I've got the hub slot frequency increased to 1/cogs, instead of the old 1/16.
Cool.
Does this still mean Number of Cogs should be a binary multiple, or is (eg) 3, 5 or 9 now practical ?
They need to be power-of-2, because for each cog there must be a hub RAM instance - and those hub RAM instances definitely need to be power-of-2 to cover the LSBs of the hub addresses.
As Phil used to decry non orthagonality, what about a 9th cog with no hub, no smart pins etc. Maybe require some special instantiation like a secret spare stripped down cog.
They need to be power-of-2, because for each cog there must be a hub RAM instance - and those hub RAM instances definitely need to be power-of-2 to cover the LSBs of the hub addresses.
Yes, the RAM needs to be 100% mapped, in binary LSB increments, but the COGs do not have to exist, it is just more 'humanly sensible' if they do.
As Phil used to decry non orthagonality, what about a 9th cog with no hub, no smart pins etc. Maybe require some special instantiation like a secret spare stripped down cog.
That could make sense, if there was a means to do any-COG-COG messages not via the HUB.
Not sure if the smart pins can be used as a COG bridge, or what speeds that could give ?
Uses I can think of are Watchdog / system background checking and Debug, but it is too late for that sort of change.
Hopefully, Debug will be good enough, built on what is there now.
Even 27c MCUs can Single Step, edit SFRs and RAM, and set break points with no MCU resource cost.
I don't think P2 can get to 'no MCU resource cost', for debug, but quite small debug stubs should be possible ?
Anyone got a minimal size value yet, for 'smallest debug stub' ?
If the Smartpins and CORDIC were removed, and maybe trimmed LUTRAM, then 16 Cogs + 512 KB HubRAM would fit. Those are not a sensible option any longer. We're talking Prop3 stuff now.
Still, there was a time when all 16 cogs and all 64 smartpins fit in the A9, right?
I'm not sure that was ever the case, FWIR the 64 smart pin build is relatively recent, and was done to allow proper test coverage across all Smart pins.
At 800M (peak) operations per second, the memory architecture of the P2 greatly overcomes the major limitations of GPU's stream processing units and certainly beats FPGA's in terms of engineering time and complexity.
If one were to configure parallel P2's as stream processors...
GPUs have gotten crazy fast and a lot of people are actually coding for GPU instead of CPU.
But, that realm is a lot different from the MCU realm, I think.
Look at these specs:
We packed the most raw horsepower we possibly could into this GPU. It’s driven by 3840 NVIDIA® CUDA® cores running at 1.6 GHz and packs 12 TFLOPs of brute force. Plus it’s armed with 12 GB of GDDR5X memory running at over 11 Gbps.
You'd need a lot of P2s to get anywhere near that...
I'm planning to have a lot of P2's... looking around for someone to take a leap on the mother board:)
In the discussion, the author makes the point that It's not the speed so much as the limited algorithmic flexibility that limits some applications. If you look at his stereo vision example... that is where some real life algorithms start... and then go on... and on... and on. This computation load is why the new Intel RealSense cameras still use a projector... even though they have stereo cameras inside. It reduces the computations tremendously... but also limits range.
Bandwidth to each GPU streamer is also a bottleneck.
Chip
The problem seems to caused by COGID returning the wrong value.
The following test code blinks the green leds on the A9 board matching the position of the cog led.
In V22 the alignment is correct and in v23 the alignment is out by 1.
i.e COGID returns 1 for cog 2 instead of 2.
dat org
coginit #2,##@blink
coginit #7,##@blink
getct pa
main addct1 pa,##20_000_000
waitct1
cogatn #%10000100
jmp #main
org
blink cogid pb
add pb,#40
loop waitatn
drvnot pb
jmp #loop
Thanks for finding that. I tested for COGID correctness by using all_cogs_blink.spin2, which only showed me group behavior. I should have been more thorough. I will have this updated in 24 hours. Thanks, again.
It's just a difference to make it suited to returning the correct ID for the given instance. "sys" must increment for each instance but has an offset from the ID# for its primary purpose. That offset must have changed with the latest optimisations.
The '- 3' is needed because of the offset between the 3-bit counter (for 8 cogs) from when a cog is given a hub cycle to issue the COGID and when the hub actually reports it.
Chip,
I just scrolled to the end of the Prop2 Document and noted a small table showing instruction formatting and a subsequent key for the lettering used. To nit pick on terminology: T is Opcode. D and S are Operands.
And you can't go calling the conditional execution field an instruction prefix either. The whole 32 bits are the one instruction.
Comments
Does this still mean Number of Cogs should be a binary multiple, or is (eg) 3, 5 or 9 now practical ?
They need to be power-of-2, because for each cog there must be a hub RAM instance - and those hub RAM instances definitely need to be power-of-2 to cover the LSBs of the hub addresses.
That could make sense, if there was a means to do any-COG-COG messages not via the HUB.
Not sure if the smart pins can be used as a COG bridge, or what speeds that could give ?
Uses I can think of are Watchdog / system background checking and Debug, but it is too late for that sort of change.
Hopefully, Debug will be good enough, built on what is there now.
Even 27c MCUs can Single Step, edit SFRs and RAM, and set break points with no MCU resource cost.
I don't think P2 can get to 'no MCU resource cost', for debug, but quite small debug stubs should be possible ?
Anyone got a minimal size value yet, for 'smallest debug stub' ?
Seems a lot has been added since then...
Just curious if it's better to have 16 good cogs or 8 great cogs...
Hard to imagine going backwards though...
Now, 8-cogs and 64 smartpins fits in the A9.
Makes me wonder if what will fit in the A9 is somehow related to what will fit in the actual chip...
Certainly there will be some correlation, and some % of A9 will roughly indicate a 'full die area'...
Not sure I understand you right, but what will fit the chip is less than what fits the A9.
On another topic... take a look at this and pay attention to the discussion:
https://researchgate.net/publication/220759541_Performance_comparison_of_FPGA_GPU_and_CPU_in_image_processing
At 800M (peak) operations per second, the memory architecture of the P2 greatly overcomes the major limitations of GPU's stream processing units and certainly beats FPGA's in terms of engineering time and complexity.
If one were to configure parallel P2's as stream processors...
Rich
But, that realm is a lot different from the MCU realm, I think.
Look at these specs:
We packed the most raw horsepower we possibly could into this GPU. It’s driven by 3840 NVIDIA® CUDA® cores running at 1.6 GHz and packs 12 TFLOPs of brute force. Plus it’s armed with 12 GB of GDDR5X memory running at over 11 Gbps.
You'd need a lot of P2s to get anywhere near that...
What fits on the 8.5x8.5 mm die was always only a guess. Chip had thought it was an easy fit for everything but that turned out to be too optimistic.
In the discussion, the author makes the point that It's not the speed so much as the limited algorithmic flexibility that limits some applications. If you look at his stereo vision example... that is where some real life algorithms start... and then go on... and on... and on. This computation load is why the new Intel RealSense cameras still use a projector... even though they have stereo cameras inside. It reduces the computations tremendously... but also limits range.
Bandwidth to each GPU streamer is also a bottleneck.
Rich
This has hub-instruction timing scaled to the number of cogs, so things run faster and less logic is used.
Could you all please try this out and make sure everything looks okay?
THANKS!!!
This really could be the last version before the chip is built.
Oops!
V23 hangs on some of my code where V22 was Ok.
I'm trying to isolate the issue now...
Sorry Chip
The problem seems to caused by COGID returning the wrong value.
The following test code blinks the green leds on the A9 board matching the position of the cog led.
In V22 the alignment is correct and in v23 the alignment is out by 1.
i.e COGID returns 1 for cog 2 instead of 2.
Thanks for finding that. I tested for COGID correctness by using all_cogs_blink.spin2, which only showed me group behavior. I should have been more thorough. I will have this updated in 24 hours. Thanks, again.
I'm trying to figure out why single_step.spin2 worked. Or, maybe I just tested it on a one-cog image.
This line:
wire [3:0] cogid_r = sys - 4 & cogm; // cogid result
...now uses '- 3', instead of '- 4'. 'Cogm' acts as a mask - it's the number of cogs minus 1.
This should do it.
By the way, I HAD tested single_step.spin2 on a single-cog compile, so COGID just got slammed to '0', regardless of this bug.
What's the constant value (was 4, now 3) for?
For the record COGID returned 7 for Cog 0.
It's just a difference to make it suited to returning the correct ID for the given instance. "sys" must increment for each instance but has an offset from the ID# for its primary purpose. That offset must have changed with the latest optimisations.
Please test this and let me know if anything is wrong.
THANKS!
Ooooo, I spy a compatibility bug around a corner up ahead.
I just scrolled to the end of the Prop2 Document and noted a small table showing instruction formatting and a subsequent key for the lettering used. To nit pick on terminology: T is Opcode. D and S are Operands.
And you can't go calling the conditional execution field an instruction prefix either. The whole 32 bits are the one instruction.