Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

jmg · 2017-10-10 19:19

cgracey wrote: »

I've got the hub slot frequency increased to 1/cogs, instead of the old 1/16.

Cool.
Does this still mean Number of Cogs should be a binary multiple, or is (eg) 3, 5 or 9 now practical ?

cgracey · 2017-10-10 20:53

jmg wrote: »

cgracey wrote: »

I've got the hub slot frequency increased to 1/cogs, instead of the old 1/16.

Cool.
Does this still mean Number of Cogs should be a binary multiple, or is (eg) 3, 5 or 9 now practical ?

They need to be power-of-2, because for each cog there must be a hub RAM instance - and those hub RAM instances definitely need to be power-of-2 to cover the LSBs of the hub addresses.

T Chap · 2017-10-10 21:09

As Phil used to decry non orthagonality, what about a 9th cog with no hub, no smart pins etc. Maybe require some special instantiation like a secret spare stripped down cog.

evanh · 2017-10-10 21:28

/me punches TChap

cgracey · 2017-10-10 21:32

Yes, a Schrodinger cog.

jmg · 2017-10-10 21:42

cgracey wrote: »

They need to be power-of-2, because for each cog there must be a hub RAM instance - and those hub RAM instances definitely need to be power-of-2 to cover the LSBs of the hub addresses.

Yes, the RAM needs to be 100% mapped, in binary LSB increments, but the COGs do not have to exist, it is just more 'humanly sensible' if they do.

T Chap wrote: »

As Phil used to decry non orthagonality, what about a 9th cog with no hub, no smart pins etc. Maybe require some special instantiation like a secret spare stripped down cog.

That could make sense, if there was a means to do any-COG-COG messages not via the HUB.
Not sure if the smart pins can be used as a COG bridge, or what speeds that could give ?

Uses I can think of are Watchdog / system background checking and Debug, but it is too late for that sort of change.

Hopefully, Debug will be good enough, built on what is there now.
Even 27c MCUs can Single Step, edit SFRs and RAM, and set break points with no MCU resource cost.

I don't think P2 can get to 'no MCU resource cost', for debug, but quite small debug stubs should be possible ?
Anyone got a minimal size value yet, for 'smallest debug stub' ?

Rayman · 2017-10-10 21:54

I wonder if the P2 from a year ago would have fit 16 instances...
Seems a lot has been added since then...

Just curious if it's better to have 16 good cogs or 8 great cogs...

Hard to imagine going backwards though...

evanh · 2017-10-10 22:36

If the Smartpins and CORDIC were removed, and maybe trimmed LUTRAM, then 16 Cogs + 512 KB HubRAM would fit. Those are not a sensible option any longer. We're talking Prop3 stuff now.

evanh · 2017-10-10 22:39

Or going to 130 nm.

Rayman · 2017-10-10 23:29

Still, there was a time when all 16 cogs and all 64 smartpins fit in the A9, right?
Now, 8-cogs and 64 smartpins fits in the A9.

Makes me wonder if what will fit in the A9 is somehow related to what will fit in the actual chip...

jmg · 2017-10-10 23:37

Rayman wrote: »

Still, there was a time when all 16 cogs and all 64 smartpins fit in the A9, right?

I'm not sure that was ever the case, FWIR the 64 smart pin build is relatively recent, and was done to allow proper test coverage across all Smart pins.

Rayman wrote: »

Makes me wonder if what will fit in the A9 is somehow related to what will fit in the actual chip...

Certainly there will be some correlation, and some % of A9 will roughly indicate a 'full die area'...

rjo__ · 2017-10-10 23:46

Rayman,

Not sure I understand you right, but what will fit the chip is less than what fits the A9.

On another topic... take a look at this and pay attention to the discussion:

https://researchgate.net/publication/220759541_Performance_comparison_of_FPGA_GPU_and_CPU_in_image_processing

At 800M (peak) operations per second, the memory architecture of the P2 greatly overcomes the major limitations of GPU's stream processing units and certainly beats FPGA's in terms of engineering time and complexity.

If one were to configure parallel P2's as stream processors...

Rich

Rayman · 2017-10-10 23:51

GPUs have gotten crazy fast and a lot of people are actually coding for GPU instead of CPU.
But, that realm is a lot different from the MCU realm, I think.

Look at these specs:

We packed the most raw horsepower we possibly could into this GPU. It’s driven by 3840 NVIDIA® CUDA® cores running at 1.6 GHz and packs 12 TFLOPs of brute force. Plus it’s armed with 12 GB of GDDR5X memory running at over 11 Gbps.

You'd need a lot of P2s to get anywhere near that...

evanh · 2017-10-11 00:01

The final silicon is the fitting problem, not the A9.

What fits on the 8.5x8.5 mm die was always only a guess. Chip had thought it was an easy fit for everything but that turned out to be too optimistic.

rjo__ · 2017-10-11 00:03

I'm planning to have a lot of P2's... looking around for someone to take a leap on the mother board:)

In the discussion, the author makes the point that It's not the speed so much as the limited algorithmic flexibility that limits some applications. If you look at his stereo vision example... that is where some real life algorithms start... and then go on... and on... and on. This computation load is why the new Intel RealSense cameras still use a projector... even though they have stereo cameras inside. It reduces the computations tremendously... but also limits range.

Bandwidth to each GPU streamer is also a bottleneck.

Rich

T Chap · 2017-10-11 05:14

The hidden ninth cog is incognito

cgracey · 2017-10-11 07:51

There is a new version 23 at the top of this thread.

This has hub-instruction timing scaled to the number of cogs, so things run faster and less logic is used.

Could you all please try this out and make sure everything looks okay?

THANKS!!!

This really could be the last version before the chip is built.

ozpropdev · 2017-10-11 09:24

cgracey wrote: »

There is a new version 23 at the top of this thread.

This has hub-instruction timing scaled to the number of cogs, so things run faster and less logic is used.

Could you all please try this out and make sure everything looks okay?

THANKS!!!

This really could be the last version before the chip is built.

Oops!
V23 hangs on some of my code where V22 was Ok.
I'm trying to isolate the issue now...
Sorry Chip

ozpropdev · 2017-10-11 10:19

Chip
The problem seems to caused by COGID returning the wrong value.

The following test code blinks the green leds on the A9 board matching the position of the cog led.
In V22 the alignment is correct and in v23 the alignment is out by 1.
i.e COGID returns 1 for cog 2 instead of 2.

dat	org

	coginit	#2,##@blink
	coginit	#7,##@blink
	getct	pa
main	addct1	pa,##20_000_000
	waitct1
	cogatn	#%10000100
	jmp	#main

	org
blink	cogid	pb
	add	pb,#40
loop	waitatn
	drvnot	pb
	jmp	#loop

cgracey · 2017-10-11 13:03

Ozpropdev,

Thanks for finding that. I tested for COGID correctness by using all_cogs_blink.spin2, which only showed me group behavior. I should have been more thorough. I will have this updated in 24 hours. Thanks, again.

cgracey · 2017-10-11 13:05

Nice use of COGATN, by the way.

cgracey · 2017-10-11 13:15

Ozpropdev, what does COGID return for cog 0?

I'm trying to figure out why single_step.spin2 worked. Or, maybe I just tested it on a one-cog image.

cgracey · 2017-10-11 14:59

Okay. I found it and I'm recompiling.

This line:

wire [3:0] cogid_r = sys - 4 & cogm; // cogid result

...now uses '- 3', instead of '- 4'. 'Cogm' acts as a mask - it's the number of cogs minus 1.

This should do it.

By the way, I HAD tested single_step.spin2 on a single-cog compile, so COGID just got slammed to '0', regardless of this bug.

Seairth · 2017-10-11 16:18

cgracey wrote: »

Okay. I found it and I'm recompiling.

This line:

wire [3:0] cogid_r = sys - 4 & cogm; // cogid result

...now uses '- 3', instead of '- 4'. 'Cogm' acts as a mask - it's the number of cogs minus 1.

This should do it.

By the way, I HAD tested single_step.spin2 on a single-cog compile, so COGID just got slammed to '0', regardless of this bug.

What's the constant value (was 4, now 3) for?

ozpropdev · 2017-10-11 21:13

cgracey wrote: »

Okay. I found it and I'm recompiling.

Thanks Chip.

cgracey wrote: »

Ozpropdev, what does COGID return for cog 0?

For the record COGID returned 7 for Cog 0.

evanh · 2017-10-11 21:47

Seairth wrote: »

What's the constant value (was 4, now 3) for?

It's just a difference to make it suited to returning the correct ID for the given instance. "sys" must increment for each instance but has an offset from the ID# for its primary purpose. That offset must have changed with the latest optimisations.

cgracey · 2017-10-12 01:32

The '- 3' is needed because of the offset between the 3-bit counter (for 8 cogs) from when a cog is given a hub cycle to issue the COGID and when the hub actually reports it.

cgracey · 2017-10-12 05:36

There is a new version 23a at the top of this thread that fixes the COGID problem, which only manifested on FPGA images containing more than one cog.

Please test this and let me know if anything is wrong.

THANKS!

evanh · 2017-10-12 07:35

"* It's possible now, in the case of 8-cog implementations, to feed a CORDIC command and retrieve a result every 8 clocks."

Ooooo, I spy a compatibility bug around a corner up ahead.

evanh · 2017-10-12 08:10

Chip,
I just scrolled to the end of the Prop2 Document and noted a small table showing instruction formatting and a subsequent key for the lettering used. To nit pick on terminology: T is Opcode. D and S are Operands.

And you can't go calling the conditional execution field an instruction prefix either. The whole 32 bits are the one instruction.

Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

Comments