LOGIC TOO BIG!!!!

cgracey · 2017-10-02 21:54

The synthesis guy just came back and said that the logic+memories area is looking to be 72 mm2. We only have 58 mm2 of space in the middle of our huge 8.5 x 8.5 mm die. We are 24% oversized!!!

We need to shave at least 14 mm2 (72 - 58).

We have 16 instances of 8192x32 SP RAM at 1.57mm2 = ~25mm2.

We have 32 instances of 512x32 DP RAM at 0.292mm2 = ~9.3mm2.

Those RAMs total to ~34mm2.

This means our current logic is 72 - 34 = 38 mm2. That's over 3x what Prop2-Hot was with 8 cogs! The smart pins are contributing to this bloat.

Each smart pin is 1/9th the logic of a cog, so 64 of them are equivalent to 64/9 = 7 cogs. The CORDIC is equivalent to 2 cogs. So, we have 16 + 7 + 2 = 25 cogs' equivalent of logic here. This means a cog's worth of logic is about 1.5 mm2 (38 / 25).

If we cut the cogs from 16 down to 8, we save 8 * (1.5 mm2 for logic + 2 * 0.292 mm2 for 512x32 DP RAMs) = 16.7 mm2. That gets us where we need to be, by 2.7 mm2. That 2.7 mm2 will hopefully be enough spare room.

I told him to reduce the cogs from 16 to 8. That also halves the main RAM, for now, but I believe they have a 16kx32 instance we could use to keep the hub RAM at 512KB. We'll see what he comes back with.

msrobots · 2017-10-02 22:03

I hope with main ram you refer to COG ram.

It is still 512 KB HUB ram, but just 8 COGs, right?

Bad news.

edit: ahh, I see 16kx32.

Mike

cgracey · 2017-10-02 22:10

Yes. Same 512KB hub RAM, but only 8 cogs.

Really, I don't know that 8 cogs would be too few. They do a lot more than on Prop1, so 16 might have been huge overkill.

TonyB_ · 2017-10-02 22:22

Sacrificing cogs is better than losing hub RAM. There would have been cogs doing nothing in many cases. People will have to write more efficient code! Let's hope 16kx32 instance is available to keep to 512KB. Will the egg beater need re-doing? Looking on the bright side, maximum power consumption will be lower.

cgracey · 2017-10-02 22:24

Deja vu, I know. This is like Groundhog Day, except we aren't looping, anymore.

OnSemi does have a 16384 x 32 SP RAM instance we can use to keep the RAM at 512KB.

They are resynthesizing now with 8 cogs.

cgracey · 2017-10-02 22:25

Yes, power will be lower and hub latency will be reduced. This feels fine, to me.

cgracey · 2017-10-02 22:34

The egg-beater is parameter-driven for 1/2/4/8/16 cogs. Fewer cogs also means a slight speed-up.

evanh · 2017-10-02 22:36

Bugger. I was thinking there was going to be room to spare at 16 Cogs.

Was Prop2-Hot never going to have a Smartpins like feature?

cgracey · 2017-10-02 22:38

evanh wrote: »

Bugger. I was thinking there was going to be room to spare at 16 Cogs.

Was Prop2-Hot never going to have a Smartpins like feature?

It just had counters, like the Prop1.

We really need to keep all the smart pins, because they are needed to control the analog portions of each pin.

msrobots · 2017-10-02 22:49

I think RAM is more important as COGs are. Going back to 8 is sad but better as half the HUB.

16 sounded good, but HUBEXEC and LUT ram will reduce the need for COGs bound to a single Object/Protocol as usually done on the P1.

TACHYON for example runs mostly on one COG and produces quite small wonders on a P1.

Time to write some fast P2-P2 transfer code using a couple of smart pins on each side to make it more easy to use more then one P2 on a board. The serial boot already supports multiple P2s just make it easy to use multiple P2 in a sane manner.

Mike

potatohead · 2017-10-02 23:10

8 is fine. Truth is, 16 COGS made sense without the tasker in hot. We added events, which can do a lot. People can cram a lot into a COG now, and it can respond quick, if needed.

ONWARD!

ozpropdev · 2017-10-02 23:14

Arrrrrgh!

Only 8 cogs hurts a bit LOT.
At least hub and smart pins remain and fingers crossed silicon >160MHz.
Have to change P2 planning a bit.

potatohead · 2017-10-02 23:17

I was mentally ready for this. Features were all we can eat. Of course it got big.

potatohead · 2017-10-02 23:20

We need to test on 8 COGS now.

Publison · 2017-10-02 23:20

Well Clusso99 got his wish for a scaled down P2.

Albeit in the same package.

cgracey · 2017-10-02 23:26

To get 16 cogs, we'd have to settle for 128KB hub RAM.

potatohead · 2017-10-02 23:28

That's too small to be the first one.

Tubular · 2017-10-02 23:30

Finally, proper cull-the-engineers kind of wranglings. Must be in the home straight

512kB better than 16 cogs for most, I think.

Now sit tight for power estimations, though I have to admit that package at 20C/watt JA is quite impressive

Peter Jakacki · 2017-10-02 23:33

Chip, you have packed so much into this design that it is like a triple redundant system and the loss of 8 cogs won't be noticed by practically all except the most esoteric applications since you have interrupts, smart pins, hub exec etc etc etc.

Please go ahead and build it and they will come.

potatohead · 2017-10-02 23:34

Yup

msrobots · 2017-10-03 00:13

Hey, we have 8 pins per COG now, that is the double of PINs as the P1 has. The COG has a LUT now, that is the double of RAM per COG as the P1 has, P2 will run at 160Mhz, that is the double of the 80Mhz a P1 has. The P2 even has 16 times more HUB ram as the P1 has.

One could say a truly P2.

And then there are the smart-pins, way more powerful as them 2 counters of the P1 COG and 8 per COG, statistically.

Also displaying 16 separate COGs in Blockly needs way to much horizontal space on the screen, and will be confusing...

and that 2.7mm2 could be used for a little bit more HUB, like say 640K?, 596K?, EEPROM?

Ducks and runs for cover...

Mike

cgracey · 2017-10-03 00:16

...And the Prop2 executes instructions in two clocks, instead of the Prop1's four clocks. So, cogs execute 4x faster, when considering the 160MHz.

ozpropdev · 2017-10-03 00:24

16 cos with 128k vs 8 cogs with 512k ram.

512k wins no matter how many different ways you look at it!

cgracey · 2017-10-03 00:27

512KB enables nice VGA displays.

ozpropdev · 2017-10-03 00:30

cgracey wrote: »

512KB enables nice VGA displays.

and should work nicely with your new SPIN interpreter.

msrobots · 2017-10-03 00:40

Having 'just' 8 COGs instead of that dreamed 16, will have impact on driver code. Throwing two COGs at a problem is not as easy anymore as with abundant 16 of them. SO having USB mouse, keyboard AND a usb stick as file system will fast reduce the number of COGS.

Is jmpret still there in P2? I struggled long with it and now that I basically now how to use it, I would really miss the co-routine aspect of it. Time-slicing at will.

Mike

Roy Eltham · 2017-10-03 01:02

Definitely do 512K hub with 8 cogs, not 128K hub with 16 cogs.

512K is already too small, really. 128K would be a death wish.

jmg · 2017-10-03 01:08

cgracey wrote: »

If we cut the cogs from 16 down to 8, we save 8 * (1.5 mm2 for logic + 2 * 0.292 mm2 for 512x32 DP RAMs) = 16.7 mm2. That gets us where we need to be, by 2.7 mm2.
That 2.7 mm2 will hopefully be enough spare room.

I told him to reduce the cogs from 16 to 8. That also halves the main RAM, for now, but I believe they have a 16kx32 instance we could use to keep the hub RAM at 512KB. We'll see what he comes back with.

So you are saying it is being synthesized at 256k + 8 COGs, with a hope it could be 512k +8 COGs ?

Phil Pilgrim (PhiPi) · 2017-10-03 01:13

cgracey wrote:

512KB enables nice VGA displays.

Is that still a major consideration? The world is all HDMI now, and the Raspberry Pi does that for cheap. Before refactoring the P2's features to fit the die constraints ad hoc, it might be well to consider how it affects a realistic application arena.

-Phil

cgracey · 2017-10-03 01:17

Good news!

The 8-cog 166MHz synthesis run completed really quickly, which is always a good sign. The 16-cog synthesis effort had been running for most of the day and wasn't getting anywhere, so they killed it and ran the new 8-cog one, instead.

So, it looks like we will have 4 mm2 to spare on the die, after upping the eight hub RAMs to 16384 x 32, in order to maintain 512KB. And that's already accounting for CTS, DFT, floorplanning, and placement density.

One thing I have to work on, though, is the CORDIC. It was failing timing, miserably, due to K-factor correction. This involves two 40-bit adders in sixteen stages which need to become their own stages. So, the CORDIC will take 16 more clocks and need 2048 more flops and 32 more 40-bit adders (X and Y in 16 stages). I'll work on this now and have it ready for the morning. It was only reaching 80MHz, or so, but I had them set a parameter to exclude the K-factor-correction adders and then everything compiled just fine.

cgracey · 2017-10-03 01:22

Phil Pilgrim (PhiPi) wrote: »

cgracey wrote:

512KB enables nice VGA displays.

Is that still a major consideration? The world is all HDMI now, and the Raspberry Pi does that for cheap. Before refactoring the P2's features to fit the die constraints ad hoc, it might be well to consider how it affects a realistic application arena.

-Phil

I still like VGA. And NTSC.

LOGIC TOO BIG!!!!

Comments