LOGIC TOO BIG!!!!

cgracey · 2017-10-03 01:24

jmg wrote: »

cgracey wrote: »

If we cut the cogs from 16 down to 8, we save 8 * (1.5 mm2 for logic + 2 * 0.292 mm2 for 512x32 DP RAMs) = 16.7 mm2. That gets us where we need to be, by 2.7 mm2.
That 2.7 mm2 will hopefully be enough spare room.

I told him to reduce the cogs from 16 to 8. That also halves the main RAM, for now, but I believe they have a 16kx32 instance we could use to keep the hub RAM at 512KB. We'll see what he comes back with.

So you are saying it is being synthesized at 256k + 8 COGs, with a hope it could be 512k +8 COGs ?

That's right. It shouldn't make any difference timing-wise, as they'll just substitute the bigger RAM in tomorrow.

jmg · 2017-10-03 01:26

cgracey wrote: »

So, it looks like we will have 4 mm2 to spare on the die, after upping the eight hub RAMs to 16384 x 32, in order to maintain 512KB. And that's already accounting for CTS, DFT, floorplanning, and placement density.

Does that mean you can squeeze in another ~64k~128k? bytes of HUB RAM, using 8 smaller units ?

jmg · 2017-10-03 01:31

Phil Pilgrim (PhiPi) wrote: »

cgracey wrote:

512KB enables nice VGA displays.

Is that still a major consideration? The world is all HDMI now, and the Raspberry Pi does that for cheap. Before refactoring the P2's features to fit the die constraints ad hoc, it might be well to consider how it affects a realistic application arena.

-Phil

I think a useful P2 target is LCD displays, and there 800 x 480 seems to be common / growing size & segment.
P2 is a similar price to SSD1963 Display Controller, which has more memory.

800x480 needs 768000 Bytes, at 16bpp, or 384000 bytes at 8bpp

The FTDI Eve2 series claims 800x600 support, using some form of display list handler. Price is ~ $5.40/1k, so less than P2.

Clocked LCDs are more timing tolerant than VGA, so it may be that HyperRAM + LCD + P2 is a possible LCD solution.
HyperRAM is ~$2.30/2k5

Electrodude · 2017-10-03 01:46

cgracey wrote: »

One thing I have to work on, though, is the CORDIC. It was failing timing, miserably, due to K-factor correction. This involves two 40-bit adders in sixteen stages which need to become their own stages. So, the CORDIC will take 16 more clocks and need 2048 more flops and 32 more 40-bit adders (X and Y in 16 stages). I'll work on this now and have it ready for the morning. It was only reaching 80MHz, or so, but I had them set a parameter to exclude the K-factor-correction adders and then everything compiled just fine.

Are the 40-bit adders carry-save adders? If they're not, you might be able to speed things up enough to avoid the extra flops by making them so. I have no idea if the tools can do this automatically, but I'd bet they can't - if they can't, it might involve rewriting the whole CORDIC to deal with unpropagated carry everywhere, which sounds like it'd be too much trouble. A quick Google found several articles, mostly behind paywalls unfortunately, on carry-save CORDIC implementations.

T Chap · 2017-10-03 01:52

add a 9th cog if there is spare space.

cgracey · 2017-10-03 01:55

Electrodude wrote: »

cgracey wrote: »

One thing I have to work on, though, is the CORDIC. It was failing timing, miserably, due to K-factor correction. This involves two 40-bit adders in sixteen stages which need to become their own stages. So, the CORDIC will take 16 more clocks and need 2048 more flops and 32 more 40-bit adders (X and Y in 16 stages). I'll work on this now and have it ready for the morning. It was only reaching 80MHz, or so, but I had them set a parameter to exclude the K-factor-correction adders and then everything compiled just fine.

Are the 40-bit adders carry-save adders? If they're not, you might be able to speed things up enough to avoid the extra flops by making them so. I have no idea if the tools can do this automatically, but I'd bet they can't - if they can't, it might involve rewriting the whole CORDIC to deal with unpropagated carry everywhere, which sounds like it'd be too much trouble. A quick Google found several articles, mostly behind paywalls unfortunately, on carry-save CORDIC implementations.

The synthesis tools can pick carry-save adders when needed for speed, and the timing report I got back earlier had instance names with "csa" in them. So, yes, fast adder architectures have been applied automatically, but still couldn't bridge the timing gap.

cgracey · 2017-10-03 01:56

T Chap wrote: »

add a 9th cog if there is spare space.

That would cause the framework to expand up to the 16-cog level, so it wouldn't be worth it. We are pretty much stuck with powers of 2.

evanh · 2017-10-03 02:04

jmg wrote: »

Phil Pilgrim (PhiPi) wrote: »

cgracey wrote:

512KB enables nice VGA displays.

Is that still a major consideration? The world is all HDMI now, and the Raspberry Pi does that for cheap. Before refactoring the P2's features to fit the die constraints ad hoc, it might be well to consider how it affects a realistic application arena.

-Phil

I think a useful P2 target is LCD displays, and there 800 x 480 seems to be common / growing size & segment.
P2 is a similar price to SSD1963 Display Controller, which has more memory.

800x480 needs 768000 Bytes, at 16bpp, or 384000 bytes at 8bpp

And any raw LCD signals can be routed through a LVDS driver. Which in-turn is the basis of DVI/HDMI and even DisplayPort I suspect. As long as the monitor is happy without all the copy protection garbage then all those options are open via a simple CMOS output on the Prop2.

It just takes some extra pins to do the bit-depth of course, but there again that is up to the system designer to choose how many bits per pixel.

jmg · 2017-10-03 02:08

cgracey wrote: »

T Chap wrote: »

add a 9th cog if there is spare space.

That would cause the framework to expand up to the 16-cog level, so it wouldn't be worth it. We are pretty much stuck with powers of 2.

If fractionally more COGs or fractionally more RAM than 512k is off the table, perhaps better debug support could be added ?

A compact i2c multi-port memory might allow enough speed for Debug, and some 'operating properly' tasks too, via a COG-COG back channel type link.... ?

potatohead · 2017-10-03 02:19

How about we just synthesize what we have, hope it can run at good power, and test what we have so we aren't all living with a bone head error for the next 10 years?

Some of that room may just be needed to implement what is already on the table.

Also, it's no longer free. Never was, as doing that cost COGS, but now it will cost real dollars.

cgracey · 2017-10-03 02:23

I think the time for creative design is past, as we are now heading down the chute.

msrobots · 2017-10-03 02:29

^finally.

Mike

ozpropdev · 2017-10-03 02:36

We already have the Prop123_A9_Prop2_v21a_64smartpins.rbf image ready to go.
FPGA testing can continue as normal.

Mike Green · 2017-10-03 02:40

Given what Peter Jakacki has been able to do with threads in Tachyon Forth on a P1, I'm not too worried about the switch from 16 cogs to 8 cogsl

Seairth · 2017-10-03 02:46

So, CORDIC is going to take longer. But does the hub speedup mean a cog can queue twice as many operations also?

Edit: actually, this is probably moot. At 8 clock cycles, there wouldn't be enough time to keep the pipeline fully loaded anyhow. At least it only has to wait half as long (on average) to start an operation.

Seairth · 2017-10-03 02:48

Given the available space, are there any trivial optimizations that could be applied to push Fmax higher?

msrobots · 2017-10-03 03:14

yes @chip is working on the CORDIC stuff, since it by now brings Fmax down to 80Mhz.

Mike

Cluso99 · 2017-10-03 03:25

512KB is preferable to 16 COGs.

With what we have, IMHO 16 cores would mainly be a marketing advantage - lots of free press.

Have you considered the possibility of reducing the SmartPins? Or every second SmartPin being a DumbPin just for config and ADC?

CORDIC: Could you just extend the number of clocks to execute?

cgracey · 2017-10-03 03:35

Cluso99 wrote: »

512KB is preferable to 16 COGs.

With what we have, IMHO 16 cores would mainly be a marketing advantage - lots of free press.

Have you considered the possibility of reducing the SmartPins? Or every second SmartPin being a DumbPin just for config and ADC?

CORDIC: Could you just extend the number of clocks to execute?

Without smart pins, we couldn't access the DAC and utilize the ADC in each pin. The logic in the 64 smart pins amounts to 87% of the logic in the 8 cogs.

I'm going to add 16 stages to the CORDIC system to perform the K-factor compensation. That means it will take 16 more clocks.

Phil Pilgrim (PhiPi) · 2017-10-03 04:10

cgracey wrote:

I still like VGA. And NTSC.

So do I. But I've not had to purchase a monitor since, like, ten years ago. Are VGA and NTSC still a thing? They seem so last century seventeen years into this one.

-Phil

potatohead · 2017-10-03 04:22

Yes, they still are.

Despite the digital advances, analog is still super easy, still sees a lot of use. No licenses, no BS, lots of cheap hardware out there.

Very worst case, one sends those into a converter chip, where all the Smile is managed by them, leaving us free to focus on other things besides an insane complicated and fast serial stream. That's actually going to be one of my P2 board solutions. Just so the option is there.

Tubular · 2017-10-03 04:39

So do I. And selling based on it

Cluso99 · 2017-10-03 05:03

Composite Video (NTSC, PAL or B&W) monitors are readily available in all sizes and definitions. Lots are used for security and also for car monitors.

VGA LCD monitors are readily available, and also being replaced. IIRC there are VGA to HDMI converters for HDMI only TVs.

evanh · 2017-10-03 05:06

potatohead wrote: »

Very worst case, one sends those into a converter chip, where all the Smile is managed by them, ...

As I mentioned a few posts back, raw LCD data is all that is needed to use DVI/HDMI/DisplayPort. The LVDS driver chip does the serialization. It won't be hard to go digital, it's still the same timing data setups. Functionally that's the same as VGA. It just requires a decent amount of pins is all. And the video DACs get put aside.

Cluso99 · 2017-10-03 05:09

cgracey wrote: »

Cluso99 wrote: »

512KB is preferable to 16 COGs.

With what we have, IMHO 16 cores would mainly be a marketing advantage - lots of free press.

Have you considered the possibility of reducing the SmartPins? Or every second SmartPin being a DumbPin just for config and ADC?

CORDIC: Could you just extend the number of clocks to execute?

Without smart pins, we couldn't access the DAC and utilize the ADC in each pin. The logic in the 64 smart pins amounts to 87% of the logic in the 8 cogs.

I'm going to add 16 stages to the CORDIC system to perform the K-factor compensation. That means it will take 16 more clocks.

Those SmartPins are definitely using a lot of silicon!!!

Put a 4x COG, 32KB Hub P1 self-contained with access to the first 32 I/O only, into the spare space. <just kidding>

evanh · 2017-10-03 05:12

If I got my facts right, the LVDS video serialiser system hasn't really changed since it was introduced for digital video recording back in the early 1990's. DVI followed by HDMI both piggybacked it for display purposes.

evanh · 2017-10-03 05:14

And some LCD's also now use LVDS on their data port as well.

potatohead · 2017-10-03 05:23

Re: LCD

Yup. All agreed, and all good. We have options. Doing it that way does omit the color management, though one can still use the pixel mixer.

Does eat pins though.

I like the analog signals, and I like component the very best, because just one pin delivers a nice monochrome signal all the way up to HDTV resolution, and at TV sweeps, will display on any old TV in the world.

Add two more, and it's full color, via difference signals.

I did pick up one of those Professional Video Monitors a while back. Man, they are insane good. I should have got one years ago. Had no idea what NTSC can actually do. It's crazy what that display will pick off just a composite signal. Pixel perfect up to about 800x400 on component / rgb.

If one watches, they can be had for a song.

Phil Pilgrim (PhiPi) · 2017-10-03 05:33

Okaaay ... I get it. And generating video was a huge deal for the P1, since so few micros could in 2006. But is this still such a major accomplishment for a micro eleven years hence? Sorry, but I'm just trying to figure out where Heater's thin, wide layer of app possibilities fits into a video standard that's current to 1985.

-Phil

cgracey · 2017-10-03 05:49

Phil Pilgrim (PhiPi) wrote: »

Okaaay ... I get it. And generating video was a huge deal for the P1, since so few micros could in 2006. But is this still such a major accomplishment for a micro eleven years hence? Sorry, but I'm just trying to figure out where Heater's thin, wide layer of app possibilities fits into a video standard that's current to 1985.

-Phil

Well, I like the old standards because they are easy-to-use and very direct. Anything that outputs HDMI natively would probably inject a lot of madness between you and those pixels. The simpler your intent, the worse it would likely be. You'd have to do it all their way, becoming them in the process. And that's fine if you don't mind several-second power-ups and having to do 8GB installs over SD cards. It's all just part of the way things are today. Nothing wrong with any of that.

LOGIC TOO BIG!!!!

Comments