LOGIC TOO BIG!!!!

cgracey · 2017-10-03 06:05

Phil, with the ADC in 33mv-span mode, we could do RIAA curve compensation. Think of the possibilities!

potatohead · 2017-10-03 06:19

But is this still such a major accomplishment for a micro eleven years hence?

Not as significant, but still as useful. Besides, there is gonna be some fun in all of that.

Hey, some one did a VR 3D display with a C64: https://hackaday.com/2017/09/14/hacked-headset-brings-vr-to-the-commodore-64/

Hilarious, and pretty awesome at the same time.

IMHO, those features will find good uses. It's like most things that ended up in there.

I'm super excited about the ADC/DAC capabilities myself.

jmg · 2017-10-03 06:19

cgracey wrote: »

Phil, with the ADC in 33mv-span mode, we could do RIAA curve compensation. Think of the possibilities!

With what noise floor numbers ?

jmg · 2017-10-03 06:22

Phil Pilgrim (PhiPi) wrote: »

Are VGA and NTSC still a thing?

Yes, but becoming niche...
Last Monitor I bought, I had to pounce on the last run-out deal with VGA connector, newer ones did not have the HD15 anymore.

NTSC/PAL are still big in security and small camera (car backing camera) uses.

potatohead · 2017-10-03 06:23

Anything under -60db will be useful. A ton of laptops sit in that range, maybe -70db on a better one.

Seairth · 2017-10-03 10:21

msrobots wrote: »

yes @chip is working on the CORDIC stuff, since it by now brings Fmax down to 80Mhz.

I meant aside from that.

samuell · 2017-10-03 13:20

Gee! Indeed disappointing news.

Well, I also prefer having 8 cogs and maintain the 512KB RAM. 16 cogs would be too overkill for just 128KB of RAM. More important is to fit lots of code, plus the variables generated during a program execution. The processing power per cog will be a significant improvement per se, and if the hub latency decreases, better yet.

Kind regards, Samuel Lourenço

evanh · 2017-10-03 13:25

Seairth wrote: »

Given the available space, are there any trivial optimizations that could be applied to push Fmax higher?

Chip had previously gone through a number of iterations along those lines I believe. Won't be anything trivial remaining.

cgracey · 2017-10-03 13:38

I just sent OnSemi the new Verilog file set with the CORDIC's k-factor adjustment in 16 new stages, instead of being lumped in with the iteration stages, which was slowing things way down.

Yesterday, they sent me a timing report for the synthesis at 166MHz. I attached it so you can see. It shows the slowest paths. They all exactly met timing because the tool optimized them to exactly that point. If you want 180MHz, the tool works harder and takes longer. Kind of like play-dough. If you ask for something unrealistic it won't finish for a long time, before giving up. The slower your timing requirement, the faster it gets the job done. It seemed to handle 166MHz pretty quickly, which is a good sign. By the way, the timing figures in that report are picosecond integers.

samuell · 2017-10-03 14:09

cgracey wrote: »

I just sent OnSemi the new Verilog file set with the CORDIC's k-factor adjustment in 16 new stages, instead of being lumped in with the iteration stages, which was slowing things way down.

Yesterday, they sent me a timing report for the synthesis at 166MHz. I attached it so you can see. It shows the slowest paths. They all exactly met timing because the tool optimized them to exactly that point. If you want 180MHz, the tool works harder and takes longer. Kind of like play-dough. If you ask for something unrealistic it won't finish for a long time, before giving up. The slower your timing requirement, the faster it gets the job done. It seemed to handle 166MHz pretty quickly, which is a good sign. By the way, the timing figures in that report are picosecond integers.

Hi Chip,

Does this mean that 180MHz operation is not feasible, or just compromised? By the way, is it still possible to do 1 instruction per cycle, instead of the 1/4 of the P1? This post really changed the game for me in two aspects: the reduced number of cogs, and now the clock frequency.

Kind regards, Samuel Lourenço

cgracey · 2017-10-03 14:13

Samuel, before we're done with this, we will push the synthesizer to the limit of what is possible. I am hoping for 200 megahertz, but maybe 180 is more realistic. If we can even be 160, that is great.

In this architecture, instructions take two clocks.

samuell · 2017-10-03 15:28

cgracey wrote: »

Samuel, before we're done with this, we will push the synthesizer to the limit of what is possible. I am hoping for 200 megahertz, but maybe 180 is more realistic. If we can even be 160, that is great.

In this architecture, instructions take two clocks.

Hum, several sources mention one cycle per instruction. But, assuming 8 cogs at 160MHz, this gives us 80MIPS per cog, which is 4x as fast as the current Propeller. Overall, not as big as an improvement as the expected initially (compared to 16 cogs at 200MHz, and 200MIPS per cog according to some sources).

Oh well, at least we have smart pins. Perhaps Parallax could consider releasing a version without smart pins, but having 16 cogs, because the raw processing power is also interesting.

Kind regards, Samuel Lourenço

evanh · 2017-10-03 15:39

200 MIPS/Cog was always dreaming, even for the Prop2-Hot ... maybe with nitrogen cooling. EDIT: That said, I was probably guilty of dreaming it way back then.

The main reason to have more Cogs is to farm out the timing sensitive tasks and just make it an easy job to build robots with. A compute platform, the Prop ain't.

potatohead · 2017-10-03 15:49

Hey Chip, when do we find out about power?

wmosscrop · 2017-10-03 19:25

samuell wrote: »

Oh well, at least we have smart pins. Perhaps Parallax could consider releasing a version without smart pins, but having 16 cogs, because the raw processing power is also interesting.

Samuel,

Also keep in mind that the instruction set has been substantially (wonderfully!) improved. This can mean that what took many P1 instructions might only take 1 or 2 P2 instructions (your mileage may vary, of course), with a corresponding increase in throughput.

I would suspect that if you need 16 P2 cogs to process your data, then you might run into I/O bottlenecks first (you simply won't be able to get the data in to/out of the Propeller fast enough). This is assuming you're not having to do extensive processing on a relatively small data set.

Personally, I'll take 8 cogs with a side of smart pins.

And one thing I've always liked about the Propeller is that there are no variations. If your program runs on your Propeller, it will run on every other Propeller.

Walter

samuell · 2017-10-03 20:04

wmosscrop wrote: »

samuell wrote: »

Oh well, at least we have smart pins. Perhaps Parallax could consider releasing a version without smart pins, but having 16 cogs, because the raw processing power is also interesting.

Samuel,

Also keep in mind that the instruction set has been substantially (wonderfully!) improved. This can mean that what took many P1 instructions might only take 1 or 2 P2 instructions (your mileage may vary, of course), with a corresponding increase in throughput.

I would suspect that if you need 16 P2 cogs to process your data, then you might run into I/O bottlenecks first (you simply won't be able to get the data in to/out of the Propeller fast enough). This is assuming you're not having to do extensive processing on a relatively small data set.

Personally, I'll take 8 cogs with a side of smart pins.

And one thing I've always liked about the Propeller is that there are no variations. If your program runs on your Propeller, it will run on every other Propeller.

Walter

Hi Walter,

It always depends on your end application, the type of resources you wish or not. I don't know if there are concrete plans, but I saw 5 variations of the P2, with a varying number of pins (smart or not) and cogs, and having different RAM sizes. I would prefer having fewer cogs if the RAM size is to be reduced from 512KB to just 128KB, but I can live without the smart pins for some of my applications. Nevertheless, having them allows applications like oscilloscopes and the like on a single chip. Lets say it is just another flavor. But if it is to be just one flavor, then it should have smart pins.

Kind regards, Samuel Lourenço

threadz · 2017-10-03 21:23

I'm not really seeing having only 8 cogs as a significant downside.

Sure, with the P1, I did run into problems of not having enough cogs multiple times. But that was largely because I had several COGs dedicated to managing I2C, UART, SPI, 1W, etc. These were such a problem for me that I recall once stumbling onto a P1 Object that could handle 4 full duplex serial buses on one COG. I nearly cried. I don't know what was involved in making that code but I am forever indebted to the author.

The P2 is a totally different story. Smart pins plus interrupts suddenly means that a single cog can manage multiple buses without having to do some kind of timing gymnastics that I assume a blood pact is involved in figuring out. Plus, integrated ADC probably means one less bus to worry about because an external ADC is not needed.

I have some ambitious plans for this microprocessor but I still don't really know what I'd do with more than maybe 5 cogs now that buses are going to be so much easier to manage.

rjo__ · 2017-10-03 21:56

Samuel,

samuell wrote: »

Hum, several sources mention one cycle per instruction.
Kind regards, Samuel Lourenço

Maybe referring to the streamer?

Regards,

Rich

cgracey · 2017-10-03 22:08

Samuel, that was Prop2-Hot from a few years ago that had single-cycle execution. The design was very power-hungry and required three dual-port RAMs per cog, instead of one, for cog memory.

I miss that speed, too. What we have now, though, is more balanced.

cgracey · 2017-10-03 22:12

The synthesis guy said timing closed at a hair below 188MHz. This is worst-case. They need that kind of margin to ensure 160MHz in the actual layout.

He's going to do some power estimations now.

potatohead · 2017-10-03 22:25

Nice. Looks like good news on that front so far.

cgracey · 2017-10-03 23:01

I compiled our final 8-cog / 512KB-hub image for the Prop123_A9 board and this is the resulting slack histogram:

The paths in red are exceeding the time limit I've set of 10ns (100MHz). The Altera tool optimizes things pretty well, but you can see there are some paths, just a few, that inhibit the 10ns goal from being met. These are due not to logic delays, but interconnect delays. The ASIC tools in use for the actual Prop2 layout have much finer granularity and can smoosh these red paths to the right, getting them into the blue zone, and creating a vertical wall at the goal line. Pretty neat, I think.

TonyB_ · 2017-10-03 23:16

Re CORDIC, is the following the new situation?

"Cogs can start CORDIC operations every 16 8 clocks and get results 39 55 clocks later"

So cog slot comes around twice as often but each operation lasts longer. Overall a bit slower than before?

Rayman · 2017-10-04 05:31

Chip said in the P2 Family thread that core# doesn't have to be power of 2:

cgracey wrote: »

These variants all use the 16-slot egg-beater, so there's really no need to constrain cogs to a power of two. We could have three cogs, if we wanted. Keeping the eggbeater as is keeps things simple and objects' timing consistent.

So, if we can't have 16, can we have 10 or 12 or 13?

jmg · 2017-10-04 06:01

Rayman wrote: »

Chip said in the P2 Family thread that core# doesn't have to be power of 2:

cgracey wrote: »

These variants all use the 16-slot egg-beater, so there's really no need to constrain cogs to a power of two. We could have three cogs, if we wanted. Keeping the eggbeater as is keeps things simple and objects' timing consistent.

So, if we can't have 16, can we have 10 or 12 or 13?

See above where Q: " add a 9th cog if there is spare space. "
has reply

cgracey wrote:

That would cause the framework to expand up to the 16-cog level, so it wouldn't be worth it. We are pretty much stuck with powers of 2.

The memory blocks == Slot count, and I guess you might be able to have 16 Memory cells, Least Nibble MUX'd, and have only (say) 9 COGS allocated for 7 empty/wasted slots, that's bumped the HUB slot time x2 for only one more COG ?

More intriguing might be a egg-beater mapping that was 8 mapped to 8 even slots, (same as 16 COG design) and 9th given every ODD slot, for a large boost in HUB memory bandwidth, on that COG.

The smarter question here is probably how much extra memory, can that 9th/10th COG equate to ?

cgracey · 2017-10-04 14:40

With the eggbeater, we are bound to powers of two, since hub-RAM instances must be a power of two and there must be a cog for each hub-RAM instance.

cgracey · 2017-10-04 15:05

So, the power came back at 1.5W, typical case, with everything running.

That's 833mA at 1.8V.

David Betz · 2017-10-04 15:07

cgracey wrote: »

So, the power came back at 1.5W, typical case, with everything running.

That's 833mA at 1.8V.

Is that good? So this chip is now referred to as "P2-cool"?

cgracey · 2017-10-04 15:09

David Betz wrote: »

cgracey wrote: »

So, the power came back at 1.5W, typical case, with everything running.

That's 833mA at 1.8V.

Is that good? So this chip is now referred to as "P2-cool"?

It's going to be really cool.

David Betz · 2017-10-04 15:11

cgracey wrote: »

David Betz wrote: »

cgracey wrote: »

So, the power came back at 1.5W, typical case, with everything running.

That's 833mA at 1.8V.

Is that good? So this chip is now referred to as "P2-cool"?

It's going to be really cool.

Well, we always knew that! :-)

LOGIC TOO BIG!!!!

Comments