The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

cgracey · 2015-05-15 03:23

Last week I wrapped up the hub-based CORDIC solver. It's a 36-stage pipeline that every cog can give a command to every 16 clocks. It has built in K-factor compensation, so it will never indicate its CORDIC personality. It does the following computations:

QSIN - computes sine and cosine with 32-bit length and angle. This can be considered a polar-to-cartesian converter.
QROT - rotates 32-bit (X,Y) around (0,0) with a 32-bit angle.
QATN - computes 32-bit length and angle of 32-bit (X,Y). This can be considered a cartesian-to-polar converter.
QLOG - computes log of 32-bit number. Result is 5:27-bit whole:fractional exponent.
QEXP - computes 32-bit log back to normal number.
QMUL - computes 64-bit product of two 32-bit unsigned numbers.
QDIV - computes 32-bit quotient and remainder of 64-over-32-bit unsigned fraction.
QSQR - computes 32-bit root of 64-bit unsigned number.

Because there are 32 CORDIC stages, all the QSIN/QROT/QATN/QLOG/QEXP results are perfect to 32-bits, rounded. In the prior Prop2, there were resolution options for time/accuracy tradeoff. With this system, precision is always maximum and input-to-output timing is a constant 36-clocks. Because of intra-pipeline scaling, the X,Y terms never grow by ~1.6, so a full 32-bits is achieved. With the prior Prop2, only 31-bit X,Y terms were allowed.

Tonight I just finished the smart pin interface on the cog side. It's pretty simple, but I still need to make some basic pin-side smart pin circuits. I'll do that later. Next, I have the last cog section to work on: hub execution. Everything is in place to support it. I just have to make it happen. This is probably not going to be a lot of code, but it will be hard, and probably will take a few days to spool up in my head just to start getting anywhere with it.

After hub exec, I need to make the ROM code for the booter and get some smart-pins working. We've got the new FPGA board on the pick-and-place now and I should be able to start using it to simulate the whole chip soon. Ken and I were thinking that as soon as we have a stable FPGA image, we could make a github thing for the new Chip (sorry about my usage, not sure how it works, yet, though Heater tried to show me a while back).

Cluso, I should probably start understanding what we need for 12MBPS USB CRC'ing.

Thanks for your patience, Everyone.

Cluso99 · 2015-05-15 04:02

Chip, thanks for the update.

WOW! That sounds like some really complex hardware in the CORDIC solver.

I am here when you are ready for the USB info. I will need to dig up the thread where I explained it all.

cgracey wrote: »

Last week I wrapped up the hub-based CORDIC solver. It's a 36-stage pipeline that every cog can give a command to every 16 clocks. It has built in K-factor compensation, so it will never indicate its CORDIC personality. It does the following computations:

QSIN - computes sine and cosine with 32-bit length and angle. This can be considered a polar-to-cartesian converter.
QROT - rotates 32-bit (X,Y) around (0,0) with a 32-bit angle.
QATN - computes 32-bit length and angle of 32-bit (X,Y). This can be considered a cartesian-to-polar converter.
QLOG - computes log of 32-bit number. Result is 5:27-bit whole:fractional exponent.
QEXP - computes 32-bit log back to normal number.
QMUL - computes 64-bit product of two 32-bit unsigned numbers.
QDIV - computes 32-bit quotient and remainder of 64-over-32-bit unsigned fraction.
QSQR - computes 32-bit root of 64-bit unsigned number.

Because there are 32 CORDIC stages, all the QSIN/QROT/QATN/QLOG/QEXP results are perfect to 32-bits, rounded. In the prior Prop2, there were resolution options for time/accuracy tradeoff. With this system, precision is always maximum and input-to-output timing is a constant 36-clocks. Because of intra-pipeline scaling, the X,Y terms never grow by ~1.6, so a full 32-bits is achieved. With the prior Prop2, only 31-bit X,Y terms were allowed.

Tonight I just finished the smart pin interface on the cog side. It's pretty simple, but I still need to make some basic pin-side smart pin circuits. I'll do that later. Next, I have the last cog section to work on: hub execution. Everything is in place to support it. I just have to make it happen. This is probably not going to be a lot of code, but it will be hard, and probably will take a few days to spool up in my head just to start getting anywhere with it.

After hub exec, I need to make the ROM code for the booter and get some smart-pins working. We've got the new FPGA board on the pick-and-place now and I should be able to start using it to simulate the whole chip soon. Ken and I were thinking that as soon as we have a stable FPGA image, we could make a github thing for the new Chip (sorry about my usage, not sure how it works, yet, though Heater tried to show me a while back).

Cluso, I should probably start understanding what we need for 12MBPS USB CRC'ing.

Thanks for your patience, Everyone.

ozpropdev · 2015-05-15 04:29

Nice work Chip! Thanks for the update

evanh · 2015-05-15 05:03

Very nice indeed. It's as good as being in every Cog. How much bigger is this CORDIC vs what an equivalent single threaded version would be?

Sapieha · 2015-05-15 06:59

Hi Chip.

Nice Progress.

And Thanks for update on it

cgracey wrote: »

Last week I wrapped up the hub-based CORDIC solver. It's a 36-stage pipeline that every cog can give a command to every 16 clocks. It has built in K-factor compensation, so it will never indicate its CORDIC personality. It does the following computations:

QSIN - computes sine and cosine with 32-bit length and angle. This can be considered a polar-to-cartesian converter.
QROT - rotates 32-bit (X,Y) around (0,0) with a 32-bit angle.
QATN - computes 32-bit length and angle of 32-bit (X,Y). This can be considered a cartesian-to-polar converter.
QLOG - computes log of 32-bit number. Result is 5:27-bit whole:fractional exponent.
QEXP - computes 32-bit log back to normal number.
QMUL - computes 64-bit product of two 32-bit unsigned numbers.
QDIV - computes 32-bit quotient and remainder of 64-over-32-bit unsigned fraction.
QSQR - computes 32-bit root of 64-bit unsigned number.

Because there are 32 CORDIC stages, all the QSIN/QROT/QATN/QLOG/QEXP results are perfect to 32-bits, rounded. In the prior Prop2, there were resolution options for time/accuracy tradeoff. With this system, precision is always maximum and input-to-output timing is a constant 36-clocks. Because of intra-pipeline scaling, the X,Y terms never grow by ~1.6, so a full 32-bits is achieved. With the prior Prop2, only 31-bit X,Y terms were allowed.

Tonight I just finished the smart pin interface on the cog side. It's pretty simple, but I still need to make some basic pin-side smart pin circuits. I'll do that later. Next, I have the last cog section to work on: hub execution. Everything is in place to support it. I just have to make it happen. This is probably not going to be a lot of code, but it will be hard, and probably will take a few days to spool up in my head just to start getting anywhere with it.

After hub exec, I need to make the ROM code for the booter and get some smart-pins working. We've got the new FPGA board on the pick-and-place now and I should be able to start using it to simulate the whole chip soon. Ken and I were thinking that as soon as we have a stable FPGA image, we could make a github thing for the new Chip (sorry about my usage, not sure how it works, yet, though Heater tried to show me a while back).

Cluso, I should probably start understanding what we need for 12MBPS USB CRC'ing.

Thanks for your patience, Everyone.

Baggers · 2015-05-15 08:35

Awesome news Chip, looking forward to the update

Publison · 2015-05-15 09:09

Thanks for the update Chip!

What's more important right now? P2, walnuts, or tacos?

My guess, Ken says P2.

potatohead · 2015-05-15 09:19

We are getting there! **Continues to save pennies for Parallax FPGA board and P2 meetup**

cgracey · 2015-05-15 09:59

evanh wrote: »

Very nice indeed. It's as good as being in every Cog. How much bigger is this CORDIC vs what an equivalent single threaded version would be?

It is about 20,000 LE's. A cog, by contrast, is around 7,000 LE's. The prior CORDIC system from the earlier Prop2 effort, that went into each cog, was 2,500 LE's. it didn't do multiply, divide, and square root, though. Of those 20,000 LE's, 8,000 are needed for multiply, divide and square root.

cgracey · 2015-05-15 10:26

Publison wrote: »

Thanks for the update Chip!

What's more important right now? P2, walnuts, or tacos?

My guess, Ken says P2.

Tacos are on ice for now, the walnuts are growing, and Prop2 is coming together nicely.

I have been making some wok cookers from 32 jet burners you can get off eBay for $50, delivered from China. I settled on this design and I've made 5 units, so far:

http://youtu.be/aAB7pVfl8Fg

The castings are such poor quality that about one third of them shoot fire out of small voids.I have found that it is almost impossible to repair them.

Publison · 2015-05-15 10:35

cgracey wrote: »

Tacos are on ice for now, the walnuts are growing, and Prop2 is coming together nicely.

I have been making some wok cookers from 32 jet burners you can get off eBay for $50, delivered from China. I settled on this design and I've made 5 units, so far:

http://youtu.be/aAB7pVfl8Fg

The castings are such poor quality that about one third of them shoot fire out of small voids.I have found that it is almost impossible to repair them.

Nuff said... get back to work.

Bill Henning · 2015-05-15 10:37

Thanks for the updates Chip!

Note to self: dust off my DE2-115...

K2 · 2015-05-15 13:37

cgracey wrote: »

Last week I wrapped up the hub-based CORDIC solver. It's a 36-stage pipeline that every cog can give a command to every 16 clocks.

This is great! 20,000 LEs were never devoted to a more worthy cause, imho. The whole scope of P2 sounds delightful.

evanh · 2015-05-15 15:12

cgracey wrote: »

It is about 20,000 LE's. A cog, by contrast, is around 7,000 LE's. The prior CORDIC system from the earlier Prop2 effort, that went into each cog, was 2,500 LE's. it didn't do multiply, divide, and square root, though. Of those 20,000 LE's, 8,000 are needed for multiply, divide and square root.

Ah, that looks like the pipelining is just a tiny part of the whole then. Impressive savings.

Cluso99 · 2015-05-15 15:34

Chip,
I have noticed that in the P1V design, the hub is always read whether it is needed or not. I realise this is a function of the ram blocks in the FPGA.

I presume this won't be the case in the real P2 as it would be a real waste of power. I presume the same was actually done in the real P1.

cgracey · 2015-05-15 15:54

Cluso99 wrote: »

Chip,
I have noticed that in the P1V design, the hub is always read whether it is needed or not. I realise this is a function of the ram blocks in the FPGA.

I presume this won't be the case in the real P2 as it would be a real waste of power. I presume the same was actually done in the real P1.

The actual chips don't read all the time. I remember that much. Certainly, in the new Prop2 FPGA configuration, these enables are only fired on accesses.

evanh · 2015-05-15 15:57

evanh wrote: »

Ah, that looks like the pipelining is just a tiny part of the whole then. Impressive savings.

Hmm, I'm having second thoughts now. In this case, the pipelining is really an unrolled loop so it must be consuming a lot more silicon.

cgracey · 2015-05-15 16:02

evanh wrote: »

Hmm, I'm having second thoughts now. In this case, the pipelining is really an unrolled loop so it must be consuming a lot more silicon.

There are 128 flops per stage (40 per X,Y,Z, 3 for mode and 5 for magnitude), so times 36 stages is 4,608 flops. That takes about 1/2 square mm of silicon, alone.

Cluso99 · 2015-05-15 16:43

Just for reference, I compiled P1V for DE0-Nano...

     Cog+    Cog     Video
     Video   NoVideo Only
LE   1750    1534    216
Comb 1617    1457    160
Regs  588     436    152

evanh · 2015-05-15 16:49

cgracey wrote: »

There are 128 flops per stage (40 per X,Y,Z, 3 for mode and 5 for magnitude), so times 36 stages is 4,608 flops. That takes about 1/2 square mm of silicon, alone.

Hehe, I guess I should be more direct. I'm wondering what the advantage of having one multithreaded CORDIC in the Hub is, verses having 16 ordinary CORDICs in the 16 Cogs. I figured the main advantage was silicon space saved by not duplicating the computation components, but this doesn't seem to be quite the case.

Take the multiply/divide 8000 LEs for example, roughly how big would that be if the CORDIC wasn't pipelined?

Cluso99 · 2015-05-15 17:39

If I understand the pipelined CORDIC correctly, all 16 cogs can use the CORDIC simultaneously due to the pipelining. That would be staged because each cog can only start the CORDIC at it's hub slot.
And since there are 36 stages, the result will be available 3 hub slots later (3*16=48 clocks later), but you can issue two new calculations in the meantime. So in theory each cog could have 3 calculations in progress concurrently - well 36 is the max so not sure how this works out actually.

Is this correct Chip?

If so, this could be a "hot" chip.

Cluso99 · 2015-05-15 18:38

USB Info
Here are some links to the later discussions regarding aid to doing USB LS & FS in software
For USB pin read and CRC
http://forums.parallax.com/showthread.php/151821-P2-Possible-additional-Instructions?highlight=usb+fs+instruction
For CRC generation:
http://forums.parallax.com/showthread.php/151992-CRC-generation?highlight=usb+fs+instruction

I am going to give this a try on the P1V (should have done that ages ago).

Martin Hodge · 2015-05-15 19:28

What are the benefits of such a complex maths system, and what real-world tasks will this chip be able to perform using it?

evanh · 2015-05-15 20:21

Guidance systems comes to mind. Take a robot that you want to send it's gripper into a particular 3D point and at the same time have the arm angled to keep it from knocking into a piece of frame or something. There is a number of fancy calculations needed.

Of course such calculations can already be done on the Prop1 but each floating point operator is quite long winded inside the float library. When there is a large table of calculations to perform then execution speed becomes a significant issue.

Also, the more RAM there is the more people will want to do fancy things with it. Data capture and filtering comes to mind here.

Cluso99 · 2015-05-15 21:34

evanh wrote: »

Guidance systems comes to mind. Take a robot that you want to send it's gripper into a particular 3D point and at the same time have the arm angled to keep it from knocking into a piece of frame or something. There is a number of fancy calculations needed.

Of course such calculations can already be done on the Prop1 but each floating point operator is quite long winded inside the float library. When there is a large table of calculations to perform then execution speed becomes a significant issue.

Also, the more RAM there is the more people will want to do fancy things with it. Data capture and filtering comes to mind here.

Maybe NASA might be interested in this. They will not be able to use any of the latest high performance <20nm chips because they cannot guarantee functionality beyond 10 years due to metal migration (at least that's what I've read). Many of the spacecraft out there are older than 10 years. Even the Mars Rover Opportunity has been roving Mars for over 10 years, and well, good old Voyager 1 & 2 are 37+ years.

evanh · 2015-05-16 02:04

Cluso99 wrote: »

Maybe NASA might be interested in this. They will not be able to use any of the latest high performance <20nm chips because they cannot guarantee functionality beyond 10 years due to metal migration (at least that's what I've read). Many of the spacecraft out there are older than 10 years. Even the Mars Rover Opportunity has been roving Mars for over 10 years, and well, good old Voyager 1 & 2 are 37+ years.

Curiosity was launched in 2011 so will be using newer components. Wikipedia states CPU is 400MIPS RAD750 (PowerPC) with memory of 256 kB of EEPROM, 256 MB of DRAM, and 2 GB of flash.

Looking that up nets 150nm process. I'm guessing that's prolly a good reference for what can be made really tough.

Cluso99 · 2015-05-16 02:34

evanh wrote: »

Curiosity was launched in 2011 so will be using newer components. Wikipedia states CPU is 400MIPS RAD750 (PowerPC) with memory of 256 kB of EEPROM, 256 MB of DRAM, and 2 GB of flash.

Looking that up nets 150nm process. I'm guessing that's prolly a good reference for what can be made really tough.

That was probably state of art when it was designed, probably 5 years before launch, so 2006.

Opportunity is having bouts of amnesia caused by, they think, by the failure of flash memory. It has 7 banks. They reconfigured to isolate the faulty bank but they have had at least one episode since. Maybe the Flash cannot retain its data that long either??? But remember, the hardware was built probably at least 5 years before launch and extensively tested. So its likely 17+ years old.

evanh · 2015-05-16 02:56

In the case of that Flash, due to ease of swapping, it could have been latest hardened product mere months prior to launch date.

However, irrespective of it's age, Flash is never going to be particularly tough. That's exactly where MRAM can supersede. NASA aren't aiming for highest capacities and price will be neither here nor there. Except, of course, MRAM still hasn't been developed enough for even a mere 2GB.

evanh · 2015-05-16 03:40

MRAM progress is happening though, even if it seems to be at a snails pace. Have a read of this - http://spectrum.ieee.org/semiconductors/memory/spin-memory-shows-its-might

A couple of tasty morsels from the above:
- "One was the write speed, which stands at 1.5 nanoseconds. Thats fast enough, Jan says, to compete with the SRAM that takes up most of the memory space on a modern microprocessor: the level-three cache."
- "Samsung reported success in making cells that could potentially be fabricated using a 15-nanometer manufacturing process"
- "TDK-Headways chips, it found no errors in data retention after 528 hours at 150 °C"

Dave Hein · 2015-05-16 05:02

Martin Hodge wrote: »

What are the benefits of such a complex maths system, and what real-world tasks will this chip be able to perform using it?

- Digital Signal Processing
- Image Processing
- Computer Graphics
- Speech Synthesis
- Digital Filtering
- Spectral Analysis
- Pattern Recognition
- Audio Coding and Decoding
- Image Coding and Decoding

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments