Shop OBEX P1 Docs P2 Docs Learn Events
The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Page 90 — Parallax Forums

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

18788909293144

Comments

  • cgraceycgracey Posts: 14,155
    edited 2015-05-15 03:23
    Last week I wrapped up the hub-based CORDIC solver. It's a 36-stage pipeline that every cog can give a command to every 16 clocks. It has built in K-factor compensation, so it will never indicate its CORDIC personality. It does the following computations:

    QSIN - computes sine and cosine with 32-bit length and angle. This can be considered a polar-to-cartesian converter.
    QROT - rotates 32-bit (X,Y) around (0,0) with a 32-bit angle.
    QATN - computes 32-bit length and angle of 32-bit (X,Y). This can be considered a cartesian-to-polar converter.
    QLOG - computes log of 32-bit number. Result is 5:27-bit whole:fractional exponent.
    QEXP - computes 32-bit log back to normal number.
    QMUL - computes 64-bit product of two 32-bit unsigned numbers.
    QDIV - computes 32-bit quotient and remainder of 64-over-32-bit unsigned fraction.
    QSQR - computes 32-bit root of 64-bit unsigned number.

    Because there are 32 CORDIC stages, all the QSIN/QROT/QATN/QLOG/QEXP results are perfect to 32-bits, rounded. In the prior Prop2, there were resolution options for time/accuracy tradeoff. With this system, precision is always maximum and input-to-output timing is a constant 36-clocks. Because of intra-pipeline scaling, the X,Y terms never grow by ~1.6, so a full 32-bits is achieved. With the prior Prop2, only 31-bit X,Y terms were allowed.

    Tonight I just finished the smart pin interface on the cog side. It's pretty simple, but I still need to make some basic pin-side smart pin circuits. I'll do that later. Next, I have the last cog section to work on: hub execution. Everything is in place to support it. I just have to make it happen. This is probably not going to be a lot of code, but it will be hard, and probably will take a few days to spool up in my head just to start getting anywhere with it.

    After hub exec, I need to make the ROM code for the booter and get some smart-pins working. We've got the new FPGA board on the pick-and-place now and I should be able to start using it to simulate the whole chip soon. Ken and I were thinking that as soon as we have a stable FPGA image, we could make a github thing for the new Chip (sorry about my usage, not sure how it works, yet, though Heater tried to show me a while back).

    Cluso, I should probably start understanding what we need for 12MBPS USB CRC'ing.

    Thanks for your patience, Everyone.
  • Cluso99Cluso99 Posts: 18,069
    edited 2015-05-15 04:02
    Chip, thanks for the update.

    WOW! That sounds like some really complex hardware in the CORDIC solver.

    I am here when you are ready for the USB info. I will need to dig up the thread where I explained it all.
    cgracey wrote: »
    Last week I wrapped up the hub-based CORDIC solver. It's a 36-stage pipeline that every cog can give a command to every 16 clocks. It has built in K-factor compensation, so it will never indicate its CORDIC personality. It does the following computations:

    QSIN - computes sine and cosine with 32-bit length and angle. This can be considered a polar-to-cartesian converter.
    QROT - rotates 32-bit (X,Y) around (0,0) with a 32-bit angle.
    QATN - computes 32-bit length and angle of 32-bit (X,Y). This can be considered a cartesian-to-polar converter.
    QLOG - computes log of 32-bit number. Result is 5:27-bit whole:fractional exponent.
    QEXP - computes 32-bit log back to normal number.
    QMUL - computes 64-bit product of two 32-bit unsigned numbers.
    QDIV - computes 32-bit quotient and remainder of 64-over-32-bit unsigned fraction.
    QSQR - computes 32-bit root of 64-bit unsigned number.

    Because there are 32 CORDIC stages, all the QSIN/QROT/QATN/QLOG/QEXP results are perfect to 32-bits, rounded. In the prior Prop2, there were resolution options for time/accuracy tradeoff. With this system, precision is always maximum and input-to-output timing is a constant 36-clocks. Because of intra-pipeline scaling, the X,Y terms never grow by ~1.6, so a full 32-bits is achieved. With the prior Prop2, only 31-bit X,Y terms were allowed.

    Tonight I just finished the smart pin interface on the cog side. It's pretty simple, but I still need to make some basic pin-side smart pin circuits. I'll do that later. Next, I have the last cog section to work on: hub execution. Everything is in place to support it. I just have to make it happen. This is probably not going to be a lot of code, but it will be hard, and probably will take a few days to spool up in my head just to start getting anywhere with it.

    After hub exec, I need to make the ROM code for the booter and get some smart-pins working. We've got the new FPGA board on the pick-and-place now and I should be able to start using it to simulate the whole chip soon. Ken and I were thinking that as soon as we have a stable FPGA image, we could make a github thing for the new Chip (sorry about my usage, not sure how it works, yet, though Heater tried to show me a while back).

    Cluso, I should probably start understanding what we need for 12MBPS USB CRC'ing.

    Thanks for your patience, Everyone.
  • ozpropdevozpropdev Posts: 2,792
    edited 2015-05-15 04:29
    Nice work Chip! Thanks for the update :)
  • evanhevanh Posts: 15,917
    edited 2015-05-15 05:03
    Very nice indeed. It's as good as being in every Cog. How much bigger is this CORDIC vs what an equivalent single threaded version would be?
  • SapiehaSapieha Posts: 2,964
    edited 2015-05-15 06:59
    Hi Chip.

    Nice Progress.

    And Thanks for update on it



    cgracey wrote: »
    Last week I wrapped up the hub-based CORDIC solver. It's a 36-stage pipeline that every cog can give a command to every 16 clocks. It has built in K-factor compensation, so it will never indicate its CORDIC personality. It does the following computations:

    QSIN - computes sine and cosine with 32-bit length and angle. This can be considered a polar-to-cartesian converter.
    QROT - rotates 32-bit (X,Y) around (0,0) with a 32-bit angle.
    QATN - computes 32-bit length and angle of 32-bit (X,Y). This can be considered a cartesian-to-polar converter.
    QLOG - computes log of 32-bit number. Result is 5:27-bit whole:fractional exponent.
    QEXP - computes 32-bit log back to normal number.
    QMUL - computes 64-bit product of two 32-bit unsigned numbers.
    QDIV - computes 32-bit quotient and remainder of 64-over-32-bit unsigned fraction.
    QSQR - computes 32-bit root of 64-bit unsigned number.

    Because there are 32 CORDIC stages, all the QSIN/QROT/QATN/QLOG/QEXP results are perfect to 32-bits, rounded. In the prior Prop2, there were resolution options for time/accuracy tradeoff. With this system, precision is always maximum and input-to-output timing is a constant 36-clocks. Because of intra-pipeline scaling, the X,Y terms never grow by ~1.6, so a full 32-bits is achieved. With the prior Prop2, only 31-bit X,Y terms were allowed.

    Tonight I just finished the smart pin interface on the cog side. It's pretty simple, but I still need to make some basic pin-side smart pin circuits. I'll do that later. Next, I have the last cog section to work on: hub execution. Everything is in place to support it. I just have to make it happen. This is probably not going to be a lot of code, but it will be hard, and probably will take a few days to spool up in my head just to start getting anywhere with it.

    After hub exec, I need to make the ROM code for the booter and get some smart-pins working. We've got the new FPGA board on the pick-and-place now and I should be able to start using it to simulate the whole chip soon. Ken and I were thinking that as soon as we have a stable FPGA image, we could make a github thing for the new Chip (sorry about my usage, not sure how it works, yet, though Heater tried to show me a while back).

    Cluso, I should probably start understanding what we need for 12MBPS USB CRC'ing.

    Thanks for your patience, Everyone.
  • BaggersBaggers Posts: 3,019
    edited 2015-05-15 08:35
    Awesome news Chip, looking forward to the update :D
  • PublisonPublison Posts: 12,366
    edited 2015-05-15 09:09
    Thanks for the update Chip!

    What's more important right now? P2, walnuts, or tacos? :)

    My guess, Ken says P2.
  • potatoheadpotatohead Posts: 10,261
    edited 2015-05-15 09:19
    We are getting there! **Continues to save pennies for Parallax FPGA board and P2 meetup**
  • cgraceycgracey Posts: 14,155
    edited 2015-05-15 09:59
    evanh wrote: »
    Very nice indeed. It's as good as being in every Cog. How much bigger is this CORDIC vs what an equivalent single threaded version would be?

    It is about 20,000 LE's. A cog, by contrast, is around 7,000 LE's. The prior CORDIC system from the earlier Prop2 effort, that went into each cog, was 2,500 LE's. it didn't do multiply, divide, and square root, though. Of those 20,000 LE's, 8,000 are needed for multiply, divide and square root.
  • cgraceycgracey Posts: 14,155
    edited 2015-05-15 10:26
    Publison wrote: »
    Thanks for the update Chip!

    What's more important right now? P2, walnuts, or tacos? :)

    My guess, Ken says P2.

    Tacos are on ice for now, the walnuts are growing, and Prop2 is coming together nicely.

    I have been making some wok cookers from 32 jet burners you can get off eBay for $50, delivered from China. I settled on this design and I've made 5 units, so far:

    http://youtu.be/aAB7pVfl8Fg

    The castings are such poor quality that about one third of them shoot fire out of small voids.I have found that it is almost impossible to repair them.
  • PublisonPublison Posts: 12,366
    edited 2015-05-15 10:35
    cgracey wrote: »
    Tacos are on ice for now, the walnuts are growing, and Prop2 is coming together nicely.

    I have been making some wok cookers from 32 jet burners you can get off eBay for $50, delivered from China. I settled on this design and I've made 5 units, so far:

    http://youtu.be/aAB7pVfl8Fg

    The castings are such poor quality that about one third of them shoot fire out of small voids.I have found that it is almost impossible to repair them.

    Nuff said... get back to work. :)
  • Bill HenningBill Henning Posts: 6,445
    edited 2015-05-15 10:37
    Thanks for the updates Chip!

    Note to self: dust off my DE2-115...
  • K2K2 Posts: 693
    edited 2015-05-15 13:37
    cgracey wrote: »
    Last week I wrapped up the hub-based CORDIC solver. It's a 36-stage pipeline that every cog can give a command to every 16 clocks.

    This is great! 20,000 LEs were never devoted to a more worthy cause, imho. The whole scope of P2 sounds delightful.
  • evanhevanh Posts: 15,917
    edited 2015-05-15 15:12
    cgracey wrote: »
    It is about 20,000 LE's. A cog, by contrast, is around 7,000 LE's. The prior CORDIC system from the earlier Prop2 effort, that went into each cog, was 2,500 LE's. it didn't do multiply, divide, and square root, though. Of those 20,000 LE's, 8,000 are needed for multiply, divide and square root.

    Ah, that looks like the pipelining is just a tiny part of the whole then. Impressive savings.
  • Cluso99Cluso99 Posts: 18,069
    edited 2015-05-15 15:34
    Chip,
    I have noticed that in the P1V design, the hub is always read whether it is needed or not. I realise this is a function of the ram blocks in the FPGA.

    I presume this won't be the case in the real P2 as it would be a real waste of power. I presume the same was actually done in the real P1.
  • cgraceycgracey Posts: 14,155
    edited 2015-05-15 15:54
    Cluso99 wrote: »
    Chip,
    I have noticed that in the P1V design, the hub is always read whether it is needed or not. I realise this is a function of the ram blocks in the FPGA.

    I presume this won't be the case in the real P2 as it would be a real waste of power. I presume the same was actually done in the real P1.

    The actual chips don't read all the time. I remember that much. Certainly, in the new Prop2 FPGA configuration, these enables are only fired on accesses.
  • evanhevanh Posts: 15,917
    edited 2015-05-15 15:57
    evanh wrote: »
    Ah, that looks like the pipelining is just a tiny part of the whole then. Impressive savings.

    Hmm, I'm having second thoughts now. In this case, the pipelining is really an unrolled loop so it must be consuming a lot more silicon.
  • cgraceycgracey Posts: 14,155
    edited 2015-05-15 16:02
    evanh wrote: »
    Hmm, I'm having second thoughts now. In this case, the pipelining is really an unrolled loop so it must be consuming a lot more silicon.

    There are 128 flops per stage (40 per X,Y,Z, 3 for mode and 5 for magnitude), so times 36 stages is 4,608 flops. That takes about 1/2 square mm of silicon, alone.
  • Cluso99Cluso99 Posts: 18,069
    edited 2015-05-15 16:43
    Just for reference, I compiled P1V for DE0-Nano...
         Cog+    Cog     Video
         Video   NoVideo Only
    LE   1750    1534    216
    Comb 1617    1457    160
    Regs  588     436    152
    
  • evanhevanh Posts: 15,917
    edited 2015-05-15 16:49
    cgracey wrote: »
    There are 128 flops per stage (40 per X,Y,Z, 3 for mode and 5 for magnitude), so times 36 stages is 4,608 flops. That takes about 1/2 square mm of silicon, alone.

    Hehe, I guess I should be more direct. I'm wondering what the advantage of having one multithreaded CORDIC in the Hub is, verses having 16 ordinary CORDICs in the 16 Cogs. I figured the main advantage was silicon space saved by not duplicating the computation components, but this doesn't seem to be quite the case.

    Take the multiply/divide 8000 LEs for example, roughly how big would that be if the CORDIC wasn't pipelined?
  • Cluso99Cluso99 Posts: 18,069
    edited 2015-05-15 17:39
    If I understand the pipelined CORDIC correctly, all 16 cogs can use the CORDIC simultaneously due to the pipelining. That would be staged because each cog can only start the CORDIC at it's hub slot.
    And since there are 36 stages, the result will be available 3 hub slots later (3*16=48 clocks later), but you can issue two new calculations in the meantime. So in theory each cog could have 3 calculations in progress concurrently - well 36 is the max so not sure how this works out actually.

    Is this correct Chip?

    If so, this could be a "hot" chip.
  • Cluso99Cluso99 Posts: 18,069
    edited 2015-05-15 18:38
    USB Info
    Here are some links to the later discussions regarding aid to doing USB LS & FS in software
    For USB pin read and CRC
    http://forums.parallax.com/showthread.php/151821-P2-Possible-additional-Instructions?highlight=usb+fs+instruction
    For CRC generation:
    http://forums.parallax.com/showthread.php/151992-CRC-generation?highlight=usb+fs+instruction

    I am going to give this a try on the P1V (should have done that ages ago).
  • Martin HodgeMartin Hodge Posts: 1,246
    edited 2015-05-15 19:28
    What are the benefits of such a complex maths system, and what real-world tasks will this chip be able to perform using it?
  • evanhevanh Posts: 15,917
    edited 2015-05-15 20:21
    Guidance systems comes to mind. Take a robot that you want to send it's gripper into a particular 3D point and at the same time have the arm angled to keep it from knocking into a piece of frame or something. There is a number of fancy calculations needed.

    Of course such calculations can already be done on the Prop1 but each floating point operator is quite long winded inside the float library. When there is a large table of calculations to perform then execution speed becomes a significant issue.

    Also, the more RAM there is the more people will want to do fancy things with it. Data capture and filtering comes to mind here.
  • Cluso99Cluso99 Posts: 18,069
    edited 2015-05-15 21:34
    evanh wrote: »
    Guidance systems comes to mind. Take a robot that you want to send it's gripper into a particular 3D point and at the same time have the arm angled to keep it from knocking into a piece of frame or something. There is a number of fancy calculations needed.

    Of course such calculations can already be done on the Prop1 but each floating point operator is quite long winded inside the float library. When there is a large table of calculations to perform then execution speed becomes a significant issue.

    Also, the more RAM there is the more people will want to do fancy things with it. Data capture and filtering comes to mind here.
    Maybe NASA might be interested in this. They will not be able to use any of the latest high performance <20nm chips because they cannot guarantee functionality beyond 10 years due to metal migration (at least that's what I've read). Many of the spacecraft out there are older than 10 years. Even the Mars Rover Opportunity has been roving Mars for over 10 years, and well, good old Voyager 1 & 2 are 37+ years.
  • evanhevanh Posts: 15,917
    edited 2015-05-16 02:04
    Cluso99 wrote: »
    Maybe NASA might be interested in this. They will not be able to use any of the latest high performance <20nm chips because they cannot guarantee functionality beyond 10 years due to metal migration (at least that's what I've read). Many of the spacecraft out there are older than 10 years. Even the Mars Rover Opportunity has been roving Mars for over 10 years, and well, good old Voyager 1 & 2 are 37+ years.

    Curiosity was launched in 2011 so will be using newer components. Wikipedia states CPU is 400MIPS RAD750 (PowerPC) with memory of 256 kB of EEPROM, 256 MB of DRAM, and 2 GB of flash.

    Looking that up nets 150nm process. I'm guessing that's prolly a good reference for what can be made really tough.
  • Cluso99Cluso99 Posts: 18,069
    edited 2015-05-16 02:34
    evanh wrote: »
    Curiosity was launched in 2011 so will be using newer components. Wikipedia states CPU is 400MIPS RAD750 (PowerPC) with memory of 256 kB of EEPROM, 256 MB of DRAM, and 2 GB of flash.

    Looking that up nets 150nm process. I'm guessing that's prolly a good reference for what can be made really tough.
    That was probably state of art when it was designed, probably 5 years before launch, so 2006.

    Opportunity is having bouts of amnesia caused by, they think, by the failure of flash memory. It has 7 banks. They reconfigured to isolate the faulty bank but they have had at least one episode since. Maybe the Flash cannot retain its data that long either??? But remember, the hardware was built probably at least 5 years before launch and extensively tested. So its likely 17+ years old.
  • evanhevanh Posts: 15,917
    edited 2015-05-16 02:56
    In the case of that Flash, due to ease of swapping, it could have been latest hardened product mere months prior to launch date.

    However, irrespective of it's age, Flash is never going to be particularly tough. That's exactly where MRAM can supersede. NASA aren't aiming for highest capacities and price will be neither here nor there. Except, of course, MRAM still hasn't been developed enough for even a mere 2GB.
  • evanhevanh Posts: 15,917
    edited 2015-05-16 03:40
    MRAM progress is happening though, even if it seems to be at a snails pace. Have a read of this - http://spectrum.ieee.org/semiconductors/memory/spin-memory-shows-its-might

    A couple of tasty morsels from the above:
    - "One was the write speed, which stands at 1.5 nanoseconds. That’s fast enough, Jan says, to compete with the SRAM that takes up most of the memory space on a modern microprocessor: the level-three cache."
    - "Samsung reported success in making cells that could potentially be fabricated using a 15-nanometer manufacturing process"
    - "TDK-Headway’s chips, it found no errors in data retention after 528 hours at 150 °C"
  • Dave HeinDave Hein Posts: 6,347
    edited 2015-05-16 05:02
    What are the benefits of such a complex maths system, and what real-world tasks will this chip be able to perform using it?
    - Digital Signal Processing
    - Image Processing
    - Computer Graphics
    - Speech Synthesis
    - Digital Filtering
    - Spectral Analysis
    - Pattern Recognition
    - Audio Coding and Decoding
    - Image Coding and Decoding
Sign In or Register to comment.