The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Part 2

Cluso99 · 2015-09-11 04:34

My thoughts were mainly that it could be simpler to implement and explain, and a possible benefit in a reduction of caveats. Currently an extra clock is being inserted in RDLUT which would not be required.

As I said originally, all I was after was to see if it were possible to execute from LUT space. I was happy with any caveats required. Chip went a step further than this since it was easy.

jmg wrote: »

Cluso99 wrote: »

The S & D addresses could be modified (extended) by AUGD/AUGS/AUGDS which could set the A9 bit to the upper 512 longs, permitting the upper cog ram (ie LUT) to be used with standard instructions (ie a pair of instructions).

Sounds like such a Paging Scheme could cause a lot of fun around Interrupts ?

Also, does the P2 really need all memory available as discrete registers (the highest cost RAM) ?

Lots of SW uses more memory as Arrays, than as discrete VARs,
and code can fetch from this memory too, which frees up the more valuable register-capable RAM.

Yanomani · 2015-09-11 05:41

cgracey wrote: »

As it works out, there are no caveats regarding concurrent LUT execution and LUT r/w. The LUT instructions are fetched on "go" cycles and LUT r/w operations occur on non-"go" cycles. RDLUT takes three clocks, not the usual two.

Hi Chip

I believe that any address/data conficts could be addressed, including streamer related ones, by spliting the CLUT into two 256 longs halves, isolated by independent sets of muxes.

Then, execution and streaming could run in parallel, although in mux-segregated address spaces and subjected to specific but not so restrictive rules.

Henrique

cgracey · 2015-09-11 06:39

Check this out!

Treehouse did some preliminary synthesis work today so we can get a reality check on size and timing.

This is a cell-count/silicon-area report on a single Prop2 cog.

============================================================
  Generated by:           Encounter(R) RTL Compiler RC14.11 - v14.10-s012_1
  Generated on:           Sep 10 2015  04:29:10 pm
  Module:                 cog
  Technology libraries:   onnn180cxrscl Rev2.101
                          onc18_rm_16384x008m16b2r2_00000000_PwcsV162T125 
                          onc18_dp_00512x032m02b1r2w2_PwcsV162T125 
                          onc18_sp_00256x032m04b1r2w2_PwcsV162T125 
                          onc18_sp_08192x032m16b2r2w2_PwcsV162T125 
  Operating conditions:   wc_1.62v_125c (worst_case_tree)
  Wireload mode:          enclosed
  Area mode:              timing library
============================================================

       Instance         Cells  Cell Area  Net Area  Total Area  Wireload     
-----------------------------------------------------------------------------
cog                     13159     424322         0      424322    <none> (D) 
  cog_ram_                  1     291952         0      291952    <none> (D) 
  cog_lut_                  1      94519         0       94519    <none> (D) 
  cog_mem_               2933      10237         0       10237    <none> (D) 
    cog_mem_fifo_        1630       6436         0        6436    <none> (D) 
    srl_318_51             78        128         0         128    <none> (D) 
    srl_190_29             78        128         0         128    <none> (D) 
    sll_336_75            102        128         0         128    <none> (D) 
    sll_196_24             78        126         0         126    <none> (D) 
    inc_add_269_30_10      17         42         0          42    <none> (D) 
  cog_alu_               3722       8527         0        8527    <none> (D) 
    cog_alu_mul_          883       1835         0        1835    <none> (D) 
      mul_38_27           828       1571         0        1571    <none> (D) 
    cog_alu_inc_          677       1548         0        1548    <none> (D) 
      dec_sub_129_22_9     61         96         0          96    <none> (D) 
      inc_add_63_24_8      45         89         0          89    <none> (D) 
      inc_add_119_22_7     42         84         0          84    <none> (D) 
    cog_alu_mux_          601       1348         0        1348    <none> (D) 
    cog_alu_add_          474       1231         0        1231    <none> (D) 
      add_159_23          188        319         0         319    <none> (D) 
    cog_alu_log_          461       1090         0        1090    <none> (D) 
    cog_alu_rot_          456        954         0         954    <none> (D) 
      srl_61_22           212        340         0         340    <none> (D) 
  cog_xfr_               1686       5084         0        5084    <none> (D) 
    add_344_24            157        238         0         238    <none> (D) 
    add_343_24            157        238         0         238    <none> (D) 
    add_180_55            124        224         0         224    <none> (D) 
    sll_313_41            116        181         0         181    <none> (D) 
    srl_365_27            100        176         0         176    <none> (D) 
  add_792_34               66        133         0         133    <none> (D) 
  add_1690_22              62        130         0         130    <none> (D) 
  dec_sub_1343_32_4        61         96         0          96    <none> (D) 
  dec_sub_1310_38_2        61         96         0          96    <none> (D) 
  dec_sub_1390_13_5        61         96         0          96    <none> (D) 

 (D) = wireload is default in technology library

You see that the memories are listed at the top (cog_ram_/cog_lut_) and their areas total 386,471 um2. The total cog, including those memories, is 424,322 um2. That means the cog logic area is only 37,851 um2, or less that 1/25 mm2! The overall cog size will grow to ~500k um2 after we upgrade the cog_lut_ from 256 to 512 words.

In practice that 37,851 um2 logic area needs to be divided by 65% (multiplied by ~1.5) to allow initially empty spaces in the cell array for the clock tree and signal buffers to go into later. That's still really small, though. A cog is only ~13,000 cells!

The clock period was set to 10ns (100MHz) for this run, which is very low, but allowed us to check the basic flow out without getting encumbered in possibly failed-timing reports. Speed checks are next.

cgracey · 2015-09-11 06:46

Yanomani wrote: »

cgracey wrote: »

As it works out, there are no caveats regarding concurrent LUT execution and LUT r/w. The LUT instructions are fetched on "go" cycles and LUT r/w operations occur on non-"go" cycles. RDLUT takes three clocks, not the usual two.

Hi Chip

I believe that any address/data conficts could be addressed, including streamer related ones, by spliting the CLUT into two 256 longs halves, isolated by independent sets of muxes.

Then, execution and streaming could run in parallel, although in mux-segregated address spaces and subjected to specific but not so restrictive rules.

Henrique

That could certainly work, but we would need two memories - one for each half. When you go to read or write, you tie up the whole memory. I think what we have right now is going to be fine because there probably won't be a strong need to stream from the LUT and execute from it, too. I think those will be fairly different types of applications that would use things differently.

cgracey · 2015-09-11 06:52

Cluso99 wrote: »

My thoughts were mainly that it could be simpler to implement and explain, and a possible benefit in a reduction of caveats. Currently an extra clock is being inserted in RDLUT which would not be required.

As I said originally, all I was after was to see if it were possible to execute from LUT space. I was happy with any caveats required. Chip went a step further than this since it was easy.

jmg wrote: »

Cluso99 wrote: »

The S & D addresses could be modified (extended) by AUGD/AUGS/AUGDS which could set the A9 bit to the upper 512 longs, permitting the upper cog ram (ie LUT) to be used with standard instructions (ie a pair of instructions).

Sounds like such a Paging Scheme could cause a lot of fun around Interrupts ?

Also, does the P2 really need all memory available as discrete registers (the highest cost RAM) ?

Lots of SW uses more memory as Arrays, than as discrete VARs,
and code can fetch from this memory too, which frees up the more valuable register-capable RAM.

I could have initiated the read on the "go" cycle (before we know its condition), instead of the next "get" cycle, but it would be disastrous is cases where the instruction wasn't supposed to execute and it wound up interfering with streaming that was going on, causing a glitch. Also, it would have conflicted with possible LUT instruction fetching. I think what we have now is just right. The only way to get around these problems, if they are problems, is to use a dual-port RAM, just like the cog RAM. That would explode the area, though (+3mm2), for a marginal improvement in function.

cgracey · 2015-09-11 07:02

Seairth wrote: »

On a slightly related note, I just noticed that there weren't any INDx registers in the 8/13 document. Did we lose indirect registers in the new design?

Yes, they are gone. We have an ALTDS instruction now that substitutes D and S fields in the next instruction. ALTDS also increments/decrements those fields in its D register, with S supplying the inc/dec controls. It was a really cheap way around what could be a huge hardware situation, like in Prop2-Hot.

cgracey · 2015-09-11 07:07

Cluso99 wrote: »

cgracey wrote: »

As it works out, there are no caveats regarding concurrent LUT execution and LUT r/w. The LUT instructions are fetched on "go" cycles and LUT r/w operations occur on non-"go" cycles. RDLUT takes three clocks, not the usual two.

Aha - got it thanks Chip.

Does that mean it would then insert an extra clock (ie 4 clocks) to get back in "sync" with "go" cycles or does the "go" cycle get shifted?

The "go" cycle can be delayed by any number of clocks (ie WAITCNT). The "get" cycle reads D and S registers and the "go" cycle writes the ALU result and reads another instruction. Normal execution goes: get,go,get,go,get,go, etc. If an instruction delays, it looks like this: get,gox,gox,gox,...go.

WRLUT issues the write on "get", then does "go".
RDLUT issues the read on "get", captures the result on "gox", then does "go". It needs that "gox" cycle to capture the data before routing it through the result mux, which also takes some time.

MJB · 2015-09-11 07:35

cgracey wrote: »

The "go" cycle can be delayed by any number of clocks (ie WAITCNT). The "get" cycle reads D and S registers and the "go" cycle writes the ALU result and reads another instruction. Normal execution goes: get,go,get,go,get,go, etc. If an instruction delays, it looks like this: get,gox,gox,gox,...go.

I am not qualified to talk here...

just wondered if writing the result on the first gox would make a difference?

cgracey · 2015-09-11 07:38

MJB wrote: »

cgracey wrote: »

The "go" cycle can be delayed by any number of clocks (ie WAITCNT). The "get" cycle reads D and S registers and the "go" cycle writes the ALU result and reads another instruction. Normal execution goes: get,go,get,go,get,go, etc. If an instruction delays, it looks like this: get,gox,gox,gox,...go.

I am not qualified to talk here...

just wondered if writing the result on the first gox would make a difference?

Well, in WRLUT there is no "gox", only "get" then "go". So, we would have to make this a 3-clock instruction to even have a "gox". I don't see what good it would do, but I probably don't understand what you are thinking about.

Cluso99 · 2015-09-11 07:55

Fantastic info from Treehouse! Nice to see the sizes for the various parts.

Thanks for the get/gox/go info. Makes it quite simple to understand.

ozpropdev · 2015-09-11 10:18

I always thought the P2 was going to be great from the "get go"

Thanks for the update Chip!

Seairth · 2015-09-11 20:50

Does SETDS (A) set d-field and s-field of D to the low 9 bits of S, or (B) set the low 18 bits of D to the low 18 bits of S?

If (A), what's the use case? If (B) what's the value of having S be an immediate value?

Yanomani · 2015-09-11 22:07

In my thoughts, they would be ever kept isolated of each other.

Each one capable of acting as a peripheral storage tank, except by having two mux groups connected to their data/address/control buses: one meant to be driven by the streamer logic state machine, and the other, by the ALU Get/Result bus logic state machine.

But, in fact, I was only trying to imagine the way you designed their data, address and control buses.

As you described that they would be tied up for reading and writing, probably it's a common bus, with the streamer SM logic at one side, and the ALU Get/Result bus SM logic at the other one; simpler and faster than I supposed it would be.

And yes, I must recognize that streaming to/from LUT and executing from it are different applications; there would be no need to run them simultaneously.

Henrique

cgracey wrote: »

That could certainly work, but we would need two memories - one for each half. When you go to read or write, you tie up the whole memory. I think what we have right now is going to be fine because there probably won't be a strong need to stream from the LUT and execute from it, too. I think those will be fairly different types of applications that would use things differently.

rjo__ · 2015-09-11 23:26

report on the p123-A7

Disregard my earlier post... board 2 now just sits there like a zombie:)

cgracey · 2015-09-12 01:19

rjo__ wrote: »

report on the p123-A7

Disregard my earlier post... board 2 now just sits there like a zombie:)

Are you saying that you are using the new -A7 board exclusively?

rjo__ · 2015-09-12 03:29

yes, killed my DE2-115 a while back. Have a couple of other boards... bemicro-A9(etc)...P1v working fine, but no time right now to play. The LED's on the 123 indicate that Jac's P1V is active, but Proptool can't identify Prop. Was working last weekend.... after giving it time to find itself.

When your guys test, make sure they do a serial loop back ala Ozpropdev.

I don't know about the delay between releasing the image and releasing the code will be for the P2... but I read your comment about supporting multiple boards. If I were you I would worry about the Bemicro-A9 and P123 A7&A9 at this point. Once you release the sources, it will give everybody something fun to do. If there is someone that is really inconvenienced by this... let them speak up now:)!!!!

Coley · 2015-09-12 09:05

erm... yeah what about all the people that invested in a DE2 or DE0?? That's a pretty selfish attitude to take IMO.

I'm sure Chip won't do that anyway.

rjo__ wrote: »

yes, killed my DE2-115 a while back. Have a couple of other boards... bemicro-A9(etc)...P1v working fine, but no time right now to play. The LED's on the 123 indicate that Jac's P1V is active, but Proptool can't identify Prop. Was working last weekend.... after giving it time to find itself.

When your guys test, make sure they do a serial loop back ala Ozpropdev.

I don't know about the delay between releasing the image and releasing the code will be for the P2... but I read your comment about supporting multiple boards. If I were you I would worry about the Bemicro-A9 and P123 A7&A9 at this point. Once you release the sources, it will give everybody something fun to do. If there is someone that is really inconvenienced by this... let them speak up now:)!!!!

evanh · 2015-09-12 10:41

Chip just said he's still on the DE2 himself, so I think you're pretty safe for that one for a while. On the other hand, DE0 will be spare-time dependent I suspect.

David Betz · 2015-09-12 12:11

rjo__ wrote: »

yes, killed my DE2-115 a while back. Have a couple of other boards... bemicro-A9(etc)...P1v working fine, but no time right now to play. The LED's on the 123 indicate that Jac's P1V is active, but Proptool can't identify Prop. Was working last weekend.... after giving it time to find itself.

When your guys test, make sure they do a serial loop back ala Ozpropdev.

I don't know about the delay between releasing the image and releasing the code will be for the P2... but I read your comment about supporting multiple boards. If I were you I would worry about the Bemicro-A9 and P123 A7&A9 at this point. Once you release the sources, it will give everybody something fun to do. If there is someone that is really inconvenienced by this... let them speak up now:)!!!

I'm hoping there will be a DE2-115 image since I don't have either a 1-2-3 board or a BeMicro-A9.

rjo__ · 2015-09-12 13:21

Then I think there will be;)

Baggers · 2015-09-12 23:18

Awesome news and update Chip

evanh · 2015-09-14 08:45

Chip,
You might want to review the conversation on ALTDS here - http://forums.parallax.com/discussion/156242/question-about-altds-implementation-in-new-chip/p1

It's an old topic that recently got revived by Seairth wondering about it's effectiveness. This then evolved into a question about how to do double indirection. I added my two cents worth by trying to make a working example of double indirection.

The end result seems to be asymmetrical methods, depending on whether one is reading or writing.

cgracey · 2015-09-15 05:20

After all the PLL drama, I've got Prop2 comfortably running on the Prop 1-2-3 FPGA -A7 board.

This is a much simpler setup than the DE2-115. Everything is on one board, there's no need for a PropPlug, just one USB connection. Download is 15x faster, and there's no need to cycle power, just flip a PGM/RUN switch.

Peter Jakacki · 2015-09-15 05:32

Great great great.

Assuming this means an image is imminent I've got a DE2-115 but I can pick up a BeMicroCVA9 if need be although they seem to have jumped up in price from $149 to $210 (What's the go there?).

BTW, do you have a link to the documents, even if they are (and will be) a work in progress?

Cluso99 · 2015-09-15 08:52

Chip,
I noted on the ALTDS thread you mentioned the new instruction bit order...
CCCC OOOOOOO CZI DDDDDDDDD SSSSSSSSS

Is there any specific reason you have reversed the CZ bits (of CZI) ?
On the P1 they are .... ZCRI ....

If there is no reason, wouldn't it be better to keep the same sequence ZCI so that we don't have to remember to reverse the order of ZC in the P2 ???

BTW I feel your pain with the PLL problem on the A7/A9 FPGA's. What a PITA!

cgracey · 2015-09-15 09:10

Cluso99 wrote: »

Chip,
I noted on the ALTDS thread you mentioned the new instruction bit order...
CCCC OOOOOOO CZI DDDDDDDDD SSSSSSSSS

Is there any specific reason you have reversed the CZ bits (of CZI) ?
On the P1 they are .... ZCRI ....

If there is no reason, wouldn't it be better to keep the same sequence ZCI so that we don't have to remember to reverse the order of ZC in the P2 ???

BTW I feel your pain with the PLL problem on the A7/A9 FPGA's. What a PITA!

The C is first because it just seemed like a better way to go, plus it made the Verilog more orderly, because of the way I had arranged things. I had been thinking about changing it in the back of my mind for quite a while.

I think that the Prop2 is such a different animal than the Prop1 that nobody's going to think much of that difference, in light of everything else that has changed. Just my feeling about it. The new way seems cleaner to me.

jac_goudsmit · 2015-09-15 23:21

cgracey wrote: »

After all the PLL drama, I've got Prop2 comfortably running on the Prop 1-2-3 FPGA -A7 board.

This is a much simpler setup than the DE2-115. Everything is on one board, there's no need for a PropPlug, just one USB connection. Download is 15x faster, and there's no need to cycle power, just flip a PGM/RUN switch.

15x speedup sounds delicious... Is the programming and pin usage exactly the same for the A9 as the A7? I wonder if I should dig into the BeMicroCV-A9 schematic and the 1-2-3 schematic and see how hard it would be to drop a Propeller on the BeMicroCV-A9 for programming... Is that worth the trouble?

===Jac

David Betz · 2015-09-16 01:00

cgracey wrote: »

After all the PLL drama, I've got Prop2 comfortably running on the Prop 1-2-3 FPGA -A7 board.

This is a much simpler setup than the DE2-115. Everything is on one board, there's no need for a PropPlug, just one USB connection. Download is 15x faster, and there's no need to cycle power, just flip a PGM/RUN switch.

That sounds encouraging. Are you likely to release an A7 image before the new A9 board is ready? Might still be worth picking up an A7 board after all.

Edit: I just realized that you're probably talking about the 1-2-3 A7 board with the whisker wire. Is there any other way to get a clock to the P2? How have people been doing it with P1v on the 1-2-3 A7 board?

jmg · 2015-09-16 02:29

David Betz wrote: »

... Are you likely to release an A7 image before the new A9 board is ready? Might still be worth picking up an A7 board after all.

Given the new A9 delays, an A7 image will certainly come first.
The A9 delays have likely pushed back the time-lines.

David Betz wrote: »

Edit: I just realized that you're probably talking about the 1-2-3 A7 board with the whisker wire. Is there any other way to get a clock to the P2? How have people been doing it with P1v on the 1-2-3 A7 board?

I think the issue is around the PLL paths?
There is a 50MHz canned Osc, so there will be Non PLL clock solutions. That may be fine for most testing.

Adafruit have a small ~$7 Si5351 PCB, which would give a PLL solution up to 200MHz

jmg · 2015-09-16 02:33

cgracey wrote: »

After all the PLL drama, I've got Prop2 comfortably running on the Prop 1-2-3 FPGA -A7 board.

How similar is the FPGA PLL to the P2 PLL, and do you have a proven OnSemi PLL Cell you are able to use.
Seems that could be a test coverage problem area ?

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Part 2

Comments