The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

koehler · 2015-09-01 04:02

So, where were we last?

Chip was about done with interrupts, and was going to jump on documentation prior to getting an image out?
And then, Smart Pins while people rev up their FPGA boards?

potatohead · 2015-09-01 04:07

Interrupts are done, block ops and misc instruction polish is likely done now.

Next up, unless there are some other basics that come up, is the boot code. 16K ROM And we aren't sure what the ROM mapping is. I think it's gonna be stored somewhere and streamed into the RAM. Saw that discussion once, but I never did see a commit to it.

Then docs, then image.

That's how I understand it anyway.

Dave Hein · 2015-09-01 19:01

P2 Watch:
72 days since the end of Spring.
21 days to the beginning of Fall.
61 days to November 1st.

        June                  July                 August               September
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa
   -- -- -- -- -- --            1  2  3  4                     1         1 -- -- -- --
-- -- -- -- -- -- --   5  6  7  8  9 10 11   2  3  4  5  6  7  8  -- -- -- -- -- -- --
-- -- -- -- -- -- --  12 13 14 15 16 17 18   9 10 11 12 13 14 15  -- -- -- -- -- -- --
21 22 23 24 25 26 27  19 20 21 22 23 24 25  16 17 18 19 20 21 22  -- -- -- -- -- -- --
28 29 30              26 27 28 29 30 31     23 24 25 26 27 28 29  -- -- -- --
                                            30 31

Interrupts are done.
The P2 instruction set is completed.
Block reads and writes almost done.
Deadline for final Verilog submission to Treehouse: November 1st
P2 Day is almost here!

Heater. · 2015-09-01 19:04

Optimist.

cgracey · 2015-09-01 23:38

Dave Hein wrote: »

I was thinking about how hubexec will interact rdxxxx and wrxxxx. My understanding is that hubexec will fetch instructions using a FIFO, which will be filled at a rate of one long per cycle. When a program jumps from a cog address to a hub address the cog will be stalled until the eggbeater memory cycle for the first instruction occurs. After that, the cog will execute straight line code at a rate of one instruction every two cycles. I assume the FIFO continues to fill when the eggbeater cycle for the next instruction comes around.

What happens when a rdxxxx/wrxxxx address is in the same cycle as a FIFO fetch. I'm assuming the rdxxxx/wrxxxx has priority, correct? Otherwise, the cog would be stalled for an additional 16 cycles waiting for the eggbeater to come around again.

For hub exec, the FIFO signals 'full' at 8 longs, instead of the normal 16 longs, since it will only be reading instructions at half the clock frequency. So, the FIFO does have priority, but it's not as bad as it could be if we waited for 16 longs to load before RDxxxx/WRxxxx could execute. The FIFO needs to have priority to ensure that whatever it's being used for doesn't run out of data.

Dave Hein · 2015-09-01 23:45

OK, that sounds good. I've added hubexec to spinsim using 16 longs, but I can change that to 8 longs with one line of code.

EDIT: On the other hand, it seems like 8 longs will cause stalls. For the first 8 instructions the FIFO will be filled in 8 cycles, and then stop filling for 8 cycles. Then when the next 8 longs are needed it seems the cog will stall for 8 cycles until the required eggbeater cycle is reached.

cgracey · 2015-09-01 23:49

Dave Hein wrote: »

So, just to clarify, here's how I understand how hubexec works.

- When a cog jumps to a long address greater than 511 it goes into the hubexec mode.
- The streamer FIFO is filled with 16 longs, and cog execution continues as soon as the first long is available.
- The streamer FIFO takes precedence over RDxxxx/WRxxxx accesses.
- As longs are consumed from the streamer FIFO subsequent hub RAM reads are performed to continue to fill the FIFO.
- Any jumps to another hub address will cause the streamer FIFO to be refilled, even if the FIFO already contains the data from the target address
- Filling of the streamer FIFO is terminated immediately when a jump is made to an address less than 512

All correct, with a few caveats:

2nd: 8 longs are read, not 16, since instruction rate is clock/2. Whenever less than 8 longs are in the FIFO and the needed hub window appears, more longs are read until the level comes back up to 8.

6th: The FIFO is always in 'read' or 'write' mode, so after a jump to cog RAM from hub RAM, the FIFO is left in read mode and fills up to 8 longs again when the window comes along. This could be optimized away, but wouldn't make any performance difference, because if the program in cog RAM wanted to use the FIFO, it would issue a new RDFAST/WRFAST command, anyway.

cgracey · 2015-09-01 23:52

Conga wrote: »

Roy Eltham wrote: »

My understanding is that the streamer/FIFO fills one long on every clock of the hub, and the normal RDxxxx/WRxxxx instructions are satisfied once every 16 clocks (similar to the existing P1 hub/cog interaction).
So in the time to "go around" once the streamer/FIFO is filled, and delaying the RDxxxx/WRxxxx beyond the normal "wait for it's slot" wouldn't happen.

I thought/hoped that this would apply only to the random access to hub memory,
not to explicit block transfers (SETQ right before a RDLONG/WRLONG) = sequential access.

Lack of fast block data transfers would make the new hub memory layout less useful;
optimizing only for HubExec would be too biased --- it's not clear that instruction reading would always be the bottleneck.

The RDLONG/WRLONG-repeat instructions do read/write one long per clock, though the FIFO has priority on any any clock that is needful.

cgracey · 2015-09-02 00:07

koehler wrote: »

So, where were we last?

Chip was about done with interrupts, and was going to jump on documentation prior to getting an image out?
And then, Smart Pins while people rev up their FPGA boards?

Still working on the block-write instruction. This has been very hard gearing up for. It's like having to bench press some huge weight that can't be broken into smaller chunks. All at once, only. I think I'll get this working tonight, after everyone goes to bed. The block-read works great and is now how the cog loads itself (for non-hub-exec COGINITs).

After this, I will add two more instructions, which will be trivial once block-read/write are done. They will read/write the 256-long LUT RAM to/from the hub. They will use the RDLONG/WRLONG instructions, but with another opcode for SETQ (maybe 'SETL'?) that will signal that any following RDLONG/WRLONG will be reading/writing the LUT RAM, not the cog RAM. This way, two instructions can load a whole sin/cos table or color palette from the hub. Then, I will do doc's and get an FPGA image out. Ken told me today that the Prop-123 boards with the Cyclone V-A9 chips are going onto the pick-and-place this week.

Dave Hein · 2015-09-02 00:13

I changed the level to 8 longs in spinsim, and it does run without stalls. Because the cog is consuming longs at half the rate as they are being read, the FIFO level only reaches 8 even though all 16 longs were read during an eggbeater cycle.

koehler · 2015-09-02 02:12

Just curious, how many transistors do we think the Prop 2 is/will be?

cgracey · 2015-09-02 02:24

koehler wrote: »

Just curious, how many transistors do we think the Prop 2 is/will be?

512KB = 4M bits x 6 transistors per bit = 24M just in hub RAM.

I estimate there's another 4M in logic.

Cluso99 · 2015-09-02 02:59

Chip,
What can the LUT (256 longs) be used for besides CLUT (and obviously lookup tables since you mention sin/cos tables)?

Seairth · 2015-09-02 03:30

cgracey wrote: »

For hub exec, the FIFO signals 'full' at 8 longs, instead of the normal 16 longs, since it will only be reading instructions at half the clock frequency. So, the FIFO does have priority, but it's not as bad as it could be if we waited for 16 longs to load before RDxxxx/WRxxxx could execute. The FIFO needs to have priority to ensure that whatever it's being used for doesn't run out of data.

Suppose the following scenario: "JMP $204" on cog 0. In this case, the FIFO streamer will stalls until cog 0 is aligned to bank 4. Over the next 8 clock cycles, it queues $204 to $20B. Let's just say that this is clock cycles $F004 to $F00B.

Given this, if the first instruction is fetched from the FIFO on cycle $F005, then the FIFO will be empty on $F014. When the pipeline attempts to fetch the next instruction on $F015, the FIFO will have to stall it for 8 clock cycles before the cog is aligned with the correct bank (the next instruction is at $20C). From there, it will fetch 8 more instructions, $20C-$213, during clock cycles $F01C-$F023. As before, the pipeline will fetch the next instruction from the FIFO on $F01D. This means there would be a stall of 8 cycles every every 16 cycles.

On the other hand, if the first instruction were to execute after the FIFO was full, then you would still stall 8 cycles (the FIFO fill period), except that it would just happen to be that the FIFO was always aligned to the proper hub bank when fetching the next set of 8 instructions. More specifically, the first instruction would be fetched on $F00C and the queue would empty on $F01B. The FIFO would immediately queue $20C-213 over clocks $F01C-F023 (stalling the pipeline for 8 clock cylces). On $F024, the next instruction would be fetched from the FIFO.

Based on this, the only way I can see avoiding the stall is to have a 16-long FIFO. With that, the FIFO would fetch $204-$213 on cycles $F004-$F013. The first instruction would still be fetched from the FIFO on $F005. The FIFO would now wait until $F024 before queuing the next instruction. As it turn out, $F025 is the next cycle that the pipeline will attempt to fetch an instruction from the FIFO. No stall!

I just don't see how you can avoid pipeline fetch stalls unless the FIFO is 16 deep.

Of course, this is all predicated on not encountering any branch instructions or instructions that would stall the pipeline. Once you encounter those, all bets are off and an 8-deep FIFO might be a reasonable compromise.

Seairth · 2015-09-02 03:34

As a side note to my prior post, it makes wonder how effective it would be to write code that always jumped to cogexec mode for any snippet that required a memory hubop. Maybe there are some special cases where this would make sense, but I'm not sure...

Electrodude · 2015-09-02 03:37

From my understanding, the FIFO is still 16 levels deep, but when used for hubexec it isn't highest priority once it is 8 longs full. An instruction that needs to interrupt the hub streamer will probably stall anyway, so I don't see Seairth's argument as a problem (although I might be misunderstanding it).

How hard would it be (and would it be useful) to make it possible to manually specify how deep the FIFO has to be in order to become lower priority when using it for data? It doesn't have to be fine-tuneable - making only 8 and 16 options would be fine, but this would seem useful for when you're not using the streamer every single clock.

cgracey · 2015-09-02 04:55

Cluso99 wrote: »

Chip,
What can the LUT (256 longs) be used for besides CLUT (and obviously lookup tables since you mention sin/cos tables)?

There's a Goertzel algorithm circuit in the streamer which can output the lower two bytes of the LUT longs to DACs, while the top two bytes are treated as sine and cosine for accumulation. On each clock, the 32-bit phase is added to, the LUT is read using the upper bits of the phase as the address, then the looked-up top two bytes are each multiplied by an ADC feedback bit (0/1 --> -1/+1) and accumulated separately. After many cycles of this, those accumulations represent an (X,Y) point that expresses angle and amplitude. The CORDIC instruction ARTCAN converts (X,Y) into (ro,theta) so you get power and phase angle. Those bottom two bytes from the LUT were output to DACs as a stimulus, if needed, to excite some system through which the ADC returns a bitstream. When used in this closed-loop mode, you have an instrument which should be able to resolve all kinds of interesting things in the real world that relate to resonance, time-of-flight, phase differences through sensor arrays, and who know what else. Something new to play with.

cgracey · 2015-09-02 05:02

Seairth wrote: »

cgracey wrote: »

For hub exec, the FIFO signals 'full' at 8 longs, instead of the normal 16 longs, since it will only be reading instructions at half the clock frequency. So, the FIFO does have priority, but it's not as bad as it could be if we waited for 16 longs to load before RDxxxx/WRxxxx could execute. The FIFO needs to have priority to ensure that whatever it's being used for doesn't run out of data.

Suppose the following scenario: "JMP $204" on cog 0. In this case, the FIFO streamer will stalls until cog 0 is aligned to bank 4. Over the next 8 clock cycles, it queues $204 to $20B. Let's just say that this is clock cycles $F004 to $F00B.

Given this, if the first instruction is fetched from the FIFO on cycle $F005, then the FIFO will be empty on $F014. When the pipeline attempts to fetch the next instruction on $F015, the FIFO will have to stall it for 8 clock cycles before the cog is aligned with the correct bank (the next instruction is at $20C). From there, it will fetch 8 more instructions, $20C-$213, during clock cycles $F01C-$F023. As before, the pipeline will fetch the next instruction from the FIFO on $F01D. This means there would be a stall of 8 cycles every every 16 cycles.

On the other hand, if the first instruction were to execute after the FIFO was full, then you would still stall 8 cycles (the FIFO fill period), except that it would just happen to be that the FIFO was always aligned to the proper hub bank when fetching the next set of 8 instructions. More specifically, the first instruction would be fetched on $F00C and the queue would empty on $F01B. The FIFO would immediately queue $20C-213 over clocks $F01C-F023 (stalling the pipeline for 8 clock cylces). On $F024, the next instruction would be fetched from the FIFO.

Based on this, the only way I can see avoiding the stall is to have a 16-long FIFO. With that, the FIFO would fetch $204-$213 on cycles $F004-$F013. The first instruction would still be fetched from the FIFO on $F005. The FIFO would now wait until $F024 before queuing the next instruction. As it turn out, $F025 is the next cycle that the pipeline will attempt to fetch an instruction from the FIFO. No stall!

I just don't see how you can avoid pipeline fetch stalls unless the FIFO is 16 deep.

Of course, this is all predicated on not encountering any branch instructions or instructions that would stall the pipeline. Once you encounter those, all bets are off and an 8-deep FIFO might be a reasonable compromise.

Well, the FIFO is actually 16+5 levels deep, because there is a 5-clock latency from issuing the hub 'read' command to getting the results back. When the FIFO is 'full', measuring 16 longs depth, and 'read' commands cease, there are still 5 more longs on the way. So, when the FIFO 'full' mark is set to 8 longs during hub exec, there's a possibility of there being 13 longs in the FIFO. Anyway, it never runs dry.

cgracey · 2015-09-02 05:08

One other thing about the FIFO: If you are using it for hub exec, you can't use it for streaming modes that require it. To use the streamer's FIFO modes, you need to be executing from the cog. On the other hand, the block-read/write instructions don't use the FIFO, so they can be executed from hub exec mode.

Cluso99 · 2015-09-02 07:58

Chip,
Does the cog have any means/instructions/bus for access to the LUT?
Without going thru the P2Hot loops (ie without changes to the current structure as I don't want to open a can of worms), I am wondering if there are any other uses for the LUT.

cgracey · 2015-09-02 08:19

Cluso99 wrote: »

Chip,
Does the cog have any means/instructions/bus for access to the LUT?
Without going thru the P2Hot loops (ie without changes to the current structure as I don't want to open a can of worms), I am wondering if there are any other uses for the LUT.

There are RDLUT and WRLUT instructions, where D is the data and S[7:0] is the address.

cgracey · 2015-09-02 08:20

Cluso99 wrote: »

Chip,
What can the LUT (256 longs) be used for besides CLUT (and obviously lookup tables since you mention sin/cos tables)?

It's always usable as a 256-long RAM.

Cluso99 · 2015-09-02 08:42

cgracey wrote: »

Cluso99 wrote: »

Chip,
Does the cog have any means/instructions/bus for access to the LUT?
Without going thru the P2Hot loops (ie without changes to the current structure as I don't want to open a can of worms), I am wondering if there are any other uses for the LUT.

There are RDLUT and WRLUT instructions, where D is the data and S[7:0] is the address.

Nice.
I presume the RDLUT and WRLUT do not go thru the hub logic (ie no delays as with RD/WRxxxx)?
Therefore we could us it as a scratch store and/or FIFO(s) rather than hub if 256 is enough.

Postedit:
After reading your post above again, I think the RD/WRLUT just transfer longs between LUT and HUB, not COG and LUT.
So my comment above would be incorrect.
Is there any simple way for a cog to use the LUT for fast storage, fifo, etc ???

I presume the LUT is dual port ram?

cgracey · 2015-09-02 08:52

Cluso99 wrote: »

cgracey wrote: »

Cluso99 wrote: »

Chip,
Does the cog have any means/instructions/bus for access to the LUT?
Without going thru the P2Hot loops (ie without changes to the current structure as I don't want to open a can of worms), I am wondering if there are any other uses for the LUT.

There are RDLUT and WRLUT instructions, where D is the data and S[7:0] is the address.

Nice.
I presume the RDLUT and WRLUT do not go thru the hub logic (ie no delays as with RD/WRxxxx)?
Therefore we could us it as a scratch store and/or FIFO(s) rather than hub if 256 is enough.

That's right.

Cluso99 · 2015-09-02 09:04

Chip,
Just updated my prev post while you were answering.
So now I am confused.

Postedit:
After reading your post above again, I think the RD/WRLUT just transfer longs between LUT and HUB, not COG and LUT.
So my comment above would be incorrect.
Is there any simple way for a cog to use the LUT for fast storage, fifo, etc ???

I presume the LUT is dual port ram?

evanh · 2015-09-02 10:21

Cluso,
I'd say you had it right first. RD/WRLUT will be register direct addressing (CogRAM addresses for D operand, and LUT addresses for S operand).

MJB · 2015-09-02 10:27

Cluso99 wrote: »

Chip,
Just updated my prev post while you were answering.
So now I am confused.

Postedit:
After reading your post above again, I think the RD/WRLUT just transfer longs between LUT and HUB, not COG and LUT.
So my comment above would be incorrect.
Is there any simple way for a cog to use the LUT for fast storage, fifo, etc ???

I presume the LUT is dual port ram?

Hi Cluso,
from above I quote Chip:

After this, I will add two more instructions, which will be trivial once block-read/write are done. They will read/write the 256-long LUT RAM to/from the hub. They will use the RDLONG/WRLONG instructions, but with another opcode for SETQ (maybe 'SETL'?) that will signal that any following RDLONG/WRLONG will be reading/writing the LUT RAM, not the cog RAM. This way, two instructions can load a whole sin/cos table or color palette from the hub. Then, I will do doc's and get an FPGA image out. Ken told me today that the Prop-123 boards with the Cyclone V-A9 chips are going onto the pick-and-place this week.

so I understand there is:
RDLONG/WRLONG with a prefix SETQ (maybe 'SETL'?) for HUB to LUT transfer
and
RD/WRLUT LUT to COG transfer

RDLUT and WRLUT do not go thru the hub logic (ie no delays as with RD/WRxxxx)?
Therefore we could us it as a scratch store and/or FIFO(s) rather than hub if 256 is enough.

confirmed by Chip

evanh · 2015-09-02 10:40

And the LUT RAM won't be dual-ported.

Seairth · 2015-09-02 12:42

cgracey wrote: »

Well, the FIFO is actually 16+5 levels deep, because there is a 5-clock latency from issuing the hub 'read' command to getting the results back. When the FIFO is 'full', measuring 16 longs depth, and 'read' commands cease, there are still 5 more longs on the way. So, when the FIFO 'full' mark is set to 8 longs during hub exec, there's a possibility of there being 13 longs in the FIFO. Anyway, it never runs dry.

Wow. I totally misconstrued your "FIFO signals full at 8" comment. I think I've got it now.

Out of curiosity, if code jumps to the same hub address as the instruction that's currently at the head of the FIFO, will it still cause the instructions to be re-fetched? I expect there will be code that will CALL/LINK into cogexec mode, then RET/JMP back to hubexec mode. In this case, there's no need to refresh the FIFO (assuming the cogexec code didn't use one of the streaming modes that uses the FIFO).

cgracey · 2015-09-02 15:43

Seairth wrote: »

cgracey wrote: »

Well, the FIFO is actually 16+5 levels deep, because there is a 5-clock latency from issuing the hub 'read' command to getting the results back. When the FIFO is 'full', measuring 16 longs depth, and 'read' commands cease, there are still 5 more longs on the way. So, when the FIFO 'full' mark is set to 8 longs during hub exec, there's a possibility of there being 13 longs in the FIFO. Anyway, it never runs dry.

Wow. I totally misconstrued your "FIFO signals full at 8" comment. I think I've got it now.

Out of curiosity, if code jumps to the same hub address as the instruction that's currently at the head of the FIFO, will it still cause the instructions to be re-fetched? I expect there will be code that will CALL/LINK into cogexec mode, then RET/JMP back to hubexec mode. In this case, there's no need to refresh the FIFO (assuming the cogexec code didn't use one of the streaming modes that uses the FIFO).

Because the FIFO is sequential and not randomly accessed, it must be reloaded on every jump to hub memory, even if it contains the needed code.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments