New Hub Scheme For Next Chip

Rayman · 2014-05-20 06:36

LMM programs are likely to stall, no matter what you do... But, I think the FIFO makes it no worse than before.
It could even help in some cases, I think...

Rayman · 2014-05-20 06:57

I wonder if things would be simpler if there where just 8 P2 cogs...
Seems like this new memory/FIFO scheme is dictated by having 16 cogs.

If there were only 8, then we could keep the old memory scheme...
There'd still be the new analog and many I/O pins and 512kB of HUB RAM to be excited about.
And, it would still run much faster than P1.
Would this simplify video output too?

On the other hand, I seem to be always low on cogs with P1...j
But, still I'm low because it takes several cogs to do video and one for kb, mouse, etc...
The faster speed should make it easier to combine cogs or use less of them for video...

Maybe the choice here is between performance and simplicity...

Todd Marshall · 2014-05-20 07:20

Heater. wrote: »

I'd love an FPGA build of a plain "vanilla" Propeller II, with the new HUB arbiter, and what is done so far.

I'd love an FPGA build of a Propeller I.

Lawson · 2014-05-20 07:39

Heater. wrote: »

@Chip,

Many us out here are going to our graves, early or otherwise, just trying to keep up with the increasing complexity of things.

For myself I would say:

1) Forget hubexec. I don't see it increasing speed over LMM much.

2) Forget FIFO's and HUB streaming. Most code accesses random locations most of the time.

3) Forget messing with increasing COG memory size by whatever tortuous means.

4) Heck most applications won't use video or codic, forget all that.

I'd love an FPGA build of a plain "vanilla" Propeller II, with the new HUB arbiter, and what is done so far.

Nice simple, understandable, easy to program, free of baggage that won't get used most of the time. Perhaps a tad less performant than some theoretical maximum but so what?

I'll add another.

5) with the 16 planned cogs, the cost of dedicating a cog to a high speed function is HALF of what it is on the P1. RDBLOCK on it's own lets code stream at Fsys/2 with 2-cogs and Fsys/1 with 4 cogs, and each cog has enough spare cycles to do triggering, add sync info, etc. (so the same cost as a 1 or 2 cog video driver on the P1)

Marty

Bill Henning · 2014-05-20 08:04

Idea for the hub:

- use 50% of the cycles for streaming
- use 50% of the cycles for deterministic RDxxx/WRxxx

This solves:

- determinism, now every RDxxx'WRxxx works exactly the same as on a P1, every 16 instruction cycles (32 clock cycles) every cog gets a turn at the hub for RDxxx/WRxxx

- high bandwidth reads/writes (video, signal capture etc) - every cog gets 8 longs every 16 cycles/8 instructions (400MB/sec) for streaming without impacting random read/writes

Food for thought:

- if a cog had two tasks, one task could use the random read/write interface at full speed, and the other task could stream full speed

- 400MB/sec is more than enough for 1080p

potatohead · 2014-05-20 08:39

I don't care about hubex.

Bill is on the right track. We are expecting one COG to do too much. Chip, I strongly encourage you to think about using them together more to get the really big numbers.

The one thing about LMM that makes me think out of the box a little is the lack of formal assembler support. What if a default kernel and the PASM instructions needed to make it really sing were in ROM?

It does not need to be a big ROM either.

pnut gets LMM support in the same way it got it for HUBEX. Programs are easy to write, and the ROM is defined as a constant the user could override when or if they supply their own kernel.

A cognew(@LMM, @program_start) would kick off the big program easy.

We all optimize the Smile out of that LMM image, aading software features and soft opcodes to get everything done super friendly, fast, easy.

I'll say this again: gcc needs to be awesome on this chip.

If doing that contradicts the above, do what it takes for C to run well.

Phil Pilgrim (PhiPi) · 2014-05-20 09:05

Chip still has not addressed how locks fit into this new scheme -- with our without the FIFO. Locking is a critical feature that cannot be omitted or glossed over. Locks are also hub resources that need to mesh deterministically with data reads and writes. The new hub-access regimes appear to complicate locking, though. Chip? Thoughts?

-Phil

cgracey · 2014-05-20 09:10

Thanks for all of your input, Guys.

My goal today is to get the FIFOs and hub memories defined and connected together. That is a huge building block that would lend some sanity if it were functional.

I think that going forward, non-aligned reads and writes are no problem. For reads and writes that extend across native long boundaries, one more clock is required for random accesses. For FIFO activity, it makes no timing difference. This is all because the next long in memory is always accessible on the next clock. It's really simple.

cgracey · 2014-05-20 09:13

Phil Pilgrim (PhiPi) wrote: »

Chip still has not addressed how locks fit into this new scheme -- with our without the FIFO. Locking is a critical feature that cannot be omitted or glossed over. Locks are also hub resources that need to mesh deterministically with data reads and writes. The new hub-access regimes appear to complicate locking, though. Chip? Thoughts?

-Phil

Say you have a span of hub memory that gets updated and another cog needs to know when it happened. The last element of data can be the 'updated' flag. I don't see any need for locks in all this.

Phil Pilgrim (PhiPi) · 2014-05-20 10:37

cgracey wrote:

Say you have a span of hub memory that gets updated and another cog needs to know when it happened. The last element of data can be the 'updated' flag. I don't see any need for locks in all this.

Huh? There will always be some cases where two processes share the same resource, and locks are necessary to synchronize that sharing. I don't think you can gloss over this so easily, Chip. Awhile back you asked if the new design needed to include locks, and the answer was an overwhelming "yes." I don't see anything in the new scheme that changes this.

-Phil

dMajo · 2014-05-20 10:43

cgracey wrote: »

As Bill pointed out, passing the byte-write-enable signals through the write-FIFO makes byte writes possible:

    WRINIT    addr    'finish writing current write-FIFO and start a new one at addr
    WRBYTEX        'skip first byte in write-FIFO
    WRBYTE    D/#    'add byte to write-FIFO
    WRWORD    D/#    'add word to write-FIFO, completes long, byte-write-enables = 10
    WRLONG    D/#    'add long to write-FIFO, byte-write-enables = 11
    WRWORDX        'skip word in write-FIFO
    WRWORD    D/#    'add word to write-FIFO, completes long, byte-write-enables = 00
    WRLONGX        'skip long in write-FIFO, completes long, byte-write-enables = 00
    WRBYTE    D/#    'add byte to write FIFO, byte-write-enables currently = 01
    WRINIT    addr    'finish writing currrent write-FIFO and start new one

    RDINIT    addr    'clear read-FIFO and beging reloading from addr
    RDBYTE    D    'read byte from read-FIFO into D
    RDBYTEX        'skip byte in read-FIFO
    RDWORD    D    'read word from read-FIFO into D, long completed, advance read-FIFO
    RDWORDX        'skip word in read-FIFO
    RDWORD    D    'read word from read-FIFO into D, long completed, advance read-FIFO
    RDLONGX        'skip long in read-FIFO, advance read-FIFO
    RDLONG    D    'read long from read-FIFO into D, advance read-FIFO

The big win here is that we get to unify the address scheme to 17 bits, always.

What am I missing here?

Can't the physical ram be long addressed while logically it can be byte?
I mean can't rd/wr-byte have a byte address where the 2 low lsb indicates multiple of 8 right bit shifts for read and multiple of 8 bit left shifts for write? For the write the 2 low address lsb can decode to wr_en[0..3]
logical addresses[2..x] are wired to physical ram address[0..x-2]

I don't see any aligned/unaligned rd/wr-byte here nor I see the need for WRxxxxX/RDxxxxX .... what am I missing?

cgracey · 2014-05-20 10:54

Phil Pilgrim (PhiPi) wrote: »

Huh? There will always be some cases where two processes share the same resource, and locks are necessary to synchronize that sharing. I don't think you can gloss over this so easily, Chip. Awhile back you asked if the new design needed to include locks, and the answer was an overwhelming "yes." I don't see anything in the new scheme that changes this.

-Phil

Sorry, Phil. I was thinking another process just needed to know when some new record had been written to hub memory. In that case, the last element could be written to non-0, signalling to another cog that the data was ready. After processing the data, the receiving cog could then write 0 to that last element to signal back to the sending cog that the data had been processed and it was safe to write a new record in the old one's place.

There will be locks, of course.

cgracey · 2014-05-20 10:57

dMajo wrote: »

What am I missing here?

Can't the phisical ram be long addressed while logically it can be byte?
I mean can't rd/wr-byte habe a byte address where the 2 low lsb indicates multiple of 8 right bit shifts for read and multiple of 8 bit shifts right for write? For the write the 2 low address lsb can decode to wr_en[0..3]
logical addresses[2..x] are wired to phisical ram address[0..x-2]

I don't see any aligned/unaligned rd/wr-byte here nor I see the need for WRxxxxX/RDxxxxX .... what am I missing?

In working on the memory this morning, I realized that all addresses will be 19 bits and there will be no alignment requirements, nor penalties for non-alignment. This is achieved by always using two consecutive hub slots for reading and writing, allowing for overlapped data. This adds one clock for random accesses, with no penalty for FIFO accesses. I'll post more later as it firms up.

Bill Henning · 2014-05-20 11:01

I think that non-aligned reads/writes are not important enough to slow down random access. GCC and other compilers already long-align longs, word-align words, so I am not sure we gain anything with non-aligned reads.

cgracey wrote: »

In working on the memory this morning, I realized that all addresses will be 19 bits and there will be no alignment requirements, nor penalties for non-alignment. This is achieved by always using two consecutive hub slots for reading and writing, allowing for overlapped data. This adds one clock for random accesses, with no penalty for FIFO accesses. I'll post more later as it firms up.

Phil Pilgrim (PhiPi) · 2014-05-20 11:10

cgracey wrote:

There will be locks, of course.

Okay, good. In the P1, locks were implemented via round-robin hub accesses, same as -- and in sequence with -- hub memory accesses. It worked simply, because only one cog could have access to the hub at the same time. That has all changed with the new scheme. How will locks be implemented in the P2?

-Phil

cgracey · 2014-05-20 11:15

Bill Henning wrote: »

I think that non-aligned reads/writes are not important enough to slow down random access. GCC and other compilers already long-align longs, word-align words, so I am not sure we gain anything with non-aligned reads.

I like it because it unifies the addressing scheme. Everything will be based on byte addresses. It keeps everything cleaner. If we weren't supporting bytes and words, we could toss two address bits. As long as we have them, though, we might as well make it as painless to think about as possible. It will be possible to not add the extra clock cycle in case of aligned accesses, but for now, to get it working, I'm going to standardize it. We can optimize it once it's working.

Baggers · 2014-05-20 11:35

Awesome news Chip

cgracey · 2014-05-20 11:37

Phil Pilgrim (PhiPi) wrote: »

Okay, good. In the P1, locks were implemented via round-robin hub accesses, same as -- and in sequence with -- hub memory accesses. It worked simply, because only one cog could have access to the hub at the same time. That has all changed with the new scheme. How will locks be implemented in the P2?

-Phil

I just assumed the same way as in Prop1. A cog using hub RAM would need to finish his write or read before relinquishing the lock. Do you think that's okay?

Phil Pilgrim (PhiPi) · 2014-05-20 11:44

cgracey wrote:

I just assumed the same way as in Prop1. A cog using hub RAM would need to finish his write or read before relinquishing the lock. Do you think that's okay?

So the hub access mechanism for locks will be round-robin as it was for the P1, i.e. separate from the memory-access mechanism? For direct hub reads and writes, can we assume, then, that the program will stall until the transfer takes place? But for FIFO writes, how will the program know that all data were transferred so that the lock state can be changed?

-Phil

cgracey · 2014-05-20 11:49

Phil Pilgrim (PhiPi) wrote: »

So the hub access mechanism for locks will be round-robin as it was for the P1, i.e. separate from the memory-access mechanism? For direct hub reads and writes, can we assume, then, that the program will stall until the transfer takes place? But for FIFO writes, how will the program know that all data were transferred so that the lock state can be changed?

-Phil

We'll need an instruction that waits until the write-FIFO is empty, I think.

Bill Henning · 2014-05-20 11:55

Makes sense.

If the later fpga image / full silicon won't have the extra cycle penalty for aligned access, that would be perfect

cgracey wrote: »

I like it because it unifies the addressing scheme. Everything will be based on byte addresses. It keeps everything cleaner. If we weren't supporting bytes and words, we could toss two address bits. As long as we have them, though, we might as well make it as painless to think about as possible. It will be possible to not add the extra clock cycle in case of aligned accesses, but for now, to get it working, I'm going to standardize it. We can optimize it once it's working.

Phil Pilgrim (PhiPi) · 2014-05-20 12:11

cgracey wrote:

We'll need an instruction that waits until the write-FIFO is empty, I think.

That would solve it -- anything that prevents the state change for the lock from occurring out of sequence.

Thanks,
-Phil

dnalor · 2014-05-20 12:19

Heater. wrote: »

@Clusso,

Up to and including all that you have, 500 odd K in this case.

The C/C++ allows the programmer to declare structures, classes, arrays on the stack. Of arbitary size.
Then of course recursive code will blow up the stack.

Cluso99 wrote: »

David & Ross,
I asked because if we get 4KB as I suggested, then running a C program would likely yield a 3KB stack = 768 longs.
Would this be enough?
If the routine pushing onto the stack checked the used depth, then the overflow could be placed into hub.
But if the stack requirements >>768 then it might not be beneficial.

Arrays on the stack!!! You guys never programmed in C on a microcontroller !!!!
One till four parameters. Bytes, words, longs, pointers. Maybe up to six levels deep.
Recursive code I never needed.

jmg · 2014-05-20 12:34

cgracey wrote: »

In working on the memory this morning, I realized that all addresses will be 19 bits and there will be no alignment requirements, nor penalties for non-alignment. This is achieved by always using two consecutive hub slots for reading and writing, allowing for overlapped data. This adds one clock for random accesses, with no penalty for FIFO accesses. I'll post more later as it firms up.

Sounds great

Heater. · 2014-05-20 13:19

dnalor,

Arrays on the stack!!! You guys never programmed in C on a microcontroller !!!!

Sure we have. I can make an FFT or FullDuplexSerial driver in the 512 longs register space of a COG.
Can you?

You are living in the early 1980's

The proposed Propeller has 512K of RAM. That is a ton more than the machines the C language was developed on!

Recursive code I never needed.

You have been missing out on lots of fun

David Betz · 2014-05-20 13:41

Are there still instructions that do what RDLONG, RDWORD, and RDBYTE do on P1?

Are there still instructions that do what WRLONG, WRWORD, and WRBYTE do on P1?

In other words, are these FIFO instructions being added to the P1-style hub memory access instructions or do they replace them?

Rayman · 2014-05-20 14:00

cgracey wrote: »

We'll need an instruction that waits until the write-FIFO is empty, I think.

Couldn't you just have non-fifo hub instructions just stall until the hub access is free?
That's more like it is on P1...

cgracey · 2014-05-20 14:39

David Betz wrote: »

Are there still instructions that do what RDLONG, RDWORD, and RDBYTE do on P1?

Are there still instructions that do what WRLONG, WRWORD, and WRBYTE do on P1?

In other words, are these FIFO instructions being added to the P1-style hub memory access instructions or do they replace them?

For the first FPGA version, I'll just have the following:

RDINIT	D/#19bit	- wait until FIFO write done, set read mode, begin reloading FIFO from address D[18:2] with byte offset D[1:0]
RDLONG	D		- read long from FIFO
RDWORD	D		- read word from FIFO
RDBYTE	D		- read byte from FIFO

WRINIT	D/#19bit	- wait until FIFO write done, set write mode at address D[18:2] with byte offset D[1:0]
WRLONG	D/#		- write long to FIFO *
WRWORD	D/#		- write word to FIFO *
WRBYTE	D/#		- write byte to FIFO *

* optional filter mode doesn't write bytes when they are $FF

I need to get this FIFO stuff working properly before I can add the random accesses.

cgracey · 2014-05-20 14:43

Rayman wrote: »

Couldn't you just have non-fifo hub instructions just stall until the hub access is free?
That's more like it is on P1...

For random accesses, yes.

Bill Henning · 2014-05-20 14:49

The more I think about it, the more I like these FIFO reads/writes, especially as they allow non-aligned access.

Questions:

Can the RD call's set Z and C if zero, and the highest bit, like the P2 instructions?

Does an RDINIT flush any data from the FIFO that may still be in it? Or does it wait until the read fifo is empty?

Do writes have priority over reads?

cgracey wrote: »

For the first FPGA version, I'll just have the following:

RDINIT	D/#19bit	- wait until FIFO write done, set read mode, begin reloading FIFO from address D[18:2] with byte offset D[1:0]
RDLONG	D		- read long from FIFO
RDWORD	D		- read word from FIFO
RDBYTE	D		- read byte from FIFO

WRINIT	D/#19bit	- wait until FIFO write done, set write mode at address D[18:2] with byte offset D[1:0]
WRLONG	D/#		- write long to FIFO *
WRWORD	D/#		- write word to FIFO *
WRBYTE	D/#		- write byte to FIFO *

* optional filter mode doesn't write bytes when they are $FF

I need to get this FIFO stuff working properly before I can add the random accesses.

New Hub Scheme For Next Chip

Comments