LMM programs are likely to stall, no matter what you do... But, I think the FIFO makes it no worse than before.
It could even help in some cases, I think...
I wonder if things would be simpler if there where just 8 P2 cogs...
Seems like this new memory/FIFO scheme is dictated by having 16 cogs.
If there were only 8, then we could keep the old memory scheme...
There'd still be the new analog and many I/O pins and 512kB of HUB RAM to be excited about.
And, it would still run much faster than P1.
Would this simplify video output too?
On the other hand, I seem to be always low on cogs with P1...j
But, still I'm low because it takes several cogs to do video and one for kb, mouse, etc...
The faster speed should make it easier to combine cogs or use less of them for video...
Maybe the choice here is between performance and simplicity...
Many us out here are going to our graves, early or otherwise, just trying to keep up with the increasing complexity of things.
For myself I would say:
1) Forget hubexec. I don't see it increasing speed over LMM much.
2) Forget FIFO's and HUB streaming. Most code accesses random locations most of the time.
3) Forget messing with increasing COG memory size by whatever tortuous means.
4) Heck most applications won't use video or codic, forget all that.
I'd love an FPGA build of a plain "vanilla" Propeller II, with the new HUB arbiter, and what is done so far.
Nice simple, understandable, easy to program, free of baggage that won't get used most of the time. Perhaps a tad less performant than some theoretical maximum but so what?
I'll add another.
5) with the 16 planned cogs, the cost of dedicating a cog to a high speed function is HALF of what it is on the P1. RDBLOCK on it's own lets code stream at Fsys/2 with 2-cogs and Fsys/1 with 4 cogs, and each cog has enough spare cycles to do triggering, add sync info, etc. (so the same cost as a 1 or 2 cog video driver on the P1)
- use 50% of the cycles for streaming
- use 50% of the cycles for deterministic RDxxx/WRxxx
This solves:
- determinism, now every RDxxx'WRxxx works exactly the same as on a P1, every 16 instruction cycles (32 clock cycles) every cog gets a turn at the hub for RDxxx/WRxxx
- high bandwidth reads/writes (video, signal capture etc) - every cog gets 8 longs every 16 cycles/8 instructions (400MB/sec) for streaming without impacting random read/writes
Food for thought:
- if a cog had two tasks, one task could use the random read/write interface at full speed, and the other task could stream full speed
Bill is on the right track. We are expecting one COG to do too much. Chip, I strongly encourage you to think about using them together more to get the really big numbers.
The one thing about LMM that makes me think out of the box a little is the lack of formal assembler support. What if a default kernel and the PASM instructions needed to make it really sing were in ROM?
It does not need to be a big ROM either.
pnut gets LMM support in the same way it got it for HUBEX. Programs are easy to write, and the ROM is defined as a constant the user could override when or if they supply their own kernel.
A cognew(@LMM, @program_start) would kick off the big program easy.
We all optimize the Smile out of that LMM image, aading software features and soft opcodes to get everything done super friendly, fast, easy.
I'll say this again: gcc needs to be awesome on this chip.
If doing that contradicts the above, do what it takes for C to run well.
Chip still has not addressed how locks fit into this new scheme -- with our without the FIFO. Locking is a critical feature that cannot be omitted or glossed over. Locks are also hub resources that need to mesh deterministically with data reads and writes. The new hub-access regimes appear to complicate locking, though. Chip? Thoughts?
My goal today is to get the FIFOs and hub memories defined and connected together. That is a huge building block that would lend some sanity if it were functional.
I think that going forward, non-aligned reads and writes are no problem. For reads and writes that extend across native long boundaries, one more clock is required for random accesses. For FIFO activity, it makes no timing difference. This is all because the next long in memory is always accessible on the next clock. It's really simple.
Chip still has not addressed how locks fit into this new scheme -- with our without the FIFO. Locking is a critical feature that cannot be omitted or glossed over. Locks are also hub resources that need to mesh deterministically with data reads and writes. The new hub-access regimes appear to complicate locking, though. Chip? Thoughts?
-Phil
Say you have a span of hub memory that gets updated and another cog needs to know when it happened. The last element of data can be the 'updated' flag. I don't see any need for locks in all this.
Say you have a span of hub memory that gets updated and another cog needs to know when it happened. The last element of data can be the 'updated' flag. I don't see any need for locks in all this.
Huh? There will always be some cases where two processes share the same resource, and locks are necessary to synchronize that sharing. I don't think you can gloss over this so easily, Chip. Awhile back you asked if the new design needed to include locks, and the answer was an overwhelming "yes." I don't see anything in the new scheme that changes this.
As Bill pointed out, passing the byte-write-enable signals through the write-FIFO makes byte writes possible:
WRINIT addr 'finish writing current write-FIFO and start a new one at addr
WRBYTEX 'skip first byte in write-FIFO
WRBYTE D/# 'add byte to write-FIFO
WRWORD D/# 'add word to write-FIFO, completes long, byte-write-enables = 10
WRLONG D/# 'add long to write-FIFO, byte-write-enables = 11
WRWORDX 'skip word in write-FIFO
WRWORD D/# 'add word to write-FIFO, completes long, byte-write-enables = 00
WRLONGX 'skip long in write-FIFO, completes long, byte-write-enables = 00
WRBYTE D/# 'add byte to write FIFO, byte-write-enables currently = 01
WRINIT addr 'finish writing currrent write-FIFO and start new one
RDINIT addr 'clear read-FIFO and beging reloading from addr
RDBYTE D 'read byte from read-FIFO into D
RDBYTEX 'skip byte in read-FIFO
RDWORD D 'read word from read-FIFO into D, long completed, advance read-FIFO
RDWORDX 'skip word in read-FIFO
RDWORD D 'read word from read-FIFO into D, long completed, advance read-FIFO
RDLONGX 'skip long in read-FIFO, advance read-FIFO
RDLONG D 'read long from read-FIFO into D, advance read-FIFO
The big win here is that we get to unify the address scheme to 17 bits, always.
What am I missing here?
Can't the physical ram be long addressed while logically it can be byte?
I mean can't rd/wr-byte have a byte address where the 2 low lsb indicates multiple of 8 right bit shifts for read and multiple of 8 bit left shifts for write? For the write the 2 low address lsb can decode to wr_en[0..3]
logical addresses[2..x] are wired to physical ram address[0..x-2]
I don't see any aligned/unaligned rd/wr-byte here nor I see the need for WRxxxxX/RDxxxxX .... what am I missing?
Huh? There will always be some cases where two processes share the same resource, and locks are necessary to synchronize that sharing. I don't think you can gloss over this so easily, Chip. Awhile back you asked if the new design needed to include locks, and the answer was an overwhelming "yes." I don't see anything in the new scheme that changes this.
-Phil
Sorry, Phil. I was thinking another process just needed to know when some new record had been written to hub memory. In that case, the last element could be written to non-0, signalling to another cog that the data was ready. After processing the data, the receiving cog could then write 0 to that last element to signal back to the sending cog that the data had been processed and it was safe to write a new record in the old one's place.
Can't the phisical ram be long addressed while logically it can be byte?
I mean can't rd/wr-byte habe a byte address where the 2 low lsb indicates multiple of 8 right bit shifts for read and multiple of 8 bit shifts right for write? For the write the 2 low address lsb can decode to wr_en[0..3]
logical addresses[2..x] are wired to phisical ram address[0..x-2]
I don't see any aligned/unaligned rd/wr-byte here nor I see the need for WRxxxxX/RDxxxxX .... what am I missing?
In working on the memory this morning, I realized that all addresses will be 19 bits and there will be no alignment requirements, nor penalties for non-alignment. This is achieved by always using two consecutive hub slots for reading and writing, allowing for overlapped data. This adds one clock for random accesses, with no penalty for FIFO accesses. I'll post more later as it firms up.
I think that non-aligned reads/writes are not important enough to slow down random access. GCC and other compilers already long-align longs, word-align words, so I am not sure we gain anything with non-aligned reads.
In working on the memory this morning, I realized that all addresses will be 19 bits and there will be no alignment requirements, nor penalties for non-alignment. This is achieved by always using two consecutive hub slots for reading and writing, allowing for overlapped data. This adds one clock for random accesses, with no penalty for FIFO accesses. I'll post more later as it firms up.
Okay, good. In the P1, locks were implemented via round-robin hub accesses, same as -- and in sequence with -- hub memory accesses. It worked simply, because only one cog could have access to the hub at the same time. That has all changed with the new scheme. How will locks be implemented in the P2?
I think that non-aligned reads/writes are not important enough to slow down random access. GCC and other compilers already long-align longs, word-align words, so I am not sure we gain anything with non-aligned reads.
I like it because it unifies the addressing scheme. Everything will be based on byte addresses. It keeps everything cleaner. If we weren't supporting bytes and words, we could toss two address bits. As long as we have them, though, we might as well make it as painless to think about as possible. It will be possible to not add the extra clock cycle in case of aligned accesses, but for now, to get it working, I'm going to standardize it. We can optimize it once it's working.
Okay, good. In the P1, locks were implemented via round-robin hub accesses, same as -- and in sequence with -- hub memory accesses. It worked simply, because only one cog could have access to the hub at the same time. That has all changed with the new scheme. How will locks be implemented in the P2?
-Phil
I just assumed the same way as in Prop1. A cog using hub RAM would need to finish his write or read before relinquishing the lock. Do you think that's okay?
I just assumed the same way as in Prop1. A cog using hub RAM would need to finish his write or read before relinquishing the lock. Do you think that's okay?
So the hub access mechanism for locks will be round-robin as it was for the P1, i.e. separate from the memory-access mechanism? For direct hub reads and writes, can we assume, then, that the program will stall until the transfer takes place? But for FIFO writes, how will the program know that all data were transferred so that the lock state can be changed?
So the hub access mechanism for locks will be round-robin as it was for the P1, i.e. separate from the memory-access mechanism? For direct hub reads and writes, can we assume, then, that the program will stall until the transfer takes place? But for FIFO writes, how will the program know that all data were transferred so that the lock state can be changed?
-Phil
We'll need an instruction that waits until the write-FIFO is empty, I think.
I like it because it unifies the addressing scheme. Everything will be based on byte addresses. It keeps everything cleaner. If we weren't supporting bytes and words, we could toss two address bits. As long as we have them, though, we might as well make it as painless to think about as possible. It will be possible to not add the extra clock cycle in case of aligned accesses, but for now, to get it working, I'm going to standardize it. We can optimize it once it's working.
Up to and including all that you have, 500 odd K in this case.
The C/C++ allows the programmer to declare structures, classes, arrays on the stack. Of arbitary size.
Then of course recursive code will blow up the stack.
David & Ross,
I asked because if we get 4KB as I suggested, then running a C program would likely yield a 3KB stack = 768 longs.
Would this be enough?
If the routine pushing onto the stack checked the used depth, then the overflow could be placed into hub.
But if the stack requirements >>768 then it might not be beneficial.
Arrays on the stack!!! You guys never programmed in C on a microcontroller !!!!
One till four parameters. Bytes, words, longs, pointers. Maybe up to six levels deep.
Recursive code I never needed.
In working on the memory this morning, I realized that all addresses will be 19 bits and there will be no alignment requirements, nor penalties for non-alignment. This is achieved by always using two consecutive hub slots for reading and writing, allowing for overlapped data. This adds one clock for random accesses, with no penalty for FIFO accesses. I'll post more later as it firms up.
Are there still instructions that do what RDLONG, RDWORD, and RDBYTE do on P1?
Are there still instructions that do what WRLONG, WRWORD, and WRBYTE do on P1?
In other words, are these FIFO instructions being added to the P1-style hub memory access instructions or do they replace them?
For the first FPGA version, I'll just have the following:
RDINIT D/#19bit - wait until FIFO write done, set read mode, begin reloading FIFO from address D[18:2] with byte offset D[1:0]
RDLONG D - read long from FIFO
RDWORD D - read word from FIFO
RDBYTE D - read byte from FIFO
WRINIT D/#19bit - wait until FIFO write done, set write mode at address D[18:2] with byte offset D[1:0]
WRLONG D/# - write long to FIFO *
WRWORD D/# - write word to FIFO *
WRBYTE D/# - write byte to FIFO *
* optional filter mode doesn't write bytes when they are $FF
I need to get this FIFO stuff working properly before I can add the random accesses.
For the first FPGA version, I'll just have the following:
RDINIT D/#19bit - wait until FIFO write done, set read mode, begin reloading FIFO from address D[18:2] with byte offset D[1:0]
RDLONG D - read long from FIFO
RDWORD D - read word from FIFO
RDBYTE D - read byte from FIFO
WRINIT D/#19bit - wait until FIFO write done, set write mode at address D[18:2] with byte offset D[1:0]
WRLONG D/# - write long to FIFO *
WRWORD D/# - write word to FIFO *
WRBYTE D/# - write byte to FIFO *
* optional filter mode doesn't write bytes when they are $FF
I need to get this FIFO stuff working properly before I can add the random accesses.
Comments
It could even help in some cases, I think...
Seems like this new memory/FIFO scheme is dictated by having 16 cogs.
If there were only 8, then we could keep the old memory scheme...
There'd still be the new analog and many I/O pins and 512kB of HUB RAM to be excited about.
And, it would still run much faster than P1.
Would this simplify video output too?
On the other hand, I seem to be always low on cogs with P1...j
But, still I'm low because it takes several cogs to do video and one for kb, mouse, etc...
The faster speed should make it easier to combine cogs or use less of them for video...
Maybe the choice here is between performance and simplicity...
I'll add another.
5) with the 16 planned cogs, the cost of dedicating a cog to a high speed function is HALF of what it is on the P1. RDBLOCK on it's own lets code stream at Fsys/2 with 2-cogs and Fsys/1 with 4 cogs, and each cog has enough spare cycles to do triggering, add sync info, etc. (so the same cost as a 1 or 2 cog video driver on the P1)
Marty
- use 50% of the cycles for streaming
- use 50% of the cycles for deterministic RDxxx/WRxxx
This solves:
- determinism, now every RDxxx'WRxxx works exactly the same as on a P1, every 16 instruction cycles (32 clock cycles) every cog gets a turn at the hub for RDxxx/WRxxx
- high bandwidth reads/writes (video, signal capture etc) - every cog gets 8 longs every 16 cycles/8 instructions (400MB/sec) for streaming without impacting random read/writes
Food for thought:
- if a cog had two tasks, one task could use the random read/write interface at full speed, and the other task could stream full speed
- 400MB/sec is more than enough for 1080p
Bill is on the right track. We are expecting one COG to do too much. Chip, I strongly encourage you to think about using them together more to get the really big numbers.
The one thing about LMM that makes me think out of the box a little is the lack of formal assembler support. What if a default kernel and the PASM instructions needed to make it really sing were in ROM?
It does not need to be a big ROM either.
pnut gets LMM support in the same way it got it for HUBEX. Programs are easy to write, and the ROM is defined as a constant the user could override when or if they supply their own kernel.
A cognew(@LMM, @program_start) would kick off the big program easy.
We all optimize the Smile out of that LMM image, aading software features and soft opcodes to get everything done super friendly, fast, easy.
I'll say this again: gcc needs to be awesome on this chip.
If doing that contradicts the above, do what it takes for C to run well.
-Phil
My goal today is to get the FIFOs and hub memories defined and connected together. That is a huge building block that would lend some sanity if it were functional.
I think that going forward, non-aligned reads and writes are no problem. For reads and writes that extend across native long boundaries, one more clock is required for random accesses. For FIFO activity, it makes no timing difference. This is all because the next long in memory is always accessible on the next clock. It's really simple.
Say you have a span of hub memory that gets updated and another cog needs to know when it happened. The last element of data can be the 'updated' flag. I don't see any need for locks in all this.
-Phil
Can't the physical ram be long addressed while logically it can be byte?
I mean can't rd/wr-byte have a byte address where the 2 low lsb indicates multiple of 8 right bit shifts for read and multiple of 8 bit left shifts for write? For the write the 2 low address lsb can decode to wr_en[0..3]
logical addresses[2..x] are wired to physical ram address[0..x-2]
I don't see any aligned/unaligned rd/wr-byte here nor I see the need for WRxxxxX/RDxxxxX .... what am I missing?
Sorry, Phil. I was thinking another process just needed to know when some new record had been written to hub memory. In that case, the last element could be written to non-0, signalling to another cog that the data was ready. After processing the data, the receiving cog could then write 0 to that last element to signal back to the sending cog that the data had been processed and it was safe to write a new record in the old one's place.
There will be locks, of course.
In working on the memory this morning, I realized that all addresses will be 19 bits and there will be no alignment requirements, nor penalties for non-alignment. This is achieved by always using two consecutive hub slots for reading and writing, allowing for overlapped data. This adds one clock for random accesses, with no penalty for FIFO accesses. I'll post more later as it firms up.
-Phil
I like it because it unifies the addressing scheme. Everything will be based on byte addresses. It keeps everything cleaner. If we weren't supporting bytes and words, we could toss two address bits. As long as we have them, though, we might as well make it as painless to think about as possible. It will be possible to not add the extra clock cycle in case of aligned accesses, but for now, to get it working, I'm going to standardize it. We can optimize it once it's working.
I just assumed the same way as in Prop1. A cog using hub RAM would need to finish his write or read before relinquishing the lock. Do you think that's okay?
-Phil
We'll need an instruction that waits until the write-FIFO is empty, I think.
If the later fpga image / full silicon won't have the extra cycle penalty for aligned access, that would be perfect
Thanks,
-Phil
Arrays on the stack!!! You guys never programmed in C on a microcontroller !!!!
One till four parameters. Bytes, words, longs, pointers. Maybe up to six levels deep.
Recursive code I never needed.
Sounds great
Can you?
You are living in the early 1980's
The proposed Propeller has 512K of RAM. That is a ton more than the machines the C language was developed on! You have been missing out on lots of fun
Are there still instructions that do what WRLONG, WRWORD, and WRBYTE do on P1?
In other words, are these FIFO instructions being added to the P1-style hub memory access instructions or do they replace them?
Couldn't you just have non-fifo hub instructions just stall until the hub access is free?
That's more like it is on P1...
For the first FPGA version, I'll just have the following:
I need to get this FIFO stuff working properly before I can add the random accesses.
For random accesses, yes.
Questions:
Can the RD call's set Z and C if zero, and the highest bit, like the P2 instructions?
Does an RDINIT flush any data from the FIFO that may still be in it? Or does it wait until the read fifo is empty?
Do writes have priority over reads?