removing byte/word addressing would significantly slow down text handling, virtual machines, byte oriented graphics.
Especially bad is writing a byte - as we would have to read a long, and/or the new byte in, and write the long.
Perhaps the solution is to only use longs for streaming, and have the RDxxxx{B|W|L} / WRxxxx{B|W|L} instructions bypass the FIFO and go to the hub directly?
A single line of read cache would allow for RDxxxxC, thus all hub reads could then be long reads.
A single line of write cache (but it would need a read-before-write) could allow write gathering.
I really like the new fast hub interface btw, and love the idea of per-clock streaming into/outof the hub!
In working out this read-FIFO, I see it would be possible to have a write-FIFO, as well. This would enable data to be streamed either direction at one long per clock, never stalling execution. No stalls could be achieved by reading and writing as many longs as were in each FIFO, at a time.
Handling sizes other than longs (bytes and words) gums up a lot of things in the hardware, as well as the documentation. It makes the addressing more difficult to understand.
I've asked before and the consensus was that byte read and write instructions were absolutely needed, but are you sure about that? For text data that is large, four bytes to a long are certainly needed, and without byte operations, some software parsing of longs would be required. String buffers could use a long per character, though, as they are usually quite limited in size. Byte and word data could still be declared in your code, as now, but PASM would only support long reads and writes. Hub addressing would become a uniform 17 bits for 128K longs. What do you think?
I say give it a trial run with the next FPGA image. We have to adapt code anyway. That way we can see how much we really would miss it. We'd get to an FPGA image quicker and force the issue as to whether we really need it
In working out this read-FIFO, I see it would be possible to have a write-FIFO, as well. This would enable data to be streamed either direction at one long per clock, never stalling execution. No stalls could be achieved by reading and writing as many longs as were in each FIFO, at a time
Sounds great. Any more details yet ? One-way, I think this can be done in Dual Port RAM to keep it small.
Byte and word data could still be declared in your code, as now, but PASM would only support long reads and writes. Hub addressing would become a uniform 17 bits for 128K longs. What do you think?
The issue depends on direction, I think.
For Read, Byte & word could be managed by 'opcode offset into 32b' - so the fetch is always 32b from hub, but the register load also does a shift based on the LSB.
Write is not so easy, as partial change of a HUB 32b needs atomic Read-Modify-Write
Another way to ease losing HUB Byte Writes, would be to add Byte-Word opcodes to index into COG memory, which is then BLOCK command moved to/from HUB - this probably makes a case for a little more COG RAM, to be used for String/Char buffers.
carrying four "byte write enable" signals in the fifo would allow bytes/words to be written without a read/modify/write
the cog would place the byte/word value in the correct place in the long, and the write enables would make sure only the desired change would be made to the hub memory.
carrying four "byte write enable" signals in the fifo would allow bytes/words to be written without a read/modify/write
the cog would place the byte/word value in the correct place in the long, and the write enables would make sure only the desired change would be made to the hub memory.
I think the FIFO is for Linear HW streaming only, as the spoke rate is 2x what a COG can handle opcodes at.
- but the idea of four "byte write enable" signalsdoes solve the RMW problem, which is especially serious as simply adding a longer opcode bangs into the fact the LSB is spinning at fSys, so there is only one Slot to do any COG transactions.
Makes the COG-BUS slightly wider, but tolerable ?
Byte read can be manged locally, after the 32b is fetched.
I think the streaming FIFOs (now both ways?) could be 32b only, as that path is mainly for fSys/N DMA like linear streaming - it still has a spoke-align start delay of HUB access for the start of fill, and empty can be separately started..
I've asked before and the consensus was that byte read and write instructions were absolutely needed, but are you sure about that? For text data that is large, four bytes to a long are certainly needed, and without byte operations, some software parsing of longs would be required. String buffers could use a long per character, though, as they are usually quite limited in size. Byte and word data could still be declared in your code, as now, but PASM would only support long reads and writes. Hub addressing would become a uniform 17 bits for 128K longs. What do you think?
A lot of times, I will read a byte-sized flag in hub RAM with a wz to check for and get a command. If I had to read an entire long, the wz would not work without wasting another three bytes.
Leaving out rdbyte/rdword and wrbyte/wrword will slow a LOT of code down greatly - examples include graphics, byte code interpreters, cmm for gcc and much more.
Having said that, it should be possible for the hub to be long-only, and using circuitry in the cog to implement byte extraction and byte insertion (if we have the byte write enables) which may still simplify things enough for Chip.
Leaving out rdbyte/rdword and wrbyte/wrword will slow a LOT of code down greatly - examples include graphics, byte code interpreters, cmm for gcc and much more.
Having said that, it should be possible for the hub to be long-only, and using circuitry in the cog to implement byte extraction and byte insertion (if we have the byte write enables) which may still simplify things enough for Chip.
I would agree - reads can be managed inside the COG, so the HUB path is always 32b, and 4 byte write enables also has a 32b path, but selectively enables B/W/L write.
This would also apply to non-blocking WR, and the 2 opcode non blocking Read.
Leaving out rdbyte/rdword and wrbyte/wrword will slow a LOT of code down greatly - examples include graphics, byte code interpreters, cmm for gcc and much more.
Having said that, it should be possible for the hub to be long-only, and using circuitry in the cog to implement byte extraction and byte insertion (if we have the byte write enables) which may still simplify things enough for Chip.
In working out this read-FIFO, I see it would be possible to have a write-FIFO, as well. This would enable data to be streamed either direction at one long per clock, never stalling execution. No stalls could be achieved by reading and writing as many longs as were in each FIFO, at a time.
Handling sizes other than longs (bytes and words) gums up a lot of things in the hardware, as well as the documentation. It makes the addressing more difficult to understand.
I've asked before and the consensus was that byte read and write instructions were absolutely needed, but are you sure about that? For text data that is large, four bytes to a long are certainly needed, and without byte operations, some software parsing of longs would be required. String buffers could use a long per character, though, as they are usually quite limited in size. Byte and word data could still be declared in your code, as now, but PASM would only support long reads and writes. Hub addressing would become a uniform 17 bits for 128K longs. What do you think?
I hate the idea of it simply for my bytecode interpreter but I do see how it can speed things up but chew up a lot more memory too. Hey, maybe if I had a longcode interpreter instead then it could simplify my whole coding scheme!, and chew up memory, but there's a lot more of that, isn't there?
Others mentioned having extra instructions to extract and manipulate bytes and words, if we at least had that then it would help along with the higher speed but remember we still only have 496 cog longs to play with which those extra manipulations are going to eat into. P2.2013 had RDBYTEC instructions which is the exact opposite of what is being proposed though. Is this a case of trying to fit a square peg into a round hole, we end up shaving off the corners of the peg or perhaps we are just making a bigger hole to fall into!?
EDIT: thinking this through I can't see the need for the hub to be byte addressed so therefore all RDLONG and WRLONG addresses are not left shifted so this means that bit0 distinguishes longs, not bytes.
My longcode "interpreter" would use the lower 17-bits to directly address other code in hub memory but if the 17-bit address was lower than 512 it would directly address cog code which could use the upper 15 bits for parameters such as literals etc without having to read another code.
The thing is, P2 FPGA needs to be deployed for testing, so don't let us slow you up Chip
>A lot of times, I will read a byte-sized flag in hub RAM with a wz to check for and get a command. If I had to read an entire long, the wz would not work without wasting another three bytes.
How many flags do you have? probably 10 max, so there will a loss of 30bytes or 0.00005%
There are ways around as to not waste it,
I for example made the Spin SPI buffer routine write the buff[0] last, as it was never $00 data so then cog sees it as non-zero it knows that the rest of spi-buffer is ready.
But now you would have to make sure the SPIN routine fills in the array[0] as a long, if it's a byte array I guess you have to make so you store the 4bytes before writing it as a long
but that waste bytes too. and how does SPIN2 byte-code-interpreter work without bytes? read 16longs and have the cog shr as it spools through the data?
I would agree - reads can be managed inside the COG, so the HUB path is always 32b, and 4 byte write enables also has a 32b path, but selectively enables B/W/L write.
This would also apply to non-blocking WR, and the 2 opcode non blocking Read.
With the P1 it was faster for the cog to read longs from hub and extract the bytes rather than use rdbyte in many cases. It's a bit ironic that getting this big boost in hub access reverses that situation. Even with cog byte manipulation instructions being able to read a byte from hub could be faster. Still faster than P1 though.
I've asked before and the consensus was that byte read and write instructions were absolutely needed, but are you sure about that? For text data that is large, four bytes to a long are certainly needed, and without byte operations, some software parsing of longs would be required. String buffers could use a long per character, though, as they are usually quite limited in size. Byte and word data could still be declared in your code, as now, but PASM would only support long reads and writes. Hub addressing would become a uniform 17 bits for 128K longs. What do you think?
Byte read and writes are absolutely needed. You probably could do without word and simulate them in software.
I'm still not sure how you'd do the equivalent of a WRBYTE without a read/modify/wirte. It seems a little trouble in the hardware design phase would save a lot of trouble for programmers. So I guess I'm squarely in the camp that says byte and word reads and writes whould be designed in.
I'm still not sure how you'd do the equivalent of a WRBYTE without a read/modify/wirte.
I think you are talking about in Software here.
Yes, SW emulation of WRBYTE needs RMW & 2 slots, whilst four "byte write enable" signals costs 3 more signals, but needs a single slot, and that can be the non-blocking single-buffer WR mentioned before.
I'm still not sure how you'd do the equivalent of a WRBYTE without a read/modify/wirte. It seems a little trouble in the hardware design phase would save a lot of trouble for programmers. So I guess I'm squarely in the camp that says byte and word reads and writes whould be designed in.
-Phil
Yes, absolutely. In fact, on reflection I think word should be left in as well. The cost to do atomic word writes in s/w (i.e. in such a way that it would never "surprise" an unwary programmer) would be very high - you'd also need to use a lock to make it work properly.
If the hub bus only allows for addressing of longs, how can any cog circuitry avoid necessitating RMW? It seems like no matter what, the necessary logic for writing individual bytes to hub ram without trashing the other 3 bytes in a long would have to be in the hub or ram itself.
I think you are talking about in Software here.
Yes, SW emulation of WRBYTE needs RMW & 2 slots, whilst four "byte write enable" signals costs 3 more signals, but needs a single slot, and that can be the non-blocking single-buffer WR mentioned before.
I don't care how it's done in hardware, just so long as it's done there, and the programmer is not burdened with the details.
What FIFO are you talking about?
IIRC the last FIFO you referred to was the 256x for LUT/wavetable.
Byte and Word access to Hub is absolutely necessary.
But, as others have said, the byte/word can be transferred together with 4 bits containing the "byte enables" so that hub can be read/written correctly. I am sure you can do the hw mux/shift to get this correctly for the hub and cog ends with just a small amount of silicon.
However, I still have concerns about the new hub access from the jitter and deterministic point of view.
I think, whatever is the easiest/quickest to get us an FPGA to test would be the best approach. Leave out whatever you haven't done yet so we can get started. Absolutely everything has to be retested, so the sooner we can get on with testing the better. While we are testing we can flesh out any quirks.
Another concern of mine is the power envelope. Presuming all cogs are at full steam, all reading at once from the hub, this means that every cog on every clock could be reading a long. How much current is likely used when accessing all 16 parallel 32 bit hub ram on every clock ???
As Bill pointed out, passing the byte-write-enable signals through the write-FIFO makes byte writes possible:
WRINIT addr 'finish writing current write-FIFO and start a new one at addr
WRBYTEX 'skip first byte in write-FIFO
WRBYTE D/# 'add byte to write-FIFO
WRWORD D/# 'add word to write-FIFO, completes long, byte-write-enables = %1110
WRLONG D/# 'add long to write-FIFO, byte-write-enables = %1111
WRWORDX 'skip word in write-FIFO
WRWORD D/# 'add word to write-FIFO, completes long, byte-write-enables = %1100
WRLONGX 'skip long in write-FIFO, completes long, byte-write-enables = %0000
WRBYTE D/# 'add byte to write FIFO, byte-write-enables currently = %0001
WRINIT addr 'finish writing currrent write-FIFO and start new one
RDINIT addr 'clear read-FIFO and beging reloading from addr
RDBYTE D 'read byte from read-FIFO into D
RDBYTEX 'skip byte in read-FIFO
RDWORD D 'read word from read-FIFO into D, long completed, advance read-FIFO
RDWORDX 'skip word in read-FIFO
RDWORD D 'read word from read-FIFO into D, long completed, advance read-FIFO
RDLONGX 'skip long in read-FIFO, advance read-FIFO
RDLONG D 'read long from read-FIFO into D, advance read-FIFO
The big win here is that we get to unify the address scheme to 17 bits, always.
What happens with a RDBYTE from the FIFO, followed by a RDLONG? Does the RDLONG automatically skip enough bytes to realign to a long boundary, or does the programmer have to do enough RDBYTEXs to compensate?
What happens with a RDBYTE from the FIFO, followed by a RDLONG? Does the RDLONG automatically skip enough bytes to realign to a long boundary, or does the programmer have to do enough RDBYTEXs to compensate?
-Phil
Ooooh, non-aligned reads and writes.
I don't know yet. That seems like kind of a pain, but may be pretty simple with a FIFO. If we looked at the bottom level and one up from that, we could piece together whatever we needed on a read.
What happens with a RDBYTE from the FIFO, followed by a RDLONG? Does the RDLONG automatically skip enough bytes to realign to a long boundary, or does the programmer have to do enough RDBYTEXs to compensate?
I think addresses are always for Longs, but another question is what happens to 5 RDBYTES or 5 WRBYTES via the FIFO ?
Is there a random access version of this, or does any HUB access need 2 lines min ?
In working out this read-FIFO, I see it would be possible to have a write-FIFO, as well. This would enable data to be streamed either direction at one long per clock, never stalling execution. No stalls could be achieved by reading and writing as many longs as were in each FIFO, at a time.
Handling sizes other than longs (bytes and words) gums up a lot of things in the hardware, as well as the documentation. It makes the addressing more difficult to understand.
I've asked before and the consensus was that byte read and write instructions were absolutely needed, but are you sure about that? For text data that is large, four bytes to a long are certainly needed, and without byte operations, some software parsing of longs would be required. String buffers could use a long per character, though, as they are usually quite limited in size. Byte and word data could still be declared in your code, as now, but PASM would only support long reads and writes. Hub addressing would become a uniform 17 bits for 128K longs. What do you think?
Having the pin read/write-FIFO long width only would be acceptable to me. Getting or setting the pin states at up to Fsys opens up way more opportunities than are lost from poor data density in the Hub ram buffer.
I've only used the byte wide hub instructions in my line-camera laser cavity length control. In that case, the 102 entry image buffer was big enough to justify working with BYTES instead of LONGS.
As Bill pointed out, passing the byte-write-enable signals through the write-FIFO makes byte writes possible:
Is this the same FIFO as proposed for DMA style Streaming (optionally via LUT ) ? (or a smaller Data-Flow one, or two RD_FIFO and WR_FIFO ?)
Since the COG can run and use slots that are not needed by Streaming DMA or LUT actions, can that COG still R.W to the HUB via some direct channel (avoiding the RW FIFO) ?
Comments
removing byte/word addressing would significantly slow down text handling, virtual machines, byte oriented graphics.
Especially bad is writing a byte - as we would have to read a long, and/or the new byte in, and write the long.
Perhaps the solution is to only use longs for streaming, and have the RDxxxx{B|W|L} / WRxxxx{B|W|L} instructions bypass the FIFO and go to the hub directly?
A single line of read cache would allow for RDxxxxC, thus all hub reads could then be long reads.
A single line of write cache (but it would need a read-before-write) could allow write gathering.
I really like the new fast hub interface btw, and love the idea of per-clock streaming into/outof the hub!
Sounds great. Any more details yet ? One-way, I think this can be done in Dual Port RAM to keep it small.
The issue depends on direction, I think.
For Read, Byte & word could be managed by 'opcode offset into 32b' - so the fetch is always 32b from hub, but the register load also does a shift based on the LSB.
Write is not so easy, as partial change of a HUB 32b needs atomic Read-Modify-Write
Another way to ease losing HUB Byte Writes, would be to add Byte-Word opcodes to index into COG memory, which is then BLOCK command moved to/from HUB - this probably makes a case for a little more COG RAM, to be used for String/Char buffers.
carrying four "byte write enable" signals in the fifo would allow bytes/words to be written without a read/modify/write
the cog would place the byte/word value in the correct place in the long, and the write enables would make sure only the desired change would be made to the hub memory.
I think the FIFO is for Linear HW streaming only, as the spoke rate is 2x what a COG can handle opcodes at.
- but the idea of four "byte write enable" signals does solve the RMW problem, which is especially serious as simply adding a longer opcode bangs into the fact the LSB is spinning at fSys, so there is only one Slot to do any COG transactions.
Makes the COG-BUS slightly wider, but tolerable ?
Byte read can be manged locally, after the 32b is fetched.
I think the streaming FIFOs (now both ways?) could be 32b only, as that path is mainly for fSys/N DMA like linear streaming - it still has a spoke-align start delay of HUB access for the start of fill, and empty can be separately started..
A lot of times, I will read a byte-sized flag in hub RAM with a wz to check for and get a command. If I had to read an entire long, the wz would not work without wasting another three bytes.
-Phil
Later on, if people complain a lot about not having it (it being any feature you can think of), then you can come back to it...
Having said that, it should be possible for the hub to be long-only, and using circuitry in the cog to implement byte extraction and byte insertion (if we have the byte write enables) which may still simplify things enough for Chip.
I would agree - reads can be managed inside the COG, so the HUB path is always 32b, and 4 byte write enables also has a 32b path, but selectively enables B/W/L write.
This would also apply to non-blocking WR, and the 2 opcode non blocking Read.
That makes a lot of sense.
I hate the idea of it simply for my bytecode interpreter but I do see how it can speed things up but chew up a lot more memory too. Hey, maybe if I had a longcode interpreter instead then it could simplify my whole coding scheme!, and chew up memory, but there's a lot more of that, isn't there?
Others mentioned having extra instructions to extract and manipulate bytes and words, if we at least had that then it would help along with the higher speed but remember we still only have 496 cog longs to play with which those extra manipulations are going to eat into. P2.2013 had RDBYTEC instructions which is the exact opposite of what is being proposed though. Is this a case of trying to fit a square peg into a round hole, we end up shaving off the corners of the peg or perhaps we are just making a bigger hole to fall into!?
EDIT: thinking this through I can't see the need for the hub to be byte addressed so therefore all RDLONG and WRLONG addresses are not left shifted so this means that bit0 distinguishes longs, not bytes.
My longcode "interpreter" would use the lower 17-bits to directly address other code in hub memory but if the 17-bit address was lower than 512 it would directly address cog code which could use the upper 15 bits for parameters such as literals etc without having to read another code.
The thing is, P2 FPGA needs to be deployed for testing, so don't let us slow you up Chip
How many flags do you have? probably 10 max, so there will a loss of 30bytes or 0.00005%
There are ways around as to not waste it,
I for example made the Spin SPI buffer routine write the buff[0] last, as it was never $00 data so then cog sees it as non-zero it knows that the rest of spi-buffer is ready.
But now you would have to make sure the SPIN routine fills in the array[0] as a long, if it's a byte array I guess you have to make so you store the 4bytes before writing it as a long
but that waste bytes too. and how does SPIN2 byte-code-interpreter work without bytes? read 16longs and have the cog shr as it spools through the data?
But I still would be OK with a 32bit only P2
With the P1 it was faster for the cog to read longs from hub and extract the bytes rather than use rdbyte in many cases. It's a bit ironic that getting this big boost in hub access reverses that situation. Even with cog byte manipulation instructions being able to read a byte from hub could be faster. Still faster than P1 though.
Byte read and writes are absolutely needed. You probably could do without word and simulate them in software.
Ross.
-Phil
Yes, SW emulation of WRBYTE needs RMW & 2 slots, whilst four "byte write enable" signals costs 3 more signals, but needs a single slot, and that can be the non-blocking single-buffer WR mentioned before.
Yes, absolutely. In fact, on reflection I think word should be left in as well. The cost to do atomic word writes in s/w (i.e. in such a way that it would never "surprise" an unwary programmer) would be very high - you'd also need to use a lock to make it work properly.
Ross.
-Phil
What FIFO are you talking about?
IIRC the last FIFO you referred to was the 256x for LUT/wavetable.
Byte and Word access to Hub is absolutely necessary.
But, as others have said, the byte/word can be transferred together with 4 bits containing the "byte enables" so that hub can be read/written correctly. I am sure you can do the hw mux/shift to get this correctly for the hub and cog ends with just a small amount of silicon.
However, I still have concerns about the new hub access from the jitter and deterministic point of view.
I think, whatever is the easiest/quickest to get us an FPGA to test would be the best approach. Leave out whatever you haven't done yet so we can get started. Absolutely everything has to be retested, so the sooner we can get on with testing the better. While we are testing we can flesh out any quirks.
Another concern of mine is the power envelope. Presuming all cogs are at full steam, all reading at once from the hub, this means that every cog on every clock could be reading a long. How much current is likely used when accessing all 16 parallel 32 bit hub ram on every clock ???
Interestingly, I noticed someone suggesting we need more cog ram - I have the simplest solution over here
http://forums.parallax.com/showthread.php/155767-The-case-for-Additional-Extended-COG-RAM-(-2-4-6-8KB)
Even if you don't implement it, I think it provides a simpler view of the prop concept.
The big win here is that we get to unify the address scheme to 17 bits, always.
Sounds great
-Phil
Ooooh, non-aligned reads and writes.
I don't know yet. That seems like kind of a pain, but may be pretty simple with a FIFO. If we looked at the bottom level and one up from that, we could piece together whatever we needed on a read.
I think addresses are always for Longs, but another question is what happens to 5 RDBYTES or 5 WRBYTES via the FIFO ?
Is there a random access version of this, or does any HUB access need 2 lines min ?
-Phil
Having the pin read/write-FIFO long width only would be acceptable to me. Getting or setting the pin states at up to Fsys opens up way more opportunities than are lost from poor data density in the Hub ram buffer.
I've only used the byte wide hub instructions in my line-camera laser cavity length control. In that case, the 102 entry image buffer was big enough to justify working with BYTES instead of LONGS.
Marty
Is this the same FIFO as proposed for DMA style Streaming (optionally via LUT ) ? (or a smaller Data-Flow one, or two RD_FIFO and WR_FIFO ?)
Since the COG can run and use slots that are not needed by Streaming DMA or LUT actions, can that COG still R.W to the HUB via some direct channel (avoiding the RW FIFO) ?