Someone suggesting using a separate register to hold the top of the hardware stack to speed things up a bit. (I don't know whether that was done because I think the stack is a big shift register with no stack pointer.) Similarly, could or does a separate register hold the top of the FIFO?
1) 3 clocks to get the read command to a particular RAM
2) 1 clock for the RAM to read out the data
3) 4 more clocks to route the data to the cog that requested it
4) 1 last clock to get it into the FIFO
Thanks for the detailed break down Chip. I was looking at faster/smaller C prologue/epilogue ideas in LUT/COG RAM that could speed up the register saving steps in function calls instead of fighting for access to the hub memory with the FIFO but those extra 13-20 clock branches back into HUB-exec are going to hurt so it might just be better to remain in HUB-exec mode the entire time during function calls and live with the extra code size overhead.
Update: I imagine that running instruction sequences such as "setq + rdlong" etc to access to the hub memory to write/read register data to/from a stack frame while running from the HUB-exec mode would also stall the FIFO, potentially by the full egg-beater read cycle each time they are issued, so figuring out the pros/cons to having prologues/epilogues in COG/LUT RAM is not trivial.
1) 3 clocks to get the read command to a particular RAM
2) 1 clock for the RAM to read out the data
3) 4 more clocks to route the data to the cog that requested it
4) 1 last clock to get it into the FIFO
That's it.
Thanks from me as well, Chip.
Are 1) and 3) due mainly to physical distance between cog and hub RAM?
1) 3 clocks to get the read command to a particular RAM
2) 1 clock for the RAM to read out the data
3) 4 more clocks to route the data to the cog that requested it
4) 1 last clock to get it into the FIFO
That's it.
Thanks from me as well, Chip.
Are 1) and 3) due mainly to physical distance between cog and hub RAM?
Yes, I had to add registers to buy clock cycles to overcome the routing delay.
Branches to hub take 13..20 clocks. It takes up to 7 clocks to reach the hub window of interest, so that a read command can be issued, then it takes 13 clocks to get to the next instruction. I'm looking at the Verilog and it signals 'go' as soon as the first instruction is entered into the FIFO. So, it doesn't wait for a some number of longs, just the first one. Why this takes 13 clocks, I'm not sure right now. I know it takes a few clocks to get the read command to the actual RAM of interest, then it takes a few clocks for the data to come back through sets of registers to the cog that requested it, then, there's a clock in the FIFO. Not sure how it all adds up to thirteen, at the moment. It seems like a long time.
As Cog/LUT exec branches take 4 clocks, could it be thought of 4 + 9..16 clocks for hub exec? Or 4 + 9..9+cogs-1, where cogs = 8 only currently. The question is could the 9 be reduced in future?
1) 3 clocks to get the read command to a particular RAM
2) 1 clock for the RAM to read out the data
3) 4 more clocks to route the data to the cog that requested it
4) 1 last clock to get it into the FIFO
That's it.
Just wondering why RDFAST takes 10..17 cycles to load FIFO, instead of 9..16 cycles taken by branches to hub RAM.
It'll be pipeline related. Same as there is +4 for branching in hubexec, incurring the combined hubRAM fetching and instruction pipeline reloading. Whereas RDLONG only has to align the hubRAM fetch with operand input of the ALU, which is much further along the pipeline than instruction fetching.
I believe RDFAST waits for two longs to be in the FIFO, not one, in case there is a long crossing.
Thanks for the info, Chip. I was hoping there's a case where RDFAST loads the FIFO in 16 cycles maximum and I think this is it: a no-wait RDFAST followed by a RFBYTE, e.g. for XBYTE, can always gets the byte from the first long in the FIFO.
If there are four instructions before a RDFAST that could be moved after it, or only three before a _RET_ RDFAST, then a no-wait RDFAST will always be faster than a normal one * (assuming a register is used for D with D[31] = 1), despite needing an extra instruction, WAITX, to ensure a delay of 16 cycles after the no-wait RDFAST (unless this duration of instructions can be moved). * Time saving is 2 .. 9 cycles.
If there are three instructions before a RDFAST that could be moved after it, or only two before a _RET_ RDFAST, then a no-wait RDFAST will never be slower and usually faster than a normal one. Time saving is 0 .. 7 cycles.
Edit:
Provided there is no long crossing, the above also applies to RFWORD & RFLONG.
Tony,
I gather you're commenting on when the hubRAM->FIFO data becomes valid with D[31] = 1, rather than actual instruction speed.
Evan, the code below compares the two versions of RDFAST. The main reason for using no-wait RDFAST is to save time but it also makes timing deterministic.
'Original code, normal RDFAST:
'cycles
<instr1> '2
...
<instrN> '2
_ret_ rdfast #0,pb '2+10..17+2 = 14..21
' _ret_ starts new XBYTE
'Reordered code, no-wait RDFAST:
'cycles
rdfast nowait,pb '2
waitx #X '2+X
<instr1> '2
...
_ret_ <instrN> '2+2
' _ret_ starts new XBYTE
nowait long $8000_0000
'Must be at least 16 cycles between waitx and _ret_ incl.
'Assuming 2 cycles for all <instr1> ... <instrN>,
'then 2+X+2*N+2 = 16, or X = 12-2*N
'
'If <instrN> requires prefix other than _ret_,
'then add separate ret and subtract 2 from X
'Table of N and X with cycle counts:
'
'N X Total cycles
' using RDFAST with
' wait no-wait
'
'0 12 14..21 18
'1 10 16..23 18
'2 8 18..25 18
'3 6 20..27 18
'4 4 22..29 18
'5 2 24..31 18
'6 0 26..33 18
'7 - 28..35 18
'
'waitx not needed if N > 6
Note that, for this XBYTE example, moving just one instruction from before RDFAST to after it saves cycles compared to wait average.
EDIT1:
Added N = 0, no-wait only 0.5 cycles slower than wait average.
EDIT2:
The fixed 18 cycles for no-wait could be a variable 11..18 cycles with the following future logic change: RFxxxx wait for first FIFO long, if necessary, after no-wait RDFAST. Would then not need WAITX, unless deterministic timing wanted.
And before I forget to mention it again, no-wait RDBYTE/WORD/LONG seem possible.
I presume you're thinking about alternative design possibilities for the prop2.
Yes, previous post edited.
Due to the delays reading from hub, it would be good if you could start the reading process at a certain time but read the actual data later, with other instructions in between to reduce the total cycles. This is precisely what no-wait RDFAST allows on the P2 now, as described earlier here: http://forums.parallax.com/discussion/comment/1511479/#Comment_1511479
Yes, TonyB_, that would work. It wouldn't have been hard to make it do that. I find that a lot of transfers I code, though, are multi-long transfers. They are pretty efficient.
1) 3 clocks to get the read command to a particular RAM
2) 1 clock for the RAM to read out the data
3) 4 more clocks to route the data to the cog that requested it
4) 1 last clock to get it into the FIFO
That's it.
Chip, based on the above, is the following an accurate summary for generalised hub RAM read timing?
a) 3 clocks to get the read command to a particular RAM
b) 1..8 clocks for the RAM to read out the data, depending on egg beater
c) 4 clocks to route the data to the cog that requested it
d) 1 clock to get the data into the FIFO or D register
Total 9..16 clocks for hub read, +1 if long crossing or RDFAST when D[31]=0
* * * * * * * * * *
And is this correct for hub RAM writes?
e) 3 clocks to get the write command to a particular RAM
f) 0..7 clocks to start write of data to RAM, depending on egg beater
Total 3..10 clocks for hub write, +1 if long crossing
Write instruction ends at start of egg beater slice of interest
I've sort of resigned to those sort of details not really mattering. I had a go at trying to get it exact with the instruction pipelining sequence, but even there we had confusion about depicting synchronous clocking in the diagrams.
You've explained why there are 9 cycles at least for reads, but not the 3 cycles minimum for writes. I assumed that read and write commands take the same time to get from cog to hub, i.e. 3 cycles.
Hub RAM read & write timing (version 2, December 2020)
A hub RAM read involves three separate stages: a read command is sent from
cog to hub, the hub RAM is read and the read data are sent from hub to cog.
There is a fixed pre-read time of 3 cycles, a variable read time due to the
egg beater of 1-8 cycles and a fixed post-read time of 5 cycles. These are
denoted by the letters a, r and b, respectively, in the tables below where
each letter represents one cycle. The shortest read of 9 cycles is given by
aaarbbbbb and the longest of 16 cycles by aaarrrrrrrrbbbbb.
A hub RAM write also has three stages: write data are sent from cog to hub,
a wait for the hub slot and an acknowledgement sent from hub to cog. There
is a fixed pre-wait time of 2 cycles, a variable wait of 0-7 cycles and a
fixed post-wait of 1 cycle, denoted by the letters c, w and d, respectively.
The actual write takes place a few clocks later but does not delay the cog.
The shortest write of 3 cycles is given by ccd and the longest of 10 cycles
by ccwwwwwwwd.
If a hub RAM read or write is followed by another and the phase or egg beater
slice difference between the two is known, then in most cases it is possible
to use the time that would be wasted on egg beater delays to run one or more
instructions between the hub accesses.
This is easy to see in the tables, in which these intermediate instructions
are denoted by i1, i2, etc., * indicating that one can be 3 cycles long. The
initial read or write, shown with worst-case timing but of unknown duration,
synchronizes subsequent reads or writes to the egg beater. Cycle lengths are
given from the second hub access onwards and assume no long crossing that
would add an extra cycle.
The above text and the tables of hub RAM timings are in the attached file.
Timings are also available in three separate forum posts:
So if the shortest RDLONG is taking 9 clock cycles, once the first hub address is synced up is the fastest (non setq based) back-to-back RDLONG sequence achieved by reading long addresses n, n+1, n+2, OR is it n, n-1, n-2 (modulo 8)? i.e. Which way does the egg-beater increment its addresses, forwards or backwards?
Writes look like the would be fastest when writing to long addresses n, n+3, n+6, n+9 all modulo 8 (or perhaps n, n-3, n-6 etc if the egg beater addresses cycles in the other direction).
It could be useful in some cases to know this optimal access pattern if you need to optimize some data structure layouts etc based on the order in which they get accessed. Of course that's for basic COG-exec PASM2, any hub-exec code would obviously throw its own spanner in the works and mess things up further.
Comments
1) 3 clocks to get the read command to a particular RAM
2) 1 clock for the RAM to read out the data
3) 4 more clocks to route the data to the cog that requested it
4) 1 last clock to get it into the FIFO
That's it.
Update: I imagine that running instruction sequences such as "setq + rdlong" etc to access to the hub memory to write/read register data to/from a stack frame while running from the HUB-exec mode would also stall the FIFO, potentially by the full egg-beater read cycle each time they are issued, so figuring out the pros/cons to having prologues/epilogues in COG/LUT RAM is not trivial.
Thanks from me as well, Chip.
Are 1) and 3) due mainly to physical distance between cog and hub RAM?
Yes, I had to add registers to buy clock cycles to overcome the routing delay.
Just wondering why RDFAST takes 10..17 cycles to load FIFO, instead of 9..16 cycles taken by branches to hub RAM.
Thanks for the info, Chip. I was hoping there's a case where RDFAST loads the FIFO in 16 cycles maximum and I think this is it: a no-wait RDFAST followed by a RFBYTE, e.g. for XBYTE, can always gets the byte from the first long in the FIFO.
If there are four instructions before a RDFAST that could be moved after it, or only three before a _RET_ RDFAST, then a no-wait RDFAST will always be faster than a normal one * (assuming a register is used for D with D[31] = 1), despite needing an extra instruction, WAITX, to ensure a delay of 16 cycles after the no-wait RDFAST (unless this duration of instructions can be moved). * Time saving is 2 .. 9 cycles.
If there are three instructions before a RDFAST that could be moved after it, or only two before a _RET_ RDFAST, then a no-wait RDFAST will never be slower and usually faster than a normal one. Time saving is 0 .. 7 cycles.
Edit:
Provided there is no long crossing, the above also applies to RFWORD & RFLONG.
I gather you're commenting on when the hubRAM->FIFO data becomes valid with D[31] = 1, rather than actual instruction speed.
Evan, the code below compares the two versions of RDFAST. The main reason for using no-wait RDFAST is to save time but it also makes timing deterministic.
Note that, for this XBYTE example, moving just one instruction from before RDFAST to after it saves cycles compared to wait average.
EDIT1:
Added N = 0, no-wait only 0.5 cycles slower than wait average.
EDIT2:
The fixed 18 cycles for no-wait could be a variable 11..18 cycles with the following future logic change: RFxxxx wait for first FIFO long, if necessary, after no-wait RDFAST. Would then not need WAITX, unless deterministic timing wanted.
Yes, previous post edited.
Due to the delays reading from hub, it would be good if you could start the reading process at a certain time but read the actual data later, with other instructions in between to reduce the total cycles. This is precisely what no-wait RDFAST allows on the P2 now, as described earlier here:
http://forums.parallax.com/discussion/comment/1511479/#Comment_1511479
With this read and write info, hub RAM timings could be calculated precisely.
Yes, they could be. I need to come up with the exact formula and add it into the silicon doc.
Felt I was beating a dead horse from then on.
EDIT: Maybe only clear way is to use a logic simulator package. I found Logisim easy to use - https://sourceforge.net/projects/circuit/
Chip, could you please check the following post?
http://forums.parallax.com/discussion/comment/1511907/#Comment_1511907
If what I said is right, then that's all the info we need to calculate hub RAM timings.
You've explained why there are 9 cycles at least for reads, but not the 3 cycles minimum for writes. I assumed that read and write commands take the same time to get from cog to hub, i.e. 3 cycles.
A hub RAM read involves three separate stages: a read command is sent from
cog to hub, the hub RAM is read and the read data are sent from hub to cog.
There is a fixed pre-read time of 3 cycles, a variable read time due to the
egg beater of 1-8 cycles and a fixed post-read time of 5 cycles. These are
denoted by the letters a, r and b, respectively, in the tables below where
each letter represents one cycle. The shortest read of 9 cycles is given by
aaarbbbbb and the longest of 16 cycles by aaarrrrrrrrbbbbb.
A hub RAM write also has three stages: write data are sent from cog to hub,
a wait for the hub slot and an acknowledgement sent from hub to cog. There
is a fixed pre-wait time of 2 cycles, a variable wait of 0-7 cycles and a
fixed post-wait of 1 cycle, denoted by the letters c, w and d, respectively.
The actual write takes place a few clocks later but does not delay the cog.
The shortest write of 3 cycles is given by ccd and the longest of 10 cycles
by ccwwwwwwwd.
If a hub RAM read or write is followed by another and the phase or egg beater
slice difference between the two is known, then in most cases it is possible
to use the time that would be wasted on egg beater delays to run one or more
instructions between the hub accesses.
This is easy to see in the tables, in which these intermediate instructions
are denoted by i1, i2, etc., * indicating that one can be 3 cycles long. The
initial read or write, shown with worst-case timing but of unknown duration,
synchronizes subsequent reads or writes to the egg beater. Cycle lengths are
given from the second hub access onwards and assume no long crossing that
would add an extra cycle.
The above text and the tables of hub RAM timings are in the attached file.
Timings are also available in three separate forum posts:
read - write - read - write
http://forums.parallax.com/discussion/comment/1512306/#Comment_1512306
read - read - read - read
http://forums.parallax.com/discussion/comment/1512310/#Comment_1512310
write - write - write - write
http://forums.parallax.com/discussion/comment/1512311/#Comment_1512311
Writes look like the would be fastest when writing to long addresses n, n+3, n+6, n+9 all modulo 8 (or perhaps n, n-3, n-6 etc if the egg beater addresses cycles in the other direction).
It could be useful in some cases to know this optimal access pattern if you need to optimize some data structure layouts etc based on the order in which they get accessed. Of course that's for basic COG-exec PASM2, any hub-exec code would obviously throw its own spanner in the works and mess things up further.