So if the shortest RDLONG is taking 9 clock cycles, once the first hub address is synced up is the fastest (non setq based) back-to-back RDLONG sequence achieved by reading long addresses n, n+1, n+2, OR is it n, n-1, n-2 (modulo 8)? i.e. Which way does the egg-beater increment its addresses, forwards or backwards?
Writes look like the would be fastest when writing to long addresses n, n+3, n+6, n+9 all modulo 8 (or perhaps n, n-3, n-6 etc if the egg beater addresses cycles in the other direction).
I assume that real slice 0 handles addresses ending in 000, real slice 1 ... 001, etc. What I call slices 0 and 1 in the tables could be N and N+1 where N is a don't care - only the egg beater slice / phase difference matters here.
Apart from documenting the egg beater correctly, my main interest is inserting "zero-cycle" instructions between hub reads and writes.
1) 3 clocks to get the read command to a particular RAM
2) 1 clock for the RAM to read out the data
3) 4 more clocks to route the data to the cog that requested it
4) 1 last clock to get it into the FIFO
That's it.
Chip, based on the above, is the following an accurate summary for generalised hub RAM read timing?
a) 3 clocks to get the read command to a particular RAM
b) 1..8 clocks for the RAM to read out the data, depending on egg beater
c) 4 clocks to route the data to the cog that requested it
d) 1 clock to get the data into the FIFO or D register
Total 9..16 clocks for hub read, +1 if long crossing or RDFAST when D[31]=0
* * * * * * * * * *
And is this correct for hub RAM writes?
e) 3 clocks to get the write command to a particular RAM
f) 0..7 clocks to start write of data to RAM, depending on egg beater
Total 3..10 clocks for hub write, +1 if long crossing
Write instruction ends at start of egg beater slice of interest
The reason WRxxxx instructions take 3..10 clocks is because they go like this:
a) 2 clocks for the instruction to present the data to the hub
b) 0..7 clocks to wait for the hub slot
c) 1 clock to get the acknowledge from the hub
The actual write takes place a few clocks later, but does not hold up the cog.
Chip, if writing to slot 0, is the 1 clock acknowledgement sent during slot 0?
I just looked at the Verilog and that is the case.
Thanks, Chip.
OK, write-only timings are effectively identical, but read-write-read-write have changed. Fastest one is now read0-write0-read1-write1. I'll edit my posts later, but for now if you change wrX to wrX-1 in the read-write timings they will be correct.
Chip, if writing to slot 0, is the 1 clock acknowledgement sent during slot 0?
I just looked at the Verilog and that is the case.
Thanks, Chip.
OK, write-only timings are effectively identical, but read-write-read-write have changed. Fastest one is now read0-write0-read1-write1. I'll edit my posts later, but for now if you change wrX to wrX-1 in the read-write timings they will be correct.
Can you verify this using the silicon? You are probably right, but it would be good to verify.
I can't attach a .txt or .zip file, tried on two computers, so had to post tables above separately.
It's one of the few functions of the forum software that arbitrarily insists on using Javascript. The other is the Edit button, it's a silly little scripted pop down menu with a single button on it.
So if the shortest RDLONG is taking 9 clock cycles, once the first hub address is synced up is the fastest (non setq based) back-to-back RDLONG sequence achieved by reading long addresses n, n+1, n+2, OR is it n, n-1, n-2 (modulo 8)? i.e. Which way does the egg-beater increment its addresses, forwards or backwards?
Writes look like the would be fastest when writing to long addresses n, n+3, n+6, n+9 all modulo 8 (or perhaps n, n-3, n-6 etc if the egg beater addresses cycles in the other direction).
I assume that real slice 0 handles addresses ending in 000, real slice 1 ... 001, etc. What I call slices 0 and 1 in the tables could be N and N+1 where N is a don't care - only the egg beater slice / phase difference matters here.
Slice accesses from each cog are also staggered from its neighbour cog. This is the how all cogs can access a slice each in unison. It's a detail that isn't particularly relevant within a single cog but thought it worth pointing out at least.
I added a read0-write0-read0-write0 timing to the above, which applies to any unchanged slice.
In case you find all these timings boring, perhaps the most useful application is reading then writing the same slice. If there is no other instruction between the two, the write will be the shortest possible at 3 cycles long. However, just one intermediate instruction will mean missing this first write slot and waiting for the next hub revolution. Four instructions between the read and write will not take any longer than one assuming all are 2 cycles long.
Some issues make it difficult for me at the moment to verify what I have written using the silicon and I'd be grateful if someone else could do this as I've spent a lot of time documenting it.
I did testing of that a long time back, when we only had the FPGA. I didn't carefully carry it through to mixed read and write timings though. If memory serves me right, I couldn't quite decide on how the two aligned. I think I got confused then distracted by other interests.
I tell you what - comparing against COGID will be very telling, since COGID is a hub-op with fixed slice association. It doesn't shift with address and has fastest minimum of 2 clocks execution time.
I did testing of that a long time back, when we only had the FPGA. I didn't carefully carry it through to mixed read and write timings though. If memory serves me right, I couldn't quite decide on how the two aligned. I think I got confused then distracted by other interests.
It's only in the last few days, today for writes, that we've had enough info to predict hub read & write timings.
I wasn't predicting, I was measuring. As I said, it wasn't easy and I effectively gave up on mixed types. The single type of RDLONGs or WRLONGs wasn't a huge problem though. I got the same results as you.
At one stage, when someone was asking much later on, I even wrote an equation for the single type case.
Looking through the old programs, it looks like it was when I was unemployed in 2018. Not quite as far back as I thought. It's funny how those fun engrossing times seem further away than reality. Work, on the other hand, seems to burn time with nothing getting done.
PS: I'd forgotten all about that round of testing I'd done January December a year back with the other hub-ops. January was the last time I'd touched the source code.
Ah, there's a flaw in that table above. The original code only dealt with one component having a hubRAM address, but the above needs two for the first three lines of each group. That'll be why I didn't get it finished - It required a reworking to cover the combinations.
GETCT interferes with the timing, so for slices x and y you need to do the first synchronizing read or write twice:
RD x
GETCT 1 'These add
RD x '16 cycles
RD y
GETCT 2
WR x
GETCT 1 'These add
WR x '8 cycles
WR y
GETCT 2
RD x
GETCT 1 'These add
RD x '16 cycles
WR y
GETCT 2
WR x
GETCT 1 'These add
WR x '8 cycles
RD y
GETCT 2
...
I'm not sure those specific numbers will be very helpful though, since it was for the other hub-ops, rather than rdlong/wrlong combos that you're wanting.
Okay, here's that quick fix again. I still haven't sorted out labelling the axes yet.
It iterates, across the table, the hub address (decrementing) of the left hand instruction. And iterates, down the table, the hub address (decrementing) of the right hand instruction.
PS: I've used the triple-instruction method too.
PPS: Left hand instruction executes ahead of the right hand instruction.
PPPS: Here's the critical section in the source code. inst6 and inst7 are left hand. inst8 is right hand.
...
inst6 nop 'hubram instructions use "phase" for hubram address
getct tickstart 'measure time
inst7 nop 'hubram instructions use "phase" for hubram address
inst8 nop 'hubram instructions use "phase2" for hubram address
getct pa 'measure time
...
Hmm, I need an exception for RDLONG as left hand. My magic compensation assumes inst6 and inst7 are eight ticks apart ... Which is not the case when they are a RDLONG. They'll be 16 ticks apart then.
EDIT: That'll be true for RDFAST as well.
EDIT2: Fixed the compensation now - Updated above attachment
EDIT3: Reinstated LOCKRET in the tables
Here's newest source code. Still messy as and unfinished but has been updated to handle mixed read and write of hubRAM without the other hub-ops. I had intended to comment it all originally but that's still not done either.
where
RDLONG0 is a read long from hub addr phase (long aligned)
RDLONG1 is a read long from hub addr phase+1*4
...
RDLONG7 is a read long from hub addr phase+7*4
Comments
I assume that real slice 0 handles addresses ending in 000, real slice 1 ... 001, etc. What I call slices 0 and 1 in the tables could be N and N+1 where N is a don't care - only the egg beater slice / phase difference matters here.
Apart from documenting the egg beater correctly, my main interest is inserting "zero-cycle" instructions between hub reads and writes.
P.S.
My figures need verifying!
Excellent info, thanks Chip. I need to change the write timings above because they are wrong.
Chip, if writing to slot 0, is the 1 clock acknowledgement sent during slot 0?
I just looked at the Verilog and that is the case.
Thanks, Chip.
OK, write-only timings are effectively identical, but read-write-read-write have changed. Fastest one is now read0-write0-read1-write1. I'll edit my posts later, but for now if you change wrX to wrX-1 in the read-write timings they will be correct.
Can you verify this using the silicon? You are probably right, but it would be good to verify.
Slice accesses from each cog are also staggered from its neighbour cog. This is the how all cogs can access a slice each in unison. It's a detail that isn't particularly relevant within a single cog but thought it worth pointing out at least.
http://forums.parallax.com/discussion/comment/1512305/#Comment_1512305
Text file containing all timings attached to this post
In case you find all these timings boring, perhaps the most useful application is reading then writing the same slice. If there is no other instruction between the two, the write will be the shortest possible at 3 cycles long. However, just one intermediate instruction will mean missing this first write slot and waiting for the next hub revolution. Four instructions between the read and write will not take any longer than one assuming all are 2 cycles long.
Some issues make it difficult for me at the moment to verify what I have written using the silicon and I'd be grateful if someone else could do this as I've spent a lot of time documenting it.
PS: Cordic ops are also hub ops.
It's only in the last few days, today for writes, that we've had enough info to predict hub read & write timings.
At one stage, when someone was asking much later on, I even wrote an equation for the single type case.
EDIT: Ah, the RevA P2-ES boards shipped in December 2018. https://forums.parallax.com/discussion/169367/p2-es-board-support/p1 That fits.
It was assembled for the FPGA because that's the only HUBSET options listed ... In fact I've found two programs now:
Here's another one where I'd given up - https://forums.parallax.com/discussion/comment/1448317/#Comment_1448317
PS: I'd forgotten all about that round of testing I'd done January December a year back with the other hub-ops. January was the last time I'd touched the source code.
EDIT: Oops, it will do a minimum 2 clock case when given an immediate operand - https://forums.parallax.com/discussion/comment/1485446/#Comment_1485446 Pretty unuseful case though.
EDIT: Err, lacking axes ...
EDIT:
Added cycles
I do have your version already in last year's code, just haven't been using it much.
Could you please try my version, making sure all hub addresses are long-aligned?
I'm not sure those specific numbers will be very helpful though, since it was for the other hub-ops, rather than rdlong/wrlong combos that you're wanting.
It iterates, across the table, the hub address (decrementing) of the left hand instruction. And iterates, down the table, the hub address (decrementing) of the right hand instruction.
PS: I've used the triple-instruction method too.
PPS: Left hand instruction executes ahead of the right hand instruction.
PPPS: Here's the critical section in the source code. inst6 and inst7 are left hand. inst8 is right hand.
EDIT: That'll be true for RDFAST as well.
EDIT2: Fixed the compensation now - Updated above attachment
EDIT3: Reinstated LOCKRET in the tables
Here's newest source code. Still messy as and unfinished but has been updated to handle mixed read and write of hubRAM without the other hub-ops. I had intended to comment it all originally but that's still not done either.
Instead of 64 numbers, I'd like 8:
where
RDLONG0 is a read long from hub addr phase (long aligned)
RDLONG1 is a read long from hub addr phase+1*4
...
RDLONG7 is a read long from hub addr phase+7*4