Hub RAM FIFO read timing
Ahle2
Posts: 1,179
Hi Chip,
Have a look at the code below...
If I sync the code above to the eggbeater, will the data at "someAddress" be ready to read through the FIFO after 9 cycles?
/Johannes
Have a look at the code below...
rdfast 1 << 31, someAddress ' Cycle 1/2 nop ' Cycle 3/4 nop ' Cycle 5/6 nop ' Cycle 7/8 rflong someLongData ' Cycle 9/10
If I sync the code above to the eggbeater, will the data at "someAddress" be ready to read through the FIFO after 9 cycles?
/Johannes
Comments
The way to use it is work with worst case so then you don't need to know the alignment. So, 16 clocks rather than 9.
Obviously, any preceding WRFAST is another case again.
Pretty sure I remember each COG getting access to addresses with a specific lower nibble. Each cycle, that nibble is incremented.
So, on cycle 0
Cog 0 = nibble 0
Cog 1 = nibble 1
Cycle 1
Cog 0 = nibble 1
Cog 1 = nibble 2
Etc...
I do not remember what the difference between 8 and 16 cogs is.
Do I have that right?
I'm not sure I'd use that D[31]=1 mode...
In practice, you only need to know the difference in two addresses ((addr2 - addr1) mod 8 ) of hubRAM slices from one access to the next. This has an interesting mental side effect - I found myself thinking in terms of multiples of this difference, but reality is when coding the next aligned access you need to think in terms of multiples of eight still but then add the modulo'd difference on.
WRLONGs can be as little as three clocks, a notable advantage over the prop1. And, I think there is a timing alignment difference from RDLONG, presumably due to difference in buffering stages.
The other hub-ops, like COGID and the locks and also cordic commands (not including GETQX/QY), all have a singular fixed slot position, for each cog, like the prop1.
Thanks for that. Either I knew it, and forgot, or never knew. But, I'm happy to know it now.
I'm having trouble sorting how that works out. Don't get me wrong, as I'm taking it as given. But, I need to noodle on it more to fully internalize it. Did you obtain this by testing, or some other method? If test, is there a pointer to the code?
EDIT: I remember it was xmas a year ago I first hacked at it.
EDIT2: Huh, just found a few tables from only a couple months back but still didn't document it well. (Note these are targetting the other hub-ops)
EDIT: Ah, I remember, Chip confirmed that COGID always returns a result, so therefore is minimum of 4 clocks, not 2. The instruction sheet needed a correction.
Original thread, not very long but utterly fascinating:
https://forums.parallax.com/discussion/167956/fast-hub-ram-timing/p1
Two reasons for no-wait RDFAST/WRFAST:
1. To reduce/eliminate otherwise wasted cycles and do something useful instead.
2. To allow 100% deterministic timing, albeit at the expense of always assuming the worst-case.
That worst-case is 2+17 cycles for RDFAST if D[31] = 0, which suggests nine 2-cycle instructions minimum between RDFAST and RFxxxx if D[31] = 1. I'm not sure anyone has confirmed this by testing yet.
Would be very handy to also have no-wait random hub accesses in the future, for WRxxxx at least. RDFAST/WRFAST could be used for a single memory access until then.
I just looked at the Verilog source and it looks to me like if you do a 'COGID #value' (which makes no sense) there will be no result, and it would, therefore take 2..9 clocks, not 4..11.
Okay, found the source code for the above tables. Here's the instruction list used for it:
I think if you use COGID with a # and no WC, it should take 2 to 9 cycles. Are you able try this out? I'm not at my setup right now.
You can use "WAITX #15 WC" to generate a random delay before executing the instruction. Then, capture the counter before and after COGID. Get the difference minus 2 and that should be the time it took to execute COGID. You will need to run it in a loop and record the lowest value.
That use case makes no sense, of course, but it would take as little as two clocks. So, this means the instruction spreadsheet is okay, right?
Setting bit D[31] to clear the FIFO is the WHOLE point of what I'm after!
I do actually have an application where I need to know the timing exactly! (you will eventually see what I'm after)
There is ways to synchronize everyting to the HUB, so now I will just prove this on actual hardware. My code will not work unless everything is perfectly synched.
I'm looking forward to it!
If you want to do some of your own testing then here's my source. As it stands it requires Eric's Fastspin because of the #include, but merging the two files wouldn't be hard if you wanted to use Pnut instead.
Thank you Evan... This will come in handy! 😀
Let us suppose a different bytecode interpreter has its inline routine in cog/LUT RAM, which jumps to the start of the inline code in hub RAM that follows immediately after the inline bytecode. I think this jump instruction could be
Q1. If jmp pb is correct, does this jump reload the FIFO? This should not be necessary as the first inline P2 instruction is already the next thing in the FIFO.
Q2. In general, does the hardware check whether a jump address is the same as the next address/program counter and if so not do the jump?
Q3. Is a RET or _RET_ enough to end the inline code and start the next bytecode that follows immediately after the inline RET/_RET_?
Question 1 and 2 are hardware though. I'm confident the FIFO will be reloaded in both cases. There's no branch prediction going on. A branch is a branch.
Anytime a branch to hub occurs, the FIFO is reloaded, even if the branch is to the next instruction in hub.
RET or _RET_ is enough to end the in-line code, but it must have been called, of course. Or, a return address must have been pushed. This could be more efficient, to push a return address, in a case where no branch is really necessary.
Thanks for the info, Chip. $1FF is on top of the stack, so RET/_RET_ at the end of the in-line code would start a new XBYTE, I hope, despite being in hub exec mode where RFBYTE (for XBYTE) is not allowed.
As an alternative intended for simple code without branching, each instruction in turn could be copied to cog RAM and executed there, to avoid the 13..20 cycles for the branch to hub RAM *. The following code shows the idea but probably won't work due to instruction pipelining:
I think this version will work:
* However liable to FIFO reload(s) that will increase overhead.
It'll start at the longword boundary that contains the requested address. Other Cogs can access while that one is waiting for its start slot.
Branches to hub take 13..20 clocks. It takes up to 7 clocks to reach the hub window of interest, so that a read command can be issued, then it takes 13 clocks to get to the next instruction. I'm looking at the Verilog and it signals 'go' as soon as the first instruction is entered into the FIFO. So, it doesn't wait for a some number of longs, just the first one. Why this takes 13 clocks, I'm not sure right now. I know it takes a few clocks to get the read command to the actual RAM of interest, then it takes a few clocks for the data to come back through sets of registers to the cog that requested it, then, there's a clock in the FIFO. Not sure how it all adds up to thirteen, at the moment. It seems like a long time.
I made a test to just recheck the time and it shows 13 clocks on the Eval board LEDs:
As Cog/LUT exec branches take 4 clocks, could it be thought of 4 + 9..16 clocks for hub exec? Or 4 + 9..9+cogs-1, where cogs = 8 only currently. The question is could the 9 be reduced in future?