Hub RAM FIFO read timing

Ahle2 · 2019-12-19 08:50

Hi Chip,

Have a look at the code below...

   rdfast    1 << 31, someAddress  ' Cycle 1/2
   nop                             ' Cycle 3/4
   nop                             ' Cycle 5/6
   nop                             ' Cycle 7/8
   rflong    someLongData          ' Cycle 9/10

If I sync the code above to the eggbeater, will the data at "someAddress" be ready to read through the FIFO after 9 cycles?

/Johannes

evanh · 2019-12-19 10:45

Potentially yes, but engineering the alignment is fraught. That feature should be used for maximising MIPS only. Once the FIFO is primed, the RFLONGs don't need to care about alignment.

The way to use it is work with worst case so then you don't need to know the alignment. So, 16 clocks rather than 9.

evanh · 2019-12-19 11:05

It might be a little more, here's the instruction sheet time for RDFAST instruction:

2 or WRFAST finish + 10...17

The 10-17 might need tested to see if that's true for data coming valid or just an extra inline clock that doesn't apply to the 2-clock case.

Obviously, any preceding WRFAST is another case again.

potatohead · 2019-12-19 14:32

The alignment is COG dependent too, is it not?

Pretty sure I remember each COG getting access to addresses with a specific lower nibble. Each cycle, that nibble is incremented.

So, on cycle 0

Cog 0 = nibble 0
Cog 1 = nibble 1

Cycle 1

Cog 0 = nibble 1
Cog 1 = nibble 2

Etc...

I do not remember what the difference between 8 and 16 cogs is.

Do I have that right?

cgracey · 2019-12-19 14:54

You just need to wait a minimum amount of clocks to ensure the FIFO has begun loading. I don't know off the top of my head how many clocks that is, but it's probably as many as the instruction normally takes, worst case. Also, I think your code was just to get the idea across, but you would need ## in front of that big number.

Rayman · 2019-12-19 15:21

Ok, I see this in the docs:

RDFAST and WRFAST each have two modes of operation.

If D[31] = 0, RDFAST/WRFAST will wait for any previous WRFAST to finish and then reconfigure the hub FIFO interface for reading or writing. In the case of RDFAST, it will additionally wait until the FIFO has begun receiving hub data, so that it can start being used in the next instruction.

If D[31] = 1, RDFAST/WRFAST will not wait for FIFO reconfiguration, taking only two clocks. In this case, your code must allow a sufficient number of clocks before any attempt is made to read or write FIFO data.

I'm not sure I'd use that D[31]=1 mode...

evanh · 2019-12-20 01:41

I've used it. It removes the initial stall. I find I keep wanting multiple FIFOs now.

evanh · 2019-12-20 01:41

potatohead wrote: »

The alignment is COG dependent too, is it not?

Pretty sure I remember each COG getting access to addresses with a specific lower nibble. Each cycle, that nibble is incremented.

Yep, Chip has named them hubRAM slices. In an 8-cog prop2 there is a matching 8 hubRAM slices, not 16.

In practice, you only need to know the difference in two addresses ((addr2 - addr1) mod 8 ) of hubRAM slices from one access to the next. This has an interesting mental side effect - I found myself thinking in terms of multiples of this difference, but reality is when coding the next aligned access you need to think in terms of multiples of eight still but then add the modulo'd difference on.

WRLONGs can be as little as three clocks, a notable advantage over the prop1. And, I think there is a timing alignment difference from RDLONG, presumably due to difference in buffering stages.

The other hub-ops, like COGID and the locks and also cordic commands (not including GETQX/QY), all have a singular fixed slot position, for each cog, like the prop1.

potatohead · 2019-12-20 01:50

The other hub-ops, like COGID and the locks and also cordic commands (not including GETQX/QY), all have a singular fixed slot position, for each cog, like the prop1.

Thanks for that. Either I knew it, and forgot, or never knew. But, I'm happy to know it now.

((addr2 - addr1) mod 8 ) of hubRAM slices from one access to the next.

I'm having trouble sorting how that works out. Don't get me wrong, as I'm taking it as given. But, I need to noodle on it more to fully internalize it. Did you obtain this by testing, or some other method? If test, is there a pointer to the code?

evanh · 2019-12-20 02:17

Definitely testing. The code was messy as, I don't think I ever posted it.

EDIT: I remember it was xmas a year ago I first hacked at it.

EDIT2: Huh, just found a few tables from only a couple months back but still didn't document it well.

(Note these are targetting the other hub-ops)

                     0   28   24   20   16   12    8    4
--------------------------------------------------------------------
 QMUL     RDLONG    11   10    9   16   15   14   13   12   24
 COGID    RDLONG     9   16   15   14   13   12   11   10    0
 COGID WC RDLONG     9   16   15   14   13   12   11   10    0
 LOCKRET  RDLONG    11   10    9   16   15   14   13   12   24
 QMUL     WRLONG     5    4    3   10    9    8    7    6   24
 COGID    WRLONG     3   10    9    8    7    6    5    4    0
 COGID WC WRLONG     3   10    9    8    7    6    5    4    0
 LOCKRET  WRLONG     5    4    3   10    9    8    7    6   24
 QMUL     RDFAST    11   10   17   16   15   14   13   12   28
 COGID    RDFAST    17   16   15   14   13   12   11   10    4
 COGID WC RDFAST    17   16   15   14   13   12   11   10    4
 LOCKRET  RDFAST    11   10   17   16   15   14   13   12   28
 QMUL     WRFAST     3    3    3    3    3    3    3    3    0
 COGID    WRFAST     3    3    3    3    3    3    3    3    0
 COGID WC WRFAST     3    3    3    3    3    3    3    3    0
 LOCKRET  WRFAST     3    3    3    3    3    3    3    3    0


                     0   28   24   20   16   12    8    4           inst3    WRLONG inb, phase2
--------------------------------------------------------------------
 RDLONG  QMUL        9    2    3    4    5    6    7    8   28      15   16   17   18   11   12   13   14   16
 WRLONG  QMUL        7    8    9    2    3    4    5    6   20       7    8    9   10   11   12    5    6    8
 RDFAST  QMUL        9    2    3    4    5    6    7    8   28      15   16   17   18   19   12   13   14   12
 WRFAST  QMUL        6    3    7    3    3    7    3    3   28       7    8    9   10   11   12    5    6    8
 RDLONG  COGID      11    4    5    6    7    8    9   10   28      17   18   19   20   13   14   15   16   16
 WRLONG  COGID       9   10   11    4    5    6    7    8   20       9   10   11   12   13   14    7    8    8
 RDFAST  COGID      11    4    5    6    7    8    9   10   28      17   18   19   20   21   14   15   16   12
 WRFAST  COGID       5    5    9    5    5    9    5    5    0       9   10   11   12   13   14    7    8    8
 RDLONG  COGID WC   11    4    5    6    7    8    9   10   28      17   18   19   20   13   14   15   16   16
 WRLONG  COGID WC    9   10   11    4    5    6    7    8   20       9   10   11   12   13   14    7    8    8
 RDFAST  COGID WC   11    4    5    6    7    8    9   10   28      17   18   19   20   21   14   15   16   12
 WRFAST  COGID WC    9    9    5    5    9    5    5    5   24       9   10   11   12   13   14    7    8    8
 RDLONG  LOCKRET     9    2    3    4    5    6    7    8   28      15   16   17   18   11   12   13   14   16
 WRLONG  LOCKRET     7    8    9    2    3    4    5    6   20       7    8    9   10   11   12    5    6    8
 RDFAST  LOCKRET     9    2    3    4    5    6    7    8   28      15   16   17   18   19   12   13   14   12
 WRFAST  LOCKRET     4    3    7    7    3    3    7    3   28       7    8    9   10   11   12    5    6    8


                     0   28   24   20   16   12    8    4           inst3    RDLONG inb, phase2
--------------------------------------------------------------------
 RDLONG  QMUL        9    2    3    4    5    6    7    8   28      17   18   11   12   13   14   15   16   24
 WRLONG  QMUL        7    8    9    2    3    4    5    6   20       9   10   11   12    5    6    7    8   16
 RDFAST  QMUL        9    2    3    4    5    6    7    8   28      17   18   19   12   13   14   15   16   20
 WRFAST  QMUL        6    3    7    3    3    7    3    3   28       9   10   11   12    5    6    7    8   16
 RDLONG  COGID      11    4    5    6    7    8    9   10   28      19   20   13   14   15   16   17   18   24
 WRLONG  COGID       9   10   11    4    5    6    7    8   20      11   12   13   14    7    8    9   10   16
 RDFAST  COGID      11    4    5    6    7    8    9   10   28      19   20   21   14   15   16   17   18   20
 WRFAST  COGID       5    5    9    5    5    5    9    5    0      11   12   13   14    7    8    9   10   16
 RDLONG  COGID WC   11    4    5    6    7    8    9   10   28      19   20   13   14   15   16   17   18   24
 WRLONG  COGID WC    9   10   11    4    5    6    7    8   20      11   12   13   14    7    8    9   10   16
 RDFAST  COGID WC   11    4    5    6    7    8    9   10   28      19   20   21   14   15   16   17   18   20
 WRFAST  COGID WC    9    9    5    5    9    5    5    5   24      11   12   13   14    7    8    9   10   16
 RDLONG  LOCKRET     9    2    3    4    5    6    7    8   28      17   18   11   12   13   14   15   16   24
 WRLONG  LOCKRET     7    8    9    2    3    4    5    6   20       9   10   11   12    5    6    7    8   16
 RDFAST  LOCKRET     9    2    3    4    5    6    7    8   28      17   18   19   12   13   14   15   16   20
 WRFAST  LOCKRET     4    3    7    7    3    7    7    7   28       9   10   11   12    5    6    7    8   16

evanh · 2019-12-20 02:29

Double huh, that's funny, COGID has different slot to cordic and locks.

EDIT: Ah, I remember, Chip confirmed that COGID always returns a result, so therefore is minimum of 4 clocks, not 2. The instruction sheet needed a correction.

TonyB_ · 2019-12-20 02:53

evanh wrote: »

Rayman wrote: »

Ok, I see this in the docs:

RDFAST and WRFAST each have two modes of operation.

If D[31] = 0, RDFAST/WRFAST will wait for any previous WRFAST to finish and then reconfigure the hub FIFO interface for reading or writing. In the case of RDFAST, it will additionally wait until the FIFO has begun receiving hub data, so that it can start being used in the next instruction.

If D[31] = 1, RDFAST/WRFAST will not wait for FIFO reconfiguration, taking only two clocks. In this case, your code must allow a sufficient number of clocks before any attempt is made to read or write FIFO data.

I'm not sure I'd use that D[31]=1 mode...

I've used it. It removes the initial stall. I find I keep wanting multiple FIFOs now.

Original thread, not very long but utterly fascinating:
https://forums.parallax.com/discussion/167956/fast-hub-ram-timing/p1

Two reasons for no-wait RDFAST/WRFAST:
1. To reduce/eliminate otherwise wasted cycles and do something useful instead.
2. To allow 100% deterministic timing, albeit at the expense of always assuming the worst-case.

That worst-case is 2+17 cycles for RDFAST if D[31] = 0, which suggests nine 2-cycle instructions minimum between RDFAST and RFxxxx if D[31] = 1. I'm not sure anyone has confirmed this by testing yet.

Would be very handy to also have no-wait random hub accesses in the future, for WRxxxx at least. RDFAST/WRFAST could be used for a single memory access until then.

cgracey · 2019-12-20 02:54

evanh wrote: »

Double huh, that's funny, COGID has different slot to cordic and locks.

EDIT: Ah, I remember, Chip confirmed that COGID always returns a result, so therefore is minimum of 4 clocks, not 2. The instruction sheet needed a correction.

I just looked at the Verilog source and it looks to me like if you do a 'COGID #value' (which makes no sense) there will be no result, and it would, therefore take 2..9 clocks, not 4..11.

evanh · 2019-12-20 03:11

That'll be with WC set. The results above indicate there is no difference in execution time.

Okay, found the source code for the above tables. Here's the instruction list used for it:

hubram_tab
		rdlong	inb, phase
		byte	" RDLONG ",0
		wrlong	inb, phase
		byte	" WRLONG ",0
		rdfast	#0, phase
		byte	" RDFAST ",0
		wrfast	#0, phase
		byte	" WRFAST ",0

hubram_tab2
		rdlong	inb, phase2
		byte	" RDLONG ",0
		wrlong	inb, phase2
		byte	" WRLONG ",0
		rdfast	#0, phase2
		byte	" RDFAST ",0
		wrfast	#0, phase2
		byte	" WRFAST ",0

hubop_tab
		qmul	tickstart, #37
		byte	" QMUL    ",0
		cogid	inb
		byte	" COGID   ",0
		cogid	#15	wc
		byte	" COGID WC",0
		lockret	#0
		byte	" LOCKRET ",0

evanh · 2019-12-20 03:19

Here's the original comments - https://forums.parallax.com/discussion/comment/1478672/#Comment_1478672

cgracey · 2019-12-20 03:50

evanh wrote: »

Here's the original comments - https://forums.parallax.com/discussion/comment/1478672/#Comment_1478672

I think if you use COGID with a # and no WC, it should take 2 to 9 cycles. Are you able try this out? I'm not at my setup right now.

You can use "WAITX #15 WC" to generate a random delay before executing the instruction. Then, capture the counter before and after COGID. Get the difference minus 2 and that should be the time it took to execute COGID. You will need to run it in a loop and record the lowest value.

evanh · 2019-12-20 03:57

Okay, yep, that gets the minimum of 2 clocks.

 Hub Address        0   28   24   20   16   12    8    4
--------------------------------------------------------------------
 RDLONG  QMUL        9    2    3    4    5    6    7    8   28
 WRLONG  QMUL        7    8    9    2    3    4    5    6   20
 RDFAST  QMUL        9    2    3    4    5    6    7    8   28
 WRFAST  QMUL        3    7    9    3    9    3    9    3    0
 RDLONG  COGID      11    4    5    6    7    8    9   10   28
 WRLONG  COGID       9   10   11    4    5    6    7    8   20
 RDFAST  COGID      11    4    5    6    7    8    9   10   28
 WRFAST  COGID      10    5    9    5    9    5    9    5   28
 RDLONG  COGID #     9    2    3    4    5    6    7    8   28
 WRLONG  COGID #     7    8    9    2    3    4    5    6   20
 RDFAST  COGID #     9    2    3    4    5    6    7    8   28
 WRFAST  COGID #     4    3    7    7    7    7    7    3   28
 RDLONG  LOCKRET     9    2    3    4    5    6    7    8   28
 WRLONG  LOCKRET     7    8    9    2    3    4    5    6   20
 RDFAST  LOCKRET     9    2    3    4    5    6    7    8   28
 WRFAST  LOCKRET     9    7    7    7    7    7    7    3    4

cgracey · 2019-12-20 03:59

evanh wrote: »

Okay, yep, that gets the minimum of 2 clocks.

 Hub Address        0   28   24   20   16   12    8    4
--------------------------------------------------------------------
 RDLONG  QMUL        9    2    3    4    5    6    7    8   28
 WRLONG  QMUL        7    8    9    2    3    4    5    6   20
 RDFAST  QMUL        9    2    3    4    5    6    7    8   28
 WRFAST  QMUL        3    7    9    3    9    3    9    3    0
 RDLONG  COGID      11    4    5    6    7    8    9   10   28
 WRLONG  COGID       9   10   11    4    5    6    7    8   20
 RDFAST  COGID      11    4    5    6    7    8    9   10   28
 WRFAST  COGID      10    5    9    5    9    5    9    5   28
 RDLONG  COGID #     9    2    3    4    5    6    7    8   28
 WRLONG  COGID #     7    8    9    2    3    4    5    6   20
 RDFAST  COGID #     9    2    3    4    5    6    7    8   28
 WRFAST  COGID #     4    3    7    7    7    7    7    3   28
 RDLONG  LOCKRET     9    2    3    4    5    6    7    8   28
 WRLONG  LOCKRET     7    8    9    2    3    4    5    6   20
 RDFAST  LOCKRET     9    2    3    4    5    6    7    8   28
 WRFAST  LOCKRET     9    7    7    7    7    7    7    3    4

That use case makes no sense, of course, but it would take as little as two clocks. So, this means the instruction spreadsheet is okay, right?

evanh · 2019-12-20 04:04

mmm, I guess. Every normal use has that +2.

Ahle2 · 2019-12-20 10:05

Rayman wrote: »

Ok, I see this in the docs:
I'm not sure I'd use that D[31]=1 mode...

Setting bit D[31] to clear the FIFO is the WHOLE point of what I'm after!

cgracey wrote: »

You just need to wait a minimum amount of clocks to ensure the FIFO has begun loading. I don't know off the top of my head how many clocks that is, but it's probably as many as the instruction normally takes, worst case. Also, I think your code was just to get the idea across, but you would need ## in front of that big number.

I do actually have an application where I need to know the timing exactly! (you will eventually see what I'm after)
There is ways to synchronize everyting to the HUB, so now I will just prove this on actual hardware. My code will not work unless everything is perfectly synched.

cgracey · 2019-12-20 10:44

Ahle2 wrote: »

Rayman wrote: »

Ok, I see this in the docs:
I'm not sure I'd use that D[31]=1 mode...

Setting bit D[31] to clear the FIFO is the WHOLE point of what I'm after!

cgracey wrote: »

You just need to wait a minimum amount of clocks to ensure the FIFO has begun loading. I don't know off the top of my head how many clocks that is, but it's probably as many as the instruction normally takes, worst case. Also, I think your code was just to get the idea across, but you would need ## in front of that big number.

I do actually have an application when I need to know the timing exactly! (you will eventually see what I'm after)
There is ways to synchronize everyting to the HUB, so now I will just prove this on actual hardware. My code will not work unless everything is perfectly syncked.

I'm looking forward to it!

evanh · 2019-12-20 11:19

Ahle,
If you want to do some of your own testing then here's my source. As it stands it requires Eric's Fastspin because of the #include, but merging the two files wouldn't be hard if you wanted to use Pnut instead.

Ahle2 · 2019-12-23 06:10

evanh wrote: »

Ahle,
If you want to do some of your own testing then here's my source. As it stands it requires Eric's Fastspin because of the #include, but merging the two files wouldn't be hard if you wanted to use Pnut instead.

Thank you Evan... This will come in handy! 😀

TonyB_ · 2020-11-24 11:58

Some thoughts and questions about how the Spin2 interpreter runs inline code and how other bytecode interpreters could do this. In Spin2, in short, a routine in hub RAM copies the inline P2 code after the inline bytecode from hub RAM to an address in cog RAM and calls that address. Thus there is an overhead for copying the P2 code (and also copying and restoring the local variables). This is an observation, not a criticism.

Let us suppose a different bytecode interpreter has its inline routine in cog/LUT RAM, which jumps to the start of the inline code in hub RAM that follows immediately after the inline bytecode. I think this jump instruction could be

		jmp	pb

Q1. If jmp pb is correct, does this jump reload the FIFO? This should not be necessary as the first inline P2 instruction is already the next thing in the FIFO.

Q2. In general, does the hardware check whether a jump address is the same as the next address/program counter and if so not do the jump?

Q3. Is a RET or _RET_ enough to end the inline code and start the next bytecode that follows immediately after the inline RET/_RET_?

evanh · 2020-11-24 17:50

Chip and Eric will have differing answers.

Question 1 and 2 are hardware though. I'm confident the FIFO will be reloaded in both cases. There's no branch prediction going on. A branch is a branch.

cgracey · 2020-11-24 19:20

TonyB_,

Anytime a branch to hub occurs, the FIFO is reloaded, even if the branch is to the next instruction in hub.

RET or _RET_ is enough to end the in-line code, but it must have been called, of course. Or, a return address must have been pushed. This could be more efficient, to push a return address, in a case where no branch is really necessary.

rogloh · 2020-11-24 20:46

Chip, when a branch occurs and the FIFO is reloaded, does execution begin during the transfer, or do you need to first wait for some number of longs to have been transferred from hub and already exist in the FIFO before the hub execution resumes? Does the transfer begin at the address being requested or does it start at whatever hub address is currently visible in the egg beater loop? Knowing this might help us figure out the branch penalties.

TonyB_ · 2020-11-25 13:49

TonyB_ wrote: »
Let us suppose a different bytecode interpreter has its inline routine in cog/LUT RAM, which jumps to the start of the inline code in hub RAM that follows immediately after the inline bytecode. I think this jump instruction could be
		jmp	pb
Q1. If jmp pb is correct, does this jump reload the FIFO? This should not be necessary as the first inline P2 instruction is already the next thing in the FIFO.

Q2. In general, does the hardware check whether a jump address is the same as the next address/program counter and if so not do the jump?

Q3. Is a RET or _RET_ enough to end the inline code and start the next bytecode that follows immediately after the inline RET/_RET_?

cgracey wrote: »

TonyB_,

Anytime a branch to hub occurs, the FIFO is reloaded, even if the branch is to the next instruction in hub.

RET or _RET_ is enough to end the in-line code, but it must have been called, of course. Or, a return address must have been pushed. This could be more efficient, to push a return address, in a case where no branch is really necessary.

Thanks for the info, Chip. $1FF is on top of the stack, so RET/_RET_ at the end of the in-line code would start a new XBYTE, I hope, despite being in hub exec mode where RFBYTE (for XBYTE) is not allowed.

As an alternative intended for simple code without branching, each instruction in turn could be copied to cog RAM and executed there, to avoid the 13..20 cycles for the branch to hub RAM *. The following code shows the idea but probably won't work due to instruction pipelining:

		rep	#2,#0			' repeat next two instructions indefinitely
		rflong	inst			' read next instruction from FIFO
inst		nop				' run instruction
						
' RET/_RET_ exits REP and starts new XBYTE
' Cycle overhead = 2 + 2 per instruction + FIFO reloads
' *** Will not work ?! ***

I think this version will work:

		rflong	inst1			' read instruction 1 from FIFO
		rep	#4,#0			' repeat next four instructions indefinitely
		rflong	inst2			' read instruction 2/4/6/8/...
inst1		nop				' run  instruction 1/3/5/7/...
		rflong	inst1			' read instruction 3/5/7/9/...
inst2		nop				' run  instruction 2/4/6/8/...

' RET/_RET_ exits REP and starts new XBYTE
' Put NOP after last 'real' in-line code instruction
' Cycle overhead = 4 + 2 per 'real' instruction + FIFO reloads

* However liable to FIFO reload(s) that will increase overhead.

evanh · 2020-11-25 17:45

rogloh wrote: »

Chip, when a branch occurs and the FIFO is reloaded, does execution begin during the transfer, or do you need to first wait for some number of longs to have been transferred from hub and already exist in the FIFO before the hub execution resumes?

It's the second case. Same as a RDFAST without bit31 set. I'm not sure what level the FIFO fills before instructions enter the pipeline. The FIFO depth isn't a round number either.

Does the transfer begin at the address being requested or does it start at whatever hub address is currently visible in the egg beater loop?

It'll start at the longword boundary that contains the requested address. Other Cogs can access while that one is waiting for its start slot.

cgracey · 2020-11-25 18:01

rogloh wrote: »

Chip, when a branch occurs and the FIFO is reloaded, does execution begin during the transfer, or do you need to first wait for some number of longs to have been transferred from hub and already exist in the FIFO before the hub execution resumes? Does the transfer begin at the address being requested or does it start at whatever hub address is currently visible in the egg beater loop? Knowing this might help us figure out the branch penalties.

Branches to hub take 13..20 clocks. It takes up to 7 clocks to reach the hub window of interest, so that a read command can be issued, then it takes 13 clocks to get to the next instruction. I'm looking at the Verilog and it signals 'go' as soon as the first instruction is entered into the FIFO. So, it doesn't wait for a some number of longs, just the first one. Why this takes 13 clocks, I'm not sure right now. I know it takes a few clocks to get the read command to the actual RAM of interest, then it takes a few clocks for the data to come back through sets of registers to the cog that requested it, then, there's a clock in the FIFO. Not sure how it all adds up to thirteen, at the moment. It seems like a long time.

I made a test to just recheck the time and it shows 13 clocks on the Eval board LEDs:

dat	org

	jmp	#pgm

	orgh	$400

pgm	mov	2,#100

.again	waitx	#$1F	wc
	getct	0
	jmp	#.next
.next	getct	1
	sub	1,0
	sub	1,#2
	fle	2,1
	not	0,2
	setbyte	outb,0,#3
	setbyte	dirb,#$FF,#3
	jmp	#.again

TonyB_ · 2020-11-25 18:20

cgracey wrote: »

Branches to hub take 13..20 clocks. It takes up to 7 clocks to reach the hub window of interest, so that a read command can be issued, then it takes 13 clocks to get to the next instruction. I'm looking at the Verilog and it signals 'go' as soon as the first instruction is entered into the FIFO. So, it doesn't wait for a some number of longs, just the first one. Why this takes 13 clocks, I'm not sure right now. I know it takes a few clocks to get the read command to the actual RAM of interest, then it takes a few clocks for the data to come back through sets of registers to the cog that requested it, then, there's a clock in the FIFO. Not sure how it all adds up to thirteen, at the moment. It seems like a long time.

As Cog/LUT exec branches take 4 clocks, could it be thought of 4 + 9..16 clocks for hub exec? Or 4 + 9..9+cogs-1, where cogs = 8 only currently. The question is could the 9 be reduced in future?

Hub RAM FIFO read timing

Comments