Fast hub RAM timing

cgracey · 2018-01-24 10:56

TonyB_,

I've gone into the hub FIFO block and made sure that things can't happen out of order:

- After RDFAST, RFxxxx reads, WFxxxx ignored
- After WRFAST, WFxxxx writes, RFxxxx ignored (returns $00000000)
- When RDFAST/WRFAST are configuring, WFxxxx ignored, RFxxxx ignored (returns $00000000)

I'm really glad you brought this up, because there were some ambiguities in these cases that needed certainty. For example, what RFxxxx returns when it can't get data, and when the GETPTR value can be incremented.

I wouldn't suppose someone would leave a streamer running while issuing a new RDFAST or WRFAST, but now we know what to expect in that case and it coincides with likely expectations.

I will get a new version out soon with the new quick-exit RDFAST/WRFAST and this cleaned up RFxxxx/WFxxxx behavior.

Thanks, again, for bringing this up.

P.S. I'm running it now and it's a lot better with RFxxxx returning 0's. Very sensible now.

evanh · 2018-01-24 11:43

Chip,
Obviously an ignore state is a bug from a use view. On that note, in the two cases where RFxxxx is issued while RDFAST is configuring or where WFxxxx is issued while WRFAST is configuring, we could go one step further and stall on RFxxxx/WFxxxx. With this in place then the two operating modes no longer have significant use difference. The b31 flag can be got rid of and only the more streamlined mode exists.

cgracey · 2018-01-24 12:08

evanh wrote: »

Chip,
Obviously an ignore state is a bug from a use view. On that note, in the two cases where RFxxxx is issued while RDFAST is configuring or where WFxxxx is issued while WRFAST is configuring, we could go one step further and stall on RFxxxx/WFxxxx. With this in place then the two operating modes no longer have significant use difference. The b31 flag can be got rid of and only the more streamlined mode exists.

There's a lot of assumption of RFxxxx/WFxxxx being able to execute immediately. It would be pretty disruptive to unravel that thing and put it back together with stalling. That's way too much for me to handle, at this point. Hub exec uses this, too.

This RDFAST/WRFAST D[31] quick-exit is a nice trick for advanced programmers, but I don't think it should be made a standard behavior. My reasoning has to do with the streamer and starting it up with known timing. RFxxxx and WFxxxx, when issued from software, just GO. The streamer, though, may be issuing hidden RFxxxx/WRxxxx's like a drunk sailor and if we do a new RDFAST, for example, in software, we just need some predictable behavior regarding what kind of data interruption the streamer experiences. Data going to 0 until the new stream is available is nice. Software is never going to get this response unless you did the D[31] thing, which is not for casual programmers.

evanh · 2018-01-24 12:19

Good info there. Heh, I've not read up on actually using the Streamer. I'm happy.

TonyB_ · 2018-01-24 23:57

cgracey wrote: »

TonyB_ wrote: »

Another query I have is about the block size in RDFAST/WRFAST/FBLOCK.

64 bytes is 16 longs or 16 cogs * one long per cog, which is what determined the minimum size of the hub FIFO? Now there are only 8 cogs is the FIFO smaller?

The FIFO is smaller, yes.

The block size could be smaller, too.

I keep the block size at 16 longs, though, so that we can have software compatibility across a family chips, with up to 16 cogs.

Chip, thanks for that and the other info and changes you made yesterday.

TonyB_ · 2018-01-25 00:14

deleted

TonyB_ · 2018-03-04 01:40

TonyB_ wrote: »

Here are the old 16-cog timings from Instructions v31 :

	16-cog timing
	Cogex cycles	Hubex cycles

RDBYTE   9-24		9-44
RDWORD   9-24*		9-44*
RDLONG   9-24*		9-44*

WRBYTE   3-18		3-38
WRWORD   3-18*		3-38*
WRLONG   3-18*		3-38*
WMLONG   3-18*		3-38*

	* +1 if crosses hub long

	16-cog timing
	Cogex cycles

RDFAST  10-25 + WRFAST finish
WRFAST   3    + WRFAST finish
FBLOCK   2

RFBYTE 	 2
RFWORD   2
RFLONG   2
RFVAR    2
RFVARS   2

WFBYTE   2
WFWORD   2
WFLONG   2

Chip, do you have time at the moment to tell us the timings for the above instructions for the 8 cog versions, including the new fast RDFAST/WRFAST? Have the latter been added to a public FPGA version? (I can't test it but others could.)

cgracey · 2018-03-04 01:58

I will measure the times soon for 8 cogs.

TonyB_ · 2018-03-28 03:19

cgracey wrote: »

The next thing I need to do is get new FPGA images out. We need everyone who can, to try them out. There should be no surprises.

Q1. Is the fast RDFAST/WRFAST with D[31]=1 already in an FPGA image?
Q2. If there is a WRFAST but no WFxxxx then a RDFAST, does the RDFAST cancel the WRFAST (and vice-versa)?

One final PM sent about XBYTE.

cgracey · 2018-03-28 04:41

TonyB_ wrote: »

cgracey wrote: »

The next thing I need to do is get new FPGA images out. We need everyone who can, to try them out. There should be no surprises.

Q1. Is the fast RDFAST/WRFAST with D[31]=1 already in an FPGA image?
Q2. If there is a WRFAST but no WFxxxx then a RDFAST, does the RDFAST cancel the WRFAST (and vice-versa)?

One final PM sent about XBYTE.

The D[31] no-wait for RDFAST/WRFAST has been in there for a while.

Yes RDFAST and WRFAST are mutually exclusive. One cancels the other.

TonyB_ · 2018-03-28 17:19

Thanks for the info, Chip. Another PM sent.

TonyB_ · 2018-09-13 23:48

TonyB_ wrote: »
TonyB_ wrote: »
Here are the old 16-cog timings from Instructions v31 :
	16-cog timing
	Cogex cycles	Hubex cycles

RDBYTE   9-24		9-44
RDWORD   9-24*		9-44*
RDLONG   9-24*		9-44*

WRBYTE   3-18		3-38
WRWORD   3-18*		3-38*
WRLONG   3-18*		3-38*
WMLONG   3-18*		3-38*

	* +1 if crosses hub long
	16-cog timing
	Cogex cycles

RDFAST  10-25 + WRFAST finish
WRFAST   3    + WRFAST finish
FBLOCK   2

RFBYTE 	 2
RFWORD   2
RFLONG   2
RFVAR    2
RFVARS   2

WFBYTE   2
WFWORD   2
WFLONG   2
Chip, do you have time at the moment to tell us the timings for the above instructions for the 8 cog versions, including the new fast RDFAST/WRFAST?

It would good to have the above hub RAM timings, now we are so close to seeing the real thing. Worst-case memory access times are important to know and I can't discover them for myself as I don't have an FPGA board.

Two related questions about XBYTE:

1. What is the latency if the next bytecode to interpret is not the next byte in the FIFO due to a bytecode jump or call?

2. Is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling at that instant, or is this impossible?

jmg · 2018-09-13 23:59

TonyB_ wrote: »

It would good to have the above hub RAM timings, now we are so close to seeing the real thing.

I think if you drop the Cores from 16 to 8, what used to be a ++16 worst case quanta, flips to a ++8
- so you drop the top numbers by 8 ?

ozpropdev · 2018-09-14 00:24

TonyB_ wrote: »

2. Is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling at that instant, or is this impossible?

If code is running in Hub ram this certainly would be the case.

From the DOCS

Cogs can access hub RAM either via the sequential FIFO interface, or by waiting for RAM slices of interest, while yielding to
the FIFO. If the FIFO is not busy, which is soon the case if data is not being read from or written to it, random accesses will
have full opportunity to access the composite hub RAM.

Dave Hein · 2018-09-14 01:06

Following the thread backwards, the question started with using RDFAST/WRFAST, which is mutually exclusive with running in hub RAM. There is only one FIFO. In either case, filling the FIFO has priority and a random hub RAM access has to wait for the FIFO to be full, and then wait for the appropriate hub slot. Hub timing will depend on where the code is located and where data is located. To ensure deterministic timing, programs and data may need to be aligned on 8 or 16-long boundaries.

ozpropdev · 2018-09-14 01:45

In hub exec mode the use of fifo hub instructions is not allowed.
The code below demonstrates the effect of doing so.

By setting the fifo pointer using RDFAST the hub exec pc is changed to $500.
Then by using a RFLONG the pc is indexed +4 skipping the next instuction.

dat	org

	jmp	#main

'Hub exec code

	orgh	$400
main	bmask	dirb,#15
	rdfast	#0,##$500   'equivalent to jmp #$500

	mov	outb,#$5a   'we never get here
	jmp	#$

	orgh	$500
	rflong	pa
	mov	outb,#$f0   'instruction is skipped
	mov	outb,#$f
	jmp	#$

TonyB_ · 2018-09-14 02:10

jmg wrote: »

TonyB_ wrote: »

It would good to have the above hub RAM timings, now we are so close to seeing the real thing.

I think if you drop the Cores from 16 to 8, what used to be a ++16 worst case quanta, flips to a ++8
- so you drop the top numbers by 8 ?

At least 8 from what Chip said earlier:

cgracey wrote: »

The timings need to be updated. In some cases, the improvement is more than 8 clocks.

TonyB_ · 2018-09-14 02:24

ozpropdev wrote: »
TonyB_ wrote: »

2. Is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling at that instant, or is this impossible?

If code is running in Hub ram this certainly would be the case.

From the DOCS
Cogs can access hub RAM either via the sequential FIFO interface, or by waiting for RAM slices of interest, while yielding to
the FIFO. If the FIFO is not busy, which is soon the case if data is not being read from or written to it, random accesses will
have full opportunity to access the composite hub RAM.

The random hub read/write would be during XBYTE, in case that was not clear. I guess the answer to 2. is not possible. I was just wondering whether there might be a situation where the worst-case egg-beater timings might be even worse.

FPGA owners could find out the minimum and maximum cycles times, e.g. by writing non-zero data then trying RDFAST with D[31] = 1 and increasing delays until data read are not zero.

Dave Hein · 2018-09-14 02:33

TonyB_ wrote: »

FPGA owners could find out the minimum and maximum cycles times, e.g. by writing non-zero data then trying RDFAST with D[31] = 1 and increasing delays until data read are not zero.

I don't understand what you are proposing. Adding a delay won't change the value that is being read. Could you post some code to show what you're suggesting. I'd be happy to run it on my FPGA board.

msrobots · 2018-09-14 03:01

ozpropdev wrote: »
In hub exec mode the use of fifo hub instructions is not allowed.
The code below demonstrates the effect of doing so.

By setting the fifo pointer using RDFAST the hub exec pc is changed to $500.
Then by using a RFLONG the pc is indexed +4 skipping the next instuction.
dat	org

	jmp	#main

'Hub exec code

	orgh	$400
main	bmask	dirb,#15
	rdfast	#0,##$500   'equivalent to jmp #$500

	mov	outb,#$5a   'we never get here
	jmp	#$

	orgh	$500
	rflong	pa
	mov	outb,#$f0   'instruction is skipped
	mov	outb,#$f
	jmp	#$

How do you figured that out?
See, mix that with some XBYTE code and a Flash boot program calling into TAQOZ and we even have code protection nobody can break.

The P2 will be so much fun to program, even if the P1 still surprises me, the WordFire 4 Keyboard/4 Screens console I recently got is run by a single P1 and it does a amazing job there.

Enjoy!

Mike

TonyB_ · 2018-09-14 03:01

The global counter could time instructions presumably but I haven't looked at the code for that yet. A variable delay would change the value read after a RDFAST with D[31] = 1, which doesn't wait for the FIFO to contain valid data, then RFxxxx will return zero until it does.

ozpropdev · 2018-09-14 03:01

Here's the minimum delay for correct hub read

dat	org

	bmask	dirb,#15
	wrlong	##$babeface,##$1000
	rdfast	##$80000000,##$1000
	waitx	#4
	rflong	outb
	jmp	#$

cgracey · 2018-09-14 05:03

The latest Google sheet shows all instruction timings for an 8-cog variant. The link is in the Prop2 FPGA Files thread, first post.

TonyB_ · 2018-09-30 20:34

Belated thanks for the timings, Chip. The answer to the following question is still not clear to me:

During XBYTE is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling with more bytecodes at that instant?

The scenario is sequential bytecodes with no branching. On the face of it, the six clock XBYTE overhead suggests that a clash could always be avoided, but I don't know exactly when or how the FIFO is re-filled.

cgracey · 2018-09-30 20:55

TonyB_ wrote: »

Belated thanks for the timings, Chip. The answer to the following question is still not clear to me:

During XBYTE is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling with more bytecodes at that instant?

The scenario is sequential bytecodes with no branching. On the face of it, the six clock XBYTE overhead suggests that a clash could always be avoided, but I don't know exactly when or how the FIFO is re-filled.

That probably wouldn't happen, given the rate of FIFO loading versus a random read or write.

evanh · 2018-10-01 01:54

ozpropdev wrote: »
Here's the minimum delay for correct hub read
dat	org

	bmask	dirb,#15
	wrlong	##$babeface,##$1000
	rdfast	##$80000000,##$1000
	waitx	#4
	rflong	outb
	jmp	#$

I don't remember seeing that before. Certainly cleaner than my effort last night.

So that indicates needing 8 clocks prefetch. Presumably it doesn't need the 9th clock because of the simpler hard wired FIFO addressing.

With some more testing I've confirmed the timing to be 8...15, which is exactly -1 from the RDLONG timings.

cgracey · 2018-10-01 02:55

Good job, Evanh.

TonyB_ · 2018-10-08 14:16

cgracey wrote: »

TonyB_ wrote: »

Belated thanks for the timings, Chip. The answer to the following question is still not clear to me:

During XBYTE is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling with more bytecodes at that instant?

The scenario is sequential bytecodes with no branching. On the face of it, the six clock XBYTE overhead suggests that a clash could always be avoided, but I don't know exactly when or how the FIFO is re-filled.

That probably wouldn't happen, given the rate of FIFO loading versus a random read or write.

This still troubles me somewhat. The probability of a clash during XBYTE might be small but if it's non-zero it will happen, sooner or later. I don't want a random hub access to suffer an extra delay, even if only occasionally. This might be a non-issue entirely or avoidable in software, but I don't know. The worst-case appears to be an XBYTE routine consisting of just a random read or write, e.g. _ret_ rdbyte ...

I think the FIFO is 1-level deep. The doc says a new bytecode is fetched during the last clock of _RET_. If this is the last byte of the long in the FIFO, does this trigger an immediate request to read a new long from hub RAM? If so, then the maximum number of clocks before this long is read is known. If this is less than the time it takes for an XBYTE routine to get around to doing a random read or write, then there would not be a clash.

Does this sound right, Chip?

EDIT:
The hub slices would need to be the same for a clash to occur, presumably.

cgracey · 2018-10-08 14:52

TonyB_,

I don't know if this answers your question, but the FIFO fills to over 8 longs before a random RDxxxx/WRxxxx is allowed.

evanh · 2018-10-08 16:09

Opps, got something wrong

Trying again ... for a WRLONG, a coinciding FIFO fill will add a delay. The intermediate hub write has a -2 lag (leading) alignment with respect to normal WRLONG hub timings.

'  Testing code (cogram)
test_main
		mov     parm, ##$12345678
		mov     tpin, ##$8000_0000   'non-blocked FIFO filling at RDFAST
		waitx   #3                   'guestimate hub rotation, tweak to suit
		getct   tick0         'reference time

		wrlong  parm, #0
		getct   tick1         '5 ticks (3 for WRLONG, 2 for GETCT)
		rdfast  tpin, #$100
		getct   tick2         '9 ticks (2 for RDFAST, 2 for GETCT)
		wrlong  parm, #0
		getct   tick3         '19 ticks (5+16=21, so skipped a slot but 2 clocks early)
		wrlong  parm, #0
		getct   tick4         '37 ticks (gap of 18, skipped another slot, alignment normal again)
		wrlong  parm, #0
		getct   tick5         '45 ticks (regular 8-clock interval)
...

EDIT: Improved comments

Fast hub RAM timing

Comments