I've gone into the hub FIFO block and made sure that things can't happen out of order:
- After RDFAST, RFxxxx reads, WFxxxx ignored
- After WRFAST, WFxxxx writes, RFxxxx ignored (returns $00000000)
- When RDFAST/WRFAST are configuring, WFxxxx ignored, RFxxxx ignored (returns $00000000)
I'm really glad you brought this up, because there were some ambiguities in these cases that needed certainty. For example, what RFxxxx returns when it can't get data, and when the GETPTR value can be incremented.
I wouldn't suppose someone would leave a streamer running while issuing a new RDFAST or WRFAST, but now we know what to expect in that case and it coincides with likely expectations.
I will get a new version out soon with the new quick-exit RDFAST/WRFAST and this cleaned up RFxxxx/WFxxxx behavior.
Thanks, again, for bringing this up.
P.S. I'm running it now and it's a lot better with RFxxxx returning 0's. Very sensible now.
Chip,
Obviously an ignore state is a bug from a use view. On that note, in the two cases where RFxxxx is issued while RDFAST is configuring or where WFxxxx is issued while WRFAST is configuring, we could go one step further and stall on RFxxxx/WFxxxx. With this in place then the two operating modes no longer have significant use difference. The b31 flag can be got rid of and only the more streamlined mode exists.
Chip,
Obviously an ignore state is a bug from a use view. On that note, in the two cases where RFxxxx is issued while RDFAST is configuring or where WFxxxx is issued while WRFAST is configuring, we could go one step further and stall on RFxxxx/WFxxxx. With this in place then the two operating modes no longer have significant use difference. The b31 flag can be got rid of and only the more streamlined mode exists.
There's a lot of assumption of RFxxxx/WFxxxx being able to execute immediately. It would be pretty disruptive to unravel that thing and put it back together with stalling. That's way too much for me to handle, at this point. Hub exec uses this, too.
This RDFAST/WRFAST D[31] quick-exit is a nice trick for advanced programmers, but I don't think it should be made a standard behavior. My reasoning has to do with the streamer and starting it up with known timing. RFxxxx and WFxxxx, when issued from software, just GO. The streamer, though, may be issuing hidden RFxxxx/WRxxxx's like a drunk sailor and if we do a new RDFAST, for example, in software, we just need some predictable behavior regarding what kind of data interruption the streamer experiences. Data going to 0 until the new stream is available is nice. Software is never going to get this response unless you did the D[31] thing, which is not for casual programmers.
Another query I have is about the block size in RDFAST/WRFAST/FBLOCK.
64 bytes is 16 longs or 16 cogs * one long per cog, which is what determined the minimum size of the hub FIFO? Now there are only 8 cogs is the FIFO smaller?
The FIFO is smaller, yes.
The block size could be smaller, too.
I keep the block size at 16 longs, though, so that we can have software compatibility across a family chips, with up to 16 cogs.
Chip, thanks for that and the other info and changes you made yesterday.
Chip, do you have time at the moment to tell us the timings for the above instructions for the 8 cog versions, including the new fast RDFAST/WRFAST? Have the latter been added to a public FPGA version? (I can't test it but others could.)
The next thing I need to do is get new FPGA images out. We need everyone who can, to try them out. There should be no surprises.
Q1. Is the fast RDFAST/WRFAST with D[31]=1 already in an FPGA image?
Q2. If there is a WRFAST but no WFxxxx then a RDFAST, does the RDFAST cancel the WRFAST (and vice-versa)?
The next thing I need to do is get new FPGA images out. We need everyone who can, to try them out. There should be no surprises.
Q1. Is the fast RDFAST/WRFAST with D[31]=1 already in an FPGA image?
Q2. If there is a WRFAST but no WFxxxx then a RDFAST, does the RDFAST cancel the WRFAST (and vice-versa)?
One final PM sent about XBYTE.
The D[31] no-wait for RDFAST/WRFAST has been in there for a while.
Yes RDFAST and WRFAST are mutually exclusive. One cancels the other.
Chip, do you have time at the moment to tell us the timings for the above instructions for the 8 cog versions, including the new fast RDFAST/WRFAST?
It would good to have the above hub RAM timings, now we are so close to seeing the real thing. Worst-case memory access times are important to know and I can't discover them for myself as I don't have an FPGA board.
Two related questions about XBYTE:
1. What is the latency if the next bytecode to interpret is not the next byte in the FIFO due to a bytecode jump or call?
2. Is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling at that instant, or is this impossible?
2. Is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling at that instant, or is this impossible?
If code is running in Hub ram this certainly would be the case.
From the DOCS
Cogs can access hub RAM either via the sequential FIFO interface, or by waiting for RAM slices of interest, while yielding to
the FIFO. If the FIFO is not busy, which is soon the case if data is not being read from or written to it, random accesses will
have full opportunity to access the composite hub RAM.
Following the thread backwards, the question started with using RDFAST/WRFAST, which is mutually exclusive with running in hub RAM. There is only one FIFO. In either case, filling the FIFO has priority and a random hub RAM access has to wait for the FIFO to be full, and then wait for the appropriate hub slot. Hub timing will depend on where the code is located and where data is located. To ensure deterministic timing, programs and data may need to be aligned on 8 or 16-long boundaries.
In hub exec mode the use of fifo hub instructions is not allowed.
The code below demonstrates the effect of doing so.
By setting the fifo pointer using RDFAST the hub exec pc is changed to $500.
Then by using a RFLONG the pc is indexed +4 skipping the next instuction.
dat org
jmp #main
'Hub exec code
orgh $400
main bmask dirb,#15
rdfast #0,##$500 'equivalent to jmp #$500
mov outb,#$5a 'we never get here
jmp #$
orgh $500
rflong pa
mov outb,#$f0 'instruction is skipped
mov outb,#$f
jmp #$
2. Is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling at that instant, or is this impossible?
If code is running in Hub ram this certainly would be the case.
From the DOCS
Cogs can access hub RAM either via the sequential FIFO interface, or by waiting for RAM slices of interest, while yielding to
the FIFO. If the FIFO is not busy, which is soon the case if data is not being read from or written to it, random accesses will
have full opportunity to access the composite hub RAM.
The random hub read/write would be during XBYTE, in case that was not clear. I guess the answer to 2. is not possible. I was just wondering whether there might be a situation where the worst-case egg-beater timings might be even worse.
FPGA owners could find out the minimum and maximum cycles times, e.g. by writing non-zero data then trying RDFAST with D[31] = 1 and increasing delays until data read are not zero.
FPGA owners could find out the minimum and maximum cycles times, e.g. by writing non-zero data then trying RDFAST with D[31] = 1 and increasing delays until data read are not zero.
I don't understand what you are proposing. Adding a delay won't change the value that is being read. Could you post some code to show what you're suggesting. I'd be happy to run it on my FPGA board.
In hub exec mode the use of fifo hub instructions is not allowed.
The code below demonstrates the effect of doing so.
By setting the fifo pointer using RDFAST the hub exec pc is changed to $500.
Then by using a RFLONG the pc is indexed +4 skipping the next instuction.
dat org
jmp #main
'Hub exec code
orgh $400
main bmask dirb,#15
rdfast #0,##$500 'equivalent to jmp #$500
mov outb,#$5a 'we never get here
jmp #$
orgh $500
rflong pa
mov outb,#$f0 'instruction is skipped
mov outb,#$f
jmp #$
How do you figured that out?
See, mix that with some XBYTE code and a Flash boot program calling into TAQOZ and we even have code protection nobody can break.
The P2 will be so much fun to program, even if the P1 still surprises me, the WordFire 4 Keyboard/4 Screens console I recently got is run by a single P1 and it does a amazing job there.
The global counter could time instructions presumably but I haven't looked at the code for that yet. A variable delay would change the value read after a RDFAST with D[31] = 1, which doesn't wait for the FIFO to contain valid data, then RFxxxx will return zero until it does.
Belated thanks for the timings, Chip. The answer to the following question is still not clear to me:
During XBYTE is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling with more bytecodes at that instant?
The scenario is sequential bytecodes with no branching. On the face of it, the six clock XBYTE overhead suggests that a clash could always be avoided, but I don't know exactly when or how the FIFO is re-filled.
Belated thanks for the timings, Chip. The answer to the following question is still not clear to me:
During XBYTE is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling with more bytecodes at that instant?
The scenario is sequential bytecodes with no branching. On the face of it, the six clock XBYTE overhead suggests that a clash could always be avoided, but I don't know exactly when or how the FIFO is re-filled.
That probably wouldn't happen, given the rate of FIFO loading versus a random read or write.
I don't remember seeing that before. Certainly cleaner than my effort last night. So that indicates needing 8 clocks prefetch. Presumably it doesn't need the 9th clock because of the simpler hard wired FIFO addressing.
With some more testing I've confirmed the timing to be 8...15, which is exactly -1 from the RDLONG timings.
Belated thanks for the timings, Chip. The answer to the following question is still not clear to me:
During XBYTE is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling with more bytecodes at that instant?
The scenario is sequential bytecodes with no branching. On the face of it, the six clock XBYTE overhead suggests that a clash could always be avoided, but I don't know exactly when or how the FIFO is re-filled.
That probably wouldn't happen, given the rate of FIFO loading versus a random read or write.
This still troubles me somewhat. The probability of a clash during XBYTE might be small but if it's non-zero it will happen, sooner or later. I don't want a random hub access to suffer an extra delay, even if only occasionally. This might be a non-issue entirely or avoidable in software, but I don't know. The worst-case appears to be an XBYTE routine consisting of just a random read or write, e.g. _ret_ rdbyte ...
I think the FIFO is 1-level deep. The doc says a new bytecode is fetched during the last clock of _RET_. If this is the last byte of the long in the FIFO, does this trigger an immediate request to read a new long from hub RAM? If so, then the maximum number of clocks before this long is read is known. If this is less than the time it takes for an XBYTE routine to get around to doing a random read or write, then there would not be a clash.
Does this sound right, Chip?
EDIT:
The hub slices would need to be the same for a clash to occur, presumably.
Trying again ... for a WRLONG, a coinciding FIFO fill will add a delay. The intermediate hub write has a -2 lag (leading) alignment with respect to normal WRLONG hub timings.
' Testing code (cogram)
test_main
mov parm, ##$12345678
mov tpin, ##$8000_0000 'non-blocked FIFO filling at RDFAST
waitx #3 'guestimate hub rotation, tweak to suit
getct tick0 'reference time
wrlong parm, #0
getct tick1 '5 ticks (3 for WRLONG, 2 for GETCT)
rdfast tpin, #$100
getct tick2 '9 ticks (2 for RDFAST, 2 for GETCT)
wrlong parm, #0
getct tick3 '19 ticks (5+16=21, so skipped a slot but 2 clocks early)
wrlong parm, #0
getct tick4 '37 ticks (gap of 18, skipped another slot, alignment normal again)
wrlong parm, #0
getct tick5 '45 ticks (regular 8-clock interval)
...
Comments
I've gone into the hub FIFO block and made sure that things can't happen out of order:
- After RDFAST, RFxxxx reads, WFxxxx ignored
- After WRFAST, WFxxxx writes, RFxxxx ignored (returns $00000000)
- When RDFAST/WRFAST are configuring, WFxxxx ignored, RFxxxx ignored (returns $00000000)
I'm really glad you brought this up, because there were some ambiguities in these cases that needed certainty. For example, what RFxxxx returns when it can't get data, and when the GETPTR value can be incremented.
I wouldn't suppose someone would leave a streamer running while issuing a new RDFAST or WRFAST, but now we know what to expect in that case and it coincides with likely expectations.
I will get a new version out soon with the new quick-exit RDFAST/WRFAST and this cleaned up RFxxxx/WFxxxx behavior.
Thanks, again, for bringing this up.
P.S. I'm running it now and it's a lot better with RFxxxx returning 0's. Very sensible now.
Obviously an ignore state is a bug from a use view. On that note, in the two cases where RFxxxx is issued while RDFAST is configuring or where WFxxxx is issued while WRFAST is configuring, we could go one step further and stall on RFxxxx/WFxxxx. With this in place then the two operating modes no longer have significant use difference. The b31 flag can be got rid of and only the more streamlined mode exists.
There's a lot of assumption of RFxxxx/WFxxxx being able to execute immediately. It would be pretty disruptive to unravel that thing and put it back together with stalling. That's way too much for me to handle, at this point. Hub exec uses this, too.
This RDFAST/WRFAST D[31] quick-exit is a nice trick for advanced programmers, but I don't think it should be made a standard behavior. My reasoning has to do with the streamer and starting it up with known timing. RFxxxx and WFxxxx, when issued from software, just GO. The streamer, though, may be issuing hidden RFxxxx/WRxxxx's like a drunk sailor and if we do a new RDFAST, for example, in software, we just need some predictable behavior regarding what kind of data interruption the streamer experiences. Data going to 0 until the new stream is available is nice. Software is never going to get this response unless you did the D[31] thing, which is not for casual programmers.
Chip, thanks for that and the other info and changes you made yesterday.
Chip, do you have time at the moment to tell us the timings for the above instructions for the 8 cog versions, including the new fast RDFAST/WRFAST? Have the latter been added to a public FPGA version? (I can't test it but others could.)
Q1. Is the fast RDFAST/WRFAST with D[31]=1 already in an FPGA image?
Q2. If there is a WRFAST but no WFxxxx then a RDFAST, does the RDFAST cancel the WRFAST (and vice-versa)?
One final PM sent about XBYTE.
The D[31] no-wait for RDFAST/WRFAST has been in there for a while.
Yes RDFAST and WRFAST are mutually exclusive. One cancels the other.
It would good to have the above hub RAM timings, now we are so close to seeing the real thing. Worst-case memory access times are important to know and I can't discover them for myself as I don't have an FPGA board.
Two related questions about XBYTE:
1. What is the latency if the next bytecode to interpret is not the next byte in the FIFO due to a bytecode jump or call?
2. Is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling at that instant, or is this impossible?
- so you drop the top numbers by 8 ?
If code is running in Hub ram this certainly would be the case.
From the DOCS
The code below demonstrates the effect of doing so.
By setting the fifo pointer using RDFAST the hub exec pc is changed to $500.
Then by using a RFLONG the pc is indexed +4 skipping the next instuction.
At least 8 from what Chip said earlier:
The random hub read/write would be during XBYTE, in case that was not clear. I guess the answer to 2. is not possible. I was just wondering whether there might be a situation where the worst-case egg-beater timings might be even worse.
FPGA owners could find out the minimum and maximum cycles times, e.g. by writing non-zero data then trying RDFAST with D[31] = 1 and increasing delays until data read are not zero.
How do you figured that out?
See, mix that with some XBYTE code and a Flash boot program calling into TAQOZ and we even have code protection nobody can break.
The P2 will be so much fun to program, even if the P1 still surprises me, the WordFire 4 Keyboard/4 Screens console I recently got is run by a single P1 and it does a amazing job there.
Enjoy!
Mike
During XBYTE is it possible that a random RDBYTE/WRBYTE (or word and long equivalents) could suffer an extra delay because the FIFO needs re-filling with more bytecodes at that instant?
The scenario is sequential bytecodes with no branching. On the face of it, the six clock XBYTE overhead suggests that a clash could always be avoided, but I don't know exactly when or how the FIFO is re-filled.
That probably wouldn't happen, given the rate of FIFO loading versus a random read or write.
With some more testing I've confirmed the timing to be 8...15, which is exactly -1 from the RDLONG timings.
This still troubles me somewhat. The probability of a clash during XBYTE might be small but if it's non-zero it will happen, sooner or later. I don't want a random hub access to suffer an extra delay, even if only occasionally. This might be a non-issue entirely or avoidable in software, but I don't know. The worst-case appears to be an XBYTE routine consisting of just a random read or write, e.g. _ret_ rdbyte ...
I think the FIFO is 1-level deep. The doc says a new bytecode is fetched during the last clock of _RET_. If this is the last byte of the long in the FIFO, does this trigger an immediate request to read a new long from hub RAM? If so, then the maximum number of clocks before this long is read is known. If this is less than the time it takes for an XBYTE routine to get around to doing a random read or write, then there would not be a clash.
Does this sound right, Chip?
EDIT:
The hub slices would need to be the same for a clash to occur, presumably.
I don't know if this answers your question, but the FIFO fills to over 8 longs before a random RDxxxx/WRxxxx is allowed.
Trying again ... for a WRLONG, a coinciding FIFO fill will add a delay. The intermediate hub write has a -2 lag (leading) alignment with respect to normal WRLONG hub timings. EDIT: Improved comments