2 Cog DE0-Nano/CV-A2 Hubexec fifo broken
ozpropdev
Posts: 2,792
Hi Chip
I've encountered an issue when trying to use hubexec on the Nano & BeMicro CV-A2 builds.
It seems that the hubexec fifo doesn't refill.
My hubexec code rins fine on both the DE2 and A9 builds.
I was able to reproduce the problem in the following code..
Edit: Includes BeMicro CV-A2 too.
I've encountered an issue when trying to use hubexec on the Nano & BeMicro CV-A2 builds.
It seems that the hubexec fifo doesn't refill.
My hubexec code rins fine on both the DE2 and A9 builds.
I was able to reproduce the problem in the following code..
hubexec mov ax,##$c0ffee00 loc ptra,#@buffer rep @.loop,#100 wrlong ax,ptra++ add ax,#1 nop nop nop nop nop nop ' <<<< adding extra nops breaks hubexec .loop retHope it's an easy fix
Edit: Includes BeMicro CV-A2 too.
Comments
1) the FIFO "full" level, at which it stops issuing reads
2) the number of FIFO levels
We need at least five levels just to accommodate the eggbeater latancy. We need an additional level for each slice. For smaller numbers of slices, perhaps below eight, we need additional levels. The "full" level probably needs to be increased for less than eight slices, as well.
So, the question is: What is the formula for determining the "full" level and the number of levels for all 1/2/4/8/16 slice counts.
Ok I take that back, not a good idea
I'm also unclear if a jump within the FIFO does anything clever, or if any jumps always refills the FIFO, but the 'added nop' failure nature suggests this is a boundary condition.
Because of the fixed five clock/level latency from read-issue to FIFO-entry.
I need to figure this out. It's somewhat of a brain bender, at this point.
I'm guessing Oz's above failures depend on any preceding inline instructions ahead of the "hubexec/hubexec2" label. Ie: Shifting the tipping NOP to an earlier position in the instructions will still fail.
For our testing, perhaps the 2 cog variants could use a 4 cog egg beater with only 2 cogs physical might be more realistic?
That's right. And what is the "full" level, at which point reads cease.
Why are even 5 levels necessary for the two cog version?
I've resorted to making a simulator that runs on Prop1 and outputs to the serial terminal built into the Propeller Tool. Right now, I'm trying to be sure that I'm modelling the 16-cog case correctly, which I know works (but maybe actually isn't bullet-proof, yet), so that I can try out cases of fewer cogs.
ie if the loop is longer, does it get to the jump before failing, or is it some number of instructions that fails ?
Is there any zone to this - ie if there are more opcodes before jump, is that ok as some window effect ?
Is it only DJNZ, or do all loop (+reload) cause this ?
The "full" FIFO level, at which point the cog FIFO quits issuing contiguous reads to the hub RAM, is #cogs + 6. The number of FIFO levels needed is #cogs + 11.
There is a 6-clock delay between issuing a hub RAM read and having the data into the FIFO. That's what necessitates all these FIFO levels, which are a lot more than I first understood were necessary. The current FPGA releases all have insufficient FIFOs in them.
I need to do recompiles on everything now. That will be version 9b.
Here is the Prop1 program I wrote to simulate the FIFO activity. It uses the serial terminal built into the Propeller Tool:
What is the FIFO buying here, above what a wait-counter would also give ?
eg Can the software jump-ahead within the FIFO, and not need a reload, or does any branch that is not in-line, need to trigger a pause+reload ?
I'm unclear around if the FIFO has to wait and reload on every branch, what that storage is gaining over a wait and smaller fifo ?
I'm glad you had success fixing the fifo mechanism(s).
Re: New compiles for V9b
A nice feature on the A2 build was the use of the 6 spare leds on the board for P5..P0
Can you do the same for the Nano build too.
The FIFO only needs to get one long into it for hub exec to resume after a branch. So, the cog must wait for its slice of interest, issue a read (first of many, unless a branch occurs), and then six clocks later the new stream of longs is available for execution.
The FIFO acts as a flow regulator. Once queued up, it can deliver any pattern of sequential bytes, word, or longs on each clock.
In the case of hub execution, instruction longs are requested no faster than clock/2. The FIFO just keeps passing longs, in sequence, no matter how long each instruction takes. The FIFO is performing a vital function here.
I guess that is the killer detail, the HUB runs faster than the COG can ever use, so some storage is needed.
.. and that jumps about with phase and opcode actual times too...
It's a pity with all that queue resource, that you cannot jump within the queue....
Does this run a Wait-Counter and a FIFO, or just a FIFO ? - it seems a Wait-Counter could allow a smaller FIFO ?
Addit: Thinking some more, maybe FIFO underflow should also auto-wait ?
I can see a minus of lowering the FIFO from the highest possible needed value, is that would add more jitter (tho there is always branch jitter anyway..?)
A benefit of auto-wait on underflow, is if there is some missed test case here of some rare opcode size/hub combination, then it tolerates that, rather than failing as above tests do.
Because the FIFO feeds the streamer, it is not possible to have waits. The streamer needs its data right away.
Does that mean the P2 design needs to be very sure that this revised #cogs + 11, is the max ever possible needed ?
The streamer could stress it in such a way that all those FIFO levels are needed.