Shop Learn
2 Cog DE0-Nano/CV-A2 Hubexec fifo broken — Parallax Forums

2 Cog DE0-Nano/CV-A2 Hubexec fifo broken

ozpropdevozpropdev Posts: 2,745
edited 2016-06-05 03:50 in Propeller 2
Hi Chip
I've encountered an issue when trying to use hubexec on the Nano & BeMicro CV-A2 builds.
It seems that the hubexec fifo doesn't refill.
My hubexec code rins fine on both the DE2 and A9 builds.

I was able to reproduce the problem in the following code..
hubexec		mov	ax,##$c0ffee00
		loc	ptra,#@buffer
		rep	@.loop,#100
		wrlong	ax,ptra++
		add	ax,#1
		nop
		nop
		nop
		nop
		nop
		nop			' <<<< adding extra nops breaks hubexec
.loop
		ret

Hope it's an easy fix :)

Edit: Includes BeMicro CV-A2 too.
«1

Comments

  • cgraceycgracey Posts: 13,402
    I'm glad you found this. Does that NOP with the last comment break the REP block?
  • cgracey wrote: »
    Does that NOP with the last comment break the REP block?
    Yes it does.


  • Removing the REP block still has same issue.
    hubexec2	mov	ax,##$c0ffee00
    		loc	ptra,#@buffer
    		mov	bx,#100
    .loop		wrlong	ax,ptra++
    		add	ax,#1
    		nop
    	'	nop			'<<<< breaks hubexec
    		djnz	bx,#.loop
    		ret
    
  • cgraceycgracey Posts: 13,402
    I know what the problem is. The FIFO is underflowing. There are two variable metrics for the FIFO:

    1) the FIFO "full" level, at which it stops issuing reads
    2) the number of FIFO levels

    We need at least five levels just to accommodate the eggbeater latancy. We need an additional level for each slice. For smaller numbers of slices, perhaps below eight, we need additional levels. The "full" level probably needs to be increased for less than eight slices, as well.

    So, the question is: What is the formula for determining the "full" level and the number of levels for all 1/2/4/8/16 slice counts.
  • RaymanRayman Posts: 11,840
    edited 2016-06-05 21:28
    Or just exclude hubexec from small versions?

    Ok I take that back, not a good idea
  • jmgjmg Posts: 14,595
    cgracey wrote: »
    So, the question is: What is the formula for determining the "full" level and the number of levels for all 1/2/4/8/16 slice counts.
    I take it that was a rhetorical question, as you are in a far better position to judge any Origin+Slope formula than anyone on here.... ?
  • I don't get it. Why do you need to increase the FIFO levels for fewer slices?
  • jmgjmg Posts: 14,595
    Seairth wrote: »
    I don't get it. Why do you need to increase the FIFO levels for fewer slices?
    I'm not following the details either, but Chip mentioned underflow, which suggests you need some min cycles to load the FIFO and then some fill rate after that.
    I'm also unclear if a jump within the FIFO does anything clever, or if any jumps always refills the FIFO, but the 'added nop' failure nature suggests this is a boundary condition.

  • cgraceycgracey Posts: 13,402
    Seairth wrote: »
    I don't get it. Why do you need to increase the FIFO levels for fewer slices?

    Because of the fixed five clock/level latency from read-issue to FIFO-entry.

    I need to figure this out. It's somewhat of a brain bender, at this point.
  • evanhevanh Posts: 10,419
    edited 2016-06-05 23:18
    jmg wrote: »
    ... I'm also unclear if a jump within the FIFO does anything clever, or if any jumps always refills the FIFO ...
    HubExec branching, including a REP loop, always stalls to reload the FIFO. So that'll reset everything on each branch.

    I'm guessing Oz's above failures depend on any preceding inline instructions ahead of the "hubexec/hubexec2" label. Ie: Shifting the tipping NOP to an earlier position in the instructions will still fail.
  • evanhevanh Posts: 10,419
    edited 2016-06-05 23:32
    Or, a variation of this, the REP instruction directly acts like a branch, causing a FIFO reload before the loop starts, and the FIFO is only 7 or 8 deep. That would trip the flaw within what's listed.
  • Cluso99Cluso99 Posts: 17,460
    I wonder if the eggbeater needs to be disabled for 1 and 2 cog variants???

    For our testing, perhaps the 2 cog variants could use a 4 cog egg beater with only 2 cogs physical might be more realistic?
  • evanhevanh Posts: 10,419
    Chip just has to solve the formula for minimum FIFO depth is all.
  • cgraceycgracey Posts: 13,402
    evanh wrote: »
    Chip just has to solve the formula for minimum FIFO depth is all.

    That's right. And what is the "full" level, at which point reads cease.
  • Does this happen if there are no hubram accesses inside the loop? What if you remove the wrlong and have it blink an LED instead?

    Why are even 5 levels necessary for the two cog version?
  • Does this happen if there are no hubram accesses inside the loop? What if you remove the wrlong and have it blink an LED instead?
    Yes, Hubexec still breaks without hub access instructions in the code.

  • cgraceycgracey Posts: 13,402
    This FIFO thing is a real brain-bender.

    I've resorted to making a simulator that runs on Prop1 and outputs to the serial terminal built into the Propeller Tool. Right now, I'm trying to be sure that I'm modelling the 16-cog case correctly, which I know works (but maybe actually isn't bullet-proof, yet), so that I can try out cases of fewer cogs.
  • jmgjmg Posts: 14,595
    ozpropdev wrote: »
    Yes, Hubexec still breaks without hub access instructions in the code.
    If you change the code to give a domino ripple of pin-signals you can scope, where does it fail ?
    ie if the loop is longer, does it get to the jump before failing, or is it some number of instructions that fails ?

    Is there any zone to this - ie if there are more opcodes before jump, is that ok as some window effect ?
    Is it only DJNZ, or do all loop (+reload) cause this ?

  • cgraceycgracey Posts: 13,402
    edited 2016-06-06 22:55
    I think I finally nailed this FIFO matter. Without having made a simulator that fires random blasts of reads, interspersed with random periods of rest, I don't know how I could have figured this out. Now that I ran all cases of cog counts, a pattern has emerged:

    The "full" FIFO level, at which point the cog FIFO quits issuing contiguous reads to the hub RAM, is #cogs + 6. The number of FIFO levels needed is #cogs + 11.

    There is a 6-clock delay between issuing a hub RAM read and having the data into the FIFO. That's what necessitates all these FIFO levels, which are a lot more than I first understood were necessary. The current FPGA releases all have insufficient FIFOs in them.

    I need to do recompiles on everything now. That will be version 9b.

    Here is the Prop1 program I wrote to simulate the FIFO activity. It uses the serial terminal built into the Propeller Tool:
    ' - Simulator for Prop2 Eggbeater FIFO
    ' - used to determine 'full' point and FIFO depth 
    
    CON
    
      _clkmode = xtal1 + pll16x
      _xinfreq = 5_000_000
    
      cogs = 16, full = 22, limit = 27      '16 cogs
    ' cogs = 8,  full = 14, limit = 19      '8 cogs
    ' cogs = 4,  full = 10, limit = 15      '4 cogs
    ' cogs = 2,  full = 8, limit = 13       '2 cogs
    ' cogs = 1,  full = 7, limit = 12       '1 cog
    
    OBJ
    
      text: "FullDuplexSerial"
    
    
    VAR
    
      long hub, engaged, incoming, level, lowlevel, highlevel
      long rnd, read, reps, trap
    
      
    PUB start
    
      'start terminal
      text.start(31, 30, 0, 115200)
    
      'init variables
      hub := 6
      engaged := 1
      incoming := %11111
      level := 1
      lowlevel := 1
      highlevel := 1
    
      'simulate random blasts of FIFO reads
      repeat
        read := rnd? & 1
        reps := (||rnd? // 30) + 1
    
        repeat reps
    
          if level => full
            engaged := 0
          elseif engaged or (hub & (cogs-1)) == 0
            incoming |= $20
            engaged := 1
    
          report
    
          if incoming & 1
            level++                
    
          if read
            level--
    
          if level < lowlevel
            lowlevel := level
    
          if level > highlevel
            highlevel := level
    
          hub++
          incoming >>= 1
    
    
    PRI report
    
      text.hex(hub,1)           'hub
      text.tx(32)  
      text.bin(engaged,1)       'engaged
      text.tx(32)  
      text.bin(read,1)          'read
      text.tx(32)  
      text.bin(incoming,5)      'incoming
      text.tx(9)
      text.dec(level)           'level
      text.tx(9)
      text.dec(lowlevel)        'low level
      text.tx(9)
      text.dec(highlevel)       'high level
      text.tx(13)
    
      if level < 1 or level > limit or trap
        trap++
        if trap == 50
          abort
    
  • jmgjmg Posts: 14,595
    cgracey wrote: »
    The "full" FIFO level, at which point the cog FIFO quits issuing contiguous reads to the hub RAM, is #cogs + 6. The number of FIFO levels needed is #cogs + 11.

    There is a 6-clock delay between issuing a hub RAM read and having the data into the FIFO. That's what necessitates all these FIFO levels, which are a lot more than I first understood were necessary. The current FPGA releases all have insufficient FIFOs in them.

    What is the FIFO buying here, above what a wait-counter would also give ?
    eg Can the software jump-ahead within the FIFO, and not need a reload, or does any branch that is not in-line, need to trigger a pause+reload ?

    I'm unclear around if the FIFO has to wait and reload on every branch, what that storage is gaining over a wait and smaller fifo ?
  • One of the most interesting posts ever seen on the Parallax forum, imho.
  • Chip
    I'm glad you had success fixing the fifo mechanism(s).

    Re: New compiles for V9b
    A nice feature on the A2 build was the use of the 6 spare leds on the board for P5..P0
    Can you do the same for the Nano build too. :)
  • cgraceycgracey Posts: 13,402
    edited 2016-06-06 23:30
    jmg wrote: »
    cgracey wrote: »
    The "full" FIFO level, at which point the cog FIFO quits issuing contiguous reads to the hub RAM, is #cogs + 6. The number of FIFO levels needed is #cogs + 11.

    There is a 6-clock delay between issuing a hub RAM read and having the data into the FIFO. That's what necessitates all these FIFO levels, which are a lot more than I first understood were necessary. The current FPGA releases all have insufficient FIFOs in them.

    What is the FIFO buying here, above what a wait-counter would also give ?
    eg Can the software jump-ahead within the FIFO, and not need a reload, or does any branch that is not in-line, need to trigger a pause+reload ?

    I'm unclear around if the FIFO has to wait and reload on every branch, what that storage is gaining over a wait and smaller fifo ?

    The FIFO only needs to get one long into it for hub exec to resume after a branch. So, the cog must wait for its slice of interest, issue a read (first of many, unless a branch occurs), and then six clocks later the new stream of longs is available for execution.

    The FIFO acts as a flow regulator. Once queued up, it can deliver any pattern of sequential bytes, word, or longs on each clock.

    In the case of hub execution, instruction longs are requested no faster than clock/2. The FIFO just keeps passing longs, in sequence, no matter how long each instruction takes. The FIFO is performing a vital function here.
  • jmgjmg Posts: 14,595
    cgracey wrote: »
    The FIFO only needs to get one long into it for hub exec to resume after a branch. So, the cog must wait for its slice of interest, issue a read (first of many, unless a branch occurs), and then six clocks later the new stream of longs is available for execution.
    This 6 clock addition, is because the FIFO is not a classic, async fall-thru fifo, but is more a dual-port-RAM FIFO using two counters ?
    cgracey wrote: »
    In the case of hub execution, instruction longs are requested no faster than clock/2...
    I guess that is the killer detail, the HUB runs faster than the COG can ever use, so some storage is needed.
    .. and that jumps about with phase and opcode actual times too...

    It's a pity with all that queue resource, that you cannot jump within the queue....

    Does this run a Wait-Counter and a FIFO, or just a FIFO ? - it seems a Wait-Counter could allow a smaller FIFO ?
  • evanhevanh Posts: 10,419
    jmg wrote: »
    This 6 clock addition, is because the FIFO is not a classic, async fall-thru fifo, but is more a dual-port-RAM FIFO using two counters ?
    Nope, an unbuffered RDLONG takes just as long if I'm reading correctly.
  • jmgjmg Posts: 14,595
    edited 2016-06-07 00:28
    cgracey wrote: »
    The "full" FIFO level, at which point the cog FIFO quits issuing contiguous reads to the hub RAM, is #cogs + 6. The number of FIFO levels needed is #cogs + 11.

    Addit: Thinking some more, maybe FIFO underflow should also auto-wait ?

    I can see a minus of lowering the FIFO from the highest possible needed value, is that would add more jitter (tho there is always branch jitter anyway..?)
    A benefit of auto-wait on underflow, is if there is some missed test case here of some rare opcode size/hub combination, then it tolerates that, rather than failing as above tests do.

  • cgraceycgracey Posts: 13,402
    edited 2016-06-07 01:46
    jmg wrote: »
    cgracey wrote: »
    The "full" FIFO level, at which point the cog FIFO quits issuing contiguous reads to the hub RAM, is #cogs + 6. The number of FIFO levels needed is #cogs + 11.

    Addit: Thinking some more, maybe FIFO underflow should also auto-wait ?

    I can see a minus of lowering the FIFO from the highest possible needed value, is that would add more jitter (tho there is always branch jitter anyway..?)
    A benefit of auto-wait on underflow, is if there is some missed test case here of some rare opcode size/hub combination, then it tolerates that, rather than failing as above tests do.

    Because the FIFO feeds the streamer, it is not possible to have waits. The streamer needs its data right away.
  • evanhevanh Posts: 10,419
    Yeah, and HubRAM can keep up no problem.
  • jmgjmg Posts: 14,595
    cgracey wrote: »
    Because the FIFO feeds the streamer, it is not possible to have waits. The streamer needs its data right away.
    ok.
    Does that mean the P2 design needs to be very sure that this revised #cogs + 11, is the max ever possible needed ?

  • cgraceycgracey Posts: 13,402
    jmg wrote: »
    cgracey wrote: »
    Because the FIFO feeds the streamer, it is not possible to have waits. The streamer needs its data right away.
    ok.
    Does that mean the P2 design needs to be very sure that this revised #cogs + 11, is the max ever possible needed ?

    The streamer could stress it in such a way that all those FIFO levels are needed.
Sign In or Register to comment.