2 Cog DE0-Nano/CV-A2 Hubexec fifo broken

ozpropdev · 2016-06-05 03:40

Hi Chip
I've encountered an issue when trying to use hubexec on the Nano & BeMicro CV-A2 builds.
It seems that the hubexec fifo doesn't refill.
My hubexec code rins fine on both the DE2 and A9 builds.

I was able to reproduce the problem in the following code..

hubexec		mov	ax,##$c0ffee00
		loc	ptra,#@buffer
		rep	@.loop,#100
		wrlong	ax,ptra++
		add	ax,#1
		nop
		nop
		nop
		nop
		nop
		nop			' <<<< adding extra nops breaks hubexec
.loop
		ret

Hope it's an easy fix

Edit: Includes BeMicro CV-A2 too.

cgracey · 2016-06-05 05:14

I'm glad you found this. Does that NOP with the last comment break the REP block?

ozpropdev · 2016-06-05 05:52

cgracey wrote: »

Does that NOP with the last comment break the REP block?

Yes it does.

ozpropdev · 2016-06-05 06:06

Removing the REP block still has same issue.

hubexec2	mov	ax,##$c0ffee00
		loc	ptra,#@buffer
		mov	bx,#100
.loop		wrlong	ax,ptra++
		add	ax,#1
		nop
	'	nop			'<<<< breaks hubexec
		djnz	bx,#.loop
		ret

cgracey · 2016-06-05 20:46

I know what the problem is. The FIFO is underflowing. There are two variable metrics for the FIFO:

1) the FIFO "full" level, at which it stops issuing reads
2) the number of FIFO levels

We need at least five levels just to accommodate the eggbeater latancy. We need an additional level for each slice. For smaller numbers of slices, perhaps below eight, we need additional levels. The "full" level probably needs to be increased for less than eight slices, as well.

So, the question is: What is the formula for determining the "full" level and the number of levels for all 1/2/4/8/16 slice counts.

Rayman · 2016-06-05 21:26

Or just exclude hubexec from small versions?

Ok I take that back, not a good idea

jmg · 2016-06-05 21:51

cgracey wrote: »

So, the question is: What is the formula for determining the "full" level and the number of levels for all 1/2/4/8/16 slice counts.

I take it that was a rhetorical question, as you are in a far better position to judge any Origin+Slope formula than anyone on here.... ?

Seairth · 2016-06-05 22:17

I don't get it. Why do you need to increase the FIFO levels for fewer slices?

jmg · 2016-06-05 22:33

Seairth wrote: »

I don't get it. Why do you need to increase the FIFO levels for fewer slices?

I'm not following the details either, but Chip mentioned underflow, which suggests you need some min cycles to load the FIFO and then some fill rate after that.
I'm also unclear if a jump within the FIFO does anything clever, or if any jumps always refills the FIFO, but the 'added nop' failure nature suggests this is a boundary condition.

cgracey · 2016-06-05 22:53

Seairth wrote: »

I don't get it. Why do you need to increase the FIFO levels for fewer slices?

Because of the fixed five clock/level latency from read-issue to FIFO-entry.

I need to figure this out. It's somewhat of a brain bender, at this point.

evanh · 2016-06-05 23:11

jmg wrote: »

... I'm also unclear if a jump within the FIFO does anything clever, or if any jumps always refills the FIFO ...

HubExec branching, including a REP loop, always stalls to reload the FIFO. So that'll reset everything on each branch.

I'm guessing Oz's above failures depend on any preceding inline instructions ahead of the "hubexec/hubexec2" label. Ie: Shifting the tipping NOP to an earlier position in the instructions will still fail.

evanh · 2016-06-05 23:29

Or, a variation of this, the REP instruction directly acts like a branch, causing a FIFO reload before the loop starts, and the FIFO is only 7 or 8 deep. That would trip the flaw within what's listed.

Cluso99 · 2016-06-06 03:05

I wonder if the eggbeater needs to be disabled for 1 and 2 cog variants???

For our testing, perhaps the 2 cog variants could use a 4 cog egg beater with only 2 cogs physical might be more realistic?

evanh · 2016-06-06 03:52

Chip just has to solve the formula for minimum FIFO depth is all.

cgracey · 2016-06-06 06:04

evanh wrote: »

Chip just has to solve the formula for minimum FIFO depth is all.

That's right. And what is the "full" level, at which point reads cease.

Electrodude · 2016-06-06 13:48

Does this happen if there are no hubram accesses inside the loop? What if you remove the wrlong and have it blink an LED instead?

Why are even 5 levels necessary for the two cog version?

ozpropdev · 2016-06-06 14:40

Electrodude wrote: »

Does this happen if there are no hubram accesses inside the loop? What if you remove the wrlong and have it blink an LED instead?

Yes, Hubexec still breaks without hub access instructions in the code.

cgracey · 2016-06-06 20:07

This FIFO thing is a real brain-bender.

I've resorted to making a simulator that runs on Prop1 and outputs to the serial terminal built into the Propeller Tool. Right now, I'm trying to be sure that I'm modelling the 16-cog case correctly, which I know works (but maybe actually isn't bullet-proof, yet), so that I can try out cases of fewer cogs.

jmg · 2016-06-06 20:51

ozpropdev wrote: »

Yes, Hubexec still breaks without hub access instructions in the code.

If you change the code to give a domino ripple of pin-signals you can scope, where does it fail ?
ie if the loop is longer, does it get to the jump before failing, or is it some number of instructions that fails ?

Is there any zone to this - ie if there are more opcodes before jump, is that ok as some window effect ?
Is it only DJNZ, or do all loop (+reload) cause this ?

cgracey · 2016-06-06 22:37

I think I finally nailed this FIFO matter. Without having made a simulator that fires random blasts of reads, interspersed with random periods of rest, I don't know how I could have figured this out. Now that I ran all cases of cog counts, a pattern has emerged:

The "full" FIFO level, at which point the cog FIFO quits issuing contiguous reads to the hub RAM, is #cogs + 6. The number of FIFO levels needed is #cogs + 11.

There is a 6-clock delay between issuing a hub RAM read and having the data into the FIFO. That's what necessitates all these FIFO levels, which are a lot more than I first understood were necessary. The current FPGA releases all have insufficient FIFOs in them.

I need to do recompiles on everything now. That will be version 9b.

Here is the Prop1 program I wrote to simulate the FIFO activity. It uses the serial terminal built into the Propeller Tool:

' - Simulator for Prop2 Eggbeater FIFO
' - used to determine 'full' point and FIFO depth 

CON

  _clkmode = xtal1 + pll16x
  _xinfreq = 5_000_000

  cogs = 16, full = 22, limit = 27      '16 cogs
' cogs = 8,  full = 14, limit = 19      '8 cogs
' cogs = 4,  full = 10, limit = 15      '4 cogs
' cogs = 2,  full = 8, limit = 13       '2 cogs
' cogs = 1,  full = 7, limit = 12       '1 cog

OBJ

  text: "FullDuplexSerial"


VAR

  long hub, engaged, incoming, level, lowlevel, highlevel
  long rnd, read, reps, trap

  
PUB start

  'start terminal
  text.start(31, 30, 0, 115200)

  'init variables
  hub := 6
  engaged := 1
  incoming := %11111
  level := 1
  lowlevel := 1
  highlevel := 1

  'simulate random blasts of FIFO reads
  repeat
    read := rnd? & 1
    reps := (||rnd? // 30) + 1

    repeat reps

      if level => full
        engaged := 0
      elseif engaged or (hub & (cogs-1)) == 0
        incoming |= $20
        engaged := 1

      report

      if incoming & 1
        level++                

      if read
        level--

      if level < lowlevel
        lowlevel := level

      if level > highlevel
        highlevel := level

      hub++
      incoming >>= 1


PRI report

  text.hex(hub,1)           'hub
  text.tx(32)  
  text.bin(engaged,1)       'engaged
  text.tx(32)  
  text.bin(read,1)          'read
  text.tx(32)  
  text.bin(incoming,5)      'incoming
  text.tx(9)
  text.dec(level)           'level
  text.tx(9)
  text.dec(lowlevel)        'low level
  text.tx(9)
  text.dec(highlevel)       'high level
  text.tx(13)

  if level < 1 or level > limit or trap
    trap++
    if trap == 50
      abort

jmg · 2016-06-06 23:08

cgracey wrote: »

The "full" FIFO level, at which point the cog FIFO quits issuing contiguous reads to the hub RAM, is #cogs + 6. The number of FIFO levels needed is #cogs + 11.

There is a 6-clock delay between issuing a hub RAM read and having the data into the FIFO. That's what necessitates all these FIFO levels, which are a lot more than I first understood were necessary. The current FPGA releases all have insufficient FIFOs in them.

What is the FIFO buying here, above what a wait-counter would also give ?
eg Can the software jump-ahead within the FIFO, and not need a reload, or does any branch that is not in-line, need to trigger a pause+reload ?

I'm unclear around if the FIFO has to wait and reload on every branch, what that storage is gaining over a wait and smaller fifo ?

User Name · 2016-06-06 23:15

One of the most interesting posts ever seen on the Parallax forum, imho.

ozpropdev · 2016-06-06 23:21

Chip
I'm glad you had success fixing the fifo mechanism(s).

Re: New compiles for V9b
A nice feature on the A2 build was the use of the 6 spare leds on the board for P5..P0
Can you do the same for the Nano build too.

cgracey · 2016-06-06 23:28

jmg wrote: »

cgracey wrote: »

The "full" FIFO level, at which point the cog FIFO quits issuing contiguous reads to the hub RAM, is #cogs + 6. The number of FIFO levels needed is #cogs + 11.

There is a 6-clock delay between issuing a hub RAM read and having the data into the FIFO. That's what necessitates all these FIFO levels, which are a lot more than I first understood were necessary. The current FPGA releases all have insufficient FIFOs in them.

What is the FIFO buying here, above what a wait-counter would also give ?
eg Can the software jump-ahead within the FIFO, and not need a reload, or does any branch that is not in-line, need to trigger a pause+reload ?

I'm unclear around if the FIFO has to wait and reload on every branch, what that storage is gaining over a wait and smaller fifo ?

The FIFO only needs to get one long into it for hub exec to resume after a branch. So, the cog must wait for its slice of interest, issue a read (first of many, unless a branch occurs), and then six clocks later the new stream of longs is available for execution.

The FIFO acts as a flow regulator. Once queued up, it can deliver any pattern of sequential bytes, word, or longs on each clock.

In the case of hub execution, instruction longs are requested no faster than clock/2. The FIFO just keeps passing longs, in sequence, no matter how long each instruction takes. The FIFO is performing a vital function here.

jmg · 2016-06-06 23:47

cgracey wrote: »

The FIFO only needs to get one long into it for hub exec to resume after a branch. So, the cog must wait for its slice of interest, issue a read (first of many, unless a branch occurs), and then six clocks later the new stream of longs is available for execution.

This 6 clock addition, is because the FIFO is not a classic, async fall-thru fifo, but is more a dual-port-RAM FIFO using two counters ?

cgracey wrote: »

In the case of hub execution, instruction longs are requested no faster than clock/2...

I guess that is the killer detail, the HUB runs faster than the COG can ever use, so some storage is needed.
.. and that jumps about with phase and opcode actual times too...

It's a pity with all that queue resource, that you cannot jump within the queue....

Does this run a Wait-Counter and a FIFO, or just a FIFO ? - it seems a Wait-Counter could allow a smaller FIFO ?

evanh · 2016-06-06 23:52

jmg wrote: »

This 6 clock addition, is because the FIFO is not a classic, async fall-thru fifo, but is more a dual-port-RAM FIFO using two counters ?

Nope, an unbuffered RDLONG takes just as long if I'm reading correctly.

jmg · 2016-06-07 00:27

cgracey wrote: »

The "full" FIFO level, at which point the cog FIFO quits issuing contiguous reads to the hub RAM, is #cogs + 6. The number of FIFO levels needed is #cogs + 11.

Addit: Thinking some more, maybe FIFO underflow should also auto-wait ?

I can see a minus of lowering the FIFO from the highest possible needed value, is that would add more jitter (tho there is always branch jitter anyway..?)
A benefit of auto-wait on underflow, is if there is some missed test case here of some rare opcode size/hub combination, then it tolerates that, rather than failing as above tests do.

cgracey · 2016-06-07 01:46

jmg wrote: »

cgracey wrote: »

The "full" FIFO level, at which point the cog FIFO quits issuing contiguous reads to the hub RAM, is #cogs + 6. The number of FIFO levels needed is #cogs + 11.

Addit: Thinking some more, maybe FIFO underflow should also auto-wait ?

I can see a minus of lowering the FIFO from the highest possible needed value, is that would add more jitter (tho there is always branch jitter anyway..?)
A benefit of auto-wait on underflow, is if there is some missed test case here of some rare opcode size/hub combination, then it tolerates that, rather than failing as above tests do.

Because the FIFO feeds the streamer, it is not possible to have waits. The streamer needs its data right away.

evanh · 2016-06-07 02:16

Yeah, and HubRAM can keep up no problem.

jmg · 2016-06-07 02:25

cgracey wrote: »

Because the FIFO feeds the streamer, it is not possible to have waits. The streamer needs its data right away.

ok.
Does that mean the P2 design needs to be very sure that this revised #cogs + 11, is the max ever possible needed ?

cgracey · 2016-06-07 02:28

jmg wrote: »

cgracey wrote: »

Because the FIFO feeds the streamer, it is not possible to have waits. The streamer needs its data right away.

ok.
Does that mean the P2 design needs to be very sure that this revised #cogs + 11, is the max ever possible needed ?

The streamer could stress it in such a way that all those FIFO levels are needed.

2 Cog DE0-Nano/CV-A2 Hubexec fifo broken

Comments