Ringbuffer (was Streamer Questions - how to sync)

jmg · 2019-06-16 19:20

evanh wrote: »

The step change above 150 MHz means it is possible to adjust the compensation to suit. But the big improvement is that it is a clean step. There is no width to it.

Nice plots. Did you try the same tests on the upper port pins ?
Yes, that's nicely reduced (de-skewed) the variations in delay paths, likely caused by routing/prop delays.

It also means you should keep well clear of 150MHz and 300MHz, as those exact MHz will vary with PVT.
Lucky the 250MHz being talked about the HDMI, is (hopefully) ' in clear air' ?
If you externally heat the P2, how much does that 300MHz step mve down ?

Be interesting to run the identical tests on the next silicon revision, and I think Chip mentioned some de-skew work ?

msrobots · 2019-06-16 22:26

evanh wrote: »

msrobots wrote: »

so wrpin ##$1_0000, pinNumber sets the pin to registered independent of being just digital or smartpin?

Independent of smartpin, yep. Not so independent of digital mode. Some pin modes don't have the C bit but are still registered anyway.

do I need to set dirl before and dirh after?

No, that's used when setting up a smartpin. Smartpin config is all in the low 6 bits.

what the hell is DJNF ?

Experience. I was like that reading Oz's code sometimes too.
DJNF turns false on rollover whereas DJNZ turns false on zero. Because they are pre-decrement instructions, Chip explicitly added DJNF as a new instruction just for zero case looping.

Ahh, loop one more, cool.

Well I could set my clock pins to register (on sending and receiving COGs) funnily this had no measurable effect on the timing.

Then I set my receive pins to registered and had to adjust my timing by two clocks.

Then I set my send pins to registered and everything goes complete bonkers. So I backed out of that road. Registered outputs make sense on the same P2 but with multiple P2's it may not have that importance.

Since I do just have one P2 the output pins of one sender COG are the input Pins for the next receiver COG. In real usage with multiple P2's this would not happen. Maybe that is why it goes bonkers.
I have no clue at all what happens then, FastSpin Terminal closes, P2 stops transmitting, (blinking) as soon as I enable registered output on the sending COGs.

Another stupid question. There is a MUL opcode but no DIV? I jus found QDIV using cordic and being not intuitive usable like add, sub. What is that about?

confused,

Mike

jmg · 2019-06-16 22:55

msrobots wrote: »

Then I set my send pins to registered and everything goes complete bonkers. So I backed out of that road.

What clock speeds are you testing at ?
Given evanh's plots, you should keep well clear of the delay speed-bumps, which suggests testing sub ~100MHz, initially, and then maybe ~ 225MHz ?

msrobots · 2019-06-17 00:28

I run at 180 Mhz, 1/1 streamer clock is simply out of reach, especially when one needs to run longer wires. My guess is that 1/3 will work nicely on different clocks since I have 1/2 stable right now on one P2 without clock phase differences except even/odd COGs.

I do drive the all pins low to have no floating pins. That got me from 1/4 to 1/2.

And - yes - I will run some tests soon with different sys clock settings...

Enjoy!

Mike

ersmith · 2019-06-17 01:03

msrobots wrote: »

Another stupid question. There is a MUL opcode but no DIV? I jus found QDIV using cordic and being not intuitive usable like add, sub. What is that about?

confused,

Mike

Yes, there is no DIV opcode. Beware, MUL is only a 16 bit multiply, so it isn't really like ADD, SUB, and so on (which operate on the full 32 bits of the register). For 32 bit operations use QMUL and QDIV. They're actually pretty simple to use:

  ' calculate A = B x C
  qmul B, C ' start cordic multiply
  ' (optional: some non-cordic instructions can go here)
  getqx A  ' get result

  ' calculate A = B / C
  qdiv B, C
  getqx A

Note that QMUL is a little bit slower than manually constructing a 32 bit MUL out of 16 bit multiplies, so if you really want maximum speed then you'll want a subroutine that does 4 MULs of the various half-words (there's a thread somewhere in the forums showing the fastest multiply for P2). But QMUL is small and easy to do in assembly.

evanh · 2019-06-17 01:59

Here's one I've found useful. It's short enough to inline even. I derived it from the posted examples in that topic a few months back.

'===============================================
'Integer multiply 32-bit x 16-bit.  pa = pa x pb.  10 clocks
'  input:  pa, pb
' result:  pa
'scratch:  temp1
'
mul32x16
		getword	temp1, pa, #1
		mul	temp1, pb		'pah * pbl
		mul	pa, pb			'pal * pbl
		shl	temp1, #16
_ret_		add	pa, temp1

evanh · 2019-06-17 02:15

msrobots wrote: »

Well I could set my clock pins to register (on sending and receiving COGs) funnily this had no measurable effect on the timing.

Doh! I forgot to try that that in my code. Definitely changes things ...

evanh · 2019-06-17 02:36

Okay, everything is registered this time:
registered%20data%20and%20handshake%20sampling%20at%20full%20sysclock.png

evanh · 2019-06-17 03:12

Same again but with bottom side of P2ES board heated to around 65°C. Graph above is at room temperature of 20°C.
60C%20-%20registered%20data%20and%20handshake%20sampling%20at%20full%20sysclock.png

60C%20-%20registered%20data%20and%20handshake%20sampling%20at%20full%20sysclock.png

EDIT: Added 80°C graph. Hair dryer was starting to make extra noises!
80C%20-%20registered%20data%20and%20handshake%20sampling%20at%20full%20sysclock.png

80C%20-%20registered%20data%20and%20handshake%20sampling%20at%20full%20sysclock.png

evanh · 2019-06-17 03:31

That's so much cleaner now.

I didn't expect it to be so good to be honest. I'm now wondering if my much older tests had a flaw in them too.

PS: It has required another +2 to the handshake compensation. Which is actually convenient since the WAITX is a minimum of +2 by itself.

Yanomani · 2019-06-17 04:37

Hi msrobots

Focusing the code you had used in your tests, to what extent have you been depending on OUTNOT instructions (or DIRNOT, FLTNOT, DRVNOT), in order to flip (reverse) output (or clock) bit/pin states?

Could the delay (in sysclks), introduced by using registered input/output, be the cause of some adverse interference at any of your routines, due to the time lag between executing an instruction that is intended to change some pin (or register) state, the moment that change will be effective at the selected pin (or intermediate register), and another posterior moment, when the now changed pin (or register) state is expected to be steady stable, to, says, feed as an input of another ???NOT instruction?

I hope I didn't put excessive filler material at the sausage in my previous sentence, to the point it had became misunderstandable...

Henrique

evanh · 2019-06-18 05:07

I've just rearranged the handshake from a single send to receive handshake, where the receiver task is sitting in wait, to one where there is two handshake lines and the receiver task can be free to cycle and check in on, or be interrupted for, a needed DMA action.

In this arrangement the sender raises his handshake then goes into waiting for the receiver to respond. It could be done with a single handshake pin even. The receiver responds then pauses a countable number of clocks to align itself with the predictable tx data.

The notable info to come from this tx briefly waiting arrangement is a much larger start compensation amount than for the simpler receiver waits all the time arrangement.

Compensation values
 rx wait   tx wait
==================
   3         21

Note: Compensation 3 is WAITX #1, and 21 is WAITX #19.

evanh · 2019-06-18 11:56

I've improved the data compare accuracy by using whole buffer now. Here's a graph comparing 1/1 to 1/4 data rates.
simulated%20ext%20data%20copy.png

msrobots · 2019-06-18 12:00

I am thinking a lot about the 'workflow?' of my ring buffer.

Currently I start the process by letting the transmitting COG with Buffer Offset 0 transmit to the next receiving COG in the chain (supposed to be on another P2) I use the 'clock' line to signal start of transmission, the receiver waits for clock change then overwrites the HUB on the supposed second P2.

After overwriting the HUB the receiver signals the transmitter COG on the same P2 (the second in chain) via COGATN that a new buffer is there, the transmitter overwrites its section of the HUB buffer with its own output buffer and then transmits this buffer to the next P2 until it comes back to the first P2. Then the process starts again.

This does work but has the disadvantage of needing a separate 'buffered' output buffer and the real ring buffer is read only since it gets overwritten under control of the sending COG (and P2).

So it is a 'push' operation and I am quite unhappy with that solution, because of the need of a separate output buffer and the inability of writing directly into the ring buffer. I can not use locks since the transmit is initiated by the sending P2 and can not be stalled without a second control line like @evanh did.

My thinking is that if I could do a 'pull' operation I might be able to use a lock on the ring buffer. Setting the lock, reading the ring buffer releasing the lock. Now the current P2 can get the lock and write at any position in the ring buffer, the sending COG would sit in a waitse1 and can read the current ring buffer without using a lock.

This way the current P2 could change any offset in the ring buffer, no dedicated output buffer would be needed.

But I need to make sure that changes done by the other 6 COGs do not get overwritten by a receive before transmitted. Thus I need to stall receiving as long as un-transmitted changes are present.

Maybe with a second lock just as flag, set by any of the other 6 COGs when writing into the buffer, released after sending?

I think I am almost there, it is just not sorted out in my head.

Enjoy!

Mike

msrobots · 2019-06-18 12:54

ersmith wrote: »
msrobots wrote: »

Another stupid question. There is a MUL opcode but no DIV? I jus found QDIV using cordic and being not intuitive usable like add, sub. What is that about?

confused,

Mike

Yes, there is no DIV opcode. Beware, MUL is only a 16 bit multiply, so it isn't really like ADD, SUB, and so on (which operate on the full 32 bits of the register). For 32 bit operations use QMUL and QDIV. They're actually pretty simple to use:
  ' calculate A = B x C
  qmul B, C ' start cordic multiply
  ' (optional: some non-cordic instructions can go here)
  getqx A  ' get result

  ' calculate A = B / C
  qdiv B, C
  getqx A
Note that QMUL is a little bit slower than manually constructing a 32 bit MUL out of 16 bit multiplies, so if you really want maximum speed then you'll want a subroutine that does 4 MULs of the various half-words (there's a thread somewhere in the forums showing the fastest multiply for P2). But QMUL is small and easy to do in assembly.

Thank you, works perfect.

Enjoy!

Mike

evanh · 2019-06-18 12:58

Mike,
The ring buffer isn't really needed of course. Direct source->target is no problem. Although, beware that any subsequent changes in the source can end up being half copied.

I don't have example code for you but the general approach should be to know ahead of time the block size to be copied. I'm not sure if you are trying to achieve gapless in time between blocks or not. If so, then the FIFO mechanism requires all blocks to be on boundaries of 64 bytes (16 longwords to fit the largest 16-cog prop2) and can only smoothly transition the block address at the end of a whole 64 byte block.

A problem occurs with gapless. You have to skip the start compensation when stitching (chaining) two disparate blocks together in time. This is where interrupts can work well. A dedicated cog also works.

If you don't require gapless (Preferable) then it's just a matter of restarting the two FIFOs at the new addresses. And any address works, limited by word size, ie: If copying 32-bit words then 4-byte boundaries must be observed. I think. EDIT2: No, not correct, any address works but might affect compensation timing if not on a word size boundary.

EDIT: edited comment pertaining to interrupts.

msrobots · 2019-06-18 13:36

Oh @evanh,

you are misunderstanding my project. I basically give a whatever about the transport mechanism, the goal is a shared ring buffer in HUB between a ring of x P2s daisy chained as transparent and fast as doable.

On the P1 @"Beau Schwabe" did something like that and it was interesting to play with it. Combining the idea with the ability of the streamer to select the number of data lines without much code change is quite interesting.

One basically looses 2 COGs and 2 times x+1 pins per P2, but can use other COGs on other P2s with the same HUB mailbox schema like it is done for local COGs. So with 2 P2 you will have 12 COGs, with 3 P2 you have 18.

The current version does almost gapless transmission, and that is the problem the other 6 COGs have no chance to change the ring buffer before already overwritten with the next one coming around.

So I need gaps for the node that wants to write but I do not need gaps when the node has nothing to write. So a general timing interrupt would be uncool. As you wrote Direct source->target is no problem. I just want to get the transmissions between the P2s uncoupled(?) from each other insofar that a P2 requests a copy from its predecessor. Still thinking about that.

Enjoy!

Mike

evanh · 2019-06-18 13:52

Hmm, I guess it does commandeer the FIFO. Which kills off hubexec for that cog. Thing is, the managing cog is basically twiddling its thumbs the whole time. I'd certainly be looking for opportunities to use that cog for an additional job.

evanh · 2019-06-18 13:55

You're probably right though. If the DMA gets busy then there ain't going to be a lot of spare hub cycles left for anything else on that cog.

msrobots · 2019-06-18 13:57

I need to read about the locks, they have changed from P1 to P2.

msrobots · 2019-06-18 14:03

evanh wrote: »

I've improved the data compare accuracy by using whole buffer now. Here's a graph comparing 1/1 to 1/4 data rates.

And thank you for that. I might want to tackle the registered pins again. I did not try that large compensations.

Enjoy!

Mike

evanh · 2019-06-18 14:16

msrobots wrote: »

I did not try that large compensations.

That's only because of the reversal of who starts the transfer. See below for my more complicated tx waiting arrangement that causes the bigger rx compensations.

When rx is waiting, the tx compensation is still 3 (WAITX #1) for 1/1 data rate, compensation 4 for 1/2 data rate, compensation 5 for 1/3 data rate, and compensation 6 for 1/4 data rate.

ORG  0
receiver
' setup
		drvl	#ctspin			' make sure handshake pin is driven low - clear to send
		wrpin	##$1_0000, #ctspin	' enable pin-registers (in and out)
		wrfast	#4, ##ddata		' setup receiver FIFO, 4 blocks (64 longwords)
		setxfrq	##xcfg			' set streamer data rate
		setse1	#$40+rtspin		' arm handshake rise - request to send


' fake processing loop
.fake
		mov	pa, #500		' lots to simulate a heavy workload
.floop
		pollxfi			wz
if_z		outl	#ctspin			' indicate DMA complete
		djnz	pa, #.floop
		jnse1	#.fake			' use an event for speed

' ready
		testp	#ctspin		wz	' if cts pin still high then
if_z		waitxfi				' wait for completion of prior DMA
if_z		outl	#ctspin			' indicate DMA complete, ensues retrigger of SE1 event in tx cog
		outh	#ctspin			' indicate ready to receive
		waitx	#(comp-2)		' Timing compensation setting

		xzero	.rxcfg, #0		' go
		jmp	#.fake

evanh · 2019-06-18 16:08

Intriguing, I just measured the duration required to copy one stand-alone 64 cycle block at 1/1 data rate. As far as I can tell the measurement always comes to 68 sysclocks. Both for sender and receiver.

PS: The measurement is based on a WAITXFI streamer completion event.

msrobots · 2019-06-18 16:32

I am jealous, my code still refuses to sync at 1/1.

I have updated it to use registered pins but still no joy at 1/1.

working on it,

Mike

jmg · 2019-06-18 22:38

evanh wrote: »

I've improved the data compare accuracy by using whole buffer now. Here's a graph comparing 1/1 to 1/4 data rates.

Nice plot. How much does it `move left`, as it warms up ?
The data match axis seems to go to 64, but the text mentions 32b bus ?
That looks impressive bandwidth, and you could make a 200+MHZ scope, with 32b capture width ( up to 2 x 16b ADC, or 3 x 10b ADC... )

evanh · 2019-06-19 01:11

jmg wrote: »

Nice plot. How much does it `move left`, as it warms up ?

Same as previous graphs. Just that the tail continues down instead of the earlier reversion back to 32 at the top end of the graph.

The data match axis seems to go to 64, but the text mentions 32b bus ?

64 words is the buffer size. 64x32-bit = 256 bytes. The old DMA setting was only copying 32 of them and this was causing an incorrect compare reading at the top end of the graph. I didn't work out why, I just got round to fleshing out the copy length and suddenly the graph shape changed ... Again! - Previously for unregistered I/O.

evanh · 2019-06-19 01:33

Uh, oops, I think I see where it had been screwing up on the compare - matching nulls still counted. Now I don't understand how come the graph never went to 64 all along. *shrug*
EDIT: Here's the compare code - which hasn't been changed from the start.

'compare buffers
		loc	ptra, #@sdata
		loc	ptrb, #@ddata
		mov	count, #0

		rep	@.rend, #64		' 64 loops
		rdlong	pa, ptra++
		rdlong	pb, ptrb++
		cmp	pa, pb		wz
if_z		add	count, #1
.rend
		mov	pa, count
		call	#itod

msrobots · 2019-06-19 07:32

hmm - another thing I do not understand.

I always had two possible wait values locking on at 1/2. Then I registered my clock pins on rx COG and TX COG. As expected I had to add two clocks compensation and had still two possible wait values.

Then I registered my data lines (send COG first) still two possible wait values. Then registered the data lines of the receiver and now just one possible wait value. Makes no sense to me.

Anyways I am posting the current files before I proceed changing the basic workflow.

Enjoy!

Mike

evanh · 2019-06-19 13:31

evanh wrote: »

The old DMA setting was only copying 32 of them and this was causing an incorrect compare reading at the top end of the graph. I didn't work out why, I just got round to fleshing out the copy length and suddenly the graph shape changed ... Again! - Previously for unregistered I/O.

Grr, turns out that wasn't the reason at all. I can get the double valley shape again by reverting back to the simpler rx waiting arrangement. I'm struggling to explain why that can affect the data copied so much. Best guess at this stage is because there is longer down time in the tx waiting loop, it affects the unstable top end of the spectrum differently.

EDIT: I've doubled checked for bugs. Even went as far as hacking an small change into the tx waiting source code to give it the rx waiting arrangement - this then matched the rx waiting results.
rx_wait%20vs%20tx_wait.png

msrobots · 2019-06-19 18:23

I still do not understand your 21 comp when registered. That seems to be wrong. I currently still run 3 comp with registered pins at 1/2, but can't get 1/1 to run at all.

something is wrong,

Mike

Ringbuffer (was Streamer Questions - how to sync)

Comments