Ringbuffer (was Streamer Questions - how to sync)

evanh · 2019-06-13 06:03

jmg wrote: »

It could start on RCFAST and then wait for a signal on the data lines, to indicate the master had a valid clock signal ready too.

Yep, no issue there. There can be seconds allotted to boot up. The DMA ops will still have a handshake to initiate as well.

evanh · 2019-06-13 11:27

I've just done some preliminary testing of 32-bit (pins 0-31) data transfers between two cogs. Amazingly, I can go right up to 270 MT/s without any errors! No variable compensation required.

Using the very simplest method of receiver always waiting, the only compensation at all is the insertion, on the sender cog, of +3 clocks between the handshake OUTH and sending XINIT. Which makes complete sense since the receiver can't know the sender has started unit after the fact. So at least one instruction time response.

evanh · 2019-06-13 14:01

Bother, wrong again, same old ... didn't look hard enough.

Here's a graph of the sensitive areas when stepping through at 1 MHz steps.
Jitter%20crossing%20of%20pin%20sampling%20at%20full%20sysclock.png

Jitter%20crossing%20of%20pin%20sampling%20at%20full%20sysclock.png

evanh · 2019-06-13 14:02

So there is a bad spot around 150 MHz and 300 MHz.

And I presume the width of the bad range is dependant on how bad the jitter is. I was doing 1 MHz steps with a XDIV of 20 on the P2ES chip, so some extra jitter comes from that.

jmg · 2019-06-13 19:28

evanh wrote: »

Here's a graph of the sensitive areas when stepping through at 1 MHz steps.
..So there is a bad spot around 150 MHz and 300 MHz.

Nice plot. Was that with the pin-register enabled ?

evanh wrote: »

And I presume the width of the bad range is dependant on how bad the jitter is. I was doing 1 MHz steps with a XDIV of 20 on the P2ES chip, so some extra jitter comes from that.

With 2 P2's using a locked-clock scheme as discussed here, yes, you could even measure the jitter by the change in slot width.
Within the same P2, just across COGS, the long term jitter does not matter, but there might be a short term jitter effect over 2-3 SysCLKs covering the transit times, but I'd expect short term jitter to be very low.
This plot may be a nice way to check PCB decoupling.

The match count seems to drop only to 24, wonder why that is ? A deeper notch may have been expected on all bits change ?

msrobots · 2019-06-13 22:09

since my clock is not a regular clock but more like a signal I decided to use a transition from low to high to signal the start of a command frame and a transition from high to low to signal a data frame.

I am also pondering to used register-clocked pins for my signal line. It may not help against the phase difference between the P2s but it should allow a more precise timing on sender and receiver.

A normal digital pin can be set via smartpin interface to delay its output by one sysclock and then transition at exact that moment.
Does that mean the outnot command takes now three cycles to execute or does it still execute in 2 cycles but the pin changes one clock later in the middle of the next command?

Same question for a input pin, if I wait for a transition on a pin with waitse1, how granular is that?

Sadly ! do not have a scope available to test this.

Mike

evanh · 2019-06-14 00:18

msrobots wrote: »

Does that mean the outnot command takes now three cycles to execute or does it still execute in 2 cycles but the pin changes one clock later in the middle of the next command?

Second case, instruction is still 2 clocks. The lag is actually longer than 3 clocks but the streamer data has same lag so they cancel each other. You'll notice it when trying to handle the turnaround of bidirectional transfers.

In your case, make the compensation only 2 clocks, a NOP will do. See below in source where I comment on WAITX #0 vs #1.

Same question for a input pin, if I wait for a transition on a pin with waitse1, how granular is that?

I'm currently using WAITSE1 on the receiving end. This makes a natural one instruction lag to compensate for - at the sending end.

'streamer to streamer copy
		coginit	#$10, ##@receiver	' start receiver task in any spare cog
		drvl	#hspin
		mov	dira, ##$ffff_ffff
		waitx	##50000			' pause to allow time for emit status and receiver cog to boot

		rdfast	#0, ##@sdata		' setup sender FIFO
		setxfrq	##xcfg			' set streamer data rate

		outh	#hspin			' tell receiver to record to hubRAM
		waitx	#0			' use #0 for sysclock 1/4 (loose timing), use #1 for sysclock 1/1 (tight timing)
		xinit	txcfg, #0		' go!
		waitxfi				' wait for completion of DMA
		waitx	#500			' pause to collect any overrun
		outl	#hspin			' stop!   tell receiver to die

ORG  0
receiver
		wrfast	#0, ##ddata		' setup receiver FIFO
		setxfrq	##xcfg			' set streamer data rate

		setse1	#$40+hspin		' handshake rise
		waitse1				' wait for

		xinit	.rxcfg, #0		' go!

		setse1	#$80+hspin		' handshake fall
		waitse1				' wait for
		cogid	pa
		cogstop	pa


.rxcfg		long	$f000_0020		' 32 words (WFLONG), pins 0-31

Sadly ! do not have a scope available to test this.

I used my scope once to know my code was actually cycling, ie: learn the streamer and get me out of a duh moment where I started off using hubexec.

But after that everything is done with prints down the comport. The only way to be sure what was copied is to readout/compare both the send and receive buffers.

EDIT: Added the receiver ORG 0

evanh · 2019-06-14 01:16

jmg wrote: »

The match count seems to drop only to 24, wonder why that is ? A deeper notch may have been expected on all bits change ?

Good observation. It's because my test data has a lot of duplication that is still a match when out of phase:

sdata		byte	"1"[16], "2"[16], "3"[16], "4"[16], "5"[16], "6"[16], "7"[16], "8"[16]
		byte	"9"[16], "A"[16], "B"[16], "C"[16], "D"[16], "E"[16], "F"[16], 13,10,0

It was a quick hack for easy reading when dumping to terminal. I didn't change the data when adding the compare code.

PS: That is 64 longwords but the streamer config was only set to 32 transfer cycles.

evanh · 2019-06-14 02:03

jmg wrote: »

Was that with the pin-register enabled ?

Unregistered.

Just did some experiments with registered enabled and it looks messy. The amount of lag varies up the frequencies. Actually more like what I was expecting originally.

Notably, at lower sysclock frequencies, the correct compensation is 1 clock. A virtual waitx #-1.

Hmm, I guess the real takeaway is that it's not wise to rely on being able to operate above about 125 MT/s. sysclock/3 works reliably at any sysclock but that's even slower.

EDIT: Attached is the waitx #-1 equivalent compensation for transfers at sysclock rate. It has a distinct and crisp phase shift at 152 MHz.

msrobots · 2019-06-14 02:19

evanh wrote: »

Second case, instruction is still 2 clocks. The lag is actually longer than 3 clocks but the streamer data has same lag so they cancel each other. You'll notice it when trying to handle the turnaround of bidirectional transfers.

In your case, make the compensation only 2 clocks, a NOP will do. See below in source where I comment on WAITX #0 vs #1.

Same question for a input pin, if I wait for a transition on a pin with waitse1, how granular is that?

I'm currently using WAITSE1 on the receiving end. This makes a natural one instruction lag to compensate for - at the sending end.
...

Ok, slower please I need to understand that.

If I execute a outnot or any other command changing a pin at say sys clock 0 the actual pin is lagging behind and may be set at sys clock 2 but may not be set at sys clock 2 since it lags a bit and I am not sure if the next command has the correct value. Thus I can use a registered pin and it WILL be ready at sys clock 3, while the next command starts at sys clock 2 (nop) and the next at sys clock 4 (my XCONT). - correct/wrong?

And if I have a waitse1 started at sys clock 0 it will be able to catch the transition at sys clock 3 and can execute the next command at sys clock 4 (my XCONT). correct/wrong?

If I also register the input pin I introduce another lag of one cycle on the receiving end thus my waitx can see the change as early as sys clock 4 or since the output already is registered still at sys clock 3? So does my XCONT starts at sys clock 4 or sys clock 5?

That leads to the next question if and how does the pipeline effect this?

Sys clock /2 might be very tight but sys clock/4 seems to be doable even with phase differences of two P2 clocks.

Enjoy!

Mike

evanh · 2019-06-14 03:00

I've just done a sysclock/2 with registered pins and 1-clock compensation. It comes out better than unregistered - clean until 300 MHz (150 MT/s). You're still limited in max throughput but it does allow another option.

Where as unregistered keeps its jittery patch around 150 MHz sysclock. So registered is the more predictable and less prone to upset from jitter.

evanh · 2019-06-14 03:17

msrobots wrote: »

If I execute a outnot or any other command changing a pin at say sys clock 0 the actual pin is lagging behind and may be set at sys clock 2 but may not be set at sys clock 2 since it lags a bit and I am not sure if the next command has the correct value. Thus I can use a registered pin and it WILL be ready at sys clock 3, while the next command starts at sys clock 2 (nop) and the next at sys clock 4 (my XCONT). - correct/wrong?

It's not that critical to know the exact explanation. My description won't be perfectly accurate, that's for sure. It's only a generalisation for visualisation purposes. I should probably stop rambling until I have a clearer way to tell the story.

Some of my comments have been about extreme cases that aren't likely to be used in practise simply because they aren't reliable enough.

Main point:
You just have to be aware there is a high likelihood of the sender and receiver not seeing the start signal on exactly the same sysclock. So adding compensation padding after the start trigger is to be expected. How much that compensation turns out to be is found by experimenting.

evanh · 2019-06-14 03:23

To cut the story down as much as possible: The compensation is really only dealing with the time interval of sender's start signal to sender's streaming data beginning and how this interval presents to the receiver.

msrobots · 2019-06-14 05:04

yes.

Ideally the sample streamer should sample at the middle of the bit send. thus start one sys clock later when running at 1/2 and two sys clock width bits. At 1/4 one should start sampling 2 sys clocks later.

Except if chip is already doing that in HW (how else would sample a 1/1 work?)

My new receiver COG seems to understand the concept of data and cmd frames as told. But my sender seem to hickup somehow and crashes completely. Time to rewrite...

I hope to get that sorted out tonight.

Currently running with sysclock/16 so I can even delay the receiver with a nop to be sure not to sample to early.

The new plan is to forget the send streamer 'out of command event' as source for a interrupt.

The receiver COG code needs less housekeeping time but needs to wait for a complete command frame so needs waitxfi, it does its waitse1 receives it cmd frame then sets up the next data frame does it waitse1 receives its data frame sets up the next cmd frame . It does not need to wait for waitxfi of the data frame. Sort of the endless loop.

The sender COG has more housekeeping to do so will always take longer, I needs to run a sort of display list depending where the own buffer sits in the ringbuffer (if any) so it needs to send either 1,2 or three frames to finish the job of transferring the buffer to the next COG/P2.

Since the rest of the COG is doing nothing, why running code in a interrupt just because it is there. So the sender will become a endless loop also.

Now the only puzzle piece is to set up a smartpin to toggle the clock pin in xxx sysclocks and let sender and receiver sitting in a identical waitse1. Then I can use waitx to finetune the start offset later on.

This way I would have a very clean start. Except odd and even sys clocks of the COGs itself. A waitx #1 makes a COG odd or even.

Time to rewrite... I hope to get that sorted out tonight.

Enjoy!

Mike

jmg · 2019-06-14 05:41

msrobots wrote: »

Ideally the sample streamer should sample at the middle of the bit send. thus start one sys clock later when running at 1/2 and two sys clock width bits. At 1/4 one should start sampling 2 sys clocks later.

If you were pushing this, it might pay to include a 'delay learning' phase, similar to the plots evanh did above, only instead of MHz on Z axis, you sweep WAIT counts, looking for the edge cases.
You then select a delay that is 'best fit' in the middle.
At higher MHz there can be `bonus sysclks` added, and that might even change with temperature and P2 batch.
Using the Pin-registers is likely to reduce the variation spread, at least in theory. (evanh comment above suggests maybe not confirmed by test ? )

evanh · 2019-06-14 06:08

jmg wrote: »

Using the Pin-registers is likely to reduce the variation spread, at least in theory. (evanh comment above suggests maybe not confirmed by test ? )

Below 150 MT/s, registered is notably superior to unregistered.

The concern about registered was all in frequencies above 150 MT/s. I concluded that registered or not, pushing into those rates is not recommended. And I gave a rough guide of capping at 125 MT/s to allow some thermal/whatever headroom.

evanh · 2019-06-14 06:14

That'll be a limitation for prop-to-prop. Something like HDMI output at 250 MHz will still be possible I suspect.

evanh · 2019-06-14 06:25

Hmm, JMG, you mentioned pin loading earlier too. That's probably going to be just as big an issue. The rate cap could end up much lower with long tracks.

Fundamentally, there is no synchronous clock with the data lines, and there is no way to add one. It's always going to be a tad hairy and any high speed attempt is dependant on tuning the clock to suit the board or tuning the board to suit the clock.

evanh · 2019-06-14 06:37

Maybe that's part of the rate cap now - The length of the tracks on the P2ES-EVAL board. Peter maybe getting better with the P2D2 board.

evanh · 2019-06-14 07:04

Here's the full source code. It's many pieces slapped together so you have to find the streamer part, which is commented, in the whole source. It relies on using loadp2's terminal for interaction.

EDIT: Updated compensation comments to include registered details

msrobots · 2019-06-14 08:30

I am not pushing it, but what was the timing for rdlong and wrlong best/worse on the P2?

I just need to make a sanity check if my send loop gives the receive loop enough setup time between the bursts.

So I need to count max clock cycle for the receiver loop and min clock cycle for my sender.

Mike

evanh · 2019-06-14 08:35

It depends on number of cogs in the prop. The google doc has details:

EEEE 1011000 CZI DDDDDDDDD SSSSSSSSS	RDLONG  D,{#}S/P {WC/WZ/WCZ}	Read long from hub address {#}S/PTRx into D. C = MSB of long. *   Prior SETQ/SETQ2 invokes cog/LUT block transfer.	9...16 *	9...26 *	9...24 *	9...44 *
EEEE 1100011 0LI DDDDDDDDD SSSSSSSSS	WRLONG  {#}D,{#}S/P		Write long in D[31:0] to hub address {#}S/PTRx.                   Prior SETQ/SETQ2 invokes cog/LUT block transfer.	3...10 *	3...20 *	3...18 *	3...38 *

So, for the planned chip, worst case cogexec RDLONG is 16 clocks, worst case cogexec WRLONG is 10 clocks.

msrobots · 2019-06-16 03:07

SUCCESS!

I had to rewrite everything just about 20 times. For reasons excluding me I am unable to change the HUB address of a streamer once started. So I rethought the process and have now a working ringbuffer.

Once I had the basics running I could fine tune it. And it is amazing. I just tested at 180Mhz but it now runs stable at sys clock / 2 and locks at sys clock/1 but has transmission errors then.

Maybe registered pin could make the last difference, but I am obviously to stupid to find how to do that in the documentation.

Anyways using 2 or more P2 in a daisy chain will not run that fast, I was aiming for 1/4 and having 1/2 running is pretty cool.

To test this I had to simulate 2 P2s on one P2. In a regular installation one would just have one instance of the driver and needed buffers, I needed two to test.

The current setup will/can use P0-Pin31 for data and 55/56 for the needed clocks. It works from 1 data pin up to 32 but with just one P2 I can just test up to 16 data lines.

In the test file on top of the first procedure you can change two variables to go to the variations:

mode can be:
0 for one data line, 1 for two data lines, 2 for four,3 for eight, 4 for sixteen and 5 for thirty two data lines.

the other parameter is bitclocks. I decided that a xfrq parameter is too weird to use what one wants to set is the number of clocks per bit.

So bitclocks := 0 or 1 will use 1 clock per bit aka xfrq := $80000_0000
bitclocks := 2 will use 2 clock per bit aka xfrq := $40000_0000
bitclocks := 4 will use 4 clock per bit aka xfrq := $20000_0000

The interesting thing is one can use bitclocks := 3 for example, I hope to get this running with wires between P2s sampling at the middle of 3 clocks allows for phase differences between the two P2 sys clocks.

I am attaching the first release before I am destroying it again while fiddling with moving the read streamer start depending on bitclocks>>1

Anyways here it is,

Mike

evanh · 2019-06-16 07:03

Here's my code for enabling registered I/O for the lower 32 pins (Part of the %P...P pin modes):

		mov	pa, #31
.ploop		wrpin	##$1_0000, pa		' enable pin-registers (in and out)
		djnf	pa, #.ploop

evanh · 2019-06-16 07:05

Congrats on the success, btw.

msrobots · 2019-06-16 07:56

so wrpin ##$1_0000, pinNumber sets the pin to registered independent of being just digital or smartpin?

do I need to set dirl before and dirh after?

what the hell is DJNF ?

why @evanh I have always more questions after reading your posts?

I will try.

Nine mile skid on a ten mile ride
Hot as a pistol but cool inside
Cat on a tin roof, dogs in a pile
Nothing left to do but smile, smile, smile

Mike

msrobots · 2019-06-16 10:22

OK registering the pins made no difference so I removed it again.

It was just a try if I can get to 1/1. But 1/2 is quite fine. Gosh I wish I had two P2 to test it over wire with two clock sources.

But as is, this streamer rocks.

Enjoy!

Mike

evanh · 2019-06-16 11:20

msrobots wrote: »

so wrpin ##$1_0000, pinNumber sets the pin to registered independent of being just digital or smartpin?

Independent of smartpin, yep. Not so independent of digital mode. Some pin modes don't have the C bit but are still registered anyway.

do I need to set dirl before and dirh after?

No, that's used when setting up a smartpin. Smartpin config is all in the low 6 bits.

what the hell is DJNF ?

Experience.

I was like that reading Oz's code sometimes too.
DJNF turns false on rollover whereas DJNZ turns false on zero. Because they are pre-decrement instructions, Chip explicitly added DJNF as a new instruction just for zero case looping.

evanh · 2019-06-16 11:52

Even though registered I/O adds two more sysclocks of lag, I think it is advised to use it. It definitely has a big impact on wiping out interference from jitter so should be more reliable when pushing for best speed.

Here's the registered version of the same sweep as the graph at the top of this page of comments:
registered%20IO%20pin%20sampling%20at%20full%20sysclock.png

registered%20IO%20pin%20sampling%20at%20full%20sysclock.png

evanh · 2019-06-16 11:55

The step change above 150 MHz means it is possible to adjust the compensation to suit. But the big improvement is that it is a clean step. There is no width to it.

Ringbuffer (was Streamer Questions - how to sync)

Comments