Ringbuffer (was Streamer Questions - how to sync)

evanh · 2019-06-20 02:37

msrobots wrote: »

I still do not understand your 21 comp when registered. That seems to be wrong. I currently still run 3 comp with registered pins at 1/2, but can't get 1/1 to run at all.

You should be able to use compensation of 4 when on 1/2 sysclock.

Your code is arranged as receiver waiting for the go signal - or you could say transmitter gives the go. Compensation 21 is what is needed when the transmitter is doing the waiting.

evanh · 2019-06-20 03:08

Compensation is added at the go task:
- If the go task is the transmitter (rx_waits) then the value is 3.
- If the go task is the receiver (tx_waits) then the value is 21.

When tx gives the go, the go and the data both can travel together in parallel. As long as the rx task sees the go signal a couple of clocks before the data arrives then it has the time to activate the streamer as the data arrives.

When rx gives the go, the go signal has to first get all the way back to tx before tx can start sending. And then the data has to make the same trip back to rx before rx's streamer can capture it.

Round trip difference of 18.

msrobots · 2019-06-20 05:50

evanh wrote: »

msrobots wrote: »

I still do not understand your 21 comp when registered. That seems to be wrong. I currently still run 3 comp with registered pins at 1/2, but can't get 1/1 to run at all.

You should be able to use compensation of 4 when on 1/2 sysclock.

Your code is arranged as receiver waiting for the go signal - or you could say transmitter gives the go. Compensation 21 is what is needed when the transmitter is doing the waiting.

yes, I could use 3 and 4 before registering the data input lines, after it does just work with 3. Not sure why.

I intend to turn around my workflow and let the transmitter wait, so I will find out.

Mike

evanh · 2019-06-20 09:33

[err, got it wrong myself ...]

need to do some testing ... what's the spin2 replacement for something like waitcnt(3_000_000 + cnt) ? Found waitx_() at https://github.com/totalspectrum/spin2cpp/blob/master/doc/spin.md#new-intrinsics

evanh · 2019-06-20 12:30

Mike,
I've done a basic workaround to fix up your compensation issue:

receiveloop
		wrfast  ##0,		rxRingnet_ptr		
		waitse1
'		waitx	rx_waitx 
		xcont   rx_xinit,	rx_mode			'start sampling
		waitxfi
		cogatn	rx_cogatn
		jmp	#receiveloop

and

sendloop	mov	ptra,		txBuffer_ptr
		mov	ptrb,		txRingnet_ptr
		add	ptrb,		txBuffer_offset
		mov	tx_loopcnt,	txBuffer_size
		rep     #2,	 	tx_loopcnt
		rdlong	tx_tmp,		ptra++
		wrlong	tx_tmp,		ptrb++
		rdfast	##0,		txRingnet_ptr

		outnot	txPin_clock	
		
		waitx	#1  'was #8

		xcont   tx_xinit,	tx_mode
		waitxfi
		waitatn
		jmp	#sendloop

evanh · 2019-06-20 12:35

Hmm, something isn't working when data width > 1-bit.

msrobots · 2019-06-21 00:07

well I do add data width>>1 to waitx to place the receiver in the middle of the data width.

Have to check this tonight, daytime is reserved for paid work...

Mike

evanh · 2019-06-21 06:32

It's gotta be something else since rx_waitx is ignored in my little patch above. I went to bed last night and haven't got back to looking.

msrobots · 2019-06-21 23:37

hmm - I turned around my workflow and let the sender wait. Now nothing is working anymore. Somewhere I have a stupid typo, I guess.

Will find out.

Mike

msrobots · 2019-06-22 00:55

How many sys clocks needs a COG to load? Do I maybe not wait long enough?

confused,

Mike

evanh · 2019-06-22 01:51

Probably just over 500. It'll be a little variable due to hub alignment. 520 tops I'd think.

msrobots · 2019-06-22 02:13

ok will try 2 times 500 to be sure. Something is going completely wrong right now and I need to figure out what. At least it does make me to comment each line what it is supposed to do. Still something amiss.

Will get there this weekend, hopefully.

no joy right now,

Mike

evanh · 2019-06-22 02:21

Just tested ... it's in the 530's. So there must be quite a number of extra steps to cog boot.

evanh · 2019-06-22 02:27

529-536 within the launchee cog.
The COGINIT instruction itself also takes some clocks ahead of that in the launcher cog.

msrobots · 2019-06-22 07:41

so with two times waitx #500 I should be out of that one. I seem to have it running again. Now with receiver waiting. Still icking up. Sometimes I need a reset to get it working again.

I have not found the real issue with my code. Somewhere it does not do what I expect it to do. And it is sort of shameful not to find that in barely 272 lines of code.

working on it...

Mike

evanh · 2019-06-22 08:24

It's a timing exact arrangement between two cogs where factors are still to be discovered. Not surprising it's a tricky one.

I've just had a small hiccup myself - the direction of compensation, not just amount, needs reversed when changing between rx and tx waiting. Obvious in hindsight but wasn't my first assumption.

msrobots · 2019-06-22 21:21

My basic idea on changing to tx wait is maybe flawed, I am now considering two control lines.

The original thinking was:

1. transmitter waits for pull request of next P2. Sends buffer without using a lock. Notifies its own receiver with COGATN that buffer is send. waits for next pull request.
2. receiver waits for COGATN locks buffer, receives new buffer from previous P2, unlocks buffer. waits for next COGATN.
3. the other local COGS have now a slight chance to lock the buffer, write something and be sure that what is written has to be send out before the receiver overwrites the buffer again.

This way I would not need a dedicated output buffer, just locks.

But the thinking is flawed as the sending COG can't wait for a lock, since it can't stall the transmission.

So it might be sending a not complete buffer. To enable lock for transmission I would need something like DTR/RTS.

But even that would not help, since there are 'number of nodes' versions of the ring buffer running around the ring.

Assume P2A sends to P2B and P2B sends to P2A (my current simulated setup).

P2A changes long X in the buffer. sends buffer, receives previous buffer from P2B (with possible changes from P2B) but its own changes to long X are not in that buffer, they are in the other one. With more P2s in the loop this gets even worse.

I see no way around this but to go back to using a output buffer for each node with a dedicated buffer offset in the ring buffer, having the ring buffer read only for the nodes.

This way I do still have different versions of the buffer running around, but each node overwrites its own part of every buffer coming around.

So back to square one. And no need to turn around rx and tx waiting.

I need to think about this more - will go to the local watering hole now, get a drink and ponder this for a while...

Mike

evanh · 2019-06-25 09:31

I've sorted out using a single control line for both directions (request and go) of the tx waiting arrangement. It's a good example of when knowing about the total lags involved comes in handy for knowing when to explore further out in the timings. Critical addition is the WAITX + POLLSE1

Because the tx task still initiates the transaction and now is the one that has the job of waiting on the go trigger, now has to deal with two control pulses on the single line per transaction. The crux is, the tx task, when entering the waiting stage, has only just raised the control line to notify the rx task it'd like to make a transaction. This very action pings back to the tx cog event handler several clocks later. So this early tx event then needs to be cleared before arriving at the WAITSE1 opcode. Solution follows:

txinit_wait
		wrpin	##$1_0000, #hspin		' enable pin-registers (in and out)
		drvl	#hspin				' make sure handshake pin is driven low
		setse1	#$40+hspin			' arm handshake rise

		mov	dira, ##$ffff_ffff		' enable data outputs
		mov	pa, #31
.ploop		wrpin	##$1_0000, pa			' enable pin-registers (in and out)
		djnf	pa, #.ploop

		rdfast	#(cycles/16), ##@sdata		' setup sender FIFO
		setxfrq	##xcfg				' set streamer data rate
	_ret_	coginit	#$10, ##@receiver		' start receiver task in any spare cog


'------------------------------------------------------------------------------
txsend_wait
'****** tails to zero ******
		outh	#hspin			' notify the receiver
		outl	#hspin
		waitx	#8			' reach over the OUT to IN to Event of ~12 clocks
		pollse1				' important to clear the above event

		waitse1				' wait here for reciever's go trigger
		nop				' lag compensation matching spacer
		xinit	txcfg, #0		' go!
		getct	tickstart		' diag

		waitxfi				' diag - wait for completion of DMA
	_ret_	getct	tickend			' diag

msrobots · 2019-06-25 09:56

yeah, slowly we are getting there.

I have something running at 1/8 again with tx waiting. Need to dial in tonight for faster speed.

That pollse1 is interesting, I had not thought about that. I am still unsure about the needed time but 12 seems reasonable.

will test soon, off work in 1 hour...

Mike

evanh · 2019-06-25 11:33

The WAITX #8 inserts a spacing of 14 clocks total: Two for the OUTL, ten for the WAITX, and a final two for the POLLSE1.

I know the event can be cleared with a 12 clock space but the mechanism then drops out in the high sysclock band. So I added two clocks clearance for good measure.

msrobots · 2019-06-27 07:15

I finally found my major bug why it did not want to run with more then one data line.

It is so stupid, I wasted days searching for it. at the end it was a REP one instruction to short.

And if I do not enable the output lines, they do not send anything. Yes. Now I am making progress again.

Enjoy!

Mike

msrobots · 2019-06-28 02:22

OK, here we go again

pins 55 and 56 as control lines for each simulated P2

pins 0-15 for P2A to P2B 16-31 for P2B to P2A.

Mode 0-4 (1-16 bit) can be tested, mode 5 (32 bit) is not testable with one P2 and would not make much sense anyways since each P2 would need 64 pins for transfer...
Currently it works down to 2 sys clocks per bit, still fails at sys clock/1.

Here now TX is waiting for a request and the needed delay is 19 + half the bitclocks to sample in the middle.

I had to revert to use a dedicated output buffer, which I still dislike. To use locks and write directly in the ring buffer requires some DTR/RTS mechanism and I would like to stay with one control line per transfer, not two..

But I have a plan here. Not completely formulated and not tested at all, but could work.

jmg · 2019-06-28 02:45

msrobots wrote: »

Currently it works down to 2 sys clocks per bit, still fails at sys clock/1.

When it does fail at /1, what yields does it have ?
Does the /2 transfer, have 2 phase settings it can work at, or just 1 ?

Yanomani · 2019-06-28 02:58

At sysclk/1 the persistence of the output bits after sysclk leading edge will fail to meet the needed hold time at the input ones.

If there are not too many noise sources wandering around, this can be tested by outputing data values in twin pairs. At least the second twin should hit a sane capture by the receiver port; some garbage due to switching noise in the interim, if the corresponding bits change polarity from any pair to the following one.

It's a fundamental limit imposed by the fact you are using a single P2. If you pull the trigger with your right index, you'll not be able to hit that same finger with the bullet.

When you'll where using two P2, you'll need to allow at least a half period sysclk (2 nS at 250 MHz (180º)) of phase difference between their individual sysclks, sure, with the sender ahead of the receiver.

Even there, you'll need to satisfy tsetup and thold of the receiving pin; not an easy task... but doable.

Short distance between peers, excellent layout, controlled impedance will be prime factors of success.

msrobots · 2019-06-28 03:09

So the new plan.

currently TX has the control line as input driven low and waits for a toggle. RX has the control line as output and toggles to request data.

RX can wait for a lock before toggling , but TX can't wait for a lock it has to send instantly.

I do not really want 2 control lines, so I am thinking about avoiding the need but still be able to stall the transmission on both ends.

My thinking is that both P2 have the control line as input, The TX COG drives the pin low, tries to get a lock, if successful drives the control pin high still using it as input. Thus signaling to the receiver that it is RTS.

The receiver has the control pin as input also, waiting for it to go high, just driven by the TX COG. Now the receiver can get its own lock, switch to output and do a outL.

The TX COG now gets its start signal the transmission gets started, and after completion both release their locks, switch to input, TX drives low again and waits for some time to stall the transmission and allow the other 6 COGs to get the lock and change the original ring buffer, not a output buffer, if they need to. Stalling the transmission since the TX cog can't get a lock.

This way I should be able to use the control line bi-directional can stall the transmission on both sides to use locks and still have no floating pins since TX, as input, drives the control line either high or low (RTS). RX does just input and floats without driving the control line, waiting for RTS (input high) of the then switches to output and does a outl.

again.

This way both sides of my control line could use locks to access their respective buffers and no separate output buffer would be needed, Just the shared HUB area as long as you access it with the lock you can write anywhere in the buffer.

Theoretically.

Mike

msrobots · 2019-06-28 03:15

jmg wrote: »

msrobots wrote: »

Currently it works down to 2 sys clocks per bit, still fails at sys clock/1.

When it does fail at /1, what yields does it have ?
Does the /2 transfer, have 2 phase settings it can work at, or just 1 ?

just one since I use registerd pins for I/o, before that I had two values at /2

jmg · 2019-06-28 03:26

msrobots wrote: »

just one since I use registerd pins for I/o, before that I had two values at /2

How does the /2 and /1 windows vary with MHz ?
I can see that tsu and th impose physical apertures, but at lower clock speeds you should be able to meet those, even if that needs a finite external delay.
At higher clock speeds, you see the notch-effects in those evanh plots, where the aperture sweeps-across the clock sampling point.
Working above a notch, or close to a notch, is a somewhat risky place to be and you care not certain how that moves with PVT.

msrobots · 2019-06-28 03:30

Yanomani wrote: »

At sysclk/1 the persistence of the output bits after sysclk leading edge will fail to meet the needed hold time at the input ones.

If there are not too many noise sources wandering around, this can be tested by outputing data values in twin pairs. At least the second twin should hit a sane capture by the receiver port; some garbage due to switching noise in the interim, if the corresponding bits change polarity from any pair to the following one.

It's a fundamental limit imposed by the fact you are using a single P2. If you pull the trigger with your right index, you'll not be able to hit that same finger with the bullet.

When you'll where using two P2, you'll need to allow at least a half period sysclk (2 nS at 250 MHz (180º)) of phase difference between their individual sysclks, sure, with the sender ahead of the receiver.

Even there, you'll need to satisfy tsetup and thold of the receiving pin; not an easy task... but doable.

Short distance between peers, excellent layout, controlled impedance will be prime factors of success.

good points.

I do not expect to hit /1 at all, we have also a phase difference between COGs on one P2, a simple waitx #3 can do that, odd and even COCs, so to speak.

But having /2 stable on one P2 should help to get /3 running with two P2s over short distance, I hope.

It is more easy to think of clocks per bit, /3 are three clocks per bit, sampling in the center could work.

And I do adjust the start of the sample by half of the clocks per bit. To make sure the sender is ahead of the receiver.

And the streamer does all the magic of using the same code for 1-2-4-8... data lines. Hub to Hub transfer.

That is something I need to figure out, I would like to measure the frequency of my control line toggle as it is now, so how to get a smart pin to measure the frequency of the pin next to it?

msrobots · 2019-06-28 03:42

jmg wrote: »

msrobots wrote: »

just one since I use registerd pins for I/o, before that I had two values at /2

How does the /2 and /1 windows vary with MHz ?
I can see that tsu and th impose physical apertures, but at lower clock speeds you should be able to meet those, even if that needs a finite external delay.
At higher clock speeds, you see the notch-effects in those evanh plots, where the aperture sweeps-across the clock sampling point.
Working above a notch, or close to a notch, is a somewhat risky place to be and you care not certain how that moves with PVT.

I just run at 180 Mhz, haven't tested faster.

What I basically do is using the only working compensation at /2 and adding half of my clocks per bit to delay sampling away from the edge to the center of the width of a bit in clocks.

the funny thing is when you make really long sys clocks per bits, you can see each bit on the LED board.

So if 3 clocks per bit fails, one can use 4,5,6,7,70000, whatever

This streamer is quite cool.

Mike

msrobots · 2019-06-28 03:53

I think not having 2 values for /2 anymore is that I just use one P2. The second value disappears as possible candidate as soon as I set the same pins already set as registered in one COG again as registered in a second COG. Should be different pins on different P2's but I do just have one.

But it runs perfectly stable right now.

So it is time for me to break it again by implementing my lock strategy by using the control line bi directional.

let the fun begin!

Mike

Ringbuffer (was Streamer Questions - how to sync)

Comments