Ringbuffer (was Streamer Questions - how to sync)

msrobots · 2019-06-10 00:54

OK,

Currently I just want to try this with digital pin streaming but maybe one can even use DAC and ADC to get more then one bit thru one wire.

I can setup a streamer in one COG and stream to one or more pins from a HUB location.

I also can setup a streamer in another COG and stream from one or more pins to a HUB location. (this is supposed to be a COG on another P2, just testing on one P2)

But how to sync this? I read:

By executing an XINIT and then a XCONT, you get time in the XINIT command to instantiate a smart pin to perform some operation which will then correlate with the queued XCONT command. Think of tossing a ball up gently, so that you can then hit it with a bat.

That seems to me as what I want for the sending side. Set up the streamer toggle some other pin to notify that I now start sending and start sending at the same time.

But what to do on the receiving end?

I am lost right now and need more input …

Mike

jmg · 2019-06-10 01:38

I'm not following ? Is this Sync, Async, or some inferred clock system like i2s with just a frame clock ?

Usually Async self-starts on the slave, some systems use a break signal, to assist sync,

Sync usually has a connected master -> slave clock, and that clock starts shifting into the slave.
Usually one end of this would use the opposite clock phasing, eg so data-out changes on CLK =\_ on master, but slave loads on CLK _/=

I think I have seen i2s systems where the clock is not sent, but Data and frame are, and if you know the CLK rate and sync to the frame, you can get by without a sync clk, but this would be trickier on p2,
A wait on frame edge, then immediate start of self-clocked RX will then read in 32? data bits.
Not sure if P2 smart pins allow you to preload the Pin divider, to set the relative phase, but usually adjacent P2 opcodes would be fast enough to be tolerable.

msrobots · 2019-06-10 02:43

I am just trying to figure out if I can use the streamer to send bytes from one P2 out of the HUB to another P2 into the HUB using
1-2-4-8-16 or 32 pins for data each P2.

It just needs to go one direction and I intend to use one pin as clock so it would be 2-3-5-9-17 or 33 pins in use.

As far as I understand the documentation the streamer when streaming from pins to HUB will start reading after XINIT/XCONT.

But how to sync this with a sending streamer so that both start at the same time?

Lets say I want to send a buffer in HUB 200 longs long to another P2. Currently I use standard 8,n,1 serial with smart pins. I might be able to use 32 bits instead of 8. Here syncing works in HW in the smart pins using the stop bit.

But I still need to fetch each long separate and read/write it from/to the HUB.

The streamer on the other hand can send a block of 200 longs up to sysclock (sysclock/2?) per long and does the same on receiving.

so I guess what I am after is a frame clock for each frame of 200 longs.

Mike

evanh · 2019-06-10 04:10

Yep, a start-stop handshake will be in order. If you want zero software jitter then a WAITxxx instruction must be used. It'll be just like the HyperRAM. Trial and error to find the amount of lag compensation in clocks. You can always

Probably best to start off at an easy rate of many sysclocks per data word. Because, if data rate is at full sysclock rate, certain frequencies of sysclock will be unreliable when the lag matches the jitter boundaries. With clock rates above requiring a +1 lag compensation compared to below. It's also possible the frequency of these jitter boundaries are temperature dependent. That's something to work out later.

I'm thinking if not wanting to dedicated a whole cog to receiver waiting, then two handshake lines would be minimum: A request-to-send from the sender and a ready-to-receive from the receiver. The sender is then the only one that is waiting for the small but indeterminate interval from the request to the ready.

The negotiation gets more complicated if wanting to do bidirectional data on the same data lines.

evanh · 2019-06-10 04:14

PS: You will need both Prop2's operating from a single crystal source.

msrobots · 2019-06-10 05:34

evanh wrote: »

PS: You will need both Prop2's operating from a single crystal source.

yes, I guessed so. might still work up to certain speed.

@"Beau Schwabe" had some interesting project of networking or combining P1 resources.

Simple concept:
A Block in HUB ram gets shared. Each P1 has a space where he can write, but can read the whole buffer. So no write conflicts. Each P1 receives the block serial, copies its (buffered) changes to his write section and sends the buffer along to the next P1. As a ring connected this goes on ad infinitum.

From the programs view it is like the common used mailbox system used to communicate with say a PASM COG. Write into Mailbox expect result. Kye's FAT_Engine does not care at all if the PASM driver is running on the same P1, it just writes into HUB and reads from HUB.

My thinking was that I could start the streamer sending a buffer/frame over and over again toggling some pin as stop bit to sync each frame start. And the receiver could wait for that signal to sync itself to the sending streamer every frame. The streamer would rearm itself via its own end interrupt, send toggle sync pin and done.

And the rest of the sending COG should be free for other things.

Sure with different clock sources things do get more complicated maybe sysclock/4 could do it. But running at the same clock speed from different sources for the time to transfer one frame should be doable or do you think a common clock is absolutely needed?

I will test this concept tonight on my eval board, but there the COGs have a common clock.

Enjoy!

Mike

msrobots · 2019-06-10 06:08

cgracey wrote: »

I think it would be easy to add a mode for streamer byte/word/long pin input, where it only grabs a byte/word/long when it sees a high/low/any transition on some pin.

yes, that is what I am looking for. Is it there somewhere hidden in the documentation?

Mike

evanh · 2019-06-10 06:40

How about a link for Chip's quote.

I actually remember Chip saying that too. However, he is not stating it as done, only that it would be possible to add to the design. I don't think it was ever added.

EDIT: Things suddenly changed from testing v33 to its all away and final, without any notice, at the end. I get the feeling Parallax decided to give the Prop2 a deadline.

rjo__ · 2019-06-11 13:46

I was watching the clocking discussion... I don't know where it is either:)

I seem to recall that it didn't make it into the hardware.
By using waitse and wrfast or rdfast you can get right around 4 bytes every 5 to 7 clocks... with 32 pins. And you don't have to share a crystal.

jmg · 2019-06-11 20:56

msrobots wrote: »

Sure with different clock sources things do get more complicated maybe sysclock/4 could do it. But running at the same clock speed from different sources for the time to transfer one frame should be doable or do you think a common clock is absolutely needed?

A common clock is not absolutely needed if you do not expect to run at SysCLK/2 maximums
With Crystal derived sampling, you should be well below coarse bit-creep issues over 32-64-128 bit frames, so that leaves edge sync.
The Sysclk/4 sounds the right ballpark, because you should be able to sync/align with 1 sysclk jitter window, and you need to allow for align creep into the next bit window, so that's a 2 sysclk time axis locator.

msrobots · 2019-06-11 22:21

I diddled around with this last night.

Man what a challenge. Wrapping my head around the pin groups and p bits was the hardest one, I could not get a single pin to light up a led.
Finally I got the right combination and it helps to know that one also has to DRVL the pins used by the streamer to output. Doh moment (I).

The next Doh moment (II) was that the buffer size used in the streamers xinit/xcont depends on the number of pins used. So if one pin is used it is byte count times 8, with two pins byte count times 4, but that still does not fit right somehow.
That combined with some serious misunderstanding of RDFAST/WRAST and its wrapping (Doh moment III) lead to curious results.

I do have a receiving COG, just waiting for a change on the clock line ready and set up to do its XCONT.

'=======================================================================
'
' receiving COG
' 
'=======================================================================
DAT
		org	0

receiveStreamer_init

rxRingnet_ptr	rdlong	rxRingnet_ptr,	ptra++			'HUB address of Buffer to receive
rxPin_clock	rdlong	rxPin_clock,	ptra++			'Pin to use for clock
rx_xinit	rdlong	rx_xinit,	ptra++			'Xinit for streamer
rx_xfrq		rdlong	rx_xfrq,	ptra++			'Xfrq for streamer

                setxfrq rx_xfrq		                    	'sysclk sample rate
		add	rxPin_clock,	##%011_000000		'now event source for setse1 for change on pin

'-----------------------------------------------------------------------
'
' waits for toggle clock and reads buffer
' 
'-----------------------------------------------------------------------
rxStartStreamer
		wrfast  ##0,		rxRingnet_ptr
		setse1	rxPin_clock
		waitse1 					'wait for change on clock pin 
                xcont   rx_xinit,	#0  			'start sampling
		jmp	#rxStartStreamer			'do it again

This seems quite wasteful but since the goal here is to combine the resources of multiple P2s it is not that bad, You loose 2 COGs per P2 so chaining 2 P2s together one has 12 COGs, chaining 3 P2 together one ends up with 18 COGs and so on.
The sending COG has a little bit more to do but the corresponding code looks like this:

		rdfast	##0,		txBuffer_ptr
		outnot	txPin_clock
		xinit   tx_xcont1,	#0

Basically hitting the button and start running. It would be nice to change that to

		rdfast	##0,		txBuffer_ptr

' use some smartpin mode to toggle txPin_clock one time in about 6 clock cycle in the future
' so that both COGs are in a waitse1 and coming out of it on pin toggle


		setse1	txPin_clock
		waitse1 					'wait for change on clock pin 
		xinit   tx_xcont1,	#0

Basically having a smartpin shooting a starter gun and both runners kneeling.

I still do not understand the difference between XINIT and XCONT both seem to do the same yet chip talked about throwing a ball in the air and hitting it with a bat, exactly what I want to do here, except my clock pin bat should hit 2 balls next to each other belonging to two different P2s.

This streamer freq thing is bombastic. If you use a value of 5 or 10 you can watch bits. 4-bit transfer looks cool on the led board...

Currently I fight with FBLOCK

My receiver reads a continuous bitstream, my sender sends a continuous bitstream. That works so far. The receiver loads its stream at a fixed position, overwriting the previous frame, my ring buffer.

Here comes my trouble. The sender needs to assemble the next frame while sending it. So it needs to send x longs from HUB addr A and then y longs from HUB addr B WITHOUT skipping the beat.
In theory this should work the streamer has a command buffer, but I am failing miserable.

This is what I want to do

txStartStreamer_OutputFirst
		rdfast	##0,		txBuffer_ptr
		outnot	txPin_clock
		xinit   tx_xcont1,	#0
		'fblock	##0,		tx_xaddr2
		rdfast	##0,		tx_xaddr2
		xinit   tx_xcont2,	#0
		ret

I was going iterative thru different combinations, no joy

txStartStreamer_OutputFirst
		'stalli
		rdfast	##0,		txBuffer_ptr
		outnot	txPin_clock
		xinit   tx_xcont1,	#0
		'fblock	##0,		tx_xaddr2
		waitxfi
		rdfast	##0,		txBuffer_ptr
		xinit   tx_xcont1,	#0
		'xcont   tx_xcont2,	#0
		'allowi
		'waitxf
		ret

		waitxfi

		rdfast	##0,		tx_xaddr2
		'fblock	##0,		tx_xaddr2
		xcont   tx_xcont2,	#0
		
		ret

And yes @jmg I am hoping for sysclock/4 per bit currently sysclock/16 is rock stable and since I have other problems I need to slow down

Maybe I need to split the whole buffer transfer into 1-3 bursts each from a single HUB location.

Enjoy!

Mike

jmg · 2019-06-11 22:35

msrobots wrote: »

Here comes my trouble. The sender needs to assemble the next frame while sending it. So it needs to send x longs from HUB addr A and then y longs from HUB addr B WITHOUT skipping the beat.

That's likely to be quite tough, & more so at highest speeds. but if you use a sync-pin, you only need to be gap-less from each sync edge, until the expected burst ends.
You could even use a Pin in capture mode, to time between sync-pin edges, to allow you to encode break, or even do a whole a lower bit rate control path on the sync lines, 'for free'.
eg If you sync on =\_ and capture Pulse Width low, you could look at a SENT (single edge nibble transmission) style encode, that encodes 4 bits per edge pair ?

msrobots · 2019-06-12 02:49

Yeah, I am pondering a lot here.

I think the streamer on the eval is broken at several places.

It does help - not surprisingly - to have the buffers long aligned and one also have to be multiples of 64 bytes to use wrapping.

What does not work is fblock, it is supposed to provide a new HUB address at the next wrap around of rdfast but does not do that. But does other strange things.

I need to let the sending streamer run out of commands to set a new HUB address. Thus getting out of sync with my receiver.

So I need to rethink my approach.

Single bursts are working as expected. Also the event #11 'Streamer idle' to rearm. I have to think about this, but to me it looks like I need some command frames in between to tell the receiver how many bursts of what size to expect to transfer one buffer.

I also might have a testing issue, with just one Instance. I am reading the same buffer I am writing into sometimes. Should work since different offsets, but might be a issue.

As usual not as simple as thought of, but my goal is to keep the overhead as small as possible to not burn down all the speed advance I hope to gain.

What IS nice is that using the streamer one can easy support a variety of number of data lines between the P2's.

Enjoy!

Mike

Yanomani · 2019-06-12 03:20

Hi msrobots

XINIT and XCONT are meant to be used in immediate sequencing, XINIT {#}D,{#}S
first, immediately followed by XCONT {#}D,{#}S.

Both will be accepted by the STREAMER, without any implicit delays, other than the 2 sysclks each one's execution will last, plus the time needed to exhaust the NCO rollover count (D/#[15:0]), as programmed at the XINIT command. And there resides the trick...

By using some (possibly) neutral STREAMER mode to construct the XINIT {#}D,{#}S command, like:
%1100_dddd_eppp_xxxx <long> 32-bit immediate $xxxxxxxx, where:

dddd = 0000 (no streamer DAC output);

eppp = 0xxx (disable pin output);

<long> = provided the above values for dddd and eppp are respected, any value that {#}S can have/hold will not affect such command (nor the pins) in any ways;

Then the value expressed by D/#[15:0] will convey the number of NCO rollovers that the command will be active/last (in the present case, doing nothing other than counting the number of ..., NCO rollovers).

Since you know the frequency you've programmed the STREAMER NCO, you can account for the total number of sysclks it'll last, BEFORE the XCONT is engaged and thus, can depend of some SMART PIN to succesfully accomplish its task.

Thus, the time you'll have to, eventually, engage the mandatory WRFAST and fire any SMART PIN within its intended mode of operation, MUST exactly fit in the time provided by the execution of XINIT (plus, of course, the two initial sysclks of XCONT).

When XCONT starts its execution, the SMART PIN MUST be doing its role, timely and in perfect sync with it.

At the risk of being interpreted as unnecessarily verbose, XINIT and XCONT can be completelly different commands; the unique precaution I would been taking, is trying to preserve the original SETXFRQ D/# (Set NCO frequency), since the whole timing would depend on its correct operation (AKA: glitchless).

Hope it helps

Henrique

msrobots · 2019-06-12 03:44

jmg wrote: »

msrobots wrote: »

Here comes my trouble. The sender needs to assemble the next frame while sending it. So it needs to send x longs from HUB addr A and then y longs from HUB addr B WITHOUT skipping the beat.

That's likely to be quite tough, & more so at highest speeds. but if you use a sync-pin, you only need to be gap-less from each sync edge, until the expected burst ends.
You could even use a Pin in capture mode, to time between sync-pin edges, to allow you to encode break, or even do a whole a lower bit rate control path on the sync lines, 'for free'.
eg If you sync on =\_ and capture Pulse Width low, you could look at a SENT (single edge nibble transmission) style encode, that encodes 4 bits per edge pair ?

I had to read that a couple of times.

And I came to the same conclusion, using the clock line in a more meaningful way as just toggling. What I need to do is to separate command frames from data frames. A command frame would describe the following data frames, as of offset/address and size of for each data frame so the receiver can put them direct in HUB where they belong.

Mike

msrobots · 2019-06-12 04:00

Yanomani wrote: »

Hi msrobots

XINIT and XCONT are meant to be used in immediate sequencing, XINIT {#}D,{#}S
first, immediately followed by XCONT {#}D,{#}S.

Both will be accepted by the STREAMER, without any implicit delays, other than the 2 sysclks each one's execution will last, plus the time needed to exhaust the NCO rollover count (D/#[15:0]), as programmed at the XINIT command. And there resides the trick...

By using some (possibly) neutral STREAMER mode to construct the XINIT {#}D,{#}S command, like:
%1100_dddd_eppp_xxxx <long> 32-bit immediate $xxxxxxxx, where:

dddd = 0000 (no streamer DAC output);

eppp = 0xxx (disable pin output);

<long> = provided the above values for dddd and eppp are respected, any value that {#}S can have/hold will not affect such command (nor the pins) in any ways;

Then the value expressed by D/#[15:0] will convey the number of NCO rollovers that the command will be active/last (in the present case, doing nothing other than counting the number of ..., NCO rollovers).

Since you know the frequency you've programmed the STREAMER NCO, you can account for the total number of sysclks it'll last, BEFORE the XCONT is engaged and thus, can depend of some SMART PIN to succesfully accomplish its task.

Thus, the time you'll have to, eventually, engage the mandatory WRFAST and fire any SMART PIN within its intended mode of operation, MUST exactly fit in the time provided by the execution of XINIT (plus, of course, the two initial sysclks of XCONT).

When XCONT starts its execution, the SMART PIN MUST be doing its role, timely and in perfect sync with it.

At the risk of being interpreted as unnecessarily verbose, XINIT and XCONT can be completelly different commands; the unique precaution I would been taking, is trying to preserve the original SETXFRQ D/# (Set NCO frequency), since the whole timing would depend on its correct operation (AKA: glitchless).

Hope it helps

Henrique

Thank you Henrique,

Sure, one can use a 'fake' Xinit with needed time for 'throwing the ball gently in the air', followed by a Xcont doing the real work. That was his thinking, OK that makes sense.

I really enjoy programming the P2, like PASM on steroids. But I am a hands on programmer, reading about something helps but to use it you have to - hmm - USE IT!

The difference between Xinit and Xcont seems to be that Xinit resets the phase of the NCO, so two (or more) Xcont following each other do work as expected, just setting a new HUB address with fblock seems to be f...ed up at least for 1 and 2 bit mode and streaming to the pins.

Enjoy!

Mike

Yanomani · 2019-06-12 04:09

The following lines where meant to be a post scriptum to my previous post, but I'd choosed to hold them till you had chance to comment about it:

P.S. On second thought, if carefully done, there can even be another SETXFRQ, provided you properly match the new NCO-timing parameter passing (ALU to STREAMER), with the XINIT instruction execution timing (and sure, the exhaust of the first NCO total timing.

If done right, the first SETXFRQ can set the NCO rollover to happen at the rate of sysclk, thus easying the calculation of XINIT execution and syncing.

Henrique

msrobots · 2019-06-12 04:28

Oh Henrique,

this way I could fine tune the NCO to my own sys clock? - hmm.

Naah, I need to get this as simple as possible. Sure transfer rate of sys clock/4 sounds cool but to much is to much. Think about it, say 16 bit at sys clock/8 should do some fast transfer, if needed. But even 4 or 8 pins might be OK.

This is the nice thing with the streamer, the basic code is the same for all bit modes.

The main plan is to connect multiple P2's and longer wires might be needed, so aiming for high on the same chip and then backing down with multiple P2's

Mike

Yanomani · 2019-06-12 04:53

On third thought, if you depart from a first SETXFRQ $8000_0000 (default value on cog start), there is no way to harm the Streamer NCO frequency, done by another SETXFRQ, because there is ever a chance to recalibrate it, at every sysclk, without shortenning any cycles.

By its turn, raising the NCO frequency could be tricky, and should be done carefully, by obvious reasons.

Henrique

Yanomani · 2019-06-12 05:00

I almost love to think about Streamers and Cogs as a perfect couple of Tango Dancers.

The more the practice, the better the performance.

msrobots · 2019-06-12 07:03

Well like missing to make the video circuit able to read on the P1 it seems there is no clocked input mode for the streamer on the P2.

But currently (on 1 P2) just letting them run next to each other seems to work fine. Without common clock and multiple P2's I might need to slow down, time will tell.

still thinking about buffer handling.

Mike

evanh · 2019-06-13 04:26

Here's a question, JMG / Chip / anyone,
What happens if the PLL is engaged (SS=%11) and only XI is connected to a fixed frequency CMOS crystal oscillator clock source? Will the PLL provide a steady multiplier of that clock source?

cgracey · 2019-06-13 04:41

evanh wrote: »

Here's a question, JMG / Chip / anyone,
What happens if the PLL is engaged (SS=%11) and only XI is connected to a fixed frequency CMOS crystal oscillator clock source? Will the PLL provide a steady multiplier of that clock source?

Yes. You would set the oscillator configuration bits (D[3:2]) to %01 to receive a clock without any loading caps enabled.

jmg · 2019-06-13 04:46

evanh wrote: »

Here's a question, JMG / Chip / anyone,
What happens if the PLL is engaged (SS=%11) and only XI is connected to a fixed frequency CMOS crystal oscillator clock source? Will the PLL provide a steady multiplier of that clock source?

Yes, CMOS drive is fine.
That is how Peter has been running P2D2, and the next model has choices of
a) CMOS Oscillator
b) Si5351 i2c Clock generator
c) TCXO Clipped sine, AC coupled to XI.

Clipped sine looks ok on my bench test, up to 38.4MHz You AC couple XI through 330pF~1nF to the clipped sine out. (avoid 100nF, as Rf+large C is a l-o-n-g settling time) 1M & 1nF is 1ms
(VC)TCXO seem to have a Zo of 150~200 ohms, and there will be some upper MHz where the P2 gain is not enough. 26MHz is fine, and 38.4MHz looks to have a gain just under unity.

evanh · 2019-06-13 05:01

The question is whether the PLL multiplying still be usable? Can a driven CMOS 1 MHz clock input on XI be multiplied up 250 MHz internally in the Prop2?

EDIT: I guess it's really only a question for Chip to answer.

PS: This pertains to msrobots's desire to make inter-prop DMA as easy, and fast, as possible. The idea being that two, or more, prop2's can be in sync on the same clock source without needing some voodoo to get 250 MHz synchronised sysclocks.

jmg · 2019-06-13 05:19

evanh wrote: »

The question is whether the PLL multiplying still be usable? Can a driven CMOS 1 MHz clock input on XI be multiplied up 250 MHz internally in the Prop2?

EDIT: I guess it's really only a question for Chip to answer.

PS: This pertains to msrobots's desire to make inter-prop DMA as easy, and fast, as possible. The idea being that two, or more, prop2's can be in sync on the same clock source without needing some voodoo to get 250 MHz synchronised sysclocks.

The usual caveats around low PFD would apply.
The jitter tests show that higher PFD is better, so 10~20MHz is very good for lowest jitter, whilst < 1MHz is quite poor.
So you may be able to circulate a soft-edge/quasi-sine clock of 5~10MHz for something of a compromise, if you needed to lock multiple P2's.

evanh · 2019-06-13 05:21

Here's an example config I envisage (M = x256 in this case):

'HUBSET ##%0000_000E_DDDD_DDMM_MMMM_MMMM_PPPP_CCSS
 HUBSET ##%0000_0001_0000_0001_0000_0000_1111_0111

That hopefully would give an internal sysclock of 256 MHz when using an externally driven 1 MHz clock source. Right?

evanh · 2019-06-13 05:24

jmg wrote: »

So you may be able to circulate a soft-edge/quasi-sine clock of 5~10MHz for something of a compromise, if you needed to lock multiple P2's.

Okay, so aside from quality issues, the PLL doesn't care that it's not controlling the crystal then? The mode select still allows the multiply to function?

cgracey · 2019-06-13 05:28

evanh wrote: »

The question is whether the PLL multiplying still be usable? Can a driven CMOS 1 MHz clock input on XI be multiplied up 250 MHz internally in the Prop2?

Yes. You have a 6-bit XI divider that ranges from 1..64, followed 10-bit multiplier that ranges from 1..1024.

evanh · 2019-06-13 05:30

Cool.

jmg · 2019-06-13 05:44

evanh wrote: »

jmg wrote: »

So you may be able to circulate a soft-edge/quasi-sine clock of 5~10MHz for something of a compromise, if you needed to lock multiple P2's.

Okay, so aside from quality issues, the PLL doesn't care that it's not controlling the crystal then? The mode select still allows the multiply to function?

Correct, the digital dividers do not know, or care, what the signal comes from. The PLL never 'controls the crystal', more the other way around.
One reference signal is the XI pin divided by 1..64, the other is VCO / 1..1024

One detail to watch, with many P2's, could be when to enable the XI, as there are known issues around too-soon enable of clock source before everything has settled.
It could start on RCFAST and then wait for a signal on the data lines, to indicate the master had a valid clock signal ready too.
You could tie reset lines, but I think the reset exit times are not guaranteed to be identical.

Ringbuffer (was Streamer Questions - how to sync)

Comments