I've just done some preliminary testing of 32-bit (pins 0-31) data transfers between two cogs. Amazingly, I can go right up to 270 MT/s without any errors! No variable compensation required.
Using the very simplest method of receiver always waiting, the only compensation at all is the insertion, on the sender cog, of +3 clocks between the handshake OUTH and sending XINIT. Which makes complete sense since the receiver can't know the sender has started unit after the fact. So at least one instruction time response.
So there is a bad spot around 150 MHz and 300 MHz.
And I presume the width of the bad range is dependant on how bad the jitter is. I was doing 1 MHz steps with a XDIV of 20 on the P2ES chip, so some extra jitter comes from that.
And I presume the width of the bad range is dependant on how bad the jitter is. I was doing 1 MHz steps with a XDIV of 20 on the P2ES chip, so some extra jitter comes from that.
With 2 P2's using a locked-clock scheme as discussed here, yes, you could even measure the jitter by the change in slot width.
Within the same P2, just across COGS, the long term jitter does not matter, but there might be a short term jitter effect over 2-3 SysCLKs covering the transit times, but I'd expect short term jitter to be very low.
This plot may be a nice way to check PCB decoupling.
The match count seems to drop only to 24, wonder why that is ? A deeper notch may have been expected on all bits change ?
since my clock is not a regular clock but more like a signal I decided to use a transition from low to high to signal the start of a command frame and a transition from high to low to signal a data frame.
I am also pondering to used register-clocked pins for my signal line. It may not help against the phase difference between the P2s but it should allow a more precise timing on sender and receiver.
A normal digital pin can be set via smartpin interface to delay its output by one sysclock and then transition at exact that moment.
Does that mean the outnot command takes now three cycles to execute or does it still execute in 2 cycles but the pin changes one clock later in the middle of the next command?
Same question for a input pin, if I wait for a transition on a pin with waitse1, how granular is that?
Sadly ! do not have a scope available to test this.
Does that mean the outnot command takes now three cycles to execute or does it still execute in 2 cycles but the pin changes one clock later in the middle of the next command?
Second case, instruction is still 2 clocks. The lag is actually longer than 3 clocks but the streamer data has same lag so they cancel each other. You'll notice it when trying to handle the turnaround of bidirectional transfers.
In your case, make the compensation only 2 clocks, a NOP will do. See below in source where I comment on WAITX #0 vs #1.
Same question for a input pin, if I wait for a transition on a pin with waitse1, how granular is that?
I'm currently using WAITSE1 on the receiving end. This makes a natural one instruction lag to compensate for - at the sending end.
'streamer to streamer copy
coginit #$10, ##@receiver ' start receiver task in any spare cog
drvl #hspin
mov dira, ##$ffff_ffff
waitx ##50000 ' pause to allow time for emit status and receiver cog to boot
rdfast #0, ##@sdata ' setup sender FIFO
setxfrq ##xcfg ' set streamer data rate
outh #hspin ' tell receiver to record to hubRAM
waitx #0 ' use #0 for sysclock 1/4 (loose timing), use #1 for sysclock 1/1 (tight timing)
xinit txcfg, #0 ' go!
waitxfi ' wait for completion of DMA
waitx #500 ' pause to collect any overrun
outl #hspin ' stop! tell receiver to die
ORG 0
receiver
wrfast #0, ##ddata ' setup receiver FIFO
setxfrq ##xcfg ' set streamer data rate
setse1 #$40+hspin ' handshake rise
waitse1 ' wait for
xinit .rxcfg, #0 ' go!
setse1 #$80+hspin ' handshake fall
waitse1 ' wait for
cogid pa
cogstop pa
.rxcfg long $f000_0020 ' 32 words (WFLONG), pins 0-31
Sadly ! do not have a scope available to test this.
I used my scope once to know my code was actually cycling, ie: learn the streamer and get me out of a duh moment where I started off using hubexec. But after that everything is done with prints down the comport. The only way to be sure what was copied is to readout/compare both the send and receive buffers.
Just did some experiments with registered enabled and it looks messy. The amount of lag varies up the frequencies. Actually more like what I was expecting originally.
Notably, at lower sysclock frequencies, the correct compensation is 1 clock. A virtual waitx #-1.
Hmm, I guess the real takeaway is that it's not wise to rely on being able to operate above about 125 MT/s. sysclock/3 works reliably at any sysclock but that's even slower.
EDIT: Attached is the waitx #-1 equivalent compensation for transfers at sysclock rate. It has a distinct and crisp phase shift at 152 MHz.
Second case, instruction is still 2 clocks. The lag is actually longer than 3 clocks but the streamer data has same lag so they cancel each other. You'll notice it when trying to handle the turnaround of bidirectional transfers.
In your case, make the compensation only 2 clocks, a NOP will do. See below in source where I comment on WAITX #0 vs #1.
Same question for a input pin, if I wait for a transition on a pin with waitse1, how granular is that?
I'm currently using WAITSE1 on the receiving end. This makes a natural one instruction lag to compensate for - at the sending end.
...
Ok, slower please I need to understand that.
If I execute a outnot or any other command changing a pin at say sys clock 0 the actual pin is lagging behind and may be set at sys clock 2 but may not be set at sys clock 2 since it lags a bit and I am not sure if the next command has the correct value. Thus I can use a registered pin and it WILL be ready at sys clock 3, while the next command starts at sys clock 2 (nop) and the next at sys clock 4 (my XCONT). - correct/wrong?
And if I have a waitse1 started at sys clock 0 it will be able to catch the transition at sys clock 3 and can execute the next command at sys clock 4 (my XCONT). correct/wrong?
If I also register the input pin I introduce another lag of one cycle on the receiving end thus my waitx can see the change as early as sys clock 4 or since the output already is registered still at sys clock 3? So does my XCONT starts at sys clock 4 or sys clock 5?
That leads to the next question if and how does the pipeline effect this?
Sys clock /2 might be very tight but sys clock/4 seems to be doable even with phase differences of two P2 clocks.
I've just done a sysclock/2 with registered pins and 1-clock compensation. It comes out better than unregistered - clean until 300 MHz (150 MT/s). You're still limited in max throughput but it does allow another option.
Where as unregistered keeps its jittery patch around 150 MHz sysclock. So registered is the more predictable and less prone to upset from jitter.
If I execute a outnot or any other command changing a pin at say sys clock 0 the actual pin is lagging behind and may be set at sys clock 2 but may not be set at sys clock 2 since it lags a bit and I am not sure if the next command has the correct value. Thus I can use a registered pin and it WILL be ready at sys clock 3, while the next command starts at sys clock 2 (nop) and the next at sys clock 4 (my XCONT). - correct/wrong?
It's not that critical to know the exact explanation. My description won't be perfectly accurate, that's for sure. It's only a generalisation for visualisation purposes. I should probably stop rambling until I have a clearer way to tell the story.
Some of my comments have been about extreme cases that aren't likely to be used in practise simply because they aren't reliable enough.
Main point:
You just have to be aware there is a high likelihood of the sender and receiver not seeing the start signal on exactly the same sysclock. So adding compensation padding after the start trigger is to be expected. How much that compensation turns out to be is found by experimenting.
To cut the story down as much as possible: The compensation is really only dealing with the time interval of sender's start signal to sender's streaming data beginning and how this interval presents to the receiver.
Ideally the sample streamer should sample at the middle of the bit send. thus start one sys clock later when running at 1/2 and two sys clock width bits. At 1/4 one should start sampling 2 sys clocks later.
Except if chip is already doing that in HW (how else would sample a 1/1 work?)
My new receiver COG seems to understand the concept of data and cmd frames as told. But my sender seem to hickup somehow and crashes completely. Time to rewrite...
I hope to get that sorted out tonight.
Currently running with sysclock/16 so I can even delay the receiver with a nop to be sure not to sample to early.
The new plan is to forget the send streamer 'out of command event' as source for a interrupt.
The receiver COG code needs less housekeeping time but needs to wait for a complete command frame so needs waitxfi, it does its waitse1 receives it cmd frame then sets up the next data frame does it waitse1 receives its data frame sets up the next cmd frame . It does not need to wait for waitxfi of the data frame. Sort of the endless loop.
The sender COG has more housekeeping to do so will always take longer, I needs to run a sort of display list depending where the own buffer sits in the ringbuffer (if any) so it needs to send either 1,2 or three frames to finish the job of transferring the buffer to the next COG/P2.
Since the rest of the COG is doing nothing, why running code in a interrupt just because it is there. So the sender will become a endless loop also.
Now the only puzzle piece is to set up a smartpin to toggle the clock pin in xxx sysclocks and let sender and receiver sitting in a identical waitse1. Then I can use waitx to finetune the start offset later on.
This way I would have a very clean start. Except odd and even sys clocks of the COGs itself. A waitx #1 makes a COG odd or even.
Time to rewrite... I hope to get that sorted out tonight.
Ideally the sample streamer should sample at the middle of the bit send. thus start one sys clock later when running at 1/2 and two sys clock width bits. At 1/4 one should start sampling 2 sys clocks later.
If you were pushing this, it might pay to include a 'delay learning' phase, similar to the plots evanh did above, only instead of MHz on Z axis, you sweep WAIT counts, looking for the edge cases.
You then select a delay that is 'best fit' in the middle.
At higher MHz there can be `bonus sysclks` added, and that might even change with temperature and P2 batch.
Using the Pin-registers is likely to reduce the variation spread, at least in theory. (evanh comment above suggests maybe not confirmed by test ? )
Using the Pin-registers is likely to reduce the variation spread, at least in theory. (evanh comment above suggests maybe not confirmed by test ? )
Below 150 MT/s, registered is notably superior to unregistered.
The concern about registered was all in frequencies above 150 MT/s. I concluded that registered or not, pushing into those rates is not recommended. And I gave a rough guide of capping at 125 MT/s to allow some thermal/whatever headroom.
Hmm, JMG, you mentioned pin loading earlier too. That's probably going to be just as big an issue. The rate cap could end up much lower with long tracks.
Fundamentally, there is no synchronous clock with the data lines, and there is no way to add one. It's always going to be a tad hairy and any high speed attempt is dependant on tuning the clock to suit the board or tuning the board to suit the clock.
Here's the full source code. It's many pieces slapped together so you have to find the streamer part, which is commented, in the whole source. It relies on using loadp2's terminal for interaction.
EDIT: Updated compensation comments to include registered details
I had to rewrite everything just about 20 times. For reasons excluding me I am unable to change the HUB address of a streamer once started. So I rethought the process and have now a working ringbuffer.
Once I had the basics running I could fine tune it. And it is amazing. I just tested at 180Mhz but it now runs stable at sys clock / 2 and locks at sys clock/1 but has transmission errors then.
Maybe registered pin could make the last difference, but I am obviously to stupid to find how to do that in the documentation.
Anyways using 2 or more P2 in a daisy chain will not run that fast, I was aiming for 1/4 and having 1/2 running is pretty cool.
To test this I had to simulate 2 P2s on one P2. In a regular installation one would just have one instance of the driver and needed buffers, I needed two to test.
The current setup will/can use P0-Pin31 for data and 55/56 for the needed clocks. It works from 1 data pin up to 32 but with just one P2 I can just test up to 16 data lines.
In the test file on top of the first procedure you can change two variables to go to the variations:
mode can be:
0 for one data line, 1 for two data lines, 2 for four,3 for eight, 4 for sixteen and 5 for thirty two data lines.
the other parameter is bitclocks. I decided that a xfrq parameter is too weird to use what one wants to set is the number of clocks per bit.
So bitclocks := 0 or 1 will use 1 clock per bit aka xfrq := $80000_0000
bitclocks := 2 will use 2 clock per bit aka xfrq := $40000_0000
bitclocks := 4 will use 4 clock per bit aka xfrq := $20000_0000
The interesting thing is one can use bitclocks := 3 for example, I hope to get this running with wires between P2s sampling at the middle of 3 clocks allows for phase differences between the two P2 sys clocks.
I am attaching the first release before I am destroying it again while fiddling with moving the read streamer start depending on bitclocks>>1
so wrpin ##$1_0000, pinNumber sets the pin to registered independent of being just digital or smartpin?
Independent of smartpin, yep. Not so independent of digital mode. Some pin modes don't have the C bit but are still registered anyway.
do I need to set dirl before and dirh after?
No, that's used when setting up a smartpin. Smartpin config is all in the low 6 bits.
what the hell is DJNF ?
Experience. I was like that reading Oz's code sometimes too.
DJNF turns false on rollover whereas DJNZ turns false on zero. Because they are pre-decrement instructions, Chip explicitly added DJNF as a new instruction just for zero case looping.
Even though registered I/O adds two more sysclocks of lag, I think it is advised to use it. It definitely has a big impact on wiping out interference from jitter so should be more reliable when pushing for best speed.
Here's the registered version of the same sweep as the graph at the top of this page of comments:
The step change above 150 MHz means it is possible to adjust the compensation to suit. But the big improvement is that it is a clean step. There is no width to it.
Comments
Using the very simplest method of receiver always waiting, the only compensation at all is the insertion, on the sender cog, of +3 clocks between the handshake OUTH and sending XINIT. Which makes complete sense since the receiver can't know the sender has started unit after the fact. So at least one instruction time response.
Here's a graph of the sensitive areas when stepping through at 1 MHz steps.
And I presume the width of the bad range is dependant on how bad the jitter is. I was doing 1 MHz steps with a XDIV of 20 on the P2ES chip, so some extra jitter comes from that.
With 2 P2's using a locked-clock scheme as discussed here, yes, you could even measure the jitter by the change in slot width.
Within the same P2, just across COGS, the long term jitter does not matter, but there might be a short term jitter effect over 2-3 SysCLKs covering the transit times, but I'd expect short term jitter to be very low.
This plot may be a nice way to check PCB decoupling.
The match count seems to drop only to 24, wonder why that is ? A deeper notch may have been expected on all bits change ?
I am also pondering to used register-clocked pins for my signal line. It may not help against the phase difference between the P2s but it should allow a more precise timing on sender and receiver.
A normal digital pin can be set via smartpin interface to delay its output by one sysclock and then transition at exact that moment.
Does that mean the outnot command takes now three cycles to execute or does it still execute in 2 cycles but the pin changes one clock later in the middle of the next command?
Same question for a input pin, if I wait for a transition on a pin with waitse1, how granular is that?
Sadly ! do not have a scope available to test this.
Mike
In your case, make the compensation only 2 clocks, a NOP will do. See below in source where I comment on WAITX #0 vs #1.
I'm currently using WAITSE1 on the receiving end. This makes a natural one instruction lag to compensate for - at the sending end. I used my scope once to know my code was actually cycling, ie: learn the streamer and get me out of a duh moment where I started off using hubexec. But after that everything is done with prints down the comport. The only way to be sure what was copied is to readout/compare both the send and receive buffers.
EDIT: Added the receiver ORG 0
PS: That is 64 longwords but the streamer config was only set to 32 transfer cycles.
Just did some experiments with registered enabled and it looks messy. The amount of lag varies up the frequencies. Actually more like what I was expecting originally.
Notably, at lower sysclock frequencies, the correct compensation is 1 clock. A virtual waitx #-1.
Hmm, I guess the real takeaway is that it's not wise to rely on being able to operate above about 125 MT/s. sysclock/3 works reliably at any sysclock but that's even slower.
EDIT: Attached is the waitx #-1 equivalent compensation for transfers at sysclock rate. It has a distinct and crisp phase shift at 152 MHz.
Ok, slower please I need to understand that.
If I execute a outnot or any other command changing a pin at say sys clock 0 the actual pin is lagging behind and may be set at sys clock 2 but may not be set at sys clock 2 since it lags a bit and I am not sure if the next command has the correct value. Thus I can use a registered pin and it WILL be ready at sys clock 3, while the next command starts at sys clock 2 (nop) and the next at sys clock 4 (my XCONT). - correct/wrong?
And if I have a waitse1 started at sys clock 0 it will be able to catch the transition at sys clock 3 and can execute the next command at sys clock 4 (my XCONT). correct/wrong?
If I also register the input pin I introduce another lag of one cycle on the receiving end thus my waitx can see the change as early as sys clock 4 or since the output already is registered still at sys clock 3? So does my XCONT starts at sys clock 4 or sys clock 5?
That leads to the next question if and how does the pipeline effect this?
Sys clock /2 might be very tight but sys clock/4 seems to be doable even with phase differences of two P2 clocks.
Enjoy!
Mike
Where as unregistered keeps its jittery patch around 150 MHz sysclock. So registered is the more predictable and less prone to upset from jitter.
Some of my comments have been about extreme cases that aren't likely to be used in practise simply because they aren't reliable enough.
Main point:
You just have to be aware there is a high likelihood of the sender and receiver not seeing the start signal on exactly the same sysclock. So adding compensation padding after the start trigger is to be expected. How much that compensation turns out to be is found by experimenting.
Ideally the sample streamer should sample at the middle of the bit send. thus start one sys clock later when running at 1/2 and two sys clock width bits. At 1/4 one should start sampling 2 sys clocks later.
Except if chip is already doing that in HW (how else would sample a 1/1 work?)
My new receiver COG seems to understand the concept of data and cmd frames as told. But my sender seem to hickup somehow and crashes completely. Time to rewrite...
I hope to get that sorted out tonight.
Currently running with sysclock/16 so I can even delay the receiver with a nop to be sure not to sample to early.
The new plan is to forget the send streamer 'out of command event' as source for a interrupt.
The receiver COG code needs less housekeeping time but needs to wait for a complete command frame so needs waitxfi, it does its waitse1 receives it cmd frame then sets up the next data frame does it waitse1 receives its data frame sets up the next cmd frame . It does not need to wait for waitxfi of the data frame. Sort of the endless loop.
The sender COG has more housekeeping to do so will always take longer, I needs to run a sort of display list depending where the own buffer sits in the ringbuffer (if any) so it needs to send either 1,2 or three frames to finish the job of transferring the buffer to the next COG/P2.
Since the rest of the COG is doing nothing, why running code in a interrupt just because it is there. So the sender will become a endless loop also.
Now the only puzzle piece is to set up a smartpin to toggle the clock pin in xxx sysclocks and let sender and receiver sitting in a identical waitse1. Then I can use waitx to finetune the start offset later on.
This way I would have a very clean start. Except odd and even sys clocks of the COGs itself. A waitx #1 makes a COG odd or even.
Time to rewrite... I hope to get that sorted out tonight.
Enjoy!
Mike
You then select a delay that is 'best fit' in the middle.
At higher MHz there can be `bonus sysclks` added, and that might even change with temperature and P2 batch.
Using the Pin-registers is likely to reduce the variation spread, at least in theory. (evanh comment above suggests maybe not confirmed by test ? )
The concern about registered was all in frequencies above 150 MT/s. I concluded that registered or not, pushing into those rates is not recommended. And I gave a rough guide of capping at 125 MT/s to allow some thermal/whatever headroom.
Fundamentally, there is no synchronous clock with the data lines, and there is no way to add one. It's always going to be a tad hairy and any high speed attempt is dependant on tuning the clock to suit the board or tuning the board to suit the clock.
EDIT: Updated compensation comments to include registered details
I just need to make a sanity check if my send loop gives the receive loop enough setup time between the bursts.
So I need to count max clock cycle for the receiver loop and min clock cycle for my sender.
Mike
I had to rewrite everything just about 20 times. For reasons excluding me I am unable to change the HUB address of a streamer once started. So I rethought the process and have now a working ringbuffer.
Once I had the basics running I could fine tune it. And it is amazing. I just tested at 180Mhz but it now runs stable at sys clock / 2 and locks at sys clock/1 but has transmission errors then.
Maybe registered pin could make the last difference, but I am obviously to stupid to find how to do that in the documentation.
Anyways using 2 or more P2 in a daisy chain will not run that fast, I was aiming for 1/4 and having 1/2 running is pretty cool.
To test this I had to simulate 2 P2s on one P2. In a regular installation one would just have one instance of the driver and needed buffers, I needed two to test.
The current setup will/can use P0-Pin31 for data and 55/56 for the needed clocks. It works from 1 data pin up to 32 but with just one P2 I can just test up to 16 data lines.
In the test file on top of the first procedure you can change two variables to go to the variations:
mode can be:
0 for one data line, 1 for two data lines, 2 for four,3 for eight, 4 for sixteen and 5 for thirty two data lines.
the other parameter is bitclocks. I decided that a xfrq parameter is too weird to use what one wants to set is the number of clocks per bit.
So bitclocks := 0 or 1 will use 1 clock per bit aka xfrq := $80000_0000
bitclocks := 2 will use 2 clock per bit aka xfrq := $40000_0000
bitclocks := 4 will use 4 clock per bit aka xfrq := $20000_0000
The interesting thing is one can use bitclocks := 3 for example, I hope to get this running with wires between P2s sampling at the middle of 3 clocks allows for phase differences between the two P2 sys clocks.
I am attaching the first release before I am destroying it again while fiddling with moving the read streamer start depending on bitclocks>>1
Anyways here it is,
Mike
do I need to set dirl before and dirh after?
what the hell is DJNF ?
why @evanh I have always more questions after reading your posts?
I will try.
Nine mile skid on a ten mile ride
Hot as a pistol but cool inside
Cat on a tin roof, dogs in a pile
Nothing left to do but smile, smile, smile
Mike
It was just a try if I can get to 1/1. But 1/2 is quite fine. Gosh I wish I had two P2 to test it over wire with two clock sources.
But as is, this streamer rocks.
Enjoy!
Mike
No, that's used when setting up a smartpin. Smartpin config is all in the low 6 bits.
Experience. I was like that reading Oz's code sometimes too.
DJNF turns false on rollover whereas DJNZ turns false on zero. Because they are pre-decrement instructions, Chip explicitly added DJNF as a new instruction just for zero case looping.
Here's the registered version of the same sweep as the graph at the top of this page of comments: