I still do not understand your 21 comp when registered. That seems to be wrong. I currently still run 3 comp with registered pins at 1/2, but can't get 1/1 to run at all.
You should be able to use compensation of 4 when on 1/2 sysclock.
Your code is arranged as receiver waiting for the go signal - or you could say transmitter gives the go. Compensation 21 is what is needed when the transmitter is doing the waiting.
Compensation is added at the go task:
- If the go task is the transmitter (rx_waits) then the value is 3.
- If the go task is the receiver (tx_waits) then the value is 21.
When tx gives the go, the go and the data both can travel together in parallel. As long as the rx task sees the go signal a couple of clocks before the data arrives then it has the time to activate the streamer as the data arrives.
When rx gives the go, the go signal has to first get all the way back to tx before tx can start sending. And then the data has to make the same trip back to rx before rx's streamer can capture it.
I still do not understand your 21 comp when registered. That seems to be wrong. I currently still run 3 comp with registered pins at 1/2, but can't get 1/1 to run at all.
You should be able to use compensation of 4 when on 1/2 sysclock.
Your code is arranged as receiver waiting for the go signal - or you could say transmitter gives the go. Compensation 21 is what is needed when the transmitter is doing the waiting.
yes, I could use 3 and 4 before registering the data input lines, after it does just work with 3. Not sure why.
I intend to turn around my workflow and let the transmitter wait, so I will find out.
ok will try 2 times 500 to be sure. Something is going completely wrong right now and I need to figure out what. At least it does make me to comment each line what it is supposed to do. Still something amiss.
so with two times waitx #500 I should be out of that one. I seem to have it running again. Now with receiver waiting. Still icking up. Sometimes I need a reset to get it working again.
I have not found the real issue with my code. Somewhere it does not do what I expect it to do. And it is sort of shameful not to find that in barely 272 lines of code.
It's a timing exact arrangement between two cogs where factors are still to be discovered. Not surprising it's a tricky one.
I've just had a small hiccup myself - the direction of compensation, not just amount, needs reversed when changing between rx and tx waiting. Obvious in hindsight but wasn't my first assumption.
My basic idea on changing to tx wait is maybe flawed, I am now considering two control lines.
The original thinking was:
1. transmitter waits for pull request of next P2. Sends buffer without using a lock. Notifies its own receiver with COGATN that buffer is send. waits for next pull request.
2. receiver waits for COGATN locks buffer, receives new buffer from previous P2, unlocks buffer. waits for next COGATN.
3. the other local COGS have now a slight chance to lock the buffer, write something and be sure that what is written has to be send out before the receiver overwrites the buffer again.
This way I would not need a dedicated output buffer, just locks.
But the thinking is flawed as the sending COG can't wait for a lock, since it can't stall the transmission.
So it might be sending a not complete buffer. To enable lock for transmission I would need something like DTR/RTS.
But even that would not help, since there are 'number of nodes' versions of the ring buffer running around the ring.
Assume P2A sends to P2B and P2B sends to P2A (my current simulated setup).
P2A changes long X in the buffer. sends buffer, receives previous buffer from P2B (with possible changes from P2B) but its own changes to long X are not in that buffer, they are in the other one. With more P2s in the loop this gets even worse.
I see no way around this but to go back to using a output buffer for each node with a dedicated buffer offset in the ring buffer, having the ring buffer read only for the nodes.
This way I do still have different versions of the buffer running around, but each node overwrites its own part of every buffer coming around.
So back to square one. And no need to turn around rx and tx waiting.
I need to think about this more - will go to the local watering hole now, get a drink and ponder this for a while...
I've sorted out using a single control line for both directions (request and go) of the tx waiting arrangement. It's a good example of when knowing about the total lags involved comes in handy for knowing when to explore further out in the timings. Critical addition is the WAITX + POLLSE1
Because the tx task still initiates the transaction and now is the one that has the job of waiting on the go trigger, now has to deal with two control pulses on the single line per transaction. The crux is, the tx task, when entering the waiting stage, has only just raised the control line to notify the rx task it'd like to make a transaction. This very action pings back to the tx cog event handler several clocks later. So this early tx event then needs to be cleared before arriving at the WAITSE1 opcode. Solution follows:
txinit_wait
wrpin ##$1_0000, #hspin ' enable pin-registers (in and out)
drvl #hspin ' make sure handshake pin is driven low
setse1 #$40+hspin ' arm handshake rise
mov dira, ##$ffff_ffff ' enable data outputs
mov pa, #31
.ploop wrpin ##$1_0000, pa ' enable pin-registers (in and out)
djnf pa, #.ploop
rdfast #(cycles/16), ##@sdata ' setup sender FIFO
setxfrq ##xcfg ' set streamer data rate
_ret_ coginit #$10, ##@receiver ' start receiver task in any spare cog
'------------------------------------------------------------------------------
txsend_wait
'****** tails to zero ******
outh #hspin ' notify the receiver
outl #hspin
waitx #8 ' reach over the OUT to IN to Event of ~12 clocks
pollse1 ' important to clear the above event
waitse1 ' wait here for reciever's go trigger
nop ' lag compensation matching spacer
xinit txcfg, #0 ' go!
getct tickstart ' diag
waitxfi ' diag - wait for completion of DMA
_ret_ getct tickend ' diag
The WAITX #8 inserts a spacing of 14 clocks total: Two for the OUTL, ten for the WAITX, and a final two for the POLLSE1.
I know the event can be cleared with a 12 clock space but the mechanism then drops out in the high sysclock band. So I added two clocks clearance for good measure.
pins 55 and 56 as control lines for each simulated P2
pins 0-15 for P2A to P2B 16-31 for P2B to P2A.
Mode 0-4 (1-16 bit) can be tested, mode 5 (32 bit) is not testable with one P2 and would not make much sense anyways since each P2 would need 64 pins for transfer...
Currently it works down to 2 sys clocks per bit, still fails at sys clock/1.
Here now TX is waiting for a request and the needed delay is 19 + half the bitclocks to sample in the middle.
I had to revert to use a dedicated output buffer, which I still dislike. To use locks and write directly in the ring buffer requires some DTR/RTS mechanism and I would like to stay with one control line per transfer, not two..
But I have a plan here. Not completely formulated and not tested at all, but could work.
At sysclk/1 the persistence of the output bits after sysclk leading edge will fail to meet the needed hold time at the input ones.
If there are not too many noise sources wandering around, this can be tested by outputing data values in twin pairs. At least the second twin should hit a sane capture by the receiver port; some garbage due to switching noise in the interim, if the corresponding bits change polarity from any pair to the following one.
It's a fundamental limit imposed by the fact you are using a single P2. If you pull the trigger with your right index, you'll not be able to hit that same finger with the bullet.
When you'll where using two P2, you'll need to allow at least a half period sysclk (2 nS at 250 MHz (180º)) of phase difference between their individual sysclks, sure, with the sender ahead of the receiver.
Even there, you'll need to satisfy tsetup and thold of the receiving pin; not an easy task... but doable.
Short distance between peers, excellent layout, controlled impedance will be prime factors of success.
currently TX has the control line as input driven low and waits for a toggle. RX has the control line as output and toggles to request data.
RX can wait for a lock before toggling , but TX can't wait for a lock it has to send instantly.
I do not really want 2 control lines, so I am thinking about avoiding the need but still be able to stall the transmission on both ends.
My thinking is that both P2 have the control line as input, The TX COG drives the pin low, tries to get a lock, if successful drives the control pin high still using it as input. Thus signaling to the receiver that it is RTS.
The receiver has the control pin as input also, waiting for it to go high, just driven by the TX COG. Now the receiver can get its own lock, switch to output and do a outL.
The TX COG now gets its start signal the transmission gets started, and after completion both release their locks, switch to input, TX drives low again and waits for some time to stall the transmission and allow the other 6 COGs to get the lock and change the original ring buffer, not a output buffer, if they need to. Stalling the transmission since the TX cog can't get a lock.
This way I should be able to use the control line bi-directional can stall the transmission on both sides to use locks and still have no floating pins since TX, as input, drives the control line either high or low (RTS). RX does just input and floats without driving the control line, waiting for RTS (input high) of the then switches to output and does a outl.
again.
This way both sides of my control line could use locks to access their respective buffers and no separate output buffer would be needed, Just the shared HUB area as long as you access it with the lock you can write anywhere in the buffer.
just one since I use registerd pins for I/o, before that I had two values at /2
How does the /2 and /1 windows vary with MHz ?
I can see that tsu and th impose physical apertures, but at lower clock speeds you should be able to meet those, even if that needs a finite external delay.
At higher clock speeds, you see the notch-effects in those evanh plots, where the aperture sweeps-across the clock sampling point.
Working above a notch, or close to a notch, is a somewhat risky place to be and you care not certain how that moves with PVT.
At sysclk/1 the persistence of the output bits after sysclk leading edge will fail to meet the needed hold time at the input ones.
If there are not too many noise sources wandering around, this can be tested by outputing data values in twin pairs. At least the second twin should hit a sane capture by the receiver port; some garbage due to switching noise in the interim, if the corresponding bits change polarity from any pair to the following one.
It's a fundamental limit imposed by the fact you are using a single P2. If you pull the trigger with your right index, you'll not be able to hit that same finger with the bullet.
When you'll where using two P2, you'll need to allow at least a half period sysclk (2 nS at 250 MHz (180º)) of phase difference between their individual sysclks, sure, with the sender ahead of the receiver.
Even there, you'll need to satisfy tsetup and thold of the receiving pin; not an easy task... but doable.
Short distance between peers, excellent layout, controlled impedance will be prime factors of success.
good points.
I do not expect to hit /1 at all, we have also a phase difference between COGs on one P2, a simple waitx #3 can do that, odd and even COCs, so to speak.
But having /2 stable on one P2 should help to get /3 running with two P2s over short distance, I hope.
It is more easy to think of clocks per bit, /3 are three clocks per bit, sampling in the center could work.
And I do adjust the start of the sample by half of the clocks per bit. To make sure the sender is ahead of the receiver.
And the streamer does all the magic of using the same code for 1-2-4-8... data lines. Hub to Hub transfer.
That is something I need to figure out, I would like to measure the frequency of my control line toggle as it is now, so how to get a smart pin to measure the frequency of the pin next to it?
just one since I use registerd pins for I/o, before that I had two values at /2
How does the /2 and /1 windows vary with MHz ?
I can see that tsu and th impose physical apertures, but at lower clock speeds you should be able to meet those, even if that needs a finite external delay.
At higher clock speeds, you see the notch-effects in those evanh plots, where the aperture sweeps-across the clock sampling point.
Working above a notch, or close to a notch, is a somewhat risky place to be and you care not certain how that moves with PVT.
I just run at 180 Mhz, haven't tested faster.
What I basically do is using the only working compensation at /2 and adding half of my clocks per bit to delay sampling away from the edge to the center of the width of a bit in clocks.
the funny thing is when you make really long sys clocks per bits, you can see each bit on the LED board.
So if 3 clocks per bit fails, one can use 4,5,6,7,70000, whatever
I think not having 2 values for /2 anymore is that I just use one P2. The second value disappears as possible candidate as soon as I set the same pins already set as registered in one COG again as registered in a second COG. Should be different pins on different P2's but I do just have one.
But it runs perfectly stable right now.
So it is time for me to break it again by implementing my lock strategy by using the control line bi directional.
Comments
Your code is arranged as receiver waiting for the go signal - or you could say transmitter gives the go. Compensation 21 is what is needed when the transmitter is doing the waiting.
- If the go task is the transmitter (rx_waits) then the value is 3.
- If the go task is the receiver (tx_waits) then the value is 21.
When tx gives the go, the go and the data both can travel together in parallel. As long as the rx task sees the go signal a couple of clocks before the data arrives then it has the time to activate the streamer as the data arrives.
When rx gives the go, the go signal has to first get all the way back to tx before tx can start sending. And then the data has to make the same trip back to rx before rx's streamer can capture it.
Round trip difference of 18.
yes, I could use 3 and 4 before registering the data input lines, after it does just work with 3. Not sure why.
I intend to turn around my workflow and let the transmitter wait, so I will find out.
Mike
need to do some testing ... what's the spin2 replacement for something like waitcnt(3_000_000 + cnt) ? Found waitx_() at https://github.com/totalspectrum/spin2cpp/blob/master/doc/spin.md#new-intrinsics
I've done a basic workaround to fix up your compensation issue: and
Have to check this tonight, daytime is reserved for paid work...
Mike
Will find out.
Mike
confused,
Mike
Will get there this weekend, hopefully.
no joy right now,
Mike
The COGINIT instruction itself also takes some clocks ahead of that in the launcher cog.
I have not found the real issue with my code. Somewhere it does not do what I expect it to do. And it is sort of shameful not to find that in barely 272 lines of code.
working on it...
Mike
I've just had a small hiccup myself - the direction of compensation, not just amount, needs reversed when changing between rx and tx waiting. Obvious in hindsight but wasn't my first assumption.
The original thinking was:
1. transmitter waits for pull request of next P2. Sends buffer without using a lock. Notifies its own receiver with COGATN that buffer is send. waits for next pull request.
2. receiver waits for COGATN locks buffer, receives new buffer from previous P2, unlocks buffer. waits for next COGATN.
3. the other local COGS have now a slight chance to lock the buffer, write something and be sure that what is written has to be send out before the receiver overwrites the buffer again.
This way I would not need a dedicated output buffer, just locks.
But the thinking is flawed as the sending COG can't wait for a lock, since it can't stall the transmission.
So it might be sending a not complete buffer. To enable lock for transmission I would need something like DTR/RTS.
But even that would not help, since there are 'number of nodes' versions of the ring buffer running around the ring.
Assume P2A sends to P2B and P2B sends to P2A (my current simulated setup).
P2A changes long X in the buffer. sends buffer, receives previous buffer from P2B (with possible changes from P2B) but its own changes to long X are not in that buffer, they are in the other one. With more P2s in the loop this gets even worse.
I see no way around this but to go back to using a output buffer for each node with a dedicated buffer offset in the ring buffer, having the ring buffer read only for the nodes.
This way I do still have different versions of the buffer running around, but each node overwrites its own part of every buffer coming around.
So back to square one. And no need to turn around rx and tx waiting.
I need to think about this more - will go to the local watering hole now, get a drink and ponder this for a while...
Mike
Because the tx task still initiates the transaction and now is the one that has the job of waiting on the go trigger, now has to deal with two control pulses on the single line per transaction. The crux is, the tx task, when entering the waiting stage, has only just raised the control line to notify the rx task it'd like to make a transaction. This very action pings back to the tx cog event handler several clocks later. So this early tx event then needs to be cleared before arriving at the WAITSE1 opcode. Solution follows:
I have something running at 1/8 again with tx waiting. Need to dial in tonight for faster speed.
That pollse1 is interesting, I had not thought about that. I am still unsure about the needed time but 12 seems reasonable.
will test soon, off work in 1 hour...
Mike
I know the event can be cleared with a 12 clock space but the mechanism then drops out in the high sysclock band. So I added two clocks clearance for good measure.
It is so stupid, I wasted days searching for it. at the end it was a REP one instruction to short.
And if I do not enable the output lines, they do not send anything. Yes. Now I am making progress again.
Enjoy!
Mike
pins 55 and 56 as control lines for each simulated P2
pins 0-15 for P2A to P2B 16-31 for P2B to P2A.
Mode 0-4 (1-16 bit) can be tested, mode 5 (32 bit) is not testable with one P2 and would not make much sense anyways since each P2 would need 64 pins for transfer...
Currently it works down to 2 sys clocks per bit, still fails at sys clock/1.
Here now TX is waiting for a request and the needed delay is 19 + half the bitclocks to sample in the middle.
I had to revert to use a dedicated output buffer, which I still dislike. To use locks and write directly in the ring buffer requires some DTR/RTS mechanism and I would like to stay with one control line per transfer, not two..
But I have a plan here. Not completely formulated and not tested at all, but could work.
When it does fail at /1, what yields does it have ?
Does the /2 transfer, have 2 phase settings it can work at, or just 1 ?
If there are not too many noise sources wandering around, this can be tested by outputing data values in twin pairs. At least the second twin should hit a sane capture by the receiver port; some garbage due to switching noise in the interim, if the corresponding bits change polarity from any pair to the following one.
It's a fundamental limit imposed by the fact you are using a single P2. If you pull the trigger with your right index, you'll not be able to hit that same finger with the bullet.
When you'll where using two P2, you'll need to allow at least a half period sysclk (2 nS at 250 MHz (180º)) of phase difference between their individual sysclks, sure, with the sender ahead of the receiver.
Even there, you'll need to satisfy tsetup and thold of the receiving pin; not an easy task... but doable.
Short distance between peers, excellent layout, controlled impedance will be prime factors of success.
currently TX has the control line as input driven low and waits for a toggle. RX has the control line as output and toggles to request data.
RX can wait for a lock before toggling , but TX can't wait for a lock it has to send instantly.
I do not really want 2 control lines, so I am thinking about avoiding the need but still be able to stall the transmission on both ends.
My thinking is that both P2 have the control line as input, The TX COG drives the pin low, tries to get a lock, if successful drives the control pin high still using it as input. Thus signaling to the receiver that it is RTS.
The receiver has the control pin as input also, waiting for it to go high, just driven by the TX COG. Now the receiver can get its own lock, switch to output and do a outL.
The TX COG now gets its start signal the transmission gets started, and after completion both release their locks, switch to input, TX drives low again and waits for some time to stall the transmission and allow the other 6 COGs to get the lock and change the original ring buffer, not a output buffer, if they need to. Stalling the transmission since the TX cog can't get a lock.
This way I should be able to use the control line bi-directional can stall the transmission on both sides to use locks and still have no floating pins since TX, as input, drives the control line either high or low (RTS). RX does just input and floats without driving the control line, waiting for RTS (input high) of the then switches to output and does a outl.
again.
This way both sides of my control line could use locks to access their respective buffers and no separate output buffer would be needed, Just the shared HUB area as long as you access it with the lock you can write anywhere in the buffer.
Theoretically.
Mike
just one since I use registerd pins for I/o, before that I had two values at /2
How does the /2 and /1 windows vary with MHz ?
I can see that tsu and th impose physical apertures, but at lower clock speeds you should be able to meet those, even if that needs a finite external delay.
At higher clock speeds, you see the notch-effects in those evanh plots, where the aperture sweeps-across the clock sampling point.
Working above a notch, or close to a notch, is a somewhat risky place to be and you care not certain how that moves with PVT.
good points.
I do not expect to hit /1 at all, we have also a phase difference between COGs on one P2, a simple waitx #3 can do that, odd and even COCs, so to speak.
But having /2 stable on one P2 should help to get /3 running with two P2s over short distance, I hope.
It is more easy to think of clocks per bit, /3 are three clocks per bit, sampling in the center could work.
And I do adjust the start of the sample by half of the clocks per bit. To make sure the sender is ahead of the receiver.
And the streamer does all the magic of using the same code for 1-2-4-8... data lines. Hub to Hub transfer.
That is something I need to figure out, I would like to measure the frequency of my control line toggle as it is now, so how to get a smart pin to measure the frequency of the pin next to it?
I just run at 180 Mhz, haven't tested faster.
What I basically do is using the only working compensation at /2 and adding half of my clocks per bit to delay sampling away from the edge to the center of the width of a bit in clocks.
the funny thing is when you make really long sys clocks per bits, you can see each bit on the LED board.
So if 3 clocks per bit fails, one can use 4,5,6,7,70000, whatever
This streamer is quite cool.
Mike
But it runs perfectly stable right now.
So it is time for me to break it again by implementing my lock strategy by using the control line bi directional.
let the fun begin!
Mike