Shop OBEX P1 Docs P2 Docs Learn Events
Ringbuffer (was Streamer Questions - how to sync) - Page 5 — Parallax Forums

Ringbuffer (was Streamer Questions - how to sync)

1235

Comments

  • Sampling happens at streamers NCOs roll over, thus, due to the fact granularity is limited to a full sysclk period, you should start your receiving streamer one sysclk ahead of the sending one, but to calculate the exact matching point, would involve taking in account all the delays involved, at both sender's and receiver's pathway.

    At the end of the day, if you could focus both eyes, independently, one inside each streamer's data flip-flop, you'll see the receiver beggining its "acceptance" period, one sysclk cycle earlier than the sender puts data at its output flip-flop. :crazy:

    It's ever about tsetup and thold; the better you are able to satisfy their demands, the better you can grab a sane data block.
  • YanomaniYanomani Posts: 1,524
    edited 2019-06-28 04:10
    I did made a mistake at my earlier post. When I stated:

    "When you'll where using two P2, you'll need to allow at least a half period sysclk (2 nS at 250 MHz (180º)) of phase difference between their individual sysclks, sure, with the sender ahead of the receiver."

    I should have stated : sure, with the receiver ahead of the sender..,

    Sorry :blush:
  • jmgjmg Posts: 15,140
    msrobots wrote: »
    I just run at 180 Mhz, haven't tested faster.
    It should work all the away up to sysclk/1, but there, it may need manual delay added externally between the pins.
    Usually D-FF have a tco delay, that is enough to meet tsu.th of a 'like-FF, which is why shift registers can work.
    The pin registers in P2, one would hope get close to that.

    I'd suggest testing at maybe 50MHz & then checking again at 100MHz, 180MHz.

    A Clock generator like Si5351A has a 333ps/step phase adjustment per CLKOUT, up to +/- ~16ns, so with that you could have a master clock for multiple P2's and adjust their sampling points via that phase adjustment.
    The P2 PLL would add some more jitter, but higher PFD values (low XI divides, Higher Xi MHz) help reduce that.

    msrobots wrote: »
    What I basically do is using the only working compensation at /2 and adding half of my clocks per bit to delay sampling away from the edge to the center of the width of a bit in clocks.
    ...
    So if 3 clocks per bit fails, one can use 4,5,6,7,70000, whatever
    Yes, if you have many clocks, you can move the sampling point in whole clock increments.

  • evanhevanh Posts: 15,126
    Yanomani wrote: »
    At sysclk/1 the persistence of the output bits after sysclk leading edge will fail to meet the needed hold time at the input ones.
    In practise, with registered I/O, sysclk/1 is measured to work just fine up to 260 MHz at around 70°C. Goes higher at lower temperatures. Primarily works because the clock edge always leads logic transition. So there is a whole clock period for the output/logic to make its transition before it is needed at the input of the following flop. Standard synchronous global clock design principle.

    PS; Apologiesfor the lack of posts. I've caught meself a flu bug so it's a struggle just to read each post at the moment.

  • Yanomani wrote: »
    Sampling happens at streamers NCOs roll over, thus, due to the fact granularity is limited to a full sysclk period, you should start your receiving streamer one sysclk ahead of the sending one, but to calculate the exact matching point, would involve taking in account all the delays involved, at both sender's and receiver's pathway.

    At the end of the day, if you could focus both eyes, independently, one inside each streamer's data flip-flop, you'll see the receiver beggining its "acceptance" period, one sysclk cycle earlier than the sender puts data at its output flip-flop. :crazy:

    It's ever about tsetup and thold; the better you are able to satisfy their demands, the better you can grab a sane data block.

    well it is kind of an iterative process, starting with - say - sys clock/20 and finding the sweet spot of synchronization between min and max compensation.

    then you find the minimum delay/maximum delay of the pathways and instructions on both ends, either it shows the required result, or not.

    now reducing the clocks per bit and you have the needed delay for your current read/write routines and find the shortest working delay.

    I do not aim for /1 even /5 would be great, we are talking about HUB to HUB transfer between different P2s.

    Take even 6 clocks per bit, say 180 Mhz will give on one data line a bit every 6 clocks, so about 30Mbit/sec, with 2 lines 60Mbit/sec with 4 lines 120Mbit/sec with 8 lines 240Mbit/sec.

    I hope to get 3 clocks per bit running over short wires, that would double the numbers.

    This is quite fast for seamlessly shared memory from one HUB to the other.

    well at least interesting to me,

    Mike
  • @jmg,

    yes it should work at /1 but it does not. My current guess is that my setup simulating two P2 on one P2 messes things up.

    The ring should look like with 2 P2:

    P2A has Pins 0-15 as output, 16-31 as input, 2 COGS running, each COG owns its pins completely.
    P2B has Pins 0-15 as input, 16-31 as output, 2 COGS running, each COG owns its pins completely.
    Pins 0-15 of P2A are connected to pins 0-15 of P2B, same for 16-31.

    or something like that, with more then two P2s, just daisy chained as endless loop

    currently I just have one P2, so I use the same pins, that might be the problem,

    no clue, else.

    Mike
  • evanhevanh Posts: 15,126
    evanh wrote: »
    In practise, with registered I/O, sysclk/1 is measured to work just fine up to 260 MHz at around 70°C. Goes higher at lower temperatures.
    250 MHz falls to 95°C.
  • evanhevanh Posts: 15,126
    edited 2019-06-29 08:34
    Those temperature limited clock rates are limitations of the pin output drivers vs loading of the EVAL board, ie: slew rate. The sysclock limit of the prop2 internals is maybe another 80-100 MHz higher before they too go down in a crash.

    I think the slew rate is limiting the subsequent setup time, rather than any hold time issues. When operating with sysclock/2 per transfer then there is room to push to higher sysclock rates by delaying the relavant rx sample point another clock - lag compensation increased from 21 to 22. There is an oddity here, in that the extra leeway gained is small, that appears like a hold-time issue.

    Here's a graph using some semi-regular data
    sdata byte "1"[16], "2"[16], "3"[16], "4"[16], "5"[16], "6"[16], "7"[16], "8"[16]
    byte "9"[16], "A"[16], "B"[16], "C"[16], "D"[16], "E"[16], "F"[16], 13,10,0,25,33,47,99,107,98,244,86,212,2,7,19,21
    byte "G"[16], "H"[16], "I"[16], "J"[16], "K"[16], "L"[16], "M"[16], "N"[16]
    byte "O"[16], "P"[16], "Q"[16], "R"[16], "S"[16], "T"[16], "U"[16], 13,10,0,25,33,47,99,107,98,244,86,212,2,7,19,21
    byte "V"[16], "W"[16], "X"[16], "Y"[16], "Z"[16], "a"[16], "b"[16], "c"[16]
    byte "d"[16], "e"[16], "f"[16], "g"[16], "h"[16], "i"[16], "j"[16], 13,10,0,25,33,47,99,107,98,244,86,212,2,7,19,21
    byte "k"[16], "l"[16], "m"[16], "n"[16], "o"[16], "p"[16], "q"[16], "r"[16]
    byte "s"[16], "t"[16], "u"[16], "v"[16], "w"[16], "x"[16], "y"[16]
    tx_wait%20temps%20registered.png

    EDIT: Rewrote the second paragraph for clarity
    1486 x 578 - 105K
  • evanhevanh Posts: 15,126
    edited 2019-06-29 04:32
    In case anyone is wondering about how I'm measuring the temperatures, I've simply soldered a type-K thermocouple pair and visually reading the temperature from a meter at the end of the test. The more recent tests start at 10 MHz then increment, in 1 MHz steps, spitting out compare counts on the comport, until the Prop2 crashes.

    Note the soldering is at the edge of the grid of vias. I started to scrape the solder mask off the very centre then decided it was wiser to not do any damage to that area.
    2034 x 1684 - 441K
  • Hi evanh

    Since I'd also recently passed hard times with my last flu episode, I was withholding any new comments, to spare you from answering them...

    But your post is kind of an irresistible scent of fresh coffee, I simply can't avoid commenting...

    Since Chip did advanced a large step in the chapter of testing the upcoming silicon revision, I was wondering how we could improve such kind of crash tests, in order to differentiate the observed failures between the ones that could be possibly related to any excessive jitter or other instabilities occuring at the PFD/PLL/VCO function, from any other ones, still undetected, but that could cause the same observable behaviour.

    Trying to explain it better; let's see if I can...

    After finding the most stable spots for input clock division and internal multiplier values for each range of sysclk frequencies (both with crystal and lab frequency generators as inputs), would not be a good test to be able to raise someway XI input frequency, in order to verify if, at such stable setups, the limits reached are the same as the ones observed by fixing the input clock, and varying the control values?

    Hope I have written it, as understandable as I intended to do...
  • evanhevanh Posts: 15,126
    edited 2019-06-29 05:45
    Yanomani wrote: »
    But your post is kind of an irresistible scent of fresh coffee, I simply can't avoid commenting...
    :) Nice intro.
    Got my first proper night sleep last night, so I'm definitely on the mend.

    As for clock jitter concerns, I'm leaving that up to others really. It's not something I have experience with. Professionally, I mostly support industrial electricians. And step on their toes occasionally.

    Besides, for testing of this streamer copying, there is only one clock source here. Which makes jitter of minimal concern for the moment.

    Certainly, Mike's problems won't be from clock jitter.
  • evanhevanh Posts: 15,126
    edited 2019-06-29 06:22
    Here's a graph showing the prior semi-regular data (Dark Red) and a new run with some pre-recorded random data (Orange). Only the source data is changed, no change to the program at all.
    semiregular%20vs%20random.png

    EDIT: Added a second graph with the rx waiting equivalents included. At this stage I can't explain why, in the unstable high frequencies, rx waiting is behaving differently to tx waiting.
    semiregular%20vs%20random.png
    1486 x 578 - 78K
    1486 x 578 - 115K
  • evanhevanh Posts: 15,126
    edited 2019-06-29 11:24
    I've struck something really weird with the streamers. When dropping down to sysclock/2 or slower, the rx_wait arrangement is producing a strange match count of 255, instead of 256, for non-three compensations. Looking at the copied data itself, the first word is always a duplicate of the second.

    The weirdest part is this doesn't happen at sysclock/1.


    EDIT: Err, it's happening with tx_wait too. I've obviously messed something up ...

    EDIT2: Found it. I'd recently changed to auto-generating the SETXFRQ constant. What I hadn't counted on was it being a signed operation. O_o

  • evanhevanh Posts: 15,126
    edited 2019-06-29 13:52
    Good, everything is fitting again.

    Here's latest data showing how there is two components to the erratic band of the graph for sysclk/1. See how if you add the purple and red lines together you'd get something similar to the green line.
    showing%20the%20two%20components%20of%20the%20erratic%20behaviour.png

    EDIT: And the tx_wait arrangement is still different. The crossing points are at the same frequencies and the initial fail-off is the same but I don't know why there is any difference at all.
    tx_wait%20components.png
  • I finally figured out how to handle my ring buffer without the need of a dedicated second output buffer.

    As usual the solution is clear, once found. Locks and just one buffer circulating. each CCOG on each P2 can lock the buffer change anywhere and release the buffer. While doing so the whole ringnet is stopped and resumes after the lock is released.

    This way I can lock across P2's in the net and have my buffer consistent. And yes it is just one buffer transferred from P2 to the next P2. Took some time to get that figured out.

    The wonderful thing is that I now just need one COG per P2, not two as before.

    I went back to waiting on the RX side, somehow the code looks nicer. It still fails at clock/1, but works including different number of data lines.

    Just one thing bothers me, my clock pin. Since I now just send one buffer I should be able to use my data line 0 to signal that I will start to send now not another pin. I am sending in bursts anyways so have some time where the streamers do not use the pins.

    that would eliminate the need of the clock pins.

    But before I break it again I attach it here

    Enjoy!

    Mike
  • jmgjmg Posts: 15,140
    evanh wrote: »
    Good, everything is fitting again.

    Here's latest data showing how there is two components to the erratic band of the graph for sysclk/1. See how if you add the purple and red lines together you'd get something similar to the green line.

    Those are good plots, and are consistent at the first failure slopes.
    Above that, the /3 seems strangely different from /4, given you would expect a sub ns window for the Tsu/Th actual window.
    The slope of the first failure indicates ~ 110ps of aperture width, which will also include routing delays. So that looks good.
    Above 300MHz, maybe the clocks are reducing in amplitude to no longer be full-swing, and then the FF's might start to miss-behave
    What a /4 design does do, over a /3, is it allows more time with same-data to ripple thru the wobbly flip flips.
    The /4 plots each advance one CLK each time, which is going to be close to 3ns, yet the indicated time-shifts in the falls come in much lower - which suggests more is failing > 300MHz than just delay-effects.

    1/290M-1/305M = 169.58ps
    1/290M-1/345M = 549.72ps
    1/290M-1/365M = 708.55ps


    evanh wrote: »
    EDIT: And the tx_wait arrangement is still different. The crossing points are at the same frequencies and the initial fail-off is the same but I don't know why there is any difference at all.
    Certainly the /1 plots you would expect to be very similar - and they are up to ~ 305MHz.
    Are those in different directions ? (so show different pins Rx registers ?)
  • evanhevanh Posts: 15,126
    edited 2019-06-30 12:13
    The difference is not between /3 and /4, it's between tx_wait and rx_wait. You'll note the light blue line in rx_wait plot is flat line 256 all the way to 380 MHz. That's the expected sweet spot for sysclock/3.

    I don't know why I can't get that to happen with tx_wait.

    As for the differences between rx_wait and tx_wait ... there are very few. They are separate compiles of the same source with a few constants adjusted to run a slightly different path in a couple of places. Certainly the physical data paths are the same and same direction data and same cogs. The functional difference is just who waits for who on the synchronising go signal.

    EDIT: Lol, I changed how the select tx_wait and rx_wait was being made during the latest testing of tx_wait. To be honest, it hadn't been tested because I was manually commenting out the two CALL lines as needed beforehand. Here's the latest fully working source code:

    EDIT2: Altered display order of settings to align CYCLES with Match in the source code.
  • I don't know why I can't get that to happen with tx_wait.

    Yes, I struggled a lot with tx_wait, it is more complicated.

    My current guess is that thru odd waitx parameter the COGs get out and in of being on odd or even sys clock cycles relative to each other.

    And that might be why tx_wait is more picky, Might be even the order the COGs get started, for shorter code I went back to rx_wait.

    I have no fan running and no way to measure temperature and am very conservative with overclocking until I have more then one P2.

    Mike
  • evanhevanh Posts: 15,126
    Just different really. They both start giving up at the same place.
  • evanhevanh Posts: 15,126
    edited 2019-06-30 12:42
    msrobots wrote: »
    Since I now just send one buffer I should be able to use my data line 0 to signal that I will start to send now not another pin. I am sending in bursts anyways so have some time where the streamers do not use the pins.
    Good idea ... yep, works as long as you also OUTL again before kicking in the tx streamer. This changes the compensation by two. And receiver must not start its next waiting until the rx streamer is finished and extraneous go event is cleared.
  • OUTL again - good one - that might be why it did not work last night, will test later...

    Mike
  • evanhevanh Posts: 15,126
    edited 2019-07-01 05:54
    Ah, just realised a low going edge event would be a smoother solution. Reverts the lag compensation change nicely, ie: In the rx cog, use SETSE1 #$80 instead of SETSE1 #$40

    PS: And tx lag compensation comes after the OUTL now.


    EDIT: Heh, it's just like the start bit on a comport. Except here the spacing details are defined by the architecture. I should get the scope out and have a look ...

  • msrobotsmsrobots Posts: 3,701
    edited 2019-07-02 02:22
    evanh wrote: »
    Ah, just realised a low going edge event would be a smoother solution. Reverts the lag compensation change nicely, ie: In the rx cog, use SETSE1 #$80 instead of SETSE1 #$40

    PS: And tx lag compensation comes after the OUTL now.


    EDIT: Heh, it's just like the start bit on a comport. Except here the spacing details are defined by the architecture. I should get the scope out and have a look ...

    Two hillbillies, one thought.

    Funny, I came to the same conclusions, but did not got it running last night. A event on falling edge might be a tick more early but that makes no measurable difference. Main thing is to have the data line low and output, sinc the streamer or's in his output in.

    And YES it is some sort of a start bit, at least a receiver can check if a sender is connected because of staying high instead of driven low as input. (not doing that yet, just speculating)

    It was late last night, so I gave up, somehow I seem to have to clear the event before waiting on the falling edge, since it had falling edges while transmitting data. Or I have some other stupid bug.

    Can't wait to get off work. This would be sweet to get my ring buffer running without dedicated control lines. A minimum ringnet P2 node would need one COG and two Pins, just like serial...

    I enhanced my test program to run thru 1-16 bit data lines and 4 to 1 bitclocks. It looks like sometimes /1 is working, /2 and bigger are stable.

    Working on it soon again...

    Enjoy!

    Mike
  • msrobotsmsrobots Posts: 3,701
    edited 2019-07-02 12:43
    Double Success!

    I got the signaling over data pin 0 running. And could clean up the code. And I did not just got rid of the need of control-lines, it is now also working at sys clock /1.

    Wonderful!

    Slowly this matures into a useful driver, I really need more P2's.

    Right now all of this uses the same pins on the same P2, I doubt that one can get /1 over multiple P2's, but the basic goal of a ring buffer works and can be used with 1-32 pins at any number of clocks per bit.

    The current test code needs pins 0-31 free and simulates 2 separate P2's, next try will be a simulated 3 node net using pins 0-47.

    I think the practical minimum ring will be a 3 line servo cable, gnd and two data lines, easy to plug on the header. Two of them, one in, one out. One might put resistors in between, for protection of stupid handling mistakes.

    I need more P2's,

    Mike
  • evanhevanh Posts: 15,126
    Good stuff. :)
  • msrobotsmsrobots Posts: 3,701
    edited 2019-07-04 08:56
    OK, next evaluation.

    I was able to replace the waitse1 by a interrupt on se1, my Idea was to free the main COG. And it does work, sadly not exactly as I was planning.

    The current Ringnet driver does basically the following steps.

    (now done in INT3)

    1. locking the buffer and waiting for a new buffer from RX, signaled by RXpin falling from high to low.
    2. replacing current local buffer and unlock it for a while so other COGs can lock it and do their changes if they need to.
    3, locking the local buffer again, transferring to the next P2 and waiting for it to return by return to step one.

    The sad part is that I have to issue the "wrfast wrap, Ringnet_ptr" command needed for the receiving streamer at the end of the transmit and the end of my interrupt routine.

    So the streamer is ready to receive on falling edge of control signal on data line 0.

    Everything works fine, except the remaining main COG, doing nothing and being interrupted by the running ringnet, - hmm - can't read or write HUB, because it will mess up the pending wrfast needed for the prepared streamer on RX interrupt.

    So right now the main process of the COG can not access HUB without disturbing the running ringnet interrupt. But I have some plan there to get this running smooth.

    To this point it is working right now.

    Now the plan:

    At Step 2 the interrupt has the receive streamer finished, transmit streamer not started yet and has the lock for the buffer released, waiting a while to give the other COGs of this P2 to lock and change the buffer.

    At this Step could read/write HUB parameters for command/result of the remaining idle COG. In the middle of the interrupt I can do that.

    The current ringnet driver needs just 122 longs in the COG the rest is free and could be used as data storage or diagnostic functions on the ringnet or even user defined routines.

    attached a running version where the main process in the ringnet driver does nothing.

    EDIT

    I removed the ISR version, does not work stable...

    EDIT

    Mike
  • evanhevanh Posts: 15,126
    You don't want to be waiting on a lock to come free inside an ISR. It's not good form as you are blocking all other activities.

    Chip will need to confirm but I'm pretty certain the streamer via the FIFO has priority over a cog program accessing hubRAM. A an apparent clash won't mess anything up but if the streamer is pulling 100% then the cog will be stalled until the streamer is finished its burst.
  • yes, I agree, and have reverted to the non ISR routine.

    I did work but was not stable.

    so ignore the isr version, not worth it.

    Mike
  • cgraceycgracey Posts: 14,133
    evanh wrote: »
    You don't want to be waiting on a lock to come free inside an ISR. It's not good form as you are blocking all other activities.

    Chip will need to confirm but I'm pretty certain the streamer via the FIFO has priority over a cog program accessing hubRAM. A an apparent clash won't mess anything up but if the streamer is pulling 100% then the cog will be stalled until the streamer is finished its burst.

    That is correct. The FIFO has priority over RDxxxx/WRxxxx instructions.
  • evanhevanh Posts: 15,126
    Thanks Chip.
Sign In or Register to comment.