Shop OBEX P1 Docs P2 Docs Learn Events
P2 vs modern process limits - Page 4 — Parallax Forums

P2 vs modern process limits

124

Comments

  • cgraceycgracey Posts: 14,155
    The streamer does what you command it to without any pauses.
  • jmgjmg Posts: 15,173
    cgracey wrote: »
    The streamer does what you command it to without any pauses.
    Hmm, that could make writing reliable streamer code to/from any OS related host, tricky.

    On the FTDI parts that have FIFOs, and on High speed Serial modes, it seems handshakes are needed rarely, I suspect when the OS gets tangled elsewhere.
    That means things will appear to work, until in some rare combination, data is lost. The handshake lines cover those cases.

    If the Clock is via a separate PinCell, maybe that just needs a Clock-Enable type option, and the Streamer attaches to what is now a gated clock.
    (This ignores for now, the SysCLK/1 streaming which needs a CLK (mentioned above) + handshake too...)
  • evanhevanh Posts: 15,924
    edited 2016-05-05 05:52
    cgracey wrote: »
    We need to have an event, for sure. Maybe two events are needed. Maybe four? What should they be? We are not actually limited to 16 events. We could add one bit and go up to 32.

    For the LUTs, a LUTRAM-end-write matching the existing LUTRAM-end-read, and same again for events coming from neighbouring LUT. So, that's three more event sources.

    I note all this detection circuitry will need moved over to the second port for local LUT but to detect the same on the neighbour's LUT will need detection on the neighbour's first port.
  • cgraceycgracey Posts: 14,155
    jmg wrote: »
    cgracey wrote: »
    The streamer does what you command it to without any pauses.
    Hmm, that could make writing reliable streamer code to/from any OS related host, tricky.

    On the FTDI parts that have FIFOs, and on High speed Serial modes, it seems handshakes are needed rarely, I suspect when the OS gets tangled elsewhere.
    That means things will appear to work, until in some rare combination, data is lost. The handshake lines cover those cases.

    If the Clock is via a separate PinCell, maybe that just needs a Clock-Enable type option, and the Streamer attaches to what is now a gated clock.
    (This ignores for now, the SysCLK/1 streaming which needs a CLK (mentioned above) + handshake too...)

    The streamer would have to be given some sensitivity to an input pin. Mainly, what the streamer is used for is interfacing to continuous real-time I/O. It could be in spurts now, but under software control, where you give it a command for so many cycles of operation and it does it. If the OS-based receiver could say "stop", but then receive up to 16 more samples before filling up, you could issue streamer commands that output groups of 16 samples, checking for the "stop" signal between commands. That would take a small-enough time that operation would be continuous until the "stop" signal.
  • cgraceycgracey Posts: 14,155
    evanh wrote: »
    cgracey wrote: »
    We need to have an event, for sure. Maybe two events are needed. Maybe four? What should they be? We are not actually limited to 16 events. We could add one bit and go up to 32.

    For the LUTs, a LUTRAM-end-write matching the existing LUTRAM-end-read, and same again for events coming from neighbouring LUT. So, that's three more event sources.

    I note all this detection circuitry will need moved over to the second port for local LUT but to detect the same on the neighbour's LUT will need detection on the neighbour's first port.

    So, I figure we could use these events in the local cog:

    1) lower cog wrote my LUT
    2) lower cog read my LUT
    3) upper cog wrote its LUT
    4) upper cog read its LUT

    Maybe we could trigger these events at fixed LUT addresses, like $1FF, or maybe they need to be settable. Addresses could involve a mask, too. What do you guys think?
  • jmgjmg Posts: 15,173
    cgracey wrote: »
    So, I figure we could use these events in the local cog:

    1) lower cog wrote my LUT
    2) lower cog read my LUT
    3) upper cog wrote its LUT
    4) upper cog read its LUT

    Maybe we could trigger these events at fixed LUT addresses, like $1FF, or maybe they need to be settable. Addresses could involve a mask, too. What do you guys think?

    A Mask would be flexible, if there is room, as you need sometimes to ignore a write, and sometimes to react to it.
    Otherwise, map events into (asymmetric) address regions.

    To clarify, here lower and upper are Even/Odd COGS, and 3rd COG cannot connect to 2nd COG at all , correct ? (but it can connect 3rd & 4th)
  • jmgjmg Posts: 15,173
    cgracey wrote: »
    The streamer would have to be given some sensitivity to an input pin.

    Yes.
    cgracey wrote: »
    Mainly, what the streamer is used for is interfacing to continuous real-time I/O.
    I can see it is good for pixel streaming, or camera capture.
    Not so good for real-world connections that need a 'hang-on-a-mo' signal.

    cgracey wrote: »
    It could be in spurts now, but under software control, where you give it a command for so many cycles of operation and it does it. If the OS-based receiver could say "stop", but then receive up to 16 more samples before filling up, you could issue streamer commands that output groups of 16 samples, checking for the "stop" signal between commands. That would take a small-enough time that operation would be continuous until the "stop" signal.

    That might work, but not all FIFO interfaces are well documented, and most seem to be designed to connect to FPGAs, which react immediately.
    To displace FPGAs, which a P2 is very close to doing, it needs to react the same way, and interface to the same HS UHS USB FIFOs.

    Otherwise you limit your PC connect speeds.
    I suppose you could ask FTDI for more information - some of their data is vague and incomplete, and restrictive.
  • Not that I have an application in mind, but I was just wondering how long it would take to "ripple/wave" a single long of data through all 16 of the cogs' LUTs (bypassing the hub entirely), wherein a first cog writes a long to the LUT of a second cog (its neighbor) and that second cog "picks that long up" and write it to the LUT a third cog (its neighbor) and so on (assuming no coordination signaling involved). Wonder if it would be N instruction cycles, or N*2 (seems unlikely) or N(N-1)/2. Obviously, it wouldn't be 1 instruction cycle total. Would it just be N cycles? That's mostly for understanding's sake, as if one really wanted to share a long across all cogs, I guess one would just read it from the hub. And if one really wanted N-time (sharing across cogs in one instruction cycle), then one would need to use a bunch of pins (as there's no internal bus/ring).
  • JRetSapDoogJRetSapDoog Posts: 954
    edited 2016-05-05 07:18
    cgracey wrote: »
    The streamer can output pixel-type data to DACs/pins that it looks up from the LUT. The streamer can also use the egg-beater hub access, but it's not the egg-beater [itself]. Some talk on here has shown confusion between the egg-beater and the streamer. They are separate things.
    Sometimes saying what something is *not* can add a useful perspective for understanding.

    Congrats, Chip, on nearly wrapping up the design (even though there are some events to work out).
  • evanhevanh Posts: 15,924
    jmg wrote: »
    cgracey wrote: »
    Mainly, what the streamer is used for is interfacing to continuous real-time I/O.
    I can see it is good for pixel streaming, or camera capture.
    Not so good for real-world connections that need a 'hang-on-a-mo' signal.
    Real-world meaning artificial databuses. ;)

    Okay, given USB is already catered for, probably a good starting point would be SDRAM. Getting the Streamer up to handling such transfers would be worth it I'd say. Making use of the Streamer to max out a burst SDRAM transfer I think would be a good feature test.
  • evanhevanh Posts: 15,924
    edited 2016-05-05 07:35
    cgracey wrote: »
    So, I figure we could use these events in the local cog:

    1) lower cog wrote my LUT
    2) lower cog read my LUT
    3) upper cog wrote its LUT
    4) upper cog read its LUT

    Maybe we could trigger these events at fixed LUT addresses, like $1FF, or maybe they need to be settable. Addresses could involve a mask, too. What do you guys think?
    Just keep it simple at the fixed $1FF address is good.

    Another set of events that trigger on every address might be handy. Wouldn't want to use it for interrupts but polling/waiting could possible speed up arbitrary writes by not needing to write to $1FF or anywhere specific.
  • cgraceycgracey Posts: 14,155
    evanh wrote: »
    jmg wrote: »
    cgracey wrote: »
    Mainly, what the streamer is used for is interfacing to continuous real-time I/O.
    I can see it is good for pixel streaming, or camera capture.
    Not so good for real-world connections that need a 'hang-on-a-mo' signal.
    Real-world meaning artificial databuses. ;)

    Okay, given USB is already catered for, probably a good starting point would be SDRAM. Getting the Streamer up to handling such transfers would be worth it I'd say. Making use of the Streamer to max out a burst SDRAM transfer I think would be a good feature test.

    It already does SDRAM. That was one of the original design goals.
  • evanhevanh Posts: 15,924
    Sexy! I'm all good then. :D
  • RaymanRayman Posts: 14,658
    If two cogs can write to some cog's LUT, seems you'd want to separate those events somehow...
  • cgraceycgracey Posts: 14,155
    edited 2016-05-05 11:31
    Rayman wrote: »
    If two cogs can write to some cog's LUT, seems you'd want to separate those events somehow...

    Yes, I was thinking the same thing. The smallest way would be to have the cog select the sensitivity. This way, you don't need more events. The one event could do either. We've got one empty event at the moment. We could use it for write sensing, since the other is already read sensing. A simple D/# instruction could set the sensitivities for both events.

    At the moment, I'm compiling the whole thing, with the new RDLUTN/WRLUTN instructions added (N=next). I'm hoping we still fit. An ALM is almost 3 LE's, so we should be okay. We have a budget of about 112 ALMs per cog for this last feature.
  • Well... I guess this is happening. Ok, then...

    I suggest skipping the configurable events. Just do the following:
    • Cog N gets WRITE event for Cog N+1 writing to $1FF
    • Cog N gets READ event for Cog N+1 reading from $1FE
    • Cog N gets WRITE event for Cog N-1 writing to $1FE
    • Cog N gets READ event for Cog N-1 reading from $1FF

    You will notice that the last event is actually an existing event (13), which means that you only need three new events. If you want to increase the event ID size to 5 bits, that's fine. But I think you could possibly get away with getting rid of one of the counters. (The additional counters were added when we didn't have the smart pins, so having 3 may not be as critical now. This will also free up a few ALMs on the FPGA. Speaking of ALMs, you might also free up a few by taking the HUB back to 512K.)
  • cgracey wrote: »
    At the moment, I'm compiling the whole thing, with the new RDLUTN/WRLUTN instructions added (N=next). I'm hoping we still fit. An ALM is almost 3 LE's, so we should be okay. We have a budget of about 112 ALMs per cog for this last feature.
    Fingers crossed! :)
    It's all wrapping up nicely.

  • cgraceycgracey Posts: 14,155
    edited 2016-05-05 12:24
    It done blowed up!

    Error (170012): Fitter requires 11361 LABs to implement the design, but the device contains only 11356 LABs

    I'll get rid of the CT3 timer. Seairth is right that we don't need three timers with smart pins. Two is plenty.
  • cgracey wrote: »
    Rayman wrote: »
    If two cogs can write to some cog's LUT, seems you'd want to separate those events somehow...

    Yes, I was thinking the same thing. The smallest way would be to have the cog select the sensitivity. This way, you don't need more events. The one event could do either. We've got one empty event at the moment. We could use it for write sensing, since the other is already read sensing. A simple D/# instruction could set the sensitivities for both events.

    So, you would have two events, one for reads and one for writes, with the following two settings:

    * Own LUT or Next LUT
    * LUT address

    Seems reasonable. I'd change the event mnemonics slightly:
    RDL   -> RDH (hub read)
    WRL   -> WRH (hub write)
    RLE   -> RDL (lut read)
    (new) -> WRL (lut write)
    

    For defaults, I'd do the following:

    * READ: Own LUT, $1FF (which is what event 13 is currently)
    * WRITE: Next LUT, $1FF

    To enable two-way communication,
    * Cog N would switch READ to Next LUT, $1FF
    * Cog N+1 would switch WRITE to Own LUT, $1FE

    With this,
    1. Cog N+1 WAITWRL
    2. Cog N WRLUTN $1FE
    3. Cog N WAITRDL
    4. Cog N+1 wakes from WAITWRL
    5. Cog N+1 RDLUT $1FE
    6. Cog N wakes from WAITRDL

    and...
    1. Cog N WAITWRL
    2. Cog N+1 WRLUT $1FF
    3. Cog N+1 WAITRDL
    4. Cog N wakes from WAITWRL
    5. Cog N RDLUTN $1FF
    6. Cog N+1 wakes from WAITRDL
  • cgracey wrote: »
    It done blowed up!

    Error (170012): Fitter requires 11361 LABs to implement the design, but the device contains only 11356 LABs

    I'll get rid of the CT3 timer. Seairth is right that we don't need three timers with smart pins. Two is plenty.

    Why are you holding onto that extra 512K? I realize this is mostly memory, but some ALMs have got to be used! Isn't it more important to prove the actual design then to give a little bit extra to those few people that have the A9 board?
  • cgraceycgracey Posts: 14,155
    Seairth wrote: »
    cgracey wrote: »
    It done blowed up!

    Error (170012): Fitter requires 11361 LABs to implement the design, but the device contains only 11356 LABs

    I'll get rid of the CT3 timer. Seairth is right that we don't need three timers with smart pins. Two is plenty.

    Why are you holding onto that extra 512K? I realize this is mostly memory, but some ALMs have got to be used! Isn't it more important to prove the actual design then to give a little bit extra to those few people that have the A9 board?

    I don't think that extra address bit amounts to much, so I haven't bothered, yet. The compiler is still moving along, further than it got last time.
  • cgraceycgracey Posts: 14,155
    Seairth wrote: »
    cgracey wrote: »
    Rayman wrote: »
    If two cogs can write to some cog's LUT, seems you'd want to separate those events somehow...

    Yes, I was thinking the same thing. The smallest way would be to have the cog select the sensitivity. This way, you don't need more events. The one event could do either. We've got one empty event at the moment. We could use it for write sensing, since the other is already read sensing. A simple D/# instruction could set the sensitivities for both events.

    So, you would have two events, one for reads and one for writes, with the following two settings:

    * Own LUT or Next LUT
    * LUT address

    Seems reasonable. I'd change the event mnemonics slightly:
    RDL   -> RDH (hub read)
    WRL   -> WRH (hub write)
    RLE   -> RDL (lut read)
    (new) -> WRL (lut write)
    

    For defaults, I'd do the following:

    * READ: Own LUT, $1FF (which is what event 13 is currently)
    * WRITE: Next LUT, $1FF

    To enable two-way communication,
    * Cog N would switch READ to Next LUT, $1FF
    * Cog N+1 would switch WRITE to Own LUT, $1FE

    With this,
    1. Cog N+1 WAITWRL
    2. Cog N WRLUTN $1FE
    3. Cog N WAITRDL
    4. Cog N+1 wakes from WAITWRL
    5. Cog N+1 RDLUT $1FE
    6. Cog N wakes from WAITRDL

    and...
    1. Cog N WAITWRL
    2. Cog N+1 WRLUT $1FF
    3. Cog N+1 WAITRDL
    4. Cog N wakes from WAITWRL
    5. Cog N RDLUTN $1FF
    6. Cog N+1 wakes from WAITRDL

    Looks good. I think I'm going to have to sleep for a few hours before I can do any more.
  • cgraceycgracey Posts: 14,155
    edited 2016-05-05 22:58
    The compiler ran for 9 hours and wasn't getting the job done. It should take about 80 minutes if things are good. To get this LUT sharing working, something else may have to be pared down.

    I got rid of the CT3 event, which was superfluous with smart pins able to track so much timing.

    Question: Do we need more than one timer event per cog?

    If we got rid of CT2, also, we'd go back to GETCNT/ADDCNT/POLLCNT/WAITCNT, instead of GETCT/ADDCT1/ADDCT2/POLLCT1/POLLCT2/WAITCT1/WAITCT2.

    Does a cog need more than one timer?
  • jmgjmg Posts: 15,173
    cgracey wrote: »
    The compiler ran for 9 hours and wasn't getting the job done. It should take about 80 minutes if things are good. To get this LUT sharing working, something else may have to be pared down.

    I got rid of the CT3 event, which was superfluous with smart pins able to track so much timing.

    Question: Do we need more than one timer event per cog?

    If we got rid of CT2, also, we'd go back to GETCNT/ADDCNT/POLLCNT/WAITCNT, instead of GETCT/ADDCT1/ADDCT2/POLLCT1/POLLCT2/WAITCT1/WAITCT2.

    Does a cog need more than one timer?

    Maybe the tools need a little headroom, meaning 100.00 % usage is not quite reachable ?

    How much does that CT2 save ?
    If timers can trigger events, I'd be reluctant to pare back to only one ?
    Does having two help porting code from P1, which has 2 timers per COG ?


  • RaymanRayman Posts: 14,658
    I'd say no, at first. But, if someone did want to use all the new task switching stuff to do multiple things, then they might want all 3 timers...

    Maybe make cog-cog only to next cog?

    Is there an A10 you can try to compile to? Or, something with more ALMs?
  • cgraceycgracey Posts: 14,155
    Rayman wrote: »
    I'd say no, at first. But, if someone did want to use all the new task switching stuff to do multiple things, then they might want all 3 timers...

    Maybe make cog-cog only to next cog?

    Is there an A10 you can try to compile to? Or, something with more ALMs?

    We are using fattest of the Cyclone V devices.
  • jmgjmg Posts: 15,173
    Rayman wrote: »
    Maybe make cog-cog only to next cog?

    hmm.. I thought it was already limited to Even-Odd pairings ?
    Seems that would save a lot of routing and Muxes, as the two associated LUTs can sit between their two owner COGs ?
  • cgraceycgracey Posts: 14,155
    edited 2016-05-05 23:39
    jmg wrote: »
    Rayman wrote: »
    Maybe make cog-cog only to next cog?

    hmm.. I thought it was already limited to Even-Odd pairings ?
    Seems that would save a lot of routing and Muxes, as the two associated LUTs can sit between their two owner COGs ?

    Tubular proposed something similar, but I can't see how it would save any logic. It's a different connection pattern, but takes the same amount of glue.
  • SeairthSeairth Posts: 2,474
    edited 2016-05-05 23:50
    For now, maybe get rid of some smartpin cells. Our primary concern is proving out the various features. 32 smartpins is still plenty enough for testing purposes.

    Unless there is a direct correlation between the ability to fit everything in the final silicon and fit everything in an A9, I wouldn't start dropping features just to fit on the A9.

    Also, I know you say the 1MB HUB only has 1 extra addressing line, but wouldn't the FPGA potentially use multiple ALMs for signal routing? I know I'm sounding like a broken record, but it's not part of the final silicon, so there's no harm in seeing if it helps free up resources.
  • cgraceycgracey Posts: 14,155
    edited 2016-05-06 00:09
    Seairth wrote: »
    For now, maybe get rid of some smartpin cells. Our primary concern is proving out the various features. 32 smartpins is still plenty enough for testing purposes.

    Unless there is a direct correlation between the ability to fit everything in the final silicon and fit everything in an A9, I wouldn't start dropping features just to fit on the A9.

    Also, I know you say the 1MB HUB only has 1 extra addressing line, but wouldn't the FPGA potentially use multiple ALMs for signal routing? I know I'm sounding like a broken record, but it's not part of the final silicon, so there's no harm in seeing if it helps free up resources.

    I'll try it. It'll probably amount to 20, or so, ALMs.

    We've actually got a LOT of room on the silicon die. We could just proceed by doing what you suggested: not have all pins be smart pins on the -A9. We could put them at 0..31 and then 58..63. That would cover all of port A and the flash and serial pins at the top of port B. Let's do that! Problem solved.
Sign In or Register to comment.