Shop Learn P1 Docs P2 Docs Events
New P2 Silicon - Page 30 — Parallax Forums

New P2 Silicon

1242526272830»

Comments

  • evanhevanh Posts: 14,132
    edited 2022-02-20 01:49

    I guess more explaining is needed ... It's the other way round Tony. In the testing, the streamer's NCO and mode dictates the demand data rate. That's a configured known. What's unknown is the FIFO's behaviour when accessing hubRAM.

    So these tests are attempting to measure how often the FIFO bursts hubRAM. And by extrapolation, an average of how much per burst. This is done by setting up the streamer to run continuously at a set rate and then use a SETQ+WRLONG to do a block fill of hubRAM. And measuring how long each block fill takes.

    The reasoning is that each time the FIFO accesses hubRAM this stalls the cog's block filling. Because the block writing is in sequential address order this adds 8 clock cycle (tick) increments of time taken to complete the block write. Therefore, the longer the write takes to complete the more often the FIFO made accesses.

    The good news is the FIFO can grab a burst of sequential data from hubRAM in one go. This compensates nicely for the fact that the block fill stalls in 8-tick increments. Quote from the silicon doc:

    The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+7) stages are filled, after which point, up to 5 more longs may stream in, potentially filling all (cogs+11) stages.

    Each line of the reports has two config parameters:

    • The streaming data width (column 1)
    • The NCO increment value (column 3). I've also been calling this the NCO divider, and I note Chip calls it the NCO frequency.

    There is a third parameter: The SETQ+WRLONG block writing/filling length, which, above, is set at 64k longwords for all tests. Unimpeded, this takes exactly 65536 ticks to complete.

    Okay, so the twelve repeat tests of each line is the measured time taken for a 64k block write to complete.

  • evanhevanh Posts: 14,132
    edited 2022-02-20 12:20

    @jmg said:
    What is a 'serious excursion' - does it drop data, or just sometimes take more cycles than hoped ?
    ...

    Things went well until discovery that some streamer rates causes the FIFO to badly hog the hubRAM bus.

    What does 'badly hog' amount to ?
    With hub being MOD 8, and a NCO not being mod 8, I can see some varying waits could be needed, but things should never 100% choke ?

    It's just the time taken, as per the measurements. Chip has wired the FIFO to generally burst 6 longwords from hubRAM at a time. This seems pretty good as it creates an early request condition so that up to 8 longwords can be burst without pushing into a second stall window. The averages should be pretty smooth, and are most of the time.

    EDIT: Here's an example with worst case excursions:

     BYTE    9   1c71c71c    70784    70776    70776    70776    70784    70776    70784    70776    70784    70776    0
    SHORT    9   1c71c71c   589720   589792   589792   589792    76928    76928    76936   589648   589720   589792    0
     LONG    9   1c71c71c   118000   117960   117968   117976   117936   117968   117976   117992   118000   117960    0
    

    BYTE label means the streamer is in RFBYTE mode. For a given streamer NCO rate, when compared to RFLONG, it should only need 1/4 of the bandwidth from hubRAM. So it's to be expected that, for BYTE lines, the block-fill will complete faster than for SHORT or LONG lines. Lower bandwidth should make less bursts. And indeed it is a quicker fill then: 70784 ticks on the BYTE line vs 76936 ticks on the LONG line. Unimpeded being 65536 ticks.

    More interesting is that there is no difference between SHORT and LONG. Both take 76936 ticks.

    And of course, the eye-popping excursions out to 589760 ticks. Something is going off the rails with those.

    EDIT: The SHORT and LONG tests measuring the same were a bug in my code. Now fixed and I've updated the example report snippet above too.

  • evanhevanh Posts: 14,132
    edited 2022-02-21 00:55

    Chip,
    Testing is solid now. Conclusion is the FIFO isn't performing optimally on occasion. Certain timings at streamer startup somehow are causing the FIFO to ebb along refilling only 1 - 4 longwords at a time. Or in high volume cases it may sit at 6/12 longwords per refill when multiples of 8 would be far more effective.

    Worst case, the FIFO can be hogging 50% of a cog's hubRAM bandwidth while only fetching a single longword every other rotation. EDIT: It might be a lot worse actually. My head is hurting. At any rate, certain combinations should be avoided because of this. Further reading from here - https://forums.parallax.com/discussion/comment/1535820/#Comment_1535820

    A detail: This doesn't afflict RFBYTE streamer ops.

    EDIT2: Okay, the worst cases are more like 90% hogging and are specifically when the FIFO data rate is either a longword every three sysclock ticks or every nine ticks. Not sure how tight a margin either side of 3 and 9.

  • cgraceycgracey Posts: 13,910

    Interesting, Evanh. I should have considered, during design, that i> @evanh said:

    Chip,
    I've before wanted to clarify a particular detail of the cog execution pipeline. In your ASCII art diagram of the pipeline it isn't completely clear at which points cogRAM addressing occurs. Knowing this would clear things up for me.

    Here's the original:

            |                   |                   |                   |                   |                   |                   |
    rdRAM Ib|------+            |           rdRAM Ic|------+            |           rdRAM Id|------+            |           rdRAM Ie|
            |      |            |                   |      |            |                   |      |            |                   |
    latch Da|--+   +--> rdRAM Db|---------> latch Db|--+   +--> rdRAM Dc|---------> latch Dc|--+   +--> rdRAM Dd|---------> latch Dd|
    latch Sa|--+   +--> rdRAM Sb|---------> latch Sb|--+   +--> rdRAM Sc|---------> latch Sc|--+   +--> rdRAM Sd|---------> latch Sd|
    latch Ia|--+   +--> latch Ib|---------> latch Ib|--+   +--> latch Ic|---------> latch Ic|--+   +--> latch Id|---------> latch Id|
            |  |                |                   |  |                |                   |  |                |                   |
            |  +---------------ALU--------> wrRAM Ra|  +---------------ALU--------> wrRAM Rb|  +---------------ALU--------> wrRAM Rc|
            |                   |                   |                   |                   |                   |                   |
            |                   |stall/done = 'gox' |                   |stall/done = 'gox' |                   |stall/done = 'gox' |
            |       'get'       |      done = 'go'  |       'get'       |      done = 'go'  |       'get'       |      done = 'go'  |
    

    So, the question is is rdRAM the data from cogRAM ... or is rdRAM the address generation for the next cycle - where data then comes from cogRAM and is subsequently latched?

    EDIT: Another way to ask: Where does PC address gen fit into the diagram?

    So, rdRAM means that the address is being given to the SRAM to initiate a read cycle. At the end of the next cycle, the read data is captured in a 'latch'.

    And wrRAM means that the address and data are being given to the SRAM to initiate a write cycle.

    The rdRAM Ix is when the address is given to the SRAM to read the next instruction.

    All these rdRAM and wrRAM are clock-edge events where the SRAM latches the address (and data) for a r/w operation that will occur on the next cycle.

  • evanhevanh Posts: 14,132

    Excellent. Thanks Chip.

  • evanhevanh Posts: 14,132

    Right, Chip, corrections time. The instruction pipeline graphic in the draft Hardware Manual - https://docs.google.com/document/d/1_b6e9_lIGHaPqZKV37E-MmCvqaS3JKYwCvHEJ-uCq8M/edit?usp=sharing needs corrected to match up with your clarification. There's one too many fetching stages, and it's missing an Execute stage. Also, the stage with MUX is not Store, it is the missing Execute.

    Instruction Fetch -> Operands Fetch -> Execute -> Execute -> Store

    See my earlier post on alignment of the two ASCII arts - https://forums.parallax.com/discussion/comment/1535164/#Comment_1535164

  • evanhevanh Posts: 14,132
    edited 2022-02-23 10:55

    And the follow on ... because that affects I/O timing via when OUT/DIR registers get updated we then get this:

    _____  clk  _____       _____       _____       _____       _____       _____       _____       _____       _____
     -1  |_____|  0  |_____|  1  |_____|  2  |_____|  3  |_____|  4  |_____|  5  |_____|  6  |_____|  7  |_____|  8  |_____|
    
               |           |    DIRA   |    reg    |    reg    |   P0 OE   |           |           |           |           |
               |           |    OUTA   |    reg    |    reg    |  P0 HIGH  |           |           |           |           |
               |           |           |           |           |  P0 slews |    reg    |    reg    |    INA    |    C/Z    |
               |           |           |           |           |           |           |           |           |           |
         [ DRVH  #0 ]      |           |           |           |           |           |           |     [ TESTP  #0 ]     |
    

    Of note is one less staging register each way than what is shown in the Pin Timing graphics. I've done some testing so feeling cocky. :)

    EDIT: Made it narrower.

  • Cluso99Cluso99 Posts: 18,064
    edited 2022-02-23 08:50

    @evanh,
    Not sure if this might help....

    Thread here
    https://forums.parallax.com/discussion/comment/1497556/#Comment_1497556

  • evanhevanh Posts: 14,132
    edited 2022-02-23 10:51

    Not bad, but doesn't demonstrate the lag of outputting a SPI clock to SPI data being read back in. As a result the phase alignment of TESTP is etherial.

    Also, in reality the sampling alignment arrows need extended more. I would have previously said the OUT latencies was correct but now think they're +1 to what you have. TESTP latencies are also +1 I think.

    The combined total, including one for pin slew time, is 8 clock cycles. (From end of OUTx to end of TESTP)

    EDIT: Doh! Reading that is tricky. Both Chip and I are working from/to instruction ends. Your one is correct because there is an inbuilt +2 for instruction execution, just needs the +1 added for pin slew.

    EDIT2: And when at high sysclock frequencies you can expect an additional +4 from externals.

  • cgraceycgracey Posts: 13,910

    I think the hogging is the > @evanh said:

    Chip,
    Testing is solid now. Conclusion is the FIFO isn't performing optimally on occasion. Certain timings at streamer startup somehow are causing the FIFO to ebb along refilling only 1 - 4 longwords at a time. Or in high volume cases it may sit at 6/12 longwords per refill when multiples of 8 would be far more effective.

    Worst case, the FIFO can be hogging 50% of a cog's hubRAM bandwidth while only fetching a single longword every other rotation. EDIT: It might be a lot worse actually. My head is hurting. At any rate, certain combinations should be avoided because of this. Further reading from here - https://forums.parallax.com/discussion/comment/1535820/#Comment_1535820

    A detail: This doesn't afflict RFBYTE streamer ops.

    EDIT2: Okay, the worst cases are more like 90% hogging and are specifically when the FIFO data rate is either a longword every three sysclock ticks or every nine ticks. Not sure how tight a margin either side of 3 and 9.

    Evanh, I think what is happening is that the FIFO loads all the way up at every opportunity, having priority over RDLONG instructions, and the RDLONG instructions get what's left over. There are some duty relationships between the FIFO loads and RDLONG instructions where the RDLONG is starved more than it would need to be if the FIFO loading was balanced better by having a more complex rule about how much data to read each time it loads. Right now, the rule is that if there's less than a static number of longs in the FIFO, the FIFO loads until that number is reached, after which 5 more longs will stream in because of the clock delays. That logic needs to be improved to fix this problem. That can't happen until we do a silicon respin.

  • evanhevanh Posts: 14,132
    edited 2022-02-25 03:42

    @cgracey said:
    ... That can't happen until we do a silicon respin.

    Putting it on the table for that day.

    In last couple days, also noticed that FIFO writes to hubRAM are a lot more hoggie than even the worst case reads. And then worked out this is because the FIFO doesn't explicitly delay its writes. Just has the hub rotation delays.

    On the revisit, could also look at adding FIFO delayed write buffering. I know it'd need to auto-flushing after a timeout to keep hubRAM up-to-date.

  • Wuerfel_21Wuerfel_21 Posts: 3,458
    edited 2022-02-28 22:06

    Ooop, I think I found a technically-not-a-bug-but-really-obnoxious-anyways:

    It seems that PUSH/POP do not affect the SKIP suspension counter.
    Thus, you can't do the triple tailcall optimization in code that gets called from inside a SKIP sequence.

    I.e. turning

    call #first
    call #second
    call #third
    ret
    

    into

    push #third
    push #second
    jmp #first
    

    EDIT: oop, double would work, triple and more is the issue.

  • Unrelatedly: Is _ret_ pop D supposed to work?

    It seems to work only sometimes, seemingly dependent on what the D address is.

    Or maybe I'm just going insane.

  • evanhevanh Posts: 14,132

    @Wuerfel_21 said:
    Unrelatedly: Is _ret_ pop D supposed to work?

    /me stirs .. testing time ...

  • @evanh said:

    @Wuerfel_21 said:
    Unrelatedly: Is _ret_ pop D supposed to work?

    /me stirs .. testing time ...

    Some of this was discussed before in a related thread...
    https://forums.parallax.com/discussion/comment/1491353/#Comment_1491353

  • evanhevanh Posts: 14,132
    edited 2022-03-01 05:19

    Ah, good stuff ...

    @rogloh said:
    I just tried out the "RET POP D" case to see what would happen. It looks like it returns ok to where it was called from and the D register gets a copy of the return address of the calling code (i.e. caller location in COG RAM + 1), plus the state of the C, Z flags at calling time.

    That makes sense. And Chip predicted the same I see.

    So if the stack contained both a PUSH'd value as well as the return address ... then the pushed value will become the return address and ... crashie-poo.

  • @Wuerfel_21 said:
    Ooop, I think I found a technically-not-a-bug-but-really-obnoxious-anyways:

    It seems that PUSH/POP do not affect the SKIP suspension counter.
    Thus, you can't do the triple tailcall optimization in code that gets called from inside a SKIP sequence.

    I.e. turning

    call #first
    call #second
    call #third
    ret
    

    into

    push #third
    push #second
    jmp #first
    

    EDIT: oop, double would work, triple and more is the issue.

    So you'd like to see PUSH/POP increment/decrement skip call counter if it is non-zero?

  • @TonyB_ said:

    So you'd like to see PUSH/POP increment/decrement skip call counter if it is non-zero?

    Yeah. I don't see any downside to that, since push/pop have to be balanced, anyways.

  • @Wuerfel_21 said:

    @TonyB_ said:

    So you'd like to see PUSH/POP increment/decrement skip call counter if it is non-zero?

    Yeah. I don't see any downside to that, since push/pop have to be balanced, anyways.

    They don't have to be balanced. A POP when skip call counter is 1 could restart skipping when the intention is not to return to the skip sequence and not to resume skipping. Skip bits could be zeroed in this case.

  • @TonyB_ said:

    @Wuerfel_21 said:

    @TonyB_ said:

    So you'd like to see PUSH/POP increment/decrement skip call counter if it is non-zero?

    Yeah. I don't see any downside to that, since push/pop have to be balanced, anyways.

    They don't have to be balanced. A POP when skip call counter is 1 could restart skipping when the intention is not to return to the skip sequence and not to resume skipping. Skip bits could be zeroed in this case.

    But that doesn't work as-is, either. It will just continue skipping on the next RET. RETurning from the skip sequence also doesn't work. As-is the skip bits will always be consumed eventually.

Sign In or Register to comment.