Shop OBEX P1 Docs P2 Docs Learn Events
New P2 Silicon - Page 29 — Parallax Forums

New P2 Silicon

12627293132

Comments

  • cgraceycgracey Posts: 14,204
    edited 2022-01-16 10:28

    If we can ever make a P3, might as well make it 64 bits and have 3 larger register fields: two sources and one destination, with several fast indirect registers with auto inc/dec. Lots of things could be done at 22nm, or so.

  • evanhevanh Posts: 16,015

    @cgracey said:
    It would have caused slowing of the register muxing paths, which would have caused an increase in buffering, likely.

    ALU critical paths right?

  • evanhevanh Posts: 16,015
    edited 2022-01-16 10:37

    @cgracey said:
    If we can ever make a P3, might as well make it 64 bits and have 3 larger register fields: two sources and one destination, with several fast indirect registers with auto inc/dec. Lots of things could be done at 22nm, or so.

    Lol, 3 x 16-bit. 64 kWords (512 kBytes) of cogRAM.

  • cgraceycgracey Posts: 14,204

    @evanh said:

    @cgracey said:
    It would have caused slowing of the register muxing paths, which would have caused an increase in buffering, likely.

    ALU critical paths right?

    That feed the ALU, yes.

  • evanhevanh Posts: 16,015
    edited 2022-01-16 10:47

    @cgracey said:
    That feed the ALU, yes.

    Oh, so the reduced number of addressable special registers in the Prop2 cogs helps make that faster compared to the Prop1 then?

    EDIT: And that would mean a round power-of-two will be ideal number of addressable special registers. Maybe minus one to allow SRAM to be one of the mux sources.

  • cgraceycgracey Posts: 14,204

    @evanh said:

    @cgracey said:
    That feed the ALU, yes.

    Oh, so the reduced number of addressable special registers in the Prop2 cogs helps make that faster compared to the Prop1 then?

    EDIT: And that would mean a round power-of-two will be ideal number of addressable special registers. Maybe minus one to allow SRAM to be one of the mux sources.

    The way things work with timing paths, powers-of-two muxes are not necessarily best, because of the varying distances signals must cross. The wiring/buffer delays can easily overshadow a 2-to-1 mux propagation delay.

  • evanhevanh Posts: 16,015

    All the special registers can be packed in tight. They're dedicated to serving the ALU.

  • cgraceycgracey Posts: 14,204
    edited 2022-01-16 11:04

    @evanh said:
    All the special registers can be packed in tight. They're dedicated to serving the ALU.

    It's a huge tug-of-war. The placer/router spends many hours trying to meet the timing requirement. There are many paths competing for the shortest routes. Everything can't be in the same place, but must get spread out across the die, of course.

  • evanhevanh Posts: 16,015
    edited 2022-01-16 11:14

    cogRAM is certainly a long way out, and dimensionally large, in comparison. You're not going to get further away than that. Each ALU is going to be on one side of each SRAM block likely.

    And the associated lutRAM block has to be nearby too. I can see why that needed an extra stage. Gives higher priority to placement around cogRAM.

  • cgraceycgracey Posts: 14,204

    @evanh said:
    cogRAM is certainly a long way out, and dimensionally large, in comparison. You're not going to get further away than that. Each ALU is going to be on one side of each SRAM block likely.

    And the associated lutRAM block has to be nearby too. I can see why that needed an extra stage. Gives higher priority to placement around cogRAM.

    Yes, the logic of each cog grouped right around its own cogRAM and lutRAM. All the RAMs were spread around the perimeter of the cell area, creating two axes of symmetry. This allowed the placer/router a big open area in the center to assemble and wire the ~600k cells.

  • Wuerfel_21Wuerfel_21 Posts: 5,097
    edited 2022-01-16 11:54

    I feel like just adding a signed version of QMUL and a way to get the middle 32 bits of the result in one instruction would significantly speed up fixed-point vector / DSP tasks

  • evanhevanh Posts: 16,015
    edited 2022-01-16 14:09

    I just realised that if QX/QY were mapped to cogRAM addresses then pushing these results over the hub-op bus is complicated by the fact that there is two results at once. They'd then have to be via time staggered updates from the Cordic registers to the Cog registers, which makes the whole scheme all that more complicated. Same applies to writing to lutRAM.

    Fixed point QMUL should work just by fetching the high 32 bits with GETQY.

  • Wuerfel_21Wuerfel_21 Posts: 5,097
    edited 2022-01-16 14:09

    @evanh said:
    Fixed point QMUL should work just by fetching the high 32 bits with GETQY

    No, it doesn't. That gets you the integer portion of the result only (assuming 16.16 inputs). And if the operands were signed, it's wrong to boot (unsigned and signed multiplication differ only in the upper half of the result)

  • In my ideal P3 you'd be able to fetch any 32 bits of the full 64 bit multiply result, something like:

        setq2 #16  ' shift result right by 16
        getqx  ans  ' now ans = correct value for 16.16 multiplication
    

    and of course there would be both signed and unsigned multiply.

    Of course if the P3 is a 64 bit processor then perhaps this becomes moot (unless it has a 64x64 -> 128 bit multiply!)

  • evanhevanh Posts: 16,015
    edited 2022-01-16 14:50

    16.16 squared won't all fit in anything other than 64 bits. If you want all the high bits then GETQY is it.

    Like floats, signed is extra logic handling. SHL fpn,#1 WC to extract and prep unsigned value. And RCR fpn,#1 to reconstruct. Of course this isn't two's complement any longer. There's two zeros for a starters.

  • evanhevanh Posts: 16,015
    edited 2022-01-16 15:14

    @ersmith said:
    In my ideal P3 you'd be able to fetch any 32 bits of the full 64 bit multiply result, something like:

        setq2 #16  ' shift result right by 16
        getqx  ans  ' now ans = correct value for 16.16 multiplication
    

    Or make it just one double operand instruction. :)

        qetqz  hi32,#32
        qetqz  mid32,#16
        qetqz  lo32,#0
    

    Problem is it likely requires a separate barrel shifter. And those are about as bulky as an integer multiply.

    EDIT: It would be interesting to know if the ALU's shifter circuits could be directed at this, and therefore GETQZ wouldn't need a separate shifter. It would take longer to fetch for sure. And then all 64 bits are input data, which is not the case for the ALU shifter ops.

  • evanhevanh Posts: 16,015

    Ada's suggestions would have smallest circuit footprint.

  • The shifter could be part of the coprocessor and use the hitherto unused SETQ input to QMUL, which could also enable signed mode with bit 5. That seems like the simplest to implement.

      ' normal, unsigned integer mul
      setq #0
      qmul a,b
      getqx result
      ' unsigned, discard 16 LSBs
      setq #16
      qmul a,b
      getqx result
      ' signed multiply (QX identical to unsigned...)
      setq #32
      qmul a,b
      getqx result
      ' signed, discard 16 LSBs -> 16.16 multiply
      setq #48
      qmul a,b
      getqx result
      ' signed, discard 24 LSBs -> 8.24 multiply
      setq #56
      qmul a,b
      getqx result
    
  • evanhevanh Posts: 16,015
    edited 2022-01-16 15:46

    Nice. It is a multiply specific requirement. Presumably it's still a barrel shifter though. But at least it is cleanly tucked in the Cordic's single pipeline then.

  • evanhevanh Posts: 16,015

    Ada,
    I understand what you're saying now about 16.16 fixed point multiplies. It'd never sunk in what if anything is different about fixed-point vs integer before. But it is exactly what you've requested of QMUL. Finding best way to retail word size. Which comes down to where is the fixed point position. duh!

    With integers, a 32b x 32b = 64b and the fixed point position stays at bit 0 as the word size increases. Therefore, to retain a 32-bit word size, the least 32 bits is kept, [63:0] -> [31:0]. So integers truly are a subset of fixed point. This detail hadn't been obvious to me. I'd seen the similarity before but not sure.

    That obviously leads to 16.16 x 16.16 = 32.32 [63:0]. Then picking 32 bits, for the result word, we want fixed point at bit 16, [63:0] -> [47:16]. This had seemed somewhat arbitrary to me previously.

  • Wuerfel_21Wuerfel_21 Posts: 5,097
    edited 2022-01-17 00:41

    A good way to intuit about fixed point is to treat it a bit like physical units. Imagine you have an integer representing some length in meters. We need more precision, so we go down to centimeters. There are 100 cm in 1 m, so we can simply add two zeroes to our meter value. But what happens if we multiply some lengths? We get an area in cm². And there are 10000 cm² in 1 m², because the unit constant got multiplied alongside the actual values.

  • evanhevanh Posts: 16,015
    edited 2022-01-17 00:44

    The realisation for me was pure integers fitting the same formula. I'd not fully associated the two means.

  • evanhevanh Posts: 16,015
    edited 2022-01-18 11:43

    So retaining two's complement encoding is desirable then. It should allow direct use of integer summing instructions on fixed-point numbers. No shuffling of bits. C/Z flags still function the same.

    EDIT: In the case of fetching QMUL results, I think that's one extra instruction - A SUB #1 when negative.

    Working out the sign might be a few instructions though. As well as the signs of the multipliers, bit15 of QY is integral to unsigned circular wrapping, aka carry/borrow.

    EDIT2: Okay, circular msb (bit15 of QY) shouldn't be included in sign logic. Had to mull that one over. And it'll naturally appear set on odd rotations of any overflowing multiplication. That's suitable for unsigned.

    The trick is to not overwrite it with the sign bit when an overflow does happen. Then the msb can serve the dual functions desired .... or not, the C/Z flags aren't going to match with summing this way though ...

    EDIT3: Man, couldn't remember how the Prop2's flags are treated with signed summing. It's not really in the docs. Had to dig up the old testing here - https://forums.parallax.com/discussion/comment/1421452/#Comment_1421452

    For signed instructions

    C always holds the sign of the result, even during overflow.

    7fffffff(reg1) - ffffffff(reg2) = 80000000(result)
    Collected flags = 000000a0
    CMP  reg1,reg2:  C = 1,  Z = 0
    SUB  reg1,reg2:  C = 1,  Z = 0
    CMPS reg1,reg2:  C = 0,  Z = 0
    SUBS reg1,reg2:  C = 0,  Z = 0
    

    EDIT4: Hmm, there's probably no circular use for the C flag. Signed and unsigned use of C can be the same: Straight overflow. The difference then is unsigned is 16.16 and C is overflow, whereas signed is 1.15.16 and C is overflow.

    So C flag use is different to summing. But maybe that isn't important. As long as the result is encoded as two's complement.

    EDIT5: Oh, in that case, when both C and Z are set, it can indicate an underflow. Which means a zero check would be IF_NC_AND_Z

    EDIT6: Or, maybe the C flag should just be exception occurred. And have exception code in the result. Codes for overflow, underflow, infinity, negatives of each, and of course the beloved NaN. Z flag can be simple zero result again.

  • evanhevanh Posts: 16,015
    edited 2022-02-13 11:20

    Chip,
    I've before wanted to clarify a particular detail of the cog execution pipeline. In your ASCII art diagram of the pipeline it isn't completely clear at which points cogRAM addressing occurs. Knowing this would clear things up for me.

    Here's the original:

            |                   |                   |                   |                   |                   |                   |
    rdRAM Ib|------+            |           rdRAM Ic|------+            |           rdRAM Id|------+            |           rdRAM Ie|
            |      |            |                   |      |            |                   |      |            |                   |
    latch Da|--+   +--> rdRAM Db|---------> latch Db|--+   +--> rdRAM Dc|---------> latch Dc|--+   +--> rdRAM Dd|---------> latch Dd|
    latch Sa|--+   +--> rdRAM Sb|---------> latch Sb|--+   +--> rdRAM Sc|---------> latch Sc|--+   +--> rdRAM Sd|---------> latch Sd|
    latch Ia|--+   +--> latch Ib|---------> latch Ib|--+   +--> latch Ic|---------> latch Ic|--+   +--> latch Id|---------> latch Id|
            |  |                |                   |  |                |                   |  |                |                   |
            |  +---------------ALU--------> wrRAM Ra|  +---------------ALU--------> wrRAM Rb|  +---------------ALU--------> wrRAM Rc|
            |                   |                   |                   |                   |                   |                   |
            |                   |stall/done = 'gox' |                   |stall/done = 'gox' |                   |stall/done = 'gox' |
            |       'get'       |      done = 'go'  |       'get'       |      done = 'go'  |       'get'       |      done = 'go'  |
    

    So, the question is is rdRAM the data from cogRAM ... or is rdRAM the address generation for the next cycle - where data then comes from cogRAM and is subsequently latched?

    EDIT: Another way to ask: Where does PC address gen fit into the diagram?

  • evanhevanh Posts: 16,015
    edited 2022-02-14 04:40

    I suspect it's the latter, ie:

            |                   |                   |                   |                   |                   |                   |
     addr Ib|------+            |            addr Ic|------+            |            addr Id|------+            |            addr Ie|
            |      |            |                   |      |            |                   |      |            |                   |
    stage Da|--+   +-->  addr Db|---------> stage Db|--+   +-->  addr Dc|---------> stage Dc|--+   +-->  addr Dd|---------> stage Dd|
    stage Sa|--+   +-->  addr Sb|---------> stage Sb|--+   +-->  addr Sc|---------> stage Sc|--+   +-->  addr Sd|---------> stage Sd|
    stage Ia|--+   +--> stage Ib|---------> stage Ib|--+   +--> stage Ic|---------> stage Ic|--+   +--> stage Id|---------> stage Id|
            |  |                |                   |  |                |                   |  |                |                   |
            |  +--> execute --->|mux --+-> forward  |  +--> execute --->|mux --+-> forward  |  +--> execute --->|mux --+-> forward  |
            |                   |      |            |                   |      |            |                   |      |            |
            |                   |      +---> addr Ra|----> write Ra     |      +---> addr Rb|----> write Rb     |      +---> addr Rc|
            |                   |                   |                   |                   |                   |                   |
            |                   |stall/done = 'gox' |                   |stall/done = 'gox' |                   |stall/done = 'gox' |
            |       'get'       |      done = 'go'  |       'get'       |      done = 'go'  |       'get'       |      done = 'go'  |
    

    And matching schedule chart:

    --------|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------|
    Operands|                   |                   |      Result       |                   |                   |                   |
    Fetch(a)|    Execute (a)    |    Execute (a)    |   Writeback (a)   |                   |                   |                   |
    --------|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------|
            |    Instruction    |     Operands      |                   |                   |      Result       |                   |
            |     Fetch (b)     |     Fetch (b)     |    Execute (b)    |    Execute (b)    |   Writeback (b)   |                   |
    --------|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------|
            |                   |                   |    Instruction    |     Operands      |                   |                   |
            |                   |                   |     Fetch (c)     |     Fetch (c)     |    Execute (c)    |    Execute (c)    |
    --------|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------|
            |                   |                   |                   |                   |    Instruction    |     Operands      |
            |                   |                   |                   |                   |     Fetch (d)     |     Fetch (d)     |
    --------|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------|
    
  • evanhevanh Posts: 16,015

    Chip,
    I feel your attention is required. On behalf of ManAtWork, I wrote a program to test hubRAM efficiency with streamer reading and cog doing a block write at the same time. Things went well until discovery that some streamer rates causes the FIFO to badly hog the hubRAM bus. Notably, it hasn't afflicted byte-wise streamer ops. It's quite inconsistent too. An example is here - https://forums.parallax.com/discussion/comment/1535641/#Comment_1535641

    The program in the post after that uses Flexspin with #include, so you can't just drop it into Pnut.

  • evanhevanh Posts: 16,015
    edited 2022-02-20 12:25

    Here's an example run. The second column is the "divider", third column is the NCO value for that divider, fourth column onwards are twelve repeated tests of that config - without restarting the streamer. They all consist of the measured number of sysclock ticks, from GETCT, required to complete a block write of 64 k longwords into hubRAM using a SETQ + WRLONG. Done while the streamer is running as configured by the divider and mode as per implied RFBYTE then RFWORD then RFLONG.

    [NOTE: Updated data on next page]

    So far I've noted serious excursions with dividers of 3, 5, 6, 7, 9, 10, 11, 17, 25, 33 and 41. Again, only occurs with shortwords and longwords.

    Total smartpins = 64   1111111111111111111111111111111111111111111111111111111111111111
    Rev B silicon.  Sysclock 120.0000 MHz
    
        Block Length is 65536
    
     BYTE   2   80000000    87384    87384    87384    87384    87384    87384    87384    87384    87384    87384    87384    87384**
    SHORT   2   80000000   131072   131072   131072   131072   131072   131072   131072   131072   131072   131072   131072   131072**
    
     BYTE   3   55555556    98312    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304    98304**
    SHORT   3   55555556   589824   589784   589784   589784   589784   589784   589784   589784   589784   589784   589784   589784**
     LONG   3   55555556   589760   589784   589784   589784   589784   589784   589784   589784   589784   589784   589784   589784**
    
     BYTE   4   40000000    76720    76728    76720    76728    76720    76728    76720    76728    76720    76728    76720    76728
    SHORT   4   40000000    87384    87384    87384    87384    87384    87384    87384    87384    87384    87384    87384    87384**
     LONG   4   40000000    87384    87384    87384    87384    87384    87384    87384    87384    87384    87384    87384    87384**
    
     BYTE   5   33333334    81920    81920    81920    81920    81928    81920    81920    81912    81920    81928    81920    81920**
    SHORT   5   33333334    81920    81920    81920    81920    81920    81920    81920    81920    81920    81920    81920    81920**
     LONG   5   33333334   126024   126024   126024   126024   126024   126024   126024   126024   126024   126024   126024   126024**
    
     BYTE   6   2aaaaaaa    74904    74904    74904    74896    74896    74896    74896    74896    74912    74904    74904    74904
    SHORT   6   2aaaaaaa    78640    78648    78640    78640    78648    78640    78648    78640    78640    78648    78640    78648
     LONG   6   2aaaaaaa    78640    78648    78640    78648    78640    78648    78640    78648    78640    78648    78640    78648
    
     BYTE   7   24924924    72432    72432    72440    72440    72432    72432    72432    72440    72432    72432    72432    72440
    SHORT   7   24924924    91760    91744    91760    91752    91744    91752    91744    91752    91744    91752    91744    91752**
     LONG   7   24924924    76464    76456    76464    76456    76464    76456    76464    76456    76464    76456    76464    76456
    
     BYTE   8   20000000    71488    71488    71496    71496    71488    71488    71496    71496    71496    71488    71496    71496
    SHORT   8   20000000    76728    76728    76720    76728    76720    76728    76720    76728    76720    76728    76720    76728
     LONG   8   20000000    76728    76720    76728    76720    76728    76720    76728    76720    76728    76720    76728    76720
    
     BYTE   9   1c71c71c    70776    70776    70776    70776    70776    70784    70784    70776    70776    70776    70776    70776
    SHORT   9   1c71c71c    76928    76936    76936    76936    76928    76936    76936    76936    76928    76936    76936    76936
     LONG   9   1c71c71c   589800   589784   589784   589784   589784   589784   589784   589784   589784   589784   589784   589784**
    
     BYTE  10   1999999a    70216    70216    70224    70224    70216    70216    70216    70224    70216    70216    70216    70224
    SHORT  10   1999999a    81928    81920    81920    81920    81920    81912    81920    81920    81920    81920    81912    81920**
     LONG  10   1999999a    81920    81920    81912    81920    81920    81920    81920    81912    81920    81920    81920    81920**
    
     BYTE  11   1745d174    69768    69768    69760    69760    69760    69768    69768    69768    69760    69760    69760    69768
    SHORT  11   1745d174    90104    90120    90112    90120    90112    90120    90112    90120    90112    90120    90112    90120**
     LONG  11   1745d174    90112    90120    90112    90120    90112    90120    90112    90120    90112    90120    90112    90120**
    
     BYTE  12   15555556    69392    69392    69392    69392    69392    69392    69392    69384    69392    69392    69392    69392
    SHORT  12   15555556    74896    74904    74904    74904    74904    74904    74904    74896    74896    74896    74896    74896
     LONG  12   15555556    74896    74896    74904    74904    74896    74904    74904    74904    74896    74896    74896    74896
    
     BYTE  13   13b13b14    69080    69080    69072    69080    69072    69080    69080    69080    69080    69072    69080    69080
    SHORT  13   13b13b14    73032    73024    73032    73024    73032    73024    73032    73024    73024    73024    73024    73024
     LONG  13   13b13b14    73024    73024    73032    73024    73032    73024    73032    73024    73032    73024    73032    73024
    
     BYTE  14   12492492    68808    68808    68816    68808    68808    68816    68816    68808    68816    68816    68808    68816
    SHORT  14   12492492    72432    72432    72432    72440    72432    72432    72440    72440    72432    72432    72432    72440
     LONG  14   12492492    72440    72440    72432    72432    72440    72440    72432    72432    72432    72440    72432    72432
    
     BYTE  15   11111112    68584    68584    68584    68584    68584    68584    68584    68584    68584    68584    68584    68584
    SHORT  15   11111112    75616    75624    75624    75616    75616    75616    75624    75616    75616    75616    75616    75624
     LONG  15   11111112    75616    75624    75616    75624    75616    75616    75616    75624    75624    75616    75616    75616
    
     BYTE  16   10000000    68384    68384    68392    68384    68384    68384    68384    68384    68384    68392    68384    68384
    SHORT  16   10000000    71496    71488    71488    71496    71496    71496    71488    71496    71496    71496    71488    71496
     LONG  16   10000000    71496    71488    71488    71496    71496    71496    71488    71496    71496    71496    71488    71496
    
     BYTE  17   0f0f0f10    68208    68208    68216    68208    68208    68216    68208    68208    68216    68216    68208    68208
    SHORT  17   0f0f0f10   123784   123792   123784   123776   123776   123776   123776   123776   123776   123776   123776   123776**
     LONG  17   0f0f0f10    71112    71112    71112    71112    71112    71112    71112    71120    71120    71112    71112    71112
    
     BYTE  18   0e38e38e    68056    68056    68056    68056    68056    68056    68056    68056    68056    68056    68056    68056
    SHORT  18   0e38e38e    70776    70784    70784    70776    70776    70776    70776    70776    70784    70784    70776    70776
     LONG  18   0e38e38e    70776    70776    70776    70776    70784    70784    70776    70776    70776    70776    70776    70784
    
     BYTE  19   0d79435e    67920    67920    67920    67920    67912    67920    67920    67920    67920    67920    67920    67920
    SHORT  19   0d79435e    77824    77824    77824    77824    77824    77816    77824    77824    77824    77824    77824    77816
     LONG  19   0d79435e    77824    77824    77824    77824    77824    77824    77824    77832    77824    77816    77824    77824
    
     BYTE  20   0ccccccc    67800    67800    67792    67792    67800    67792    67792    67800    67800    67792    67792    67800
    SHORT  20   0ccccccc    70224    70216    70216    70216    70216    70216    70216    70216    70224    70216    70216    70216
     LONG  20   0ccccccc    70224    70216    70216    70216    70224    70216    70216    70216    70224    70216    70216    70216
    
     BYTE  21   0c30c30c    67688    67688    67688    67688    67688    67688    67688    67688    67680    67680    67680    67680
    SHORT  21   0c30c30c    76472    76456    76456    76456    76456    76456    76472    76456    76464    76456    76456    76456
     LONG  21   0c30c30c    69976    69984    69976    69976    69984    69976    69976    69984    69976    69976    69984    69976
    
     BYTE  22   0ba2e8ba    67584    67584    67584    67584    67584    67584    67592    67584    67584    67584    67584    67584
    SHORT  22   0ba2e8ba    69760    69768    69768    69768    69760    69760    69760    69768    69768    69768    69760    69760
     LONG  22   0ba2e8ba    69760    69768    69768    69768    69760    69760    69760    69760    69768    69768    69768    69760
    
     BYTE  23   0b21642c    67488    67488    67496    67496    67488    67488    67496    67496    67488    67496    67496    67488
    SHORT  23   0b21642c    71776    71784    71776    71776    71776    71784    71784    71776    71784    71776    71776    71776
     LONG  23   0b21642c    71776    71776    71776    71776    71776    71784    71784    71776    71776    71776    71776    71776
    
     BYTE  24   0aaaaaaa    67408    67416    67408    67408    67408    67408    67408    67408    67408    67408    67416    67408
    SHORT  24   0aaaaaaa    69384    69392    69392    69392    69392    69392    69392    69392    69392    69384    69392    69392
     LONG  24   0aaaaaaa    69392    69392    69392    69392    69384    69392    69392    69392    69392    69392    69392    69392
    
     BYTE  25   0a3d70a4    67328    67328    67328    67328    67336    67328    67336    67328    67336    67328    67336    67328
    SHORT  25   0a3d70a4    96368    96376    96376    96376    96384    96376    96376    96376    96376    96376    96376    96376**
     LONG  25   0a3d70a4    96376    96368    96376    96368    96376    96368    96376    96376    96376    96376    96376    96376**
    
  • evanhevanh Posts: 16,015
    edited 2022-02-20 12:23

    [removed out-of-date info]

  • jmgjmg Posts: 15,173

    @evanh said:
    Here's an example run. The second column is the "divider", third column is the NCO value for that divider, fourth column onwards are twelve repeated tests of that config - without restarting the streamer. They all consist of the measured number of sysclock ticks, from GETCT, required to complete a block write of 64 k longwords into hubRAM using a SETQ + WRLONG. Done while the streamer is running as configured by the divider and mode as per implied RFBYTE then RFWORD then RFLONG.

    So far I've noted serious excursions with dividers of 3, 5, 6, 7, 9, 10, 11, 17, 25, 33 and 41. Again, only occurs with shortwords and longwords.

    What is a 'serious excursion' - does it drop data, or just sometimes take more cycles than hoped ?
    Running a NCO and a streamer could be tricky, for non-binary numbers. Not sure why bytes would be unaffected, but maybe the 32:8 fifo effect is enough to smooth things ?

    Things went well until discovery that some streamer rates causes the FIFO to badly hog the hubRAM bus.

    What does 'badly hog' amount to ?
    With hub being MOD 8, and a NCO not being mod 8, I can see some varying waits could be needed, but things should never 100% choke ?

  • TonyB_TonyB_ Posts: 2,191
    edited 2022-02-20 15:39

    @jmg said:

    @evanh said:
    So far I've noted serious excursions with dividers of 3, 5, 6, 7, 9, 10, 11, 17, 25, 33 and 41. Again, only occurs with shortwords and longwords.

    What is a 'serious excursion' - does it drop data, or just sometimes take more cycles than hoped ?
    Running a NCO and a streamer could be tricky, for non-binary numbers. Not sure why bytes would be unaffected, but maybe the 32:8 fifo effect is enough to smooth things ?

    Things went well until discovery that some streamer rates causes the FIFO to badly hog the hubRAM bus.

    What does 'badly hog' amount to ?
    With hub being MOD 8, and a NCO not being mod 8, I can see some varying waits could be needed, but things should never 100% choke ?

    A couple of bad bus hogging examples for two different streamer frequencies: a fast block write of N longs sometimes takes 1.5N and sometimes 9N, or sometimes 1.17N and sometimes 9N.

Sign In or Register to comment.