Shop OBEX P1 Docs P2 Docs Learn Events
SPI boot code and new CALLPA/CALLPB instructions - Page 4 — Parallax Forums

SPI boot code and new CALLPA/CALLPB instructions

124

Comments

  • Cluso99Cluso99 Posts: 18,069
    FYI

    I did autoboarding in the 80's on the 6802and the 68705. Neither had UARTs inbuilt so it was bit banging. These were 4MHz 4 clock instructions. The current P1 & P2V is 80MHz with 4 & 2 clock instructions. So that is 20x & 40x.

    The first autobauding was to an ICL minicomputer running at ~53,000 baud using a "special" serial handshake (incapable of using a uart anyway). Due to drift and not being xtal controlled, I also had to re-sync throughout the character. 53,000 x 40 = ~2M baud.

    I also did the autobauding for modems using the AT command set that I/we built. IIRC they could go to 9600 baud. 9,600 x 40 = 384,000 baud.

    With the AT sequence, we were interfacing to micros and mainframes that were crystal controlled, so drift was not a problem. However, the Apple //c originally came with it's uart off-speed by IIRC 2%. As we were building modems which were also branded Apple, we had to work with them too.

    The AT is a special sequence with a start=0 bit, followed by bit0=1, followed by bit1+=0. I won't bother showing how we detected parity, or case, as it's not required for this discussion.

    I did the timing by waiting for the commencement of the start bit, and timing it. By dividing the time/2, I could then sample all 8 bits at approximately the middle. The only check was to ensure that the stop bit was indeed a "1" when sampled.

    Timing was only done on the first "A". However, the calculated timing was used on every other character, only by syncing with the commencement of the start bit going to "0". In other words, the only time the speed could be changed was by a new "AT" command.

    It would also have been possible to time the bit0 "1" bit, but in our case this was unnecessary.

    Now you are referring to 8N1 (8N1+ which included 8N2). We only have to sample 9 bits (being 8+stop).

    Presuming our sample is 100% correct, then we have 9 bits, and we are sampling at the centre of the 9th (stop) bit.

    Now lets assume our bit time is 100. Therefore we sample at +50 (after the end of the start bit), then 8 * +100. So we have counted 850 from the end of the start bit to the middle of the stop bit.

    If out 100 was miscounted, what is the earliest/latest where we will fail?
    Obviously, +-50 in 850 (or 0.5 in 8.5) which gives +-5.88%.

    Of course we may be catering for a micro that is bit-banging too, so we need to allow for some error here too.

    So the real question becomes, how accurate can we be at calculating the length of the start bit ???

    Firstly, we can only be as accurate as 1 clock count = ~20MHz = 50ns.

    If we assume this to be 2% then 100% = 2.5us = 250,000 baud

    Note: The "A" or "a" is also a common character used in modems and the wifi ESP8266 board. So there is a precedent to using this. It should also be echoed back to the terminal/pc/micro.

  • jmgjmg Posts: 15,173
    edited 2016-10-07 01:41
    Cluso99 wrote: »
    So the real question becomes, how accurate can we be at calculating the length of the start bit ???

    That question only really applies, to a design that tries to measure a single-bit time.

    Better AutoBaud design, measures over more than one bit time.
    Cluso99 wrote: »
    Firstly, we can only be as accurate as 1 clock count = ~20MHz = 50ns.
    If we assume this to be 2% then 100% = 2.5us = 250,000 baud
    Correct, if we (for now) exclude NCO.

    However that 2% step size, (which applies to 400k(2.5us) Baud, not 250k) can be centered to give +/- 1% steps, with the right care.

    The secret is to preserve as much maths resolution as you can, and multiple-bit timing certainly helps here.

    See my example calcs above, with a measurement sum of 10 bit-times, and an EFM8BB1 being stepped through valid baud rates.




  • Cluso99Cluso99 Posts: 18,069
    Here is another way with more accuracy, using the character "x" ASCII $78...

    Autobaud_p2_x.jpg

    Here, either the start bit + 3* "0" bits = 4 bits could be timed. This could be verified with the following 4* "1" bits.

    Alternately, using the two successive falling edges, the start + 3* "0" + 4* "1" bits could be timed giving an 8 bit time.

    Simple maths:
    4 bit times: (t+t)>>3 (2 instructions includes rounding)
    8 bit times: (t+t)>>4 (2 instructions includes rounding)

    This could be then possible to verify that bit7="0" satisfies the calculation if considered necessary.

    Firstly, we can only be as accurate as 1 clock count = ~20MHz = ~50ns. If we assume this to be 2% then 100% = 2.5us.
    4 bit times: 2.5us / 4bits = 625ns = 1,600,000 baud (1.6M baud)
    8 bit times: 2.5us / 8bits = 312ns = 3,200,000 baud (3.2M baud)

    And this is with a safety margin of +-3.88%.
    236 x 81 - 8K
  • jmgjmg Posts: 15,173
    Cluso99 wrote: »
    ...
    Alternately, using the two successive falling edges, the start + 3* "0" + 4* "1" bits could be timed giving an 8 bit time.

    Yes, this is a subset of what Chip is already doing ?
    He captures both tRR and tFF, which gives good precision for the bit-time.
    Capture of both, also allows coverage of the reset-exit error case.

  • cgraceycgracey Posts: 14,152
    jmg wrote: »
    cgracey wrote: »

    You can make good use of a Pair of captures
    * Sum to get more x-Axis
    * Check tRR relative to tFF to catch rejects
    * Check ratio, to discriminate between two valid AutoBaud chars.

    I'm not understanding these ideas, yet.

    The idea is to combine the two capture readings, to get the most information.
    eg 0x3f will capture tRR = 8b tFF = 7b, which you can add to get 15b times.
    You can also compare them, and if (tRR < tFF), you ignore & continue to wait. (means RST exit was mid-char)

    To resolve which one of 0x3f or 0x1f was sent, you compare the difference with tRR, and slice ~ 1.5/8b


    I'm favouring 0x3f "?" and 0x1f, as they have higher sums, (tho the last one does go a little against your Kbd Test wish )
    0x3f "?" :  on tRR = 8 tFF = 7 Sum=15b
    ===\_s_/=0=.=1=.=2=.=3=.=4=.=5=\_6_._7_/=P==T=\_s_/=0=.=1=.=2=.=3=.=4=.=5=\_6_._7_/=P==T=\_s_/=0=.=1=.=2=.=3=.=4=.=5=\_6_._7_/=P=
    tRR    |            8                  |   2+T    |            8                  |   2+T    |       8                       |   //
    tFF                     7      |        3+T   |                7           |     3+T     | 1 |                                =\_/= 
        f  r             OK tRR:8b,tFF:7b -^          ^- Err:tRR:2b+T,tFF:3b+T        ^OK        ^Err 
    Margins: +1b,-1b = usable
    
    Need a Second Char, for AutoBaud One-Pin command 
    
    0x1f -> OK tRR:8b,tFF:6b Sum=14b   Err:tRR:2b+T,tFF:4b+T 
    ===\_s_/=0=.=1=.=2=.=3=.=4=\_5_._6_._7_/=P==T=\_s_/=0=.=1=.=2=.=3=.=4=\_5_._6_._7_/=P==T=\_
    Margins: +2b,-2b = usable, CAN pair with 0x3f
    

    also possible are 0x20 " " and 0x40 "@", but they have smaller sums.

    But it seems that what would matter in these readings are empirical ratios that have numbers that fill most of the frame. For example 7 and 8 are viable because the 7:8 relationship is sufficient for knowing what came in, and that relationship must be tested with some precision, not just that one is greater than the other. My preferred value of $20 has a 7:3 relationship, which may not hold as much information, but it's plenty. That 7:3 relationship must validated.
  • jmgjmg Posts: 15,173
    edited 2016-10-07 19:36
    cgracey wrote: »
    But it seems that what would matter in these readings are empirical ratios that have numbers that fill most of the frame. For example 7 and 8 are viable because the 7:8 relationship is sufficient for knowing what came in, and that relationship must be tested with some precision, not just that one is greater than the other. My preferred value of $20 has a 7:3 relationship, which may not hold as much information, but it's plenty. That 7:3 relationship must validated.

    For reliable operation, you only really need to sense and reject the 'wrong' phased capture of
    Err:tRR:7b+T,tFF:3b+T
    Otherwise, knowing the character was 0x20 should be enough to make a Baud calculation.

    You can check for some window, but I'm not sure what that gains you ? Maybe some noise immunity ?

    Here, a check that tFF was between tRR/2 and tRR/4 would seem quick and simple ?
    That also nicely rejects the wrong-phase case, that needs covering.

    Then, you sum both tRR & tFF, and calculate Baud

    I've expanded the example table of AutoBaud results in this post
    http://forums.parallax.com/discussion/comment/1389598/#Comment_1389598

    Here is a zoomed sawtooth error check near 1.5MBd
     TB=20M/13.54 = 1477104.87  CT=(10/TB)/(1/20M) = 135.4  P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1428571.428 100*(1-P2B/TB) = 3.285%
     TB=20M/13.53 = 1478196.60  CT=(10/TB)/(1/20M) = 135.3  P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1428571.428 100*(1-P2B/TB) = 3.357%
     TB=20M/13.52 = 1479289.94  CT=(10/TB)/(1/20M) = 135.2  P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1428571.428 100*(1-P2B/TB) = 3.428%
     TB=20M/13.51 = 1480384.90  CT=(10/TB)/(1/20M) = 135.1  P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1428571.428 100*(1-P2B/TB) = 3.5%
     TB=20M/13.50 = 1481481.48  CT=(10/TB)/(1/20M) = 135    P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1428571.428 100*(1-P2B/TB) = 3.57%
     TB=20M/13.49 = 1482579.688 CT=(10/TB)/(1/20M) = 134.9  P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1428571.428 100*(1-P2B/TB) = 3.642%
     TB=20M/13.48 = 1483679.525 CT=(10/TB)/(1/20M) = 134.8  P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1428571.428 100*(1-P2B/TB) = 3.714%
     TB=20M/13.47 = 1484780.994 CT=(10/TB)/(1/20M) = 134.7  P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1428571.428 100*(1-P2B/TB) = 3.785%
     TB=20M/13.46 = 1485884.101 CT=(10/TB)/(1/20M) = 134.6  P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1428571.428 100*(1-P2B/TB) = 3.857%
     TB=20M/13.45 = 1486988.847 CT=(10/TB)/(1/20M) = 134.5  P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1538461.538 100*(1-P2B/TB) = -3.461% not quite central
    
    The ideal sawtooth is equally peaked +/-, this is slightly off as the 10b sample has the Maths jitter/LSB effects, and the fast divide of (2^15+round(CT)*6554) >> 16) contributes another small error.

    Testing at 2MBd is a good first-pass check, but I would also include numbers like 1.6MBd, which should be possible on a FT232H/FT2232H, and that has an AutoBAUD error of 3.846% to a FPGA 20MHz SysCLK.

    Do you have a Character echo on the first AutoBAUD (eg 0x7f "~") ?
    That's a good way to confirm AutoBAUD was ok, and automated systems can make use of that to speed the download.

    Retrim AutoBauds do not have to echo every time, as a 100% echo could have issues

    Add: Some Pseudo Code to cover enough tests :2
    Psudeo code for AutoBaud, and One-Pin / Two-Pin selection on Autobaud, so Echo can go out on correct pin.
    IF tRR > ((tFF*19) >> 5) THEN   // 4.15625 - Reverse Sync trap
      RETURN
    ELSIF tRR <= ((tFF*6) >> 5) THEN // 1.3125 - too skewed trap
      RETURN
    ELSIF tRR > ((tFF*21) >> 6) THEN // ~32.73% between " " 3:7 and "@" 2:8
      AutoCal(10,TwoPin)
      Echo("~")
    ELSE  
      AutoCal(10,OnePin)
      Echo("~")
    ENDIF
    

  • Cluso99Cluso99 Posts: 18,069
    edited 2016-10-07 07:07
    A value of "x" is simple mathematics. Dividing by 3 or 7 is not so easy when time is of the essence in fast transmissions.

    Two instructions give rounding (ie add itself to itself) then dividing (ie shift right by 3 or 4) to get bit time or an extra shift to get 1/2 bit time.
  • Out of curiosity, what is it that you all are trying to solve right now? Maximizing the autobaud rate during boot? And, if so, how critical is this to have? Reliably autobauding up to 1Mbps right now already seems good enough.
  • Why is autobauding is so important to P2 serial boot?
  • SeairthSeairth Posts: 2,474
    edited 2016-10-07 15:35
    User Name wrote: »
    Why is autobauding is so important to P2 serial boot?

    Autobauding is necessary because the boot ROM cannot know the actual clock speed. By extracting the "clock" from the RX pin, the boot ROM can reliably set its baud rate regardless of the precision of the internal oscillator. Put another way, the boot ROM isn't really figuring out baud rate (since the chip doesn't exactly know how many clock cycles one second is), but instead figuring out the serial_clock-to-internal_clock ratio. The P2 doesn't know that it's at 1Mbps or 4800bps, just that 1 serial_clock equals X internal_clocks. And the X is going to differ for every chip due to the precision of the internal oscillator.

    Edit: actually, autobaud is only necessary for higher baud rates. My guess is that the internal clock will be precise enough to be able do a low, fixed baud rate (e.g. 9600bps), as long as the stop condition was stretched to ensure that the next start bit wasn't too early.
  • Seairth wrote: »
    Out of curiosity, what is it that you all are trying to solve right now? Maximizing the autobaud rate during boot? And, if so, how critical is this to have? Reliably autobauding up to 1Mbps right now already seems good enough.

    Seconded.

  • jmgjmg Posts: 15,173
    edited 2016-10-07 19:36
    Seairth wrote: »
    Edit: actually, autobaud is only necessary for higher baud rates. My guess is that the internal clock will be precise enough to be able do a low, fixed baud rate (e.g. 9600bps), as long as the stop condition was stretched to ensure that the next start bit wasn't too early.

    Not quite, the P2 Boot osc is only good to 30%, so ALL baud speeds will require Autobaud.

    Seairth wrote: »
    Out of curiosity, what is it that you all are trying to solve right now? Maximizing the autobaud rate during boot? And, if so, how critical is this to have? Reliably autobauding up to 1Mbps right now already seems good enough.
    Reliable Autobaud up to 1MBd/2MBd is not there 'right now', but it is being worked on :)

    Current release boot loader is Spec'd to 115200, with a ceiling maybe 2x that.
    115200 is frankly glacial, and constraining. That is over 60 seconds to send a P2 image.

    Sure, in some cases, you might be able to change gears and use dual stage loaders, but that adds a whole lot more management, and assumes you can change gears.
    Much cleaner to have a faster, more practical upper AutoBaud ceiling.
    The numbers above suggest > 1.5Mbd, maybe 2Mbd, will be practical (even without NCO step).

    There are many use cases for P2, where a simple single baud rate will be desirable/possible.
    BLE and WiFi links can go way faster than 115200.

    Then, there are other use cases where you want the fastest possible boot times, including reset-exit delay effects.

  • jmgjmg Posts: 15,173
    cgracey wrote: »
    Wow! These are great ideas. And, yes, the captures are cyclical.

    We COULD just do an initial auto-baud and then check the sample pair in each serial interrupt, as the last thing to happen was the stop bit, making a fresh rise-to-rise measurement. We are already 1/2 into the stop bit when the serial ISR is triggered, so we have less than half a bit period to grab the last fall-to-fall sample before a new fall-to-fall is registered, in case a (low) start bit is next. ..

    Thinking a little more about this specific timing detail, here is a way to buy some extra time here:

    INT_tRR: // Interrupt on Rising Capture case - VERY short. Does not INT RX INT.
    tFF = Read(FallValue)
    RETI // This is actually needed only at start of Stop bit, other cases are tolerated.

    INT_RX:
    tRR = Read(RiseValue) // very first thing @ RX mid-stop
    ... non critical code..


    ie at the last rising edge/Stop bit, read and store the waiting tFF value,
    This now means a tFF capture at Start-Bit fall, is not such a problem.
    tRR cannot occur before the end of Start Bit, so it can read on INT_RX, and still have 1.5 bit times of margin, with 1 Stop bit.
    ie this triples the timing margin.
    (2 stop bits of course always gives even more time)

  • cgraceycgracey Posts: 14,152
    edited 2016-10-08 10:00
    In order to tighten up smart pin timing, I've given each smart pin a 32-bit parallel RDPIN output. This means RDPIN will take only two clocks, like most other instructions, and always return a long (no more byte and word possibilities). This got rid of all the pin-to-cog message mechanisms, but added more routing. It so much simpler now.

    This is going to make everything a lot quicker. This should help full-speed USB at 80MHz quite a bit.

    It's actually much simpler to make the RDPIN paths 32 bits wide, as opposed to the WRPIN/WXPIN/WYPIN paths, as they must be OR'd together from all cogs before heading out to the pins.

    I'm looking forward to seeing what this does for auto-baud. I also added 4 fractional bits to the serial baud generator, so that we can get the extra resolution we need above 500k baud. I had it auto-bauding and loading at 1.5M baud earlier today. It should be solid at 2MHz after this. One problem, though, is that the SHA-256 computation can only run at about 100K bytes per second at 20MHz. We'll deal with that later.
  • dMajodMajo Posts: 855
    edited 2016-10-08 10:20
    @cgracey
    Chip, just out of curiosity/ignorance, what's the difference between PIC's oscillators and P2's one?
    Most of the PICs reach 32MHz, the lattest parts even 64MHz. Can't you use such type of oscillators in the Prop?
    As I have understood the P2 will use the same internal oscillator that was used in the P1 10 years ago.

    Clocking Structure
    • Precision Internal Oscillator:
    - ±1% at calibration
    - Selectable frequency range 32 MHz to 31 kHz
    • 31 kHz Low-Power Internal Oscillator
    • 4x Phase-Locked Loop (PLL) for up to 32 MHz Internal Operation
    • External Oscillator Block with Three External
    Clock modes up to 32 MHz
  • cgraceycgracey Posts: 14,152
    edited 2016-10-08 14:08
    dMajo wrote: »
    @cgracey
    Chip, just out of curiosity/ignorance, what's the difference between PIC's oscillators and P2's one?
    Most of the PICs reach 32MHz, the lattest parts even 64MHz. Can't you use such type of oscillators in the Prop?
    As I have understood the P2 will use the same internal oscillator that was used in the P1 10 years ago.

    Clocking Structure
    • Precision Internal Oscillator:
    - ±1% at calibration
    - Selectable frequency range 32 MHz to 31 kHz
    • 31 kHz Low-Power Internal Oscillator
    • 4x Phase-Locked Loop (PLL) for up to 32 MHz Internal Operation
    • External Oscillator Block with Three External
    Clock modes up to 32 MHz

    It's all just a matter of how much power you want to burn. I figured 20MHz for boot was still sufficient. We could make it faster, but it would be a silicon change, at this point.
  • cgracey wrote: »
    dMajo wrote: »
    @cgracey
    Chip, just out of curiosity/ignorance, what's the difference between PIC's oscillators and P2's one?
    Most of the PICs reach 32MHz, the lattest parts even 64MHz. Can't you use such type of oscillators in the Prop?
    As I have understood the P2 will use the same internal oscillator that was used in the P1 10 years ago.

    Clocking Structure
    • Precision Internal Oscillator:
    - ±1% at calibration
    - Selectable frequency range 32 MHz to 31 kHz
    • 31 kHz Low-Power Internal Oscillator
    • 4x Phase-Locked Loop (PLL) for up to 32 MHz Internal Operation
    • External Oscillator Block with Three External
    Clock modes up to 32 MHz

    It's all just a matter of how much power you want to burn. I figured 20MHz for boot was still sufficient. We could make it faster, but it would be a silicon change, at this point.
    Speaking of silicon, what happened with the shuttle run? Did you get the test chips back yet?

  • jmgjmg Posts: 15,173
    edited 2016-10-08 20:15
    cgracey wrote: »
    In order to tighten up smart pin timing, I've given each smart pin a 32-bit parallel RDPIN output. This means RDPIN will take only two clocks, like most other instructions, and always return a long (no more byte and word possibilities). This got rid of all the pin-to-cog message mechanisms, but added more routing. It so much simpler now.

    This is going to make everything a lot quicker. This should help full-speed USB at 80MHz quite a bit.

    It's actually much simpler to make the RDPIN paths 32 bits wide, as opposed to the WRPIN/WXPIN/WYPIN paths, as they must be OR'd together from all cogs before heading out to the pins.
    Wow, quite the boost ! I figured you were knocking of a few corners, but that's a big change.
    cgracey wrote: »
    I'm looking forward to seeing what this does for auto-baud. I also added 4 fractional bits to the serial baud generator, so that we can get the extra resolution we need above 500k baud. I had it auto-bauding and loading at 1.5M baud earlier today. It should be solid at 2MHz after this.
    Non fractional AutoBaud should be ok to somewhere between 1~1.5Mbd, (see my rounding equations above) with sparse solutions above 1.5MBd.
    eg 2MHz should work, because with a 20MHz FPGA clock, and 48MHz USB clocks, that happens to hit a sweetspot.
    1.6MBd is a tougher test for 20MHz & 48MHz

    Fractional AutoBaud should get maybe to the 3MBd that is widespread in USB-UARTS.
    cgracey wrote: »
    I also added 4 fractional bits to the serial baud generator, so that we can get the extra resolution ..
    Are those extra bits base 8, base 10 or base 16 ?
  • jmgjmg Posts: 15,173
    David Betz wrote: »
    Speaking of silicon, what happened with the shuttle run? Did you get the test chips back yet?

    Those should have good numbers on the actual 20MHz oscillator, and especially how it varies with Vcc and Temperature.
    cgracey wrote: »
    It's all just a matter of how much power you want to burn. I figured 20MHz for boot was still sufficient. We could make it faster, but it would be a silicon change, at this point.

    I'd measure the test wafers first, to see what you actually get, and what the frequency curves look like.
    Another number to consider would be ~24MHz, as that's related to the USB clock everyone uses these days.
    With fractional Autobaud, you can position samples to within 50ns, and so might AutoBaud up toward 3MBd
  • cgracey wrote: »
    In order to tighten up smart pin timing, I've given each smart pin a 32-bit parallel RDPIN output. This means RDPIN will take only two clocks, like most other instructions, and always return a long (no more byte and word possibilities). This got rid of all the pin-to-cog message mechanisms, but added more routing. It so much simpler now.

    This is going to make everything a lot quicker. This should help full-speed USB at 80MHz quite a bit.

    It's actually much simpler to make the RDPIN paths 32 bits wide, as opposed to the WRPIN/WXPIN/WYPIN paths, as they must be OR'd together from all cogs before heading out to the pins.
    Neat! I have been thinking about taking another stab at full-speed at 80MHz -- this makes it definite :smile:
  • cgraceycgracey Posts: 14,152
    edited 2016-10-09 10:05
    Big progress...

    I've got 32-bit data paths for all RDPIN/WRPIN/WXPIN/WYPIN instructions, so they all take only 2 clocks now.

    I improved the asynchronous serial mode of the smart pin by getting rid of NCO baud mode and making it use lower bits as fractional bit counts if the top bits are clear:
    Async serial configuration via WXPIN
    
    b=clocks/bit, n=bits/word-1:
    
    %bbbb_bbbb_bbbb_bbbb_xxxx_xxxx_xxx_nnnnn	= normal 16-bit baud
    %0000_00bb_bbbb_bbbb_bbbb_bbxx_xxx_nnnnn	= 10-bit baud with 6-bit fractional
    
    For example, at 20MHz:
    
    $208D0007 = 2400 baud, 8 bits (8333 clocks/bit)
    $000D5407 = 1.5M baud, 8 bits (13.33 clocks/bit)
    
    Just think of it as a 22-bit (16.6) bit count. The smart pin goes into fractional
    mode if the top six bits are clear. The programmer just does his calculation and
    the smart pin makes the best use of it.
    

    Now the ROM booter can handle serial at 1.5M baud (with 2 stop bits). This means 1M baud will be safe for a slow/hot RC clock that is running at only 13MHz, instead of the expected 20MHz:
    '
    '
    ' Initial autobaud ISR
    '
    ' $20 -> 10000001001 -> fall-to-fall = 7 bits, then rise-to-rise = 3 bits
    '
    autobaud_isr	rdpin	buf1,#rx_ms1		'get old fall-to-fall time	(7x if $20)
    		mul	buf1,norm1		'make baud rate
    		setbyte	buf1,#7,#0		'set 8 bits
    		wxpin	buf1,#rx_rcv		'set baud rate as early as possible, even if it's wrong
    		dirh	#rx_rcv			'enable serial receiver in case $20 received (10 clocks, so far)
    
    		rdpin	buf0,#rx_ms0		'get new rise-to-rise time	(3x if $20)
    
    		akpin	#rx_ms0			'acknowledge rise-to-rise measurement (ISR trigger)
    
    		mul	buf0,norm0		'normalize rise-to-rise time for comparison
    
    		sub	buf0,buf1		'subtract one from the other
    		abs	buf0			'get absolute difference
    		topone	buf2,buf1		'get magnitude of one original
    		sub	buf2,#4			'subtract 4 for 1/16th test
    		shr	buf0,buf2	wz	'shift down difference by magnitude minus 4, z=1 if $20
    
    	if_nz	dirl	#rx_rcv			'if not $20, disable serial receiver
    
    	if_z	setint1	#0			'if $20, disable int1
    	if_z	nixint1				'if $20, nix any pending int1
    
    		reti1				'exit (4 entry + 32 body + 4 exit clocks)
    
    
    norm1		long	3 * $1_0000 / 21	'normalization constants get clock count into top word
    norm0		long	7 * $1_0000 / 21
    '
    '
    ' Serial receiver plus autobaud maintenance ISR
    '
    receive_isr	rdpin	buf1,#rx_ms1		'get old fall-to-fall time	(7x if $20)
    		rdpin	buf0,#rx_ms0		'get new rise-to-rise time	(3x if $20)
    
    		mul	buf1,norm1		'normalize both samples to 21x for comparison
    		mul	buf0,norm0
    
    		sub	buf0,buf1		'subtract one from the other
    		abs	buf0			'get absolute difference
    		topone	buf2,buf1		'get magnitude of one original
    		sub	buf2,#4			'subtract 4 for 1/16th test
    		shr	buf0,buf2	wz	'shift down difference by magnitude minus 4, z=1 if $20
    
    	if_z	setword	buf1,#7,#0		'if $20, set 8 bits
    	if_z	wxpin	buf1,#rx_rcv		'if $20, update serial receiver baud rate
    	if_z	mov	baud,buf1		'if $20, save baud rate for transmit
    
    		akpin	#rx_rcv			'acknowledge rx byte
    		rdpin	rxbyte,#rx_rcv		'get rx byte
    
    		wrlut	rxbyte,head		'write byte to circular buffer in lut
    		incmod	head,#lut_btop		'increment buffer head
    
    		reti2				'exit (4 entry + 32 body + 4 exit clocks)
    

    Now that smart pins are as fast to access as registers, code can be much higher performance.

    Getting rid of all the messaging circuitry simplified things greatly. The net increase in logic usage, due to wider mux/demux circuits was hardly noticeable. I'm going to do a full compile over night to get a better idea of this.
  • evanhevanh Posts: 15,915
    edited 2016-10-09 10:39
    Wow! Is that not the old 8x32 I/O ring bus, now 16x32, that was so real estate hungry? The one that, once removed, allowed the 128kB to 256kB increase in HubRAM size and lots of nice extras in the P2Hot?
  • Cluso99Cluso99 Posts: 18,069
    Sounds like a huge improvement!
    WTG Chip :)
  • *fingers crossed* If this takes more room on the FPGA, but will still fit the ASIC, so be it. That kind of I/O performance is exciting!
  • jmgjmg Posts: 15,173
    cgracey wrote: »
    Big progress...

    I've got 32-bit data paths for all RDPIN/WRPIN/WXPIN/WYPIN instructions, so they all take only 2 clocks now.

    I improved the asynchronous serial mode of the smart pin by getting rid of NCO baud mode and making it use lower bits as fractional bit counts if the top bits are clear:
    Sounding great :)
    cgracey wrote: »
    Just think of it as a 22-bit (16.6) bit count. The smart pin goes into fractional
    mode if the top six bits are clear. The programmer just does his calculation and
    the smart pin makes the best use of it.

    Can you expand on how the pin makes 'best use' of the 6 bit fraction ?
    There are only 8 data bits (HW up to 32*) and a start and stop bit sample to manage.
    For the 8-b UART case, a decimal (4b) rate multiplier and a 10bit AutoBAUD sample** would seem the best fit ?
    or possible a 3b Rate Multiplier, and maybe some decision to do a +0/+1 on the start bit ?

    * ROM does not need to manage 32b, but the hardware should be checked for this.
    for > 8, does the fractional modulus vary with the selected bit length (it could change at 12b and 24b thresholds?)

    ** To get 10b, I think you can add the tRR(7) and tFF(3) samples.

  • cgraceycgracey Posts: 14,152
    edited 2016-10-09 21:16
    evanh wrote: »
    Wow! Is that not the old 8x32 I/O ring bus, now 16x32, that was so real estate hungry? The one that, once removed, allowed the 128kB to 256kB increase in HubRAM size and lots of nice extras in the P2Hot?

    It is more wiring, but the EDA tools somehow deal with this better than we did in our custom layout.

    I need to go check that compile to see how things went.

    One big upside to this new scheme is that all of the flops have gated clocks now, which will absolutely save power in the ASIC.
  • cgraceycgracey Posts: 14,152
    edited 2016-10-09 21:18
    Well, it seems that all this wide-data-path stuff only increased the Prop123-A9 16-cog/18-smart-pin logic utilization from 89% to 91%, which is surprisingly low. These changes got rid of 68, somewhat complicated, state machines: a sender and receiver in each smart pin and a sender and receiver in each cog. The hub grew to accommodate the 32-bit mux/demux circuits, while those old cog and smart-pin state machines each became 32-bit flop arrays. This is a huge net improvement.
  • jmgjmg Posts: 15,173
    cgracey wrote: »
    '
    '
    ' Initial autobaud ISR
    ...
    		sub	buf0,buf1		'subtract one from the other
    ...
    
    ' Serial receiver plus autobaud maintenance ISR
    '
    receive_isr	rdpin	buf1,#rx_ms1		'get old fall-to-fall time	(7x if $20)
    		rdpin	buf0,#rx_ms0		'get new rise-to-rise time	(3x if $20)
    
    		mul	buf1,norm1		'normalize both samples to 21x for comparison
    		mul	buf0,norm0
    
    		sub	buf0,buf1		'subtract one from the other
    
    This sub step puzzles me, as usually you'd want the highest number / most number of possible bits to keep the maths LSB as small as possible ?
    In my code, I added the 7+3 to get 10b count, then use a fast ~ rounded/10 (60ppm error)
    P2C/((2^15+round(CT)*6554) >> 16)


  • jmgjmg Posts: 15,173
    cgracey wrote: »
    Well, it seems that all this wide-data-path stuff only increased the Prop123-A9 16-cog/18-smart-pin logic utilization from 89% to 91%, which is surprisingly low. These changes got rid of 68, somewhat complicated, state machines: a sender and receiver in each smart pin and a sender and receiver in each cog. The hub grew to accommodate the 32-bit mux/demux circuits, while those old cog and smart-pin state machines each became 32-bit flop arrays. This is a huge net improvement.
    Sounds like good progress... With luck, the ASIC tools will pack those 32-flops very well, better than random-logic.
  • jmgjmg Posts: 15,173
    edited 2016-10-09 22:51
    cgracey wrote: »
    I improved the asynchronous serial mode of the smart pin by getting rid of NCO baud mode and making it use lower bits as fractional bit counts if the top bits are clear:
    Async serial configuration via WXPIN
    
    b=clocks/bit, n=bits/word-1:
    
    %bbbb_bbbb_bbbb_bbbb_xxxx_xxxx_xxx_nnnnn	= normal 16-bit baud
    %0000_00bb_bbbb_bbbb_bbbb_bbxx_xxx_nnnnn	= 10-bit baud with 6-bit fractional
    
    For example, at 20MHz:
    
    $208D0007 = 2400 baud, 8 bits (8333 clocks/bit)
    $000D5407 = 1.5M baud, 8 bits (13.33 clocks/bit)
    
    Just think of it as a 22-bit (16.6) bit count. The smart pin goes into fractional
    mode if the top six bits are clear. The programmer just does his calculation and
    the smart pin makes the best use of it.
    

    Now the ROM booter can handle serial at 1.5M baud (with 2 stop bits).
    With fractional Baud, you should be able to go well above 1.5MBd.

    I get these results on a numeric test sweep. Here I applied a 3b Fractional baud, as with 8 data bits I can see only 8-10 opportunities to apply a 1 clock correction.
    (ie I'm not sure what a 6b fraction can do at 8b Data)
    These assume a 10baud time summed sample, for T-Axis resolution. All this precision now matters, as the fractional baud (here F/8) has moved the weak-link to the baud timing.
    Lucky we have smart-pins that can do a very good job here :)
    Try Centre, and reduce remainder to 8 steps, now see balanced error sawtooth, of just 0.392% at 1.5MBd.
     TB=10*P2C/126 = 1587301.587 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 1584158.41    100*(1-P2B/TB)  = 0.198%
     TB=10*P2C/127 = 1574803.149 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 1568627.450   100*(1-P2B/TB)  = 0.392%
     TB=10*P2C/128 = 1562500     P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 1568627.450   100*(1-P2B/TB)  = -0.392
     TB=10*P2C/129 = 1550387.596 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 1553398.058   100*(1-P2B/TB)  = -0.194%
     TB=10*P2C/130 = 1538461.538 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 1538461.538   100*(1-P2B/TB)  = 0
     TB=10*P2C/131 = 1526717.557 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 1523809.523   100*(1-P2B/TB)  = 0.190%
     TB=10*P2C/132 = 1515151.515 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 1509433.962   100*(1-P2B/TB)  = 0.377%
    
    Assuming that 3b fraction applies correctly, this gives an AutoBaud quantize error of 0.392% max
    The capture resolution on a 10baud time sample at 1.5MBd, is around 0.8%, total is ~ 1.2%

    & here are some 3MBd sweeps
     TB=10*P2C/61  = 3278688.524 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3265306.122   100*(1-P2B/TB)  = 0.408%
     TB=10*P2C/62  = 3225806.451 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3200000       100*(1-P2B/TB)  = 0.8
     TB=10*P2C/63  = 3174603.174 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3200000       100*(1-P2B/TB)  = -0.8
     TB=10*P2C/64  = 3125000     P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3137254.901   100*(1-P2B/TB)  = -0.392%
     TB=10*P2C/65  = 3076923.076 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3076923.076   100*(1-P2B/TB)  = 0
     TB=10*P2C/66  = 3030303.030 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3018867.924   100*(1-P2B/TB)  = 0.377%
     TB=10*P2C/67  = 2985074.626 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 2962962.962   100*(1-P2B/TB)  = 0.740%
     TB=10*P2C/68  = 2941176.470 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 2962962.962 > 100*(1-P2B/TB)  = -0.740%
    

    At this 3MBd ballpark, the 10baud time timebase resolution is ~ 1.59% for a total jitter of ~ 2.4%

    and for completeness 4MBd
     TB=10*P2C/48  = 4166666.666  P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 4210526.315   100*(1-P2B/TB)  = -1.052%
     TB=10*P2C/49  = 4081632.653  P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 4102564.102   100*(1-P2B/TB)  = -0.512%
     TB=10*P2C/50  = 4000000      P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 4000000       100*(1-P2B/TB)  = 0
     TB=10*P2C/51  = 3921568.627  P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3902439.024   100*(1-P2B/TB)  = 0.488%
     TB=10*P2C/52  = 3846153.846  P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3809523.809   100*(1-P2B/TB)  = 0.952%
     TB=10*P2C/53  = 3773584.905  P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3809523.809   100*(1-P2B/TB)  = -0.952
     TB=10*P2C/54  = 3703703.703  P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3720930.232   100*(1-P2B/TB)  = -0.465%
    
    10baud time timebase resolution is ~2%, for a total jitter of ~ 3%
Sign In or Register to comment.