SPI boot code and new CALLPA/CALLPB instructions

Cluso99 · 2016-10-07 01:22

FYI

I did autoboarding in the 80's on the 6802and the 68705. Neither had UARTs inbuilt so it was bit banging. These were 4MHz 4 clock instructions. The current P1 & P2V is 80MHz with 4 & 2 clock instructions. So that is 20x & 40x.

The first autobauding was to an ICL minicomputer running at ~53,000 baud using a "special" serial handshake (incapable of using a uart anyway). Due to drift and not being xtal controlled, I also had to re-sync throughout the character. 53,000 x 40 = ~2M baud.

I also did the autobauding for modems using the AT command set that I/we built. IIRC they could go to 9600 baud. 9,600 x 40 = 384,000 baud.

With the AT sequence, we were interfacing to micros and mainframes that were crystal controlled, so drift was not a problem. However, the Apple //c originally came with it's uart off-speed by IIRC 2%. As we were building modems which were also branded Apple, we had to work with them too.

The AT is a special sequence with a start=0 bit, followed by bit0=1, followed by bit1+=0. I won't bother showing how we detected parity, or case, as it's not required for this discussion.

I did the timing by waiting for the commencement of the start bit, and timing it. By dividing the time/2, I could then sample all 8 bits at approximately the middle. The only check was to ensure that the stop bit was indeed a "1" when sampled.

Timing was only done on the first "A". However, the calculated timing was used on every other character, only by syncing with the commencement of the start bit going to "0". In other words, the only time the speed could be changed was by a new "AT" command.

It would also have been possible to time the bit0 "1" bit, but in our case this was unnecessary.

Now you are referring to 8N1 (8N1+ which included 8N2). We only have to sample 9 bits (being 8+stop).

Presuming our sample is 100% correct, then we have 9 bits, and we are sampling at the centre of the 9th (stop) bit.

Now lets assume our bit time is 100. Therefore we sample at +50 (after the end of the start bit), then 8 * +100. So we have counted 850 from the end of the start bit to the middle of the stop bit.

If out 100 was miscounted, what is the earliest/latest where we will fail?
Obviously, +-50 in 850 (or 0.5 in 8.5) which gives +-5.88%.

Of course we may be catering for a micro that is bit-banging too, so we need to allow for some error here too.

So the real question becomes, how accurate can we be at calculating the length of the start bit ???

Firstly, we can only be as accurate as 1 clock count = ~20MHz = 50ns.

If we assume this to be 2% then 100% = 2.5us = 250,000 baud

Note: The "A" or "a" is also a common character used in modems and the wifi ESP8266 board. So there is a precedent to using this. It should also be echoed back to the terminal/pc/micro.

jmg · 2016-10-07 01:33

Cluso99 wrote: »

So the real question becomes, how accurate can we be at calculating the length of the start bit ???

That question only really applies, to a design that tries to measure a single-bit time.

Better AutoBaud design, measures over more than one bit time.

Cluso99 wrote: »

Firstly, we can only be as accurate as 1 clock count = ~20MHz = 50ns.
If we assume this to be 2% then 100% = 2.5us = 250,000 baud

Correct, if we (for now) exclude NCO.

However that 2% step size, (which applies to 400k(2.5us) Baud, not 250k) can be centered to give +/- 1% steps, with the right care.

The secret is to preserve as much maths resolution as you can, and multiple-bit timing certainly helps here.

See my example calcs above, with a measurement sum of 10 bit-times, and an EFM8BB1 being stepped through valid baud rates.

Cluso99 · 2016-10-07 01:51

Here is another way with more accuracy, using the character "x" ASCII $78...

Here, either the start bit + 3* "0" bits = 4 bits could be timed. This could be verified with the following 4* "1" bits.

Alternately, using the two successive falling edges, the start + 3* "0" + 4* "1" bits could be timed giving an 8 bit time.

Simple maths:
4 bit times: (t+t)>>3 (2 instructions includes rounding)
8 bit times: (t+t)>>4 (2 instructions includes rounding)

This could be then possible to verify that bit7="0" satisfies the calculation if considered necessary.

Firstly, we can only be as accurate as 1 clock count = ~20MHz = ~50ns. If we assume this to be 2% then 100% = 2.5us.
4 bit times: 2.5us / 4bits = 625ns = 1,600,000 baud (1.6M baud)
8 bit times: 2.5us / 8bits = 312ns = 3,200,000 baud (3.2M baud)

And this is with a safety margin of +-3.88%.

jmg · 2016-10-07 02:00

Cluso99 wrote: »

...
Alternately, using the two successive falling edges, the start + 3* "0" + 4* "1" bits could be timed giving an 8 bit time.

Yes, this is a subset of what Chip is already doing ?
He captures both tRR and tFF, which gives good precision for the bit-time.
Capture of both, also allows coverage of the reset-exit error case.

cgracey · 2016-10-07 06:42

jmg wrote: »
cgracey wrote: »

You can make good use of a Pair of captures
* Sum to get more x-Axis
* Check tRR relative to tFF to catch rejects
* Check ratio, to discriminate between two valid AutoBaud chars.

I'm not understanding these ideas, yet.

The idea is to combine the two capture readings, to get the most information.
eg 0x3f will capture tRR = 8b tFF = 7b, which you can add to get 15b times.
You can also compare them, and if (tRR < tFF), you ignore & continue to wait. (means RST exit was mid-char)

To resolve which one of 0x3f or 0x1f was sent, you compare the difference with tRR, and slice ~ 1.5/8b

I'm favouring 0x3f "?" and 0x1f, as they have higher sums, (tho the last one does go a little against your Kbd Test wish )
0x3f "?" :  on tRR = 8 tFF = 7 Sum=15b
===\_s_/=0=.=1=.=2=.=3=.=4=.=5=\_6_._7_/=P==T=\_s_/=0=.=1=.=2=.=3=.=4=.=5=\_6_._7_/=P==T=\_s_/=0=.=1=.=2=.=3=.=4=.=5=\_6_._7_/=P=
tRR    |            8                  |   2+T    |            8                  |   2+T    |       8                       |   //
tFF                     7      |        3+T   |                7           |     3+T     | 1 |                                =\_/= 
    f  r             OK tRR:8b,tFF:7b -^          ^- Err:tRR:2b+T,tFF:3b+T        ^OK        ^Err 
Margins: +1b,-1b = usable

Need a Second Char, for AutoBaud One-Pin command 

0x1f -> OK tRR:8b,tFF:6b Sum=14b   Err:tRR:2b+T,tFF:4b+T 
===\_s_/=0=.=1=.=2=.=3=.=4=\_5_._6_._7_/=P==T=\_s_/=0=.=1=.=2=.=3=.=4=\_5_._6_._7_/=P==T=\_
Margins: +2b,-2b = usable, CAN pair with 0x3f
also possible are 0x20 " " and 0x40 "@", but they have smaller sums.

But it seems that what would matter in these readings are empirical ratios that have numbers that fill most of the frame. For example 7 and 8 are viable because the 7:8 relationship is sufficient for knowing what came in, and that relationship must be tested with some precision, not just that one is greater than the other. My preferred value of $20 has a 7:3 relationship, which may not hold as much information, but it's plenty. That 7:3 relationship must validated.

jmg · 2016-10-07 07:00

cgracey wrote: »

But it seems that what would matter in these readings are empirical ratios that have numbers that fill most of the frame. For example 7 and 8 are viable because the 7:8 relationship is sufficient for knowing what came in, and that relationship must be tested with some precision, not just that one is greater than the other. My preferred value of $20 has a 7:3 relationship, which may not hold as much information, but it's plenty. That 7:3 relationship must validated.

For reliable operation, you only really need to sense and reject the 'wrong' phased capture of
Err:tRR:7b+T,tFF:3b+T
Otherwise, knowing the character was 0x20 should be enough to make a Baud calculation.

You can check for some window, but I'm not sure what that gains you ? Maybe some noise immunity ?

Here, a check that tFF was between tRR/2 and tRR/4 would seem quick and simple ?
That also nicely rejects the wrong-phase case, that needs covering.

Then, you sum both tRR & tFF, and calculate Baud

I've expanded the example table of AutoBaud results in this post
http://forums.parallax.com/discussion/comment/1389598/#Comment_1389598

Here is a zoomed sawtooth error check near 1.5MBd

 TB=20M/13.54 = 1477104.87  CT=(10/TB)/(1/20M) = 135.4  P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1428571.428 100*(1-P2B/TB) = 3.285%
 TB=20M/13.53 = 1478196.60  CT=(10/TB)/(1/20M) = 135.3  P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1428571.428 100*(1-P2B/TB) = 3.357%
 TB=20M/13.52 = 1479289.94  CT=(10/TB)/(1/20M) = 135.2  P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1428571.428 100*(1-P2B/TB) = 3.428%
 TB=20M/13.51 = 1480384.90  CT=(10/TB)/(1/20M) = 135.1  P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1428571.428 100*(1-P2B/TB) = 3.5%
 TB=20M/13.50 = 1481481.48  CT=(10/TB)/(1/20M) = 135    P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1428571.428 100*(1-P2B/TB) = 3.57%
 TB=20M/13.49 = 1482579.688 CT=(10/TB)/(1/20M) = 134.9  P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1428571.428 100*(1-P2B/TB) = 3.642%
 TB=20M/13.48 = 1483679.525 CT=(10/TB)/(1/20M) = 134.8  P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1428571.428 100*(1-P2B/TB) = 3.714%
 TB=20M/13.47 = 1484780.994 CT=(10/TB)/(1/20M) = 134.7  P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1428571.428 100*(1-P2B/TB) = 3.785%
 TB=20M/13.46 = 1485884.101 CT=(10/TB)/(1/20M) = 134.6  P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1428571.428 100*(1-P2B/TB) = 3.857%
 TB=20M/13.45 = 1486988.847 CT=(10/TB)/(1/20M) = 134.5  P2B = 20M/((2^15+round(CT)*6554) >> 16) = 1538461.538 100*(1-P2B/TB) = -3.461% not quite central

The ideal sawtooth is equally peaked +/-, this is slightly off as the 10b sample has the Maths jitter/LSB effects, and the fast divide of (2^15+round(CT)*6554) >> 16) contributes another small error.

Testing at 2MBd is a good first-pass check, but I would also include numbers like 1.6MBd, which should be possible on a FT232H/FT2232H, and that has an AutoBAUD error of 3.846% to a FPGA 20MHz SysCLK.

Do you have a Character echo on the first AutoBAUD (eg 0x7f "~") ?
That's a good way to confirm AutoBAUD was ok, and automated systems can make use of that to speed the download.

Retrim AutoBauds do not have to echo every time, as a 100% echo could have issues

Add: Some Pseudo Code to cover enough tests :2

Psudeo code for AutoBaud, and One-Pin / Two-Pin selection on Autobaud, so Echo can go out on correct pin.
IF tRR > ((tFF*19) >> 5) THEN   // 4.15625 - Reverse Sync trap
  RETURN
ELSIF tRR <= ((tFF*6) >> 5) THEN // 1.3125 - too skewed trap
  RETURN
ELSIF tRR > ((tFF*21) >> 6) THEN // ~32.73% between " " 3:7 and "@" 2:8
  AutoCal(10,TwoPin)
  Echo("~")
ELSE  
  AutoCal(10,OnePin)
  Echo("~")
ENDIF

Cluso99 · 2016-10-07 07:05

A value of "x" is simple mathematics. Dividing by 3 or 7 is not so easy when time is of the essence in fast transmissions.

Two instructions give rounding (ie add itself to itself) then dividing (ie shift right by 3 or 4) to get bit time or an extra shift to get 1/2 bit time.

Seairth · 2016-10-07 12:35

Out of curiosity, what is it that you all are trying to solve right now? Maximizing the autobaud rate during boot? And, if so, how critical is this to have? Reliably autobauding up to 1Mbps right now already seems good enough.

User Name · 2016-10-07 14:32

Why is autobauding is so important to P2 serial boot?

Seairth · 2016-10-07 15:19

User Name wrote: »

Why is autobauding is so important to P2 serial boot?

Autobauding is necessary because the boot ROM cannot know the actual clock speed. By extracting the "clock" from the RX pin, the boot ROM can reliably set its baud rate regardless of the precision of the internal oscillator. Put another way, the boot ROM isn't really figuring out baud rate (since the chip doesn't exactly know how many clock cycles one second is), but instead figuring out the serial_clock-to-internal_clock ratio. The P2 doesn't know that it's at 1Mbps or 4800bps, just that 1 serial_clock equals X internal_clocks. And the X is going to differ for every chip due to the precision of the internal oscillator.

Edit: actually, autobaud is only necessary for higher baud rates. My guess is that the internal clock will be precise enough to be able do a low, fixed baud rate (e.g. 9600bps), as long as the stop condition was stretched to ensure that the next start bit wasn't too early.

potatohead · 2016-10-07 17:44

Seairth wrote: »

Out of curiosity, what is it that you all are trying to solve right now? Maximizing the autobaud rate during boot? And, if so, how critical is this to have? Reliably autobauding up to 1Mbps right now already seems good enough.

Seconded.

jmg · 2016-10-07 18:48

Seairth wrote: »

Edit: actually, autobaud is only necessary for higher baud rates. My guess is that the internal clock will be precise enough to be able do a low, fixed baud rate (e.g. 9600bps), as long as the stop condition was stretched to ensure that the next start bit wasn't too early.

Not quite, the P2 Boot osc is only good to 30%, so ALL baud speeds will require Autobaud.

Seairth wrote: »

Out of curiosity, what is it that you all are trying to solve right now? Maximizing the autobaud rate during boot? And, if so, how critical is this to have? Reliably autobauding up to 1Mbps right now already seems good enough.

Reliable Autobaud up to 1MBd/2MBd is not there 'right now', but it is being worked on

Current release boot loader is Spec'd to 115200, with a ceiling maybe 2x that.
115200 is frankly glacial, and constraining. That is over 60 seconds to send a P2 image.

Sure, in some cases, you might be able to change gears and use dual stage loaders, but that adds a whole lot more management, and assumes you can change gears.
Much cleaner to have a faster, more practical upper AutoBaud ceiling.
The numbers above suggest > 1.5Mbd, maybe 2Mbd, will be practical (even without NCO step).

There are many use cases for P2, where a simple single baud rate will be desirable/possible.
BLE and WiFi links can go way faster than 115200.

Then, there are other use cases where you want the fastest possible boot times, including reset-exit delay effects.

jmg · 2016-10-07 19:01

cgracey wrote: »

Wow! These are great ideas. And, yes, the captures are cyclical.

We COULD just do an initial auto-baud and then check the sample pair in each serial interrupt, as the last thing to happen was the stop bit, making a fresh rise-to-rise measurement. We are already 1/2 into the stop bit when the serial ISR is triggered, so we have less than half a bit period to grab the last fall-to-fall sample before a new fall-to-fall is registered, in case a (low) start bit is next. ..

Thinking a little more about this specific timing detail, here is a way to buy some extra time here:

INT_tRR: // Interrupt on Rising Capture case - VERY short. Does not INT RX INT.
tFF = Read(FallValue)
RETI // This is actually needed only at start of Stop bit, other cases are tolerated.

INT_RX:
tRR = Read(RiseValue) // very first thing @ RX mid-stop
... non critical code..

ie at the last rising edge/Stop bit, read and store the waiting tFF value,
This now means a tFF capture at Start-Bit fall, is not such a problem.
tRR cannot occur before the end of Start Bit, so it can read on INT_RX, and still have 1.5 bit times of margin, with 1 Stop bit.
ie this triples the timing margin.
(2 stop bits of course always gives even more time)

cgracey · 2016-10-08 09:21

In order to tighten up smart pin timing, I've given each smart pin a 32-bit parallel RDPIN output. This means RDPIN will take only two clocks, like most other instructions, and always return a long (no more byte and word possibilities). This got rid of all the pin-to-cog message mechanisms, but added more routing. It so much simpler now.

This is going to make everything a lot quicker. This should help full-speed USB at 80MHz quite a bit.

It's actually much simpler to make the RDPIN paths 32 bits wide, as opposed to the WRPIN/WXPIN/WYPIN paths, as they must be OR'd together from all cogs before heading out to the pins.

I'm looking forward to seeing what this does for auto-baud. I also added 4 fractional bits to the serial baud generator, so that we can get the extra resolution we need above 500k baud. I had it auto-bauding and loading at 1.5M baud earlier today. It should be solid at 2MHz after this. One problem, though, is that the SHA-256 computation can only run at about 100K bytes per second at 20MHz. We'll deal with that later.

dMajo · 2016-10-08 10:17

@cgracey
Chip, just out of curiosity/ignorance, what's the difference between PIC's oscillators and P2's one?
Most of the PICs reach 32MHz, the lattest parts even 64MHz. Can't you use such type of oscillators in the Prop?
As I have understood the P2 will use the same internal oscillator that was used in the P1 10 years ago.

http://ww1.microchip.com/downloads/en/DeviceDoc/40001819A.pdf wrote:

Clocking Structure
• Precision Internal Oscillator:
- ±1% at calibration
- Selectable frequency range 32 MHz to 31 kHz
• 31 kHz Low-Power Internal Oscillator
• 4x Phase-Locked Loop (PLL) for up to 32 MHz Internal Operation
• External Oscillator Block with Three External
Clock modes up to 32 MHz

cgracey · 2016-10-08 14:08

dMajo wrote: »

@cgracey
Chip, just out of curiosity/ignorance, what's the difference between PIC's oscillators and P2's one?
Most of the PICs reach 32MHz, the lattest parts even 64MHz. Can't you use such type of oscillators in the Prop?
As I have understood the P2 will use the same internal oscillator that was used in the P1 10 years ago.

http://ww1.microchip.com/downloads/en/DeviceDoc/40001819A.pdf wrote:

Clocking Structure
• Precision Internal Oscillator:
- ±1% at calibration
- Selectable frequency range 32 MHz to 31 kHz
• 31 kHz Low-Power Internal Oscillator
• 4x Phase-Locked Loop (PLL) for up to 32 MHz Internal Operation
• External Oscillator Block with Three External
Clock modes up to 32 MHz

It's all just a matter of how much power you want to burn. I figured 20MHz for boot was still sufficient. We could make it faster, but it would be a silicon change, at this point.

David Betz · 2016-10-08 16:56

cgracey wrote: »

dMajo wrote: »

@cgracey
Chip, just out of curiosity/ignorance, what's the difference between PIC's oscillators and P2's one?
Most of the PICs reach 32MHz, the lattest parts even 64MHz. Can't you use such type of oscillators in the Prop?
As I have understood the P2 will use the same internal oscillator that was used in the P1 10 years ago.

http://ww1.microchip.com/downloads/en/DeviceDoc/40001819A.pdf wrote:

Clocking Structure
• Precision Internal Oscillator:
- ±1% at calibration
- Selectable frequency range 32 MHz to 31 kHz
• 31 kHz Low-Power Internal Oscillator
• 4x Phase-Locked Loop (PLL) for up to 32 MHz Internal Operation
• External Oscillator Block with Three External
Clock modes up to 32 MHz

It's all just a matter of how much power you want to burn. I figured 20MHz for boot was still sufficient. We could make it faster, but it would be a silicon change, at this point.

Speaking of silicon, what happened with the shuttle run? Did you get the test chips back yet?

jmg · 2016-10-08 19:51

cgracey wrote: »

In order to tighten up smart pin timing, I've given each smart pin a 32-bit parallel RDPIN output. This means RDPIN will take only two clocks, like most other instructions, and always return a long (no more byte and word possibilities). This got rid of all the pin-to-cog message mechanisms, but added more routing. It so much simpler now.

This is going to make everything a lot quicker. This should help full-speed USB at 80MHz quite a bit.

It's actually much simpler to make the RDPIN paths 32 bits wide, as opposed to the WRPIN/WXPIN/WYPIN paths, as they must be OR'd together from all cogs before heading out to the pins.

Wow, quite the boost ! I figured you were knocking of a few corners, but that's a big change.

cgracey wrote: »

I'm looking forward to seeing what this does for auto-baud. I also added 4 fractional bits to the serial baud generator, so that we can get the extra resolution we need above 500k baud. I had it auto-bauding and loading at 1.5M baud earlier today. It should be solid at 2MHz after this.

Non fractional AutoBaud should be ok to somewhere between 1~1.5Mbd, (see my rounding equations above) with sparse solutions above 1.5MBd.
eg 2MHz should work, because with a 20MHz FPGA clock, and 48MHz USB clocks, that happens to hit a sweetspot.
1.6MBd is a tougher test for 20MHz & 48MHz

Fractional AutoBaud should get maybe to the 3MBd that is widespread in USB-UARTS.

cgracey wrote: »

I also added 4 fractional bits to the serial baud generator, so that we can get the extra resolution ..

Are those extra bits base 8, base 10 or base 16 ?

jmg · 2016-10-08 19:58

David Betz wrote: »

Speaking of silicon, what happened with the shuttle run? Did you get the test chips back yet?

Those should have good numbers on the actual 20MHz oscillator, and especially how it varies with Vcc and Temperature.

cgracey wrote: »

It's all just a matter of how much power you want to burn. I figured 20MHz for boot was still sufficient. We could make it faster, but it would be a silicon change, at this point.

I'd measure the test wafers first, to see what you actually get, and what the frequency curves look like.
Another number to consider would be ~24MHz, as that's related to the USB clock everyone uses these days.
With fractional Autobaud, you can position samples to within 50ns, and so might AutoBaud up toward 3MBd

garryj · 2016-10-08 20:18

cgracey wrote: »

In order to tighten up smart pin timing, I've given each smart pin a 32-bit parallel RDPIN output. This means RDPIN will take only two clocks, like most other instructions, and always return a long (no more byte and word possibilities). This got rid of all the pin-to-cog message mechanisms, but added more routing. It so much simpler now.

This is going to make everything a lot quicker. This should help full-speed USB at 80MHz quite a bit.

It's actually much simpler to make the RDPIN paths 32 bits wide, as opposed to the WRPIN/WXPIN/WYPIN paths, as they must be OR'd together from all cogs before heading out to the pins.

Neat! I have been thinking about taking another stab at full-speed at 80MHz -- this makes it definite

cgracey · 2016-10-09 09:15

Big progress...

I've got 32-bit data paths for all RDPIN/WRPIN/WXPIN/WYPIN instructions, so they all take only 2 clocks now.

I improved the asynchronous serial mode of the smart pin by getting rid of NCO baud mode and making it use lower bits as fractional bit counts if the top bits are clear:

Async serial configuration via WXPIN

b=clocks/bit, n=bits/word-1:

%bbbb_bbbb_bbbb_bbbb_xxxx_xxxx_xxx_nnnnn	= normal 16-bit baud
%0000_00bb_bbbb_bbbb_bbbb_bbxx_xxx_nnnnn	= 10-bit baud with 6-bit fractional

For example, at 20MHz:

$208D0007 = 2400 baud, 8 bits (8333 clocks/bit)
$000D5407 = 1.5M baud, 8 bits (13.33 clocks/bit)

Just think of it as a 22-bit (16.6) bit count. The smart pin goes into fractional
mode if the top six bits are clear. The programmer just does his calculation and
the smart pin makes the best use of it.

Now the ROM booter can handle serial at 1.5M baud (with 2 stop bits). This means 1M baud will be safe for a slow/hot RC clock that is running at only 13MHz, instead of the expected 20MHz:

'
'
' Initial autobaud ISR
'
' $20 -> 10000001001 -> fall-to-fall = 7 bits, then rise-to-rise = 3 bits
'
autobaud_isr	rdpin	buf1,#rx_ms1		'get old fall-to-fall time	(7x if $20)
		mul	buf1,norm1		'make baud rate
		setbyte	buf1,#7,#0		'set 8 bits
		wxpin	buf1,#rx_rcv		'set baud rate as early as possible, even if it's wrong
		dirh	#rx_rcv			'enable serial receiver in case $20 received (10 clocks, so far)

		rdpin	buf0,#rx_ms0		'get new rise-to-rise time	(3x if $20)

		akpin	#rx_ms0			'acknowledge rise-to-rise measurement (ISR trigger)

		mul	buf0,norm0		'normalize rise-to-rise time for comparison

		sub	buf0,buf1		'subtract one from the other
		abs	buf0			'get absolute difference
		topone	buf2,buf1		'get magnitude of one original
		sub	buf2,#4			'subtract 4 for 1/16th test
		shr	buf0,buf2	wz	'shift down difference by magnitude minus 4, z=1 if $20

	if_nz	dirl	#rx_rcv			'if not $20, disable serial receiver

	if_z	setint1	#0			'if $20, disable int1
	if_z	nixint1				'if $20, nix any pending int1

		reti1				'exit (4 entry + 32 body + 4 exit clocks)


norm1		long	3 * $1_0000 / 21	'normalization constants get clock count into top word
norm0		long	7 * $1_0000 / 21
'
'
' Serial receiver plus autobaud maintenance ISR
'
receive_isr	rdpin	buf1,#rx_ms1		'get old fall-to-fall time	(7x if $20)
		rdpin	buf0,#rx_ms0		'get new rise-to-rise time	(3x if $20)

		mul	buf1,norm1		'normalize both samples to 21x for comparison
		mul	buf0,norm0

		sub	buf0,buf1		'subtract one from the other
		abs	buf0			'get absolute difference
		topone	buf2,buf1		'get magnitude of one original
		sub	buf2,#4			'subtract 4 for 1/16th test
		shr	buf0,buf2	wz	'shift down difference by magnitude minus 4, z=1 if $20

	if_z	setword	buf1,#7,#0		'if $20, set 8 bits
	if_z	wxpin	buf1,#rx_rcv		'if $20, update serial receiver baud rate
	if_z	mov	baud,buf1		'if $20, save baud rate for transmit

		akpin	#rx_rcv			'acknowledge rx byte
		rdpin	rxbyte,#rx_rcv		'get rx byte

		wrlut	rxbyte,head		'write byte to circular buffer in lut
		incmod	head,#lut_btop		'increment buffer head

		reti2				'exit (4 entry + 32 body + 4 exit clocks)

Now that smart pins are as fast to access as registers, code can be much higher performance.

Getting rid of all the messaging circuitry simplified things greatly. The net increase in logic usage, due to wider mux/demux circuits was hardly noticeable. I'm going to do a full compile over night to get a better idea of this.

evanh · 2016-10-09 10:36

Wow! Is that not the old 8x32 I/O ring bus, now 16x32, that was so real estate hungry? The one that, once removed, allowed the 128kB to 256kB increase in HubRAM size and lots of nice extras in the P2Hot?

Cluso99 · 2016-10-09 10:48

Sounds like a huge improvement!
WTG Chip

Seairth · 2016-10-09 13:46

*fingers crossed* If this takes more room on the FPGA, but will still fit the ASIC, so be it. That kind of I/O performance is exciting!

jmg · 2016-10-09 18:38

cgracey wrote: »

Big progress...

I've got 32-bit data paths for all RDPIN/WRPIN/WXPIN/WYPIN instructions, so they all take only 2 clocks now.

I improved the asynchronous serial mode of the smart pin by getting rid of NCO baud mode and making it use lower bits as fractional bit counts if the top bits are clear:

Sounding great

cgracey wrote: »

Just think of it as a 22-bit (16.6) bit count. The smart pin goes into fractional
mode if the top six bits are clear. The programmer just does his calculation and
the smart pin makes the best use of it.

Can you expand on how the pin makes 'best use' of the 6 bit fraction ?
There are only 8 data bits (HW up to 32*) and a start and stop bit sample to manage.
For the 8-b UART case, a decimal (4b) rate multiplier and a 10bit AutoBAUD sample** would seem the best fit ?
or possible a 3b Rate Multiplier, and maybe some decision to do a +0/+1 on the start bit ?

* ROM does not need to manage 32b, but the hardware should be checked for this.
for > 8, does the fractional modulus vary with the selected bit length (it could change at 12b and 24b thresholds?)

** To get 10b, I think you can add the tRR(7) and tFF(3) samples.

cgracey · 2016-10-09 20:56

evanh wrote: »

Wow! Is that not the old 8x32 I/O ring bus, now 16x32, that was so real estate hungry? The one that, once removed, allowed the 128kB to 256kB increase in HubRAM size and lots of nice extras in the P2Hot?

It is more wiring, but the EDA tools somehow deal with this better than we did in our custom layout.

I need to go check that compile to see how things went.

One big upside to this new scheme is that all of the flops have gated clocks now, which will absolutely save power in the ASIC.

cgracey · 2016-10-09 21:12

Well, it seems that all this wide-data-path stuff only increased the Prop123-A9 16-cog/18-smart-pin logic utilization from 89% to 91%, which is surprisingly low. These changes got rid of 68, somewhat complicated, state machines: a sender and receiver in each smart pin and a sender and receiver in each cog. The hub grew to accommodate the 32-bit mux/demux circuits, while those old cog and smart-pin state machines each became 32-bit flop arrays. This is a huge net improvement.

jmg · 2016-10-09 22:03

cgracey wrote: »

'
'
' Initial autobaud ISR
...
		sub	buf0,buf1		'subtract one from the other
...

' Serial receiver plus autobaud maintenance ISR
'
receive_isr	rdpin	buf1,#rx_ms1		'get old fall-to-fall time	(7x if $20)
		rdpin	buf0,#rx_ms0		'get new rise-to-rise time	(3x if $20)

		mul	buf1,norm1		'normalize both samples to 21x for comparison
		mul	buf0,norm0

		sub	buf0,buf1		'subtract one from the other

This sub step puzzles me, as usually you'd want the highest number / most number of possible bits to keep the maths LSB as small as possible ?
In my code, I added the 7+3 to get 10b count, then use a fast ~ rounded/10 (60ppm error)
P2C/((2^15+round(CT)*6554) >> 16)

jmg · 2016-10-09 22:04

cgracey wrote: »

Well, it seems that all this wide-data-path stuff only increased the Prop123-A9 16-cog/18-smart-pin logic utilization from 89% to 91%, which is surprisingly low. These changes got rid of 68, somewhat complicated, state machines: a sender and receiver in each smart pin and a sender and receiver in each cog. The hub grew to accommodate the 32-bit mux/demux circuits, while those old cog and smart-pin state machines each became 32-bit flop arrays. This is a huge net improvement.

Sounds like good progress... With luck, the ASIC tools will pack those 32-flops very well, better than random-logic.

jmg · 2016-10-09 22:47

cgracey wrote: »
I improved the asynchronous serial mode of the smart pin by getting rid of NCO baud mode and making it use lower bits as fractional bit counts if the top bits are clear:
Async serial configuration via WXPIN

b=clocks/bit, n=bits/word-1:

%bbbb_bbbb_bbbb_bbbb_xxxx_xxxx_xxx_nnnnn	= normal 16-bit baud
%0000_00bb_bbbb_bbbb_bbbb_bbxx_xxx_nnnnn	= 10-bit baud with 6-bit fractional

For example, at 20MHz:

$208D0007 = 2400 baud, 8 bits (8333 clocks/bit)
$000D5407 = 1.5M baud, 8 bits (13.33 clocks/bit)

Just think of it as a 22-bit (16.6) bit count. The smart pin goes into fractional
mode if the top six bits are clear. The programmer just does his calculation and
the smart pin makes the best use of it.
Now the ROM booter can handle serial at 1.5M baud (with 2 stop bits).

With fractional Baud, you should be able to go well above 1.5MBd.

I get these results on a numeric test sweep. Here I applied a 3b Fractional baud, as with 8 data bits I can see only 8-10 opportunities to apply a 1 clock correction.
(ie I'm not sure what a 6b fraction can do at 8b Data)
These assume a 10baud time summed sample, for T-Axis resolution. All this precision now matters, as the fractional baud (here F/8) has moved the weak-link to the baud timing.
Lucky we have smart-pins that can do a very good job here

Try Centre, and reduce remainder to 8 steps, now see balanced error sawtooth, of just 0.392% at 1.5MBd.
 TB=10*P2C/126 = 1587301.587 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 1584158.41    100*(1-P2B/TB)  = 0.198%
 TB=10*P2C/127 = 1574803.149 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 1568627.450   100*(1-P2B/TB)  = 0.392%
 TB=10*P2C/128 = 1562500     P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 1568627.450   100*(1-P2B/TB)  = -0.392
 TB=10*P2C/129 = 1550387.596 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 1553398.058   100*(1-P2B/TB)  = -0.194%
 TB=10*P2C/130 = 1538461.538 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 1538461.538   100*(1-P2B/TB)  = 0
 TB=10*P2C/131 = 1526717.557 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 1523809.523   100*(1-P2B/TB)  = 0.190%
 TB=10*P2C/132 = 1515151.515 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 1509433.962   100*(1-P2B/TB)  = 0.377%

Assuming that 3b fraction applies correctly, this gives an AutoBaud quantize error of 0.392% max
The capture resolution on a 10baud time sample at 1.5MBd, is around 0.8%, total is ~ 1.2%

& here are some 3MBd sweeps

 TB=10*P2C/61  = 3278688.524 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3265306.122   100*(1-P2B/TB)  = 0.408%
 TB=10*P2C/62  = 3225806.451 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3200000       100*(1-P2B/TB)  = 0.8
 TB=10*P2C/63  = 3174603.174 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3200000       100*(1-P2B/TB)  = -0.8
 TB=10*P2C/64  = 3125000     P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3137254.901   100*(1-P2B/TB)  = -0.392%
 TB=10*P2C/65  = 3076923.076 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3076923.076   100*(1-P2B/TB)  = 0
 TB=10*P2C/66  = 3030303.030 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3018867.924   100*(1-P2B/TB)  = 0.377%
 TB=10*P2C/67  = 2985074.626 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 2962962.962   100*(1-P2B/TB)  = 0.740%
 TB=10*P2C/68  = 2941176.470 P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 2962962.962 > 100*(1-P2B/TB)  = -0.740%

At this 3MBd ballpark, the 10baud time timebase resolution is ~ 1.59% for a total jitter of ~ 2.4%

and for completeness 4MBd

 TB=10*P2C/48  = 4166666.666  P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 4210526.315   100*(1-P2B/TB)  = -1.052%
 TB=10*P2C/49  = 4081632.653  P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 4102564.102   100*(1-P2B/TB)  = -0.512%
 TB=10*P2C/50  = 4000000      P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 4000000       100*(1-P2B/TB)  = 0
 TB=10*P2C/51  = 3921568.627  P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3902439.024   100*(1-P2B/TB)  = 0.488%
 TB=10*P2C/52  = 3846153.846  P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3809523.809   100*(1-P2B/TB)  = 0.952%
 TB=10*P2C/53  = 3773584.905  P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3809523.809   100*(1-P2B/TB)  = -0.952
 TB=10*P2C/54  = 3703703.703  P2B = P2C/((2^12+(round(CT)*6554) >> 13)/8) = 3720930.232   100*(1-P2B/TB)  = -0.465%

10baud time timebase resolution is ~2%, for a total jitter of ~ 3%

SPI boot code and new CALLPA/CALLPB instructions

Comments