FullDuplexSerial for P1+

Seairth · 2014-04-18 08:11

Since we have been using FDS as an example bit of code that would be affected by all of the changes, I figured I'd start tweaking it to take advantage of the new instruction set. So far, I have made the following changes:

Used RDQUAD in a few places to reduce the number hub accesses
Much shorter setup code
Rearranged a few blocks to reduce a couple of the hub access stalls
Added some additional comments (this was to help me, mostly, but I figured I'd leave them in)
Replaced JMPRET with JMPSW. This was a drop-in replacement.

Things this does not take into account:

Any changes to the SPIN language
I believe PAR is going away, but it's not clear to me what's replacing it. So the current code assumes PAR still exists.
The smart pin stuff might get rid of some of the serial bit-banging, but I don't know enough about it at this point to make those changes.

As it stands, this code is slightly shorter than the original version (and hopefully bug free). Given the general increase P1+ performance (5x MIPS) and the reduction of a few hub operations, this version might be able to get up to a 8x increase in baud rate.

FullDuplexSerial_P1+.spin

jmg · 2014-04-18 13:33

Seairth wrote: »

Since we have been using FDS as an example bit of code that would be affected by all of the changes,
...
The smart pin stuff might get rid of some of the serial bit-banging, but I don't know enough about it at this point to make those

Even if Pin Cell can do some Serial work, the general case of SW FDS is still useful.

Maybe the new pin-test opcodes can also help ?

JP D/#,S/@ (jump if pin IN high, pins registered at beginning of ALU cycle)
JNP D/#,S/@ (jump if pin IN (not) high, pins registered at beginning of ALU cycle)

evanh · 2014-04-19 03:59

:wait                   jmpsw   rxcode,txcode         'run a chuck of transmit code, then return

                        mov     t1,rxcnt              'check if bit receive period done
                        sub     t1,cnt
                        cmps    t1,#0           wc
        if_nc           jmp     #:wait

Um, excuse me if my lack of experience shows but can't the SUB generate a zero flag result and the JMP act on that rather than using a CMPS?

EDIT: Of course if these routines were running independently then they could do away with the tight loops and instead use WAITCNT instructions. I think Chip got the waits sorted so they can share the hardware time slicing?

Seairth · 2014-04-19 06:31

evanh wrote: »
:wait                   jmpsw   rxcode,txcode         'run a chuck of transmit code, then return

                        mov     t1,rxcnt              'check if bit receive period done
                        sub     t1,cnt
                        cmps    t1,#0           wc
        if_nc           jmp     #:wait
Um, excuse me if my lack of experience shows but can't the SUB generate a zero flag result and the JMP act on that rather than using a CMPS?

EDIT: Of course if these routines were running independently then they could do away with the tight loops and instead use WAITCNT instructions. I think Chip got the waits sorted so they can share the hardware time slicing?

Z would only work if cnt were exactly the same as t1 at the sub instruction. C is used to detect the less accurate condition where cnt passes rxcnt. You can't use a SUBS here because cnt is unsigned.

As for the WAITCNT, that was correct for P2. In that case, WAITCNT (or was it called something else? I'm already forgetting!) would self-jump to keep from stalling the pipeline. Depending on the task scheduling, you could overshoot your mark just like the above code, but I suspect it would be more accurate on average. If hardware multitasking were available in P1+, I agree that this likely the better way to go (with a bit of TLOCK, or whatever it was called, for the hubop blocks). But, that's neither here nor there, since P1+ doesn't currently do hardware multitasking.

Seairth · 2014-04-19 07:05

jmg wrote: »

Even if Pin Cell can do some Serial work, the general case of SW FDS is still useful.

Maybe the new pin-test opcodes can also help ?

JP D/#,S/@ (jump if pin IN high, pins registered at beginning of ALU cycle)
JNP D/#,S/@ (jump if pin IN (not) high, pins registered at beginning of ALU cycle)

The only place I found these to be useful was on the beginning of the receive, looking for the start bit:

receive                 jmpsw   rxcode,txcode         'run a chunk of transmit code, then return

                        test    rxtxmode,#%001  wz    'wait for start bit on rx pin
        if_z            jp      rx_pin,@receive       'looking for LOW, encountered HIGH
        if_nz           jnp     rx_pin,@receive       'looking for HIGH, encountered LOW

I think that's right (and assumes that rx_pin is a cog register). In a way, this might be more readable. But it didn't reduce the number of instructions, unfortunately.

It does, however shave off 2 clock cycles for non-inverted signalling. Which makes me wonder if these shouldn't be used after all, because it potentially changes the maximum baud rate depending on whether receive signalling is inverted. I suspect, though, that this bit of code is not what's limiting the baud rate.

Seairth · 2014-04-19 07:18

I don't suppose the new pin mode stuff has an inverted mode that would invert the input/output automatically? If so, then this would certainly simplify both send and receive code.

Cluso99 · 2014-04-19 11:39

Seairth wrote: »

I don't suppose the new pin mode stuff has an inverted mode that would invert the input/output automatically? If so, then this would certainly simplify both send and receive code.

From what I understood, originally the pins had an inversion mode. I presume it is still there.

evanh · 2014-04-19 16:56

Seairth wrote: »

But, that's neither here nor there, since P1+ doesn't currently do hardware multitasking.

Comparing is exactly what we are wanting to do here. The reason why this topic exists is because of the earlier discussion over advantage of fine grained time slicing vs any sort of cooperative multitasking.

Chip did do something to the WAITs of the Prop2 design to allow them to function cleanly with more than one thread active. It was something like self-repeating in the pipeline instead of going into a static wait state.

evanh · 2014-04-19 17:24

Seairth wrote: »

... You can't use a SUBS here because cnt is unsigned.

That didn't ring true so I had another careful look at the Prop1 datasheet and found that the Carry flag result for SUBS and CMPS is not the same. SUBS generates a "Signed Overflow" while:
CMPS "Signed (D < S)", can this be anything other than it's less than zero?, which looks to be pretty much identical to
CMP "Unsigned (D < S)", I would interpret this one as an unsigned borrow, and also
SUB "Unsigned Borrow".

Hmm, that means SUB should do the job just right!

evanh · 2014-04-19 18:02

Lol, IF_NEVER - there is 28 bits of NOPs! I guess it's inherent in conditional execution. Commented machine code, yeah! My Prop education is starting.

jmg · 2014-04-19 19:00

Seairth wrote: »

It does, however shave off 2 clock cycles for non-inverted signalling. Which makes me wonder if these shouldn't be used after all, because it potentially changes the maximum baud rate depending on whether receive signalling is inverted. I suspect, though, that this bit of code is not what's limiting the baud rate.

Seairth wrote: »

I don't suppose the new pin mode stuff has an inverted mode that would invert the input/output automatically? If so, then this would certainly simplify both send and receive code.

The Pins are quite configurable, but re-configure is going to be via serial word(s) send.

Other design approaches :

* The JP/JNP are one bit different in the opcode itself, and so code can self modify the opcode once, at config.
Run time is then as fast as possible..

* Use conditional builds to select P/N versions, as polarity is unlikely to change during run.

Seairth · 2014-04-19 19:46

evanh wrote: »

Comparing is exactly what we are wanting to do here. The reason why this topic exists is because of the earlier discussion over advantage of fine grained time slicing vs any sort of cooperative multitasking.

Except that I was trying to focus on using only those instructions/features which are available at the moment. If Chip adds hardware multitasking, we can see then how it will help. If he adds self-jumping WAITxxx instructions, we can see then how it will help. But until those features (and others) are actually added, I think we should avoid discussing them in this thread.

Heater. · 2014-04-19 19:51

evanh,

How is cnt signed or unsigned? It's just a 32 bit number that counts up and rolls over right?

Surely CMPS and SUBS are the same, is that not a documentation error?

According to my Propeller manual:

Instruction   -INSTR- ZCRI -CON-  -DEST-  -SRC-       Z Result  C Result          Result         Clocks


CMPS D, S     110000  000i 1111 ddddddddd sssssssss   D=S       Signed (D < S)    Not Written    4
SUBS D, S     110101  001i 1111 ddddddddd sssssssss   D-S=0     Signed Overflow   Written        4

Isn't a carry flag result of "Signed (D < S)" and "Signed Overflow" the same thing?

Seairth · 2014-04-19 19:58

jmg wrote: »

The Pins are quite configurable, but re-configure is going to be via serial word(s) send.

Other design approaches :

* The JP/JNP are one bit different in the opcode itself, and so code can self modify the opcode once, at config.
Run time is then as fast as possible..

* Use conditional builds to select P/N versions, as polarity is unlikely to change during run.

I like your idea about modifying the instruction! On the transmit side, I suppose you could use the same technique to enable/disable the XOR that flips the bits. It should let you get rid of one of the TESTs though.

Actually, on transmit, I wonder why it's flipping only one bit at a time? Its not like the polarity will change in the middle of the byte.

Electrodude · 2014-04-19 20:21

evanh wrote: »

That didn't ring true so I had another careful look at the Prop1 datasheet and found that the Carry flag result for SUBS and CMPS is not the same. SUBS generates a "Signed Overflow" while:
CMPS "Signed (D < S)", can this be anything other than it's less than zero?, which looks to be pretty much identical to
CMP "Unsigned (D < S)", I would interpret this one as an unsigned borrow, and also
SUB "Unsigned Borrow".

I'm very sure the flags are the same, even if the descriptions are different (but still are equivalent). CMP wr = SUB and SUB nr = CMP, same for the -S and -X versions, as well as TEST and AND, and TESTN and ANDN. I'm pretty sure the only instructions whose behavior is changed by the R bit's value is (wr|rd)(byte|word|long), in which case the R bit decides if it's a read or write.

For comparing with cnt, I'm pretty sure you need 3 instructions (not counting the conditional jump or conditional whatever)

mov t1, cnt
sub t1, oldcnt
cmps t1, #0 wc wz
if_c_or_z jmp #timer_expired

This sets c or z if (cnt-oldcnt)>0 (signed compare), which is what you want. This is different from cnt>oldcnt, because of signededness. cnt is neither signed nor unsigned - it just increments and couldn't care less if you think its value is -1 or 4294967295. That's why you can't use just one sub or subs (or cmp or cmps).

electrodude

evanh · 2014-04-19 20:24

Seairth wrote: »

Except that I was trying to focus on using only those instructions/features which are available at the moment.

I'm struggling to see any value in that.

jmg · 2014-04-19 21:06

evanh wrote: »

I'm struggling to see any value in that.

It allows a size/speed comparison with P1, but we already know it will be much faster.

It would be useful to have a equivalent compare of some P2-like features (just those already on the short list), and for FDS, the ideal target is to find minimal power, as Power is still going to be important here.

That would presume a HW Task-SW Core, and a lowest power form of WAITxxx, so both Tx and Rx can spend most time in minimal power.

A Pin-indexed version would also be nice - JP & WAITP can work on any pin (unlike masking), but less clear is if there is a Tx equiv of that ?

It would be quite impressive to have a low-power, Pin-index capable UART that can be used in a Poll/response design of up to 32 Duplex channels.

evanh · 2014-04-19 21:22

Okay, clear this up, lets try a 9600 bps example near the counter rollover point:

bitticks = clkfreq / baudrate = 80000000 / 9600 = 8333

Lets say CNT happens to be 4,294,967,000 at the time of first sample. So 296 clock ticks occur from then until unsigned rollover.

First pass results as follows ..

          mov     rxcnt,bitticks    ' 8333
          shr     rxcnt,#1          ' 4166
          add     rxcnt,cnt         ' 4294967000 + 4166 = 3870
          add     rxcnt,bitticks    ' 3870 + 8333 = 12203

wait:     mov     t1,rxcnt          ' 12203
          sub     t1,cnt            ' 12203 - 4294967012 (-284 signed) = 12487
          cmps    t1,#0   wc        ' C = D < S = false
if_nc     jmp     #:wait

The question is what happens to the Carry flag if it was: sub t1,cnt wc ? I'm guessing C gets set, which is not what's wanted.

It seems it's important that CMPS has this mismatched behaviour, different from SUBS. Is this where the half-carry flag was used in other CPU's?

Evan

evanh · 2014-04-19 22:06

Heater. wrote: »

Isn't a carry flag result of "Signed (D < S)" and "Signed Overflow" the same thing?

Signed Overflow occurs at the other boundary - 2^31. So 2,147,483,640 + 10 will cause a signed overflow.

Signed (D < S) is a high bit copy of the result of D - S, ie: Is the result negative?

jmg · 2014-04-20 00:11

If HW task SW with a low power compatible WAITCNT variant does not make it, should there be an opcode for software tasking equivalent of WAITCNT. The examples above use 4-5 lines to do a SW version.

evanh · 2014-04-20 01:59

jmg wrote: »

... should there be an opcode for software tasking equivalent of WAITCNT. The examples above use 4-5 lines to do a SW version.

It should be possible to have a special compare that can be branched on. Cutting it to two instructions. Something like CMPRS D,S WC ' It would evaluate something like this: Carry = (D - S < #0)

I know I've written many a routine that would have used this for event triggering.

pjv · 2014-04-20 07:55

evanh wrote: »

It should be possible to have a special compare that can be branched on. Cutting it to two instructions. Something like CMPRS D,S WC ' It would evaluate something like this: Carry = (D - S < #0)

I know I've written many a routine that would have used this for event triggering.

Evanh,

Many times I have needed that precisely ! It will do a great job on COMPARES without concern for rollover.

Cheers,

Peter (pjv)

evanh · 2014-04-20 20:47

A bit of a (Potato) summary of the situation:

evanh wrote: »

Signed (D < S) is a high bit copy of the result of D - S, ie: Is the result negative?

At first look, my simple description appears to satisfy the requirement and that D-S<#0 doesn't seem to be needed. To be clear, however, this instruction, CMPS D,S WC, fails to always satisfy the simple "Is the result negative?" when dealing with circular numerics. If S, for example, is a little bigger than the positive overflow threshold, and D is just below overflow, then in a circular sense the two are close together but the compare treats them as at opposite ends of numerical range making D be treated as much bigger than S. We would get a false carry instead of getting the desired true carry. Similar issue with CMP but borrow is at #0 instead.

So, my description of CMPS was incomplete as it failed to address absolute overflow conditions. This is probably due to "high-bit copy of result" not being how CMPS works. Rather, it was how I wanted it to work.

It could be argued that requiring preconditioning is the price of niche functionality. However, as has been noted, binary addition and subtraction is explicitly constructed to produce numerically correct results when dealing with circular input. Why not also the logic bridging compares?

Or maybe there is a need for an appropriate flag so extended maths can also benefit? For that mater, can circular extending beyond the bit size of the ALU even be done?

evanh · 2014-04-21 02:52

evanh wrote: »

... can circular extending beyond the bit size of the ALU even be done?

I'll answer that yes myself. I don't think it's any harder than an ordinary extended subtraction, followed by, again, the most significant bit of the result being copied to carry.

But to do that as pure compares?

EDIT: You know what? I've almost never actually coded any assembly of any sort before. New experience for me. Now it seems to me, the low word has to be treated unsigned so: CMP D,S WC; CMPSX D+1,S+1 WC; will be a normal (little endian) extended compare.

So, for the extended circular variant it would be my imaginary new instruction at the end: CMP D,S WC; CMPRSX D+1,S+1 WC;

jmg · 2014-04-24 23:29

Seairth wrote: »

Except that I was trying to focus on using only those instructions/features which are available at the moment. If Chip adds hardware multitasking, we can see then how it will help. If he adds self-jumping WAITxxx instructions, we can see then how it will help. But until those features (and others) are actually added, I think we should avoid discussing them in this thread.

Morphing this, with your other idea of Timer extensions to the COG counters, I think can yield low impact solutions that are smaller and faster.

The new JP / JNP boolean opcodes are great for pin polling - they could extend very slightly to allow Flag polling.(JNB/JB)

A spare Carry type BIT added to the CTRx config register can sticky-flag signal a timer way point, either in Adder mode (existing) or Reload Mode (new)

In the most compact form, a single JNB opcode would loop jmpsw until Timer WayPoint, and a following line clears the flag.

If there is opcode room, a simple Auto-clear variant of the opcode, for Flags would be
JNBC -> loops until Bit True, and on True, clears the Bit and exits with no jump.

FullDuplexSerial for P1+

Comments