FullDuplexSerial for P1+
Seairth
Posts: 2,474
Since we have been using FDS as an example bit of code that would be affected by all of the changes, I figured I'd start tweaking it to take advantage of the new instruction set. So far, I have made the following changes:
Things this does not take into account:
As it stands, this code is slightly shorter than the original version (and hopefully bug free). Given the general increase P1+ performance (5x MIPS) and the reduction of a few hub operations, this version might be able to get up to a 8x increase in baud rate.
FullDuplexSerial_P1+.spin
- Used RDQUAD in a few places to reduce the number hub accesses
- Much shorter setup code
- Rearranged a few blocks to reduce a couple of the hub access stalls
- Added some additional comments (this was to help me, mostly, but I figured I'd leave them in)
- Replaced JMPRET with JMPSW. This was a drop-in replacement.
Things this does not take into account:
- Any changes to the SPIN language
- I believe PAR is going away, but it's not clear to me what's replacing it. So the current code assumes PAR still exists.
- The smart pin stuff might get rid of some of the serial bit-banging, but I don't know enough about it at this point to make those changes.
As it stands, this code is slightly shorter than the original version (and hopefully bug free). Given the general increase P1+ performance (5x MIPS) and the reduction of a few hub operations, this version might be able to get up to a 8x increase in baud rate.
FullDuplexSerial_P1+.spin
Comments
Even if Pin Cell can do some Serial work, the general case of SW FDS is still useful.
Maybe the new pin-test opcodes can also help ?
JP D/#,S/@ (jump if pin IN high, pins registered at beginning of ALU cycle)
JNP D/#,S/@ (jump if pin IN (not) high, pins registered at beginning of ALU cycle)
Um, excuse me if my lack of experience shows but can't the SUB generate a zero flag result and the JMP act on that rather than using a CMPS?
EDIT: Of course if these routines were running independently then they could do away with the tight loops and instead use WAITCNT instructions. I think Chip got the waits sorted so they can share the hardware time slicing?
Z would only work if cnt were exactly the same as t1 at the sub instruction. C is used to detect the less accurate condition where cnt passes rxcnt. You can't use a SUBS here because cnt is unsigned.
As for the WAITCNT, that was correct for P2. In that case, WAITCNT (or was it called something else? I'm already forgetting!) would self-jump to keep from stalling the pipeline. Depending on the task scheduling, you could overshoot your mark just like the above code, but I suspect it would be more accurate on average. If hardware multitasking were available in P1+, I agree that this likely the better way to go (with a bit of TLOCK, or whatever it was called, for the hubop blocks). But, that's neither here nor there, since P1+ doesn't currently do hardware multitasking.
The only place I found these to be useful was on the beginning of the receive, looking for the start bit:
I think that's right (and assumes that rx_pin is a cog register). In a way, this might be more readable. But it didn't reduce the number of instructions, unfortunately.
It does, however shave off 2 clock cycles for non-inverted signalling. Which makes me wonder if these shouldn't be used after all, because it potentially changes the maximum baud rate depending on whether receive signalling is inverted. I suspect, though, that this bit of code is not what's limiting the baud rate.
Comparing is exactly what we are wanting to do here. The reason why this topic exists is because of the earlier discussion over advantage of fine grained time slicing vs any sort of cooperative multitasking.
Chip did do something to the WAITs of the Prop2 design to allow them to function cleanly with more than one thread active. It was something like self-repeating in the pipeline instead of going into a static wait state.
That didn't ring true so I had another careful look at the Prop1 datasheet and found that the Carry flag result for SUBS and CMPS is not the same. SUBS generates a "Signed Overflow" while:
CMPS "Signed (D < S)", can this be anything other than it's less than zero?, which looks to be pretty much identical to
CMP "Unsigned (D < S)", I would interpret this one as an unsigned borrow, and also
SUB "Unsigned Borrow".
Hmm, that means SUB should do the job just right!
The Pins are quite configurable, but re-configure is going to be via serial word(s) send.
Other design approaches :
* The JP/JNP are one bit different in the opcode itself, and so code can self modify the opcode once, at config.
Run time is then as fast as possible..
* Use conditional builds to select P/N versions, as polarity is unlikely to change during run.
Except that I was trying to focus on using only those instructions/features which are available at the moment. If Chip adds hardware multitasking, we can see then how it will help. If he adds self-jumping WAITxxx instructions, we can see then how it will help. But until those features (and others) are actually added, I think we should avoid discussing them in this thread.
How is cnt signed or unsigned? It's just a 32 bit number that counts up and rolls over right?
Surely CMPS and SUBS are the same, is that not a documentation error?
According to my Propeller manual:
Isn't a carry flag result of "Signed (D < S)" and "Signed Overflow" the same thing?
I like your idea about modifying the instruction! On the transmit side, I suppose you could use the same technique to enable/disable the XOR that flips the bits. It should let you get rid of one of the TESTs though.
Actually, on transmit, I wonder why it's flipping only one bit at a time? Its not like the polarity will change in the middle of the byte.
For comparing with cnt, I'm pretty sure you need 3 instructions (not counting the conditional jump or conditional whatever) This sets c or z if (cnt-oldcnt)>0 (signed compare), which is what you want. This is different from cnt>oldcnt, because of signededness. cnt is neither signed nor unsigned - it just increments and couldn't care less if you think its value is -1 or 4294967295. That's why you can't use just one sub or subs (or cmp or cmps).
electrodude
It allows a size/speed comparison with P1, but we already know it will be much faster.
It would be useful to have a equivalent compare of some P2-like features (just those already on the short list), and for FDS, the ideal target is to find minimal power, as Power is still going to be important here.
That would presume a HW Task-SW Core, and a lowest power form of WAITxxx, so both Tx and Rx can spend most time in minimal power.
A Pin-indexed version would also be nice - JP & WAITP can work on any pin (unlike masking), but less clear is if there is a Tx equiv of that ?
It would be quite impressive to have a low-power, Pin-index capable UART that can be used in a Poll/response design of up to 32 Duplex channels.
bitticks = clkfreq / baudrate = 80000000 / 9600 = 8333
Lets say CNT happens to be 4,294,967,000 at the time of first sample. So 296 clock ticks occur from then until unsigned rollover.
First pass results as follows ..
The question is what happens to the Carry flag if it was: sub t1,cnt wc ? I'm guessing C gets set, which is not what's wanted.
It seems it's important that CMPS has this mismatched behaviour, different from SUBS. Is this where the half-carry flag was used in other CPU's?
Evan
Signed Overflow occurs at the other boundary - 2^31. So 2,147,483,640 + 10 will cause a signed overflow.
Signed (D < S) is a high bit copy of the result of D - S, ie: Is the result negative?
It should be possible to have a special compare that can be branched on. Cutting it to two instructions. Something like CMPRS D,S WC ' It would evaluate something like this: Carry = (D - S < #0)
I know I've written many a routine that would have used this for event triggering.
Evanh,
Many times I have needed that precisely ! It will do a great job on COMPARES without concern for rollover.
Cheers,
Peter (pjv)
At first look, my simple description appears to satisfy the requirement and that D-S<#0 doesn't seem to be needed. To be clear, however, this instruction, CMPS D,S WC, fails to always satisfy the simple "Is the result negative?" when dealing with circular numerics. If S, for example, is a little bigger than the positive overflow threshold, and D is just below overflow, then in a circular sense the two are close together but the compare treats them as at opposite ends of numerical range making D be treated as much bigger than S. We would get a false carry instead of getting the desired true carry. Similar issue with CMP but borrow is at #0 instead.
So, my description of CMPS was incomplete as it failed to address absolute overflow conditions. This is probably due to "high-bit copy of result" not being how CMPS works. Rather, it was how I wanted it to work.
It could be argued that requiring preconditioning is the price of niche functionality. However, as has been noted, binary addition and subtraction is explicitly constructed to produce numerically correct results when dealing with circular input. Why not also the logic bridging compares?
Or maybe there is a need for an appropriate flag so extended maths can also benefit? For that mater, can circular extending beyond the bit size of the ALU even be done?
I'll answer that yes myself. I don't think it's any harder than an ordinary extended subtraction, followed by, again, the most significant bit of the result being copied to carry.
But to do that as pure compares?
EDIT: You know what? I've almost never actually coded any assembly of any sort before. New experience for me. Now it seems to me, the low word has to be treated unsigned so: CMP D,S WC; CMPSX D+1,S+1 WC; will be a normal (little endian) extended compare.
So, for the extended circular variant it would be my imaginary new instruction at the end: CMP D,S WC; CMPRSX D+1,S+1 WC;
Morphing this, with your other idea of Timer extensions to the COG counters, I think can yield low impact solutions that are smaller and faster.
The new JP / JNP boolean opcodes are great for pin polling - they could extend very slightly to allow Flag polling.(JNB/JB)
A spare Carry type BIT added to the CTRx config register can sticky-flag signal a timer way point, either in Adder mode (existing) or Reload Mode (new)
In the most compact form, a single JNB opcode would loop jmpsw until Timer WayPoint, and a following line clears the flag.
If there is opcode room, a simple Auto-clear variant of the opcode, for Flags would be
JNBC -> loops until Bit True, and on True, clears the Bit and exits with no jump.