Fast Full-Duplex Serial, 1 Cog - a.k.a. FFDS1

lonesocklonesock Posts: 913
edited 2013-06-06 - 09:31:30 in Propeller 1
Hi, everybody.

I'm ready to upload this to the OBEX, but I would like a little more peer review before tempting a prop newbie to try my code if it's buggy. [8^)

The main FFDS1 features are basically the components of the name. [8^)
* fast : 460_800 baud on a 80 MHz prop
* full-duplex : FFDS1 can handle full speed block TX and RX at the same time
* no jitter : um, no jitter
* FFDS1 uses only a single cog.

Any and all feedback appreciated...bugs, features, documentation, etc.

thanks,
Jonathan

P.S. This version is a bit updated over the prelim version posted a while back on another thread. This will be the official place to get FFDS1 until such time as it goes into the OBEX
Free time status: see my avatar [8^)
F32 - fast & concise floating point: OBEX, Thread
Unrelated to the prop: KISSlicer
«1

Comments

  • jmgjmg Posts: 14,122
    edited 2012-11-02 - 12:03:48
    More of generic usage suggestions :
    Allow a Constant Value set for Tx/Rx bit counts ?
    Edit: Allow a Constant Value set for Number of Stop bits ?

    Even tho the 'standard' PC serial is 8 bits, there are instances where this would be used Prop-Prop, and there, up to 32 bits is easily handled.
    This can save 'which byte' message fragment handling.
    The Baud precision of course is tighter than 8 bits, but easily met with a Xtal Prop.
    A defined STOP bit count can help pace Send speeds, and give remote units a known time to (eg) flip RS485 direction in half-duplex designs.

    Other extensions could be parity/checksum style check bits, and/or an address bit handler.
    Address bit can be force parity on a PC, or a host uC with 9 bit mode, or it can be managed in the serial cog

    We did what I called a twisted ring daisy chain of many small micros running at high bauds.

    Here the protocol rule was simple Address-bit-edge based :
    N bytes following an ABE =\_ were mine, and in those slots, a reply was inserted, with AB held high. All other cases, are simply echo.
    The AB edge thus moves as the packet travels around the ring.
    - result is a long TX message drops N bytes to every node, in chain-order, and the incoming data has reply info from all nodes.
    If you send more than Node count, those bytes arrive back unchanged, so you can check total installed chain size.
  • Duane C. JohnsonDuane C. Johnson Posts: 955
    edited 2012-11-02 - 12:17:59
    Hi lonesock;
    lonesock wrote: »
    The main FFDS1 features are basically the components of the name. [8^)
    * fast : 460_800 baud on a 80 MHz prop
    That's not very fast. There are others that run much faster.

    Duane J
  • jmgjmg Posts: 14,122
    edited 2012-11-02 - 12:57:34
    That's not very fast. There are others that run much faster.

    I thought 465116 Baud was quite good for Full Duplex in 1 COG ?
  • lonesocklonesock Posts: 913
    edited 2012-11-02 - 13:40:27
    Stop bits would be easy enough to change. Using > 8 data bits would be a bit trickier, not because of the transmission code itself, but because I would have to find a different interface mechanism (i.e. are the consecutive values to send stored in words or longs, with possible wasted bits, or packed together contiguously (requiring hub ops at random times). I guess just extending to 16 or 32 would work easily. (Conditional compiling would be very nice in this application.)

    (btw, I really like your ring protocol.)

    In terms of features to add, I can't really go too crazy with this, not and still hit 460.8 kbps at 80MHz (which is one of my main goals). I was hoping to maybe add in software flow control, but even that might get too crazy. [8^)

    Regarding speed, I could not think of a faster way to do jitter-free RX as well as TX in one cog. The cog samples at 1/2 bit period intervals, and RX uses a counter to see how much of the start bit has been captured once it arrives. If it has been on < 1/4 bit period, then I wait one more 1/2 bit period before sampling RX. This guarantees that I sample RX at least 1/4 bit period from either edge. I know other people have done high-speed drivers, and jitter-free drivers, and one-cog drivers (off the top of my head I recall the PBnJ driver, and Kye had one too).

    thanks,
    Jonathan
    Free time status: see my avatar [8^)
    F32 - fast & concise floating point: OBEX, Thread
    Unrelated to the prop: KISSlicer
  • Duane C. JohnsonDuane C. Johnson Posts: 955
    edited 2012-11-02 - 16:45:24
    My bad %^(

    I just didn't see the underline and read it as 460 to 800 baud.
    I apologize for my stupidity.

    Duane J
  • jmgjmg Posts: 14,122
    edited 2012-11-02 - 16:51:44
    I just didn't see the underline and read it as 460 to 800 baud.
    Hehe, now that is slow...
  • lonesocklonesock Posts: 913
    edited 2012-11-02 - 17:24:35
    ...
    I just didn't see the underline and read it as 460 to 800 baud.
    That is both understandable and very funny! I adopted that Spin convention in my own notes because it is less confusing than commas for those who use them instead of decimal points. Now it just sort of slips out every once in a while. [8^)

    Jonathan
    Free time status: see my avatar [8^)
    F32 - fast & concise floating point: OBEX, Thread
    Unrelated to the prop: KISSlicer
  • SRLMSRLM Posts: 5,045
    edited 2012-12-05 - 21:47:49
    I think I may have found a bug. At the very least, it's not working for me.

    The problem seems to be somewhere in the PASM receive function, and it has to do with receiving too much data at once. Here is main.spin:
    CON
    	_CLKMODE          = XTAL1 + PLL16x
    	_XINFREQ          = 5_000_000 
    	
    '*******************************************************************************	
    OBJ
    	debug		: "FFDS1_66.spin"   
    
    CON
    	debug_rxpin = 31
    	debug_txpin = 30
    	debug_baud = 115200
    
    PUB Main
    
    	debug.Start(debug_rxpin, debug_txpin, debug_baud)
    	
    	waitcnt(CLKFREQ + CNT)
    
    	repeat
    		debug.Tx(debug.Rx)
    
    

    As you can see, it simply echos the characters you send it. When I type in characters by hand, it works fine. But when I try to send a bunch of characters (such as the characters in the code block above), it produces bad output:
    Terminal ready
    
    *** file: main.spin
    ascii-xfr -s -c 0 main.spin 
    ASCII upload of "main.spin"
    
    
    *** exit status: 0
    CN
      _LME     XA1 L1x
                      _IFE     =50_0 	***************************************	BJ&#65533;(V!$&#65533;&#65533;e&#65533;r&#65533;&#65533;
    
        eu_xi  1                                                                                O
    P X-&#65533;&#65533;(V&#65533;10euxi 3
              U&#65533;}`i,dbgtp,dgbu)
    &#65533;   ctCKRH&#65533;(*&#65533;&#65533;($&#65533;&#65533;dbgT(eu.x
    Thanks for using picocom
    

    This output changes each time I run the program.

    I've also tried running two instances of the FFDS1 object with one Tx'ing into the other's Rx, and it works fine (as expected: that's what the test program does). Transmit also works fine: I can transmit all sorts of things without error.

    As a final test, I've tried hooking it up to a GPS outputing at 9600, and get similar results.
  • lonesocklonesock Posts: 913
    edited 2012-12-06 - 10:44:38
    SRLM wrote: »
    ...
    As a final test, I've tried hooking it up to a GPS outputing at 9600, and get similar results.
    OK, thanks. The fact that this error can happen at 9600 is scary! I will look into this right away!

    thanks,
    Jonathan
    Free time status: see my avatar [8^)
    F32 - fast & concise floating point: OBEX, Thread
    Unrelated to the prop: KISSlicer
  • kuronekokuroneko Posts: 3,623
    edited 2012-12-07 - 15:58:58
    Jonathan asked me to post this fix. I'll attach it as rev 0.9. The problem was the recovery time needed between processing the stopbit and scanning for a new startbit. This has now been shortened by one half-bit period.
    {417}         jmpret    lockstep_ret, tx_jump  
    {418}         jmp       #rx_cleanup    
    
    has been replaced by
    {417}         tjz       phsb, #rx_main wr
    
    The original version used this sequence to restart the receiver with the criteria that a hubop needs a lockstep jmpret following within 2 insns:
    wrword    rx_ptr, update_head_ptr
                                                   
                  [COLOR="#FFA500"]jmpret    lockstep_ret, tx_jump[/COLOR]  
                  jmp       #rx_cleanup    
    
    rx_cleanup    mov       phsb, #0
                                     
    rx_main       jmpret    lockstep_ret, tx_jump
    
    IOW, simply removing the first jmpret wouldn't have been enough. While it did work for my test setup hubop restrictions are usually there for a reason. Which brings us to tjz which does the jump (shadow[phsb] always zero) but also writes back (wr) to phsb therefore clearing both shadow and counter register.
  • lonesocklonesock Posts: 913
    edited 2012-12-07 - 20:35:00
    kuroneko wrote: »
    ...
    Which brings us to tjz which does the jump (shadow[phsb] always zero) but also writes back (wr) to phsb therefore clearing both shadow and counter register.
    This is pure PASM gold...thanks!

    Jonathan
    Free time status: see my avatar [8^)
    F32 - fast & concise floating point: OBEX, Thread
    Unrelated to the prop: KISSlicer
  • MagIO2MagIO2 Posts: 2,181
    edited 2012-12-28 - 10:53:14
    @lonesock:
    Finally I found some time for testing your FFDS1. Thanks for sharing this great object! It'll be assimilated into my codebase ;o)

    Some little things I'll change right away, some changes might come after further testing.
    Right away changes:
    1. reordering the variables -> the variables needed by the PASM-part should be in sequence and separated from variables only needed by the SPIN-part. This way I can use memory allocated during runtime (rudimentary memory management) instead of memory allocated by the compiler.
    2. doing the setup of PASM in PASM and not via injection in the start function -> same reason, separation of cognew and start-function
    3. moving the 'wait for end of transmission' from end of function to start of function -> I think in a lot of use-cases, you output a buffer and do other things after that. So, why block the 'do other things'-part for the whole transmission time? This only makes sense if you directly want to overwrite the transfer-buffer. But for this case you have the waittx-function. (To be honest, the waittx function currently does not make sense because tx already waits and it's not save to use it for syncing across COGs).
    This is a point which might be of general interest. Of course I see that it is more beginners-friendly to keep it like it is, but I also think that it'll increase the net-transfer-rate and overall program-speed if doing it the other way around. Maybe it makes sense to have both versions?

    Possible changes:
    Maybe it's faster to generate the whole output-string before sending (hex,dec,bin) instead of calling the single character tx-function.

    I found those kind of weaknesses in the FDS when experimenting with the raspberry. Transmission worked without problems up to ... I don't remember, but some hundred kbit/sec, but after 115200 the net transfer-speed did not increase because the SPIN-part simply is to slow to deliver the bytes fast enough. That's why your driver is a great improvement.
  • SRLMSRLM Posts: 5,045
    edited 2012-12-30 - 19:40:44
    I may have found an issue with the rxtime(ms) method. Here it is in Spin:
    PUB RxTime(ms) : rxbyte | tout
    {{
      * Waits for a byte to be received or a timeout to occur.
      > ms : the number of milliseconds to wait for an incoming byte
      < returns -1 if no byte received, $00..$FF if byte
    
      e.g. if (c := RxTime( 10 )) < 0
    }}
      tout := clkfreq / 1000 * ms + cnt
      repeat
        rxbyte := RxCheck
      while (rxbyte < 0) and ((cnt - tout) < 0)
    


    I've converted it to C++, and done my testing there, but I believe that the functionality is the same. To be specific, I have not tested with the Spin version of the program. In any case, here is the C/C++ version:
    int32_t Serial::GetCTime(int32_t ms)
    {
      int32_t tout = ((CLKFREQ / 1000) * ms) + CNT;
      int32_t rxbyte = 0;
      do {
        rxbyte = GetCCheck();
      } while ((rxbyte < 0) && ((CNT - tout) < 0));
      return rxbyte;
    }
    

    The Problems
    1) It doesn't seem to wait. I don't know why this is, but it definitely does not.
    2) In the case that, during the period while the function is waiting, CNT wraps around it will exit prematurely (this is the same for Spin for sure).

    I wrote the following (in C/C++) to correct these issues:
    int32_t Serial::GetCTime(int32_t ms)
    {
    	int tout = (CLKFREQ/1000)*ms;
    	int rxbyte;
    	int totaltime = 0;
    	int previous_cnt = CNT;
    	int current_cnt;
    	do
    	{
    		rxbyte = GetCCheck();
    		current_cnt = CNT;
    		totaltime += current_cnt-previous_cnt;
    		previous_cnt = current_cnt;
    	}while ( rxbyte < 0 && totaltime < tout);
    	return rxbyte;
    }
    

    I couldn't figure out what might cause problem #1, though.
  • kuronekokuroneko Posts: 3,623
    edited 2012-12-30 - 20:08:52
    @SRLM: FWIW, the SPIN version works just fine, just verified a 26sec timeout (@80MHz).
  • kuronekokuroneko Posts: 3,623
    edited 2012-12-30 - 20:29:45
    SRLM wrote: »
    int32_t Serial::GetCTime(int32_t ms)
    {
      int32_t tout = ((CLKFREQ / 1000) * ms) + CNT;
      int32_t rxbyte = 0;
      do {
        rxbyte = GetCCheck();
      } while ((rxbyte < 0) && (([COLOR="#FF0000"]CNT - tout[/COLOR]) < 0));
      return rxbyte;
    }
    
    I believe the highlighted part is giving you trouble. CNT is a #define for _CNT which in turn is defined as
    extern _COGMEM volatile unsigned int _CNT __asm__("CNT");
    
    That's unlikely to work. Don't know right now what the compiler is going to make of it, most likely an unsigned expression (never < 0). In your second example you side-step this by assigning the unsigned CNT to a signed int.
  • CircuitsoftCircuitsoft Posts: 1,018
    edited 2013-01-04 - 09:35:27
    Kuroneko, I've been making my own attempt to port this driver to gcc/gas in a maintainable manner, and I'm lost. Can you take a look at http://vdubshouse.zapto.org/parallax/#/c/2/?

    Thanks.
  • kuronekokuroneko Posts: 3,623
    edited 2013-01-04 - 18:26:51
    Kuroneko, I've been making my own attempt to port this driver to gcc/gas in a maintainable manner, and I'm lost. Can you take a look at http://vdubshouse.zapto.org/parallax/#/c/2/?

    How do you build the test? Do you have a Makefile for this or at least a command line? Just so we are on the same page.
  • CircuitsoftCircuitsoft Posts: 1,018
    edited 2013-01-05 - 11:57:43
    propeller-gcc-elf -DTEST ffds1.c ffds1m.S -o ffds1
    propeller-load -r ffds1
  • CircuitsoftCircuitsoft Posts: 1,018
    edited 2013-01-05 - 12:27:38
    Oh, in case you don't have a Gerrit account, I attached a snapshot.
  • kuronekokuroneko Posts: 3,623
    edited 2013-01-05 - 17:58:48
    From code review I noticed 2 things. You will have issues later with the timed methods, CNT is declared as unsigned and the compiler issues a simple cmp (see earlier postings). Also, your coginit is a complete mess. Function arguments are ID, code, par (without any shifts & Co applied), i.e.
    thisobj.Cog = 1 + coginit(8, &fds_entry, &thisobj.Write_buf_ptr);
    
    This will at least produce meaningful PASM code. However, the code entry address resolves as 0 (as far as the object dump tells me). Which means it's either resolved at load time or has to be prepared differently.
  • CircuitsoftCircuitsoft Posts: 1,018
    edited 2013-01-05 - 22:59:02
    I'm now back on the computer I did this on, and found my actual command lines:

    propeller-elf-gcc -g3 -DTEST ffds1.c ffds1m.S -o ffds1
    propeller-load -D clkmode=xtal1+pll8x -D clkfreq=96mhz -e ffds1

    The board is a QuickStart board, but I desoldered the 0-ohm resistor on the crystal, and attached my own 12MHz crystal to the solder pads. I had done this to play with some USB things, but haven't gotten around to it on this board yet. Anyway, the hardware works with spin programs, but I haven't figured out what I need to change to make C work yet. I suppose, for now, I'd better just switch the hardware back.
  • SRLMSRLM Posts: 5,045
    edited 2013-04-09 - 23:42:45
    What would be the best way to add RTS and CTS pins to the PASM code? In particular, I'm interested in RTS.
  • Tracy AllenTracy Allen Posts: 6,404
    edited 2013-04-10 - 08:44:34
    Cody,
    Jonathan's code is written tight, so messing with it is upon your own cog-nition. But take a look at how it is done in fullDuplexSerial4port. It is not a heavy load. In between received characters, you are testing for a new start bit. If one is not detected, then you jump to a routine that compares the difference between the head and tail pointers to the programmed % of buffer full. If above that percent, it sets the RTS to its stop state and then goes on back to looking for the start bit.

    Some outside devices take some time to respond to RTS and may empty their xmt buffer before halting transmission, so the buffer has to be big enough to account for that. The RTS process will mess up if you are operating on the edge of the speed zone, where input characters are arriving head to tail with one stop bit. It would be fine at lower speeds or when incoming characters are paced.

    A parameter in the init method in fullDuplexSerial4port selects the pin to use for RTS, or -1 if none. If none is selected, the initialization code changes the JMP to the RTS processing to a NOP.
  • SRLMSRLM Posts: 5,045
    edited 2013-04-10 - 15:14:03
    Jonathan's code is written tight, so messing with it is upon your own cog-nition. But take a look at how it is done in fullDuplexSerial4port. It is not a heavy load. In between received characters, you are testing for a new start bit. If one is not detected, then you jump to a routine that compares the difference between the head and tail pointers to the programmed % of buffer full. If above that percent, it sets the RTS to its stop state and then goes on back to looking for the start bit.

    Some outside devices take some time to respond to RTS and may empty their xmt buffer before halting transmission, so the buffer has to be big enough to account for that. The RTS process will mess up if you are operating on the edge of the speed zone, where input characters are arriving head to tail with one stop bit. It would be fine at lower speeds or when incoming characters are paced.

    A parameter in the init method in fullDuplexSerial4port selects the pin to use for RTS, or -1 if none. If none is selected, the initialization code changes the JMP to the RTS processing to a NOP.

    Thanks for the advice. The RTS and CTS acronyms confuse me: isn't RTS an indicator to the host (the Propeller in this case) to disable transmission. I'm getting this information from the RN-42 Bluetooth datasheet:
    15 UART_RTS UART RTS, goes high to disable host transmitter     Low level output from RN-42     0 - 3.3
    16 UART_CTS UART CTS, if set high, disables transmitter         Low level input to RN-42           0 - 3.3
    

    Basically, I want to make sure that I'm not losing bytes when sending by sending too much from the Propeller. Your second paragraph would be a solution for CTS? Or, are RTS/CTS like RX/TX: it's relative to where you are looking from?

    So, the solution that I came up with is as follows:
    Tx_main.tx_byte
    				// set up for sending out a byte
    				rdbyte	Tmp, Write_ptr
    				add		Write_ptr, #1
    
    				// force the stop bit
    				or		Tmp, #$100
    
    				jmpret	Tx_jump, #Lockstep
    				
    [b]				                             //SRLM: Add RTS support
    				                             mov		INA, INA
    				                             and		INA, Maskrts wz
    	if_nz                        jmp		#Lockstep
    [/b]
    				// sign extend the 1 into all upper bits
    				shl		Tmp, #(32-9)
    				sar		Tmp, #(32-10)
    				mov		PHSA, Tmp
    
    				// 10 bits (start + 8 data + stop) makes 20 half-bits
    				mov		Half_bits_out, #20
    

    In some preliminary tests with a button acting as the master on the RTS line, this seems to work (at 460800). Does anybody see any timing problems?

    edit: some of the format of the PASM is a bit different: it's from a C++ GAS driver, which is why there are no local labels and the variables are capitalized. And that's why comments use // instead of '.
  • Tracy AllenTracy Allen Posts: 6,404
    edited 2013-04-10 - 16:35:35
    Yes indeed, cts and rts have the same ambiguity as tx and rx. The names are written from the standpoint of DTE, data terminal equipment, for which tx is an output, and cts is an input that can throttle the flow of data out from tx. And rts is an output from the DTE that tells the external device to hold off on sending more data. The opposite of DTE is DCE, data communications equipment (e.g. modem), and it uses the same names, but tx is its input and cts is its output. It's just one of the things that makes RS232 hookups confusing.

    The RN-42 bluetooth module should by all rights be a modem, DCE device, but the description has it tagged as a DTE, at least for the handshaking lines...
    [FONT=arial narrow][COLOR=#020FC0][SIZE=1]15 UART_RTS UART RTS, goes high to disable host transmitter     
                         Low level output from RN-42     0 - 3.3 
    16 UART_CTS UART CTS, if set high, disables transmitter         
           Low level input to RN-42           0 - 3.3
    [/SIZE][/COLOR][/FONT]
    


    FullDuplexSerial4port works from the standpoint of DTE, so cts is an input for control of flow from the tx pin. The cts pin state is combined with the state of the xmit buffer. In the following, the uart initialiation patches if_z to either if_z_or_nc or to if_z_or_nc , depending on the desired polarity, or leaves it at if_z if cts control is not desired.
    [SIZE=1][FONT=courier new]
    transmit    jmpret  txcode,rxcode1  'run a chunk of receive code, then return
                                                          'patched to a jmp if pin not used                        
    txcts0    test     ctsmask,ina   wc   'if flow-controlled dont send
              rdlong   t1,tx_head_ptr     '{7-22} - head[0]
              cmp      t1,tx_tail   wz    'tail[0]
    ctsi0     if_z     jmp #transmit      'may be patched to if_z_or_c or if_z_or_nc
    [/FONT][/SIZE]
    
  • SRLMSRLM Posts: 5,045
    edited 2013-06-04 - 23:50:24
    There seems to be a small bug: the driver does not set DIRA for the RX pin. So, if the pin given as RX is an output then the driver fails.
  • kuronekokuroneko Posts: 3,623
    edited 2013-06-04 - 23:59:05
    SRLM wrote: »
    There seems to be a small bug: the driver does not set DIRA for the RX pin. So, if the pin given as RX is an output then the driver fails.
    What do you think the driver should/can do about this?
  • SRLMSRLM Posts: 5,045
    edited 2013-06-05 - 00:36:15
    kuroneko wrote: »
    What do you think the driver should/can do about this?

    I think it should set the rx pin to input when the start method is called.

    Maybe you were referring to how all the DIRAs are OR'd together? That does present some interesting problems. My solution is to set the RX pin to input, then test DIRA to see if it really is input. If it is not (another cog has it set to output), then Start() fails and returns the appropriate error code.
  • kuronekokuroneko Posts: 3,623
    edited 2013-06-05 - 00:50:05
    SRLM wrote: »
    I think it should set the rx pin to input when the start method is called.
    Assuming the PASM driver can start at all (free cog) that leaves 6 other cogs which you can't easily affect. So why bother with the caller?
    ... then test DIRA to see if it really is input
    How do you distinguish between a pin driven from an external source and driven by another cog? Besides, you can only read your own dira ...
  • SRLMSRLM Posts: 5,045
    edited 2013-06-05 - 01:14:09
    You're right. For others interested in reading about the I/O registers, I found a good source from Jeff Martin here: http://forums.parallax.com/showthread.php/100671-Reading-the-output-state-of-a-pin-in-one-cog-that-has-been-set-by-a-different-c

    I guess it's more of a problem with the Propeller hardware, then. It would have been handy to be able to read what the DIRA register was set at (after all the OR'ing) so that these cases could be accounted for.

    Still, I think it would be a good idea to set the RX pin to input, even if that covers only 25% of the cogs.
Sign In or Register to comment.