Smarter/faster array element indexing using indirect register addressing in PASM?

northcove · 2014-08-11 16:34

I am trying to improve the performance of a PASM routine that continually processes registers which have originated as separate LONG arrays of user values in hub ram. My PASM implementation works correctly using MOVS and MOVD for register indexing but my approach has several drawbacks:

1. my actual implementation reads and writes registers in many different places; keeping track of all the instructions that need modification via MOVS / MOVD is onerous;
2. my approach requires additional temporary "working" registers that represent the current element of each vector I'm processing;
3. the overhead of code self-modification for simulating vector indexing before and every iteration is causing performance problems with my solution.

Here's an example that illustrates my issues:

dat                  
              org

main       'compute S[i]:=A[i]+B[i]+C[i]+S[i] repeatedly

:restart      movs          :rd_A, #A           'init A src reg
              movs          :rd_B, #B           'init B src reg
              movs          :rd_C, #C           'init C src reg
              movs          :rd_S, #S           'init S src reg
              movd          :wr_S, #S           'init S dst reg
              mov           _N, #N              'init counter
              mov           _P, par             'init hub ram dst reg

:compute      '      
:rd_A         mov           _A, 0-0             'read current A element into working register
:rd_B         mov           _B, 0-0             'read current B element into working register
:rd_C         mov           _C, 0-0             'read current C element into working register

              'do lots of read/write with A, B, and C here before they are added to S
              
:rd_S         mov           _S, 0-0             'read current S element into working register
              add           _S, _A              'add working A to working S
              add           _S, _B              'add working B to working S
              add           _S, _C              'add working C to working S
:wr_S         mov           0-0, _S             'write working S to pasm reg
              wrlong        _S, _P              'write working S to hub ram
              djnz          _N, #:increment     'operate on the next array elements
              jmp           #:restart           'start all over from the first array elements

:increment    add           :rd_A, #1           'incr A src reg
              add           :rd_B, #1           'incr B src reg
              add           :rd_C, #1           'incr C src reg
              add           :rd_S, #1           'incr X src reg
              add           :wr_S, DI           'incr X dst reg
              add           _P, #4              'incr dst hub ram ptr           
              jmp           #:compute

  A     long  0 [ N ]       'input vector inited by hub cog 
  B     long  0 [ N ]       'input vector inited by hub cog  
  C     long  0 [ N ]       'input vector inited by hub cog
  S     long  0 [ N ]       'input vector sum result in pasm cog
  DI    long  1 << 9        'reg val to incr dst reg       
  
  'my "working" registers
  _A    res
  _B    res
  _C    res
  _S    res
  _N    res
  _P    res

              fit

The single biggest issue is my loop needs to run as fast as possible but the overhead of the :restart and :increment instruction blocks is causing problems. (The project is a serial multiplexer that handles up to 32 input channels, up to 1megabaud input data rate, and offers delivers a 3 megabaud output data rate.) Is there a better way to achieve array-like indexing with PASM registers than what I'm doing?

Cheers,

Christopher

kuroneko · 2014-08-11 16:53

The three hub arrays (A, B, C) are never updated (based on the fragment you posted). Why do you keep adding them up (once should be fine)? Also, _A, _B and _C are written to and read from once. Merge those. Then you write _S to hub, why do you keep it in cog RAM as well?

I'd also re-order the jumps slightly (but that's just me):

wrlong        _S, _P              'write working S to hub ram

:increment    add           :rd_A, #1           'incr A src reg
              add           :rd_B, #1           'incr B src reg
              add           :rd_C, #1           'incr C src reg
              add           :rd_S, #1           'incr X src reg
              add           :wr_S, DI           'incr X dst reg
              add           _P, #4              'incr dst hub ram ptr           
              djnz          _N, #:compute       'operate on the next array elements
              jmp           #:restart           'start all over from the first array elements

northcove · 2014-08-11 17:17

Thanks for your reply but they are not the performance improvements you are looking for (Jedi mind trick-hand wave.)

That code is not intended to update hub ram or even do anything useful; it is a simple example just to show how a technique of indexing through PASM registers to access individual elements. I only wrote out S to hub ram to verify my example worked correctly.

In reality, in my actual implementation A, B and C are being modified in the inner loop (hence the "do lots of read/write.." comment.)

I need to find a better way of reading/writing registers as though they are elements in large arrays. Still learning how to best achieve this on that offers self-modifying instructions rather than more conventional indirect register addressing.

My ideal is to merely adjust a base pointer and/or register using a single instruction each time through the main loop and then access current elements of A, B, C etc using a fixed offset from that main pointer / register. But unfortunately I'm not yet high enough on the PASM learning curve to achieve this.

Bill Henning · 2014-08-11 17:41

The Propeller 1 does not have any index registers, indexed register modes, or auto increment/decrement addressing.

northcove · 2014-08-11 17:45

^Thanks Bill, can you think of a faster way to index through registers than what I'm doing?

Bill Henning · 2014-08-11 17:52

Take a look at the documentation for hub ops, and re-organize your code to optimally interleave it.
top:
RDxxxxx
ins1
ins2
RDxxxxx
ins1
ins2
RDxxxxx
ins1
ins2
WRxxxxx
ins1
jmp #top

Note that once you are synchronized to the hub, each of the above RDxxxx blocks will take 16 clock cycles, ins1/2 essentially execute for free.

If there are three instructions after a hub operation, it will waste 12 clock cycles - ie above executes in 64 clock cycles, but below takes 80

top:
RDxxxxx
ins1
ins2
ins3 ***** effectively takes 16 clock cycles
RDxxxxx
ins1
ins2
RDxxxxx
ins1
ins2
WRxxxxx
ins1
jmp #top

(thanks to kuroneko for pointing out my silly mistake. Shows me I have not been writing enough crazy pasm lately)

Basically, schedule your hub ops carefully

kuroneko · 2014-08-11 18:32

Bill Henning wrote: »

top:
RDxxxxx
ins1
ins2
ins3
RDxxxxx
...

Note that once you are synchronized to the hub, each of the above RDxxxx blocks will take 16 clock cycles, ins1/2/3 essentially execute for free.

Neat trick.

Bill Henning · 2014-08-11 18:37

LOL!!!!

Yep.

DUH!

TWO spacer instructions!!!!

I am getting old and forgetfull... fixing post.

Thanks.

kuroneko wrote: »

Neat trick.

Phil Pilgrim (PhiPi) · 2014-08-11 18:54

Just iimagine how fast LMM would be if the three-spacer-instruction cha-cha actually worked!

-Phil

northcove · 2014-08-11 18:58

Bill Henning wrote: »

Take a look at the documentation for hub ops, and re-organize your code to optimally interleave it.

Yeah, that's a neat trick - thanks.

Got any tricks for faster r/w access of incremented PASM registers other than modifying instructions via MOVS/MOVD?

Cheers,

Christopher

msrobots · 2014-08-11 19:06

you can also replace complete instructions with a simple mov. Or use add to increment source and dest in one operation.

Enjoy!

Mike

Bill Henning · 2014-08-11 19:23

not really

well... kuroneko has used PHSA (if my memory is working this time) as a hub pointer, but that is crazy code.

northcove wrote: »

Yeah, that's a neat trick - thanks.

Got any tricks for faster r/w access of incremented PASM registers other than modifying instructions via MOVS/MOVD?

Cheers,

Christopher

ChrisGadd · 2014-08-11 19:32

Your top post mentions up to 32 input channels, each of which I'm guessing stores data in its own array, and you're trying to find a routine to read from each array in turn? If that's accurate, then is there a way to have all of the input channels write to the same array, and simply interleave the registers that they write to? The first channel writes to longs 0, 32, 64... channel 2 gets 1, 33, 65... and so on. The PASM routine could then just start at the beginning and read sequentially down the array.

northcove · 2014-08-11 20:56

ChrisGadd wrote: »

Your top post mentions up to 32 input channels, each of which I'm guessing stores data in its own array, and you're trying to find a routine to read from each array in turn? If that's accurate, then is there a way to have all of the input channels write to the same array, and simply interleave the registers that they write to? The first channel writes to longs 0, 32, 64... channel 2 gets 1, 33, 65... and so on. The PASM routine could then just start at the beginning and read sequentially down the array.

^Yeah, that's closer to my real problem.

My actual code scans multiple pins for delimited serial data not dissimilar to NMEA sentences from a GPS, for example. (In my application the data could be coming in at 500k-1Mbaud / 1kHz records over fibre optic cable, not 4800 baud / 1Hz records from a GPS.) The PASM code performs checks on the incoming bytes and writes the record to hub ram when records are complete. The receiver code also performs more complex preprocessing of these records before writing them to hub ram. Each receiver channel has about twenty fields associated with it: rx pin mask, ticks per bit, start char, stop char, bits remaining, checksum, error count, destination buffer address, destination buffer length, routing information, IO stats, etc. My single port implementation handles 1.5Mbaud input data just fine (tested with a multi-port transmitter I wrote that handles 3Mbaud and ~1k records per second per port.)

Converting the single port code over to multi-port I'm finding that the MOVS/MOVD workaround for the absence of indexed register modes or auto incrementing is causing overhead that I wasn't anticipating writing PASM on the Prop. My multi-channel code requires ~twenty MOVS/MOVD before the first iteration to set the base source and destination registers to load the elements into working registers. Each sucessive iteration requires another ~twenty instructions to increment the source registers and yet another twenty instructions to save & increment the destination registers. And I'm trying to keep the port switching overhead to under 0.2uSecs !!

Bill Henning · 2014-08-11 21:25

Keep in mind the instruction rate of the propeller, and the bit rates you are dealing with. There is no way that you could handle more than a few high speed channels per cog, even if the timing is perfectly predictable within each stream after the start bit. Assuming a standard start bit, 8 data bits, stop bit, and only sampling each bit cell in the middle, you *might* be able to handle 3 mbps streams per cog, at best, and I would not rely on being able to handle more than two.

northcove wrote: »

^Yeah, that's closer to my real problem.

My actual code scans multiple pins for delimited serial data not dissimilar to NMEA sentences from a GPS, for example. (In my application the data could be coming in at 500k-1Mbaud / 1kHz records over fibre optic cable, not 4800 baud / 1Hz records from a GPS.) The PASM code performs checks on the incoming bytes and writes the record to hub ram when records are complete. The receiver code also performs more complex preprocessing of these records before writing them to hub ram. Each receiver channel has about twenty fields associated with it: rx pin mask, ticks per bit, start char, stop char, bits remaining, checksum, error count, destination buffer address, destination buffer length, routing information, IO stats, etc. My single port implementation handles 1.5Mbaud input data just fine (tested with a multi-port transmitter I wrote that handles 3Mbaud and ~1k records per second per port.)

Converting the single port code over to multi-port I'm finding that the MOVS/MOVD workaround for the absence of indexed register modes or auto incrementing is causing overhead that I wasn't anticipating writing PASM on the Prop. My multi-channel code requires ~twenty MOVS/MOVD before the first iteration to set the base source and destination registers to load the elements into working registers. Each sucessive iteration requires another ~twenty instructions to increment the source registers and yet another twenty instructions to save & increment the destination registers. And I'm trying to keep the port switching overhead to under 0.2uSecs !!

northcove · 2014-08-11 22:25

Bill Henning wrote: »

Keep in mind the instruction rate of the propeller, and the bit rates you are dealing with. There is no way that you could handle more than a few high speed channels per cog, even if the timing is perfectly predictable within each stream after the start bit. Assuming a standard start bit, 8 data bits, stop bit, and only sampling each bit cell in the middle, you *might* be able to handle 3 mbps streams per cog, at best, and I would not rely on being able to handle more than two.

Copy that. Realistically there will only be one or two high speed streams (>500kbaud) to handle. When streams are numerous they'll mostly be in the 38.4-115.2kbaud range. I have sufficient free cogs to allocate to specific streams when the bandwidth requires it. I'm also fortunate in that I can tweak the incoming baud to match my receivers because I developed the transmitters.

One of my issues is that the PASM register layout is inherited from arrays in the the Spin cog's DAT section, eg:

DAT
  'initialised by hub obj and copied to cog ram
  rxmask      long  0 [ max_ports ]    'rx rxmask for port
  rxbyte      long  0 [ max_ports ]    'current byte for port
  bitime      long  0 [ max_ports ]    'word 0: data bit time, word 1: start bit delay
  delims      long  0 [ max_ports ]    'byte 0: start char, byte 1: end char
  flags       long  0 [ max_ports ]    'word 0: cur flags, word 1: ini flags 
  headref     long  0 [ max_slots ]    'handle to head addr 
  headptr     long  0 [ max_slots ]    'addr of rcv buffer start
  tailref     long  0 [ max_slots ]    'handle to tail addr
  tailptr     long  0 [ max_slots ]    'addr of rcv buffer end
  bufcnt      long  0 [ max_ports ]    'curr user buffers for port
  ...

In the single-to-multi-port conversion I've lazily resorted to using "working" registers that reflected the single port usage of registers. What I've ended up with this overhead before every channel loop:

:rescan
                    'init base registers        
                    movs        :r_rxmask, #rxmask
                    movs        :r_rxbyte, #rxbyte
                    movd        :w_rxbyte, #rxbyte
                    movs        :r_nxtbit, #nxtbit
                    movd        :w_nxtbit, #nxtbit
                    movs        :r1_bitime, #bitime
                    movs        :r2_bitime, #bitime
                    movs        :r_recdelim, #delims
                    movs        :r_flags, #flags 
                    movd        :w_flags, #flags 
                    movs        :r_dstptr, #dstptr
                    movd        :w_dstptr, #dstptr 
                    movs        :r_tailptr, #tailptr
                    movs        :r_headref, #headref 
                    movs        :r_headptr, #headptr
                    movs        :r_tailref, #tailref
                    movs        :r_hsh_rx, #hsh_rx
                    movd        :w_hsh_rx, #hsh_rx
                    movs        :r_hsh_tx, #hsh_tx
                    movd        :w_hsh_tx, #hsh_tx

And this overhead after every channel iteration:

:incr_regs
                    add         :r_rxmask, #1
                    add         :r_rxbyte, #1
                    add         :w_rxbyte, _dri
                    add         :r_nxtbit, #1
                    add         :w_nxtbit, _dri
                    add         :r1_bitime, #1
                    add         :r2_bitime, #1
                    add         :r_recdelim, #1
                    add         :r_flags, #1 
                    add         :w_flags, _dri
                    add         :r_dstptr, #1
                    add         :w_dstptr, _dri
                    add         :r_tailptr, #1
                    add         :r_headref, #1 
                    add         :r_headptr, #1
                    add         :r_tailref, #1
                    add         :r_hsh_rx, #1
                    add         :w_hsh_rx, _dri
                    add         :r_hsh_tx, #1
                    add         :w_hsh_tx, _dri

Yikes & yuck.

Possibly a better solution can be found by reorganising the layout such that each channel has a base register and its fields are in adjacent registers. Not sure this would fix much the absence of register indexing/autoincrementing instructions, however.

ChrisGadd · 2014-08-12 06:29

yikes and yuck indeed. Have you considered unrolling the loop? That is instead of one loop that handles everything, break every iteration of the loop into its own routine? It would almost certainly be faster, and due to the number of registers that need updating might not be too much larger.

northcove · 2014-08-12 13:36

I cannot think of any way to unroll the loop without making the implementation be limited to a fixed number of channels. I want my implementation to be flexible enough to handle 1 channel at 1Mbaud while also supporting up to 32 channels at much slower speeds. My application has both requirements.

While my single port implementation was really fast, I'm having a complete rethink and starting over for the multi-port implementation. Given the Prop's self-modifying code capability next I will experiment with a template sequence of instructions (start bit scanning, data byte assembly) to handle a single channel and then upon entry have the cog duplicate that instruction template for additional channels with a one-time setting of each channel's instruction's source and destination registers. The return registers will be appropriately to process channels in sequence. I'm a n00b at PASM and this is my first time writing self-modifying code but I like the new direction it's taking me.

northcove · 2014-08-26 18:19

ChrisGadd wrote: »

yikes and yuck indeed. Have you considered unrolling the loop? That is instead of one loop that handles everything, break every iteration of the loop into its own routine? It would almost certainly be faster, and due to the number of registers that need updating might not be too much larger.

Here's an update with my loop unrolling experiment for my multi-channel serial receiver. The project took almost two weeks longer than I expected. My excuse is I spent most of last week skiing in the mountains with my kids, my girlfriend, her kids and her ex. Also, writing self-modifying Propeller assembly for the first time using only a terminal program and 200MHz scope for debugging is without a doubt the slowest development I have ever endured.

Initial performance results from testing from a single cog reading serial 8 data bits-1 stop bit, no handshaking for delimited records. For this project I am using records defined by a "$" start char and $D (carriage return) stop char as commonly used by NMEA 0183. The records used for testing had pseudo-random length between 12 and 96 bytes. My transmitter was emitting about 1,500 records per second at 1.5M Baud.

          Chans   Max Baud      MCSF*         
          -----  ---------   -----------
            1    1_500_000     1560KHz
            2      460_800      833KHz
            3      250_000      556KHz                   
            4      230_400      417KHz
            5      115_200      333KHz
            6      115_200      278KHz

*Minimum Channel Sample Frequency reported by my scope when XORin a diagnostic pin each time a port's data pin was sampled.

Here's what I did:

1. Wrote a tiny PASM cog that supported reading and writing its registers from another Spin object using only 12 instructions and 2 data registers in the cog RAM.

pub start
  return cognew ( @entry, @_req )

pub get_reg ( reg_idx )
  result := reg_idx << 1
  _req := @result
  repeat while _req

pub set_reg ( reg_idx, reg_val )
  result := reg_idx << 1 | 1
  _req := @result
  repeat while _req
  
dat           org   'hub ram

_req    long 0

dat           org   'cog ram   

entry               rdlong      _ptr, par       wz
        if_z        jmp         #entry          'no cmd, check again      
                    rdlong      _val, _ptr      'get cmd val     
                    shr         _val, #1        wc 'check set bit and shift to get reg index       
        if_nc       movd        $+2, _val       'set get_reg dest
        if_c        movd        $+3, _val       'set set_reg dest
        if_nc       wrlong      0-0, _ptr       'read reg value, copy to usr hub var
        if_c        add         _ptr, #8        'offset to reg_val param
        if_c        rdlong      0-0, _ptr       'write usr val from hub ram to reg
                    mov         _val, #0        'zero cmd
                    wrlong      _val, par       'reset cmd reg
                    jmp         #entry          'do it again

_ptr    res
_val    res

This cog enabled me to use its registers $00E to $1EF for my own purposes. Once the registers were configured, I set register $000 with an instruction to execute the custom code that was created on-the-fly by a Spin object.

2. Wrote a generic PASM template for the instructions and data for reading serial data from a single pin. For explanation purposes, here is the basic code for assembling bytes from bits.

dat

port_code_beg       tjnz    _rxbyte, #:check_time
        
                    'check start bit
:seek_start         test    _rxmask, ina        wz
        if_nz       jmp     #0-0                'process next port

                    'got start bit
                    mov     _nxtime, cnt 
                    add     _nxtime, _sbtime
                    mov     _rxbyte, _hibyte
                    jmp     #0-0                'process next port                             

:check_time         cmp    _nxtime, cnt         wc              
        if_nc       jmp     #0-0                'process next port                                          

                    'got data bit
                    test    _rxmask, ina        wz
                    shl     _rxbyte, #1         wc                             
        if_nz       or      _rxbyte, #1
        if_nc       jmp     #:got_byte
                    add     _nxtime, _bitime
                    jmp     #0-0                'process next port 

:got_byte           shr     _rxbyte, #1         'shift out stop bit                         
                    rev     _rxbyte, #24        'reverse data bit order

                    'got complete byte here
                    call    #handle_byte

:exit               mov     _rxbyte, #0        'start all over with new byte
                    jmp     #0-0               'process next port
port_code_end

  _rxmask     long  0
  _bitime     long  0
  _sbtime     long  0
  _dbmask     long  0
  _nxtime     long  0
  _rxbyte     long  0
  _bufreg     long  0
  _rcvptr     long  0
  _endptr     long  0

3. Then I wrote the Spin code to duplicate that PASM template for every serial channel and set the instruction register's source and destination bits customised for each channel. Every source bits for JMP #0-0 instruction was overwritten with the register number for the next channel's code. Oh yeah, I also wrote a little disassembler to see what was going on with the code I was writing on-the-fly. Here's a dump of the multi-receiver's PASM registers when configured for two channels starting at register $010:

0010:00004000
0011:000002B6
0012:000003E9
0013:00010000
0014:00000000
0015:00000000
0016:0000002D
0017:00000000
0018:00000000
0019:E87C2A20 TJNZ   111010 0001 015 #020
001A:623C21F2 TEST   011000 1000 010  1F2
001B:5C54003A JMP    010111 0001 000 #03A
001C:A0BC29F1 MOV    101000 0010 014  1F1
001D:80BC2812 ADD    100000 0010 014  012
001E:A0BC2A0E MOV    101000 0010 015  00E
001F:5C7C003A JMP    010111 0001 000 #03A
0020:853C29F1 CMP    100001 0100 014  1F1
0021:5C4C003A JMP    010111 0001 000 #03A
0022:623C21F2 TEST   011000 1000 010  1F2
0023:2DFC2A01 SHL    001011 0111 015 #001
0024:68D42A01 OR     011010 0011 015 #001
0025:5C4C0028 JMP    010111 0001 000 #028
0026:80BC2811 ADD    100000 0010 014  011
0027:5C7C003A JMP    010111 0001 000 #03A
0028:28FC2A01 SHR    001010 0011 015 #001
0029:3CFC2A18 REV    001111 0011 015 #018
002A:5CFC1E0F CALL   010111 0011 00F #00F
002B:A0FC2A00 MOV    101000 0011 015 #000
002C:5C7C003A JMP    010111 0001 000 #03A
002D:00001FFC
002E:00001BD8
002F:00001C38
0030:00000000
0031:00020000
0032:000002B6
0033:000003E9
0034:00000000
0035:00000000
0036:00000000
0037:0000004E
0038:00000000
0039:00000000
003A:E87C6C41 TJNZ   111010 0001 036 #041
003B:623C63F2 TEST   011000 1000 031  1F2
003C:5C540019 JMP    010111 0001 000 #019
003D:A0BC6BF1 MOV    101000 0010 035  1F1
003E:80BC6A33 ADD    100000 0010 035  033
003F:A0BC6C0E MOV    101000 0010 036  00E
0040:5C7C0019 JMP    010111 0001 000 #019
0041:853C6BF1 CMP    100001 0100 035  1F1
0042:5C4C0019 JMP    010111 0001 000 #019
0043:623C63F2 TEST   011000 1000 031  1F2
0044:2DFC6C01 SHL    001011 0111 036 #001
0045:68D46C01 OR     011010 0011 036 #001
0046:5C4C0049 JMP    010111 0001 000 #049
0047:80BC6A32 ADD    100000 0010 035  032
0048:5C7C0019 JMP    010111 0001 000 #019
0049:28FC6C01 SHR    001010 0011 036 #001
004A:3CFC6C18 REV    001111 0011 036 #018
004B:5CFC1E0F CALL   010111 0011 00F #00F
004C:A0FC6C00 MOV    101000 0011 036 #000
004D:5C7C0019 JMP    010111 0001 000 #019
004E:00001FFE
004F:00001C38
0050:00001C98
0051:00000000

Notice that each channel's instructions have their source and destination bits independently configured, this is how I keep the channel switching overhead to a minimum. The receiver is put into action by setting register $000 to JMP #$019. My Spin code went through the PASM code registers for each channel's instruction block and fixed up the source and destination bits whereevery needed. JMPs to #0-0 were overwritten to jump to the next channel's instruction block. The data registers before each channel's instruction block contains the channel's settings: pin mask, bit ticks, next data bit ticks, received bits so far, trace mask, etc. The data registers following each code block contain the hub addresses of where and how the received data should be written for the user. (My implementation supports multiple receive buffers for each channel. If a receiver buffer is busy, eg. being processed by the user, the current record can be written to another buffer. This allows a single channel's records to be processed by multiple cogs, quite useful when records are coming in at 1KHz and the higher-level processing is being done in Spin.) I've deliberately omitted the code that does the processing of a received byte. This code is part of the generic template for the real project but I left it to make this explanation easier to understand. After more testing I'll post the receiver project to the Propeller objex.

Cheers,

Christopher

kuroneko · 2014-08-26 18:33

northcove wrote: »

test    _rxmask, ina        wz
                    shl     _rxbyte, #1         wc                             
        if_nz       or      _rxbyte, #1

This is more efficiently done with test wc and rcl.

I'm also a bit worried about this bit:

:check_time         cmp    _nxtime, cnt         wc              
        if_nc       jmp     #0-0                'process next port

northcove · 2014-08-26 19:05

Quite true, thanks. I've noticed Chip's FullDuplexSerial code using test wc & rcl too. I've deferred switching over because I'm using the C status from shl to determine when the byte is complete. _rxbyte is initialised with $ff00_0000; when the MSB is zero I know all the bits have been sampled. This approach eliminates the overhead of a bit counter.

northcove · 2014-08-26 19:07

kuroneko wrote: »

This is more efficiently done with test wc and rcl.

I'm also a bit worried about this bit:

:check_time         cmp    _nxtime, cnt         wc              
        if_nc       jmp     #0-0                'process next port

I was wondering how long it would take you to find that. Usually only takes me $ffffffff / 80_000_000 seconds maximum.

kuroneko · 2014-08-26 19:09

northcove wrote: »

I've deferred switching over because I'm using the C status from shl to determine when the byte is complete. _rxbyte is initialised with $ff00_0000; when the MSB is zero I know all the bits have been sampled.

Whether you use shl wc or rcl wc doesn't really matter in this case. They both move D[31] into carry.

northcove wrote: »

I was wondering how long it would take you to find that. Usually only takes me $ffffffff / 80_000_000 seconds maximum.

Are you saying you put that there on purpose?

northcove · 2014-08-26 19:27

Nice one, thanks. I knew the fault was in the code when I posted it, figured someone would flag it. Created it a couple days ago while hastily freeing up registers.

Also, the 1.5MBaud MCSF* in my previous table should be 3.8MHz, not 1.56MHz.

northcove · 2014-08-26 22:25

kuroneko wrote: »

I'm also a bit worried about this bit:

:check_time         cmp    _nxtime, cnt         wc              
        if_nc       jmp     #0-0                'process next port

Is there a better way than this? Would prefer to not introduce a temporary register.

                    mov     _tmp, _nxtime
                    sub     _tmp, cnt       
                    cmps    _tmp, #0            wc
        if_nc       jmp     #0-0

kuroneko · 2014-08-26 22:32

northcove wrote: »

Is there a better way than this? Would prefer to not introduce a temporary register.

Use cnt instead?

mov     cnt, _nxtime
                    sub     cnt, cnt       
                    cmps    cnt, #0 wc
        if_nc       jmp     #0-0

You could also use a normal counter which makes the comparison easier, i.e. run any counter in LOGIC.always so the above becomes e.g.

cmp     _nxtime, phsa wc
if_nc   jmp     #0-0

Smarter/faster array element indexing using indirect register addressing in PASM?

Comments