Smarter/faster array element indexing using indirect register addressing in PASM?
I am trying to improve the performance of a PASM routine that continually processes registers which have originated as separate LONG arrays of user values in hub ram. My PASM implementation works correctly using MOVS and MOVD for register indexing but my approach has several drawbacks:
1. my actual implementation reads and writes registers in many different places; keeping track of all the instructions that need modification via MOVS / MOVD is onerous;
2. my approach requires additional temporary "working" registers that represent the current element of each vector I'm processing;
3. the overhead of code self-modification for simulating vector indexing before and every iteration is causing performance problems with my solution.
Here's an example that illustrates my issues:
The single biggest issue is my loop needs to run as fast as possible but the overhead of the :restart and :increment instruction blocks is causing problems. (The project is a serial multiplexer that handles up to 32 input channels, up to 1megabaud input data rate, and offers delivers a 3 megabaud output data rate.) Is there a better way to achieve array-like indexing with PASM registers than what I'm doing?
Cheers,
Christopher
1. my actual implementation reads and writes registers in many different places; keeping track of all the instructions that need modification via MOVS / MOVD is onerous;
2. my approach requires additional temporary "working" registers that represent the current element of each vector I'm processing;
3. the overhead of code self-modification for simulating vector indexing before and every iteration is causing performance problems with my solution.
Here's an example that illustrates my issues:
dat
org
main 'compute S[i]:=A[i]+B[i]+C[i]+S[i] repeatedly
:restart movs :rd_A, #A 'init A src reg
movs :rd_B, #B 'init B src reg
movs :rd_C, #C 'init C src reg
movs :rd_S, #S 'init S src reg
movd :wr_S, #S 'init S dst reg
mov _N, #N 'init counter
mov _P, par 'init hub ram dst reg
:compute '
:rd_A mov _A, 0-0 'read current A element into working register
:rd_B mov _B, 0-0 'read current B element into working register
:rd_C mov _C, 0-0 'read current C element into working register
'do lots of read/write with A, B, and C here before they are added to S
:rd_S mov _S, 0-0 'read current S element into working register
add _S, _A 'add working A to working S
add _S, _B 'add working B to working S
add _S, _C 'add working C to working S
:wr_S mov 0-0, _S 'write working S to pasm reg
wrlong _S, _P 'write working S to hub ram
djnz _N, #:increment 'operate on the next array elements
jmp #:restart 'start all over from the first array elements
:increment add :rd_A, #1 'incr A src reg
add :rd_B, #1 'incr B src reg
add :rd_C, #1 'incr C src reg
add :rd_S, #1 'incr X src reg
add :wr_S, DI 'incr X dst reg
add _P, #4 'incr dst hub ram ptr
jmp #:compute
A long 0 [ N ] 'input vector inited by hub cog
B long 0 [ N ] 'input vector inited by hub cog
C long 0 [ N ] 'input vector inited by hub cog
S long 0 [ N ] 'input vector sum result in pasm cog
DI long 1 << 9 'reg val to incr dst reg
'my "working" registers
_A res
_B res
_C res
_S res
_N res
_P res
fit
The single biggest issue is my loop needs to run as fast as possible but the overhead of the :restart and :increment instruction blocks is causing problems. (The project is a serial multiplexer that handles up to 32 input channels, up to 1megabaud input data rate, and offers delivers a 3 megabaud output data rate.) Is there a better way to achieve array-like indexing with PASM registers than what I'm doing?
Cheers,
Christopher

Comments
I'd also re-order the jumps slightly (but that's just me):
wrlong _S, _P 'write working S to hub ram :increment add :rd_A, #1 'incr A src reg add :rd_B, #1 'incr B src reg add :rd_C, #1 'incr C src reg add :rd_S, #1 'incr X src reg add :wr_S, DI 'incr X dst reg add _P, #4 'incr dst hub ram ptr djnz _N, #:compute 'operate on the next array elements jmp #:restart 'start all over from the first array elementsThat code is not intended to update hub ram or even do anything useful; it is a simple example just to show how a technique of indexing through PASM registers to access individual elements. I only wrote out S to hub ram to verify my example worked correctly.
In reality, in my actual implementation A, B and C are being modified in the inner loop (hence the "do lots of read/write.." comment.)
I need to find a better way of reading/writing registers as though they are elements in large arrays. Still learning how to best achieve this on that offers self-modifying instructions rather than more conventional indirect register addressing.
My ideal is to merely adjust a base pointer and/or register using a single instruction each time through the main loop and then access current elements of A, B, C etc using a fixed offset from that main pointer / register. But unfortunately I'm not yet high enough on the PASM learning curve to achieve this.
top:
RDxxxxx
ins1
ins2
RDxxxxx
ins1
ins2
RDxxxxx
ins1
ins2
WRxxxxx
ins1
jmp #top
Note that once you are synchronized to the hub, each of the above RDxxxx blocks will take 16 clock cycles, ins1/2 essentially execute for free.
If there are three instructions after a hub operation, it will waste 12 clock cycles - ie above executes in 64 clock cycles, but below takes 80
top:
RDxxxxx
ins1
ins2
ins3 ***** effectively takes 16 clock cycles
RDxxxxx
ins1
ins2
RDxxxxx
ins1
ins2
WRxxxxx
ins1
jmp #top
(thanks to kuroneko for pointing out my silly mistake. Shows me I have not been writing enough crazy pasm lately)
Basically, schedule your hub ops carefully
Yep.
DUH!
TWO spacer instructions!!!!
I am getting old and forgetfull... fixing post.
Thanks.
-Phil
Yeah, that's a neat trick - thanks.
Got any tricks for faster r/w access of incremented PASM registers other than modifying instructions via MOVS/MOVD?
Cheers,
Christopher
Enjoy!
Mike
well... kuroneko has used PHSA (if my memory is working this time) as a hub pointer, but that is crazy code.
^Yeah, that's closer to my real problem.
My actual code scans multiple pins for delimited serial data not dissimilar to NMEA sentences from a GPS, for example. (In my application the data could be coming in at 500k-1Mbaud / 1kHz records over fibre optic cable, not 4800 baud / 1Hz records from a GPS.) The PASM code performs checks on the incoming bytes and writes the record to hub ram when records are complete. The receiver code also performs more complex preprocessing of these records before writing them to hub ram. Each receiver channel has about twenty fields associated with it: rx pin mask, ticks per bit, start char, stop char, bits remaining, checksum, error count, destination buffer address, destination buffer length, routing information, IO stats, etc. My single port implementation handles 1.5Mbaud input data just fine (tested with a multi-port transmitter I wrote that handles 3Mbaud and ~1k records per second per port.)
Converting the single port code over to multi-port I'm finding that the MOVS/MOVD workaround for the absence of indexed register modes or auto incrementing is causing overhead that I wasn't anticipating writing PASM on the Prop. My multi-channel code requires ~twenty MOVS/MOVD before the first iteration to set the base source and destination registers to load the elements into working registers. Each sucessive iteration requires another ~twenty instructions to increment the source registers and yet another twenty instructions to save & increment the destination registers. And I'm trying to keep the port switching overhead to under 0.2uSecs !!
Copy that. Realistically there will only be one or two high speed streams (>500kbaud) to handle. When streams are numerous they'll mostly be in the 38.4-115.2kbaud range. I have sufficient free cogs to allocate to specific streams when the bandwidth requires it. I'm also fortunate in that I can tweak the incoming baud to match my receivers because I developed the transmitters.
One of my issues is that the PASM register layout is inherited from arrays in the the Spin cog's DAT section, eg:
In the single-to-multi-port conversion I've lazily resorted to using "working" registers that reflected the single port usage of registers. What I've ended up with this overhead before every channel loop:
:rescan 'init base registers movs :r_rxmask, #rxmask movs :r_rxbyte, #rxbyte movd :w_rxbyte, #rxbyte movs :r_nxtbit, #nxtbit movd :w_nxtbit, #nxtbit movs :r1_bitime, #bitime movs :r2_bitime, #bitime movs :r_recdelim, #delims movs :r_flags, #flags movd :w_flags, #flags movs :r_dstptr, #dstptr movd :w_dstptr, #dstptr movs :r_tailptr, #tailptr movs :r_headref, #headref movs :r_headptr, #headptr movs :r_tailref, #tailref movs :r_hsh_rx, #hsh_rx movd :w_hsh_rx, #hsh_rx movs :r_hsh_tx, #hsh_tx movd :w_hsh_tx, #hsh_txAnd this overhead after every channel iteration:
:incr_regs add :r_rxmask, #1 add :r_rxbyte, #1 add :w_rxbyte, _dri add :r_nxtbit, #1 add :w_nxtbit, _dri add :r1_bitime, #1 add :r2_bitime, #1 add :r_recdelim, #1 add :r_flags, #1 add :w_flags, _dri add :r_dstptr, #1 add :w_dstptr, _dri add :r_tailptr, #1 add :r_headref, #1 add :r_headptr, #1 add :r_tailref, #1 add :r_hsh_rx, #1 add :w_hsh_rx, _dri add :r_hsh_tx, #1 add :w_hsh_tx, _driYikes & yuck.
Possibly a better solution can be found by reorganising the layout such that each channel has a base register and its fields are in adjacent registers. Not sure this would fix much the absence of register indexing/autoincrementing instructions, however.
While my single port implementation was really fast, I'm having a complete rethink and starting over for the multi-port implementation. Given the Prop's self-modifying code capability next I will experiment with a template sequence of instructions (start bit scanning, data byte assembly) to handle a single channel and then upon entry have the cog duplicate that instruction template for additional channels with a one-time setting of each channel's instruction's source and destination registers. The return registers will be appropriately to process channels in sequence. I'm a n00b at PASM and this is my first time writing self-modifying code but I like the new direction it's taking me.
Here's an update with my loop unrolling experiment for my multi-channel serial receiver. The project took almost two weeks longer than I expected. My excuse is I spent most of last week skiing in the mountains with my kids, my girlfriend, her kids and her ex. Also, writing self-modifying Propeller assembly for the first time using only a terminal program and 200MHz scope for debugging is without a doubt the slowest development I have ever endured.
Initial performance results from testing from a single cog reading serial 8 data bits-1 stop bit, no handshaking for delimited records. For this project I am using records defined by a "$" start char and $D (carriage return) stop char as commonly used by NMEA 0183. The records used for testing had pseudo-random length between 12 and 96 bytes. My transmitter was emitting about 1,500 records per second at 1.5M Baud.
Chans Max Baud MCSF* ----- --------- ----------- 1 1_500_000 1560KHz 2 460_800 833KHz 3 250_000 556KHz 4 230_400 417KHz 5 115_200 333KHz 6 115_200 278KHz *Minimum Channel Sample Frequency reported by my scope when XORin a diagnostic pin each time a port's data pin was sampled.Here's what I did:
1. Wrote a tiny PASM cog that supported reading and writing its registers from another Spin object using only 12 instructions and 2 data registers in the cog RAM.
pub start return cognew ( @entry, @_req ) pub get_reg ( reg_idx ) result := reg_idx << 1 _req := @result repeat while _req pub set_reg ( reg_idx, reg_val ) result := reg_idx << 1 | 1 _req := @result repeat while _req dat org 'hub ram _req long 0 dat org 'cog ram entry rdlong _ptr, par wz if_z jmp #entry 'no cmd, check again rdlong _val, _ptr 'get cmd val shr _val, #1 wc 'check set bit and shift to get reg index if_nc movd $+2, _val 'set get_reg dest if_c movd $+3, _val 'set set_reg dest if_nc wrlong 0-0, _ptr 'read reg value, copy to usr hub var if_c add _ptr, #8 'offset to reg_val param if_c rdlong 0-0, _ptr 'write usr val from hub ram to reg mov _val, #0 'zero cmd wrlong _val, par 'reset cmd reg jmp #entry 'do it again _ptr res _val resThis cog enabled me to use its registers $00E to $1EF for my own purposes. Once the registers were configured, I set register $000 with an instruction to execute the custom code that was created on-the-fly by a Spin object.
2. Wrote a generic PASM template for the instructions and data for reading serial data from a single pin. For explanation purposes, here is the basic code for assembling bytes from bits.
dat port_code_beg tjnz _rxbyte, #:check_time 'check start bit :seek_start test _rxmask, ina wz if_nz jmp #0-0 'process next port 'got start bit mov _nxtime, cnt add _nxtime, _sbtime mov _rxbyte, _hibyte jmp #0-0 'process next port :check_time cmp _nxtime, cnt wc if_nc jmp #0-0 'process next port 'got data bit test _rxmask, ina wz shl _rxbyte, #1 wc if_nz or _rxbyte, #1 if_nc jmp #:got_byte add _nxtime, _bitime jmp #0-0 'process next port :got_byte shr _rxbyte, #1 'shift out stop bit rev _rxbyte, #24 'reverse data bit order 'got complete byte here call #handle_byte :exit mov _rxbyte, #0 'start all over with new byte jmp #0-0 'process next port port_code_end _rxmask long 0 _bitime long 0 _sbtime long 0 _dbmask long 0 _nxtime long 0 _rxbyte long 0 _bufreg long 0 _rcvptr long 0 _endptr long 03. Then I wrote the Spin code to duplicate that PASM template for every serial channel and set the instruction register's source and destination bits customised for each channel. Every source bits for JMP #0-0 instruction was overwritten with the register number for the next channel's code. Oh yeah, I also wrote a little disassembler to see what was going on with the code I was writing on-the-fly. Here's a dump of the multi-receiver's PASM registers when configured for two channels starting at register $010:
Notice that each channel's instructions have their source and destination bits independently configured, this is how I keep the channel switching overhead to a minimum. The receiver is put into action by setting register $000 to JMP #$019. My Spin code went through the PASM code registers for each channel's instruction block and fixed up the source and destination bits whereevery needed. JMPs to #0-0 were overwritten to jump to the next channel's instruction block. The data registers before each channel's instruction block contains the channel's settings: pin mask, bit ticks, next data bit ticks, received bits so far, trace mask, etc. The data registers following each code block contain the hub addresses of where and how the received data should be written for the user. (My implementation supports multiple receive buffers for each channel. If a receiver buffer is busy, eg. being processed by the user, the current record can be written to another buffer. This allows a single channel's records to be processed by multiple cogs, quite useful when records are coming in at 1KHz and the higher-level processing is being done in Spin.) I've deliberately omitted the code that does the processing of a received byte. This code is part of the generic template for the real project but I left it to make this explanation easier to understand. After more testing I'll post the receiver project to the Propeller objex.
Cheers,
Christopher
I'm also a bit worried about this bit:
:check_time cmp _nxtime, cnt wc if_nc jmp #0-0 'process next portI was wondering how long it would take you to find that. Usually only takes me $ffffffff / 80_000_000 seconds maximum.
Are you saying you put that there on purpose?
Also, the 1.5MBaud MCSF* in my previous table should be 3.8MHz, not 1.56MHz.
Is there a better way than this? Would prefer to not introduce a temporary register.
mov _tmp, _nxtime sub _tmp, cnt cmps _tmp, #0 wc if_nc jmp #0-0mov cnt, _nxtime sub cnt, cnt cmps cnt, #0 wc if_nc jmp #0-0You could also use a normal counter which makes the comparison easier, i.e. run any counter in LOGIC.always so the above becomes e.g.