Smarter/faster array element indexing using indirect register addressing in PASM?
northcove
Posts: 49
I am trying to improve the performance of a PASM routine that continually processes registers which have originated as separate LONG arrays of user values in hub ram. My PASM implementation works correctly using MOVS and MOVD for register indexing but my approach has several drawbacks:
1. my actual implementation reads and writes registers in many different places; keeping track of all the instructions that need modification via MOVS / MOVD is onerous;
2. my approach requires additional temporary "working" registers that represent the current element of each vector I'm processing;
3. the overhead of code self-modification for simulating vector indexing before and every iteration is causing performance problems with my solution.
Here's an example that illustrates my issues:
The single biggest issue is my loop needs to run as fast as possible but the overhead of the :restart and :increment instruction blocks is causing problems. (The project is a serial multiplexer that handles up to 32 input channels, up to 1megabaud input data rate, and offers delivers a 3 megabaud output data rate.) Is there a better way to achieve array-like indexing with PASM registers than what I'm doing?
Cheers,
Christopher
1. my actual implementation reads and writes registers in many different places; keeping track of all the instructions that need modification via MOVS / MOVD is onerous;
2. my approach requires additional temporary "working" registers that represent the current element of each vector I'm processing;
3. the overhead of code self-modification for simulating vector indexing before and every iteration is causing performance problems with my solution.
Here's an example that illustrates my issues:
dat org main 'compute S[i]:=A[i]+B[i]+C[i]+S[i] repeatedly :restart movs :rd_A, #A 'init A src reg movs :rd_B, #B 'init B src reg movs :rd_C, #C 'init C src reg movs :rd_S, #S 'init S src reg movd :wr_S, #S 'init S dst reg mov _N, #N 'init counter mov _P, par 'init hub ram dst reg :compute ' :rd_A mov _A, 0-0 'read current A element into working register :rd_B mov _B, 0-0 'read current B element into working register :rd_C mov _C, 0-0 'read current C element into working register 'do lots of read/write with A, B, and C here before they are added to S :rd_S mov _S, 0-0 'read current S element into working register add _S, _A 'add working A to working S add _S, _B 'add working B to working S add _S, _C 'add working C to working S :wr_S mov 0-0, _S 'write working S to pasm reg wrlong _S, _P 'write working S to hub ram djnz _N, #:increment 'operate on the next array elements jmp #:restart 'start all over from the first array elements :increment add :rd_A, #1 'incr A src reg add :rd_B, #1 'incr B src reg add :rd_C, #1 'incr C src reg add :rd_S, #1 'incr X src reg add :wr_S, DI 'incr X dst reg add _P, #4 'incr dst hub ram ptr jmp #:compute A long 0 [ N ] 'input vector inited by hub cog B long 0 [ N ] 'input vector inited by hub cog C long 0 [ N ] 'input vector inited by hub cog S long 0 [ N ] 'input vector sum result in pasm cog DI long 1 << 9 'reg val to incr dst reg 'my "working" registers _A res _B res _C res _S res _N res _P res fit
The single biggest issue is my loop needs to run as fast as possible but the overhead of the :restart and :increment instruction blocks is causing problems. (The project is a serial multiplexer that handles up to 32 input channels, up to 1megabaud input data rate, and offers delivers a 3 megabaud output data rate.) Is there a better way to achieve array-like indexing with PASM registers than what I'm doing?
Cheers,
Christopher
Comments
I'd also re-order the jumps slightly (but that's just me):
That code is not intended to update hub ram or even do anything useful; it is a simple example just to show how a technique of indexing through PASM registers to access individual elements. I only wrote out S to hub ram to verify my example worked correctly.
In reality, in my actual implementation A, B and C are being modified in the inner loop (hence the "do lots of read/write.." comment.)
I need to find a better way of reading/writing registers as though they are elements in large arrays. Still learning how to best achieve this on that offers self-modifying instructions rather than more conventional indirect register addressing.
My ideal is to merely adjust a base pointer and/or register using a single instruction each time through the main loop and then access current elements of A, B, C etc using a fixed offset from that main pointer / register. But unfortunately I'm not yet high enough on the PASM learning curve to achieve this.
top:
RDxxxxx
ins1
ins2
RDxxxxx
ins1
ins2
RDxxxxx
ins1
ins2
WRxxxxx
ins1
jmp #top
Note that once you are synchronized to the hub, each of the above RDxxxx blocks will take 16 clock cycles, ins1/2 essentially execute for free.
If there are three instructions after a hub operation, it will waste 12 clock cycles - ie above executes in 64 clock cycles, but below takes 80
top:
RDxxxxx
ins1
ins2
ins3 ***** effectively takes 16 clock cycles
RDxxxxx
ins1
ins2
RDxxxxx
ins1
ins2
WRxxxxx
ins1
jmp #top
(thanks to kuroneko for pointing out my silly mistake. Shows me I have not been writing enough crazy pasm lately)
Basically, schedule your hub ops carefully
Yep.
DUH!
TWO spacer instructions!!!!
I am getting old and forgetfull... fixing post.
Thanks.
-Phil
Yeah, that's a neat trick - thanks.
Got any tricks for faster r/w access of incremented PASM registers other than modifying instructions via MOVS/MOVD?
Cheers,
Christopher
Enjoy!
Mike
well... kuroneko has used PHSA (if my memory is working this time) as a hub pointer, but that is crazy code.
^Yeah, that's closer to my real problem.
My actual code scans multiple pins for delimited serial data not dissimilar to NMEA sentences from a GPS, for example. (In my application the data could be coming in at 500k-1Mbaud / 1kHz records over fibre optic cable, not 4800 baud / 1Hz records from a GPS.) The PASM code performs checks on the incoming bytes and writes the record to hub ram when records are complete. The receiver code also performs more complex preprocessing of these records before writing them to hub ram. Each receiver channel has about twenty fields associated with it: rx pin mask, ticks per bit, start char, stop char, bits remaining, checksum, error count, destination buffer address, destination buffer length, routing information, IO stats, etc. My single port implementation handles 1.5Mbaud input data just fine (tested with a multi-port transmitter I wrote that handles 3Mbaud and ~1k records per second per port.)
Converting the single port code over to multi-port I'm finding that the MOVS/MOVD workaround for the absence of indexed register modes or auto incrementing is causing overhead that I wasn't anticipating writing PASM on the Prop. My multi-channel code requires ~twenty MOVS/MOVD before the first iteration to set the base source and destination registers to load the elements into working registers. Each sucessive iteration requires another ~twenty instructions to increment the source registers and yet another twenty instructions to save & increment the destination registers. And I'm trying to keep the port switching overhead to under 0.2uSecs !!
Copy that. Realistically there will only be one or two high speed streams (>500kbaud) to handle. When streams are numerous they'll mostly be in the 38.4-115.2kbaud range. I have sufficient free cogs to allocate to specific streams when the bandwidth requires it. I'm also fortunate in that I can tweak the incoming baud to match my receivers because I developed the transmitters.
One of my issues is that the PASM register layout is inherited from arrays in the the Spin cog's DAT section, eg:
In the single-to-multi-port conversion I've lazily resorted to using "working" registers that reflected the single port usage of registers. What I've ended up with this overhead before every channel loop:
And this overhead after every channel iteration:
Yikes & yuck.
Possibly a better solution can be found by reorganising the layout such that each channel has a base register and its fields are in adjacent registers. Not sure this would fix much the absence of register indexing/autoincrementing instructions, however.
While my single port implementation was really fast, I'm having a complete rethink and starting over for the multi-port implementation. Given the Prop's self-modifying code capability next I will experiment with a template sequence of instructions (start bit scanning, data byte assembly) to handle a single channel and then upon entry have the cog duplicate that instruction template for additional channels with a one-time setting of each channel's instruction's source and destination registers. The return registers will be appropriately to process channels in sequence. I'm a n00b at PASM and this is my first time writing self-modifying code but I like the new direction it's taking me.
Here's an update with my loop unrolling experiment for my multi-channel serial receiver. The project took almost two weeks longer than I expected. My excuse is I spent most of last week skiing in the mountains with my kids, my girlfriend, her kids and her ex. Also, writing self-modifying Propeller assembly for the first time using only a terminal program and 200MHz scope for debugging is without a doubt the slowest development I have ever endured.
Initial performance results from testing from a single cog reading serial 8 data bits-1 stop bit, no handshaking for delimited records. For this project I am using records defined by a "$" start char and $D (carriage return) stop char as commonly used by NMEA 0183. The records used for testing had pseudo-random length between 12 and 96 bytes. My transmitter was emitting about 1,500 records per second at 1.5M Baud.
Here's what I did:
1. Wrote a tiny PASM cog that supported reading and writing its registers from another Spin object using only 12 instructions and 2 data registers in the cog RAM.
This cog enabled me to use its registers $00E to $1EF for my own purposes. Once the registers were configured, I set register $000 with an instruction to execute the custom code that was created on-the-fly by a Spin object.
2. Wrote a generic PASM template for the instructions and data for reading serial data from a single pin. For explanation purposes, here is the basic code for assembling bytes from bits.
3. Then I wrote the Spin code to duplicate that PASM template for every serial channel and set the instruction register's source and destination bits customised for each channel. Every source bits for JMP #0-0 instruction was overwritten with the register number for the next channel's code. Oh yeah, I also wrote a little disassembler to see what was going on with the code I was writing on-the-fly. Here's a dump of the multi-receiver's PASM registers when configured for two channels starting at register $010:
Notice that each channel's instructions have their source and destination bits independently configured, this is how I keep the channel switching overhead to a minimum. The receiver is put into action by setting register $000 to JMP #$019. My Spin code went through the PASM code registers for each channel's instruction block and fixed up the source and destination bits whereevery needed. JMPs to #0-0 were overwritten to jump to the next channel's instruction block. The data registers before each channel's instruction block contains the channel's settings: pin mask, bit ticks, next data bit ticks, received bits so far, trace mask, etc. The data registers following each code block contain the hub addresses of where and how the received data should be written for the user. (My implementation supports multiple receive buffers for each channel. If a receiver buffer is busy, eg. being processed by the user, the current record can be written to another buffer. This allows a single channel's records to be processed by multiple cogs, quite useful when records are coming in at 1KHz and the higher-level processing is being done in Spin.) I've deliberately omitted the code that does the processing of a received byte. This code is part of the generic template for the real project but I left it to make this explanation easier to understand. After more testing I'll post the receiver project to the Propeller objex.
Cheers,
Christopher
I'm also a bit worried about this bit:
I was wondering how long it would take you to find that. Usually only takes me $ffffffff / 80_000_000 seconds maximum.
Are you saying you put that there on purpose?
Also, the 1.5MBaud MCSF* in my previous table should be 3.8MHz, not 1.56MHz.
Is there a better way than this? Would prefer to not introduce a temporary register.