Code review/challenge:fastest PASM implementing read of variable num longs to cog
ags
Posts: 386
I am often impressed by the techniques used by the most skilled PASM coders here. I think of it like design patterns for high-level languages. I just don't have that level of experience with assembly code.
If anyone is interested in a challenge, I am in need of an optimized method to read a variable length record (1-14 longs) from hub RAM into a circular buffer in cog RAM. I'm using a mailbox to indicate the record is ready in hub RAM; when the value >0, it indicates the record length (not including the length long itself).
I'm using this in a timing-critical project in a complex larger driver. The timing of the overall driver will be dictated by the worst-case delay during this load. I believe that will be when a full 14 long record is presented, and it also corresponds with a wrap around the end of the cog buffer. (I suspect that the best implementation will be the best regardless of the specific longest record length)
I've stared at this until I'm blind. The basic strategy recognizes that I'll never read a record longer than the cog buffer length, so I'll never have more than one buffer wrap (most times there will be none - but worst case is what matters). The shortest load-increment-loop code is 16 clocks, also aligning consecutive rdlongs to hub access timing, so I've preserved that. Rather than incrementing other counters in that loop I perform those calculations (albeit rather clumsily) and determine if wrap is needed. If so, I use the same loading loop twice: once to the buffer end, then again for any remaining. The special case of ending just at the cog buffer end is also handled. I've also pushed things in front of the loading loop to utilize all time between the rdlong reading the record length and the rdlong loading the first record long.
This is the best I can do. I may be totally missing a much more elegant method. Despite hours of effort, this looks like a huge hack to me. This also seems like a rather common method to use, so I suspect that there are some really fast, elegant implementations beyond my creativity. If anyone has suggestions I would appreciate the input. Full disclosure: I can't finalize an architecture for the overall driver until I know the worst-case timing of this method. That means this code has been reviewed, but not tested. It may be functionally flawed. By my assessment it requires 22 + 4*n instructions (not including the call to this method) worst case for an n-long record.
Thanks.
PS - if someone can explain how to cut-and-paste (using the appropriate code designation) and get aligning (from PropTool) to remain readable I'd appreciate it. I hate trying to align code manually - and never succeed)
If anyone is interested in a challenge, I am in need of an optimized method to read a variable length record (1-14 longs) from hub RAM into a circular buffer in cog RAM. I'm using a mailbox to indicate the record is ready in hub RAM; when the value >0, it indicates the record length (not including the length long itself).
I'm using this in a timing-critical project in a complex larger driver. The timing of the overall driver will be dictated by the worst-case delay during this load. I believe that will be when a full 14 long record is presented, and it also corresponds with a wrap around the end of the cog buffer. (I suspect that the best implementation will be the best regardless of the specific longest record length)
I've stared at this until I'm blind. The basic strategy recognizes that I'll never read a record longer than the cog buffer length, so I'll never have more than one buffer wrap (most times there will be none - but worst case is what matters). The shortest load-increment-loop code is 16 clocks, also aligning consecutive rdlongs to hub access timing, so I've preserved that. Rather than incrementing other counters in that loop I perform those calculations (albeit rather clumsily) and determine if wrap is needed. If so, I use the same loading loop twice: once to the buffer end, then again for any remaining. The special case of ending just at the cog buffer end is also handled. I've also pushed things in front of the loading loop to utilize all time between the rdlong reading the record length and the rdlong loading the first record long.
This is the best I can do. I may be totally missing a much more elegant method. Despite hours of effort, this looks like a huge hack to me. This also seems like a rather common method to use, so I suspect that there are some really fast, elegant implementations beyond my creativity. If anyone has suggestions I would appreciate the input. Full disclosure: I can't finalize an architecture for the overall driver until I know the worst-case timing of this method. That means this code has been reviewed, but not tested. It may be functionally flawed. By my assessment it requires 22 + 4*n instructions (not including the call to this method) worst case for an n-long record.
Thanks.
PS - if someone can explain how to cut-and-paste (using the appropriate code designation) and get aligning (from PropTool) to remain readable I'd appreciate it. I hate trying to align code manually - and never succeed)
'------------------------------------------------------------------------------------------------------------ ORG driverInit mov mailbox, par add mailbox, #4 'setup mailbox mov hubAdr, mailbox [COLOR=#ff0000]add[/COLOR] hubAdr, #4 'setup hub address holding record [error repaired] movd indLoad, #cogBuf 'initialize indirect load address to cog buffer starting address {OTHER PROCESSING HERE} loadBuf 'load one variable-length record (1-14 longs) 'mailbox (hub RAM address) contains record length (#longs) 'recLen holds the record length 'bufFreeCnt holds number of contiguous free longs in cog RAM (may wrap around from end to start) 'bufHdrmCnt holds "headroom" - number of longs from current buffer position to top of buffer 'is a record available in the hub? (1-14 longs) rdlong recLen, mailbox wz 'z set if no record available to load if_z jmp #loadBuf_ret 'is there room in the buffer to hold the entire record? cmp bufFreeCnt,recLen wc 'c set if record will not fit in buffer - not set if equal if_c jmp #loadBuf_ret mov loadCnt, recLen 'seed number of longs to load in loop with record length max loadCnt, bufHdrmCnt 'limit longs to load by buffer headroom - will need two passes sub bufFreeCnt, recLen 'update buffer free count (will never be <0) sub bufHdrmCnt, recLen wz, wc 'does the record hit or span end of buffer? (may be =<0) [COLOR=#ff0000]abs[/COLOR] recLen, bufHdrmCnt 'recLen now contains load count for second pass (if needed) if_be add bufHdrmCnt, #COG_BUF_LEN 'if buffer wrap needed reset headroom (will be >0) movs indLoad, hubBufAdr 'initialize indirect address to read hub RAM at record start indLoad rdlong 0-0, 0-0 'NOTE: Must initialize dest=#cogBuf at driver startup add indLoad, _bit9bit2 'increment cog and hub addresses djnz loadCnt, #indLoad 'loop for specified long count if_a jmp #:finish 'if buffer headroom >0, done movd indLoad, #cogBuf 'reset location for next write to buffer to start of buffer if_z jmp #finish 'no need to take another pass if no unread longs remain mov loadCnt, recLen wz, wc 'set remaining longs to load, clear z,c (0<recLen<$8000_0000) jmp #indLoad :finish wrlong _zero, mailbox 'clear mailbox, signalling ready for next record loadBuf_ret ret _zero long 0 _bit9bit2 long 00000100 mailbox res 1 recLen res 1 loadCnt res 1 bufFreeCnt res 1 bufHdrmCnt res 1 hubBufAdr res 1 cogBuf res COG_BUF_LEN '------------------------------------------------------------------------------------------------------------
Comments
re: editor, looks like you use the WYSIWYG version (consumes binary indicators). Change your settings to the Standard Editor. Always worked for me.
Generally code I've reviewed this thoroughly works. I must have a mental block. Do you mean the loop itself is flawed, or the mess around it?
Thanks for the reply.
Yes, it's the loop (and one insn around it). While the _bit9 part takes care of the destination register (OK) the bit2 increments the source register (index) not the address (NG). Your approach would only work if your record data is located in the first 128 hub longs (rdlong dst, #addr).
Yes, of course you are correct. How embarrassing - and frustrating. To spend so much time on something that, while admittedly a hack, was hoped to be somewhat ingenious - only to realize that the core was fundamentally flawed. So it seems that there is no way to read from hub RAM into cog RAM in 4 instructions, incrementing both the cog and hub addresses (unless using direct addressing as you point out above). Is that true?
This also makes me look at something that is obvious, but put it into a context that will help it sink in and be more natural for my way of thinking and help prevent similar mistakes in the future. I find that most every time I use a movi/movd instruction, it is because I really want indirect addressing - which is not supported in PASM. The new view is that rd* and wr* are actually indirect instructions (indirect source for rd* and indirect destination for wr*). That is why my method doesn't work. As in your example above, the # modifies that, and I would describe that as changing an indirect address into a direct address (limited by the 9-bit size of the address), whereas with other "normal" instructions, it changes direct addressing into immediate addressing. (Comments on that?)
Edit: This brings up an interesting related question? Is the reason for rd*/wr* operations taking 8 clocks (rather than 4) due to extra steps in completing an indirect reference, compared to direct or immediate? I always presumed it was due to hub RAM speed vs cog RAM speed. Also, all hub operations take 8 clocks to execute (neglecting hub window synchronization) so I doubt it, but it did raise the question (presuming my indirect/direct/immediate model is reasonable).
Edit: After examining the code referenced in your reply, I see how it works (very clever). It's not clear to me how I can apply that with a circular buffer though. The unrolled loop works in one hub window because you don't have to spend time incrementing the rdlong destination (cog) address (hardcoded for each possible consecutive read). With a circular buffer, this is dynamic and cannot be predetermined. If I increment this then I've missed the hub window and may be able to use a loop (saving space) - but will not meet my timing budget. Am I missing something?
Also in the referenced post, you mention a way ("... more trouble than it's usually worth") to loop within one hub window. It may be worth the trouble to me. Can you direct me towards that method (if appropriate for this application?) Since my pride is already defeated (for today), I'll ask what the other mistake is. I don't see it. (I expect it would have taken me a while to find the fatal flaw in the loop though, despite it being an obvious mistake. A good case for peer code reviews.)
Thanks again.
-Phil
Your certitude is justified, Phil. Fixed in OP. I suppose I prefer the "other error" to be a stupid typo rather than error in logic with all the fuss with z and c flags, timing, and optimizing. Oh, except for the fatal stupid flaw in the main load loop...
Maybe you can adapt it to your app.
-Phil
The [post=1108886]version I usually use[/post] goes backwards and transfers always 2n longs. While easy enough to use for plain block transfers it's a bit more complicated/unacceptable for unaligned loads (2n+1 longs) let alone circular buffers.
max loadCnt, bufHdrmCnt picks the minimum of the two parameters. Assuming the latter is 10 and you want to transfer 14 values you end up with 10 in the first loop. Later you subtract recLen from bufHdrmCnt (-4) then add this to recLen (14 + (-4) = 10) for the second run. You probably want abs recLen, bufHdrmCnt instead.
Just deleted many lines. Just to clarify:
Indirect: the operand is the address of a register that contains the address of the register that will be accessed, eg source operand (although crossing cog/hub RAM space) in: Direct: the operand is the address of the register to be accessed, eg both operands in: Immediate: the operand is the value itself, no register is accessed, eg source operand in:
But it's not helpful because there is no control in PASM of addressing mode (other than literals using the definitions above). I hope I'm not making any misstatements.
I think I see how that could work. Problem is the (circular) buffer I'm loading is 128 longs. This is a hard problem.
You are absolutely correct (I would have used neg because it is only used if it was negative (before this instruction)) but abs is more clearly indicating what the intent is. It's remarkable how you are able to find these problems so quickly.
I think I now have a very nicely reviewed and debugged non-functioning example of code that won't work.:frown:
I suppose I could try the trick using a counter to increment the rdlong source operand - but that could be messy.
Always loads from the same hub location. First long holds record length, between 1-14. The overall design I had in mind had a budget of about 320 clocks to load one record (14 longs worst case). So if I cannot find a single-hub-window-per-long method that just can't work. Also using the counters is clearly not possible either.
Yes, of course. Looks quite straightforward... now that you've figured it out Runtime penalty of just 4 clocks* each call, and 15 longs for hubAdr and d1s1. I will proceed and see how this fits into the overall driver.
The other option I came up with would be to have the load loop limit itself to no more than some number (likely 5-7) long reads each call. If the record length is greater than that, it keeps track and picks up on the next call. While it would change the overall driver design, I would end up filling many more instances of much smaller available time slices, which may be enough to keep up the overall data rate. If I can accommodate timing of your idea that seems cleaner (and my earlier work isn't totally wasted.)
Thank you.
Edit: *originally thought to repay with savings of 4 clocks after the load loop, but for worst-case timing (requiring a wrap of the circular buffer) it remains 32 clocks, same as the original code. Propville credit, not legal tender, awarded for 8 clocks saved before the load loop since hub window makes it moot.
The code you posted has been very helpful. The idea of using the array to index to the next hub address with a simple add is creative. I should have thought of that once you pointed out my error (which I call not considering that rd*/wr* use "indirect addressing" for hub RAM) - but I did not, credit to you.
You also demonstrated a use of cmpsub - which I have seen every time I open the documentation but have never found a good use for - until now.
Also, setting the carry flag with: is a great trick which I picked up from one of the links you provided. Very nice.
Thank you for taking the time to reply. With this as a starting point, I have been able to create an overall architecture that can accommodate the timing of this basic design. I also have modified for use in the "push" driver cog which presents data to the reader cog. I truly appreciate the help.