Code review/challenge:fastest PASM implementing read of variable num longs to cog

ags · 2013-08-19 22:28

I am often impressed by the techniques used by the most skilled PASM coders here. I think of it like design patterns for high-level languages. I just don't have that level of experience with assembly code.

If anyone is interested in a challenge, I am in need of an optimized method to read a variable length record (1-14 longs) from hub RAM into a circular buffer in cog RAM. I'm using a mailbox to indicate the record is ready in hub RAM; when the value >0, it indicates the record length (not including the length long itself).

I'm using this in a timing-critical project in a complex larger driver. The timing of the overall driver will be dictated by the worst-case delay during this load. I believe that will be when a full 14 long record is presented, and it also corresponds with a wrap around the end of the cog buffer. (I suspect that the best implementation will be the best regardless of the specific longest record length)

I've stared at this until I'm blind. The basic strategy recognizes that I'll never read a record longer than the cog buffer length, so I'll never have more than one buffer wrap (most times there will be none - but worst case is what matters). The shortest load-increment-loop code is 16 clocks, also aligning consecutive rdlongs to hub access timing, so I've preserved that. Rather than incrementing other counters in that loop I perform those calculations (albeit rather clumsily) and determine if wrap is needed. If so, I use the same loading loop twice: once to the buffer end, then again for any remaining. The special case of ending just at the cog buffer end is also handled. I've also pushed things in front of the loading loop to utilize all time between the rdlong reading the record length and the rdlong loading the first record long.

This is the best I can do. I may be totally missing a much more elegant method. Despite hours of effort, this looks like a huge hack to me. This also seems like a rather common method to use, so I suspect that there are some really fast, elegant implementations beyond my creativity. If anyone has suggestions I would appreciate the input. Full disclosure: I can't finalize an architecture for the overall driver until I know the worst-case timing of this method. That means this code has been reviewed, but not tested. It may be functionally flawed. By my assessment it requires 22 + 4*n instructions (not including the call to this method) worst case for an n-long record.

Thanks.

PS - if someone can explain how to cut-and-paste (using the appropriate code designation) and get aligning (from PropTool) to remain readable I'd appreciate it. I hate trying to align code manually - and never succeed)

'------------------------------------------------------------------------------------------------------------
ORG
driverInit
  mov mailbox, par
  add mailbox, #4                'setup mailbox
  mov hubAdr, mailbox
  [COLOR=#ff0000]add[/COLOR] hubAdr, #4                'setup hub address holding record [error repaired]
  movd indLoad, #cogBuf    'initialize indirect load address to cog buffer starting address

{OTHER PROCESSING HERE}

loadBuf                               'load one variable-length record (1-14 longs)
                                           'mailbox (hub RAM address) contains record length (#longs)
                                           'recLen holds the record length
                                           'bufFreeCnt holds number of contiguous free longs in cog RAM (may wrap around from end to start)
                                           'bufHdrmCnt holds "headroom" - number of longs from current buffer position to top of buffer

'is a record available in the hub? (1-14 longs)
  rdlong recLen, mailbox wz           'z set if no record available to load
  if_z jmp #loadBuf_ret
'is there room in the buffer to hold the entire record?
  cmp bufFreeCnt,recLen wc         'c set if record will not fit in buffer - not set if equal
  if_c jmp #loadBuf_ret

  mov loadCnt, recLen                  'seed number of longs to load in loop with record length
  max loadCnt, bufHdrmCnt           'limit longs to load by buffer headroom - will need two passes
  sub bufFreeCnt, recLen              'update buffer free count (will never be <0)
  sub bufHdrmCnt, recLen wz, wc   'does the record hit or span end of buffer? (may be =<0)
  [COLOR=#ff0000]abs[/COLOR] recLen, bufHdrmCnt             'recLen now contains load count for second pass (if needed)
  if_be add bufHdrmCnt, #COG_BUF_LEN                'if buffer wrap needed reset headroom (will be >0)
  movs indLoad, hubBufAdr                                       'initialize indirect address to read hub RAM at record start

  indLoad rdlong 0-0, 0-0                                           'NOTE: Must initialize dest=#cogBuf at driver startup
  add indLoad, _bit9bit2                                             'increment cog and hub addresses
  djnz loadCnt, #indLoad                                            'loop for specified long count

  if_a jmp #:finish                                                        'if buffer headroom >0, done

  movd indLoad, #cogBuf                                           'reset location for next write to buffer to start of buffer
  if_z jmp #finish                                                         'no need to take another pass if no unread longs remain

  mov loadCnt, recLen wz, wc                                    'set remaining longs to load, clear z,c (0<recLen<$8000_0000)
  jmp #indLoad
:finish wrlong _zero, mailbox                                     'clear mailbox, signalling ready for next record
loadBuf_ret ret

_zero long 0
_bit9bit2 long 00000100
mailbox res 1
recLen res 1
loadCnt res 1
bufFreeCnt res 1
bufHdrmCnt res 1
hubBufAdr res 1
cogBuf res COG_BUF_LEN

'------------------------------------------------------------------------------------------------------------

kuroneko · 2013-08-20 00:37

If it really has to be a circular buffer I'd suggest you stick with the two pass loading but use an [post=1042678]unrolled loading loop[/post] provided you have the code space (the one you use doesn't work).

re: editor, looks like you use the WYSIWYG version (consumes binary indicators). Change your settings to the Standard Editor. Always worked for me.

ags · 2013-08-20 06:00

kuroneko wrote: »

If it really has to be a circular buffer...

Maybe I'm overlooking something here too. I am loading the variable length record into the hub from another cog. The other cog will stall at times so I need to be able to read to and write from my cog buffer at different rates. That pointed me to a circular buffer but perhaps there's another way?

... I'd suggest you stick with the two pass loading but use an [post=1042678]unrolled loading loop[/post] provided you have the code space.

I presumed that I didn't have the option of unrolling the loop since I have a variable length record. I may not have the space to accommodate this for up to 14 args, though.

(the one you use doesn't work).

Generally code I've reviewed this thoroughly works. I must have a mental block. Do you mean the loop itself is flawed, or the mess around it?

Thanks for the reply.

kuroneko · 2013-08-20 06:22

ags wrote: »

Maybe I'm overlooking something here too. I am loading the variable length record into the hub from another cog. The other cog will stall at times so I need to be able to read to and write from my cog buffer at different rates. That pointed me to a circular buffer but perhaps there's another way?

Is it that you only ever have one record to work with or do you expect having to read e.g. 4 more before you've processed the first record in the buffer?

ags wrote: »

Do you mean the loop itself is flawed, or the mess around it?

Yes, it's the loop (and one insn around it). While the _bit9 part takes care of the destination register (OK) the bit2 increments the source register (index) not the address (NG). Your approach would only work if your record data is located in the first 128 hub longs (rdlong dst, #addr).

ags · 2013-08-20 09:19

kuroneko wrote: »

Is it that you only ever have one record to work with or do you expect having to read e.g. 4 more before you've processed the first record in the buffer?

Short answer is I need to be able to store and consume data records at different rates. Therefore the number of records stored in the cog changes over time based on the aggregate of the difference in storage/consumption rates up to that point in time.

While the _bit9 part takes care of the destination register (OK) the bit2 increments the source register (index) not the address (NG). Your approach would only work if your record data is located in the first 128 hub longs (rdlong dst, #addr).

Yes, of course you are correct. How embarrassing - and frustrating. To spend so much time on something that, while admittedly a hack, was hoped to be somewhat ingenious - only to realize that the core was fundamentally flawed. So it seems that there is no way to read from hub RAM into cog RAM in 4 instructions, incrementing both the cog and hub addresses (unless using direct addressing as you point out above). Is that true?

This also makes me look at something that is obvious, but put it into a context that will help it sink in and be more natural for my way of thinking and help prevent similar mistakes in the future. I find that most every time I use a movi/movd instruction, it is because I really want indirect addressing - which is not supported in PASM. The new view is that rd* and wr* are actually indirect instructions (indirect source for rd* and indirect destination for wr*). That is why my method doesn't work. As in your example above, the # modifies that, and I would describe that as changing an indirect address into a direct address (limited by the 9-bit size of the address), whereas with other "normal" instructions, it changes direct addressing into immediate addressing. (Comments on that?)

Edit: This brings up an interesting related question? Is the reason for rd*/wr* operations taking 8 clocks (rather than 4) due to extra steps in completing an indirect reference, compared to direct or immediate? I always presumed it was due to hub RAM speed vs cog RAM speed. Also, all hub operations take 8 clocks to execute (neglecting hub window synchronization) so I doubt it, but it did raise the question (presuming my indirect/direct/immediate model is reasonable).

Edit: After examining the code referenced in your reply, I see how it works (very clever). It's not clear to me how I can apply that with a circular buffer though. The unrolled loop works in one hub window because you don't have to spend time incrementing the rdlong destination (cog) address (hardcoded for each possible consecutive read). With a circular buffer, this is dynamic and cannot be predetermined. If I increment this then I've missed the hub window and may be able to use a loop (saving space) - but will not meet my timing budget. Am I missing something?

Also in the referenced post, you mention a way ("... more trouble than it's usually worth") to loop within one hub window. It may be worth the trouble to me. Can you direct me towards that method (if appropriate for this application?)

Yes, it's the loop (and one insn around it).

Since my pride is already defeated (for today), I'll ask what the other mistake is. I don't see it. (I expect it would have taken me a while to find the fatal flaw in the loop though, despite it being an obvious mistake. A good case for peer code reviews.)

Thanks again.

Phil Pilgrim (PhiPi) · 2013-08-20 13:53

I'm sure you didn't mean the following:

mov hubAdr, mailbox
  [b][color=red]mov[/color][/b] hubAdr, #4                'setup hub address holding record

-Phil

ags · 2013-08-20 15:11

Phil Pilgrim (PhiPi) wrote: »
I'm sure you didn't mean the following:
mov hubAdr, mailbox
  [B][COLOR=red]mov[/COLOR][/B] hubAdr, #4                'setup hub address holding record
-Phil

Your certitude is justified, Phil. Fixed in OP. I suppose I prefer the "other error" to be a stupid typo rather than error in logic with all the fuss with z and c flags, timing, and optimizing. Oh, except for the fatal stupid flaw in the main load loop...

Cluso99 · 2013-08-20 15:16

There is a method to read hub quickly by reading in reverse. I didn't come up with that sequence, but I made use of it in my overlay loader (see obex). The method may help your project.

Phil Pilgrim (PhiPi) · 2013-08-20 16:14

Here's a short description of the reverse method that Cluso mentioned, applied to the LMM kernel, along with a little bit slower method that runs forward::

http://forums.parallax.com/showthread.php/114938-Faster-LMM-Code-Not-reversed-this-time.-%28IMPORTANT-Addendum-in-Top-Post%29

Maybe you can adapt it to your app.

-Phil

kuroneko · 2013-08-20 16:45

ags wrote: »

Edit: After examining the code referenced in your reply, I see how it works (very clever). It's not clear to me how I can apply that with a circular buffer though. The unrolled loop works in one hub window because you don't have to spend time incrementing the rdlong destination (cog) address (hardcoded for each possible consecutive read). With a circular buffer, this is dynamic and cannot be predetermined. If I increment this then I've missed the hub window and may be able to use a loop (saving space) - but will not meet my timing budget. Am I missing something?

I was thinking along the line that you keep the index to the first free slot (e.g. 4). Transferring values would be done by jumping to code entry 4*3 (i.e. rdlong buf4, addr). This will then execute 10 times filling slots 4 to 13. According to the flags you then decide to call the function again from the top (i.e. rdlong buf0, addr). As it limits itself re: parameter count it stops at index 3. The only thing you have to do externally is to prepare/remember the next entry point.

ags wrote: »

Also in the referenced post, you mention a way ("... more trouble than it's usually worth") to loop within one hub window. It may be worth the trouble to me. Can you direct me towards that method (if appropriate for this application?)

The [post=1108886]version I usually use[/post] goes backwards and transfers always 2n longs. While easy enough to use for plain block transfers it's a bit more complicated/unacceptable for unaligned loads (2n+1 longs) let alone circular buffers.

ags wrote: »

Since my pride is already defeated (for today), I'll ask what the other mistake is. I don't see it. (I expect it would have taken me a while to find the fatal flaw in the loop though, despite it being an obvious mistake. A good case for peer code reviews.)

max loadCnt, bufHdrmCnt picks the minimum of the two parameters. Assuming the latter is 10 and you want to transfer 14 values you end up with 10 in the first loop. Later you subtract recLen from bufHdrmCnt (-4) then add this to recLen (14 + (-4) = 10) for the second run. You probably want abs recLen, bufHdrmCnt instead.

kuroneko · 2013-08-20 17:06

ags wrote: »

... whereas with other "normal" instructions, it changes direct addressing into immediate addressing. (Comments on that?)

Both end up with an immedate operand. What a particular insn does with it is up to the insn, i.e. a rdlong interprets it as an address while shl uses it as shift count. I don't think we should overcomplicate things here

ags · 2013-08-20 18:57

kuroneko wrote: »

Both end up with an immedate operand. What a particular insn does with it is up to the insn, i.e. a rdlong interprets it as an address while shl uses it as shift count. I don't think we should overcomplicate things here

Just deleted many lines. Just to clarify:

Indirect: the operand is the address of a register that contains the address of the register that will be accessed, eg source operand (although crossing cog/hub RAM space) in:

rdlong adrCogRegToLoad, adrCogRegContHubAdrToRead

Direct: the operand is the address of the register to be accessed, eg both operands in:

mov adrCogRegToStoreVal, adrCogAdrToReadVal

Immediate: the operand is the value itself, no register is accessed, eg source operand in:

mov adrCogRegToStoreVal, #actualValToStore

But it's not helpful because there is no control in PASM of addressing mode (other than literals using the definitions above). I hope I'm not making any misstatements.

ags · 2013-08-20 19:00

kuroneko wrote: »

I was thinking along the line that you keep the index to the first free slot (e.g. 4). Transferring values would be done by jumping to code entry 4*3 (i.e. rdlong buf4, addr). This will then execute 10 times filling slots 4 to 13. According to the flags you then decide to call the function again from the top (i.e. rdlong buf0, addr). As it limits itself re: parameter count it stops at index 3. The only thing you have to do externally is to prepare/remember the next entry point.

I think I see how that could work. Problem is the (circular) buffer I'm loading is 128 longs. This is a hard problem.

ags · 2013-08-20 19:06

kuroneko wrote: »

max loadCnt, bufHdrmCnt picks the minimum of the two parameters. Assuming the latter is 10 and you want to transfer 14 values you end up with 10 in the first loop. Later you subtract recLen from bufHdrmCnt (-4) then add this to recLen (14 + (-4) = 10) for the second run. You probably want abs recLen, bufHdrmCnt instead.

You are absolutely correct (I would have used neg because it is only used if it was negative (before this instruction)) but abs is more clearly indicating what the intent is. It's remarkable how you are able to find these problems so quickly.

I think I now have a very nicely reviewed and debugged non-functioning example of code that won't work.:frown:

kuroneko · 2013-08-20 19:29

ags wrote: »

I think I see how that could work. Problem is the (circular) buffer I'm loading is 128 longs. This is a hard problem.

Yes, I can see how this will be a problem. How much code space do you have available for the loader? Initially I thought the buffer is about the same size as the largest record but here unrolled is out of the question (but the principle does work, see below).

''
'' PASM dynamic parameter retrieval
''
''        Author: Marko Lukat
'' Last modified: 2013/08/21
''       Version: 0.3
''
CON
  BLEN = 8                                       ' entries in cog buffer
  
VAR
  long  link
  
PUB null : n | a[8]

  repeat n from 0 to 7
    a[n] := |< (16+n)
    
  link := @a{0}
  cognew(@entry, @link)
  repeat while link

  repeat
    link := 8
    repeat while link
  
DAT             org     0

entry           rdword  addr, par
done            wrlong  zero, par               ' transfer/setup done
                test    $, #1 wc                ' set carry                     (##)
                
idle            rdlong  eins, par wz
                mov     zwei, indx              ' preserve jump target
        if_z    jmp     #$-2

                add     indx, eins              ' |
                cmpsub  indx, #BLEN             ' update first free slot index

                sub     eins, #1                ' |
                rcl     eins, #1                ' |                             (##)
                shl     eins, #23               ' |
                or      eins, addr              ' build control word
                
                call    #args                   ' primary load
        if_c    call    #jump                   ' wrap


[COLOR="#D3D3D3"]                mov     dira, mask              ' display data

                rdlong  vscl, #0
                mov     cnt, cnt
                add     cnt, #9

                waitcnt cnt, vscl
                mov     outa, arg0
                
                waitcnt cnt, vscl
                mov     outa, arg1

                waitcnt cnt, vscl
                mov     outa, arg2

                waitcnt cnt, vscl
                mov     outa, arg3

                waitcnt cnt, vscl
                mov     outa, arg4

                waitcnt cnt, vscl
                mov     outa, arg5

                waitcnt cnt, vscl
                mov     outa, arg6

                waitcnt cnt, vscl
                mov     outa, arg7

                waitcnt cnt, vscl
                mov     outa, mask

                waitcnt cnt, vscl
[/COLOR]                
                jmp     #done


args            add     jump, zwei              ' prime call
args_ret
jump_ret        long    0
jump            jmpret  $, $+1                  ' call sub function
                
                long    s0, s1, s2, s3, s4, s5, s6, s7
                
s0              rdlong  arg0, eins              ' read 1st argument
                cmpsub  eins, delta wc          ' [increment address and] check exit
        if_nc   jmp     args_ret                ' cond: early return
        
s1              rdlong  arg1, eins              ' read 2nd argument
                cmpsub  eins, delta wc
        if_nc   jmp     args_ret

s2              rdlong  arg2, eins              ' read 3rd argument
                cmpsub  eins, delta wc
        if_nc   jmp     args_ret

s3              rdlong  arg3, eins              ' read 4th argument
                cmpsub  eins, delta wc
        if_nc   jmp     args_ret

s4              rdlong  arg4, eins              ' read 5th argument
                cmpsub  eins, delta wc
        if_nc   jmp     args_ret

s5              rdlong  arg5, eins              ' read 6th argument
                cmpsub  eins, delta wc
        if_nc   jmp     args_ret

s6              rdlong  arg6, eins              ' read 7th argument
                cmpsub  eins, delta wc
        if_nc   jmp     args_ret

s7              rdlong  arg7, eins              ' read 8th argument
                cmpsub  eins, delta wc
                jmp     args_ret

' initialised data and/or presets

delta           long    %10 << 23 | $FFFC       '  %10  deal with movi setup
                                                ' -(-4) address increment
indx            long    1                       ' force wrap
[COLOR="#D3D3D3"]mask            long    $00FF0000[/COLOR]

' uninitialised data and/or temporaries

addr            res     1
eins            res     1
zwei            res     1

arg0            res     1
arg1            res     1
arg2            res     1
arg3            res     1
arg4            res     1
arg5            res     1
arg6            res     1
arg7            res     1

                fit

CON
  zero    = $1F0                                ' par (dst only)
   
DAT

ags · 2013-08-20 21:34

kuroneko wrote: »

Yes, I can see how this will be a problem. How much code space do you have available for the loader?

Without timing for the loader I can't finish the rest, so I can only estimate. Probably about 128 longs or so.
I suppose I could try the trick using a counter to increment the rdlong source operand - but that could be messy.

kuroneko · 2013-08-20 21:42

ags wrote: »

Probably about 128 longs or so.

Buffer length of 128 and record length 1..14 are still the same? Does the hub location change or do the records come always from the same location?

ags · 2013-08-20 22:48

kuroneko wrote: »

Buffer length of 128 and record length 1..14 are still the same? Does the hub location change or do the records come always from the same location?

Always loads from the same hub location. First long holds record length, between 1-14. The overall design I had in mind had a budget of about 320 clocks to load one record (14 longs worst case). So if I cannot find a single-hub-window-per-long method that just can't work. Also using the counters is clearly not possible either.

kuroneko · 2013-08-20 23:34

How does the processing part know how big the records are (you don't store their length) or is it not important or maybe encoded in the data itself?

kuroneko · 2013-08-21 01:36

This is - so far - untested but should do the trick provided that the same hub location prerequisite stays true (data follows record length).

DAT             org     0
'------------------------------------------------------------------------------------------------------------
driverInit      add     mailbox, par                    ' |
                add     $-1, dst1                       ' |
                djnz    recLen, #$-2                    ' upset cog array with hub references

                {OTHER PROCESSING HERE}

loadBuf                                                 ' load one variable-length record (1-14 longs)
                                                        ' mailbox (hub RAM address) contains record length (#longs)
                                                        ' recLen holds the record length

                rdlong  recLen, mailbox wz              ' z set if no record available to load
                cmpsub  bufFreeCnt, recLen wc           ' c clear if record will not fit in buffer - not set if equal
    if_z_or_nc  jmp     loadBuf_ret                     ' !recLen or bufFreeCnt < recLen

                mov     loadCnt, recLen                 ' seed number of longs to load in loop with record length
                max     loadCnt, bufHdrmCnt             ' limit longs to load by buffer headroom - may need two passes

                movs    indLoad, #hubAdr                ' initialize indirect address to read hub RAM at record start

                sub     bufHdrmCnt, recLen wz,wc        ' does the record hit or span end of buffer? (may be =<0)
        if_be   add     bufHdrmCnt, #COG_BUF_LEN        ' if buffer wrap needed reset headroom (will be >0)
                sub     recLen, loadCnt                 ' recLen now contains load count for second pass (if needed)

indLoad         rdlong  cogBuf, 0-0                     ' hub >> cog
                add     indLoad, d1s1                   ' increment cog and hub addresses
                djnz    loadCnt, #indLoad               ' loop for specified long count

        if_be   movd    indLoad, #cogBuf                ' reset location for next write to buffer to start of buffer
        if_ae   jmp     #:finish                        ' no need to take another pass if no unread longs remain

                mov     loadCnt, recLen wz,wc           ' set remaining longs to load, clear z,c (0 < recLen < $8000_0000)
                jmp     #indLoad

:finish         wrlong  zero, mailbox                   ' clear mailbox, signalling ready for next record
loadBuf_ret     ret

' initialised data and/or presets

mailbox         long    4
hubAdr          long    8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60

bufFreeCnt      long    COG_BUF_LEN                     ' number of contiguous free longs in cog RAM (may wrap around from end to start)
bufHdrmCnt      long    COG_BUF_LEN                     ' "headroom" - number of longs from current buffer position to top of buffer

recLen          long    14 +1                           ' max record length + mailbox

dst1            long    |< 9
d1s1            long    |< 9 | 1

' uninitialised data and/or temporaries

loadCnt         res     1
cogBuf          res     COG_BUF_LEN

tail            fit
'------------------------------------------------------------------------------------------------------------
CON
  COG_BUF_LEN = 128
  zero        = $1F0
  
DAT

ags · 2013-08-21 05:57

kuroneko wrote: »

How does the processing part know how big the records are (you don't store their length) or is it not important or maybe encoded in the data itself?

Record length is encoded in the data itself. I don't use it during load because decoding it from the data would require extra processing (the classic 32-bit population count problem) and I didn't think I would have time for this.

ags · 2013-08-21 06:10

kuroneko wrote: »

This is - so far - untested but should do the trick provided that the same hub location prerequisite stays true (data follows record length).

Yes, of course. Looks quite straightforward... now that you've figured it out

Runtime penalty of just 4 clocks* each call, and 15 longs for hubAdr and d1s1. I will proceed and see how this fits into the overall driver.

The other option I came up with would be to have the load loop limit itself to no more than some number (likely 5-7) long reads each call. If the record length is greater than that, it keeps track and picks up on the next call. While it would change the overall driver design, I would end up filling many more instances of much smaller available time slices, which may be enough to keep up the overall data rate. If I can accommodate timing of your idea that seems cleaner (and my earlier work isn't totally wasted.)

Thank you.

Edit: *originally thought to repay with savings of 4 clocks after the load loop, but for worst-case timing (requiring a wrap of the circular buffer) it remains 32 clocks, same as the original code. Propville credit, not legal tender, awarded for 8 clocks saved before the load loop since hub window makes it moot.

kuroneko · 2013-08-22 02:20

ags wrote: »

Edit: *originally thought to repay with savings of 4 clocks after the load loop, but for worst-case timing (requiring a wrap of the circular buffer) it remains 32 clocks, same as the original code. Propville credit, not legal tender, awarded for 8 clocks saved before the load loop since hub window makes it moot.

Not quite sure what you're asking here (if anything). But yes, the wrapping has a cost of 2 hub windows (1.5 in fact). However, the timing can easily be improved to 4+n[+2] instead of 5+n[+2]:

DAT             org     0
'------------------------------------------------------------------------------------------------------------
driverInit      add     mailbox, par                    ' |
                add     $-1, dst1                       ' |
                djnz    recLen, #$-2                    ' upset cog array with hub references

                {OTHER PROCESSING HERE}

loadBuf                                                 ' load one variable-length record (1-14 longs)
                                                        ' mailbox (hub RAM address) contains record length (#longs)
                                                        ' recLen holds the record length

                rdlong  recLen, mailbox wz              '  +0 = z set if no record available to load
                cmpsub  bufFreeCnt, recLen wc           '  +8   c clear if record will not fit in buffer - not set if equal
    if_z_or_nc  jmp     loadBuf_ret                     '  -4   !recLen or bufFreeCnt < recLen

                mov     loadCnt, recLen                 '  +0 = seed number of longs to load in loop with record length
                max     loadCnt, bufHdrmCnt             '  +4   limit longs to load by buffer headroom - may need two passes

                sub     bufHdrmCnt, recLen wz,wc        '  +8   does the record hit or span end of buffer? (may be =<0)
                sub     recLen, loadCnt                 '  -4   recLen now contains load count for second pass (if needed)

indLoad         rdlong  cogBuf, hubAdr                  '  +0 = hub >> cog
                add     indLoad, d1s1                   '  +8   increment cog and hub addresses
                djnz    loadCnt, #indLoad               '  -4   loop for specified long count

        if_be   add     bufHdrmCnt, #COG_BUF_LEN        '  +4   if buffer wrap needed reset headroom (will be >0)

        if_be   movd    indLoad, #cogBuf                '  +8   reset location for next write to buffer to start of buffer
        if_ae   jmp     #:finish                        '  -4   no need to take another pass if no unread longs remain

                mov     loadCnt, recLen wz,wc           '  +0 = set remaining longs to load, clear z,c (0 < recLen < $8000_0000)
                jmp     #indLoad                        '  +4   8 clocks penalty

:finish         wrlong  zero, mailbox                   '  +0 = clear mailbox, signalling ready for next record
                movs    indLoad, #hubAdr                '  +8   reset indirect address to read hub RAM at record start
loadBuf_ret     ret                                     '  

' initialised data and/or presets

mailbox         long    4
hubAdr          long    8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60

bufFreeCnt      long    COG_BUF_LEN                     ' number of contiguous free longs in cog RAM (may wrap around from end to start)
bufHdrmCnt      long    COG_BUF_LEN                     ' "headroom" - number of longs from current buffer position to top of buffer

recLen          long    14 +1                           ' max record length + mailbox

dst1            long    |< 9
d1s1            long    |< 9 | 1

' uninitialised data and/or temporaries

loadCnt         res     1
cogBuf          res     COG_BUF_LEN

tail            fit
'------------------------------------------------------------------------------------------------------------
CON
  COG_BUF_LEN = 128
  zero        = $1F0
  
DAT

ags · 2013-08-26 10:01

kuroneko wrote: »

Not quite sure what you're asking here (if anything). But yes, the wrapping has a cost of 2 hub windows (1.5 in fact). However, the timing can easily be improved to 4+n[+2] instead of 5+n[+2]

@kuroneko: You are correct, my last post was neither a question or clear statement - sloppy on my part.

The code you posted has been very helpful. The idea of using the array to index to the next hub address with a simple add is creative. I should have thought of that once you pointed out my error (which I call not considering that rd*/wr* use "indirect addressing" for hub RAM) - but I did not, credit to you.

You also demonstrated a use of cmpsub - which I have seen every time I open the documentation but have never found a good use for - until now.

Also, setting the carry flag with:

test  $, #1  wc

is a great trick which I picked up from one of the links you provided. Very nice.

Thank you for taking the time to reply. With this as a starting point, I have been able to create an overall architecture that can accommodate the timing of this basic design. I also have modified for use in the "push" driver cog which presents data to the reader cog. I truly appreciate the help.