Fastes possible memory transfer

macca · 2012-07-01 06:13

Hello,

I need to transfer a block of bytes from HUB to COG memory, with this code:

                        MOVD    :l1, #buffer
                        MOV     count, 64
:l1                     RDLONG  buffer, srcptr
                        ADD     :l1, increment
                        ADD     srcptr, #4
                        DJNZ    count, #:l1

srcptr                  LONG    $0
count                   LONG    $0
increment               LONG    000000000000_000000100_000000000 

buffer                  RES     64

according with the documentation, if I'm not wrong, it looses a slot at every cycle.

Is there a better, fastest, way to transfer a block of memory without loosing slots ?

Regards.

kuroneko · 2012-07-01 06:25

macca wrote: »

Is there a better, fastest, way to transfer a block of memory without loosing slots ?

Not sure about better but [post=978929]this example[/post] will get you maximum transfer speed in both directions.

Heater. · 2012-07-01 06:27

Not sure how the hub slots work out off hand but you can always unroll the loop a bit.
Do that rdlong, add, add sequence 4 times in each iteration and reduce the start count to 16. For example.

Mark_T · 2012-07-01 13:57

If you can arrange the hub memory buffer to be in the first 512 bytes of hub RAM then you can combine the increments:

                        MOVD    :l1, #buffer
                        MOVS    :l1, #srcaddr
                        MOV     count, 64
 :l1
                        RDLONG  buffer, #srcptr 
                        ADD     :l1, increment  
                        DJNZ    count, #:l1

 increment              LONG    %000000000000_000000001_000000100

 count                  RES    1

Note that your code had the wrong increment value, you add 1 to a cog address to get to the next long, not 4 and binary constants start with %

macca · 2012-07-01 23:26

Mark_T wrote: »

If you can arrange the hub memory buffer to be in the first 512 bytes of hub RAM then you can combine the increments:

How do I place the buffer in the first 512 bytes ?
The whole program uses SPIN code also and a (yet unknown) number of COGs.

Heater. · 2012-07-01 23:56

You can't. Or at least not with any "normal" Spin programming.
Spin will place code and declared variables wherever it likes and gives you no control over that.
It might be that if you declare an array at the start of the first object which itself contains little code then that array may end up within 512 bytes of the start of RAM. But that would only be by accident of the compliler implementation in use.

pjv · 2012-07-02 07:46

Hi Macca;

I believe you can reserve any amount of Low Hub space after address $10 by having your first object contain only a DAT section of the size you choose, plus of course the declaration of one next object.

Cheers,

Peter (pjv)

Heater. · 2012-07-02 08:09

You will need at least 1 PUB in your first object which will then be calling your "real" main object with the address of the reserved space.

Still this is not nice as it depends on undocumented compiler operation. I.E. that the main object will always be first in memory and that no other junk is put in low memory.

But I guess if it works it works...

macca · 2012-07-02 08:35

Well, the hub memory location is not reliable and the solution using a counter seems too error prone (I don't like to have something running on its own in that case), I think I'll unroll the loop a bit to loose less slots.
Thanks for your help.

pjv · 2012-07-02 09:53

Hi Macca;

Sorry, I meant to have used the word "PUB" instead of "next object" as Heater pointed out. Although it is not documented, this technique works consistently according to verbal confirmation by Chip Gracey.

Then, to get the fastest transfer rate you asked for, use the technique outlined by Mark_T.

I believe there are no other approaches that will be as fast.

Cheers,

Peter (pjv)

Phil Pilgrim (PhiPi) · 2012-07-02 11:03

I think the following (untested) will do what you want for transferring data into the cog from the hub:

CON

  _clkmode      = xtal1 + pll16x
  _xinfreq      = 5_000_000

  BUF_SIZE      = 128           'Number of longs in buffer.

PUB  start

  cognew(@pasm,@bufend - 4)     'Start cog, pointing to last long in hub buffer. 

DAT

              org       0
pasm          mov       hubbufend,par

              '...
              call      #loadbuf
              '...

loadbuf       movd      long0,#bufend-1         'Point to last long in cog buffer.
              movd      long1,#bufend-2         'Point to next-to-last long in cog buffer.
              mov       loop_back,djnz_instr    'Replace jmp loadbuf_ret instruction with a djnz.
              mov       ptr,hubbufend           'Point to last long in hub buffer.
              jmp       #long0                  'Start the transfer.
              
loop          sub       long0,dec_dest          'Decrement long0 dest addr by 2.
long1         rdlong    0-0,ptr                 'Read long at pointer address. (Pointer is address + 3.)
              sub       ptr,#7                  'Pointer + 3 - 7 == pointer - 4.
              sub       long1,dec_dest          'Decrement long1 dest addr by 2.
long0         rdlong    0-0,ptr                 'Read long at pointer address. (Pointer is actual address.)
loop_back     jmp       loadbuf_ret             'Is djnz ptr,#loop until overwritten by transfer.

buffer        long      0[BUF_SIZE]             'Serves as both hub and cog buffers.
bufend 

djnz_instr    djnz      ptr,#loop               'Placed at loadbuf_ret to perform loop until overwritten by transfer.
dec_dest      long      2 << 9                  'Amount by which to decrement destination address in transfer insturctions.
hubbufend     long      0-0                     'Hub address of the last long in buffer.
loadbuf_ret   ret                               'Return address gets placed here.

ptr           res       0                       'Pointer into hub.

The transfer loop loads the data in reverse order and hits the hub sweet spot every time without using a counter. The hub and cog buffers occupy the same hub memory and must be placed immediately after loop_back. To start, the jmp loadbuf_ret instruction is replaced by a djnz, which creates the loop. The loop terminates when the djnz gets overwritten again by the jmp loadbuf_ret from the hub. (You read that right, no "#".) Actually, due to pipelining, the loop will terminate one or two transfers later than that, depending upon whether BUF_SIZE is even or odd. But the additional rdlongs are harmless, since they overwrite the code with the same instructions that are already there.

Unfortunately, the same technique cannot be employed for transferring data out of the cog to the hub.

-Phil

pjv · 2012-07-02 14:39

Hello Phil;

I have not studied your example in great detail, but I suspect you have hit another home run!

I see you also use that reverse DJNZ trick as a pointer.... it gives me great pleasure every time I can use that; kind of a dual function with a single instruction. It really helps keep the ripple-sorter I use down to a very compact routine.

Nice going!

Cheers,

Peter (pjv)

macca · 2012-07-03 01:49

That's really interesting, thanks Phil!

kuroneko · 2012-07-03 07:04

macca wrote: »

... and the solution using a counter seems too error prone (I don't like to have something running on its own in that case)

Can you elaborate? What error conditions do you expect to encounter with a LOGIC.always setup? Just curious ...

macca · 2012-07-03 07:26

kuroneko wrote: »

Can you elaborate? What error conditions do you expect to encounter with a LOGIC.always setup? Just curious ...

What I'm worried about is that there is a register running on its own and the code execution expects to be synchronized with it. One day someone may forget that and/or do a change that alter this synchronization and the program doesn't work anymore and will be very hard to discover the problem.

lonesock · 2012-07-03 10:36

kuroneko's solution works great, and it is also nice to have an alternative for those times when your counters are already in use.

Jonathan

kuroneko · 2012-07-04 20:27

Phil Pilgrim (PhiPi) wrote: »

I think the following (untested) will do what you want for transferring data into the cog from the hub:

Thanks for posting this again. I always found the override approach slightly irritating but since I had a counter solution I didn't see the need for doing anything about it. Until today^A. This example doesn't use counters, doesn't override instructions and can serve multiple hub buffers in both directions but with a 2n long limitation.

CON
  BUF_SIZE = 128                                ' number of longs in buffer (2n)

VAR
  long  hub[BUF_SIZE]
  
PUB start

  cognew(@pasm, @hub{0})                                                         

DAT             org     0

pasm            '...
                mov     ptr, par                ' hub location
                call    #loadbuf                ' transfer fixed size buffer
                '...
                waitpeq $, #0

loadbuf         movd    long0, #ptr -1          ' last long in cog buffer
                movd    long1, #ptr -2          ' second-to-last long in cog buffer
                add     ptr, #BUF_SIZE * 4 -1   ' last byte in hub buffer (8n + 7)
                movi    ptr, #BUF_SIZE - 2      ' add magic marker
                
long0           rdlong  0-0, ptr                ' |
                sub     long0, dst2             ' |
                sub     ptr, i2s7 wc            ' |
long1           rdlong  0-0, ptr                ' |
                sub     long1, dst2             ' |
        if_nc   djnz    ptr, #long0             ' sub #7/djnz (Thanks Phil!)

loadbuf_ret     ret

' initialised data and/or presets

dst2            long    2 << 9                  ' dst +/-= 2
i2s7            long    2 << 23 | 7

' uninitialised data and/or temporaries

buffer          res     BUF_SIZE
ptr             res     1                       ' buffer + BUF_SIZE

tail            fit
                
DAT

^A ... not really needed but why waste an idea ...

Phil Pilgrim (PhiPi) · 2012-07-04 21:02

Nicely done, Kuroneko, and a diabolically clever use of those upper ptr bits!

-Phil

Fastes possible memory transfer

Comments