[Propeller Assembly] Using PAR

Vega256 · 2011-07-04 18:54

Hey,

If I pass in an address of a piece of data using PAR but I wanted to know the address of the next sequential piece of data in memory as well, would I just add 1 to PAR or is this completely incorrect?

potatohead · 2011-07-04 19:20

That is correct.

Adding one would point to the next BYTE in the HUB. Adding two would be a WORD, and four is a LONG.

mov index, PAR
add index, #4
rdlong cog_var1, index
add index, #1
rdbyte cog_var2, index
etc...

I see Cluso beat me to it. Transfer PAR to a working cog storage location first, then operate on it. Didn't say it explicitly, and should have.

Cluso99 · 2011-07-04 19:26

You may need to add 4. WARNING: You cannot actually add to PAR because it cannot be written. You have to copy it into your cog and then add to this and use it in your rdxxxx instruction.

This is because the hub is addressed as bytes. PAR can only be passed as a long address because it is shifted 2 bits left with the bottom 2 bits both set to zero. However, if you are picking up bytes you add 1 and for words add 2.

IIRC there is a good example in passing parameters in the VGA sample code in the obex.

Vega256 · 2011-07-05 07:09

So then to write to an array in SPIN via PASM, I would need the address of the first element, then I would need to increment the address by x offset and write the next piece of data, and repeat until the end of the array. Correct?

potatohead · 2011-07-05 08:01

Yes. Props have no index registers. And the hub access windows are generally 2, 6 and 10 instructions, if you are not branching and or using the special wait instructions.

:loop           wrlong          value, index

                add             index, offset
                djnz            offset, #:loop

Simple and fast case. Takes advantage of one of the specialized auto decrement instructions. Maybe initialize a region of RAM with this.

:loop           wrlong          value, index

                add             index, offset
                cmp             index, boundary   wc, wz

                nop
                nop
                
                nop
      if_Z      jmp             #:loop

The second loop shows the hub access window missed, because of the compare. It runs just as fast with the nop instructions in there, because of the hub access window. When writing routines that work with the HUB RAM, I find it best to write it sequentially, in the order you find helps get it done, then break the instructions into groups to show the window and make counting easy. More common case, there is time to do stuff, but only so much, or the next window is bigger...

Then re-sequence them to fit the best case access window, 2, 6, or 10 instructions. Very significant speed gains are to be had that way. Here's one example from the Potatotext 2 driver I just completed.

:loop                RDLONG     colpix, screen            'Fetch two chars
                       mov      A, colpix                 'Need working copy
                       and      colpix, vff00ff00         'mask for colors colors only both characters

                       mov      B, A                      'working copy for second character
                       and      A, #$FF                   'Prepare to calculate first character pixel data address
                       
                       shl      A, #3                     'Multiply by 8
                       add      A, fontsum                'point to pixel data in font table

                       RDBYTE   pixels, A                 'fetch pixels from hub for first character
                       
                       and      B, v00ff00ff              'pixels only
                       shr      B, #13                    'Prepare to fetch pixels from HUB, shift contains multiply

                       nop
                       nop 

                       shl      pixels, #16               'Put pixels in position for combining with colors
                       add      B, fontsum                'point to pixel data second character



                       RDBYTE   pixels1, B                'Fetch pixel data from font table

                       rev      pixels, #8                'Reverse pixels for proper display
                       rev      pixels1, #8               'Do it for both characters

                       or       colpix, pixels            'Combine colors and pixels for TV COG
                       or       colpix, pixels1

                       'add      index, #4
                       add      screen, #4                'Point to next pair of characters



                       WRLONG   colpix, index             'Put complete pixel and color data in buffer for TV COG

                       add      index, #4                 'Point to next buffer address
                       'add      screen, #4
                                             
                       djnz     count, #:loop             'Do all of the characters in the scan line, until buffer full

I'll have to go pull the nop instructions out of that one. Forgot and left them in there.

I'll typically drop them in as place holders to visualize the access cycle, followed by stuffing instructions in there, out of order where possible, to get the best access times. On that loop, it was very important to get that last cycle to be 2 instructions to hit the time requirement. It's currently 6+6+6+2 = 20 instructions. One more instruction in the last HUB OP there would be 6+6+6+6, a jump to 24 instructions.

Another instruction in the 6 instruction area would bump it to 10. 6+10+6+6 = 28, with a worst case of 10+10+10+6 = 36 instructions. Well worth considering, if you need the access speed. I was at the 28 instructions the first time, putting the operations in simple, logical, sequential order. Choosing when to do things made a considerable difference. Roughly 25 percent faster.

Calculate that, then write your access loop, then add up your total hub accesses, finally dropping in your between access operations out of order for best speed.

This is also a great way to calculate quickly whether or not something is going to be possible. Add up the HUB accesses, then approximate instructions, and you've a rough metric on how much time it will take to move the data. A few exercises like this, and it's clear to see it pays off to combine data operations with the HUB where possible, keeping the number of reads and writes to the minimum.

kuroneko · 2011-07-05 18:42

@potatohead: Would your code benefit from one (or two^A) less hub window(s) in that loop you've shown?

^A provided you drop the rev requirement

potatohead · 2011-07-05 23:42

It would benefit clocks lower than 80Mhz. TV's don't realistically do more than 640 pixels, and even that requires a nice TV, or S-video, or monochrome display. Figured if I could get 80 chars at 80Mhz, that's good. Losing one or two hub cycles would bring up the character density for lower clocks though. Inverting fonts can be done in the init stage of whatever the higher level task is, I just prefer not to, just because it's easier to put to use by others.

Honestly, a future port to VGA would benefit, because the sweeps are higher. The lowest sweep I know of that will display reasonably is 640x200, as seen in the full color tile driver. Would have to be interlaced for this display, but... many newer LCD type VGA displays will just de-interlace that, or not render with a lot of flicker like a CRT will. Did a quick test on the one I have, and the interlace display rendering was superior to a CRT. So that's the plan later on.

It's probably not worth it to lose the rev instructions for TV, as that driver renders at the max for TV @ 80. But, a VGA one likely won't at the current speed, meaning rev would have to go, in that case to get higher character densities.

What did you have in mind? Something good probably. Do tell... Maybe Vega256 will enjoy some HUB access time chatter from the master!

kuroneko · 2011-07-05 23:47

6+2+6+2 (fontsum_paired = fontsum * $00010001):

DAT
:loop           RDLONG  colpix, screen            'Fetch two chars
                mov     A, colpix                 'Need working copy
                and     colpix, vff00ff00         'mask for colors colors only both characters

                add     screen, #4                'Point to next pair of characters
                and     A, v00ff00ff              'Prepare to calculate character pixel data addresses
                shl     A, #3                     'Multiply by 8
                add     A, fontsum_paired         'point to pixel data in font table

                RDBYTE  pixels, A                 'fetch pixels from hub for first character
                shl     pixels, #16               'Put pixels in position for combining with colors
                shr     A, #16                    'point to pixel data second character

                RDBYTE  pixels1, A                'Fetch pixel data from font table
                rev     pixels, #8                'Reverse pixels for proper display
                rev     pixels1, #8               'Do it for both characters

                or      colpix, pixels            'Combine colors and pixels for TV COG
                or      colpix, pixels1

                WRLONG  colpix, index             'Put complete pixel and color data in buffer for TV COG
                add     index, #4                 'Point to next buffer address
                djnz    count, #:loop             'Do all of the characters in the scan line, until buffer full

6+2+2+2 (fontsum_paired = fontsum * $00010001, rev requirement lifted):

DAT
:loop           RDLONG  colpix, screen            'Fetch two chars
                mov     A, colpix                 'Need working copy
                and     colpix, vff00ff00         'mask for colors colors only both characters

                add     screen, #4                'Point to next pair of characters
                and     A, v00ff00ff              'Prepare to calculate character pixel data addresses
                shl     A, #3                     'Multiply by 8
                add     A, fontsum_paired         'point to pixel data in font table

                RDBYTE  pixels, A                 'fetch pixels from hub for first character
                or      colpix,pixels             'Combine colors and pixels for TV COG
                shr     A, #16                    'point to pixel data second character

                RDBYTE  pixels, A                 'Fetch pixel data from font table
                shl     pixels, #16               'Relocate pixel data
                or      colpix, pixels            'Combine colors and pixels for TV COG

                WRLONG  colpix, index             'Put complete pixel and color data in buffer for TV COG
                add     index, #4                 'Point to next buffer address
                djnz    count, #:loop             'Do all of the characters in the scan line, until buffer full

potatohead · 2011-07-05 23:57

Very nice, combining the add operations. I like it. Will plug that in and give it a go here in the near future. Nicely done. One long will hold two 16 bit addresses, appreciated.

Edit: And there is another thing about the Prop. It's best to operate on data in sizes that make sense for it. Not doing that, where there are indexes tends to be very costly.

@Vega256, sorry for extending your query. Usually, the access window questions very closely follow indexing ones.

Figured I would get the jump on that, just as somebody did for me when I crossed this point in PASM.

[Propeller Assembly] Using PAR

Comments