[Propeller Assembly] Using PAR
Hey,
If I pass in an address of a piece of data using PAR but I wanted to know the address of the next sequential piece of data in memory as well, would I just add 1 to PAR or is this completely incorrect?
If I pass in an address of a piece of data using PAR but I wanted to know the address of the next sequential piece of data in memory as well, would I just add 1 to PAR or is this completely incorrect?
Comments
Adding one would point to the next BYTE in the HUB. Adding two would be a WORD, and four is a LONG.
mov index, PAR
add index, #4
rdlong cog_var1, index
add index, #1
rdbyte cog_var2, index
etc...
I see Cluso beat me to it. Transfer PAR to a working cog storage location first, then operate on it. Didn't say it explicitly, and should have.
This is because the hub is addressed as bytes. PAR can only be passed as a long address because it is shifted 2 bits left with the bottom 2 bits both set to zero. However, if you are picking up bytes you add 1 and for words add 2.
IIRC there is a good example in passing parameters in the VGA sample code in the obex.
:loop wrlong value, index add index, offset djnz offset, #:loop
Simple and fast case. Takes advantage of one of the specialized auto decrement instructions. Maybe initialize a region of RAM with this.:loop wrlong value, index add index, offset cmp index, boundary wc, wz nop nop nop if_Z jmp #:loop
The second loop shows the hub access window missed, because of the compare. It runs just as fast with the nop instructions in there, because of the hub access window. When writing routines that work with the HUB RAM, I find it best to write it sequentially, in the order you find helps get it done, then break the instructions into groups to show the window and make counting easy. More common case, there is time to do stuff, but only so much, or the next window is bigger...Then re-sequence them to fit the best case access window, 2, 6, or 10 instructions. Very significant speed gains are to be had that way. Here's one example from the Potatotext 2 driver I just completed.
:loop RDLONG colpix, screen 'Fetch two chars mov A, colpix 'Need working copy and colpix, vff00ff00 'mask for colors colors only both characters mov B, A 'working copy for second character and A, #$FF 'Prepare to calculate first character pixel data address shl A, #3 'Multiply by 8 add A, fontsum 'point to pixel data in font table RDBYTE pixels, A 'fetch pixels from hub for first character and B, v00ff00ff 'pixels only shr B, #13 'Prepare to fetch pixels from HUB, shift contains multiply nop nop shl pixels, #16 'Put pixels in position for combining with colors add B, fontsum 'point to pixel data second character RDBYTE pixels1, B 'Fetch pixel data from font table rev pixels, #8 'Reverse pixels for proper display rev pixels1, #8 'Do it for both characters or colpix, pixels 'Combine colors and pixels for TV COG or colpix, pixels1 'add index, #4 add screen, #4 'Point to next pair of characters WRLONG colpix, index 'Put complete pixel and color data in buffer for TV COG add index, #4 'Point to next buffer address 'add screen, #4 djnz count, #:loop 'Do all of the characters in the scan line, until buffer full
I'll have to go pull the nop instructions out of that one. Forgot and left them in there.I'll typically drop them in as place holders to visualize the access cycle, followed by stuffing instructions in there, out of order where possible, to get the best access times. On that loop, it was very important to get that last cycle to be 2 instructions to hit the time requirement. It's currently 6+6+6+2 = 20 instructions. One more instruction in the last HUB OP there would be 6+6+6+6, a jump to 24 instructions.
Another instruction in the 6 instruction area would bump it to 10. 6+10+6+6 = 28, with a worst case of 10+10+10+6 = 36 instructions. Well worth considering, if you need the access speed. I was at the 28 instructions the first time, putting the operations in simple, logical, sequential order. Choosing when to do things made a considerable difference. Roughly 25 percent faster.
Calculate that, then write your access loop, then add up your total hub accesses, finally dropping in your between access operations out of order for best speed.
This is also a great way to calculate quickly whether or not something is going to be possible. Add up the HUB accesses, then approximate instructions, and you've a rough metric on how much time it will take to move the data. A few exercises like this, and it's clear to see it pays off to combine data operations with the HUB where possible, keeping the number of reads and writes to the minimum.
A provided you drop the rev requirement
Honestly, a future port to VGA would benefit, because the sweeps are higher. The lowest sweep I know of that will display reasonably is 640x200, as seen in the full color tile driver. Would have to be interlaced for this display, but... many newer LCD type VGA displays will just de-interlace that, or not render with a lot of flicker like a CRT will. Did a quick test on the one I have, and the interlace display rendering was superior to a CRT. So that's the plan later on.
It's probably not worth it to lose the rev instructions for TV, as that driver renders at the max for TV @ 80. But, a VGA one likely won't at the current speed, meaning rev would have to go, in that case to get higher character densities.
What did you have in mind? Something good probably. Do tell... Maybe Vega256 will enjoy some HUB access time chatter from the master!
DAT :loop RDLONG colpix, screen 'Fetch two chars mov A, colpix 'Need working copy and colpix, vff00ff00 'mask for colors colors only both characters add screen, #4 'Point to next pair of characters and A, v00ff00ff 'Prepare to calculate character pixel data addresses shl A, #3 'Multiply by 8 add A, fontsum_paired 'point to pixel data in font table RDBYTE pixels, A 'fetch pixels from hub for first character shl pixels, #16 'Put pixels in position for combining with colors shr A, #16 'point to pixel data second character RDBYTE pixels1, A 'Fetch pixel data from font table rev pixels, #8 'Reverse pixels for proper display rev pixels1, #8 'Do it for both characters or colpix, pixels 'Combine colors and pixels for TV COG or colpix, pixels1 WRLONG colpix, index 'Put complete pixel and color data in buffer for TV COG add index, #4 'Point to next buffer address djnz count, #:loop 'Do all of the characters in the scan line, until buffer full
6+2+2+2 (fontsum_paired = fontsum * $00010001, rev requirement lifted):DAT :loop RDLONG colpix, screen 'Fetch two chars mov A, colpix 'Need working copy and colpix, vff00ff00 'mask for colors colors only both characters add screen, #4 'Point to next pair of characters and A, v00ff00ff 'Prepare to calculate character pixel data addresses shl A, #3 'Multiply by 8 add A, fontsum_paired 'point to pixel data in font table RDBYTE pixels, A 'fetch pixels from hub for first character or colpix,pixels 'Combine colors and pixels for TV COG shr A, #16 'point to pixel data second character RDBYTE pixels, A 'Fetch pixel data from font table shl pixels, #16 'Relocate pixel data or colpix, pixels 'Combine colors and pixels for TV COG WRLONG colpix, index 'Put complete pixel and color data in buffer for TV COG add index, #4 'Point to next buffer address djnz count, #:loop 'Do all of the characters in the scan line, until buffer full
Edit: And there is another thing about the Prop. It's best to operate on data in sizes that make sense for it. Not doing that, where there are indexes tends to be very costly.
@Vega256, sorry for extending your query. Usually, the access window questions very closely follow indexing ones.