I've been mapping out the external memory bank to bank copy and graphics copy cases, and the logic for all that code is going to be rather messy and branch around a fair bit but thankfully it can run in the period of time where we are waiting for the end of the streaming burst, so this slowish code is hidden in otherwise dead time and won't affect performance. But right now my fill operations are still run in REP loops that do tie up the CPU for the entire transfer and prevent the next sub-burst setup code from running until the entire fill operation completes first which isn't ideal (this is just for a fill, not the normal read/write bursts).
I'm wondering if I can fix this if the streamer replicates the data size automatically in its immediate mode. E.g. If I setup the streamer to do this with the code below, will I get 50 $aa55 words replicated or will it only send two $aa55 words then replicate the final $aa byte for the remaining 96 byte transfers perhaps:
setword xmode, #100, #0 ' 100 byte clocks for 50 words
...
xcont xmode, pattern
xmode long $60CE_0000 ' immediate to pin mode (4 x 8 bits)
pattern long $aa55aa55
Basically what happens with the data output from a 32 bit long in streamer mode if there are more than 4 clocks assigned to the streamer transfers but only 32 bits in the long passed? Does it just recycle the data in the long over and over or will it repeat the last (upper) byte when it runs out of data after the first 4 clocks? I guess I have to setup a test unless anyone already knows this...
Update: Just read the docs on this:
"S/# supplies 32 bits of data which form a set of 1/2/4/8/16-bit values that are shifted by 1/2/4/8/16/32 bits on each subsequent NCO rollover, with the last value repeating. Each value is output in sequence."
I guess I have my answer, looks like it might just repeat the last byte.
Interesting approach evanh. I think the two bit immediate to LUT lookup mode may be possible to use to achieve this and free the COG during fills. So all the byte values passed from the streamer data to the LUT RAM index could hold 4 two-bit values and each then resolve to the final byte from the desired long (or word) pattern which is then output to the pins. So I can see a way it could possibly work. Any actual byte sized fills can work directly without the LUT RAM. The main downside is that up to 8 sets of 4 LUT entries are used (one set of 4 entries per COG) and these have to be distributed on bbbb00000 boundaries in the LUTRAM. It might be doable later as an optimisation if I really need it by breaking up my LUT RAM code into smaller blocks leaving room for these 4 entries to be scattered throughout the LUTRAM addresses in the places where they need to be. Unless that fits nicely it's not especially ideal.
Update: Another issue with this is the extra setup overhead of writing the byte patterns into LUT RAM before the transfers can begin. For a long fill pattern you'd have to extract 4 individual bytes from the original pattern, replicate each one over all 4 bytes in each long being written and then write to LUT at the 4 right addresses. That's already 12 instructions..doing the following 3 instructions 4 times, and this is excluding the per COG ptra setup. Its complexity may not be worth its gain...
getbyte a, pattern, #0 ' eg. get byte 0,1,2, or 3
movbyts a, #0
wrlut a, ptra++
The way I see it is we can simply fill all 4 bytes in the LUT long with the same value, and then setup the right 8 bit output pin group for the data bus once at the start in the streamer commands. Then there's no need to figure out how to shuffle bytes into their real positions in the LUT longs. The streamer command's "ppp" value should be able to selectively determine the final 8 bit output pin group for us nicely. The other way requires getbyte, setbyte patches in multiple places and doesn't seem to gain anything for the extra work.
EDIT: Actually maybe we can use the ppp bits to realign and force the streamer to only make use the 8 bit values at the bottom of the LUT long to output to the pins, and in this case that extra movbyts might not be needed at all.
EDIT: Actually maybe we can use the ppp bits to realign and force the streamer to only make use the 8 bit values at the bottom of the LUT long to output to the pins, and in this case that extra movbyts might not be needed at all.
Now I finally get what you mean about zero extended. The output drive from the streamer in this mode will OR it's output with the other 24 bits in the group above this 8 bit data bus which could be our control pins. Yep it needs to be zeroes.
Comments
I'm wondering if I can fix this if the streamer replicates the data size automatically in its immediate mode. E.g. If I setup the streamer to do this with the code below, will I get 50 $aa55 words replicated or will it only send two $aa55 words then replicate the final $aa byte for the remaining 96 byte transfers perhaps:
Basically what happens with the data output from a 32 bit long in streamer mode if there are more than 4 clocks assigned to the streamer transfers but only 32 bits in the long passed? Does it just recycle the data in the long over and over or will it repeat the last (upper) byte when it runs out of data after the first 4 clocks? I guess I have to setup a test unless anyone already knows this...
Update: Just read the docs on this:
"S/# supplies 32 bits of data which form a set of 1/2/4/8/16-bit values that are shifted by 1/2/4/8/16/32 bits on each subsequent NCO rollover, with the last value repeating. Each value is output in sequence."
I guess I have my answer, looks like it might just repeat the last byte.
If you're outputting bytes, yes.
I've used lutRAM addresses 0 and 1. There's 15 other choices by setting bits 16-19 in xmode.
EDIT: Ah, maybe I never used it beyond eight clocks at a time. I've found an old test program that only does eight for each XINIT.
Update: Another issue with this is the extra setup overhead of writing the byte patterns into LUT RAM before the transfers can begin. For a long fill pattern you'd have to extract 4 individual bytes from the original pattern, replicate each one over all 4 bytes in each long being written and then write to LUT at the 4 right addresses. That's already 12 instructions..doing the following 3 instructions 4 times, and this is excluding the per COG ptra setup. Its complexity may not be worth its gain...
EDIT: Actually maybe we can use the ppp bits to realign and force the streamer to only make use the 8 bit values at the bottom of the LUT long to output to the pins, and in this case that extra movbyts might not be needed at all.