Fastest way to get INA Nibbles shifted to 32bit?

As some SPI memory devices are now starting to show up in 100mhz Quad-SPI
I was happy that getting 4 times the bandwidth, but then I though that
shifting 4 bits from INA is not as simple as reading one pin and using RCL (or maybe muxc)
So some of the gain could be lost with extra software overhead.
Expamples should be based on that we are using a counter to generate the clock,
The code would only need get 4pins shifted in as fast as possible, but at a synchronous time interval
so to not get a mismatch with CLK
movs buffer,ina
ror buffer,#1 wc
rcl buffer2,#1
ror buffer,#1 wc
rcl buffer2,#1
...
...
and so on using un-rolled loops
But the above code in not faster that reading 1bit at time from regular spi
I could see running 4 cogs synchronous,
that treat it just as a 1bit spi and but on different quad pins.
If it's used for VGA Color data, you are allowed to use some tricks that
take advantage of that the it is only using 24bits (H & V is always 0) but should only use 6 nibbles.
I was happy that getting 4 times the bandwidth, but then I though that
shifting 4 bits from INA is not as simple as reading one pin and using RCL (or maybe muxc)
So some of the gain could be lost with extra software overhead.
Expamples should be based on that we are using a counter to generate the clock,
The code would only need get 4pins shifted in as fast as possible, but at a synchronous time interval
so to not get a mismatch with CLK
movs buffer,ina
ror buffer,#1 wc
rcl buffer2,#1
ror buffer,#1 wc
rcl buffer2,#1
...
...
and so on using un-rolled loops
But the above code in not faster that reading 1bit at time from regular spi
I could see running 4 cogs synchronous,
that treat it just as a 1bit spi and but on different quad pins.
If it's used for VGA Color data, you are allowed to use some tricks that
take advantage of that the it is only using 24bits (H & V is always 0) but should only use 6 nibbles.
Comments
Later in the program, the lower 16 bits of the even longs have to be masked and OR'd with the masked upper 16 bits of the odd longs before transfer to the hub. Or you could just collect 16 lower bits, as in the first block above and transfer the data as words.
-Phil
We're all patiently waiting for SQI RAM
As you point out, it's easy to realize 4X the speed on any pins using 4 cogs. But, I think 4 cogs is just too much.
For FlashPoint, I'm currently working on drivers for 2 and 3 cog fast reads.
With 3 cogs, I think I can do a 256x240 "full color" bitmap (8-bit) display on TV.
With 2 cogs, I can now show a 480x272 (6-bit) image on my 4.3" touchscreens at 25 Hz (fast enough to make it flicker free) from either SQI flash or 4X SRAM chips.