Clever ways to write in parallel?
pgbpsu
Posts: 460
I'm in the middle of an application that needs temporary storage of incoming data which will later be pulled back from temp storage and written to SD. I'm using Atmel DataFlash units for the temporary storage but to get the speed we need I'm using five in parallel. They use an SPI interface which I've got working. What I'm trying to do now is come up with a clever way to write to them in parallel. They share a clock, but have individual chip select and data lines. My code for writing to them is included below but I'd like to speed this up if possible. Basically I have 5 registers and I want to pull the highest bit of each one and put it onto a group of output pins. My output pins are consecutive (P10-P6).
So my question is how do I best pluck the highest bit off 5 different hub registers and get them together for output?
I think the loop part works out to 417 instructions. Add to that the ~40 instructions involved in copying the data from hub to cog and doing a bit of setup and I'm looking at ~23uSecs which isn't as bad as I first thought. I haven't had a chance to try this yet, and it ended up being a bit shorter than I expected, but if anyone has suggestions for ways to speed this up please let me know.
Thanks in advance,
pgb
So my question is how do I best pluck the highest bit off 5 different hub registers and get them together for output?
rdlong data_0, Prop_Address ' Grab the first long add Prop_Address, #4 ' Next byte rdlong data_1, Prop_Address ' Grab the second long add Prop_Address, #4 ' Next byte rdlong data_2, Prop_Address ' Grab the third long add Prop_Address, #4 ' Next byte rdlong data_3, Prop_Address ' Grab the fourth long add Prop_Address, #4 ' Next byte rdlong data_4, Prop_Address ' Grab the fifth long call #SET_DATA_OUTPUT ' Set Data pins to output call #LOWER_CS_PINS ' Lower all the necessary Chip Select Lines mov num_bits, #32 wz ' how many bits do we need to clock out ' Z=0; Z flag is set (1) if Value equals zero :write_loop muxz outa, CLK_MASK ' Lower Clock line shl data_0, #1 wc ' shift data left 1; C=bit shifted out muxc outa, DATA0_MASK ' Set OUTA[noparse][[/noparse]DATA0_MASK]=C shl data_1, #1 wc ' shift data left 1; C=bit shifted out muxc outa, DATA1_MASK ' Set OUTA[noparse][[/noparse]DATA1_MASK]=C shl data_2, #1 wc ' shift data left 1; C=bit shifted out muxc outa, DATA2_MASK ' Set OUTA[noparse][[/noparse]DATA2_MASK]=C shl data_3, #1 wc ' shift data left 1; C=bit shifted out muxc outa, DATA3_MASK ' Set OUTA[noparse][[/noparse]DATA3_MASK]=C shl data_4, #1 wc ' shift data left 1; C=bit shifted out muxc outa, DATA4_MASK ' Set OUTA[noparse][[/noparse]DATA4_MASK]=C xor outa, DF_SCK_MASK ' Raise the clock line djnz num_bits, #:write_loop wz ' repeat until done; Z=1 when Byte_Count = 0 call #RAISE_CS_LINES ' Set Data pin to input
I think the loop part works out to 417 instructions. Add to that the ~40 instructions involved in copying the data from hub to cog and doing a bit of setup and I'm looking at ~23uSecs which isn't as bad as I first thought. I haven't had a chance to try this yet, and it ended up being a bit shorter than I expected, but if anyone has suggestions for ways to speed this up please let me know.
Thanks in advance,
pgb
Comments
one idea is to take three more of these memorychips and three more IO-lines
load ONE byte (containing bit 0, bit 1, bit, 2...bit7 and store each BIT to one chip
bit 0 to chip 0
bit 1 to chip 1
bit 2 to chip 2
...
bit 7 to chip 7
all other possabilities I can think of (which are NOT al possabilities) would take more time
best regards
Stefan
That IS a clever solution which would also add a significant amount of additional storage space. Unfortunately, my application is already out of pins. But I'll keep that in mind for future designs where pins aren't at such a premium.
Thanks,
pgb
this principle to store bits of one bit in different chips would work with 5 bits too.
But needs a little bit more of bitshifting.
what range do the values have that you want to store ?
8 bit ? 16 bit?
as an example I assume it is 16 bit
load this 16 bit value make and AND with %11111 to get the lowest 5 bits (Bit 1-5)
store that bitbattern to your outputs
bitshift RIGHT 5 bits to get the next 5 bits (Bit 6-10)
and, store
bitshift RIGHT 5 bits to get the next 5 bits (Bit 11-15)
and, store
so know one bit is left from the first 16 bit word
store this bit
load next 16bit word
and %11110 or with the last bit of word before
store to outputs
next 16-bit word there are two bits left
then 3, then 4
so if you would program a loop that has hardcoded bitshifts 5 for first word
bitshifts 4 for second word
bitshifts 3 for third word
bitshifts 2 for fourth
bitshifts 1 for fith word
you would need more memory but gaining speed by avoiding to execute a command that decrements the bitshift-value (5,4,3,2,1)
know you would have to calculate if this would be faster in execution than the other solution
if your values are 15bit or 20 bit it would become more regular with 5 chips to store the bits
as there would'nt be bits left and have to be filled up with the bits of the next word
I think this can work but the stored data will be some kind of encrypted.
The read out algorithm has to take care of this and work the same way backwards to get the normal 16bitvalues
best regards
Stefan
My data samples are 32-bits which I don't think is any problem for the algorithm you've suggested. As you state at the end, the data are effectively encrypted and in order to get back a single 32-bit sample I'd have to read from each of the DataFlash chips. My current approach, although possibly a bit slower, has the benefit of putting all the data from a single sample into a single DataFlash. That makes getting the data from a single channel/chip read back in and stored to SD card in its own file MUCH easier.
After laying out the routine in my original post last night it turned out better/simpler then I expected. I actually have 100uS to write out the 5 32-bit samples before new ones are available. So time isn't as tight as I may have lead you to believe. Even if my code ends up taking 25uSec, I still have lots of wiggle room.
Since this approach seems fast enough, and getting data back from the DF chips will be much simpler, I'll probably stick with what I've got (provided I can actually get it working). But thanks nonetheless for the suggestions.
Cheers,
Peter
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
We picked 5 because that's the number of channels of data we have. Turns out it also uses up all remaining pins (which isn't a good thing). The nice thing about it is that it allows us to receive data as fast as we need to (one channel to each DF chip) and it makes pulling the buffered data back into the Prop and then passing it out to SD pretty easy. Our final file structure will be one file per channel. To accomplish that with our current layout, we can simply read from a single DF chip of interest, grab 512 bytes, spit them to SD, grab 512 more, etc until that file is finished. Move to next DF and so on.
I'm sure there are some very clever ways to do this. For now I think I've settled on less clever and more straight forward. This was all done to allow for our 10,000 samples/second on 5 channels for 30 seconds and the math works out so I think we're alright.
Just plugging away on the code....
p
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
Considering that you are going to split the byte in 8 chips you can drive them in parallel exept for data so: 1 clock, 1 cs, 8 data [noparse][[/noparse]total 10 pins]
Your solution 5 chips: 1 clock, 5 cs, 5 data [noparse][[/noparse]equal to 11 pins: one more]
regards