Clever ways to write in parallel?

pgbpsu · 2008-10-29 02:42

I'm in the middle of an application that needs temporary storage of incoming data which will later be pulled back from temp storage and written to SD. I'm using Atmel DataFlash units for the temporary storage but to get the speed we need I'm using five in parallel. They use an SPI interface which I've got working. What I'm trying to do now is come up with a clever way to write to them in parallel. They share a clock, but have individual chip select and data lines. My code for writing to them is included below but I'd like to speed this up if possible. Basically I have 5 registers and I want to pull the highest bit of each one and put it onto a group of output pins. My output pins are consecutive (P10-P6).

So my question is how do I best pluck the highest bit off 5 different hub registers and get them together for output?

          rdlong  data_0,      Prop_Address   ' Grab the first long
          add     Prop_Address, #4             ' Next byte
          rdlong  data_1,      Prop_Address   ' Grab the second long
          add     Prop_Address, #4             ' Next byte
          rdlong  data_2,      Prop_Address   ' Grab the third long
          add     Prop_Address, #4             ' Next byte
          rdlong  data_3,      Prop_Address   ' Grab the fourth long
          add     Prop_Address, #4               ' Next byte
          rdlong  data_4,      Prop_Address    ' Grab the fifth long

 
          call    #SET_DATA_OUTPUT           ' Set Data pins to output
          call    #LOWER_CS_PINS               ' Lower all the necessary Chip Select Lines
          mov    num_bits,     #32      wz      ' how many bits do we need to clock out
                                                            ' Z=0; Z flag is set (1) if Value equals zero

:write_loop
          muxz   outa,          CLK_MASK       ' Lower Clock line

          shl     data_0,         #1         wc     ' shift data left 1; C=bit shifted out
          muxc    outa,         DATA0_MASK    ' Set OUTA[noparse][[/noparse]DATA0_MASK]=C
          shl     data_1,         #1         wc     ' shift data left 1; C=bit shifted out
          muxc    outa,         DATA1_MASK    ' Set OUTA[noparse][[/noparse]DATA1_MASK]=C
          shl     data_2,         #1         wc     ' shift data left 1; C=bit shifted out
          muxc    outa,         DATA2_MASK    ' Set OUTA[noparse][[/noparse]DATA2_MASK]=C
          shl     data_3,         #1         wc     ' shift data left 1; C=bit shifted out
          muxc    outa,         DATA3_MASK    ' Set OUTA[noparse][[/noparse]DATA3_MASK]=C
          shl     data_4,         #1         wc     ' shift data left 1; C=bit shifted out
          muxc    outa,         DATA4_MASK    ' Set OUTA[noparse][[/noparse]DATA4_MASK]=C

          xor     outa,         DF_SCK_MASK     ' Raise the clock line

          djnz    num_bits,   #:write_loop    wz  ' repeat until done; Z=1 when Byte_Count = 0

          call    #RAISE_CS_LINES              ' Set Data pin to input

I think the loop part works out to 417 instructions. Add to that the ~40 instructions involved in copying the data from hub to cog and doing a bit of setup and I'm looking at ~23uSecs which isn't as bad as I first thought. I haven't had a chance to try this yet, and it ended up being a bit shorter than I expected, but if anyone has suggestions for ways to speed this up please let me know.

Thanks in advance,
pgb

StefanL38 · 2008-10-29 06:39

hello pgb,

one idea is to take three more of these memorychips and three more IO-lines

load ONE byte (containing bit 0, bit 1, bit, 2...bit7 and store each BIT to one chip

bit 0 to chip 0
bit 1 to chip 1
bit 2 to chip 2
...
bit 7 to chip 7

all other possabilities I can think of (which are NOT al possabilities) would take more time

best regards

Stefan

pgbpsu · 2008-10-29 11:08

@StefanL38

That IS a clever solution which would also add a significant amount of additional storage space. Unfortunately, my application is already out of pins. But I'll keep that in mind for future designs where pins aren't at such a premium.

Thanks,
pgb

StefanL38 · 2008-10-29 13:51

Hello pgbpsu,

this principle to store bits of one bit in different chips would work with 5 bits too.
But needs a little bit more of bitshifting.

what range do the values have that you want to store ?
8 bit ? 16 bit?

as an example I assume it is 16 bit

load this 16 bit value make and AND with %11111 to get the lowest 5 bits (Bit 1-5)
store that bitbattern to your outputs
bitshift RIGHT 5 bits to get the next 5 bits (Bit 6-10)
and, store
bitshift RIGHT 5 bits to get the next 5 bits (Bit 11-15)
and, store

so know one bit is left from the first 16 bit word
store this bit
load next 16bit word
and %11110 or with the last bit of word before
store to outputs

next 16-bit word there are two bits left
then 3, then 4

so if you would program a loop that has hardcoded bitshifts 5 for first word
bitshifts 4 for second word
bitshifts 3 for third word
bitshifts 2 for fourth
bitshifts 1 for fith word
you would need more memory but gaining speed by avoiding to execute a command that decrements the bitshift-value (5,4,3,2,1)

know you would have to calculate if this would be faster in execution than the other solution
if your values are 15bit or 20 bit it would become more regular with 5 chips to store the bits
as there would'nt be bits left and have to be filled up with the bits of the next word

I think this can work but the stored data will be some kind of encrypted.
The read out algorithm has to take care of this and work the same way backwards to get the normal 16bitvalues

best regards

Stefan

pgbpsu · 2008-10-29 14:24

@Stefan

My data samples are 32-bits which I don't think is any problem for the algorithm you've suggested. As you state at the end, the data are effectively encrypted and in order to get back a single 32-bit sample I'd have to read from each of the DataFlash chips. My current approach, although possibly a bit slower, has the benefit of putting all the data from a single sample into a single DataFlash. That makes getting the data from a single channel/chip read back in and stored to SD card in its own file MUCH easier.

After laying out the routine in my original post last night it turned out better/simpler then I expected. I actually have 100uS to write out the 5 32-bit samples before new ones are available. So time isn't as tight as I may have lead you to believe. Even if my code ends up taking 25uSec, I still have lots of wiggle room.

Since this approach seems fast enough, and getting data back from the DF chips will be much simpler, I'll probably stick with what I've got (provided I can actually get it working). But thanks nonetheless for the suggestions.

Cheers,
Peter

Paul Baker · 2008-10-30 18:41

How did you arrive at 5 dataflashes? Was this the number needed to reach your target throughput, or was this the amount of I/O pins you could spare? The reason I ask is that if it's the latter, you may want to consider using 4 since it divides the 32 bit value into 8 nibbles. This means if you read a byte from each of the 4 memories you have accessed a 32 bit value, so the reconstruction is greatly simplified.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

pgbpsu · 2008-10-30 19:27

Hi Paul-

We picked 5 because that's the number of channels of data we have. Turns out it also uses up all remaining pins (which isn't a good thing). The nice thing about it is that it allows us to receive data as fast as we need to (one channel to each DF chip) and it makes pulling the buffered data back into the Prop and then passing it out to SD pretty easy. Our final file structure will be one file per channel. To accomplish that with our current layout, we can simply read from a single DF chip of interest, grab 512 bytes, spit them to SD, grab 512 more, etc until that file is finished. Move to next DF and so on.

I'm sure there are some very clever ways to do this. For now I think I've settled on less clever and more straight forward. This was all done to allow for our 10,000 samples/second on 5 channels for 30 seconds and the math works out so I think we're alright.

Just plugging away on the code....
p

Paul Baker · 2008-10-30 23:11

Depending on how many cogs you are using at the time you need to send the data to the buffer, you may be able to create multiple cogs where one is the master and issues the clock signal and the others are slaves. So the master would toggle the clock and shift a data bit for it's channel out. The slaves would use the clock pulse as a synchronation pulse and output it's data bit for this channel. This way you don't have to mess with trying to interleave the data within a single cog.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

dMajo · 2008-10-31 14:31

pgbpsu said...
I'm in the middle of an application that needs temporary storage of incoming data which will later be pulled back from temp storage and written to SD. I'm using Atmel DataFlash units for the temporary storage but to get the speed we need I'm using five in parallel. They use an SPI interface which I've got working. What I'm trying to do now is come up with a clever way to write to them in parallel. They share a clock, but have individual chip select and data lines.

StefanL38 said...
hello pgb,

one idea is to take three more of these memorychips and three more IO-lines

load ONE byte (containing bit 0, bit 1, bit, 2...bit7 and store each BIT to one chip

bit 0 to chip 0
bit 1 to chip 1
bit 2 to chip 2
...
bit 7 to chip 7

all other possabilities I can think of (which are NOT al possabilities) would take more time

best regards

Stefan

The Stefan solution is the perfect one and you save a pin too.
Considering that you are going to split the byte in 8 chips you can drive them in parallel exept for data so: 1 clock, 1 cs, 8 data [noparse][[/noparse]total 10 pins]

Your solution 5 chips: 1 clock, 5 cs, 5 data [noparse][[/noparse]equal to 11 pins: one more]

regards

Clever ways to write in parallel?

Comments