OLED 96-prop - Loading an Image from uSD card, and draw speed.
etherVision
Posts: 3
I've been working to display animation on the 96-Prop from 4D, I've modified some of the demo code, added SD read capability. The problem I'm having now is image write speed, which seems to be about 2fps. I added a speed test section, that just writes frames of solid color, 1 pixel at a time. Still getting only about 2fps. I'm posting the code here, in the hopes that someone will help me optimize it further, or figure out if it's just a problem with the unit that I have.
Comments
Here are the 2 routines added to the prop driver codes:
PUB PutPixel16 (X, Y, V1, V2)
'' Writes 2 bytes (16 bits) of color data to the upper-left corner (X,Y) of the area in
'' Graphic RAM defined by the Set_GRAM_Access method.
Set_GRAM_Access (X, 95, Y, 63)
OUTA[noparse][[/noparse]CS_OLED] := 0
OUTA[noparse][[/noparse]WR_OLED] := 0
OUTA[noparse][[/noparse]7..0] := V1 ' MSB
OUTA[noparse][[/noparse]WR_OLED] := 1
OUTA[noparse][[/noparse]WR_OLED] := 0
OUTA[noparse][[/noparse]7..0] := V2 ' LSB
OUTA[noparse][[/noparse]WR_OLED] := 1
OUTA[noparse][[/noparse]CS_OLED] := 1
PUB PutPixelNext (V1, V2)
'' Writes 2 bytes (16 bits) of color data to the next location
OUTA[noparse][[/noparse]CS_OLED] := 0
OUTA[noparse][[/noparse]WR_OLED] := 0
OUTA[noparse][[/noparse]7..0] := V1 ' MSB
OUTA[noparse][[/noparse]WR_OLED] := 1
OUTA[noparse][[/noparse]WR_OLED] := 0
OUTA[noparse][[/noparse]7..0] := V2 ' LSB
OUTA[noparse][[/noparse]WR_OLED] := 1
OUTA[noparse][[/noparse]CS_OLED] := 1
code is in the routine that calls PutPixelNext, I can imagine that Spin is just a little
too slow. I'm sure you could optimize the Spin code a little bit and get a few more
frames per second, but you'll do better with assembly language.
I doubt you even get 2fps
This is a matter of SPIN, you should easily accomplish 60 fps with machine code.
My results with a 2GB PNY uSD card are 164 kbytes/sec write, 316 kbytes/sec read. This translates to more than 24 frames per second ... i.e. true video. Not bad. Past experiments with the USBwiz, a USB host product by GHIelectronics, maxed out at 54 kbytes/sec write speed.
On another front, I have had success with a 'color averaging' routine implemented in Visual Basic 6. It may take 6 seconds or so for a large 3 MB jpeg from my digital camera to be processed, but it will examine individual pixel blocks, calculate an average color, and construct 12kbyte .txt files which can be written to the uSD card. It runs in batch mode, translating a folder full of JPGs to small BMPs and propeller TXT. I modified the uOLED-96-Prop demo code into a simple SlideShow program with about three dozen colorful images grabbed using Google/Images. The image quality is amazingly high. I love this thing !
I've also been toying with the uSD card write operation ...specifically the sdspiqasm code .. the really fast version of the low-level SPI interface code. This code is amazing !! The author has used the two cog counters as NCOs (numerically controlled oscillators) in a very unusual manner. One NCO drives the CLK pin to the uSD card, the other the DI pin (data to the card). The CLK NCO is configured in such a way that loading (for example the PHSa register with #8) will cause the CLK pin to go low for 8 cycles before returning high. This saves an instruction or two. The other NCO is used as a bit-shift to an output pin, with its FRQ register set to zero. Data is shifted into bit 31 and this maps to the DI pin. Another implementation elsewhere in his code, toggles the clock line up and down repeatedly while assembly code is reading in data bits from the SD card. Cool, amazing stuff ! My hats off. PS: I also liked the trick of editing the cog code while it is still in hub memory, changing a few longs which hold the pin#s, THEN calling cognew() to load the SPIN modified code into the cog. This is truly Propeller-ology at its finest !!
Once I got my head around this code I thought I saw an opportunity to speed up the low-level writeblock routine. The bit shifts out uses 3 instructions in a loop. I rewrote the code without the loop and modified the phase and period of the NCO generated CLK signal so that the CLK's rising edge (SD card's signal to sample the data on its DI pin) aligned with the assembly instructions which where just repeated "shl phsb,#1" + "nop" instructions - effectively 2 instructions per bit output. It seemed to work (at least nothing crashed and burned !) Unfortunately the final write time for 2Mbytes was hardly changed at about 12. 8 seconds. WHY ? My rough first estimate (from the timing loop) was that an upper possible limit of 666Kbytes per second would be possible IN THEORY. I thought I would at least see some small improvement.
I don't have access over the weekend to my logic analyzer, but if I did, I'm fairly certain I would see long pauses during which an asserted BUSY by the SD card was applying the brakes for a few milliseconds during each 512 byte write attempt. It's back to the drawing board with the best hope being a multiblock write operation. The actual write of 512 bytes with my modified code should take less than 0.8 mS ... I still have another 2.3 mS of additional unexplained delay (I'm only at 25% of theoretical max write speed). With multiblock I could just keep sending blocks and the SD card will (I think) just keep accepting them until an Erase sector of data has been received, and only then force a BUSY. That's just a guess. IF anyone out there has real working Multiblock source code from any source it would be a welcome starting point !
As in fact it is all standard use and not really amazing for the experienced programmer.
A small addition for the folks who want to do it themselves.
(1) To output a high pulse of length N ticks requires to set PHS to minus N and FRQ to 1 - as this depends on the CLKFREQ it can become tricky when the crystal is not well tuned to the required timing... But you can't do any better with next next best WAITCNT instruction
(2) Shifting things out from PHS has been discussed some times by our video specialists, the issue is to shift things in We still have nothing better but an unrolled INA loop...
will do a 10 Mhz shift-out - which is much slower than using the Video Logic.
Edit: Oops, the last line must read 10MHz of course
Edit2: Oh dear, it must be 31 of course, as the MSB is already output befor the loop starts...
Post Edited (deSilva) : 11/10/2007 7:52:56 PM GMT
4 clks: shl phsb,#1 ' output bit5
4 clks: nop
At the uOLED-09-Prop's system clock frequency of 64 Mhz, that's 125 nS per bit, or an 8Mhz SPI clock.
If running at 80 Mhz, that's a 10 Mhz SPI clock.
Can anyone explain why uncommenting the OR statement below would cause all $00 values to be written to the SD card instead of "always odd" values (with bits7 thru bits1 preserved) ? Note, its not a timing issue because a substituted NOP instruction has no effect ... all data is correctly written and read.
At first it seemed possible to drive the CLK pin at 16Mhz (2clks high, 2 clks low), delete the NOP, and just have a string of SHL instructions, but how would you start the CLK and stop it from toggling beyond the last bit ?
Below is the modified code, with an 8 Mhz SPI clock. It did not improve Write speed as I believe the bottleneck lies with an extended BUSY state issued by the SD card, not the Propeller SPI clock speed. It did, however, work, so it will serve as a good example to study (and hopefully improve ! )
I know nothing of the video registers. Can video & PHS be combined to shift out 32 data bits (WITH CLK controlled by NCO) ???
PS: Ok, I just go it ! Your code has no penalty for its DJNZ loop as it serves the same purpose as my NOP, with much less code and a lot more elegance !
MODIFIED SDSPIQASM:
Great job guys.
On the writing to the SD card, there's a couple of things going on. One is the between-byte
overhead; that adds up. Another is the between-block overhead, but that's not too bad.
What really hurts is, while most writes go pretty quick (the SD card says 'done'), a small
percentage of the writes incur a significant latency (almost certainly due to the need to
erase another erase block, but I'm guessing at this.) Depending on how you do your timing
and how many bytes you are writing, you may miss this altogether, or see it as though it
were split among the blocks, or if you time each individual block for enough of them, you'll
see quick quick quick quick ... for a bunch and then *slow* and then quick quick quick ...
In my tests speeding up the actual bit-by-bit code doesn't make much of a difference for
writing.
For reading, on the other hand, improvement is still possible, but (I believe) it will take a bunch
more instructions. If you completely unroll 32-bits worth of input, two instructions per bit
(this is possible), you should get a reasonably good improvement. It's slightly tricky to do this
but not too bad. But as I said, you'll probably need a bunch more instructions.
Actually the best improvement on reading data to the display is probably just not using pread
but instead writing another subroutine or something that will (essentially) DMA data from a
file directly to a circular buffer (of probably at least 1K so you can keep a 512 byte block in
flight). This way the Spin code that does the FAT16 stuff happens in parallel with the SD
readblock stuff which happens in parallel with the video writing stuff, and no memory copy
is needed (pread does a memory copy that slows things down).
Definitely fun stuff. I hope Parallax adds an SD or microSD card slot to some version of one
of their development boards; it really makes pretty amazing things possible, and it's just a
ton of fun. (They might want to *call* it an MMC slot or something, though, so they don't
have to pay royalties or whatnot; I'm not sure about the legalities.)
to two buffers while our modified serial cog continually monitored these same buffers for transmission as packets via a
115.2kbaud serial link to a PC. The two buffers are defined by SPIN code and a method associated with the data acquisition
cog is passed the address of these two buffers. Likewise, the serial cog has a method which accepts the two buffer addresses.
The serial cog runs an 'application' that repeatedly monitors the two buffers 1st byte for non-zero content. The first byte of
each buffer is used as the availability flag. Once the data cog has filled a buffer, only then does its write a non-zero entry
into that buffer's first byte which also happens to be the number of bytes in the buffer that needs to be transmitted by the
serial cog. The serial cog, upon detection of a non-zero 1st byte entry in the buffer preceeds to transmit it contents via the
serial port. Upon completion it zeros, the buffer's 1st byte, signalling the data cog that the buffer is now available to store
more data. The experiment may run for days at a time, continually streaming 10K bytes per second to a file in the PC.
I have also double-buffered 11khz 8-bit audio data, reading from a USB flash drive (via USBwiz) to an audio pin with RC
filter attached. I plugged the speaker output to the mic input on my notebook PC, then played my "Wizard of Oz" DVD
using Audacity configured for RAW 11.025 Khz 8-bit signed audio recording. The resulting 60 Mbyte audio file was then
copied to the USB flash drive. I could select any start time (in tenths of seconds) and any playtime (in tenths of seconds).
The audio quality was excellent (at least for my old ears!) . I even had an old wired video editor remote control that could
fast forward or rewind the audio. I would open the audio file with a USBwiz cog, then enform the sound cog of the address
of two audio buffers. The sound cog, internally would then ask the USBwiz cog to fill the buffers with data. Instead of
waiting in a forever loop until the "command" value returns to zero, these methods would send the command to the cog
and immediately exit. The "calling" application (another cog) would then merely examine the buffers first byte to see
when data actually arrived.
I think the SPIN portion of the pread() method may need to integrated within the readblock code and placed within the cog
SDSPI code. From my experience with the audio project, it was nice to be able to command the sound cog to start playing
audio THEN go along my merry way and monitor for key presses on the remote control in SPIN. This frees the high-level
spin code from the microsecond-fast tasks.
I'm thinking I would like to open a file for reading, then inform my video or audio cog of the addresses of the double buffers
and let them INTERNALLY call a cog-based Pread() like function. These cog 'applications' would have to have access
to the hub addresses of all the variables that pread() in SPIN code currently manipulates so that everything stays current
and up to date. The same could be done for Pwrite().
Tonight I finally came across some excelllent source code for performing MultiBlock reads and writes. I hope to give
them a test this coming week. My write rates (164 kbytes per second) are about half my read rates (316 kbytes per
second). I still perhaps foolishly believe that it should be possible to double my write rate. My write code can spit
out in about 0.8 mS a 512-byte block, yet 3.1 mS per block is the observed AVERAGE write rate. The existing code
can, indeed, deliver the data in a timely manner, its just that the Flashing of memory internal to the SD card takes
time. While the internal host processor of the SD card is waiting for this flash process completion it could be busy
reading in more data for the next flash operation. As I understand the process, if I'm writing Multiblock, I instruct
the flash drive to PRE-ERASE 'x' number of 512-byte BLOCKS, then I proceed to send block after block up to the
number specified in the PRE-ERASE command. The SD card, if it has enough internal RAM is happy to accept as
many blocks as it has RAM to store, hopefully waiting until an ERASE SECTOR-sized number of blocks have been received
before initating the next flash operation. The SD card host processor can even skip it own normal internal erase
operation which would otherwise accompany every BLOCK write as this task has already been done with the PRE-ERASE
command. We will see, of course, if my crude understanding is correct !! Hopefully progress can be made in the near future.
Another Amazing Thread[noparse]:)[/noparse]
The 96-Prop has an 8Mhz crystal... using a 16x PLL that would give 128Mhz clkfreq (can we do this?) or a 16Mhz SPI(is this correct?)
This has come up before ... The Prop (any incarnation) will not work reliably at 128mHz. Some chips may and some may not.
It depends mostly on temperature and power supply voltage.