jpeg decoding with picojpeg and FlexC+Spin2
This took some work, but I figured out how to decode and show a 640x300 .jpg image using picojpeg.
The 16 bpp VGA driver is in Spin2, but the main code is in Fastspin's version of C.
Here's a screenshot.
The 16 bpp VGA driver is in Spin2, but the main code is in Fastspin's version of C.
Here's a screenshot.
Comments
Here's the jpg file that was decoded.
Its probably not optimized yet, but what kind of time does it take to render the image?
Note: This is at the "Hey, I just got this working!" stage and not the polished final stage. But, it may never get polished.
This code originally decoded the image into a giant array of 24-bit color pixels on the heap.
But, the P2 doesn't have enough RAM to do that and show a 16-bit image from HUB RAM at the same time.
So, I looked and figured out that it decodes in small chunks it calls MCUs. My test image uses MCUs that consist of four 8x8 pixel blocks. That may be the only format that works at the moment.
What the code does now is copy each MCU to the display buffer after it is decoded, converting from 24bpp to 16bpp along the way.
This way, we only need heap storage for a single MCU.
I think this could also enable decent video, especially when used with eMMC, once optimized for speed...
Also, it seems the red, blue, and green bytes are kept in separate buffers that you have to read from.
Still, 24bpp would be nice too. Maybe for QVGA resolution that would be the way to go...
Those are in the fastspin github. I also posted a zip file of a fastspin beta in the fastspin discussion thread a few days ago.
Rayman, the image displayed on the VGA monitor looks great. I'll play around with the program a bit to see if I can speed up the IDCT. I've had a bit of experience implementing the DCT in hardware and software.
File I/O: 453 msecs
VLC Decode: 948 msecs
IDCT: 672 msecs
Display: 131 msecs
Total 2204 msecs
It’s hard to tell exactly from serial window
What P2 rate and image size & resolution? Hopefully it can be sped up a lot more to be more responsive, like by 10-20x (!), otherwise it will be limited to just slideshow stuff. With HyperRAM frame buffers we could at least have multiple images and flip seamlessly.
I think the key to getting the JPEG decoder to run faster is to write the time-critical portions in assembly, and execute it from cog or lut memory. Hub accesses are probably taking a lot of time also, so data needs to be moved to cog memory to get the fastest speed.
It would be nice if the algorithm could be made to fit in a single COG and you just spawn it dynamically and it takes a source address in HUB and a frame buffer address and then does the whole decode. If you can make it fit in LUT+COG RAM instead of HUB-exec (or at least the time critical loops) that should speed it up further.
I then realized that I was including the time to copy the results to the MCU memory, which I thought shouldn't be that significant. However, when I implemented a separate timer for the copy operation I determined that this takes 381 msecs. So the initial IDCT time is actually 291 msecs, and the optimized IDCT takes 89 msecs. That's more than 3 times faster.
It was good to see that the hub exec version is only 4 msecs slower than the cog exec version at 89 msecs versus 85 msecs.
There may be other versions of the IDCT algorithm that work better on the P2. The Winograd algorithm used by Pico JPEG minimizes the number of multiplies. It may be better to use a method that has more multiplies, but fewer adds and substracts.
@Rayman, I'm having fun tinkering with it.
In my previous post I stated a best case time of 72 msecs, but I over-counted the chroma pixels. The best case time for this algorithm should have been about 54 msecs. So it looks like there is still some room for improvement, mostly in the hub accesses.
If you can block copy your pixel data from HUB into LUTRAM I have found RDLUT X, PTRA++ is pretty fast (3 cycles) to access sequential data, faster than reading indirectly from COGRAM with alti etc (4 cycles), unless the processing loop is already unrolled and hard coded to fixed register positions in COGRAM. Best to try to take full advantage if possible of the fast HUB block transfers in/out of the COG. Another option is the FIFO itself but you can't do that in HUB-exec mode.
WRLUT X, PTRA++ is very handy too before writing back from LUT to HUB with SETQ2. I put that to good use for my video driver.
If any table lookups can speed things up further that might be useful too, though a table approach may require additional memory accesses if it doesn't fit in the COG itself.