sorta-optimized JPEG decoder
Was bored tonight, so decided to see how fast I can get TJpgDec (by the same fella as FatFS) to run on the P2. Answer seems to be "not very".
Build main.c
from the attached ZIP file with (hopefully recent) flexspin, have VGA monitor on pin 32, etc, etc.
Timing for my assortment of four 320x240 images: (the measurements are consistent every repeat, but can change a good bit due to memory alignment)
Unoptimized (using tjpgd_orig.c
, default -O1):
I did set up tjpgdcnf.h
to reasonable values of course.
reimu_jpg decompd in 532904393 cycles! pigge_jpg decompd in 500664481 cycles! edge_jpg decompd in 491380321 cycles! youmu_jpg decompd in 497175873 cycles! TOTAL test suite: 2022125068 cycles!!!!
Obvious blunder fixed (using tjpgd_orig_enumfix.c
, default -O1):
A float constant was not constant-evaluated...
reimu_jpg decompd in 172734098 cycles! pigge_jpg decompd in 139880914 cycles! edge_jpg decompd in 130399770 cycles! youmu_jpg decompd in 136295706 cycles! TOTAL test suite: 579310488 cycles!!!!
Various manual optimizations (default -O1):
About twice as fast, but still kinda underwhelming...
reimu_jpg decompd in 88671818 cycles! pigge_jpg decompd in 59422170 cycles! edge_jpg decompd in 51630130 cycles! youmu_jpg decompd in 56050650 cycles! TOTAL test suite: 255774768 cycles!!!!
At -O2
Lmao, -O2 being questionably useful as ever, but at least it works.
reimu_jpg decompd in 91051564 cycles! pigge_jpg decompd in 61695788 cycles! edge_jpg decompd in 53613036 cycles! youmu_jpg decompd in 58332956 cycles! TOTAL test suite: 264693344 cycles!!!!
Disabling the fast IDCT routine (at -O1 again)
This I probably spent the most time on and it makes barely any impact. The codegen for the C version was pretty decent already...
reimu_jpg decompd in 100094530 cycles! pigge_jpg decompd in 66338186 cycles! edge_jpg decompd in 57495194 cycles! youmu_jpg decompd in 63157322 cycles! TOTAL test suite: 287085232 cycles!!!!
I wonder if I could make a sampling profiler with the timer IRQ that could tell me where most of that time is spent...
Comments
Is it faster than picojpeg?
https://forums.parallax.com/discussion/172174/jpeg-decoding-with-picojpeg-and-flexc-spin2/p1
That one seems to do its IDCT with 16 bit values, which I guess is interesting (TJpgDec uses 32 bit and thus needs QMUL, though my optimized version uses pipelined QMULs for speed), but as exposited, the IDCT is surprisingly not the slow part.
Not sure I have the energy to figure out how to build your code right now, could you hook up the same benchmark for comparsion? (decode memory->memory, RGB565, ideally with one of the 320x240 test images I provided)
Meanwhile, I figured that there's still some funky instruction alignment that I missed out on in the IDCT, which pushes the whole thing below 250M cycles (~ 1 second) to run through the four test images:
Had some more fun, but apparently I've peered too far into the void. It peered back. By which I mean I found two compiler bugs. One is reported, the other one has to do with -Olocal-reuse (which I created), so I will have to fix it myself
Current funny numbers:
Under 200 Megacycles. Use
-O1,inline-single,experimental
for these numbersThis is where I'll leave it, I think. Well actually, I think I want to try hooking up PSRAM for some high-res testing, but that's that.
(Still use
-O1,inline-single,experimental
)Okay, now with PSRAM (by default set up for a 96MB board) in 640x480. Not really optimized. I think @rogloh 's driver can do a rectangle blit somehow but I was too lazy to figure out how. I am amused at the amount of compressed images that fit into just P2's hub RAM.
Use the code from this ZIP regardless (I have still included the RAM-less main.c), for the above ZIP has a clipping bug!
It has a command list that you can create and it will execute it at once. While I was making my video drivers, he optimized this, so the driver doesn't do all the queue service stuff while executing the list. Much faster.
Also, here I've modified it to directly produce the P2 native xBGR 32 bit format. The other format modes still work. I infact fixed a clipping issue in monochrome mode, which was broken in the original library! and added an example for that (no PSRAM required).
Yeah the memory driver has a graphics API to do this and it uses the request list format. Look for gfxWriteImage(...) if you want to copy a HUB RAM graphics image to PSRAM. It will expand contiguous pixel data into scan lines of a given width and can also reverse the scan line order too if you need to for an upside down stored image. Some graphics formats store it reversed.
I vaguely recall that optimization. To find it would require lots of thread wading...
The driver, while executing the list, after every command from the list, was processing the queue. I needed the list to be executed as fast as possible, so you changed the code so it executes the list without processing the queue in between the commands. This enabled the possibility of making this effect:
Nice of you to assume that I'm not using my terrible hand-rolled spin wrapper.
Unrelatedly... This is the same 16bpp hub RAM only benchmark as before, except I went insane and started hacking on flexspin to improve the performance (and also more source-level micro-improvements).
Obnoxiously, there's some sort of alignment effect with hubexec code that has more impact on performance than my actual micro-optimizations (i.e. shaving off one instruction from the code actually makes the times worse).
Will post code when PRs land in flexspin master ~
Yes, that's when making use of Fcache to contain the whole loop will help. EDIT: And, further, also using the freed-up FIFO for data reads as well.
Oh, I already converted all the hot loops to inline ASM. The biggest bottleneck remaining is everything to do with the bitstream decoding, since it can't be inlined due to needing to call the function to replenish the buffers.
mmm, not simple I see.
Okay, I think all the showstoppers have been eliminated (USE LATEST FLEXSPIN GIT!), so here's my for-now final version.
There's a number of different examples:
main_rgb565.c
is the one I've been using for the aforementioned benchmarks (though I cheated slightly since the last results and also optimized my out_func a bit):(Still use
-O1,experimental
for these results)The PSRAM ones are pre-configured for a Rayslogic 96MB board, because that's what I mostly keep around for testing. I think if you delete all the overrides for the exmem_mini object, it should work on a P2EDGE (maybe do play with PSRAM_DELAY)
Nice overall speedup gained from the original 2022 million P2 clock cycles you started with down to 176 million now (over a 10x speedup).
So do you think you're pretty close to the performance limits or would you still expect some further (diminishing) gains to be possible? I do see a bunch of tight inline code in key functions already so perhaps it's already reaching the limits...unless there are other areas still to improve. Right now it doesn't seem like you could do a JPEG decode at video refresh rates like 24Hz for 320x240 with this code base (if you wanted it for the basis of some sort of motion JPEG style of video decoder for example).
I tried to play along and run this but hit this bug with the latest flexspin already downloaded/compiled. It doesn't seem to like your ##constants with ptrb indexes, although once I changed them to just 128 it seemed to work and let me see it running.
The PSRAM RGB888 demo is nice BTW! Just imagine it decoding in real time video rates...it could certainly work for a JPG photo browser for example - and you could always flip between offscreen buffers to hide the update artifacts.
Well that doesn't count, the first 4x speedup was due to a bug where it didn't properly fold a constant.
Current bottleneck I think is the coefficient loop in
mcu_load
(which is basically a per-pixel loop). I think forcing the huffman extract function to be inline caused a mild improvement at the cost of bloat. The regular bit extract function is inlined.For video, see the cinepak thread. The quality on that isn't that much worse than pure motion JPEG. IIRC that was going just fine doing 60 FPS at 320x240 with a hub single buffer. The bottleneck is actually the PSRAM read/write for larger resolutions / double buffer. I've started work on an improved encoder but ran into weird issues that I couldn't debug due to the Windows GCC tools being shite, so I shelved it. The biggest improvement was actually just fixing the RGB->YUV down-conversion that happens before the actual encoder. Should probably publish the WIP code for that somewhere.
Any non-meme video codec would require motion vectors, which would require a lot of bandwidth to the buffer.
flexspin not new enough, git pull.
I just cloned the tree about an hour ago. Seems to be the top already so am not sure what gives.
roger@RLs-Mac-mini flexprop % git pull
Already up to date.
roger@RLs-Mac-mini flexprop % git status
On branch master
Your branch is up to date with 'origin/master'.
nothing to commit, working tree clean
roger@RLs-Mac-mini flexprop % git log
commit dfe2b01e23787344ad80be2a1de6691cd44f7f72 (HEAD -> master, origin/master, origin/HEAD)
Author: Eric Smith ersmith@totalspectrum.ca
Date: Fri May 5 10:59:35 2023 -0300
Updated spin2cpp
...
wrong repo, need the spin2cpp. Its submodule in the flexprop repo is only occasionally updated.
roger@RLs-Mac-mini flexprop % git pull --recurse-submodules
Fetching submodule PropLoader
Fetching submodule loadp2
Fetching submodule spin2cpp
Fetching submodule spin2cpp/Test/spinsim
Already up to date.
EDIT: OOPS now I think I know what you mean...
no, you need to clone the spin2cpp repo separately (or cd into into its submodule and pull from there, but that might cause explosions later down the line)
Just couldn't stop myself and was able to eek out a few more cycles by transforming the color conversion loop such that it doesn't need to do as many branches and can fit entirely in FCACHE (either the subsampled or the non-subsampled version):
Actually, it turns out that using
-O1,experimental,aggressive-mem
gives even faster times. Might just be alignment, but it reduces code size, which is always nice. EDIT: yep, is alignment, the only thing that changes is that some instructions get removed in__system___fmtchar
This could be very handy.
The pico jpeg code had a mode where it output 1 pixel per block. Makes for a super fast way to show thumbnails…
This also has scaling by 1/2, 1/4 and 1/8. The 1/8 mode uses the same trick to skip doing the actual IDCT, but it's not much faster since, as mentioned, the bottleneck is decoding the actual coefficients, which still needs to be done to keep the bitstream in sync. Also it will still call your output function once per block, so 1/8 will call it for each pixel. The other downscale modes I haven't really optimized very much, but they're not much slower than 1/1.