sorta-optimized JPEG decoder

Wuerfel_21 · 2023-04-25 00:28

Was bored tonight, so decided to see how fast I can get TJpgDec (by the same fella as FatFS) to run on the P2. Answer seems to be "not very".

Build main.c from the attached ZIP file with (hopefully recent) flexspin, have VGA monitor on pin 32, etc, etc.

Timing for my assortment of four 320x240 images: (the measurements are consistent every repeat, but can change a good bit due to memory alignment)

Unoptimized (using `tjpgd_orig.c`, default -O1):

I did set up tjpgdcnf.h to reasonable values of course.

reimu_jpg decompd in 532904393 cycles!
pigge_jpg decompd in 500664481 cycles!
edge_jpg decompd in 491380321 cycles!
youmu_jpg decompd in 497175873 cycles!
TOTAL test suite: 2022125068 cycles!!!!

Obvious blunder fixed (using `tjpgd_orig_enumfix.c`, default -O1):

A float constant was not constant-evaluated...

reimu_jpg decompd in 172734098 cycles!
pigge_jpg decompd in 139880914 cycles!
edge_jpg decompd in 130399770 cycles!
youmu_jpg decompd in 136295706 cycles!
TOTAL test suite: 579310488 cycles!!!!

Various manual optimizations (default -O1):

About twice as fast, but still kinda underwhelming...

reimu_jpg decompd in 88671818 cycles!
pigge_jpg decompd in 59422170 cycles!
edge_jpg decompd in 51630130 cycles!
youmu_jpg decompd in 56050650 cycles!
TOTAL test suite: 255774768 cycles!!!!

At -O2

Lmao, -O2 being questionably useful as ever, but at least it works.

reimu_jpg decompd in 91051564 cycles!
pigge_jpg decompd in 61695788 cycles!
edge_jpg decompd in 53613036 cycles!
youmu_jpg decompd in 58332956 cycles!
TOTAL test suite: 264693344 cycles!!!!

Disabling the fast IDCT routine (at -O1 again)

This I probably spent the most time on and it makes barely any impact. The codegen for the C version was pretty decent already...

reimu_jpg decompd in 100094530 cycles!
pigge_jpg decompd in 66338186 cycles!
edge_jpg decompd in 57495194 cycles!
youmu_jpg decompd in 63157322 cycles!
TOTAL test suite: 287085232 cycles!!!!

I wonder if I could make a sampling profiler with the timer IRQ that could tell me where most of that time is spent...

Rayman · 2023-04-25 10:41

Is it faster than picojpeg?
https://forums.parallax.com/discussion/172174/jpeg-decoding-with-picojpeg-and-flexc-spin2/p1

Wuerfel_21 · 2023-04-25 11:14

That one seems to do its IDCT with 16 bit values, which I guess is interesting (TJpgDec uses 32 bit and thus needs QMUL, though my optimized version uses pipelined QMULs for speed), but as exposited, the IDCT is surprisingly not the slow part.

Not sure I have the energy to figure out how to build your code right now, could you hook up the same benchmark for comparsion? (decode memory->memory, RGB565, ideally with one of the 320x240 test images I provided)

Wuerfel_21 · 2023-04-25 13:56

Meanwhile, I figured that there's still some funky instruction alignment that I missed out on in the IDCT, which pushes the whole thing below 250M cycles (~ 1 second) to run through the four test images:

reimu_jpg decompd in 86213770 cycles!
pigge_jpg decompd in 57963546 cycles!
edge_jpg decompd in 50403778 cycles!
youmu_jpg decompd in 54622482 cycles!
TOTAL test suite: 249203576 cycles!!!!

Wuerfel_21 · 2023-04-25 16:18

Had some more fun, but apparently I've peered too far into the void. It peered back. By which I mean I found two compiler bugs. One is reported, the other one has to do with -Olocal-reuse (which I created), so I will have to fix it myself

Current funny numbers:

reimu_jpg decompd in 82053618 cycles!
pigge_jpg decompd in 49604370 cycles!
edge_jpg decompd in 42272762 cycles!
youmu_jpg decompd in 46260682 cycles!
TOTAL test suite: 220191432 cycles!!!!

Wuerfel_21 · 2023-04-25 19:42

Under 200 Megacycles. Use -O1,inline-single,experimental for these numbers

reimu_jpg decompd in 73440564 cycles!
pigge_jpg decompd in 44631540 cycles!
edge_jpg decompd in 38091244 cycles!
youmu_jpg decompd in 41819052 cycles!
TOTAL test suite: 197982400 cycles!!!!

Wuerfel_21 · 2023-04-25 20:38

This is where I'll leave it, I think. Well actually, I think I want to try hooking up PSRAM for some high-res testing, but that's that.
(Still use -O1,inline-single,experimental)

reimu_jpg decompd in 73154396 cycles!
pigge_jpg decompd in 44409716 cycles!
edge_jpg decompd in 37898004 cycles!
youmu_jpg decompd in 41611732 cycles!
TOTAL test suite: 197073848 cycles!!!!

Wuerfel_21 · 2023-04-25 21:37

Okay, now with PSRAM (by default set up for a 96MB board) in 640x480. Not really optimized. I think @rogloh 's driver can do a rectangle blit somehow but I was too lazy to figure out how. I am amused at the amount of compressed images that fit into just P2's hub RAM.

Use the code from this ZIP regardless (I have still included the RAM-less main.c), for the above ZIP has a clipping bug!

pik33 · 2023-04-26 06:30

@rogloh 's driver can do a rectangle blit somehow

It has a command list that you can create and it will execute it at once. While I was making my video drivers, he optimized this, so the driver doesn't do all the queue service stuff while executing the list. Much faster.

Wuerfel_21 · 2023-04-26 12:09

Also, here I've modified it to directly produce the P2 native xBGR 32 bit format. The other format modes still work. I infact fixed a clipping issue in monochrome mode, which was broken in the original library! and added an example for that (no PSRAM required).

rogloh · 2023-04-27 00:45

@Wuerfel_21 said:
@rogloh 's driver can do a rectangle blit somehow

Yeah the memory driver has a graphics API to do this and it uses the request list format. Look for gfxWriteImage(...) if you want to copy a HUB RAM graphics image to PSRAM. It will expand contiguous pixel data into scan lines of a given width and can also reverse the scan line order too if you need to for an upside down stored image. Some graphics formats store it reversed.

@pik33 said:
It has a command list that you can create and it will execute it at once. While I was making my video drivers, he optimized this, so the driver doesn't do all the queue service stuff while executing the list. Much faster.

I vaguely recall that optimization. To find it would require lots of thread wading...

pik33 · 2023-04-27 09:17

I vaguely recall that optimization.

The driver, while executing the list, after every command from the list, was processing the queue. I needed the list to be executed as fast as possible, so you changed the code so it executes the list without processing the queue in between the commands. This enabled the possibility of making this effect:

Wuerfel_21 · 2023-04-27 11:09

@rogloh said:

@Wuerfel_21 said:
@rogloh 's driver can do a rectangle blit somehow

Yeah the memory driver has a graphics API to do this and it uses the request list format. Look for gfxWriteImage(...) if you want to copy a HUB RAM graphics image to PSRAM. It will expand contiguous pixel data into scan lines of a given width and can also reverse the scan line order too if you need to for an upside down stored image. Some graphics formats store it reversed.

Nice of you to assume that I'm not using my terrible hand-rolled spin wrapper.

Unrelatedly... This is the same 16bpp hub RAM only benchmark as before, except I went insane and started hacking on flexspin to improve the performance (and also more source-level micro-improvements).

reimu_jpg decompd in 70221572 cycles!
pigge_jpg decompd in 42627260 cycles!
edge_jpg decompd in 36362236 cycles!
youmu_jpg decompd in 40079044 cycles!
TOTAL test suite: 189290112 cycles!!!!

Obnoxiously, there's some sort of alignment effect with hubexec code that has more impact on performance than my actual micro-optimizations (i.e. shaving off one instruction from the code actually makes the times worse).

Will post code when PRs land in flexspin master ~

evanh · 2023-04-29 07:35

@Wuerfel_21 said:
Obnoxiously, there's some sort of alignment effect with hubexec code that has more impact on performance than my actual micro-optimizations (i.e. shaving off one instruction from the code actually makes the times worse).

Yes, that's when making use of Fcache to contain the whole loop will help. EDIT: And, further, also using the freed-up FIFO for data reads as well.

Wuerfel_21 · 2023-04-29 11:09

Oh, I already converted all the hot loops to inline ASM. The biggest bottleneck remaining is everything to do with the bitstream decoding, since it can't be inlined due to needing to call the function to replenish the buffers.

evanh · 2023-04-29 11:41

mmm, not simple I see.

Wuerfel_21 · 2023-05-05 16:01

Okay, I think all the showstoppers have been eliminated (USE LATEST FLEXSPIN GIT!), so here's my for-now final version.

There's a number of different examples:


main_rgb565.c	320x240 16 bit RGB
main_rgb888.c	320x240 32 bit RGB
main_rgb332.c	640x480 8 bit RGB
main_mono.c	640x480 monochrome
main_psram_rgb565.c	640x480 16 bit RGB
main_psram_rgb888.c	640x480 32 bit RGB

main_rgb565.c is the one I've been using for the aforementioned benchmarks (though I cheated slightly since the last results and also optimized my out_func a bit):

reimu_jpg decompd in 65724852 cycles!
pigge_jpg decompd in 39749964 cycles!
edge_jpg decompd in 33583396 cycles!
youmu_jpg decompd in 37172404 cycles!
TOTAL test suite: 176230616 cycles!!!!

(Still use -O1,experimental for these results)

The PSRAM ones are pre-configured for a Rayslogic 96MB board, because that's what I mostly keep around for testing. I think if you delete all the overrides for the exmem_mini object, it should work on a P2EDGE (maybe do play with PSRAM_DELAY)

rogloh · 2023-05-06 05:59

Nice overall speedup gained from the original 2022 million P2 clock cycles you started with down to 176 million now (over a 10x speedup).

So do you think you're pretty close to the performance limits or would you still expect some further (diminishing) gains to be possible? I do see a bunch of tight inline code in key functions already so perhaps it's already reaching the limits...unless there are other areas still to improve. Right now it doesn't seem like you could do a JPEG decode at video refresh rates like 24Hz for 320x240 with this code base (if you wanted it for the basis of some sort of motion JPEG style of video decoder for example).

I tried to play along and run this but hit this bug with the latest flexspin already downloaded/compiled. It doesn't seem to like your ##constants with ptrb indexes, although once I changed them to just 128 it seemed to work and let me see it running.

roger@RLs-Mac-mini fastjpg % f -O1,inline-single,experimental main_rgb332.c 
Propeller Spin/PASM Compiler 'FlexSpin' (c) 2011-2023 Total Spectrum Software Inc. and contributors
Version 6.1.2-beta-HEAD-v6.1.0-38-g5b5a0acd Compiled on: May  6 2023
main_rgb332.c
|-p2videodrv.spin2
fmt.c
vfs.c
posixio.c
errno.c
bufio.c
tjpgd.c:1129: error: ptra/ptrb offset must be constant
tjpgd.c:1225: error: ptra/ptrb offset must be constant
roger@RLs-Mac-mini fastjpg %

The PSRAM RGB888 demo is nice BTW! Just imagine it decoding in real time video rates...it could certainly work for a JPG photo browser for example - and you could always flip between offscreen buffers to hide the update artifacts.

Wuerfel_21 · 2023-05-06 07:01

@rogloh said:
Nice overall speedup gained from the original 2022 million P2 clock cycles you started with down to 176 million now (over a 10x speedup).

Well that doesn't count, the first 4x speedup was due to a bug where it didn't properly fold a constant.

So do you think you're pretty close to the performance limits or would you still expect some further (diminishing) gains to be possible? I do see a bunch of tight inline code in key functions already so perhaps it's already reaching the limits...unless there are other areas still to improve. Right now it doesn't seem like you could do a JPEG decode at video refresh rates like 24Hz for 320x240 with this code base (if you wanted it for the basis of some sort of motion JPEG style of video decoder for example).

Current bottleneck I think is the coefficient loop in mcu_load (which is basically a per-pixel loop). I think forcing the huffman extract function to be inline caused a mild improvement at the cost of bloat. The regular bit extract function is inlined.

For video, see the cinepak thread. The quality on that isn't that much worse than pure motion JPEG. IIRC that was going just fine doing 60 FPS at 320x240 with a hub single buffer. The bottleneck is actually the PSRAM read/write for larger resolutions / double buffer. I've started work on an improved encoder but ran into weird issues that I couldn't debug due to the Windows GCC tools being shite, so I shelved it. The biggest improvement was actually just fixing the RGB->YUV down-conversion that happens before the actual encoder. Should probably publish the WIP code for that somewhere.

Any non-meme video codec would require motion vectors, which would require a lot of bandwidth to the buffer.

I tried to play along and run this but hit this bug with the latest flexspin already downloaded/compiled. It doesn't seem to like your ##constants with ptrb indexes, although once I changed them to just 128 it seemed to work and let me see it running.

flexspin not new enough, git pull.

rogloh · 2023-05-06 07:06

@Wuerfel_21 said:

@rogloh said:

I tried to play along and run this but hit this bug with the latest flexspin already downloaded/compiled. It doesn't seem to like your ##constants with ptrb indexes, although once I changed them to just 128 it seemed to work and let me see it running.

flexspin not new enough, git pull.

I just cloned the tree about an hour ago. Seems to be the top already so am not sure what gives.

roger@RLs-Mac-mini flexprop % git pull
Already up to date.
roger@RLs-Mac-mini flexprop % git status
On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean
roger@RLs-Mac-mini flexprop % git log
commit dfe2b01e23787344ad80be2a1de6691cd44f7f72 (HEAD -> master, origin/master, origin/HEAD)
Author: Eric Smith ersmith@totalspectrum.ca
Date: Fri May 5 10:59:35 2023 -0300
Updated spin2cpp
...

Wuerfel_21 · 2023-05-06 07:09

wrong repo, need the spin2cpp. Its submodule in the flexprop repo is only occasionally updated.

rogloh · 2023-05-06 07:11

roger@RLs-Mac-mini flexprop % git pull --recurse-submodules
Fetching submodule PropLoader
Fetching submodule loadp2
Fetching submodule spin2cpp
Fetching submodule spin2cpp/Test/spinsim
Already up to date.

EDIT: OOPS now I think I know what you mean...

Wuerfel_21 · 2023-05-06 07:14

no, you need to clone the spin2cpp repo separately (or cd into into its submodule and pull from there, but that might cause explosions later down the line)

Wuerfel_21 · 2023-05-06 13:10

Just couldn't stop myself and was able to eek out a few more cycles by transforming the color conversion loop such that it doesn't need to do as many branches and can fit entirely in FCACHE (either the subsampled or the non-subsampled version):

reimu_jpg decompd in 63783140 cycles!
pigge_jpg decompd in 38821028 cycles!
edge_jpg decompd in 32616012 cycles!
youmu_jpg decompd in 36243772 cycles!
TOTAL test suite: 171463952 cycles!!!!

Wuerfel_21 · 2023-05-06 14:00

Actually, it turns out that using -O1,experimental,aggressive-mem gives even faster times. Might just be alignment, but it reduces code size, which is always nice. EDIT: yep, is alignment, the only thing that changes is that some instructions get removed in __system___fmtchar

reimu_jpg decompd in 62925097 cycles!
pigge_jpg decompd in 38322697 cycles!
edge_jpg decompd in 32194673 cycles!
youmu_jpg decompd in 35810617 cycles!
TOTAL test suite: 169253084 cycles!!!!

Rayman · 2023-05-06 16:54

This could be very handy.

The pico jpeg code had a mode where it output 1 pixel per block. Makes for a super fast way to show thumbnails…

Wuerfel_21 · 2023-05-06 17:01

This also has scaling by 1/2, 1/4 and 1/8. The 1/8 mode uses the same trick to skip doing the actual IDCT, but it's not much faster since, as mentioned, the bottleneck is decoding the actual coefficients, which still needs to be done to keep the bitstream in sync. Also it will still call your output function once per block, so 1/8 will call it for each pixel. The other downscale modes I haven't really optimized very much, but they're not much slower than 1/1.

sorta-optimized JPEG decoder

Unoptimized (using tjpgd_orig.c, default -O1):

Obvious blunder fixed (using tjpgd_orig_enumfix.c, default -O1):

Various manual optimizations (default -O1):

At -O2

Disabling the fast IDCT routine (at -O1 again)

Comments

Unoptimized (using `tjpgd_orig.c`, default -O1):

Obvious blunder fixed (using `tjpgd_orig_enumfix.c`, default -O1):