Ok, thats mighty useful. I guess you could double buffer or somehow just 'switch' the image on, rather than have it render lines/blocks at a time, to give the illusion of speed
Very nice. I wonder if there is scope for a motion jpeg type of video decoder on the P2 if the decode rate can be boosted to ~24Hz so. Is it anywhere in that ballpark if you throw a few cogs at it?
Rayman, can you post your code. I'd be interested in looking at it. I've written a few JPEG decoders in the past. I looked at the picojpeg code on GitHub, and I'm wondering how you implemented the inverse DCT. picojpeg's IDCT multiplies 16-bit numbers, so the P2's hardware multiplier could be used. I don't know whether FlexGUI's C compiler uses the hardware multiplier or the CORDIC multiplier for short ints. The assembly code would show which multiplier is being used.
Note: This is at the "Hey, I just got this working!" stage and not the polished final stage. But, it may never get polished.
This code originally decoded the image into a giant array of 24-bit color pixels on the heap.
But, the P2 doesn't have enough RAM to do that and show a 16-bit image from HUB RAM at the same time.
So, I looked and figured out that it decodes in small chunks it calls MCUs. My test image uses MCUs that consist of four 8x8 pixel blocks. That may be the only format that works at the moment.
What the code does now is copy each MCU to the display buffer after it is decoded, converting from 24bpp to 16bpp along the way.
This way, we only need heap storage for a single MCU.
I think this could also enable decent video, especially when used with eMMC, once optimized for speed...
If the writes to external memory can be accumulated to take advantage of something like scan line write bursts then HyperRAM should be useable for the frame buffer, also then there would be no need to convert down to 16 bpp to save room unless that is the colour mode in use.
Converting to 16bpp is fairly efficient due to RGBSQZ assembly instruction. I think that allows for higher resolution than you can get with 24 bit color.
Also, it seems the red, blue, and green bytes are kept in separate buffers that you have to read from.
Still, 24bpp would be nice too. Maybe for QVGA resolution that would be the way to go...
Eric, I downloaded fseek.c, and got the program working. Thanks.
Rayman, the image displayed on the VGA monitor looks great. I'll play around with the program a bit to see if I can speed up the IDCT. I've had a bit of experience implementing the DCT in hardware and software.
What P2 rate and image size & resolution? Hopefully it can be sped up a lot more to be more responsive, like by 10-20x (!), otherwise it will be limited to just slideshow stuff. With HyperRAM frame buffers we could at least have multiple images and flip seamlessly.
Actually it's fairly easy to measure. I just measured and accumulated the elapsed cycles every time there was file i/o, calls to the IDCT, VLC decode and transfers to the VGA buffer. After the program is completely done I print out the results to the serial window.
I think the key to getting the JPEG decoder to run faster is to write the time-critical portions in assembly, and execute it from cog or lut memory. Hub accesses are probably taking a lot of time also, so data needs to be moved to cog memory to get the fastest speed.
What P2 rate and image size & resolution? Hopefully it can be sped up a lot more to be more responsive, like by 10-20x (!), otherwise it will be limited to just slideshow stuff. With HyperRAM frame buffers we could at least have multiple images and flip seamlessly.
The image is 640x300. The P2 is running at 300 MHz. The JPEG file size is 62914, so I think we should be able to read that in less than 30 msecs. I think with assembly that uses the P2 features it should be possible to get at least 10x improvement.
Ok thanks for that Dave. Yeah making use of PASM2 will be the way to go.
It would be nice if the algorithm could be made to fit in a single COG and you just spawn it dynamically and it takes a source address in HUB and a frame buffer address and then does the whole decode. If you can make it fit in LUT+COG RAM instead of HUB-exec (or at least the time critical loops) that should speed it up further.
I've been working on speeding up the IDCT, and I've made some progress. The IDCT time that I posted in a previous post was 672 msecs. After converting almost all of it to assembly I got 471 msecs, which isn't much improvement. So then I converted all of the IDCT to assembly, and ran it in its own cog. That only improved it by 4 msecs to 467 msecs.
I then realized that I was including the time to copy the results to the MCU memory, which I thought shouldn't be that significant. However, when I implemented a separate timer for the copy operation I determined that this takes 381 msecs. So the initial IDCT time is actually 291 msecs, and the optimized IDCT takes 89 msecs. That's more than 3 times faster.
It was good to see that the hub exec version is only 4 msecs slower than the cog exec version at 89 msecs versus 85 msecs.
Interesting results. Do you see much further scope for IDCT improvements, other than parallelizing with more COGs? For example is the algorithm making use of block or fifo transfers into COG/LUT registers and working on that data instead of random accesses to HUB memory etc, and then later using block/fifo transfers of results back to HUB?
@rogloh, I counted the number of instructions that are executed, and multiplying by 2 cycles/instructions gives a best case number of 72 msecs. The hub reads and writes take more than 2 cycles each, so 86 msecs for the cog version is pretty close to the best case number. I thought about using the FIFO to read the entire 8x8 block into cog RAM, and then write the entire block out when done. It might be worth giving it a try, but it's going to be hard to go down from 86 msecs to 72 msecs.
There may be other versions of the IDCT algorithm that work better on the P2. The Winograd algorithm used by Pico JPEG minimizes the number of multiplies. It may be better to use a method that has more multiplies, but fewer adds and substracts.
Ok @"Dave Hein" that is an interesting result, and shows it is mainly compute bound, not memory bound. At 300MHz we have ~150MIPS at disposal and for that image size I guess you are doing about 640/8 * 300/8 IDCT blocks or ~3040 if you round up the edges. Lets say ~3k in 86ms. That is about one every 28.7us or 4300 P2 instructions per 8x8 block (67 instructions per pixel). Would this seem about reasonable given the work needed per pixel? I know there are going to be other overheads that have to be amortised over this.
You also have to account from the chroma pixels, which are sub-sampled 2:1 in both directions relative to the luma pixels. So you have to multiply times 1.5, which gives a total of 4560 8x8 blocks. So that's one every 18.9us, or 44 instructions per pixel. That does seem high. For each pixel you would expect approximately 2 reads, 2 writes, 6 add/subtracts, 3 moves, 1 multiply plus normalizing, 1 rounding and 1 clamping. Assigning weights of 3, 2, 1, 1, 4, 2 and 2 respectively to the various operations I get a total of 6 + 4 + 6 + 3 + 4 + 2 + 2 = 26 instructions/pixel. So there does seem to be room for improvement.
In my previous post I stated a best case time of 72 msecs, but I over-counted the chroma pixels. The best case time for this algorithm should have been about 54 msecs. So it looks like there is still some room for improvement, mostly in the hub accesses.
Yeah knowing the instructions per pixel based on the operations expected to occur should give you good knowledge of how close to best case it is.
If you can block copy your pixel data from HUB into LUTRAM I have found RDLUT X, PTRA++ is pretty fast (3 cycles) to access sequential data, faster than reading indirectly from COGRAM with alti etc (4 cycles), unless the processing loop is already unrolled and hard coded to fixed register positions in COGRAM. Best to try to take full advantage if possible of the fast HUB block transfers in/out of the COG. Another option is the FIFO itself but you can't do that in HUB-exec mode.
WRLUT X, PTRA++ is very handy too before writing back from LUT to HUB with SETQ2. I put that to good use for my video driver.
If any table lookups can speed things up further that might be useful too, though a table approach may require additional memory accesses if it doesn't fit in the COG itself.
Comments
Here's the jpg file that was decoded.
Its probably not optimized yet, but what kind of time does it take to render the image?
Note: This is at the "Hey, I just got this working!" stage and not the polished final stage. But, it may never get polished.
This code originally decoded the image into a giant array of 24-bit color pixels on the heap.
But, the P2 doesn't have enough RAM to do that and show a 16-bit image from HUB RAM at the same time.
So, I looked and figured out that it decodes in small chunks it calls MCUs. My test image uses MCUs that consist of four 8x8 pixel blocks. That may be the only format that works at the moment.
What the code does now is copy each MCU to the display buffer after it is decoded, converting from 24bpp to 16bpp along the way.
This way, we only need heap storage for a single MCU.
I think this could also enable decent video, especially when used with eMMC, once optimized for speed...
Also, it seems the red, blue, and green bytes are kept in separate buffers that you have to read from.
Still, 24bpp would be nice too. Maybe for QVGA resolution that would be the way to go...
Those are in the fastspin github. I also posted a zip file of a fastspin beta in the fastspin discussion thread a few days ago.
Rayman, the image displayed on the VGA monitor looks great. I'll play around with the program a bit to see if I can speed up the IDCT. I've had a bit of experience implementing the DCT in hardware and software.
File I/O: 453 msecs
VLC Decode: 948 msecs
IDCT: 672 msecs
Display: 131 msecs
Total 2204 msecs
It’s hard to tell exactly from serial window
What P2 rate and image size & resolution? Hopefully it can be sped up a lot more to be more responsive, like by 10-20x (!), otherwise it will be limited to just slideshow stuff. With HyperRAM frame buffers we could at least have multiple images and flip seamlessly.
I think the key to getting the JPEG decoder to run faster is to write the time-critical portions in assembly, and execute it from cog or lut memory. Hub accesses are probably taking a lot of time also, so data needs to be moved to cog memory to get the fastest speed.
It would be nice if the algorithm could be made to fit in a single COG and you just spawn it dynamically and it takes a source address in HUB and a frame buffer address and then does the whole decode. If you can make it fit in LUT+COG RAM instead of HUB-exec (or at least the time critical loops) that should speed it up further.
I then realized that I was including the time to copy the results to the MCU memory, which I thought shouldn't be that significant. However, when I implemented a separate timer for the copy operation I determined that this takes 381 msecs. So the initial IDCT time is actually 291 msecs, and the optimized IDCT takes 89 msecs. That's more than 3 times faster.
It was good to see that the hub exec version is only 4 msecs slower than the cog exec version at 89 msecs versus 85 msecs.
There may be other versions of the IDCT algorithm that work better on the P2. The Winograd algorithm used by Pico JPEG minimizes the number of multiplies. It may be better to use a method that has more multiplies, but fewer adds and substracts.
@Rayman, I'm having fun tinkering with it.
In my previous post I stated a best case time of 72 msecs, but I over-counted the chroma pixels. The best case time for this algorithm should have been about 54 msecs. So it looks like there is still some room for improvement, mostly in the hub accesses.
If you can block copy your pixel data from HUB into LUTRAM I have found RDLUT X, PTRA++ is pretty fast (3 cycles) to access sequential data, faster than reading indirectly from COGRAM with alti etc (4 cycles), unless the processing loop is already unrolled and hard coded to fixed register positions in COGRAM. Best to try to take full advantage if possible of the fast HUB block transfers in/out of the COG. Another option is the FIFO itself but you can't do that in HUB-exec mode.
WRLUT X, PTRA++ is very handy too before writing back from LUT to HUB with SETQ2. I put that to good use for my video driver.
If any table lookups can speed things up further that might be useful too, though a table approach may require additional memory accesses if it doesn't fit in the COG itself.