Cinepak video player proof-of-concept
IIRC @Rayman did some video playback before, but he only managed to do, like, uncompressed 360x180 with 8bpp indexed color.
Here I am with a POC of 640x480 not-really-truecolor video (without audio currently) utilizing Cinepak compression. Requires PSRAM to buffer those large frames though.
Cinepak, for the unfamiliar, is basically the simplest scheme that could be reasonably described as a "video codec", developed in the 90s to play grainy video from CD-ROM on commodity computers without video acceleration. It uses vector quantization, which is similar to indexed color, except that you index whole blocks of pixels (in this case 2x2 or 4x4). It also supports basic inter-frame compression (omission of macroblocks that don't change between frames). It is kinda grainy, but at 640x480 it is mostly okay, especially when using NTSC output (any bright red objects may disagree...).
Code is massively messy, but here's what you need to know if you want to mess around:
- Need to compile with reasonably recent flexspin
- There's 3 main files (all using the cinepak.spin2 library file):
- cinetest.spin2 - Play video from SD card, up to 640x480, using PSRAM frame buffer
- cinetest_nosd.spin2 - Decompress still frame, up to 640x480, using PSRAM frame buffer
- cinetest_nopsram.spin2 - Play video from SD card, up to 320x240, no PSRAM involved. No space for double-buffering at 24bpp...
By default everything is set up for a P2EDGE 32MB with a VGA board on pin 32, but that's all configurable. Note that lower-bandwidth memory is really not fast enough for truecolor 640x480 operation, but stepping down a gear to 16bpp makes it okay again for 8 bit PSRAM (and presumably HyperRAM, too, but I haven't actually tested HyperRAM support at all). Note that the 8bpp mode is monochrome only.
I've put some example files up here: https://mega.nz/file/zTYQUTpb#lsh8e5rGcbK0ZIfjb7LlwBDpJX-gkF5zolMnvBw3qzs
The code download also has some still frames to use with the nosd version.
To play your own videos, you need some raw cinepak frames concatenated together. You may create one with FFMPEG like this:
ffmpeg -i some_video.mp4 -vf scale=640:480:flags=lanczos -c:v cinepak -max_strips 4 -map 0:v -f data some_video.cvd
Obviously change the resolution to whatever is needed (i.e. 640x360 for widescreen files). You can also use an image as the input to get a single frame of output (what those .cinep files are)
max_strips value. This corresponds to the same value in cinepak.spin2. If this option is omitted, it defaults to 3. More strips -> higher quality, higher bitrate. But also needs more memory to buffer codebooks (well, ffmpeg-encoded files never re-use them anyways...)
You can also add
-q 1 to force maximum quality (though it's still a bit iffy)
Also, if your video is monochrome (or you want to make it...), be sure to change the filter chain to
Technically cinepak also works for palette video, but I haven't messed around with that (seems rather pointless when you can have "true"color).
Remaining to do:
- make less-than-fullscreen video not be stuck to top-left corner
add audio and frame rate control
- what audio codec and container?
improve NTSC output
- figure out why @rogloh 's video driver sometimes switches top and bottom fields around in interlace modes (notice that Source/AltSource are reversed in the nosd version to get correct field order. I have no idea why this happens)
- figure out why there's flagging on some monitors?
- set proper square pixels mode (12.27 MHz dotclk)
write better cinepak encoder (ffmpeg's is slow and bad)
- I've actually started doing that, more on that later
This is great work @Wuerfel_21 ! The P2 can certainly playback some reasonable quality video with your code. I was able to get it to work playing back the bunny movie from microSD with some 16 bit wide PSRAM fitted to the P2-EVAL. I think apart from a quick test I did way back when developing my video driver, you are probably the first person to actually use its double buffering capabilities (or at least has mentioned it in the forums). Also in theory with two independent banks of PSRAM (different data pins) you could even double the read/write bandwidth because my video driver can be setup to read video from one memory bank while you write to the other, and then switch over automatically to the other bank per field/frame (or at least it is meant to be able to do that - never tested it yet, you'd be the first there too). That feature may end up being helpful in dual 4 or 8 bit PSRAM setups for higher performance.
Not sure about the interlace issue you mentioned but maybe there is a bug you can identify...or something else? Playback mostly seems smooth though I did see the scrolling credits at the end is a little jerky, is there a little frame rate variation in the decoding there? I do also see some grain/shimmer as you mention but that may all be a part of it. I'll need to read through your code in depth to see what you did.
Certainly you'll want to add some sound soon as that would really make it really compelling.
EDIT: okay now I see some of the other videos running a bit too fast and then re-read your post again to learn there is no frame rate control put in there yet which explains it. Also very much liked the 80's Howard Jones inclusion, that takes me back to fun times.
If you want to see real speed, try the psram-less version. Buffer copy is a major bottleneck, so reduced resolution + faster copy (even though not asynchronous) makes it go weeeeeeee.
The issue is that which of Source/AltSource ends up being the top/bottom field seems to be inconsistent based on ???. In the no_sd version, AltSource is the top field, but in the regular version, AltSource is the bottom field. If you change it in either the picture is wrong. The documentation file also doesn't specify which is supposed to be which???
It's been a while since I messed with that since I kinda got cought up in the whole "write better encoder" thing. Turns out generating a good codebook (= the "palette" of blocks) is an NP-hard problem (as is regular color quantization, which is why most programs suck at converting images to 256 colors (or god forbid 16)).
I think I cobbled together an algorithm that is somewhat nicer than FFMPEG's, though is mildly slower overall (but I plan to implement multi-threading to offset that). The main tangible improvement though is in the RGB -> YUV conversion. I use proper rounding (so encoded colors match input colors +/-1 instead of being kinda off) and gamma-correct the Y values to maintain the original pixel's brightness with the subsampled UV vector (reduces blockyness and edge artifacts on bright red areas).
(Also @VonSzarvas plz fix APNG uploading, had to externally host this.)
The actual algorithm is a combination of fast-search PNN (with the funny k-d trees) and classic LBG (I found ELBG's shifting step to be slow (not sure why) and not actually that great at anything but reducing numerical error. It's still in there to deal with dead codewords, but not much else). I'm also using a weighted distance metric (UV get double weight, cause that's cheap to implement) to reduce banding and specks of inappropriate hue.
Nice work @Wuerfel_21 !
Seems I was able to do 480p (widescreen, uncompressed) using eMMC chip:
Did figure out color cell compression at one point (but not real time decompression):
But, this looks way better, especially if can use ffmpeg to make the files...
But wait, any 480p 16bpp frame wouldn't fit in memory? Or is it 640x360?
Cinepak is maxes out at 2bpp for actual image data (excluding codebooks), but yeah, seems like a way superior scheme (better quality and probably faster to decompress) than CCC. The codebooks add up to 3K per strip used though. OTOH, image data can be reduced a lot by changing the V1/V4 coding threshold while still maintaining decent quality. Bottoms out at 0.5 bits per pixel, but at that point you should just reduce the resolution)
Yea, though as mentioned the encoding isn't so hot.
Speaking of, here's another funny APNG illustrating the aforementioned YUV conversion issue. "Fast YUV" is similar to what FFmpeg does, minus rounding errors:
(Note that this is before actual quantization is applied. Also this is an extreme case and at low resolution.)
Yeah I think this is why having dual independent PSRAM banks might be useful - though obviously more pins are required (36 IO pins). While the video driver is reading a frame from one bank you have the entire PSRAM bandwidth to yourself on the other bank for writing. This bandwidth increase could be very large if there is no reading going on and could speed up the copy to PSRAM with large transfers. I wonder then if 720p50 or 25 could even be achievable...? Is there any more headroom on the decoding side for increasing the resolution? Or could you use more COGs if required? What is the COG use currently?
Ok, I'll see if I can have a quick look soon.
There's a profiling option that will tell you some per-frame metrics. Do keep in mind that IO and FS metrics are masked by VQ due to async transfers. I think 800x480 would certainly work with 300+ MHz (current 640x480 uses the default VGA 252 MHz setting). Might try to see what's possible at 1024x768 (native resolution of the VGA monitor on the """bench"""). Probably need to reduce buffers to 16bpp (which is... fine-ish). Also, the coding efficiency kinda goes down with resolution since the amount of "interior" blocks (i.e. ones where all the Y values are approximately equal) grows quadratically to the amount of "edge" blocks, favoring the former in the VQ algorithm.
In my WIP encoder, I already counteract that by increasing weights of interior blocks for building the V1 codebook (since V1 won't get picked for edge blocks, anyways). I think I may need to correspondingly de-weight the V4 blocks at the same locations (so the V4 book doesn't get filled with as many pointless codes), but that would require an additional copy of data. Which I guess is fine considering how slow the actual algorithm is. Which would also allow me to reset the weights after the skip/V1/V4 decisions are made, which is probably a good idea.
Currently cog use is roughly like this:
The decode cog is running Spin2 (and as currently written, is rebooted for every frame), so it could also do other things. You can also change it to run on Cog 0 by fiddling with the options at the top of cinetest, but that makes it slower.
I think the PSRAM access schedule isn't quite ideal yet. I might look into using request lists to make one buffer swap (i.e. uploading a 640x4 line and downloading the second-next one over the same buffer) into a single op, which might improve bandwidth utilization. But that would make the interface less nice.
Using multiple cogs would be tricky due to the variable-rate encoding (i.e can't easily split the data). What could be done is having separate cogs for odd/even scanlines. They'd both need to decode the entire bitstream, but would only process half of each block. You could also split strip processing, but then you'd need to constrain the encoder to never use less than N strips and never re-use codewords between different strips and also figure out how to split the read bandwidth among the strips. I.e. NO.
Could also figure out if the funky FIFO wrapping trick the vector decoder uses is even worth it or could be made faster. That's a minor headache, though it doesn't scale with resolution.
@Wuerfel_21 yeah, signaled 480p but widescreen with top and bottom black
Another trick might be to see is some signaling at 24 Hz that a tv would accept…
Buffer might not need to be full screen in that case?
You might also try to fiddle about with the per COG burst sizes so that your writes line up well with this burst size and you reduce the total requests issued, while also trying to maximize the use of remaining bandwidth per scan line. It's a bit of a balance there that depends on the video line frequency and the scan line size and P2 clock frequency and there are probably some sweet spots.
Seems like that has potential if it increases performance sufficiently and if you have an application needing higher resolution. Even seeing just SDTV resolution working is nice enough IMO.
From purely a video and PSRAM perspective I know a 1024x768 true-colour frame buffer is achievable on a P2 at 325MHz, but whether sufficient write bandwidth remains for a video decoder to use is the main question. A dual PSRAM setup would probably allow it, and single PSRAM not. Going down to 16bpp halves the bandwidth over true-colour at the expense of video quality. Also you can update new frames just at 30Hz or 25Hz instead of 60/50Hz so only need half the write bandwidth again. Presumably the decoder is already using that trick.
1024x768 definitely works for the still frames. Now to actually encode some videos in that resolution (will take approximately forever)
1024x768 @16bpp seems to be basically fine for 24fps video (letterboxed 1024x576 equivalently at 30fps, especially if you cut down the exmem video region to increase bandwidth). There's weird spikes (esp. in the "other" profiling column), should probably investigate that.
Converting videos is an over-night job at this resolution though. Should probably also set a higher strip limit for such high resolutions, but then I'd have to convert again.
Thriller 1024x768: https://mega.nz/file/PLI0UKIQ#RTnjP7PA8_N9DRif11gJlFUgWgiIKn-kJ5UZEaZ3Yjk
Big Buck Bunny 1024x576: https://mega.nz/file/WShSAJxQ#ovhzVzBuuqruEmrlDspXldlPt3RkE1QmlOykAugXB3E
Don't forget to change to 16bpp mode in cinepak.spin2
One other low-processing video format maybe worth looking at is Smacker or Bink from radgametools. It used to be used for a lot of video game cut-scenes, and I remember them claiming HD video playback on an 80486.
Yeah but those don't have proper format documentation (cinepak doesn't either, but is as, previously exposited, dead simple) and free encoders
Eitherhow, I've looked into audio. MP2 is probably doable but would require serious brain hurt (P2 really sucks at signed fixed point beyond 16 bit precision).
I kinda like the Heptafon codec I developed myself, but that's kinda underkill, being designed for P1 rather (no multiply, no output buffering, etc) and thus being not quite transparent. Also not available in any standard tool.
All the usual ADPCM codecs are memes. Like, entirely obsoleted by the existence of aforementioned Heptafon (maybe my encoder is just better rather than the format, but I've tried encoding Microsoft ADPCM with audacity and ffmpeg and both suck). All of the ones that do outperform it require multiplies and have higher bitrate. I could probably make a heptafon deriviative that does require such things which would in turn destroy those.
Seems I had libmad library working for mp3 on P1, but not realtime:
Maybe would work on P2?
Still, I'd just add a second uSD and do uncompressed .wav audio...
Got it going on my P2-EVAL setup with the PSRAM module using P32-P47 pins. The higher resolution looks very nice and crisp. If I get a chance sometime I'll try to fit another PSRAM module on P0-P23 and put the VGA board on P24-P31 and see if I can figure some way to hack your code to flip frames alternatively from each PSRAM, assuming it can be done with the way you have it coded (you'll probably know if that is even possible). My second PSRAM board does have two PSRAM loads per data pin though (it's 64MB module instead of 32MB), hopefully it can still run okay at 325MHz. I should try that one out separately first I guess... If that works it'd allow 24bpp colour at this higher resolution.
Ok, forgot to mention, the encoder project got shelved because I ran into issues I couldn't easily debug due to tooling shortcomings (couldn't get ASAN to work, couldn't get debugger to work on 64 bit, etc, etc). WIP code is here, doesn't quite work: https://github.com/Wuerfel21/cinepunk
Maybe this is the moment to try Rust or something.