Video Player (now with eMMC and 480p)

Rayman · 2020-06-02 18:43

I previously posted about a video player from uSD using FSRW: https://forums.parallax.com/discussion/171570

Just hacked that to use the new FSRW for eMMC. eMMC has an 8-bit data bus, so is much faster.

Can now do 480p widescreen video at 30 fps and 16bpp. This looks way, way better than that QVGA in 8-bit indexed bitmap.

Also demonstrates using a single buffer for video. Don't have a choice as each image is 342 kB. It can actually go at 60 fps (did that by accident).
Video has to be widescreen though because a full frame image won't fit in RAM. (Or, maybe there's a way around that?)

File size is ginormous. I could only put half the movie into one file because 5 minutes of this hits the 4 GB file size limit of FAT32.

Anyway, here's a video of it working. Somehow I managed to get the audio out of sync. I'm not sure how that is even possible though... Have to look into that sometime...

rogloh · 2020-06-02 22:51

Great work! Now we just need some type of light compression to help reduce the file size. 4:2:2 YUV/RGB etc. Would be good to see how many COGs that might take.

Rayman · 2020-06-02 23:26

Some kind of compression would be very helpful...

Circuitsoft · 2020-06-03 01:41

4:1:1 sub-sampling would be half the data rate. IIRC, there is some kind of YUV<->RGB color conversion available, and it should be pretty easy to do the 1:4 interpolation to decompress it. Y411 is probably the simplest defined format to decode.

rogloh · 2020-06-03 02:17

Yeah Circuitsoft for this we'd need to transform this Y411 data into YUV0 format (or somehow use RGBSQZ perhaps to convert back into 16 bit colour at the end):

Input reads 6 bytes (4 Y, U, V) per four pixels presumably in this sequence:

V Y0 Y1 U Y2 Y3

Outputs 32 bits x 4 pixels something like this:

Y0:U:V:0 Y1:U:V:0 Y2:U:V:0 Y3:U:V:0

Here's a PASM2 snippet thay may do this work, 28 clocks for 4 pixels including a REP plus 4 write clocks with setq2 burst later, making 8 clocks per pixel. This is ~21us per 640 pixel wide scan line at 250MHz so one COG could do this decompression.

rflong  pixel           ' reads U:Y1:Y0:V 
rfword  y2_3            ' reads 0:0:Y3:Y2
movbyts pixel, #%%1302  ' setup in Y:U:V:0 format
getbyte y1, pixel, #0   ' extract y1 before we lose it
setbyte pixel, #0, #0
wrlut   pixel, ptra++   ' save pixel1
setbyte pixel, y, #3
wrlut   pixel, ptra++   ' save pixel2
setbyte pixel, y2_3, #3
wrlut   pixel, ptra++   ' save pixel3
shr     y2_3, #8
setbyte pixel, y2_3, #3
wrlut   pixel, ptra++   ' save pixel4

Circuitsoft · 2020-06-03 05:18

How hard would it be to make the U and V linearly interpolated from one set of 4 pixels to the next?

rogloh · 2020-06-03 06:26

Not sure how hard. So you need to average two U, V pairs and use that to compute the next result. I guess you'd need to retain the last U, V values and find the difference with new values then divide by 4, then add these increments to 3 of the U, V valued pixels in the 4 pixel group. It could be quite lot of extra instructions for the visual improvement you might get.

Rayman · 2020-06-03 20:55

There is a mixpix instruction... Maybe that could help with interpolation?

Wuerfel_21 · 2020-06-03 21:04

Rayman wrote: »

There is a mixpix instruction... Maybe that could help with interpolation?

Yes, but some byte shuffling is needed to get 4 U/V values to interpolate into a long.
But then it's easy to interpolate any ratio you want.

rogloh · 2020-06-03 22:30

Neat that mixpix can be used, though these blends do take 7 clocks each and you'd need 3 of them for each four pixels processed so I'm guessing you might need at least 2 COGs to do this work vs one for the non-interpolated version for processing VGA width video and line rate @ 250MHz P2 clock with everything else involved. Even doing it in 2 COGs might be a challenge. Who can write the sample loop code to do it all in less than 24 clocks per pixel to fit in just two COGs? One more clock is needed for the setq2 + wrlong burst.

Maybe a simpler single interpolation operation could be done to get it to fit within two COGs (or even one?). So just 2 replications and one interpolated pixels, instead of 3 replicated pixels every four pixels, or the more intensive 3 interpolated pixels. That might still give a reasonable effect.

Wuerfel_21 · 2020-06-05 10:57

You don't need 3 MIXPIX for 4 processed pixels. You need 3 BLNPIX for 8 processed pixels, since two sets of UV can be processed in parallel. So you can process two scanlines in parallel, you just need to interleave them in the compressed data.

Rough pseudo-asm

' decompress into 32bit UVYx
' assume FIFO is pointed at interleaved compressed data
' assume PTRA and PTRB are pointed to two scanline buffers
' should use REP, but IDK how that works RN
rflong rightuv ' load first UV pair
mov iter,#VIDEO_WIDTH/4
:loop
mov leftuv,rightuv
rflong rightuv ' even line's UVs are in bottom word
rflong eveny
rflong oddy

' even-line pixel 0
rolword evenpx0,leftuv,#0
rolbyte evenpx0,eveny,#0
rolbyte evenpx0,#0,#0

' odd-line pixel 0
rolword oddpx0,leftuv,#1
rolbyte oddpx0,oddy,#0
rolbyte oddpx0,#0,#0

' UVs for column 1
mov tempuv,leftuv
setpiv #64
blnpix tempuv,rightuv

' even-line pixel 1
rolword evenpx1,tempuv,#0
rolbyte evenpx1,eveny,#1
rolbyte evenpx1,#0,#0

' odd-line pixel 1
rolword oddpx1,tempuv,#1
rolbyte oddpx1,oddy,#1
rolbyte oddpx1,#0,#0

' UVs for column 2
mov tempuv,leftuv
setpiv #128
blnpix tempuv,rightuv

' even-line pixel 2
rolword evenpx2,tempuv,#0
rolbyte evenpx2,eveny,#2
rolbyte evenpx2,#0,#0

' odd-line pixel 2
rolword oddpx2,tempuv,#1
rolbyte oddpx2,oddy,#2
rolbyte oddpx2,#0,#0

' UVs for column 3
mov tempuv,leftuv
setpiv #192
blnpix tempuv,rightuv

' even-line pixel 3
rolword evenpx3,tempuv,#0
rolbyte evenpx3,eveny,#3
rolbyte evenpx3,#0,#0

' odd-line pixel 3
rolword oddpx3,tempuv,#1
rolbyte oddpx3,oddy,#3
rolbyte oddpx3,#0,#0

' writeout
setq #3
wrlong evenpx0,ptra++
setq #3
wrlong oddpx0,ptrb++

djnz iter,#loop

Read compressed data: 4*2 = 8 cycles
Assemble pixels: 8*3*2 = 48 cycles
UV interpolation: (4+7)*3 = 33 cycles
writeout (assuming worst-case waitstates): 2*(10+3+2) = 30 cycles
Total: 8 +48 +33 + 30 = 119 cycles for 8 pixels = 14.875 cycles per pixel

Or am I missing something important?

rogloh · 2020-06-05 11:35

Very interesting approach Wuerfel_21.

If this blend method works ~15 cycles per pixel will require 2 COGs for VGA resolution and rates at 250MHz for 60fps source data. But perhaps replaying at 30Hz it could be done in one COG and that would need a suitable frame buffer in external memory if the colour depth gets increased to become 24 bpp and the frames are being output (twice) at 60Hz. HyperRAM may suit this if we can get the entire data written in time which requires a 640*360*4 * 30 or ~ 27MB/s write rate. Should be likely doable even with the sysclk/2 writes and a 252MHz P2.

Video Player (now with eMMC and 480p)

Comments