P2 FPD-Link (LVDS) Displays
n_ermosh
Posts: 294
in Propeller 2
Thought I'd finally get something cool going on the P2. I need a user interface (display + input) for my P1 based CO2 laser cutter. Figured I'd try to build it with P2. Originally I was planning to use an HDMI or VGA display, but thought I would instead see if I could drive the LVDS signals directly and get rid of extra parts.
TL;DR: it works, for the most part. Details below.
The big caveat is it will be unlikely to work on any given display. I had a couple lying around from various projects, but the one I found to work is a Newhaven Display 1024x600 10.1" panel. The big thing that makes this work is the minimum pixel clock is low, (20MHz, although the panel driver datasheet states ~40MHz), compared to other displays I have, that are around 60-80MHz. The fun thing about TFT displays is that they will retain their pixel states for a bit as the pixel's capacitor discharges. So theoretically, much slower refresh rates (and therefore lower pixel clock rates) should draw just fine. The main limitation is the PLL that divides the pixel clock for every bit in the pixel (I assume). Many displays will work at lower than spec'd pixel clocks, allowing us to run them slower from a microcontroller.
Sending the FPD-Link data stream from the P2 is actually super simple with the new streamer. The FPD-link bus is 4 bits wide (1 clock and 3 data). The clock divides into 7 bits and describes one pixel. So a 4 byte word can be set up to shift out a single pixel to the bus, and while it's being shifted, the next pixel is computed. The 4 byte word is structured as follows:
- the 0th, 4th, 8th, etc bit is the value of the clock
- the 1st, 5th, 9th, etc bit is the value of Rx1 (contains red bits and 1 green bit)
- the 2nd, 6th, 10th, etc bit is the value of Rx2 (contains remaining green bits and 2 blue bits)
- the 3rd, 7th, 11th, etc bit is the value of Rx3 (contains remaining blue bits and sync bits)
Because the pixel is 7 bits long, the top 4 bits are unused and don't get streamed.
The streamer will shift out a single bit of the bus each cycle. So the pixel clock takes 7 streamer cycles, and will be the (P2 system clock rate*streamer clock divider)/7. In my case, I managed to get good results setting the divider to 0x16000000 (1/5), and the system clock to 360MHz, resulting in a pixel clock rate of just over 10MHz, which for this display is fast enough, with some experimentation to figure out the blanking (the datasheet values didn't quite work for me).
A lot of these displays also can ignore the H sync and V sync bits and just use the Data Enable (DE) bit. I set the driver up to allow for using a display that requires those sync bits, but mine doesn't so i left that code commented.
From there, a LVDS transmitter (used to convert the TTL output of the P2 to LVDS) is hooked up and connected the the display. I used the DS90LV047ATMTCX. (aside: I had to transplant a connector from a different display to be able to use the IPEX connectors a lot of displays have instead of a FPC cable, and also re-terminate the IPEX connector cable assembly I had to match the pinout of the display). At a 70 MHz bitrate, the headers on the eval board are just good enough, but signal integrity would be better with a properly designed PCB. Keeping this short helped.
Thanks to 512KB of RAM, as can store a framebuffer for the full display size for 1 bit pixels (76.8KB for my display). From there, each pixel cycle reads out the next group of 8 pixels (a byte) using the new sequential FIFO for fast hub reads. A 1 bit pixel buffer allows for 2 colors, but colors can be defined with 18 bits. A larger framebuffer will allow for more colors. with 512KB of RAM, this display (in theory) could be driven with 16 different colors.
Here are a couple photos and the code for this is attached. (please excuse my use of function pointers in structs, FlexC doesn't support C++ I'm working on getting riscvp2 set up but am having issues, will post about that separately if I can't figure it out). The code is very much minimum viable product to demonstrate the concept and is not a complete and configurable module. If anyone has ideas for how to speed it up to enable more displays to work with this, please share them.
Next steps:
- 2-4 bit pixels for more colors.
- get the Propeller font loaded for larger and prettier characters. Maybe even try other TTF fonts rendered at specific sized.
- get riscvp2 working to turn this into a portable C++ class.
- another fun idea could be a VERY primitive openGL implementation since we can store a full frame buffer, a cog or two can be used to do 4D matrix math and do 3D visualizations of shapes. openGL might be overkill but some generic drawing would be possible. At a 10Hz re-fresh rate though it might be doable. I haven't looked to deeply at the new math/CORDIC functionality in the P2, so not sure how well it can be done.
Credits:
some of this work (specifically on driving displays at low pixel clock rates) was inspired by https://sites.google.com/site/geekattempts/home-1/drive-an-old-laptop-display-from-an-avr. A lot of good stuff here.
TL;DR: it works, for the most part. Details below.
The big caveat is it will be unlikely to work on any given display. I had a couple lying around from various projects, but the one I found to work is a Newhaven Display 1024x600 10.1" panel. The big thing that makes this work is the minimum pixel clock is low, (20MHz, although the panel driver datasheet states ~40MHz), compared to other displays I have, that are around 60-80MHz. The fun thing about TFT displays is that they will retain their pixel states for a bit as the pixel's capacitor discharges. So theoretically, much slower refresh rates (and therefore lower pixel clock rates) should draw just fine. The main limitation is the PLL that divides the pixel clock for every bit in the pixel (I assume). Many displays will work at lower than spec'd pixel clocks, allowing us to run them slower from a microcontroller.
Sending the FPD-Link data stream from the P2 is actually super simple with the new streamer. The FPD-link bus is 4 bits wide (1 clock and 3 data). The clock divides into 7 bits and describes one pixel. So a 4 byte word can be set up to shift out a single pixel to the bus, and while it's being shifted, the next pixel is computed. The 4 byte word is structured as follows:
- the 0th, 4th, 8th, etc bit is the value of the clock
- the 1st, 5th, 9th, etc bit is the value of Rx1 (contains red bits and 1 green bit)
- the 2nd, 6th, 10th, etc bit is the value of Rx2 (contains remaining green bits and 2 blue bits)
- the 3rd, 7th, 11th, etc bit is the value of Rx3 (contains remaining blue bits and sync bits)
Because the pixel is 7 bits long, the top 4 bits are unused and don't get streamed.
The streamer will shift out a single bit of the bus each cycle. So the pixel clock takes 7 streamer cycles, and will be the (P2 system clock rate*streamer clock divider)/7. In my case, I managed to get good results setting the divider to 0x16000000 (1/5), and the system clock to 360MHz, resulting in a pixel clock rate of just over 10MHz, which for this display is fast enough, with some experimentation to figure out the blanking (the datasheet values didn't quite work for me).
A lot of these displays also can ignore the H sync and V sync bits and just use the Data Enable (DE) bit. I set the driver up to allow for using a display that requires those sync bits, but mine doesn't so i left that code commented.
From there, a LVDS transmitter (used to convert the TTL output of the P2 to LVDS) is hooked up and connected the the display. I used the DS90LV047ATMTCX. (aside: I had to transplant a connector from a different display to be able to use the IPEX connectors a lot of displays have instead of a FPC cable, and also re-terminate the IPEX connector cable assembly I had to match the pinout of the display). At a 70 MHz bitrate, the headers on the eval board are just good enough, but signal integrity would be better with a properly designed PCB. Keeping this short helped.
Thanks to 512KB of RAM, as can store a framebuffer for the full display size for 1 bit pixels (76.8KB for my display). From there, each pixel cycle reads out the next group of 8 pixels (a byte) using the new sequential FIFO for fast hub reads. A 1 bit pixel buffer allows for 2 colors, but colors can be defined with 18 bits. A larger framebuffer will allow for more colors. with 512KB of RAM, this display (in theory) could be driven with 16 different colors.
Here are a couple photos and the code for this is attached. (please excuse my use of function pointers in structs, FlexC doesn't support C++ I'm working on getting riscvp2 set up but am having issues, will post about that separately if I can't figure it out). The code is very much minimum viable product to demonstrate the concept and is not a complete and configurable module. If anyone has ideas for how to speed it up to enable more displays to work with this, please share them.
Next steps:
- 2-4 bit pixels for more colors.
- get the Propeller font loaded for larger and prettier characters. Maybe even try other TTF fonts rendered at specific sized.
- get riscvp2 working to turn this into a portable C++ class.
- another fun idea could be a VERY primitive openGL implementation since we can store a full frame buffer, a cog or two can be used to do 4D matrix math and do 3D visualizations of shapes. openGL might be overkill but some generic drawing would be possible. At a 10Hz re-fresh rate though it might be doable. I haven't looked to deeply at the new math/CORDIC functionality in the P2, so not sure how well it can be done.
Credits:
some of this work (specifically on driving displays at low pixel clock rates) was inspired by https://sites.google.com/site/geekattempts/home-1/drive-an-old-laptop-display-from-an-avr. A lot of good stuff here.
Comments
By the way could any of those SPLITB, MERGEB, SPLITW, MERGEW instructions help you translate the data format?
One thing I was wondering for a long time now is whether the P2 "bit_dac" mode could be able to send a signal that satisfies the receiving LCD display (which would let you ditch the conversion IC). For this to work you may need to set up a pin pair, with the second pin outputting the inverted state of the original pin
The bit_dac splits up the 3v3 range into 16 levels, so ~200mV apart
Perhaps this is possible, it would save that external lvds converter chip (not that its a huge deal, but the less the better)
Using the DAC, I'll need a lot more details. While it can probably meet the level requirements, I can immediately think of two problems: settling time and impedance matching. LVDS manages its high speeds by using a constant current driver into the 100 ohm line impedance, driving it to the ~300mV differential. Take a look at a typical LVDS driver, the push-pull current driver can be very fast. The P2 DAC is pwm based and probably has some RC filter on the front. It also looks like the PWM must be at least a 256 ticks for the period, so it wouldn't be able to respond to the changes that need to happen every bit change. And if the output of the dac isn't within 20-30% of 100 ohm differential or 50 ohm single ended, there would probably some crazy reflections that would mess up the signal pretty badly. At 70 MHz bitrate, the line should happily support ~700MHz without ringing to keep the edges clean. I'd need more details on the analog characteristics of the DAC front end to really know if it can be used.
Yes you can do some things to speed this up. Use the REP instruction and loops inside the pixel routines, instead of doing the outer function call per pixel. That way you avoid the djnz, call and return overheads on each pixel. Also you can use the INCMOD and TESTB instructions here. These two things combined probably saves you about 20 P2 clocks per pixel or so and could help drop the needed P2 clock rate if that is a goal.
So instead of this: you can try something like this and call it once per scan line:
You should be able to output your LVDS pixels in 20 P2 clocks per 7 * 4 bit pixel with the code above. 21 clocks then makes a good exact multiple to stream 7 out nibbles. 1 nibble is streamed every 3 P2 clocks. I think that makes it 66% faster than before.
Update: just realised that your OR instructions can be removed from the loop as well and done just once before the REP. This shrinks by 6 more clocks down to 14 per pixel which is another multiple of 7. So you could probably send 1 nibble out every 2 P2 clocks! This is then 2.5 times faster than before so if it works you could do a 20MHz pixel rate panel (mono) with a 280MHz P2. The code with that type of optimization is shown below.
Update2: Once you boost things like this you may hit the next roadblock which is keeping the video streamed output fed during the rest of your outer loop code now there is less time per pixel. You may wish to consider streaming the h-blank &/or v-blank portion(s) from some precomputed (constant) line data stored in HUB RAM to buy you more time for any additional per frame housekeeping overhead that would otherwise cause the streamer pixel output to underrun. That should help fix such a problem.
The 123 ohm 3v3 dac could be made into a 100 ohm DAC using a parallel resistor at the source (this is what 75 ohm, 2v mode does anyway). Perhaps a 510 or 560 ohm parallel resistor. The bit_dac step size would then be around 180mV
Also, there are some CRO shots showing the PWM dither between two adjacent 8 bit levels of the DAC (which is quite different to dithering the full 3v3 range like "normal pwm")
https://forums.parallax.com/discussion/comment/1364249/#Comment_1364249
@Tubular good to know, 120ohm is close enough to 100 ohm that it would probably work fine. 3ns is just fast enough, so maybe it could work, assuming that a pin can be setup to mirror another pin (so I don't need to sent 7x8 words which would take more than one long). Need to look through the smart pin modes and figure that out.
I generated a higher quality font (Menlo is a nice fixed width font on macOS) and it draws nicely as 16x32.
One other thing I noticed is that when using a colored background, you start to see the refresh flickering since the refresh rate is about 10Hz.
It should be possible to get 256 colors with a pixel clock of sysclock/7. The would be 3 instructions per pixel. I think it can be done.
The above has a bit of unrolling to meet the 3 pixel requirement. Here is the simpler, rolled up version that won't work due to the 3 instruction limit. The palette_and_encode_table translates a color to 28 pre-encoded lvds bits. 7 of those are the clock. The blanking and sync pixels would have their own pre-calculated longs that could be kept outside of the palette table. Streamer takes the long with 28 bits and outputs 4 at a time. These 4 bits go through the streamer look-up-table to convert it to a differential output. Check out Chip's hdmi code as I think it is easy to read and is a decent example of a cycle efficient video driver. forums.parallax.com/discussion/comment/1475526/#Comment_1475526 It's 640x480 for v2 silicon.
So I initially missed that there isn't enough ram for 8 bits per pixel. But this can be handled. For a 4 bits/pixel framebuffer, keep the palette table to be 256 entries and copy the 16 color palette 16 times. That will make the 4 unused bits into "don't care." Then alternate between reading a byte and shifting the old byte. Essentially wasting a bunch of cogram to mask data using a look up table. But the cog doesn't have much time to use that ram anyway. Easily extendable to 2 and 1 bit per pixel as well.
I used this in a software hdmi encoder. But alts may be a better solution since the table doesn't need to start at 0.
The lwip code in my github uses spin+pasm for a serial port. It seems I'm one of a few people using riscvp2.
As @rogloh and @SaucySoliton have shown, there are definitely ways to shift pixels out faster, but setting up between lines takes time so I need to think through how to use the streamer for streaming multiple blank pixels to buy that time.
edit: looks like only the RDFAST->pins mode will allow for nibbles to be streamed and continue reading, if I'm understanding it correctly. Unfortunately that means that setting up the rdfast for the next line/frame would need to be done AFTER the last blank pixel is shifted out, which would take up to much time and glitch the clock. Although, maybe setting up a line buffer in ram that includes the blanking pixels so that rdfast can just wrap and loop constantly. worth experimenting, but it'll be a while before I start trying to get an external ram setup for a higher resolution framebuffer.
I thought of a way to do a full 18 bit color framebuffer last night. Store the full framebuffer in HyperRAM (possibly something similar to here), and use a double buffer for a single line in hub RAM. While drawing a line from that buffer, another cog reads in the next line from the off-chip RAM. Then the display driver swaps the buffers, and the buffering cog loads the next line in lockstep with the drawing cog. Maybe there's some really creative way to load the line while drawing the blank pixels, but I need to think about that more. Probably not since it would require reseting the FIFO for writing while simultaneously trying to read which I don't think is possible.
Maybe it's possible to use the DVI/HDMI output for such displays. You will need the literal mode instead of the 8b10b encoder, and set the NCO for the clock to 1/7 instead of 1/10. But I have not checked the details.
This would give you differential outputs automatically.
Andy
Yep, that's pretty much the method used in my own video+HyperRAM driver pair. The trick is going to be dealing with the sync portions when streaming as you are finding.
I was thinking you could use a pre-encoded frame buffer form where each 8 pixels take 7 longs in HUB or external RAM you are sourcing from. This is probably ok for writing text which is usually 8 bit pixel oriented/aligned to begin with. For manipulating bitmapped graphics it gets a little messy and you'd need read/modify/write cycles to change pixel colours which could limit performance with HyperRAM. If there was a fast way found to translate between the two formats on the fly, you possibly use that in a text mode in the driver to populate the data dynamically from a screen buffer and write back to HUB ahead of time. In my own driver I'm still hoping there might be a chance at 14 clocks per LVDS pixel, but I expect doing 7 clocks per pixel will be tough if not impossible to involve the HUB, plus some cycles are lost due to housekeeping overheads and HUB block read/write cycles too.
If you keep working on this you may discover some interesting new approaches. You'll be surprised what the P2 can achieve in the end.
I'd thought about that too. The problem then I think is that you'd need to generate 7 times the amount of data to feed it because each pixel needs 7 symbols generated whereas the nibble approach can fit an entire pixel in 28 bits and can be sent in one streamer operation. Unfortunately it's those extra 4 bits left over in the long that causes the most difficulty in the code during pixel generation for any HUB streaming. We need a tight way to pack some of the next pixel's data in there if HUB streaming is an option.
I think the raw drawing driver should be as general as possible--draw every pixel from a framebuffer. If text is the desired outcome then have a higher level code add the text to the frame buffer. The other nice thing that for full color depth, each pixel needs to be a 32 bit long, so it can directly store the color information (and even the pixel/DE bits) in LVDS streamer format, not needing to lookup colors in a color look up table, which could actually speed up the pixel drawing operation. This also won't have the packing problems above, as the top nibble will just be ignored as the streamer is being reset every 7 nibbles.
It's a real shame the LVDS application wasn't discussed more when Chip was re-doing the streamer for rev2 silicon as it could have easily translated the pixel format for us and natively allowed LVDS LCD displays with potentially little effort. But it would have been tough to convince the mob of taking more risks etc. I get that too. We were lucky we got TMDS support.
I'll probably have a HyperRAM driver out soon you should be able to try to access/populate your frame buffer data. But it still needs something to trigger the request via a mailbox which is setup with the external memory address to read. If your code is flat out doing per pixel streamer commands and RFLONGs it may get tricky to issue the request to a mailbox in hub RAM but hopefully not impossible if you can free up part of your loop particularly during the h-sync portion. Using 14 P2 clocks per LVDS pixel makes this a lot easier, 7 clocks is really tight with the per scan line sync work unless that is pre-encoded in the frame buffer too. If a least some of your sync portion data could be packed into hub, you might be able to stream more than one pixel at a time to buy some more cycles at the start of each scan line. The RDFAST change + refill time might be an issue too unless you can auto-loop on its 64 byte boundary.
In the worst case if things are really tight and you can't find spare cycles to write a mailbox to HUB you might have to invent your own HyperRAM driver and share data directly via LUT RAM perhaps, but speaking from my own experience it is quite a lot of work. It's a challenge but also good way to learn the P2.
Yeah. that's basically how the current driver I have works. I set the number of cycles to 7 so that if we get to the command early, xcont will block until those 7 are output and then setup the next 7 for output once the streamer is ready.
I'll look out for it. Yeah, I think there's enough room during the blank pixels at the end of each line to signal the hyper RAM driver to start buffering the next line. the cog attention mechanism should work for this since it's only 2 clocks. reading and buffering from external RAM should happen faster than it takes to draw the line (or else this whole thing is blown) and then the cog reading the ram can just wait for attention and then start buffering the next line in the other buffer.
I'll see how your's works when it's done, though it might take a total of 3 cogs to do this cleanly (1 for the hyperRAM driver, 1 for the LVDS stream driver, and one to coordinate the buffering between the two). And the main program (or any cog that wants to write to the framebuffer) would talk to the hyperRAM driver to add stuff to the frame buffer when the bus is available.
The RDFAST fifo block wrapping can be used to avoid ever needing to change the fifo pointers once running but I was wondering if the LUT palette lookup mode and a 4bit -> 4bit (or 4->8 for differential output) symbol conversion could also help you switch its output pattern on the fly during V-blank/sync lines without needing a different RDFAST input source address. It may be possible to achieve just by using a different streamer command with a different LUT base address offset (eg. use a mapping that only outputs clock transitions but all LVDS channels remain zeroed for example on v-sync lines). If so, that might let you stream pre-computed compressed sync pattern sequences from the HUB RAM scan line fifo buffer during blanking portions with a single streamer command and that would buy cycles for mailbox writes to HUB RAM. Maybe for displays that only use the DE signal this won't be such a problem. This gets tricker if you need to set VS and HS bits independently to DE, but may still be possible using translated 4 bit symbols and specific input patterns that output different LVDS pin patterns depending on the selected palette.
Some other existing driver features may not be possible to achieve such as cursors/mouse sprite etc given they work with native pixel data not LVDS data. Syncing would also need to be figured out too.
Am I correct to assume this LVDS panel you use can accept reduced blanking and achieve 60Hz refresh at 1024x600 with a 40MHz pixel clock? If so that is a nice sweet spot with the P2 operating at 280MHz.
I’m not sure about reduced blanking, but a 40MHz pixel click will give 60Hz refresh (according the spec sheet too). But that means pushing out each bit at 280MHz. While possible from a hardware standpoint, would definitely be a challenge. Though if all LVDS data is stored in the frame buffer and the only thing the driver does is read and then stream in a loop, it may be possible.
With respect to reduced blanking, if the panel can output 1024x600@60Hz with a 40MHz pixel clock that would definitely indicate reduced blanking. I wonder if there is any panel data showing the minimum total line count (e.g., 620 lines etc with 600 active lines). If so that would let you calculate the line frequency and the amount of horizontal blanking pixels you can have. If it is at least something the vicinity of 32-64 pixels, that could free around ~200-400 P2 clocks at 7 clks/pixel and some interesting work could be done in this portion, such as reading in 64 longs of font data for the scanline or some palette data dynamically.
By the way the approach SaucySoliton suggested earlier in this thread is a very good one indeed. Plus you could always pixel double and get 512x300 graphics on a 1024x600 panel with 8bpp mapping 256 palette entries to 18 bit colour using just 150kB of HUB RAM. That could still look decent on a smallish panel. Or just double the pixel width only for a 600 line count with 300kB of HUB. By repeating pixels in an immediate streamer mode it buys you some further cycle budget too. There are some reasonable options there.
I think the holy grail here would be in-built text and/or 8bpp graphics in a single LVDS COG but the text rendering part is tricky if the LVDS COG is also doing display output at 7 clocks per pixel. Drawing text into an 8bpp graphic frame buffer is not difficult though and each pixel can be coloured independently allowing overlayed text and graphics. It's a lot simpler if you don't have to concern yourself with the LVDS format too.
I did look into some sample LVDS text code and believe I've identified an unrolled inner loop sequence that could generate 8 pixels of 16 colour text in 55 clocks while simultaneously streaming it out on the fly, whether it works or not with real HW at 280MHz is another story. It also would depend on the behaviour of FBLOCK allowing streamer source switchovers with no loss in fifo contents during sync/blanking handling. The good thing is for a 1024 pixel panel the source data (text/graphics) is going to be a multiple of 64 bytes so switching over seamlessly probably has a decent chance.
Speaking of repeating... The bit shifter repeats the last data. So, we don't need to always store 7 nibbles of data. We all know that 8 pixels of data can be packed into 7 longs. But by using the last pattern repeat, we could pack 7 pixels into 6 longs. We would send 49 nibbles, but only store 48. It will slightly restrict the colors available with the constraints B3=B2, G2=G1, R1=R0.
Unrelated: We can switch between 16 LUT offsets using the D field of XCONT. If the S value was $6543210 then the streamer would output a complete pixel sequence.
Also this approach is quite easily adjustable if you wanted 256 coloured text by widening the source screen memory format to 24/32 bits and using rfword's and getbyte's instead of rdbyte and getnib's etc. It could also share a common 256 colour palette with a 8bpp graphics mode (possibly selectable per scanline), as well as a 4-bit to 8 bit LUT conversion table (in another portion of LUT RAM) for differential outputs if that works with the bitDAC mode for example to simulate LVDS drivers...
Has the code been tested? There won't be any skipping if pixels = 0. Skipped instructions are not shown and presumably look like the following?
Another point: FIFO will need refilling occasionally, however plenty of instructions after rfbyte colours before next rfbyte char.
Actually when pixels = 0 the skip mask is meant to be set to $11111111 which still skips every 4th instruction. However upon initial entry it should also have been preset to $11111111 not 0, as you rightly point out. I've gone and updated the problem line in the code above. No this code has not been tested, it was just an idea as I'd mentioned. It's not meant to be a completed implementation.
The fifo should have enough time in the loop to refill as only one long gets read from it every 56 P2 clocks. However the trick is setting up the next read block for the fifo between the scan line blanking portions and the active pixel portions, probably using a FBLOCK. The fifo source needs to be seamlessly switched somehow and this part is the biggest unknown at least to me. I think if the block size is already a multiple of 64 bytes there is a better chance for that to work one way or another. Good thing is the source data is already a multiple of 64 bytes for 1024 pixel panel, whether in graphics mode or in text mode.
Update: Looking at the FBLOCK instruction further it seems it takes effect when the block wraps, so if the horizontal blanking is setup to stream some multiple of 64 bytes it may be possible to keep thing seamless between active and blanking portions and have the LVDS pins continuously updated. 64 bytes is not a multiple of 7 nibbles, so some extra nibbles likely need to be inserted to create a final multiple of 28 bits (1 LVDS pixel). However this is achievable with some extra immediate streamer commands either before or after blanking to line things up.
The cool thing in the code above is that during the final active pixel at the end of the REP loop there is extra time to setup next stream from hub command and change the FIFO to get prepared to read the next data. Similarly there will be plenty of time during h-blanking to prepare the next active pixel hub read region with another FBLOCK and read in 64 longs of font for the scan line. I'm reasonably confident it is doable, so long as there is at least over ~100 P2 clocks or so during blanking to read in 64 longs and do the other setup work. That's only 14 blanking pixels, and I'm sure there can be more than that sent!