[Release] JET ENGINE - New tile&sprite graphics driver
Wuerfel_21
Posts: 5,052
Yes, the name is another Propeller pun.
(Note that Version 2 has been released now. Just work yourself backwards from the last page in this thread to find a download)
Copypaste from OBEX description:
I've been working on this one for a while. There are still some things that should/could be done:
- Clean up and structure the code a little better - especially in the demo application!
- Write output object for SCART-style RGB (VGA is AFAICT impossible). Should be easy, but ATM I lack the hardware to test/do it.
- Further optimizations and features
- Write an actual game/application that uses it
- Write a cogless SD card driver (due to jet engine's high cog usage)
- Explore fast mixed XMM/LMM assembly techniques to overcome memory limit
And because "no pix no clix" or whatevever, here's a terrible screenshot of my demo application (which isn't very impressive, overall):
(Note that Version 2 has been released now. Just work yourself backwards from the last page in this thread to find a download)
Copypaste from OBEX description:
Somewhat of a work-in-progess: Stable and functional, but with room for improvement.
This is a game graphics driver with NTSC and PAL60 output.
Short overview of features:
- 256x224 resolution
- uses 5 cogs and a bunch of memory
- 16x16 tiles and sprites
- 32 sprites on screen
- 4 colors per scanline per sprite/tile
- 8-way scrolling
- full-screen post-"""processing"""
- Antialiased ROM font text
- Screen can be split into horizontal strips - "subscreens" for status displays, parallax (heh) scrolling and more
- For more detailed info, look at the scrolltext in demo.spin, aswell as just the code itself. I tried documenting the PASM rendering code as well as possible: most lines have a comment explaining what they do!
I've been working on this one for a while. There are still some things that should/could be done:
- Clean up and structure the code a little better - especially in the demo application!
- Write output object for SCART-style RGB (VGA is AFAICT impossible). Should be easy, but ATM I lack the hardware to test/do it.
- Further optimizations and features
- Write an actual game/application that uses it
- Write a cogless SD card driver (due to jet engine's high cog usage)
- Explore fast mixed XMM/LMM assembly techniques to overcome memory limit
And because "no pix no clix" or whatevever, here's a terrible screenshot of my demo application (which isn't very impressive, overall):
Comments
Thanks.
This is actually the first big PASM program that i've written mostly from scratch (the TV output object is largely derived from TV.spin, of course). The rendering code however is, entirely new. It fills up the entire 4 cogs: not a single long is unused. Some longs are even reused used multiple times: the a0,d0,d1 temp variables are used in multiple places for different things, aswell as holding some init code. Many other variables that are guaranteed to not be needed at the same time are also aliased to two different names. Then there's also some clever self-modifying code that changes the condition codes of instructions in "tight" code (good example: the waitvid for the right scrollborder. In 256-pixel wide modes, it is not needed, but in 240-pixel scrolling modes, the previous waitvid might have only been 1 pixel long, thus there is space for one instruction until the next one. That instruction of course must be the one updating VSCL. There also can't be any instruction added before the last waitvid, nor are the Z or C flags guaranteed to be unaffected by the screen filter. Then i remembered the MUX* instructions and used those to "kill" the relevant instructions during horizontal blanking if neccesary.
I think it turned out pretty well for never writing much PASM before.
I have PASM code to read FAT32 files that I converted to P2 PASM that is in the new P2 ROM. Might that be a help to you?
Possibly. Ideally, one'd only have to use the filesytem during program initalization to check that all required files are present and not fragmented and then store their address-on-the-card somewhere in hub ram. This would severely reduce runtime overhead, as neither expensive FAT reads nor large lookup tables are needed to read/write a random sector. In my testing, when there is enough consecutive space on the card, Window's FAT driver will not fragment a file. (and of course, one can just manually trigger a defragmentation if it is fragmented, regardless of OS). I think hand-written XMM assembly can be a lot faster than what propGCC (which i haven't messed with yet) outputs. Mainly because you can put custom, application specific code into the interpreter cog. I did some mental gymnastics and figured out that:
- LMM/hubexec can get arbitrarily close to 16 cycles per non-hubop instruction by unrolling. 20 cycles however is more likely.
- XMM can, apart from loading the code/data from external memory, be as fast as LMM. 128 instructions fit into a single SD sector. If a routine needs to span multiple sectors, the last instruction would be a jump into some cog code to fetch the next sector.
- jumping within one sector is cheap if the buffers are 512-byte-aligned (movs on the program counter)
- SPI top speed is FREQ/4 = 20Mhz with a short break every 32(?) bits. Given a pessimistic assumption of an average speed (including SD protocol overhead) of 10Mhz and another pessimistic assumption that one wants 50% of the cog time to be spent executing code, we can read 5.000.000 bits per second. divided by 4096 bits per sector, that's ~1220 sectors per second. Divided by 60, that's ~20 sectors per frame. That sounds a little bit too fast to be real. Did i make a mistake in my math?
The physical SD SPI exchange should be in cog to get the most speed. LMM will slow down the read significantly.
I did a high speed overlay loader that can load snippets of code from hub to cog and then execute in cog. It's useful when you have repeating loops of code that will run much faster in cog. IIRC it's in obex but if not search the forum. It's maybe 8+ years ago. We use it in ZiCog (the z80 emulator for some instructions). This may be of some use to you too.
It is one of the nicest pasm-piece I ever studied. Somehow one sees @kuronekos's handwriting also.
The pasm driver supports read-ahead and write behind using a sector buffer in the cog. So if you issue a write sector command your HUB block gets copied very fast into cog-ram, then the command returns, while the sd-pasm cog is still writing. doing that in parallel with your calling cog.
Same goes for reading, when you read a sector, it gets read into cog ram, then delivered to HUB. Then the driver reads the next sector in case you need it next, doing that in parallel to your calling cog.
Almost all of the initialization and FAT stuff is done in SPIN, but if you know your sector addresses, nothing beats @lonesock's block-driver on the propeller.
Enjoy!
Mike
Also, this thread got somewhat offtopic real fast.
If you want to you can do slow serial from the main application cog, same with I2C for accessing the eeprom (storing scores and settings).
So I think you should be able to make a game with this without needing to get sdcard reading without using a cog of it's own.
Also, sending RS232 data could be done at pretty high baudrate (but with long-ish pauses between bytes) using WAITVID
Some compilers (PropGCC and fastspin, for example) will do this kind of "overlay" loading from HUB to COG automatically. I think it was Bill Henning that dubbed this FCACHE, and the compilers do it for small loops or (in PropGCC's case) small recursive functions. It certainly can make a big difference in speed.
Eric
Very early on Linus managed to hit the WHOP in an 800x600 driver prototype.
Nobody exploited that, until kuroneko many years later. Most of us thought it a glitch.
@Wuerfel_21 don't forget your VBLANK time. If you need to, spin down a couple sprite COGS, do something, then get them running before active display. Nice engine!
https://forums.parallax.com/discussion/111807/propeller-based-demo-to-be-released-at-breakpoint/p1
https://www.linusakesson.net/scene/turbulence/index.php
Video here:
https://hd0.linusakesson.net/files/lft_turbulence_h264_capture_720x576.mp4
So what does WHOP stand for?
- Drop-in replacement object for JET_v01_composite.spin
- Uses 3_000_000 baud serial (highest I could go without major glitches)
- PC software has accurate color reproduction and can record image sequences
Will upload it soon-ish
it should be noted that the screen frequently glitches, but I am suspecting insufficient buffering on the PC side, as even Alt+SysRq can affect it.
Oh, as in it's fast enough for full framerate feed to the PC? If so, it's no surprise the PC can't keep up non-stop like that. Desktop OSes aren't built for that and the comport hardware doesn't have DMA or any large buffer to compensate.
EDIT, have measured it, less than 10% speed. average 4.6 frames per second.
sadly, it seems that recording mode is extra suspectible to glitches... I might want to make a slower version for glitchless recording.
Waitvid Hand Off Point.
It is the cycle where WAITVID gets it's data from the S and D busses. Whatever is on them gets used.
Funny thing, Linus and all of us missed it! I actually suggested he shift his timing to make it work "properly" totally missing the implications!
He did, and it went unexploited until kuroneko.
In particular, it sets P0 to output and has a bunch of apparently unused stuff in the VAR section that takes up a couple longs and might be confusing to people reading the code.
Then again, the glue code is designed to be hacked and copied around.
Should I release a v02 package without the gunk?
This PNG can be directly used as a palette for FFMPEG's "paletteuse" filter. I think 120x480 full framerate video might be possible ;-)