[Release] JET ENGINE - New tile&sprite graphics driver

Wuerfel_21 · 2018-10-04 14:51

Yes, the name is another Propeller pun.

(Note that Version 2 has been released now. Just work yourself backwards from the last page in this thread to find a download)

Copypaste from OBEX description:

Somewhat of a work-in-progess: Stable and functional, but with room for improvement.

This is a game graphics driver with NTSC and PAL60 output.

Short overview of features:
- 256x224 resolution
- uses 5 cogs and a bunch of memory
- 16x16 tiles and sprites
- 32 sprites on screen
- 4 colors per scanline per sprite/tile
- 8-way scrolling
- full-screen post-"""processing"""
- Antialiased ROM font text
- Screen can be split into horizontal strips - "subscreens" for status displays, parallax (heh) scrolling and more
- For more detailed info, look at the scrolltext in demo.spin, aswell as just the code itself. I tried documenting the PASM rendering code as well as possible: most lines have a comment explaining what they do!

I've been working on this one for a while. There are still some things that should/could be done:
- Clean up and structure the code a little better - especially in the demo application!
- Write output object for SCART-style RGB (VGA is AFAICT impossible). Should be easy, but ATM I lack the hardware to test/do it.
- Further optimizations and features
- Write an actual game/application that uses it
- Write a cogless SD card driver (due to jet engine's high cog usage)
- Explore fast mixed XMM/LMM assembly techniques to overcome memory limit

And because "no pix no clix" or whatevever, here's a terrible screenshot of my demo application (which isn't very impressive, overall):

Publison · 2018-10-04 15:20

Impressive!

Wuerfel_21 · 2018-10-04 16:25

Publison wrote: »

Impressive!

Thanks.
This is actually the first big PASM program that i've written mostly from scratch (the TV output object is largely derived from TV.spin, of course). The rendering code however is, entirely new. It fills up the entire 4 cogs: not a single long is unused. Some longs are even reused used multiple times: the a0,d0,d1 temp variables are used in multiple places for different things, aswell as holding some init code. Many other variables that are guaranteed to not be needed at the same time are also aliased to two different names. Then there's also some clever self-modifying code that changes the condition codes of instructions in "tight" code (good example: the waitvid for the right scrollborder. In 256-pixel wide modes, it is not needed, but in 240-pixel scrolling modes, the previous waitvid might have only been 1 pixel long, thus there is space for one instruction until the next one. That instruction of course must be the one updating VSCL. There also can't be any instruction added before the last waitvid, nor are the Z or C flags guaranteed to be unaffected by the screen filter. Then i remembered the MUX* instructions and used those to "kill" the relevant instructions during horizontal blanking if neccesary.
I think it turned out pretty well for never writing much PASM before.

Cluso99 · 2018-10-04 17:02

Nice job! Thanks for posting your work

I have PASM code to read FAT32 files that I converted to P2 PASM that is in the new P2 ROM. Might that be a help to you?

Wuerfel_21 · 2018-10-04 17:31

Might that be a help to you?

Possibly. Ideally, one'd only have to use the filesytem during program initalization to check that all required files are present and not fragmented and then store their address-on-the-card somewhere in hub ram. This would severely reduce runtime overhead, as neither expensive FAT reads nor large lookup tables are needed to read/write a random sector. In my testing, when there is enough consecutive space on the card, Window's FAT driver will not fragment a file. (and of course, one can just manually trigger a defragmentation if it is fragmented, regardless of OS). I think hand-written XMM assembly can be a lot faster than what propGCC (which i haven't messed with yet) outputs. Mainly because you can put custom, application specific code into the interpreter cog. I did some mental gymnastics and figured out that:
- LMM/hubexec can get arbitrarily close to 16 cycles per non-hubop instruction by unrolling. 20 cycles however is more likely.
- XMM can, apart from loading the code/data from external memory, be as fast as LMM. 128 instructions fit into a single SD sector. If a routine needs to span multiple sectors, the last instruction would be a jump into some cog code to fetch the next sector.
- jumping within one sector is cheap if the buffers are 512-byte-aligned (movs on the program counter)
- SPI top speed is FREQ/4 = 20Mhz with a short break every 32(?) bits. Given a pessimistic assumption of an average speed (including SD protocol overhead) of 10Mhz and another pessimistic assumption that one wants 50% of the cog time to be spent executing code, we can read 5.000.000 bits per second. divided by 4096 bits per sector, that's ~1220 sectors per second. Divided by 60, that's ~20 sectors per frame. That sounds a little bit too fast to be real. Did i make a mistake in my math?

Cluso99 · 2018-10-04 17:46

It's 3:20am here so it's not maths time.

The physical SD SPI exchange should be in cog to get the most speed. LMM will slow down the read significantly.

I did a high speed overlay loader that can load snippets of code from hub to cog and then execute in cog. It's useful when you have repeating loops of code that will run much faster in cog. IIRC it's in obex but if not search the forum. It's maybe 8+ years ago. We use it in ZiCog (the z80 emulator for some instructions). This may be of some use to you too.

Wuerfel_21 · 2018-10-04 18:18

The physical SD SPI exchange should be in cog.

Of course. You really only need "read sector" and "write sector". All the initialization can be LMM code that can be overwritten afterwards. A lot of the SPI code can also be reused for SPI RAM (only the beginning and end of a transaction are different: 8bit command + 24 bit address, then data for RAM, 8bit command + 32 bit parameter + 8bit optional CRC + some garbage + 8bit data token, then data, then two CRC bytes for SD)

msrobots · 2018-10-04 22:12

you should look at the code of FSRW32, part of it is a very, very nice SD block-driver written by Jonathan Dummer aka @Lonesock.

It is one of the nicest pasm-piece I ever studied. Somehow one sees @kuronekos's handwriting also.

The pasm driver supports read-ahead and write behind using a sector buffer in the cog. So if you issue a write sector command your HUB block gets copied very fast into cog-ram, then the command returns, while the sd-pasm cog is still writing. doing that in parallel with your calling cog.

Same goes for reading, when you read a sector, it gets read into cog ram, then delivered to HUB. Then the driver reads the next sector in case you need it next, doing that in parallel to your calling cog.

Almost all of the initialization and FAT stuff is done in SPIN, but if you know your sector addresses, nothing beats @lonesock's block-driver on the propeller.

Enjoy!

Mike

Wuerfel_21 · 2018-10-04 22:39

That's where I the 20 Mhz SPI trick comes from. Read-ahead/write-behind can't really be implemented when you want the SD driver to sit in the same cog as the code that uses it. But having the SD card code premade makes everything easier, doesn't it?

Also, this thread got somewhat offtopic real fast.

Roy Eltham · 2018-10-04 23:20

You are using 5 of the 8 cogs for this, if you allocate the remaining 3 cogs as 1 for sd card, 1 for audio driver, and 1 as the main application, then you are fine!
If you want to you can do slow serial from the main application cog, same with I2C for accessing the eeprom (storing scores and settings).

So I think you should be able to make a game with this without needing to get sdcard reading without using a cog of it's own.

Wuerfel_21 · 2018-10-04 23:36

You're forgetting the keyboard! As far as I can tell, it needs it's own cog, as the the keyboard controls and clocks the data transfer. Even the earliest IBM PCs that used the PS/2 keyboard protocol (which actually predates the PS/2 computer line) used a dedicated microcontroller for the keyboard. Most kinds of game controllers don't have that problem (NES/SNES controllers are just a shift register and Wii expansion controllers are i2c devices). I actually still haven't built my SNES-gamepad-to-"proprietary"-DIN6 adapter cable, so I can't use those yet and I assume most people don't have any of them + hardware to hook them to a prop just laying around. And what fun is making a game noone else can play?

Also, sending RS232 data could be done at pretty high baudrate (but with long-ish pauses between bytes) using WAITVID

ersmith · 2018-10-05 01:48

Cluso99 wrote: »

I did a high speed overlay loader that can load snippets of code from hub to cog and then execute in cog. It's useful when you have repeating loops of code that will run much faster in cog. IIRC it's in obex but if not search the forum. It's maybe 8+ years ago. We use it in ZiCog (the z80 emulator for some instructions). This may be of some use to you too.

Some compilers (PropGCC and fastspin, for example) will do this kind of "overlay" loading from HUB to COG automatically. I think it was Bill Henning that dubbed this FCACHE, and the compilers do it for small loops or (in PropGCC's case) small recursive functions. It certainly can make a big difference in speed.

Eric

Roy Eltham · 2018-10-05 10:11

The main cog can read whatever input you want to have easily, even a keyboard.

evanh · 2018-10-05 10:47

I think I remember it was kuroneko that found that the video shifter could be fed faster without requiring any WAITVID instruction, or any waiting of any sort. It was a jaw dropper for me because it did a totally undocumented trick with the Cog internals.

potatohead · 2018-10-05 12:52

WHOP!

Very early on Linus managed to hit the WHOP in an 800x600 driver prototype.

Nobody exploited that, until kuroneko many years later. Most of us thought it a glitch.

@Wuerfel_21 don't forget your VBLANK time. If you need to, spin down a couple sprite COGS, do something, then get them running before active display. Nice engine!

evanh · 2018-10-05 13:17

potatohead wrote: »

WHOP!

Very early on Linus managed to hit the WHOP in an 800x600 driver prototype.

Nobody exploited that, until kuroneko many years later. Most of us thought it a glitch.

Gee, before my time again. Who was Linus?

Publison · 2018-10-05 13:33

He did a brilliant Breakpoint demo in 2009:

https://forums.parallax.com/discussion/111807/propeller-based-demo-to-be-released-at-breakpoint/p1

Publison · 2018-10-05 13:40

Details of the Turbulence demo can be found here:
https://www.linusakesson.net/scene/turbulence/index.php

Video here:
https://hd0.linusakesson.net/files/lft_turbulence_h264_capture_720x576.mp4

evanh · 2018-10-05 13:48

Ah, the demo scene'r. That was a celebration for sure. One with the force.

I hadn't known he was around the Prop from the beginning. That's so classic that he was the one to dig up the WHOP. It's a mindset isn't it.

So what does WHOP stand for?

Wuerfel_21 · 2018-10-05 14:09

I've been working on a screen capture tool for JET engine:

- Drop-in replacement object for JET_v01_composite.spin
- Uses 3_000_000 baud serial (highest I could go without major glitches)
- PC software has accurate color reproduction and can record image sequences

Will upload it soon-ish

evanh · 2018-10-05 14:19

Snazzy. 3Mbps is very fast for the Prop1 I think. That's some fast coding work there too.

Wuerfel_21 · 2018-10-05 14:22

Actually, I'm just using the video generator as a shift register, so the code itself is rather lazy.
it should be noted that the screen frequently glitches, but I am suspecting insufficient buffering on the PC side, as even Alt+SysRq can affect it.

evanh · 2018-10-05 14:38

Wuerfel_21 wrote: »

Actually, I'm just using the video generator as a shift register, so the code itself is rather lazy.
it should be noted that the screen frequently glitches, but I am suspecting insufficient buffering on the PC side, as even Alt+SysRq can affect it.

Oh, as in it's fast enough for full framerate feed to the PC? If so, it's no surprise the PC can't keep up non-stop like that. Desktop OSes aren't built for that and the comport hardware doesn't have DMA or any large buffer to compensate.

Wuerfel_21 · 2018-10-05 14:49

full framerate

I wish. I haven't measured it, but i think it's running at maybe 15% speed. Definitely better than trying to take screenshots with GEAR though.

EDIT, have measured it, less than 10% speed. average 4.6 frames per second.

Wuerfel_21 · 2018-10-05 15:53

Here is a little test run of video recording. JETViewer outputs a bunch of numbered PNG files, which i've encoded into VP9 video with correct pixel aspect ratio (JET pixels are not perfect squares, as the viewer would want you to believe)
sadly, it seems that recording mode is extra suspectible to glitches... I might want to make a slower version for glitchless recording.

evanh · 2018-10-05 16:05

Yeah, the more "other" I/O happening the more likely to be an issue. You're probably hitting something like 20k IRQ/sec.

potatohead · 2018-10-05 16:42

evanh wrote: »

Ah, the demo scene'r. That was a celebration for sure. One with the force. I hadn't known he was around the Prop from the beginning. That's so classic that he was the one to dig up the WHOP. It's a mindset isn't it.

So what does WHOP stand for?

Waitvid Hand Off Point.

It is the cycle where WAITVID gets it's data from the S and D busses. Whatever is on them gets used.

Funny thing, Linus and all of us missed it! I actually suggested he shift his timing to make it work "properly" totally missing the implications!

He did, and it went unexploited until kuroneko.

Wuerfel_21 · 2018-10-05 17:33

I actually had to account for the runaway video generator behaviour in the serial output module (which is on the OBEX now, btw) by gating the video generator output using OUTA

Wuerfel_21 · 2018-10-18 16:38

I've found some rather questionable old unused code in JET_v01.spin (the glue code file).
In particular, it sets P0 to output and has a bunch of apparently unused stuff in the VAR section that takes up a couple longs and might be confusing to people reading the code.
Then again, the glue code is designed to be hacked and copied around.

Should I release a v02 package without the gunk?

JT Cook · 2018-11-27 04:06

Wuerfel_21, can you upload the palette you used for the screen capture application?

Wuerfel_21 · 2018-11-27 16:27

It is generated by the source code (which is included in the JAR) and included in every PNG it writes. However, it includes some colors that don't actually exist. Here is version that only has colors that actually exist (all other slots are black).

This PNG can be directly used as a palette for FFMPEG's "paletteuse" filter. I think 120x480 full framerate video might be possible ;-)

[Release] JET ENGINE - New tile&amp;sprite graphics driver

Comments

[Release] JET ENGINE - New tile&sprite graphics driver