Shop OBEX P1 Docs P2 Docs Learn Events
3D teapot demo — Parallax Forums

3D teapot demo

Here's some fruits of what I've been working on: Teapot model with 2464 triangles, 256x256 sphere environment texture + vertex AO rendered to 320x240 16bpp at 20 FPS


This is mostly a demo of the optimized and correct 4-cog triangle rasterizer - the transform and setup is currently unoptimized. I believe 60 FPS will be achieved for this demo when I implement optimized geometry processing (currently raster takes only ~10% of the frame time!). Also currently Z-sort is implemented using linked lists in hub RAM, which is unwieldy. I need to replace it with big command blocks stored to PSRAM...

If you want to run it:

  • You need a board with PSRAM (only for framebuffering)
  • use very recent flexspin
  • The pin settings likely don't match your board, but it's almost 4AM and I'm too lazy to dig out the EDGE board now
  • Video driver in use is jaunty, don't use video modes other than 'HDMI' or VGA2X
  • Keyboard (or gamepad) lets you interact:
    • Tab to toggle texture/solid mode
    • Arrow keys and S/D to rotate manually
    • Space to resume spinning automatically
    • Esc to reset rotation
«1

Comments

  • TubularTubular Posts: 4,713

    This looks amazing Ada, looking forward to running it

  • cgraceycgracey Posts: 14,256

    Yes, this looks really neat!

  • for the EDGE P2-EC32MB, this works for PSRAM cfg:

    exmem : "exmem_mini" | PSRAM_DELAY = 11, MEMORY_TYPE = 8, PSRAM_CLK = 56, PSRAM_SELECT = 57, PSRAM_BASE = 40, PSRAM_BANKS = 1
    

    Looks awesome...can't believe this is running on a microcontroller :D
    Got it rotating with an ADXL345, kind of like Adafruit demos with their IMUs (https://www.adafruit.com/product/2472, 1st thumbnail, except it looks like theirs runs on a Mac, whereas this is nice and self-contained!
    Being a total noob to 3D, what sort of scale are the pitch, roll, etc? I just scaled up my accelerometer data until it worked, but obviously that's not very scientific.

  • RaymanRayman Posts: 14,865

    Nice @Wuerfel_21
    Combining with IMU is cool too @avsa242

  • RaymanRayman Posts: 14,865

    BTW: Guess forgot about this way of overriding object settings:
    exmem : "exmem_mini" | PSRAM_DELAY = 11, MEMORY_TYPE = 8, PSRAM_CLK = 40 addpins 1, PSRAM_SELECT = 42, PSRAM_BASE = 32, PSRAM_BANKS = 6

    Do all the spin2 compilers support this, or just flexprop?

    Have to remember this one...

  • @avsa242 said:
    Looks awesome...can't believe this is running on a microcontroller :D

    This is only the garbage proof-of concept version...
    In the end the graphics library should be able to render full 3D worlds at 30 FPS or more.

    Being a total noob to 3D, what sort of scale are the pitch, roll, etc? I just scaled up my accelerometer data until it worked, but obviously that's not very scientific.

    They're 32 bit binary angles, so 232 is a full rotation. This is just the native format of the P2 rotate instructions.

    @Rayman said:
    BTW: Guess forgot about this way of overriding object settings:
    exmem : "exmem_mini" | PSRAM_DELAY = 11, MEMORY_TYPE = 8, PSRAM_CLK = 40 addpins 1, PSRAM_SELECT = 42, PSRAM_BASE = 32, PSRAM_BANKS = 6

    Do all the spin2 compilers support this, or just flexprop?

    Yes, this is universal

  • cgraceycgracey Posts: 14,256

    This will be really interesting for live data visualization.

  • roglohrogloh Posts: 5,865
    edited 2025-01-04 11:15

    Nice one @Wuerfel_21 👏
    Got it working here with PSRAM. I wonder what sort of 3d games could be developed with such a capability. Are there any exisiting older open source games that might suit this sort of performance level and could then run on a P2 or would something using it need to be home grown? I know it's only 2400*20 triangles/sec, so it's not some super HW accelerated thing but maybe that's still enough for something reasonably simple without too many triangles. Something like those old mechwarrior games perhaps.

  • roglohrogloh Posts: 5,865
    edited 2025-01-04 03:01

    @Rayman said:
    BTW: Guess forgot about this way of overriding object settings:
    exmem : "exmem_mini" | PSRAM_DELAY = 11, MEMORY_TYPE = 8, PSRAM_CLK = 40 addpins 1, PSRAM_SELECT = 42, PSRAM_BASE = 32, PSRAM_BANKS = 6

    Do all the spin2 compilers support this, or just flexprop?

    Yes, this is universal

    This parameter override approach should help me improve my multiple external drivers wrapper code to select memory type when I next update the code for release. Ideally the lower level driver wrapper could also cull the unused memory driver objects based on some #ifdef style macro derived from these parameters so as not to incur a memory footprint penalty. I think that could already be done with flexspin but I'm still waiting for PNut to include some sort of conditional code inclusion before heading down that path...

  • @rogloh said:
    Nice one @Wuerfel_21 👏
    Got it working here with PSRAM. I wonder what sort of 3d games could be developed with such a capability. Are there any exisiting older open source games that might suit this sort of performance level and could then run on a P2 or would something using it need to be home grown? I know it's only 2400*20 triangles/sec, so it's not some super HW accelerated thing but maybe that's still enough for something reasonably simple withough too many triangles. Something like those old mechwarrior games perhaps.

    The current bottleneck is the super unoptimized geometry handling, the triangle fill can go to toe with (very) early accelerators. 56 P2 cycles per pixel, IIRC it gets to 60-70 with overhead factored in. That's ~16 cycles globally with 4 cogs in parallel. Though the optimized geo code will eat into the same cycle budget, along with any audio (all other cogs are taken)

    (btw, the demo only runs at 252MHz btw, so "free" 25% improvement going to 320)

    I can't think of any existing game that could be easily fitted. Most 3D games use floating point and a lot of RAM - imagine your average mid-90s low end PC, probably has a Pentium (faster at floats than ints!) and between 8 and 32MB of directly addressable RAM.

  • roglohrogloh Posts: 5,865

    @Wuerfel_21 said:
    I can't think of any existing game that could be easily fitted. Most 3D games use floating point and a lot of RAM - imagine your average mid-90s low end PC, probably has a Pentium (faster at floats than ints!) and between 8 and 32MB of directly addressable RAM.

    Yeah that's probably the sort of machine & game era I was thinking about. Those 3d polygon style games from back then. I guess without floating point there's a bit of a limitation there. Need to use fixed point/integer math and leverage HW multiply wherever possible.

  • pik33pik33 Posts: 2,402

    I can't think of any existing game that could be easily fitted.

    Doom?

  • RaymanRayman Posts: 14,865

    I robot would work right?
    https://en.m.wikipedia.org/wiki/I,Robot(video_game)

  • pik33pik33 Posts: 2,402
    edited 2025-01-07 11:09

    Tested this at last :) Works on EC32 with USB at 16 and HDMI at 0 - my standard setup so only "exmem" line had to be modified. The aspect ratio on my monitor is strange in either setting I can choose, "wide" seems to be too wide, 4:3 is too narrow. The monitor reports 720x480 on HDMI input.

    And the program outputs something on the serial terminal.

  • @pik33 said:
    Tested this at last :) Works on EC32 with USB at 16 and HDMI at 0 - my standard setup so only "exmem" line had to be modified. The aspect ratio on my monitor is strange in either setting I can choose, "wide" seems to be too wide, 4:3 is too narrow. The monitor reports 720x480 on HDMI input.

    garbolium video driver I lazily grabbed off the "old projects" pile can't do a true 640 mode without borders (not enough time after processing the last pixel), so 720 it is. I should hook up the newer one for 16bpp PSRAM operation, but not today.

    And the program outputs something on the serial terminal.

    It's at 2000000 baud - it prints end-to-end frame time and raster µcode work time


    Unrelatedly, next step is to figure out how to turn models into command streams that can be easily worked on in parallel. i.e. each command has some large N of like items to process and it can be split such that each cog handles ~N/4 of them.
    The current idea is to have a buffer of 256 transformed/lit/etc vertices (stored in the at-that-point unused framebuffer area).
    For a small model, all vertices can be transformed in one command and then the next command builds all the triangles.
    For a larger model, this needs to be (smartly) split into multiple chunks. (see also: https://www.researchgate.net/publication/6979989_An_improved_vertex_caching_scheme_for_3D_mesh_rendering )
    Interesting: With this approach simple animation skinning is essentially free, since you can load matrix A, transform some verts, load matrix B, transform some more, then draw triangles that span the gap. An idea with legs and feet.

    I'm not sure what to do with texture UVs though: If they are processed alongside the vertex, that will cause a lot of duplicates (same position, different UV). But Envmapping (as shown in the demo) requires this - UV is made up from vertex normal. Could separate position from UV/lighting, but then lighting gets into trouble when it depends on position (i.e. fog/depth-cue). Not that transforming a position is super expensive to begin with, but it also reduces the efficiency of the buffer.

  • Very cool demo, Ada!

    If you though this would be fun but don't have a board with PSRAM, try this version. It sends the output images as a USB Video Class device. The PSRAM is only used to store completed frames for the display driver. That prevents some pretty bad flickering. The JPEG artifacts pretty visible with this kind of source material. The UVC output cuts the framerate in half since the rendering is stopped while the frame is JPEG encoded and sent via USB. There aren't enough cogs or on-chip memory to run both at the same time. I think one cog does double duty as a render cog and JPEG encoder.

  • @SaucySoliton neat! :+1:

    Eventually PSRAM will store basically everything (models, textures and raster command buckets), so that trick won't work anymore...

  • RaymanRayman Posts: 14,865

    How does this compare with small3dlb?
    Better one would presume?

    https://forums.parallax.com/discussion/172200/3d-graphics-with-small3dlib-and-flexc

  • Still working on new geometry format (streaming and multi-core friendly). Very slowly.

  • cgraceycgracey Posts: 14,256

    @Wuerfel_21 said:
    Still working on new geometry format (streaming and multi-core friendly). Very slowly.

    Looking hopeful.

  • Finally made it work. I only wrote the converter tool and changed the Spin code to match, but that's already enough to push it to 30 FPS (2 vsync per rendered frame).

    I tried writing the tool in Rust (it is included in the ZIP). Not sure if that was the best idea (the amount of something as usize disagrees...), but it works ig.
    There's some added Z-sort artifacts in the rendering because cache optimization changes the triangle submit order (which in the source .obj I manually tweaked a bit). Implementing some way to gain back control over that is somewhere down the list.

    If you want to understand the format, the Spin code is probably more useful. The basics:

    • There's a 256-slot cache of transformed vertices (stored over framebuffer memory)
    • It's a list of commands. Each has an opcode byte. Currently just 3 are used. (in the end there will be more of course)
      • $00 terminates the command list
      • $02 transforms a batch of vertices and stores them into the cache. Each has an arbitrary slot index that it gets stored in.
      • $03 sets up a batch of triangles. Each triangle is defined by 3 cache slot indices.
  • cgraceycgracey Posts: 14,256

    30 FPS is amazing! Do you think much more is possible? 30 FPS is quite sufficient, anyway.

  • RaymanRayman Posts: 14,865

    Seeing some good stuff here!

    Looks like broke off display and memory drivers from emulator. This could be very useful...

    Have to try this out.

  • RaymanRayman Posts: 14,865

    Hmm... This is built with FlexProp, right? Seems it needs a newer version than latest release?

  • Wuerfel_21Wuerfel_21 Posts: 5,140
    edited 2025-01-15 22:49

    @cgracey said:
    30 FPS is amazing! Do you think much more is possible? 30 FPS is quite sufficient, anyway.

    This is still without optimized geometry code, still just the Spin/inline ASM. I just changed it to use the command format. So there'll be a huge boost with proper full ASM code.
    For this simple demo it really needs to hit 60FPS to statisfy. Also there's features missing on the road to the full 3D library that will set performance back:

    • 3D clipping (so stuff outside the view cone doesn't cause numeric explosion) - currently only 2D clipping is implemented, which only works when the XY coordinates are in +/-32K range (in 1/16th pixels). I want to use a combined approach so full 3D clipping is only used where really needed, instead of everything that touches the screen edges (3D clipping can turn one triangle into multiple).
    • Actually flexible system that can handle different rendering modes, multiple textures, skinning, dynamic lighting, etc
    • Using PSRAM for textures. (They're really large - will need some cache/prefetch logic since Z-sort turns this into semi-random access)
    • Using PSRAM for Z-sort buckets (Really large and limits scene complexity)
    • Using PSRAM for model data (for completeness sake)
    • Some hook to run audio code on the worker cogs*
    • (There's more features to add, but they wouldn't have a huge performance impact)

    *: Main program cog + USB driver + Display driver + PSRAM driver + 4 render workers fills up the chip, no space for dedicated audio cog. Some audio work can be jammed into the display driver (since it'll handle HDMI audio, anyways), but I think some timeslice will need to be cut off the graphics to do sound mixing. There's a convenient time gap when the frame is done rendering and being copied out, so if the mixing fits there it's free.

  • GPU P2+separate application P2 :) (j/k...sort of...don't mean to create rabbit holes!)

  • @avsa242 said:
    GPU P2+separate application P2 :) (j/k...sort of...don't mean to create rabbit holes!)

    That would partially solve the audio issue and not much else :) (Unless you want to do, like, accurate physics simulations and then render the outcome)

  • roglohrogloh Posts: 5,865

    @Wuerfel_21 said:
    Finally made it work. I only wrote the converter tool and changed the Spin code to match, but that's already enough to push it to 30 FPS (2 vsync per rendered frame).

    Neat. I wonder if higher resolutions beyond 320x240 could work if the final image is stored in PSRAM where there is plenty of space. Does the entire frame buffer need to be rendered into HUB first or could each scan line or smaller portion of the frame be copied into PSRAM as it loads the frame in PSRAM, reducing the HUB RAM requirements? Would you envisage bandwidth issues for that?

    I already have some primitive capability for accelerating line drawing to PSRAM in my memory driver vs writing individual pixels. Back then I was considering adding other primitives such as drawing circles and rects at some future time, and I did wonder if some triangle portion or stripe could be also drawn this way or perhaps a scan line's worth of data by the driver in parallel with something else filling it. It might possibly help offload some work being done by the render cogs, but it will of course have to share bandwidth and it's only possible to do during memory idle times when the video driver is not accessing it. Although perhaps there's no real gain there if the workload is high, and it may just become a bottleneck instead.

  • pik33pik33 Posts: 2,402
    edited 2025-01-16 08:30

    @rogloh said:
    Neat. I wonder if higher resolutions beyond 320x240 could work if the final image is stored in PSRAM where there is plenty of space. Does the entire frame buffer need to be rendered into HUB first or could each scan line or smaller portion of the frame be copied into PSRAM as it loads the frame in PSRAM, reducing the HUB RAM requirements? Would you envisage bandwidth issues for that?

    I already have some primitive capability for accelerating line drawing to PSRAM in my memory driver vs writing individual pixels. Back then I was considering adding other primitives such as drawing circles and rects at some future time, and I did wonder if some triangle portion or stripe could be also drawn this way or perhaps a scan line's worth of data by the driver in parallel with something else filling it. It might possibly help offload some work being done by the render cogs, but it will of course have to share bandwidth and it's only possible to do during memory idle times when the video driver is not accessing it. Although perhaps there's no real gain there if the workload is high, and it may just become a bottleneck instead.

    I use your PSRAM driver in the modified version, such as the PSRAM list doesn't check the queue while doing the list. That modification enabled this multi-window demo I did. I don't know if this mod made it to the current version of the driver.
    With this feature, a PSRAM list can draw a fast triangle on a PSRAM based framebuffer.
    I will try to add a triangle procedure to my video driver and check how fast it can work.

  • roglohrogloh Posts: 5,865

    @pik33 said:
    I use your PSRAM driver in the modified version, such as the PSRAM list doesn't check the queue while doing the list. That modification enabled this multi-window demo I did. I don't know if this mod made it to the current version of the driver.

    Yeah I want to add that option to the next release. Thanks for reminding me.

    With this feature, a PSRAM list can draw a fast triangle on a PSRAM based framebuffer.
    I will try to add a triangle procedure to my video driver and check how fast it can work.

    The drawing acceleration code can be extended to other items as it is HUB exec'd from HUB RAM but it needs to allow some pre-emption from the high priority clients such as the video driver or it will be risky to allow it to complete and could otherwise corrupt video. Line plotting did check the queue per pixel IIRC, but it might be possible to do other primitive transfers like computing start/end interpolated co-ordinates of triangle strips to block copy from HUB to PSRAM for example, or even from PSRAM to PSRAM. Having it compute textures would be too much though IMO.

Sign In or Register to comment.