Inline assembly is awesome

Rayman · 2024-08-10 21:38

I've seen it help before, but it's a real game changer for one particular application...

Testing out a reduced pin count 2.4" display board and driver for SimpleP2.
Uses a 3.4 MHz rated I/O expander for many of the control signals. Pretty much just the data bus and the WR signal use actual P2 pins directly, so as to be fast.

Anyway, one thing we want is to transfer entire screen from P2 Hub RAM to screen.
If can be fast enough, would allow for double buffering and continuous transfer P2 Hub RAM to screen.
Ideally, one would want update to be faster than the screen's internal update rate of ~70 Hz or so.

But, the original driver, in Spin2, was way to slow. Update rate was 1.6 Hz (with 180 MHz P2 clock).
Just turned that update loop into inline assembly and it's now 36 Hz.
Not quite ideal, but good enough. Maybe can speed up or increase P2 clock, if needed.
But, I think this is going to be fine.
Here's the code that transfers the data to the LCD:

PUB WrBytes_DAT(p,n)|i,pin_wr     ' write n bytes of data, starting from pointer p

  if (n<=0)
    return
  GPIO.[CS]:=0
  GPIO.[DC]:=1
  WriteRegMCP(MCP23009_GPIO,GPIO)                      ' DC high: data

'{ This is the inline assembly version !!!!!
  pin_wr:=wr
  org
     rdfast #0,p
.loop
     rfbyte i
     setbyte outb,i,#1
     waitx #2
     drvnot pin_wr
     waitx #2
     drvnot pin_wr
     djnz n, #.loop
  end
'}
{ This is the Spin2 version  !!!!!!!!!
  repeat i from 0 to n-1
    pinw(bus addpins 7, byte[p][i])           ' write data to bus
    pint(WR)                       ' clock out data byte
    pint(WR)
  }
  GPIO.[CS]:=1
  WriteRegMCP(MCP23009_GPIO,GPIO)

Rayman · 2024-08-10 21:42

BTW: This experiment with using I/O expander to reduce pin count for 2.4" LCD has been sorta successful. The one painfully slow thing is drawing lines though. It's not impressive.
That's why looking to double buffer with P2 Hub RAM. This should make it appear just as fast as something like a VGA driver would be.

TonyB_ · 2024-08-10 21:46

Loop would be 20% quicker using REP instead of DJNZ.

Rayman · 2024-08-10 22:20

@TonyB_ Yeah, was trying to remember if rep works inside inline assembly or not... Does REP work with inline assembly in both FlexProp and Parallax versions of Spin2?

Wuerfel_21 · 2024-08-10 22:44

Reading a bunch of data from memory onto pins sounds like a prime job for streamer DMA.

In my emulators I have a driver for ILI9342 LCDs using direct RGB input. That really works like VGA and can bypass the GRAM in the display controller, so it's fully synchronized to the P2's rendering, no tearing or stuttering, just smooth 60Hz animation. (downside is that when you turn the dot clock off the entire thing freezes). That uses 12 pins: 6 data pins, HSYNC, VSYNC, DOTCLK, CS, CLK, SDA. The latter 3 are just for setting up the registers, and CLK/SDA can of course be shared with other SPI-ish devices.

In theory the other ILI-something-somethings should be able to do it, too, but most of them are portrait orientation (240x320), in which case you need to disable RAM bypass mode and get a really ugly tear diagonally through the screen. Though that's just an artifact of rendering in scanlines. I've seen a 320x240 IPS screen that should also work somewhere, but no matching breakout board. That needs 8 data pins because it can actually do 24 bit color.

Wuerfel_21 · 2024-08-10 22:55

@Rayman said:
@TonyB_ Yeah, was trying to remember if rep works inside inline assembly or not... Does REP work with inline assembly in both FlexProp and Parallax versions of Spin2?

If you're using ORG/END, yes, it just works. flexspin's ASM/ENDASM has problems with it, but it'll warn you about it (just write a DJNZ loop, it'll convert into REP if possible)

evanh · 2024-08-10 22:55

@Rayman said:
Does REP work with inline assembly in both FlexProp and Parallax versions of Spin2?

Yes. And generally branching within is fine.
Actually, I just noticed Eric's docs for FlexC state that RET shouldn't be used for inline assembly and that the correct solution is to branch to a C label within the function. FlexC/Spin also automatically replaces any RET with an equivalent branch - Which is how Pnut works for Spin2 too. I've personally always used RET when desired and never had an issue.

Rayman · 2024-08-11 22:18

As I recall, it's very different in the usual case between FlexProp and PropTool...
Think PropTool will copy code first into cog ram and then run it from there.
I suppose this means there is some size limit to the inline assembly code.
Probably not a big deal for the usual case though...

FlexProp makes everything assembly anyway, so it's just a drop in and run from in hubexec, like everything else.
But, this can be slow, so I think there is also an option to have copy to cog ram first.

evanh · 2024-08-11 22:35

For Flexspin, using ORG/END in Spin2 code will replicate Pnut's approach of copying to and running in cogRAM.

Inline assembly is awesome

Comments