Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

evanh · 2020-04-18 07:03

BTW: Your binary works fine with the 22 pF cap installed.

rogloh · 2020-04-18 07:03

I used the getrnd to compute the pixels so it is as random as that...you also can see this code may not be very efficient, given it calls Point a lot and tests the bpp parameter for every pixel.

PUB pointtest(n, w, b) | first, second, x, y, r, colour
	first:=cnt
        repeat n
	    asm
                getrnd colour
                getrnd x
                and x,#$3ff
                getrnd y
                and y,#$3ff
            endasm
            Point(ram#RAM, w, x, y, colour, b)
        second:=cnt
        return (second-first)

PUB Point(buf, width, x,y,c, b)
    if b < 8 
        return
    elseif b==8
        ram.writeByte(buf, x+y*width, c)
    elseif b==16
        ram.writeWord(buf, 2*(x+y*width), c)
    elseif b==32
        ram.writeLong(buf, 4*(x+y*width), c)

Also each writeByte has to setup a few things too in Spin so it might be slowish..

PUB writeByte(bank, addr, data) | m
    m := cogid * 3
    mailbox[m+2] := 1
    mailbox[m+1] := data
    mailbox[m] := REQ_WRITEBYTE + (bank & $f) << 24 + (addr & $ffffff)
    repeat until mailbox[m] => 0
    return mailbox[m]

evanh · 2020-04-18 07:06

evanh wrote: »

BTW: Your binary works fine with the 22 pF cap installed.

And that's with the accessory board not even in the ideal position of P16-31.

rogloh · 2020-04-18 07:09

Yeah that may help in some cases but remember this is not overclocking anything right now. The HyperRAM is rated to 100MHz and that is what we are using. Writes are still done at sysclk/2 so the clock should be centered in the middle of the data (you can check if you have a good scope).

rogloh · 2020-04-18 07:38

I think I could speed up non-horizontal lines heaps if I could somehow add Bresenham's algorithm directly into the memory COG. But I would need to have the free space for that unless I try to somehow jump out and run it in hub-exec mode. I do know a way I could free up 80 longs or so in the COG RAM, but I don't think I will go down that path right now. The only other thing I was thinking of squeezing in was a graphics fill operation (ie. multiple fill repeats of the same pattern each offset by the line pitch). That could possibly help clearing out rectangular screen regions. Most of the code infrastructure to do this is probably already there from the graphics copy so it may not be that much more to put in. I still have ~20 LUT longs free plus ~10 COG RAM longs unless I resort to my other bank lookup option to free space (which then takes 2-3 more instructions per request).

Edit: In some cases it may even make sense to have a dedicated COG driver to do the graphics which optimizes some of this processing code and it could run as fast as it possibly can in PASM, making its own mailbox calls directly to the external memory COG. This could be especially useful to deal with line co-ordinate calcs and the different bpp handling or rendering text from fonts etc. It's all still doable in Spin but can be a little slow perhaps. The Spin code could just provide a list of high level drawing commands to this Graphics COG and let it go draw things in the background, co-ordinating with the memory driver COG as needed.

evanh · 2020-04-18 08:10

It's working really well for something without specific hardware support. Good work.

rogloh · 2020-04-18 08:16

Cheers. I've spent a while on it. The driver that is. The gfx demo stuff was mostly just hacked up today.

whicker · 2020-04-18 12:32

Forget Bresenham when you can divide. Or use fixed point math with sca for that matter.

If your deltaY is less than deltaX, then you can divide to find the number of horizontal pixels to burst.

Some complication as most lines start and end in half spans.

===
   ======
         ===

rogloh · 2020-04-18 15:15

Yes whicker I think there should be a good way to do this that is simpler than Bresenham. If plotted lines can take further advantage of write bursts of multiple pixels on the same scan line that could speed it up too. In a request list parameter block for a future graphics line operation I can probably pass in up to 4 additional parameters in addition to the current start address and colour fill value. So I was thinking of something like the startaddr (lowest pixel address), deltax, deltay, and endaddr (higher pixel addr). The code continues processing using some math on the address and the scaled delta values until the address reaches up to or beyond endaddr, perhaps with some safety counter limit to ensure we can't go beyond the maximum possible for the screen size, unless we trust the caller to validate all the data first.

But to do this I think I'd first have to free up those 80 COG longs I mentioned and eliminate my existing very fast bank+service table lookup method. It's still worth considering in time. I'm pretty sure I can at least speed up some vertical lines by extending my byte/word/long fills to support graphics with an extra pitch offset and repeat count parameter. That should already fit I hope. Vertical lines right now are rather slow as they are basically one mailbox request per pixel.

Wuerfel_21 · 2020-04-18 15:30

drawing arbitrary lines directly onto an external framebuffer to me doesn't seem useful enough to sacrifice overall speed for.

rogloh · 2020-04-18 15:37

You are probably right Wuerfel_21. I certainly don't want to add that initially. But if I can readily extend my existing fill burst to support repeats with a pitch argument for doing graphics fills, it can also be used to get a speedup of simple vertical lines. Fast drawing of horizontal and vertical lines is probably useful for GUI type applications for drawing group/window outlines and text field boxes etc. Plotting arbitrary angle lines is perhaps less useful (in a typical GUI application), and can always be done in another COG until this capability is possibly added in the future. With any luck maybe there are other things I can do to make room without sacrificing performance or putting in some extensions that support this type of thing with hub exec perhaps.

Update: Another thing to factor in is that these graphics helper operations in the memory driver can only fully work with 8/16/32bpp colour depth modes. For 1/2/4bpp colour depths this memory driver doesn't do the pixel operations as it can't act on items smaller than a single byte. So for those smaller bit depths an external COG would really have to do all the line drawing if it needs to be single pixel granular. These smaller bit depth frame buffers are still supported by my video driver using external memory which always assumes the overall frame buffer is aligned to a byte address. I suspect 16 colour modes are still going to be quite useful for providing a GUI and they will leave a lot more external memory bandwidth available to other COGs too.

rogloh · 2020-04-19 03:23

Here's another binary showing two independent COGs writing into the shared HyperRAM frame buffer and the video COG reading from it. The main (first) COG still does the same demo and cycles through all the different patterns while the second COG only uses the circle demo and continually runs it which overlays its drawn circles on top of the main demo contents being drawn. The second COG is currently not 8/16 bpp aware and this has an interesting effect whenever it switches into 16bpp mode by biasing the circles more to the top of screen.

The good thing is that I think this starts to exercise the poller/arbiter when put under load from multiple COGs and importantly maintaining the context for each COGs request's over the fragmented bursts all while stable video is maintained.

Same P2-EVAL config - Fit HyperRAM module on pins P32-47 and VGA module on pins P0-7

rogloh · 2020-04-19 06:52

Just added vertical lines/block graphics fills to the HyperRAM driver and am now seeing something working in 8 bit colour modes, and it took only 9 more LUTRAM longs and didn't slow things down. It does look like it should help free up the COG making the fill request from writing the individual pixels/lines. On the scope I see it can do a single pixel write from the fill list request in 1.8us of the free time on the scan line (P2@200MHz). This includes all overhead, yielding to the poller, managing and servicing the request list, address phase setup, data transfer etc.

As an example, at 200MHz and displaying 1024x768 8bpp it looks like it squeezes in about 6-7 of these single pixel write opportunities per scan line when there is only one COG writing. So the write rate is perhaps up to 340k pixels/second for vertical lines in this configuration. Other video resolutions/colour depths/burst sizes/P2 clock frequencies will of course change these numbers. There isn't much of a penalty to widen the line/rectangle drawn slightly either. Filling 1 pixel wide vs 10 pixels wide doesn't change much given the overhead dwarfs the transfer rate.

EDIT: spoke too soon on extra LUTRAM use. This was correct for 8 bit fills, but for 16bpp and 32bpp depths with block fills, I needed 2 more LUTRAM longs to handle their size differences. Still fits.

evanh · 2020-04-19 12:23

Someone here asked about eye patterns. I never thought I could do them with this scope but after a talk with a friend and pondering whether I really could, and him convincing me that two probes is enough after I said that two of my probes have lost their ground clips ... I started actually trying ...

I started with the simplest clock only from a continuous toggling output ... turned on persistent traces and was pleasantly surprised how effective it was. I then dug out my regular testing code for the hyperRAM and set it running at a fixed sysclock of 10 MHz and proceeded to explore getting the trigger to focus on the clock during data bursts. This turned out not too hard by specifying high times of under slightly higher than sysclock period. I had results.

Changed the testing software to data reading test with registered data pins and unregistered clock pin - Worst performing config. Stepped up to 50 MHz sysclock (25 MHz HR clock) and captured my first screenshot. Top trace is the HR clock on P24, bottom trace is data bit2 on P18 (Ignore the blue fill, this is a side effect of the sampling method when there is two states at the same relative position):

Here you see a clear latency of the read data from the HR chip of about 6 ns. Nothing surprising there I don't think. Just shows what a clean signal looks like.

Next is 80 MHz sysclock (40 MHz HR clock):

Here I've zoomed in to 5 ns per division and the clock is losing its square wave definition. The data is still intact and the latency is easier to see.

Next is 90 MHz sysclock:

Here the read data is all corrupt without changing the chosen setup. The eye pattern doesn't look bad as far as I can see. There is some general wobble starting but I'm assuming this is a scope probe thing. Those grond clips need to be shorter.

Have to hit the sack and got work in the morning so won't be posting much more for a bit ...

rogloh · 2020-04-19 22:15

An Eye-diagram can be used for several things such as checking signal jitter and noise margins etc. In this last case where you get the HyperRAM read errors I think this is more likely to be caused by a data timing delay relative to the sampling clock where the arriving data signal is not meeting setup or hold times when being captured in the P2 input stage. Unfortunately this P2 output clock you've obtained on your scope is not going to have the same phase as the internal P2 sampling clock due to overall propagation delays in the P2 as it is sent through the I/O block and PCB etc. So it can't be accurately used for that purpose with HyperRAM reads.

You could still use this for validating HyperRAM write timing if both these clock and data signals reach the HyperRAM around the same time (which should be the case if the PCB trace length and impedance is sufficiently well matched etc).

evanh · 2020-04-20 07:36

Those are with the 22 pF cap installed on the HR clock.

Write timings, with the 22 pF, should give about 1.0 ns data setup. Will be quite hard to see in the fuzz I suspect ...

PS: Here's read data at 240 MHz sysclock. It's notable that the HR clock signal has nicely ramped up while the HR read data out is pretty much the same shape as the clock now. As opposed to at 90 MHz the data had substantial high-low flats still.

evanh · 2020-04-20 07:49

rogloh wrote: »

Unfortunately this P2 output clock you've obtained on your scope is not going to have the same phase as the internal P2 sampling clock due to overall propagation delays in the P2 as it is sent through the I/O block and PCB etc. So it can't be accurately used for that purpose with HyperRAM reads.

Yeah, that is the crux of the uncertainty unknown.

evanh · 2020-04-20 08:07

evanh wrote: »

Write timings, with the 22 pF, should give about 1.0 ns data setup. Will be quite hard to see in the fuzz I suspect ...

Hehe, of course, as it turns out, hanging a scope probe off one of the data pins with it that tight also causes problems. Anything that cuts into the setup time will destroy the data.

rogloh · 2020-04-20 11:37

@evanh, yes unfortunately loading the clock signal with the scope probe has likely shifted its phase a fair bit and probably messed up your high speed writes which may explain why your read back is corrupted, or it failed to recognised the read command in the address setup phase perhaps.

It's probably rather difficult to measure without some high-end test gear/probes once you get down in the single ns digit range and you are adding several pF of probe load or more.

evanh · 2020-04-20 20:32

Data pin loading is worse. Henrique maybe right, he's suggested going for adding more than 1.0 ns lag to the clock. I guess that's actually a good thing for selecting a suitable tiny logic buffer.

We can treat this as dialling in the parameter.

rogloh · 2020-04-21 06:07

This line drawing thing is still bugging me quite a bit. I know I don't want to slow down other memory driver aspects or add a lot more code given I don't have much room left in the COG, but I think there must be a simple way to extend my current graphics vertical line/fill to support slanted lines based on what whicker mentioned. Gut feeling tells me if I have about 20 more instructions spare I can probably get it to fit, though it still good to keep some left over room for other bug fixes...

Even if this line capability is just limited to 8/16/32 bpp modes I still think it would be nice that a COG can simply issue the first address co-ordinate and the height/width offsets to the other end of the line and the memory driver will fill in the pixels for the line. Otherwise a graphics COG could get tied up issuing hundreds of single pixel writes to the mailbox or to a large request list when it could instead be doing other useful things while the memory COG writes to the memory in the background. This line primitive would also help allow us to place a series of line drawing commands in request lists without consuming one list item per pixel (16 bytes), and the client COG could then in parallel be off doing other things such as co-ordinate transformations, setting up polygon fills, other pixel blend ops etc.

I'm still looking into it...

evanh · 2020-04-21 06:31

Is there any plan for 4-bit pixel depth? Some sort of read-modify-write I would assume.

Yanomani · 2020-04-21 07:25

evanh wrote: »

Data pin loading is worse. Henrique maybe right, he's suggested going for adding more than 1.0 ns lag to the clock. I guess that's actually a good thing for selecting a suitable tiny logic buffer.

We can treat this as dialling in the parameter.

I'm still working on it; during the past three months I did found several options, some look promising. others don't.

Soon I'll be posting the ones I believe would better fit.

In the mean time, I'm trying to improve my little bunker/home, to better face the pandemic chalenge.

Late addition was an "electric shower". For the ones that are unnaware of its existence, the following link could explain what they are, and the dangers they burry inside.

https://youtube.com/watch?v=vmOoAWrn1kM

P.S. Obviously, mine is way better installed than the one depicted at the youtube video, but the term "suicide shower" still describes its operation. Try googling the terms within quotes... really dazzling...

rogloh · 2020-04-21 07:44

evanh wrote: »

Is there any plan for 4-bit pixel depth? Some sort of read-modify-write I would assume.

Not gonna happen in this driver. 8/16/32 read/write access only, plus byte granular bursts. Any low bit depth read/update accesses for 1/2/4bpp graphics will need to be done by the client COG.

Update: I mean it could actually be done, but support for those special cases needs more COG RAM that I don't have right now. I do have the concept of a locked request which could be used to do a read, update it, then a write back before relinquishing the bus to other COG's requests (including video). The video COG gets locked at setup time so its fragmented bursts don't yield back to the poller until the whole request is completed, allowing higher performance. Round-robin COGs should not lock their requests when they are added to the polling list, though I don't prevent it. Locking long bursts of RR COGs could upset video. This locking could also probably be dynamically enabled on a single request basis to allow an atomic read/modify/write of bitfields, but I just need more room to support things like that...

evanh · 2020-04-21 09:04

Just thinking that whatever ends up being used for line drawing at 4-bit depth could also apply to 8-bit depth.

evanh · 2020-04-21 09:13

Yanomani wrote: »

Late addition was an "electric shower". For the ones that are unnaware of its existence, the following link could explain what they are, and the dangers they burry inside.

I've got an export diverting box on my mains to modulate any excess into the hot water cylinder. It works really well, never have to pay for any hot water.

Pity the lines company is screwing everyone with home generation systems though. I'm barely saving a cent, over not having the solar PV, since they charge more for what I use from the grid now. I would have gone off-grid if the roof space was big enough for it.

rogloh · 2020-04-23 05:34

Here's another HyperRAM external memory test binary demo that achieves a 1920x1080 8bpp frame buffer over VGA. It uses the same outputs P0-7 for VGA, and P32-P47 for HyperRAM module.

Note: be aware that it clocks the P2 at 297MHz and overclocks the HyperRAM to 148.5MHz!

Interestingly it doesn't really drop the overall performance that much at all going from XGA to full HD. The extra clock boost up from 200MHz seems to mostly cover the increased resolution load. Only 8bpp is working, 16bpp was removed due to overloading.

I've boosted the text output a little bit by first reading from HyperRAM into hub RAM then rendering pixels on top of that area in hub memory, then copying back. It's a lot faster than writing individual font pixels directly to HyperRAM this way when printing strings.

Below is the simple hacked up demo code I used that writes a text font into HUB or external RAM. You can see it is not going to be as fast as it could be with all the extra calculations it is doing per pixel. There are certainly other techniques that could boost its text speed further, by writing a long instead of individual bytes for example and using some inline PASM etc.

PUB gfxDrawChar(buf, c, xpos, ypos, width, font, fontheight, fgcolour, bgcolour, bpp) | x, y, f            
   repeat y from 0 to fontheight-1
      f:=byte[font][(y<<8) + c]
      if bpp == 8
          repeat x from 0 to 7
              if (f & (1<<x))
                 if buf & $80000000 == 0
                    byte[buf][(ypos+y)*width+xpos+x] := fgcolour
                 else
                    ram.writebyte(buf & $7fffffff, ((ypos+y)*width+xpos+x), fgcolour)
              elseif fgcolour <> bgcolour ' not transparent
                 if buf & $80000000 == 0
                    byte[buf][(ypos+y)*width+xpos+x] := bgcolour
                 else
                    ram.writebyte(buf & $7fffffff, ((ypos+y)*width+xpos+x), bgcolour)
      elseif bpp == 16
          repeat x from 0 to 7
              if (f & (1<<x))
                 if buf & $80000000 == 0
                    word[buf][(ypos+y)*width+xpos+x] := fgcolour
                 else
                    ram.writeword(buf & $7fffffff, ((ypos+y)*width+xpos+x)*2, fgcolour)
              elseif fgcolour <> bgcolour ' not transparent
                 if buf & $80000000 == 0
                    word[buf][(ypos+y)*width+xpos+x] := bgcolour
                 else
                    ram.writeword(buf & $7fffffff, ((ypos+y)*width+xpos+x)*2, bgcolour)

Update: forgot to adjust text string which still mentions 1024x768 200MHz and also the 'gator copy resolution width in case anyone notices. Due to no clipping and a random number selection from 0-2047 instead of stopping at 1919 pixels, you might see the random pixel loop is a little biased towards the left side of screen where it simply wraps around for now. In fact most of the demo loop wraps this way.

EDIT: we found this attachment has a video timing issue with the front porch - use later posted versions.

rogloh · 2020-04-23 07:12

Reducing the above code to this dedicated 8bpp hub printing function already boosted performance from 474 to 755 strings/s so a fair bit of the slowness is probably on the Spin side. Some raw PASM could probably do even better. The test string happens to be 75 chars long and wraps once.

PUB gfxDrawChar8bpp(buf, c, xpos, ypos, width, font, fontheight, fgcolour, bgcolour) | x, y, f, base
   base:= buf + ypos*width + xpos
   repeat y from 0 to fontheight-1
      f:=byte[font][(y<<8) + c]
      repeat x from 0 to 7
          if (f&(1<<x))
             byte[base][x]:= fgcolour
          elseif fgcolour <> bgcolour ' not transparent
             byte[base][x] := bgcolour
      base += width

The outer calling loop became this...

PUB printText(buf, str, xpos, ypos, fgcolour, bgcolour, font, height, width, bpp) | c, pos, xoffset, colour
   pos:=0
   xoffset:=0
   repeat while byte[str][pos] <> 0
      c := byte[str][pos]
      if c == 13 
         ypos+=height
         xoffset:=0
      elseif c > 31
        if bpp==8
            gfxDrawChar8bpp(buf, c, xpos+xoffset*8, ypos, width, font, height, fgcolour, bgcolour)
        elseif bpp==16
            gfxDrawChar16bpp(buf, c, xpos+xoffset*8, ypos, width, font, height, fgcolour, bgcolour)
        xoffset++
      pos++

evanh · 2020-04-23 07:30

The horizontal timings must be out. The picture is partly off the right side of the display for me. Large black blanking area on the left side.

rogloh · 2020-04-23 07:33

I've found that too. My monitor also crops some off the right. Here's the HD timing I am using in my driver. Perhaps there are better values to try.

            long   CLK297MHz
            long   297000000
                   '_HSyncPolarity___FrontPorch__SyncWidth___BackPorch__Columns
                   '     1 bit         7 bits      8 bits      8 bits    8 bits
            long   (SYNC_POS<<31) | ( 44<<24) | ( 88<<16) | (148<<8 ) |(1920/8)
                   
                   '_VSyncPolarity___FrontPorch__SyncWidth___BackPorch__Visible
                   '     1 bit         8 bits      3 bits      9 bits   11 bits
            long   (SYNC_POS<<31) | (  4<<23) | (  5<<20) | ( 36<<11) | 1080
            long   2<<8
            long   0
            long   0   ' reserved for CFRQ parameter

So this means...

HFP = 44 pixels
HSYNC = 88 pixels
HBP = 148 pixels
VISIBLE = 1920 pixels

VFP = 4 lines
VSYNC = 5 lines
VBP = 36 lines
VISIBLE = 1080 lines

Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

Comments