Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

rogloh · 2020-04-11 09:54

Thanks all.

jmg wrote: »

Sounds good.
How does that cope with some temperature cycling, as in the margin tests evanh did above ?

Testing that can probably come later. It's not as if we could do very much about temperature aspects apart from controlling delay/register settings for timing. I guess it is what it is.

TonyB_ wrote: »

Excellent, well done!

As well as 640x480, sysclk of 252MHz could do 800x600 @ 56.25Hz, pclk 36MHz, if your monitor supports that (but not 800x600 DVI/HDMI of course).

Yes I intend to try a few more combinations again. I want to push this up high and see if I can get XGA at 16bpp if I can read the memory at sysclk/1 now. I sort of hope it might even be possible to reach 1920x1200x4bpp also using a 1x sysclk:pixel clock ratio in transparent mode, but I need to check it out to see if that is doable...first I will need to remember what my video driver was capable of as it has been a while since I used it.

evanh · 2020-04-11 10:06

Wow from me too! Wasn't expecting anything for a day or two. It's going to be fun keeping everyone's setup stable me thinks.

rogloh · 2020-04-11 10:53

evanh wrote: »

Wow from me too! Wasn't expecting anything for a day or two. It's going to be fun keeping everyone's setup stable me thinks.

Well much of it was ready to go. Hardest part today was finding the extra instruction in the video driver I need to setup the new 3 long mailbox request. I had to (hopefully temporarily) remove a 3 instruction mod I'd added last time to cancel both interlaced+line doubled text which along with my PAL fix and a couple of others had apparently filled up the video driver COG to its total maximum of 502 longs. So I'm on the hunt for another long in the video driver to make up for it or live with its existing weird behavior of generating scanlines 1-1-3-3-5-5 in odd fields and 2-2-4-4-6-6 in the even fields if you happen to inadvertently select both text options at the same time which just looks messed up.

Update: just got a 800x600 32bpp frame buffer working at 200MHz.

Update2: just got a 1920x1080 8bpp frame buffer working at 297MHz with sysclk/1 - its image is added below. It looks amazing! I also tried 1920x1200 at 308MHz but I think this is just a bit too much and the pixels start to shimmer a bit on the monitor. Perhaps sysclk/2 will work better there.

Update3: left this image for a while and now noticing a little pixel fuzz at the side of the screen when I look carefully - maybe self heating has caused a bit of drop in transfer reliability - it is near the edge of its range accoring to evanh's data. It could possibly be my monitor too or P2 jitter perhaps, this is analog VGA not DVI. It could even be noise and this board is only being powered by my MACs USB port. The oscilloscope had showed a lot of noise on the logic low level of the clock and chip select lines when I was probing the signals too.

evanh · 2020-04-11 23:00

Do you have the capacitor installed?

rogloh · 2020-04-11 23:21

No I don't have the capacitor. I am working with a stock board right now as I am only doing writes at sysclk/2 rates.

evanh · 2020-04-11 23:32

Okay, 297 MHz shouldn't be a signal integrity problem then. Best config is HRclock registered and HRdata unregistered. 297 MHz is bang in the middle of that band at room temperature without capacitor.

evanh · 2020-04-11 23:40

How are you calculating the pixel clock for SETXFRQ? If it's not a clean multiple to sysclock then shimmer or crawl can be the result.

rogloh · 2020-04-12 01:07

It was a clean divide by two, both for 1920x1080 using 297/2 and for my prior 1920x1200 attempt using 308MHz/2. The latter seemed to have this weird shimmer effect right away but the former seemed to look ok at first glance but then on more inspection has some slight fuzzy noise at certain locations across the screen where the colour is changing a lot (ie. high frequency). It was not noticeable on the flat circle colours which is why I didn't see it immediately. It's weird and needs closer inspection, some of the pixels sort of flicker between different colours but it is subtle. It possibly looked like the data being read back has occasional problems and it may be analog noise too. My ground level was bouncing about on the scope for some signals. It might also be due to not getting a 1:1 clocking on the analog VGA LCD monitor (scaler issue) because I noticed that the text character on the top right was truncated which it isn't normally. This is not a native 1080 panel it is 1200 pixels tall.

rogloh · 2020-04-12 01:08

evanh wrote: »

Okay, 297 MHz shouldn't be a signal integrity problem then. Best config is HRclock registered and HRdata unregistered. 297 MHz is bang in the middle of that band at room temperature without capacitor.

I was using all pins unregistered.

Update: perhaps I should move to having registered clocks instead of live clocking on that pin. It might create more consistent frequency band transitions and a higher overall operating range. To do this I need to retime my critical read/write setup code which will be a total PITA. But I might look into it.

evanh · 2020-04-12 01:32

Another option is reduce sysclock frequency a little and reduce the blanking intervals to maintain Hsync/Vsync frequencies. I've noticed that LCD monitors use HSync/VSync to derive the outputted resolution. 285 MHz sysclock would be a good target with all pins unregistered.

rogloh · 2020-04-12 01:35

Yeah I'll play with it a bit more today.

rogloh · 2020-04-13 02:49

So on my analog hires Sony monitor I am able to get 1920x1080x8bpp working with the P2 clocked at 297MHz and the HyperRAM at transfer rate at sysclk/1. It looks pretty clean and I can't see any flickery pixels. Perhaps the noise I was seeing on my LCD is just due to imperfect scaling at that resolution, though I should leave it on for a while for the P2 to fully warm up...it is also probably a bit colder in this room today, ~18C. Pushing up to 1920x1200 is too high a bandwidth for the Sony and it can't sync to it.

Funny thing was when I tried 1920x1080x16bpp it actually worked! I couldn't believe it because the 297MB/s bus bandwidth is insufficient for that rate when overheads are added. Then I realized that it was actually drawing every second line, and it had an interlaced effect. This is because the video driver keeps generating requests without checking/waiting, and the first request hadn't finished when the second one was issued. When the mailbox is cleared by the HyperRAM it zeroed out the second request that had been generated in the meantime, so the video driver just outputs a blank line then a picture line, etc. I think this could actually be a useful feature perhaps in some cases if you wanted hi-colour on a 1080 display, though you could probably also just use a 540p resolution if that halves the pixel rate.

Ok, several minutes later while formulating this post, the video is still clean on the analog VGA monitor. Nothing flickering.

Update: added photo.

Update2: now seeing some success with graphics copies too. First time I tested it out and it appears to be moving graphics blocks around on the display.

Tubular · 2020-04-13 12:30

Amazing stuff, Roger. HD at 8bpp, you can do a heap with that

So the graphics copies are to and from regions of Hyperram? What kind of buffer is required in the P2?

rogloh · 2020-04-13 22:39

Tubular wrote: »

Amazing stuff, Roger. HD at 8bpp, you can do a heap with that

So the graphics copies are to and from regions of Hyperram? What kind of buffer is required in the P2?

Yeah Lachlan when I started out I really didn't think it would be possible to get everything combined and push that far, the P2 continues to deliver nice surprises. It looked nice and clean on the Sony and the HyperRAM chip was not even warm to touch either. I think the Dell LCD scaler didn't like 1080 video timing I was using.

Having an 8 bpp frame buffer at 1080p (via VGA) should be fun to play with for those P2 systems and temperature environments that support pushing chips this high. There are no guarantees because this 3.3V RAM is being overclocked and signal integrity/board layout is probably going to be critical. You will also want a well controlled operating temperature range.

Few more details on graphics copies:

The copy can work in a few ways: Hub RAM to/from HyperRAM/Flash, HyperRAM/Flash to itself (same bank), or between two different banks on the Hyper bus (HyperFlash to HyperRAM for example). I've only tried out the HyperRAM to HyperRAM (same bank) case so far and am still testing it, but it looked good on screen.

If the source and destination addresses of the graphics copy are both on the Hyper bus, you will still require a small temporary hub RAM buffer. It defines the width of the region being copied. So for a 100x100 pixel block being copied it would use 100 bytes of hub RAM at 8bpp, or 200 bytes at 16bpp etc. In most cases the copy should only typically need to have a single scan line sized buffer free in hub RAM. This same buffer gets shared for all the scan lines being copied as it iterates over the block's height in pixels.

The per COG and per bank burst limits are enforced throughout the copy so its read and write operations are automatically fragmented into sub-bursts which won't interfere with the normal graphics reads by the video driver or hold off HyperRAM refresh. This was where the copy code became rather tricky in the driver.

The graphics row pitch is independently settable (byte granular) for both source and destination addresses, so you can pack or unpack data during the copy, i.e. linear layout converted to graphics memory layout, or you can do a full block copy of rectangular regions on screen maintaining its pitch. I intended it to be pretty versatile when it was added in and it saves doing this all manually in SPIN which will otherwise be slower due to lots of mailbox requests and waiting that ties up the processor.

This graphics copy feature uses a request list entry which should be convenient for GUI/windowing manager COGs. They could create a list of various block copy operations to construct a graphics display from some common off screen primitive graphics items like radio buttons, check boxes etc and to also move/update rectangular window portions around on screen with a mouse. I think it should work out fairly well.

Tubular · 2020-04-14 00:54

Ok. Sounds truly ideal.

We need these new S27KL0642x Hypperrams to make their appearance... so things are no longer overclocked

evanh · 2020-04-14 01:05

Tubular wrote: »

We need these new S27KL0642x Hypperrams to make their appearance... so things are no longer overclocked

I don't think that'll have any real impact. I suspect the usable frequency bands will be just as narrow with those when used in same setup. A custom board layout with Prop2 + HR will be where we get real gains - Possibly with the use of a "TinyLogic" buffer chip to insert a clock delay that isn't slower slewing. It'd all be a lot of guesswork for me though, I've never tried to do impedance matching layouts

rogloh · 2020-04-16 04:30

Over the last days of testing I ran into a couple of problems with my HyperRAM graphics copies which are now fixed. The code was only partially dealing with a special corner case of another special case and this was exposed in stress testing with video. A graphics copy in a list being tested with random width transfers broke the list pointer causing the P2 to go haywire after the random width ever became zero and then it would typically lockup, and also any graphics copies exceeding a burst size during another COG's video access would corrupt the screen. In the end these were both simple one line fixes but they were quite a bit harder for me to track down.

The first issue demonstrates that any request list setup really needs to be correct at all times when it is passed to the HyperRAM driver. For reads and other graphics copies involving the hub there is definitely the potential to corrupt the P2 memory if the addresses and lengths are wrong. Not a lot I can do there except to first validate the request and bank values are valid and error out if not. But after checking that, the driver has to assume the remaining fields of the list are correct and continues to process accordingly. If the wrong addresses get used then the P2 memory can be corrupted. So these request lists can be very powerful but also very dangerous. Be warned...

evanh · 2020-04-16 04:40

We don't need no protection!

rogloh · 2020-04-16 06:50

Singing...

We don't need no hub protection
We don't need no list control
No more PASM, in the LUT RAM
Coder leave that cog alone
Hey, coder, leave that cog alone!
All in all it's just another bug in the call
All in all it's our best way, to hit the wall

ps.: this is what isolation for several weeks does to you.

Lyrics slightly adjusted.

Yanomani · 2020-04-16 07:50

Trying it, now aloud!

Ready to impress a prospective new wife!

(Best candidate so far...)

rogloh · 2020-04-17 09:31

Your prospective wife pic made a good benchmark test Yanomami. Turns out I can block transfer 721 per second out of hub RAM into HyperRAM using 1024x768x16bpp at 200MHz with a sysclk/2 write rate and a burst transfer size of 252 bytes. Each transferred image is 126x128 pixels which at 16bpp is ~32kB, yielding over 22MB/s of pixel write rate.

This was a good way to test my gfx copy from Hub to HyperRAM.

Update: noticed pitch was off by 2x in the test which has affected the displayed aspect ratio below. It didn't affect the transfer rate, it's just that the displayed output got squished. Fixed and now looks square.

Tubular · 2020-04-17 23:37

Lol nice relevant choice of test image, and measurement units 'Gator Blts/s'

evanh · 2020-04-18 00:02

rogloh wrote: »

... can block transfer 721 per second out of hub RAM into HyperRAM using 1024x768x16bpp at 200MHz with a sysclk/2 write rate and a burst transfer size of 252 bytes. Each transferred image is 126x128 pixels which at 16bpp is ~32kB, yielding over 22MB/s of pixel write rate.

The video display is pulling 90 MB/s at the same time, right? That's a good 50% of sysclock/1 read bandwidth gone to the display. Presumably more with overheads.

rogloh · 2020-04-18 01:19

Yeah for this XGA resolution with the P2 at 200MHz the video COG reads 2048 bytes every scan line. This video timing displays 1024 active pixels out of 1344 total pixels per scan line, so we are reading in 2kB in every ~20.16us per scan line. This is already contributing a load that is over 101 Million bytes/second out of a theoretical maximum of 200M per second excluding overheads, making it unattainable. I have recently reduced the overheads significantly for the video reads using locked bursts (no yielding to the poller) but they are still going to be in the order of around 10%. With the remaining ~88MB/s of bandwidth, if using sysclk/2 writes, we theoretically would be left with ~ 44MB/s but with the other write setup and polling yield overheads and Spin request setup overheads from the test etc, it looks like we drop about half again to reach 22MB/s at this 252 byte burst size.

Even larger burst write sizes could potentially improve this but there are limits that need to be enforced to prevent the HyperRAM !CS from exceeding 4us and also to constrain any resulting video service time jitter. Smaller bursts are always going to be less efficient, so tall+skinny graphics image transfers will never copy as fast as short+wide ones for example. Also the round-robin polled writer COGs always have to yield back to the poller between each of their bursts so that video is not impacted during writes. This also reduces the remaining effective bandwidth. Going to sysclk/1 writes in systems that could support it would be the main way for further improvement.

evanh · 2020-04-18 01:35

Contrary to the datasheet's assertions, I'm pretty certain the 4 us limit is not critical at all. The refresh mechanism has proven to handle a lot of abuse and the small hints in the datasheet indicate it is because the refreshing actually operates at double speed or can change gear when needing to catch up.

Yanomani · 2020-04-18 02:53

rogloh wrote: »

Your prospective wife pic made a good benchmark test Yanomami.

rogloh, i owe you a special thanks. You took a weight off my shoulders.

I was a little concerned, due to my off topic comment ...

But you managed to turn a blond-haired lemon into a Swiss lemonade!

rogloh · 2020-04-18 06:22

For those with P2-EVAL rev B's and the HyperRAM board with a VGA output capability and want to try out their HyperRAM board, here is a binary demo doing some graphics rendering into a 1024x768 external memory frame buffer. The P2 is safely clocked at 200MHz.

Install your VGA output board on pins 0-7, and the HyperRAM board on pins 32-47.

Serial I/O is not used at the moment, this demo runs automatically in a loop from startup and just cycles between 8 bpp and 16 bpp colour modes.

The absolute benchmark values are not particularly relevant because they depend on the size of the items drawn which are randomised, but they can be used to compare 16bpp performance against 8bpp. All drawing code is written is FastSpin and is not necessarily optimised. The rendered text display in graphics mode is particularly inefficient in my test implementation (it's all per-pixel based rendering into the frame buffer), and there are many ways things can be sped up with better block transfers from hub etc.

You should be able to download the binary image from FlexGUI or directly with loadp2. I have successfully used this command below with my loadp2 version on a Mac (Note: your serial port will be different and you may have a different version of loadp2 with different arguments, mine is old).

loadp2 - a loader for the propeller 2 - version 0.017 2019-09-16

loadp2 -p /dev/cu.usbserial-DN43U3VJ -l 2000000 -b230400 hypvgatest.binary -CHIP

Update: be aware of whether your board resets if the terminal is not held open after download. You can probably add the -t option in loadp2 if that is the case.

evanh · 2020-04-18 06:44

Oh, in a single burst, that's about a 10.5 us of data per scan line ... assuming it's done in a single burst.

rogloh · 2020-04-18 06:58

@evanh In this test I was using 640 byte bursts. So that is 3.2 bursts for reading 2048 bytes total per line in 16bpp mode and each burst transfers its 640 bytes of data on the bus in 3.2us at a P2 rate of 200MHz, plus it contains another 0.55us address setup time and other COG processing overheads while CS is low. There is also 0.15us of CS high time between bursts of the video transfer. So it looks like 3.2/3.9 for this burst size and video mode (82% efficiency). I could probably push this higher slightly to free a little more bandwidth for the COG writers. Performance is certainly burst size dependent.

evanh · 2020-04-18 07:00

rogloh wrote: »

The absolute benchmark values are not particularly relevant because they depend on the size of the items drawn which are randomised, but they can be used to compare 16bpp performance against 8bpp. All drawing code is written is FastSpin and is not necessarily optimised. The rendered text display in graphics mode is particularly inefficient in my test implementation (it's all per-pixel based rendering into the frame buffer), and there are many ways things can be sped up with better block transfers from hub etc.

150,000 random pixels/sec! I'm impressed given the amount of arbitration that I would've thought was needed. How random are they?

Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

Comments