All PASM2 gurus - help optimizing a text driver over DVI?

rogloh · 2019-12-14 08:37

potatohead wrote: »

Does it make sense to distribute the video data differently?

Spread it across all the potential "rows" just as a test? Or does reading not refresh at all?

Unfortunately reading long bursts seems to hold off refresh of the data being read. I would have expected it to be able to refresh after a read like normal SDRAM does, but it seems to have problems if you exceed the CS low time. Or perhaps some other corruption can happen when the CS is low for longer than this set refresh interval. Strangely it seems to pick up the state of other rows, not pure random data.

If we need to manage our own refresh we would need to not use bursts (nor would we want to), we'd have to read through all rows in the device within the desired refresh time as quickly as possible, (e.g. single byte transfer). Some of this could be done during blanking if the driver knows when we are blanking, as well as idle periods when the video COG is not requesting data. Alternatively we just break up the bursts (this adds a bit more overhead). We may need 3-6 bursts to meet the 4us limit, though it is clock and video mode dependent. The 16us is far more generous and we may only need 1 or 2 bursts in that case.

rogloh · 2019-12-14 08:45

I am currently now trying to get the mouse working with USB so I can put together some interactive control of my graphics stuff, which can be helpful instead of simple canned tests/demos. But I am having problems with the garryj USB code right now. I can spawn a COG to start the USB from within my application, and my USB mouse seems to work now I patched it for revB (I can get it to output debug co-ordinates over serial etc), but my video structures are all getting trashed affecting the picture, font etc, if I call the USB, and okay if I don't call the USB code. I wonder if the USB code is somehow writing to other memory areas outside of what it is meant to...?

Has anyone seen any evidence of this type of behaviour with the USB code? Or is it known to be fully self-contained and well behaved with fastspin?

jmg · 2019-12-14 08:45

rogloh wrote: »

..
If we need to manage our own refresh we would need to not use bursts (nor would we want to), we'd have to read through all rows in the device within the desired refresh time as quickly as possible, (e.g. single byte transfer)..

If doing this, I wonder if even one byte needs to be read ?
There may be some smaller number of clocks needed during the address time, to trigger refresh.
Read into the row buffer may be enough, it may not need any byte to move out of the pins.

evanh · 2019-12-14 08:57

There is multiple places the datasheet implying that CS low blocks the auto-refresh. Mostly in terms of "all reads and writes", eg:

"Because the DRAM cells cannot be refreshed during a read or write transaction, ..."

Whicker,
I'd say you've got this one exactly right. The datasheet here is effectively saying it'll skip rows when CS is low too long by saying that they can be made up for by running through them manually as a catchup:

The host system may also effectively increase the tCMS value by explicitly taking responsibility for performing all refresh and doing burst refresh reading of multiple sequential rows in order to catch up on distributed refreshes missed by longer transactions.

The catch is I don't see any way to know what was missed by the auto-refresh. This implies the catchup needs to run a full end to end refresh sequence of it's own to cover all bases.

Here's some supporting quotes from the datasheet (I bolded for highlighting):

Configuration Register 1 (CR1) is used to define the distributed refresh interval for this HyperRAM device. The core DRAM array requires periodic refresh of all bits in the array. This can be done by the host system by reading or writing a location in each row within a specified time limit. The read or write access copies a row of bits to an internal buffer. At the end of the access the bits in the buffer are written back to the row in memory, thereby recharging (refreshing) the bits in the row of DRAM memory cells.

Refresh of all rows could be done as a single batch of accesses at the beginning of each interval, in groups (burst refresh) of several rows at a time, spread throughout each interval, or as single row refreshes evenly distributed throughout the interval. The self-refresh logic distributes single row refresh operations throughout the interval so that the memory is not busy doing a burst of refresh operations for a long period, such that the burst refresh would delay host access for a long period.

So, refresh can be done externally by cycling a read of every row and not giving the auto-refresh any real time.

evanh · 2019-12-14 09:18

rogloh wrote: »

... But I am having problems with the garryj USB code right now. I can spawn a COG to start the USB from within my application, and my USB mouse seems to work now I patched it for revB ...

There were a few variations but the latest version was all set for revB. I'm guessing you've got an older one. https://forums.parallax.com/discussion/170149/p2-hosted-usb-keyboard-mouse/p1

rogloh · 2019-12-14 09:32

Yeah that thread has the codebase I started with @evanh, and I patched to revB by enabling the revB smartpin stuff as below. I am getting this weird display data corruption on the video side and the overall image size is still within 512kB though close to the limit. I've also tried reducing it further by pulling out the large bird file, didn't help. I also added in an external USB power supply on the second USB port power input of the P2-EVAL thinking my Mac may have issues powering the board with HyperRAM and VGA and DVI and USB and high speed P2 all together, didn't help though. The USB side does actually work but appears to trash the HUB memory somehow.

{
                wrpin   ##USB_V1HMODE_FS, #DP           ' The host is also the root hub, so full-speed is its native speed
                wrpin   ##USB_V1HMODE_FS, #DM
                wxpin   ##_12Mbps, #DM                  ' Default to Full-Speed
}
                wrpin   ##USB_V2MODE, #DP               ' Low-speed signalling is always used
                wrpin   ##USB_V2MODE, #DM
                wxpin   ##USB_H_FS_NCO, #DM             ' Host mode and default to 12Mbs baudrate

evanh · 2019-12-14 10:09

Ah, right, it would seem mine is the modified copy.

I've got a #define for switching between the two.

rogloh · 2019-12-14 10:20

This might be some type of stack corruption. At one point it was crashing shortly after start, then I increased the stack size for the USB SPIN code COG, then it took proportionally longer to crash.

Update: got the mouse cursor working on the screen. For some reason the Fastspin/USB code didn't like me putting additional variables in the VAR section. When I put them in the DAT area it worked. Not sure why but not digging into it right now either.

cgracey · 2019-12-14 15:58

With all these HyperRAM issues, how might those 8-pin 8MB QSPI SRAM chips compare now? I think they were cheaper. What about using two of them for an 8-biit data path?

Wuerfel_21 · 2019-12-14 18:06

AFAIK those are actually DRAM, too (they just call it "pseudo SRAM" because the interface is a bit similar to the usual SPI SRAMs (think 23LC1024)) and have the same issue of having to de-select them every couple µs to not loose data. Although if they're cheap, it may be attractive to gang up four of them for a 16-bit bus?
(Assuming Chip means 8MByte, that is. Are there are 8MBit 8pin SRAMs? idk)

jmg · 2019-12-14 19:17

cgracey wrote: »

With all these HyperRAM issues, how might those 8-pin 8MB QSPI SRAM chips compare now? I think they were cheaper. What about using two of them for an 8-biit data path?

Those also have CS rules, but maybe not as onerous ? I think they also lack DDR choice, which limits the top speed.
HyperRAM issues seem to be mainly around figuring out what actually matters to the part, and the data is frustratingly vague...

A good list of serial memories is here, under the 4 vendor tabs http://www.wridy.com/list-354-1.html
8 pin parts seem to top-out at 64Mb, and octa-parts road map to 128Mb

ISSI had HyperRAM parts planned to 256Mb
32Mx8 IS66/67WVH32M8DBLL 2.7-3.6V 100MHz (200 DDR)

Addit: there is more info on Octa parts here
http://www.apmemory.com/html/product_psram.php
That shows 128Mb parts but only in 18V bus models , speeds as BGA 1.8V Only -7(133Mhz) -6(166MHz) -5(200MHz)

Some of these parts spec
"tRBXwait Row Boundary Crossing Wait Time 30 65 30 65 ns "
and others will not cross a Row boundary at all, simply wrapping instead.

3.0V parts spec speeds of
APS6408L-3OBMx DDR OPI Xccela PSRAM 7 (133MHz) -9 (109MHz)
APS6408L-3OCx Octal DDR PSRAM -7(133MHz) -9(109MHz)

It may be a good idea to limit the row boundary crossing in all SW, to make this more generally portable ?

Addit2:
Winbond show a roadmap that supports 166MHz at 3.0V point
W956D8MBY 1.8V 200MHz -40~85C, Automotive 8Mx8 UD S
W956A8MBY 3V 166MHz -40~85C, Automotive 8Mx8 UD S
W957D8MFY 1.8V 200MHz -40~85C, Automotive 16Mx8 UD S
W957A8MFY 3V 166 / 200MHz -40~85C, Automotive 16Mx8 UD S

Winbond road map HyperRAM out to 128Mb/256Mb/512Mb, all at 200MH, but only in 1.8V (so P2 support at 1.8V is going to be long term important )

jmg · 2019-12-14 19:22

jmg wrote: »

rogloh wrote: »

..
If we need to manage our own refresh we would need to not use bursts (nor would we want to), we'd have to read through all rows in the device within the desired refresh time as quickly as possible, (e.g. single byte transfer)..

If doing this, I wonder if even one byte needs to be read ?
There may be some smaller number of clocks needed during the address time, to trigger refresh.
Read into the row buffer may be enough, it may not need any byte to move out of the pins.

I find more on this,scattered in the data :
"The read or write access copies a row of bits to an internal buffer. At the end of the access the bits in the buffer are written back to the row in memory, thereby recharging (refreshing) the bits in the row of DRAM memory cells. "

and I think the 'end of the access' they mean here, is the time labeled tACC = Access on the waveforms.
It may be possible to poll RWDS to speed this user-refresh more, some of the time.

evanh · 2019-12-14 20:26

Big bursts can be done reliably. Just have to finish by also following through with some downtime to refresh remaining rows is all.

evanh · 2019-12-14 20:37

Oh, duh, every read or write is also a row refresh! I just realised that, in the case of the video frame buffer, the unused areas aren't accounted for. We're only seeing what is being used for the frame buffer. Which is guaranteed to be refreshed just by the display action.

So we have no idea how badly the rest of the DRAM is performing.

That also leads to the question of why does any of the frame buffer have issues ever?

jmg · 2019-12-14 21:47

evanh wrote: »

Oh, duh, every read or write is also a row refresh! I just realised that, in the case of the video frame buffer, the unused areas aren't accounted for. We're only seeing what is being used for the frame buffer. Which is guaranteed to be refreshed just by the display action.

Based on rogloh's reports, even that assumption is not quite right.

evanh wrote: »

So we have no idea how badly the rest of the DRAM is performing.

True.

evanh wrote: »

That also leads to the question of why does any of the frame buffer have issues ever?

It seems simple read is not enough.

That may be because either/or combination of
a) the internal refresh timer interacts with user reads, and somehow skips/misses refresh.
b) All reads may not refresh.

Data is fairly clear an address-read does refresh within the address+access window, (before data outputs), but a roll-over read seems to lack any time for the write-back, so that may not refresh ?

evanh · 2019-12-14 22:02

The datasheet is explicit about the read/write action. I quoted it above - https://forums.parallax.com/discussion/comment/1484964/#Comment_1484964

evanh · 2019-12-14 23:19

Once the row is accessed, it has all the remaining row select time to refresh that row. Presumably the refresh is as it says, immediately after access. There is plenty of spare time in that window even if a new CS toggle occurs abruptly.

EDIT: Hmm, burst writes can't do that. Obviously there is no point in writing the row until the row latches are all updated with the host writes. So maybe when they say access, maybe they mean the remaining select duration instead.

jmg · 2019-12-14 23:51

evanh wrote: »

Once the row is accessed, it has all the remaining row select time to refresh that row. Presumably the refresh is as it says, immediately after access. There is plenty of spare time in that window even if a new CS toggle occurs abruptly.

Yes, but that's only partial info.
Less clear is if refresh on wrap into next row occurs, and reading other vendors data, I think the answer is no.

I've expanded my post above with more x8 PSRAM memories & notes.
Many simply do not move across row boundaries, so avoid this issue.
One vendor has it as a register option, but adds "tRBXwait Row Boundary Crossing Wait Time 30 65 30 65 ns " ie a std access time is added, & memory stalls for that duration.

Given those caveats and rules, it looks like it could be a good idea to avoid code that assumes linear burst, and instead always work within a row.

rogloh · 2019-12-15 00:29

At this point there is probably no need to get too worried about the refresh/chip select low time issues encountered so far. I think HyperRAM can readily be used with the P2 for video and other uses. We will just either do our own refresh or break up the burst transfers to limit the ChipSelect low time. It's probably not that big of an issue really to breakup the bursts. I can easily alter my HyperRAM arbiter memory driver to suit doing that. Breaking up a long burst is simple, the only downside is some additional setup overhead that eats into the scan line budget a bit. We can also possibly use this slower refresh 16us configuration register in the HyperRAM to help significantly, though I don't know how stable doing that will be at higher temps.

Actually breaking up large bursts is a desirable thing to do for the non-video applications where you still want to give other COGs some low latency access and not starve them too long while one COG is doing a really long burst (eg. >1kB), however the issue then is that the memory access becomes non-atomic. A "locking" burst transfer request option may be required to be added in cases where this is important, or locks could be done out of band by the applications themselves if two COGs have to share some critical external memory area. The high priority video COG transfers that take precedence in the scheduling could always be locking in order to maintain their performance, while other COGs could do it either way if we add that capability at some point.

The bigger issue I see is how to transfer reliably at sysclk/1 across a wide range of temps, boards, and device variation....that's probably going to cause the most headaches there. We don't know if the values ozpropdev already characterized apply across a wide temperature range and over all the different HyperRAM parts. Plus any PCB wiring delays from extra capacitance etc might start to come into play at some point if the timing gets really tight. Sysclock/2 might be easier to target, but limits bandwidth. Another tradeoff as usual.

jmg wrote: »

evanh wrote: »

Oh, duh, every read or write is also a row refresh! I just realised that, in the case of the video frame buffer, the unused areas aren't accounted for. We're only seeing what is being used for the frame buffer. Which is guaranteed to be refreshed just by the display action.

Based on rogloh's reports, even that assumption is not quite right.

I'm not 100% sure this corruption is a refresh issue directly where contents leak away, because these affected scanline portions often seem to take on the values of other areas of memory. It's possibly like something is trying to refresh but gets corrupted in the process perhaps and uses data read from elsewhere. Also it is not a full 1kB row that is corrupted but a smaller yet contiguous portion part way through the transfer. Might be some internal timing glitch during a long burst when refresh was trying to do something while we roll over within a row (on a page or half page boundary perhaps), or across a row. It does hit specific pixel regions each time indicating some type of boundary.

evanh · 2019-12-15 00:43

Every row is another access. I don't see what is unclear about that.

jmg · 2019-12-15 00:44

rogloh wrote: »

.... We can also possibly use this slower refresh 16us configuration register in the HyperRAM to help significantly, though I don't know how stable doing that will be at higher temps....

Looking at other vendor data and road map devices, it seems that 16us option is far from universal. Shame, as it looked to be useful.
Most do seem to be sticking to 4us as a spec point, and 1024 byte page Row Buffer Read sizes. Not all support linear burst across page boundaries & some that do, pause when doing so.

evanh · 2019-12-15 01:08

I'm currently needing many seconds to get fade errors. Just been testing with a couple of 2 million byte blocks. At room temp, two seconds hammering one block isn't enough to generate any errors in the starved block.

Giving it a good blast with the hair dryer can get specific weak spots to fade in starved block within two seconds. The hammered block is perfect throughout. Block cycle time is close to 100 ms.

PS: Testing method: Block data is procedural random (XORO32) that I don't re-seed. Each fresh block write continues from prior state. I take a copy of the random state variable at the beginning of the starved block write. This gets used later for seeding the compare of the same starved block after the rounds of read hammering.

rogloh · 2019-12-15 03:25

jmg wrote: »

Looking at other vendor data and road map devices, it seems that 16us option is far from universal. Shame, as it looked to be useful.
Most do seem to be sticking to 4us as a spec point, and 1024 byte page sizes. Not all support linear burst across page boundaries & some that do, pause when doing so.

Note a "page" is not an entire "row". Eg. From the ISSI HyperRAM data sheet pg10 footnotes:

1. A Row is a group of words relevant to the internal memory array structure and additional latency may be inserted by RWDS when crossing Row boundaries - this is device dependent behavior, refer to each HyperBus device data sheet for additional information. Also, the number of Rows may be used in the calculation of a distributed refresh interval for HyperRAM memory.
2. A Page is a 16-word (32-byte) length and aligned unit of device internal read or write access and additional latency may be inserted by RWDS when crossing Page boundaries - this is device dependent behavior, refer to each HyperBus device data sheet for additional information.
3. The Column address selects the burst transaction starting word location within a Row. The Column address is split into an upper and lower portion. The upper portion selects an 8-word (16-byte) Half-page and the lower portion selects the word within a Half-page where a read or write transaction burst starts.

We still need to see if this corruption aligns with their page (32 byte) boundary. I think that may have something to do with it, even though they don't seem to pause during burst transfers with extra RWDS latency inserted there (on this ISSI device at least).

evanh · 2019-12-15 03:35

My best guess as to what the purpose of a page is: The buffer size for maintaining continuous burst data while sequencing the row to row steps. Which will include both the refreshing write back as well as accessing the next row.

EDIT: I haven't read anything that says page alignment helps in any way but it might make those steps consistent and non-blocking.

EDIT2: There was something about wrapped transfers needing it I think.

EDIT3: Here's a quote from datasheet

Linear burst accepts data in a sequential manner across page boundaries.

jmg · 2019-12-15 05:18

rogloh wrote: »

Note a "page" is not an entire "row".

fixed above.

rogloh wrote: »

2. A Page is a 16-word (32-byte) length and aligned unit of device internal read or write access and additional latency may be inserted by RWDS when crossing Page boundaries - this is device dependent behavior, refer to each HyperBus device data sheet for additional information.

That sounds generic, and I see Winbond parts say slightly different things

32Mb part
"A Page is a 16-word (32-byte) length and aligned unit of device internal read or write access and additional latency may be inserted by RWDS when refresh is undergoing."
So their 'device dependent behavior' is a little different, and their 32Mb data suggests refresh can catch you by surprise .... sounds like a part best avoided ?

64Mb part differs
"2. The initial read access time starts when the Row and Upper Column (Half-page) address bits are captured by a slave interface. Continuous linear read burst is enabled by memory devices internally interleaving access to 16 byte half-pages."

rogloh · 2019-12-15 05:28

It seems to me like this must be the main challenge with HyperRAM design (refreshing and sustaining long bursts). Perhaps these silicon vendors are still figuring out how to do it and ISSI can at least manage sustained row transfers without gaps. A sustained burst is great for video RAM transfers. Requiring the transfer to stop for refresh or simply pause when crossing small page boundaries within the RAM chip is not great if you don't have a hardware memory controller that is handling this for you, or when you'd like to stream out video directly from the device without a buffer in between. The P2 doesn't have any time to check for RDWS validity when it uses the streamer unless it is clocked very slowly. We can at least break up the bursts if required.

whicker · 2019-12-15 06:10

It isn't that dire. During a burst write, RWDS is being driven by the master, so the HyperRAM can't even protest. It simply must take the data.

When running the chip in fixed latency mode and not doing the stupid legacy wrapping, the only true gotcha may be crossing the 1024 byte row mark (which you can anticipate), and crossing from one die to the other.

I wish i had the time and resources for writing a comprehensive memory controller cog program. I've been eyeing getting a MSO for years but their prices after software options are outrageous.

evanh · 2019-12-15 06:29

For the Rigol's, you buy the base model and all software options can be hacked enabled. All the details were on the web back when they were new at least. Things may have changed though, I haven't looked in years.

Tubular · 2019-12-15 22:23

Looks good Roger. Watched it for 15 minutes or so and didn't spot any glitches

I went to capture it, but the capture device rejects the resolution despite going to 1080p

Vesa timing would seem to be 108 rather than 110 MHz, and the tcl monitor reports 1280x1024 at 61 Hz... could this be the issue?

rogloh · 2019-12-15 22:55

Yeah its a bit over 60Hz refresh at 110MHz, we should try with 108MHz pixel clock if it needs 60Hz refresh. I'll try to send you a 60Hz build shortly once I get back to it.

Now I have the mouse working I was thinking of putting in some on screen test controls in the top display region (sourced from hub RAM) to adjust the HyperRAM clocking dynamically and maybe the HyperRAM max CS time too. Could be handy just for these experiments...

One issue I have is that if I want to adjust resolutions dynamically with the mouse, I may have to shutdown the video COG and change the P2 clock speed. This will affect the USB mouse COG. Hopefully if I respawn it it would detect the mouse without needing to reset or hot plug the mouse again...

All PASM2 gurus - help optimizing a text driver over DVI?

Comments