HyperRam/Flash as VGA screen buffer (Now XGA, 720p &1080p) &Rev.B

evanh · 2019-03-28 08:16

Yeah, working on doing a reliable autosweep right now.

evanh · 2019-03-28 09:33

Here we go.

evanh · 2019-03-28 11:33

Cooled it down with some ice packs and got another 30 MHz at the top.

jmg · 2019-03-28 20:03

evanh wrote: »

Cooled it down with some ice packs and got another 30 MHz at the top.

Great scans

These are the waterfalls I derive from that, and the timing apertures they imply (I dropped the upper 4 pins, as they have unusual PCB loadings)

1/379M-1/357M = 0.162ns timing aperture spread to cover 60 pins, register mode
1/327M-1/287M = 0.426ns timing aperture spread to cover 60 pins, non-register mode
TmNon-TmReg = 0.5399ns shift in mean aperture sample point, relative register to non register. (+2 Sysclks)

As expected, register pin mode is much tighter spread, with 162ps, vs 426ps non-register mode

That register spread is quite good, one sixth of 1 nanosecond across any pin,

There is also an absolute shift in mean sampling point, of just over half a nano second, (0.5399ns + 2 Sysclks) with no skirt overlap, between Reg<->NoReg choices, that might be useful in some testing or design cases.

Derived from   pin_echo2.txt
 Registered
380 MHz 6   00000000000000000000000000000000000000000000000000000000000100
379 MHz 6   00000000000000000000000000000000000000000000000000000000000100
378 MHz 6   00000000000000000000000000000000000000000000010000000000000100
377 MHz 6   00000000001100000010000000000000000000000000010000000000000100
376 MHz 6   00000000001110000010000000000000000000000000010000000000000100
375 MHz 6   00000000001110000110000000000000000000000000010000000000000100
374 MHz 6   00000000001110000010000000000000000000000000010010000000000100
373 MHz 6   00000000001110001110000000010000000000000000110010000000000100
372 MHz 6   00000000001111111110000000010000000000000000111010000110000100
371 MHz 6   00000000001111111110010000111100000000000000111010010110001100
370 MHz 6   00010100011111111110011001111100010011000000111111110110011101
369 MHz 6   00010100011111111110011011111110110011000000111111111110011101
368 MHz 6   00010100011111111110011111111110111011000010111111111110011101
367 MHz 6   00011100011111111110111111111111111011000011111111111110011101
366 MHz 6   00011100011111111110111111111111111011000011111111111110011111
365 MHz 6   00011100011111111111111111111111111111000011111111111110011111
364 MHz 6   00011101111111111111111111111111111111010011111111111110011111
363 MHz 6   00011101111111111111111111111111111111010011111111111110011111
362 MHz 6   00011111111111111111111111111111111111110011111111111110111111
361 MHz 6   00011111111111111111111111111111111111110011111111111110111111
360 MHz 6   00011111111111111111111111111111111111110011111111111110111111
359 MHz 6   00011111111111111111111111111111111111110111111111111111111111
358 MHz 6   00011111111111111111111111111111111111110111111111111111111111
357 MHz 6   00011111111111111111111111111111111111111111111111111111111111 << ignore upper pins
356 MHz 6   00011111111111111111111111111111111111111111111111111111111111
355 MHz 6   00111111111111111111111111111111111111111111111111111111111111
354 MHz 6   01111111111111111111111111111111111111111111111111111111111111


Unregistered
328 MHz 4   00000000000000000000000000000000000000000000000000000000000000
327 MHz 4   00000000000000000000010001000000000000000000000000000000000000
...
297 MHz 4   00000011011111111111111111111110111111111111101111101111101100
296 MHz 4   00001011011111111111111111111111111111111111111111111111111100
295 MHz 4   00001011011111111111111111111111111111111111111111111111111100
294 MHz 4   00001011011111111111111111111111111111111111111111111111111100
293 MHz 4   00001011111111111111111111111111111111111111111111111111111100
292 MHz 4   00001011111111111111111111111111111111111111111111111111111100
291 MHz 4   00001011111111111111111111111111111111111111111111111111111100
290 MHz 4   00001011111111111111111111111111111111111111111111111111111101
289 MHz 4   00001011111111111111111111111111111111111111111111111111111111
288 MHz 4   00001011111111111111111111111111111111111111111111111111111111
287 MHz 4   00001111111111111111111111111111111111111111111111111111111111

1/379M-1/357M = 0.162ns timing aperture spread to cover all pins, register mode
1/327M-1/287M = 0.426ns timing aperture spread to cover all pins, non-register mode 
                        TmReg=1/((379M+357M)/2) = 2.717391304347826087e-9 TmNon = 1/((327M+287M)/2) = 3.2573289902280130293e-9
TmNon-TmReg   = 0.5399ns shift in mean aperture sample point, relative register to non register. (+ 2 sysclks)

evanh · 2019-03-28 20:26

Some of that 162 ps could even be due XDIV = 20. I had to make it a larger value to achieve 1 MHz steps.

jmg · 2019-03-28 20:53

evanh wrote: »

Some of that 162 ps could even be due XDIV = 20. I had to make it a larger value to achieve 1 MHz steps.

True, but the VCO only needs to be stable over a handful of sysclk cycles here, and the larger XDIV mainly increases lower frequency noise, or the variation over a few thousand sysclks.
It is useful to include any XDIV variation anyway.

If we take ~ 3 sysclks at ~ 10ns, a litmus test of 0.1% jitter variation (poor) on that, maps to ~ 10ps.
That 162ps aperture is going to 'walk' with temperature (PVT) as well.
It is nice to know what the pin-spread component is & how registered helps reduce spread.

Your two plots above are at different cooling, right ?
Do you have a guess/estimate of the dTemp on those 2 captures ?
Did any code change between them ? It may be possible to extract a ps/°C indicator tempco from the two tables ?

Yanomani · 2019-03-28 21:25

Thanks evanh and jmg for crafting the tables, they will help me a lot.

I'd specially enjoyed the waterfalls, because they easy the study of when and where things start to change.

jmg · 2019-03-28 22:39

Here is a compare of the two warmer/cold plots
There seems to be about 16ps of variance (VCO jitter?) in the waterfall patterns, ie the same pin is last in both plots, but the exact ripples vary a couple of lines across the pins

The shift due to test temperature looks to be ~ 143ps, which needs a Temperature change.
Icepacks may imply 25°C or 30°C change ?
That's a ballpark indication of 5 ps/°C, (=28.6°C) or over a 50°C operating range => 250ps, or over 75°C => 375ps

pin_echo2.txt   Registered  COLD 
363 MHz 6   00011101111111111111111111111111111111010011111111111110011111 <<
362 MHz 6   00011111111111111111111111111111111111110011111111111110111111
361 MHz 6   00011111111111111111111111111111111111110011111111111110111111
360 MHz 6   00011111111111111111111111111111111111110011111111111110111111
359 MHz 6   00011111111111111111111111111111111111110111111111111111111111
358 MHz 6   00011111111111111111111111111111111111110111111111111111111111
357 MHz 6   00011111111111111111111111111111111111111111111111111111111111
                                                    ^- last pin 

pin_echo.txt  Registered WARMER
347 MHz 6   00011101111111111110111111111111111111010011111111111110011111
346 MHz 6   00011101111111111110111111111111111111010011111111111110111111
345 MHz 6   00011111111111111111111111111111111111010011111111111110111111 <<
344 MHz 6   00011111111111111111111111111111111111110011111111111111111111
343 MHz 6   00111111111111111111111111111111111111110111111111111111111111
342 MHz 6   01111111111111111111111111111111111111110111111111111111111111
341 MHz 6   01111111111111111111111111111111111111110111111111111111111111
340 MHz 6   01111111111111111111111111111111111111111111111111111111111111
                                                    ^- last pin 

General wobble in pins 
 1/347M-1/345M = 16.706 ps 

d Aperture/ dT ?°C   1/363M-1/345M = 143 ps

I wonder if this MHz sweep technique can give a margin indicator to show where to avoid ?

So I then scan the file for ~50% bits flipped, as a better quality indicator of the middle of the aperture (highest slope of change)

pin_echo2.txt   Registered  COLD
372 MHz  -> 6 : 17 1's
371 MHz  -> 6 : 23 1's

317 MHz  -> 5 : 15 1's
315 MHz  -> 5 : 25 1's
316 MHz  -> 5 : 29 1's
- remains at 5 : 1's in bottom 58 pins, down to 80MHz test stop.
 1/372M-1/317M = 0.466ns ??

312MHz Unregistered  -> 4 : 17 1's 
158MHz Unregistered  -> 3 : 14 1's 

 1/158M-1/315M = 3.1545ns - more expected spacing

The Unregistered sweep to next aperture, gives a healthy 3.15ns between 4 & 3 changes, which nicely ~equals the period at 312MHz - ie an 'expected' result.

The puzzle here, is the Registered 6 sysclk & 5 sysclk apertures change points, seem to be only 466ps apart ?!
Below ~ 316MHz, Registered remains stable, (just pin 63 is late change at ~150MHz due to high loading effects on that pin, & the top 6 pins are clearly different )
Not sure how to explain how it can pick up another sysclk in just 466 ps period change ?
This is pushing up to the top-spec and well overclocked, so maybe other effects come into play ?

Below 300+ MHz, the Registered delay is well behaved.

Just above 300MHz is where both Registered & UnRegistered have an aperture point, so that's probably best avoided operating point.

Rayman · 2019-03-29 00:25

Here's 1080p @ 4bpp. Can't do a pretty bird in 4bpp...

This is full screen, whereas couldn't quite get to full screen at 8bpp.

Need 300 MHz to get my monitor to like it.
Still needed that NOP in read burst routine.
Had to reorder nibbles in video loop. Glad that option was there!
Had to add to front porch Hsync to make right.

rogloh · 2019-03-29 01:27

Impressive, @Rayman . How much memory bandwidth was left for writes in this 4bpp 1080p mode?

Rayman · 2019-03-29 01:50

See scope traces..
Maybe half line each line

whicker · 2019-03-29 04:18

I was looking at your code rather in depth today.
I noticed this line in most of your different display resolution sources:

              'Wait for Pin_RWDS to go high
              setse1    #$80+Pin_RWDS
              waitse1

When i look at the P2 documentation V32, i see that setse1 has a pattern like this:

%001_PPPPPP = INA/INB bit of pin %PPPPPP rises
%010_PPPPPP = INA/INB bit of pin %PPPPPP falls
%011_PPPPPP = INA/INB bit of pin %PPPPPP changes

%10x_PPPPPP = INA/INB bit of pin %PPPPPP is low
%11x_PPPPPP = INA/INB bit of pin %PPPPPP is high

so the line with $80 is basically:

              setse1    #%1000_0000+Pin_RWDS
              'same as:
              setse1    #%010_000000+Pin_RWDS

So despite the comment, it's waiting for a falling edge of RWDS?

evanh · 2019-03-29 04:25

jmg wrote: »

Icepacks may imply 25°C or 30°C change ?

Yep, I'd think 30 °C. I have a little PTC based weather thermometer resting on top of the chip. And the ice packs were top and bottom. Bottom one was touching the eval board. I left it that way for a while before the cold run. Thermometer read about 2.0 °C. Got a little lower in the end.

The first run, at room temperature, the thermometer was probably in the low 30's.

evanh · 2019-03-29 04:30

I've been planning to buy some fine thermocouple wire, that I'll solder to the centre underside of the board. This'll give much more direct reading. I must get it ordered ...

evanh · 2019-03-29 05:05

jmg wrote: »

Did any code change between them ?

Only change was the start and end MHz of the runs. I tried 400 MHz too but that locked up.

BTW: There is 62 pins in the graphs. numbered 61 .. 0.

Rayman · 2019-03-29 09:55

Whicker: good catch. I’ll have to fix the text and/or code.

It would be correct if said:

'Wait for Pin_RWDS to go high and then low

Maybe I should try to skew the timing somehow... On the other hand, it seems to work...

evanh · 2019-03-29 12:35

There is a difference in how those two event types get triggered. The edge detect version, %010, does as you are saying.

The level detect version, %100, however, can catch you out as it will retrigger on a held level if that level condition is still present after checking (clearing the trigger). If not managed, this will create excess unexpected triggers. Which can produced duplicate or leading type effects where the program seems to be instant when it shouldn't be.

whicker · 2019-03-29 17:33

Rayman wrote: »
Whicker: good catch. I’ll have to fix the text and/or code.

It would be correct if said:
'Wait for Pin_RWDS to go high and then low
Maybe I should try to skew the timing somehow... On the other hand, it seems to work...

I think it works because the data for the first pixel is from skipping at least one byte, or it unfortunately is testing the 8 nS data line hold time right at the margin. I think it's the former, because in WriteRamBurstSub we see the following line:

              'Latency Clocks
                nop
                mov       i2,#27'24  'need to check that this is right...
LoopLat1
                outnot    #Pin_CK
                djnz      i2,#LoopLat1

I think the default latency is 5 clocks, and 2x latency?
I haven't looked at the ISSI datasheet.

But the latency clocking starts early before all of the command address has arrived. One rise for 15:8 and one fall for 7:0.
This is already clocked manually in the first part of the code.

So with the assumed default settings, 5-1=4 clock cycles for the first part of the latency. 4*2 = 8 transitions.
5 more clock cycles for the second part of the latency. 5*2 = 10 transitions.

So that's only 8+10=18 clock transitions.

If default latency was 6, then it's 6-1=5, 5*2=10 transitions. Then adding to it is 6*2 = 12. 22 transitions to get the first data byte.
As shown in the attached image, with a latency of 4, then it's a total of 14 transitions, then put the data on, then rising edge of clock.

Somewhere there's something interesting going on is all I'm saying.

Rayman · 2019-03-29 17:50

I suppose I could to a slow read and see what's actually stored in the first few bytes of each row...

jmg · 2019-03-29 19:25

whicker wrote: »
...

so the line with $80 is basically:
              setse1    #%1000_0000+Pin_RWDS
              'same as:
              setse1    #%010_000000+Pin_RWDS
So despite the comment, it's waiting for a falling edge of RWDS?
...
Somewhere there's something interesting going on is all I'm saying.

Well spotted.
The clock also changes gear, from SW send of the addr.command part, to using a SmartPin generated faster clock, for the latency 'spare clocks' region & following data

              setbyte   outb,#0,#0      'CA7..0  'lower 3 bits are A2..A0
              outnot    #Pin_CK
              
              setbyte   dirb,#$00,#0 'release control of buffer             
            
              'configure smartpin to run HR clock
              dirl      #Pin_CK
              wrpin     #%1_00110_0,#Pin_CK
              wxpin     #1,#Pin_CK  'add on every clock
              mov       pa,#1
              shl       pa,#30      'SysCLK/4 = 62.5 or 75MHz
              wypin     pa,#Pin_CK
              dirh      #Pin_CK
                       
              'prepare to load buffer using fifo              
              loc       ptra,#@HyperBuffer
              mov       pa,pb
              mul       pa,##480  'pb is section of buffer, 0..2
              add       ptra,pa
              wrfast    #0,ptra
              
               
               'calc # bytes to read
              mov       pa,##480           
               
              'Wait for Pin_RWDS to go high
              setse1    #$80+Pin_RWDS
              waitse1  
              
              nop  'need this at 300 MHz?
              
              'read in bytes
              rep   #1,pa 
              wfbyte    inb            
.reploop2

Current code uses SysCLK/4, but that pushes to need 300MHz PLL's, which is ok for testing, but not really shippable...
It may be possible to use SysCLK/2, and shuffle the code sections about a little, eg so 'prepare to load buffer using fifo goes before 'configure smartpin to run HR clock

I notice octaRAM floats DQS slightly longer (goes low later), than HyperFLASH RQDS, so maybe a pull-down mode on the pin would tolerate RAM types more.

jmg · 2019-03-29 19:59

jmg wrote: »

...
The Unregistered sweep to next aperture, gives a healthy 3.15ns between 4 & 3 changes, which nicely ~equals the period at 312MHz - ie an 'expected' result.

The puzzle here, is the Registered 6 sysclk & 5 sysclk apertures change points, seem to be only 466ps apart ?!
Below ~ 316MHz, Registered remains stable, (just pin 63 is late change at ~150MHz due to high loading effects on that pin, & the top 6 pins are clearly different )
Not sure how to explain how it can pick up another sysclk in just 466 ps period change ?
This is pushing up to the top-spec and well overclocked, so maybe other effects come into play ?

Looking some more into 466ps apparent clock width, I did find one mechanism that appeared in doing PCB trace delay spice runs.. and that is clock ringing.
If the internal cock on the PAD ring has some ringing, as per the spice, that could explain the 466ns 466ps separation measurement result ?
I'n not sure what this could mean for things like edge-detection circuits, but I think that's not done in the PAD Ring ?

cgracey · 2019-03-29 23:29

jmg wrote: »

jmg wrote: »

...
The Unregistered sweep to next aperture, gives a healthy 3.15ns between 4 & 3 changes, which nicely ~equals the period at 312MHz - ie an 'expected' result.

The puzzle here, is the Registered 6 sysclk & 5 sysclk apertures change points, seem to be only 466ps apart ?!
Below ~ 316MHz, Registered remains stable, (just pin 63 is late change at ~150MHz due to high loading effects on that pin, & the top 6 pins are clearly different )
Not sure how to explain how it can pick up another sysclk in just 466 ps period change ?
This is pushing up to the top-spec and well overclocked, so maybe other effects come into play ?

Looking some more into 466ps apparent clock width, I did find one mechanism that appeared in doing PCB trace delay spice runs.. and that is clock ringing.
If the internal cock on the PAD ring has some ringing, as per the spice, that could explain the 466ns separation measurement result ?
I'n not sure what this could mean for things like edge-detection circuits, but I think that's not done in the PAD Ring ?

It seems maybe it missed the first clock, but was registered on the second clock, instead?

jmg · 2019-03-29 23:37

cgracey wrote: »

It seems maybe it missed the first clock, but was registered on the second clock, instead?

If that was true, the change would not be 466ps ?
The non-registered path does seem to have next-clock step size, which is more expected.

cgracey · 2019-03-29 23:44

jmg wrote: »

cgracey wrote: »

It seems maybe it missed the first clock, but was registered on the second clock, instead?

If that was true, the change would not be 466ps ?
The non-registered path does seem to have next-clock step size, which is more expected.

Maybe there is some delay between the clocking that occurs in the actual I/O pad and the internal core flops? Or, the data paths subject to the two clocking contexts?

Rayman · 2019-03-30 21:07

I just did a test where I made the first two pixels in the first row different. Did the fast row read and then output result to serial port.
Looks good. We are getting the data read in where it is supposed to be.

Actually, it's 4 pixels, two bytes...

Rayman · 2019-03-31 00:45

I was just wondering why the 640x480x16bpp bird photo looks so good on my 1080p monitor...
Detail around the beak is much smoother than is should be at this resolution.

Turns out the monitor actually has upscaling built into it.

It's a Samsung S24E450. Not even my best monitor...

This is really good for a P2 with VGA HDMI focus...
Think I'm going to target 16bpp VGA for games. 1080p for sure for profi stuff.

Rayman · 2019-03-31 00:49

Here's what it looks like without upscaling:

Rayman · 2019-03-31 16:11

Just tried HyperRam code for HyperFlash. Looks to me like has same exact read and write protocol.

But, got no response at all from HyperFlash…
Went back to datasheet and now I see that the CSn ball is different for HyperFlash

Looks like all the other needed pins are the same, except this one.

Yanomani · 2019-03-31 18:30

Hi Rayman

They did made it that way, in order to enable the existence of multi-chip package options, where both can be accomodated, one HyperFlash and one HyperRam, their respective silicon dices stacked, one on top of the other, using a unique FBGA package footprint.

Be aware that some multi-chip packaged devices could also present a up-to 1 nS timing penalty, at some/all the signals and their respective interrelationships.

Rayman · 2019-03-31 22:14

Ok, have to order some new boards with the fix...

Also looking at dual HyperRam+HyperFlash along with dual eMMC (see attached)

HyperRam/Flash as VGA screen buffer (Now XGA, 720p &amp;1080p) &amp;Rev.B

Comments

HyperRam/Flash as VGA screen buffer (Now XGA, 720p &1080p) &Rev.B