HyperRAM Solutions for P2 (and P1)

RJSM · 2017-06-05 11:24

Thanks Cluso

here's the short clip. When you get back to Oz I'd be keen to talk to you about your SMT soldering setup

RJSM · 2017-06-05 11:30

I just tried looking at that clip on a Mac and it's badly corrupted up to the 17s mark using QuickTime Player. It was made on a PC - will investigate tomorrow

Sorry about that

RJSM · 2017-06-05 11:37

Just tried Macgo, a freeware MacMedia player and it shows the video ok on my MacBook Pro - hope that helps

evanh · 2017-06-05 11:45

I've whipped it through Handbrake, give this a try.

RJSM · 2017-06-05 11:52

Evanh - yes that's sorted out the problem running Quicktime - thanks

Cluso99 · 2017-06-05 13:18

RJSM wrote: »

Thanks Cluso

here's the short clip. When you get back to Oz I'd be keen to talk to you about your SMT soldering setup

email or PM me end of next week in case I forget

ozpropdev · 2017-06-05 13:39

Very nice work Richard!

Yanomani · 2017-06-05 14:54

deleted

Yanomani · 2017-06-05 15:03

Hi RJSM

RJSM wrote: »

Hi Henrique

Great timing diagram !

Some clarification - In my code CR0 and CR1 are configured just once on program startup

Here are the 8 bytes (hex) I’m writing to each register :
CR0 : 60 00 01 00 00 00 AF E8
CR1 : 60 00 01 00 00 01 00 02

The E8 in the CR0 data stream configures HR for 3 clock latency and also requests a fixed 2x latency. Doing that seems desirable, simplifying the need to explicitly handle RWDS apart from a direction change before sending data. As far as the clock is concerned, all is as in your timing diagram.

In my file WR_HR.jpg, I’m not accessing CR0 – in the LA traces you’ll see that prior to the 1st clock rising edge DQ5 is set high - unfortunately I did not capture DQ6 here - which would have made this clearer as it would have been low ie my first data byte sent is $20 and not $60 which would indeed have been a register access.

On the question of the long duration of my RWDS I find the datasheet unclear regarding exactly when one should make RWDS an output. In HR_WR, after setting up the streamer and smart pin, then pulling CS low (with RWDS initially configured as an input pin), I’ve got the following lines in my code to change RWDS direction and pull it low (to ensure there’s no data masking) :

waitx #n
drvl #HR_RWDS

In WR_HR.jpg my delay cycles constant n was rather large and I’ve reduced it – now RWDS is high exactly as shown in your timing diagram. In the case of a RD_HR there is no need to change the direction of that pin, of course; it’s an input. Good to have that sorted out thanks to your input.

In a couple of minutes i'll post code (now with some documentation) and a short video clip as well

First, I must acknowledge your excellent work, both in hardware and software, and express how much I personally thank you by sharing it with us.

Without it, we wouldn't had a real chance to discuss and share our efforts, on such an important and urgeing subject as the parallel comms.

Since the last opportunity, I've crafted a new write transaction panorama image, now focusing Fclk = 120 MHz and an edited copy of your original table, with the missing 8 in place, I hope.

It spans from transaction's starting #CS low-going pulse, till the high-going ending one and depicts the three zones I've used to part the time spent at P2/HyperRam interface, during write accesses to HyperRam .

Clocks, control signals data buses contents and individual bit lane values, for CA's group and the sixteen originally intended data bytes, are shown in a LA-style display format.

Let me post it, then further comments can be shared.

Henrique

Yanomani · 2017-06-05 15:49

Hi RJSM

An expanded view of the same image, this time showing bus tags/values.

Excluding CA[8] (bit lane D0 value at CK's third rising edge, during CA phase) that should be 0 (tagged as "Reserved" at ISSI datasheet Rev. A 06/28/2016, page 6), they behave like identical twins.

I've restricted the transaction plot to 16 data bytes to spare the obvious repeating frames.

More on the go...

Henrique

Rayman · 2017-06-05 17:32

Nice work. 50 MB/s is pretty good.
I'll have to see if this works on my board...

BTW: I don't know if this is new or not, but I just noticed a chip with both HyperRAM and HyperFlash in one package. $16 though...

jmg · 2017-07-13 21:13

RJSM wrote: »

I'll post my effort on a P1 HyperRAM test in a week or so in that area of the forum. It still needs extensive documentation. Would be sooner - but a kitchen renovation here is making this slow-going !

Curious for any updates on this ?

jmg · 2017-07-13 21:36

jmg wrote: »

Because the 4us is a royal pain, I've been running a pencil over self-refresh, with a view to a simple direct P1 <-> HyperRAM(s) -> LCD Display connection.
The LCD has an DE line, and needs parallel up to 24b, giving choices of 1,2 or 3 HyperRAMs connected.
With no DE, the data is ignored, so refresh or write can be done anytime DE is inactive. (more forgiving that VGA)

I think a P1 can read continually readily enough, and so that handles refresh the RAM of the displayed frame.

Further to this, I've been looking into MCU -> HyperRAM -> 24b LCD.
Yes, 2 or 3 HyperRAM is possible, but whilst they are somewhat cheap ($1.65/1k), they are not free, and that still needs 24+ pins write pathway.
Smarter seem to be ONE HyperRAM, as that is already very large in Pixel count.

So, I've looked at smaller CPLDs to 'see what can fit'.
These are TQFP44, ~$1.30/1k, and you need just 1, and it gives an 8b path to host MCU.
Once the LCD data path is there, I iterate to 'do more' in the CPLD, so the MCU has to do less.
There are pin limits on this, as 24b LCD and 8b HR leaves just 4 control/signal lines.

This is LCD data I've used :
Horiz : 800, Vert 480, LCD_CLK typ 30MHz, Max 50MHz
HORIZ DE Blanking 85 < 128 < 512 DCLK - tH
VERT DE Blanking, 4 < 45 < 255 tH

This data has min and max for the blanking, so you need eg 800+(85..512) clocks per line.

CPLD currently takes 2 words on 4 edges, and latches 24b to LCD bus, and checks the spare 8b for command codes/tags.

Latest rev derives LCD_DE from embedded codes, which slashes MCU overhead into a single playback clock burst
(but does mean you do need to initially insert those tags, and memory is a little larger than simple pixel space)

A mode bit allows CPLD(auto) or MCU(manual) control of LCD_DE, so that for early testing, and Fast-Fills/Erase the LCD can work with no tags yet.

Helping this, is the recent better availability of 60,72,80 MHz region oscillators, HR_CLK is 2x LCD_CLK
Starting lower Mhz end is probably smarter.

RJSM · 2017-07-18 21:59

P2 HyperRAM project - progress update

I’ve now populated 2 new OSHPark PCB’s – one targeting HyperRAM for P1 and the other for P2.

The P2 version is just a small breakout PCB that plugs into the prototyping area of the Parallax add-on board for the DE2-115. The HR chip is then connected to P11…0 via hookup wires – light blue for the data bus and yellow for the control signals.

The resistor you see in the photo is 56 ohms - this is placed in the CLK line from P9 – and proved essential in getting reliable performance with this setup (see the validation test below).

Aside : I could have made a P2 PCB with a double row header to plug directly into the I/O connector on the add-on board (which would have meant more direct inter-connections to the HR chip) - but I ended up going with this arrangement for a bit more flexibility.

For testing, I’ve added a new command (Q) to my code to access the full 8MB of HyperRAM as 65536 blocks, each of 128 bytes.

Here’s what I implemented for a validation test [timings in usec] :

Initialize block counter
* Fill 128 byte buffer with random bytes (using XORO32) [30.4 usec]
Write this buffer to HR [3.0 usec]
Read the buffer back from HR [3.0 usec]
Compare write and read buffers, maintaining an error counter [33.5 usec]
Decrement block counter and repeat at * till done

Total time per 128 byte buffer here is 70 usec, most of which is taken up generating the random numbers and comparing the write/read buffers.

I’ve just finished running this code continuously for ~24 hours and during that time ~150 gigabytes of random data were transferred to/from HR - without a single error.

Wanted to post this info now as I’ll soon be travelling for an extended period of time (until early September) without access to P1/P2 projects.

RJSM · 2017-07-18 22:09

Can someone please clarify the mode of operation of the XORO32 instruction ?

Chip’s May 23 post re XORO32, seems at odds with what’s documented in the latest v20.xlsx instruction spreadsheet, which states -

“Iterate D with xoroshiro32+ algorithm. Top 1..15 bits of sum of low and high words are high-quality pseudo-random data.”

The code I just posted was based on Chip's May 23 information. I did look at the random data being generated and superficially it seems ok - but then I can’t claim I did any rigorous mathematical testing of the randomness of the data.

jmg · 2017-07-18 22:18

Cool.

RJSM wrote: »

For testing, I’ve added a new command (Q) to my code to access the full 8MB of HyperRAM as 65536 blocks, each of 128 bytes.

It could also be useful to access as row * column sizes. (eg 8192 x 512 words)

RJSM wrote: »

Here’s what I implemented for a validation test [timings in usec] :
Initialize block counter
* Fill 128 byte buffer with random bytes (using XORO32) [30.4 usec]
Write this buffer to HR [3.0 usec]
Read the buffer back from HR [3.0 usec]
Compare write and read buffers, maintaining an error counter [33.5 usec]
Decrement block counter and repeat at * till done

Total time per 128 byte buffer here is 70 usec, most of which is taken up generating the random numbers and comparing the write/read buffers.

I’ve just finished running this code continuously for ~24 hours and during that time ~150 gigabytes of random data were transferred to/from HR - without a single error.

Nice test.
Can you increase the buffer size to larger values ?
Useful would be line-buffer of say 3200 bytes / 1600 words, and frame buffer times of ~ 30ms.

Refresh spec is > 64ms, so it could also be useful to look for the typical needed - ie place a delay between Write and Read, and increment slowly until Read failures appear.

It probably does not need to be true random, just 'not the same as last time'.
I've used a simple (address upper XOR address lower) to get a rolling and moving memory offset test pattern.

evanh · 2017-07-22 08:28

RJSM wrote: »

Can someone please clarify the mode of operation of the XORO32 instruction ?

Chip’s May 23 post re XORO32, seems at odds with what’s documented in the latest v20.xlsx instruction spreadsheet, which states -

“Iterate D with xoroshiro32+ algorithm. Top 1..15 bits of sum of low and high words are high-quality pseudo-random data.”

The code I just posted was based on Chip's May 23 information. I did look at the random data being generated and superficially it seems ok - but then I can’t claim I did any rigorous mathematical testing of the randomness of the data.

Ah, I see now, having been alongside Chip for a part of this one little corner, I had to read and compare and reread both Chip's earlier postings and the instruction description to finally get to where your question is coming from. The instruction description reads like the summing is done by the instruction when it most definitely doesn't do that part. The summing is outside the iterator section so it's easy to leave out and of course that saves a decent amount of logic.

The 1 to 15 bits of useful HQ output is just too dense detail, out of context even, for one line of text. The second sentence of the instruction description probably should just say it's used for high quality but tiny PRNG.

Here's a good place read on the workings of the xoro32 instruction - http://forums.parallax.com/discussion/comment/1409681/#Comment_1409681

rjo__ · 2017-08-02 21:35

Love the code... don't understand it all, but can easily see where I would start hacking.

While you are traveling I thought maybe I would propose a "kickstarter" here.

I need at least 4 of these... don't care about price.

How about everyone else? How many... what do you want to pay?... postage is extra:)

Ale · 2017-08-03 07:16

I'd like to jump on the hyperRAM band-wagon too. Does someone has a board here in Europe ?
I'd probably get that artix board from trenz electronic. It costs like 70 € for a (quite fast) 15k LEs device with a 8 MBytes HyperRAM part and 87 IOs, not bad !, I think. It is a Xilinx part, I'm normally using either Lattice or Altera parts.

jmg · 2017-08-17 04:13

I see JSC now offer more are offering 8-bit BUS DTR memory....

https://www10.edacafe.com/nbc/articles/view_article.php?section=ICNews&articleid=1527055

"JSC, made known sampling of a new high-speed, self-refresh OctaRAM based on JSC's low-pin-count interface, which announced the industry's fastest Octa serial interface RAM, the OctaRAM JSC64SS product family."

I've asked for data.

I also find
http://www.macronix.com/en-us/products/NOR-Flash/Pages/OctaFlash.aspx#3V
which shows 512Mb (prodn) and 256Mb, 128Mb planned.
These parts DO have a 300mil 16-SOP package choice, which is easier to handle and prototype with than 6x8mm 24-TFBGA(5x5)

Ale · 2017-08-19 12:32

@jmg: They seem to have only BGA versions of the DRAMs and BGA/SOP of the flash chips. Maybe later

jmg · 2017-08-21 01:08

Ale wrote: »

@jmg: They seem to have only BGA versions of the DRAMs and BGA/SOP of the flash chips. Maybe later

Yes, the SOP16 is to me an unusual package choice, I think selected because it has been used in the past for bigger memories than would fit in SO-8
I see Digikey has 220 items in SO-16, under Serial Flash.

A smarter gull-wing might have been a TQFP32, which has a thinner and smaller footprint than SOP16-300, but that seems to not be the industry choice.
Even QFN20 or similar could be a middle ground between BGA and larger gull wing ?

Cluso99 · 2017-08-21 01:16

Pinout is a dogs breakfast

HyperRAM Solutions for P2 (and P1)

Comments