Proposed external & HyperRAM memory interface suitable for video drivers with other COGs.

evanh · 2020-02-29 07:03

Yanomani wrote: »

Actually, there is not any flag the HyperBus controller can access, in order to be informed that any row's content can be already rotten, due to starvation (or total absence) of refresh.

I was talking about how the internal automatic refresh likely works. There is no informing outside the chip.

Manually, it all comes down to ensuring explicit row accesses I think. As Roger indicated, this would need 16 k accesses to refresh the whole 16 MB chip.

rogloh · 2020-02-29 07:03

Yeah whicker I had the exact same idea there too as something that may help though I still think its probably just easiest to break up the bursts to be always under 4us, and we will have to break them up for video applications anyway.

It seems as currently implemented we will probably lose 13% or so of the HyperRAM bus bandwidth in other overhead within the CS low time. I counted up the instructions in my code and it appears that there's about 30 instructions after CS goes low before the clock is even active (which is 0.30us for a 200MHz P2 with 100MHz HyperRAM), and about another 0.16us overhead for the address phase and latency time both contributing to about 0.5us in overhead within the 4us, leaving 3.5us free for the actual transfer. There is also more overhead between bursts as well for polling and service setup (outside the CS low time) but that can be significantly reduced for video COGs which can continue to repeat additional sub-bursts until the overall transfer requests is complete. (ie. the video COG won't be relinquishing service to another COG between its sub-bursts).

This 12.5% overhead within the CS low time will get worse if you clock the HyperRAM slower than 100MHz. Operating at 200MHz is a sweet spot though (for VGA and SVGA) but by the sound of what @evanh found you can easily overclock it for the writes anyway. For reads at sysclk/1 I'm not so sure. For a 252MHz P2 we might like to use the newer 133MHz or higher rated HyperRAMs to remain within spec.

evanh · 2020-02-29 07:21

An easy adjustment is to set CR1 register to zero. This doubles the tCMS period to 8 us. So then everywhere the docs say 4 us max, you can replace that with 8 us max instead.

rogloh · 2020-02-29 07:26

When I tried going to 16us and testing with video I had periodic corruption issues which Tubular and ozpropdev noticed with me at the time. It was only when I went back to 4us this all went away and things were rock solid. Edit: No the problems went away when I broke up the bursts to be less than 4us. I think the 16us actually helped.
I can't recall what the result was with 8us or 6us.

Also it looks like the newer HyperRAM Tubular mentioned from Cypress doesn't support these slower refresh settings any more so it's probably not very standard.

rogloh · 2020-02-29 07:27

I'm really considering this 3 long mailbox thing now..most of these instructions below would go away for every single burst transfer which I think may become the most commonly issued request. With 3 mailbox longs to read per COG we lose 8 more clock cycles in the polling loop vs what I have today but we could replace all of this code with a single mov instruction and the execf, and the driver side may not need as much setup work either. Fixed length bursts then save 7 instructions (14 clocks), and variable save even more from 20-27 clocks, which easily repays for the increased 8 clock cycle penalty during polling. Individual R/W accesses would suffer these extra 8 clocks however as they don't use the 3rd parameter. A fill request could use this third mailbox long argument, if ever done individually outside a request list.

                            getnib  b, data, #5             '...no: new burst, get block size 
                            cmp     b, #15 wz               'check for variable length transfer
            if_z            jmp     #r_varlen               'handle it
                            getbyte c, data, #3             'get number of block transfers
                            test    c wz                    'test for zero
            if_z            mov     c, #256                 '...and treat 0 as 256
                            shl     c, b                    'now scale this into the number of bytes
                            execf   newburstr
...

r_varlen                    setq    #2-1                    'variable length, read new hub address+size
                            rdlong  b, data wz              'read 2 longs (b=addr / c=len)
                            mov     data, b                 'set new hub address
            if_nz           execf   newburstr               'start a new write burst request

evanh · 2020-02-29 07:43

rogloh wrote: »

When I tried going to 16us and testing with video I had periodic corruption issues which Tubular and ozpropdev noticed with me at the time. It was only when I went back to 4us this all went away and things were rock solid. Edit: No the problems went away when I broke up the bursts to be less than 4us. I think the 16us actually helped.
I can't recall what the result was with 8us or 6us.

Also it looks like the newer HyperRAM Tubular mentioned from Cypress doesn't support these slower refresh settings any more so it's probably not very standard.

You'd have to make sure you always stay under the limit. Otherwise it'll start skipping rows. At the slower rate that'll break things easier.

The ISSI parts have the same config bits. I'd be surprised if those are being removed, unless they are offering some alternative ways of having longer bursts.

evanh · 2020-02-29 07:50

rogloh wrote: »

... For reads at sysclk/1 I'm not so sure. For a 252MHz P2 we might like to use the newer 133MHz or higher rated HyperRAMs to remain within spec.

For a refresh access, the read data can be ignored. Just have to issue the CA and enough clocks to trigger the the row refresh.

Yanomani · 2020-02-29 11:00

The spec for a minimum tRWR is there (at the datasheets) just for a reason:

HyperRams use that time to write-back the last-used, row-wide, ghost register into its position at main memory array, independently from if any new information being written on it during the last access, or not.

Those are some vestigies (remains?) from the (now) ancient/vintage refresh read procedure, that has been with us since Dram inception.

Following the timing diagram (just a warn: default two times latency count enforced ahead):

- after tRWR, the first latency count follows; during this time, any impending self-refresh operation could be internally performed (AKA, far from curious eyes), afeccting any row that's about to starve (perhaps including the one you are just about to access (or not, by prematurelly ending the operation in course), just because the internal controller shouldn't trust you, by presuming you will not decide to give-up with the "apparently" intended operation);

- then, there comes the second latency count; the period of time used by HR internals to read the row addressed during the CA phase from main memory array, into one of the ghost registers; thereon, that ghost register becomes exposed to the external, bi-directionally accessible interface bus, able to send data (READ COMMAND in progress) or, eventually, receive data (Write COMMAND in progress); during WRITE COMANDS whose destination is main memory array, any masking (RWDS-level-controlled) would preclude changing its contents, byte-by-byte, as speccifyed by positioning RWDS = HIGH at byte A-time, byte B-time, both, or none.

- the ordering of the above two operations can be changed, without any loss of significance, provided that, at the end of them, the right ghost register rests connected to the external bi-directional Interface bus.

- then, the byte/word xfer time comences and the internal controller still needs to keep supervising it, in order to be able to load the, now free, other ghost buffer, by reading the next row contents from main memory array into it, in advance, since it can't know how far the current accesses being done will reach, nor if they'll be limited to the contents of the currently exposed ghost buffer, already in use (Linear Burst terms enforced);

Perhaps, and that's only one of my supositions, if, by the end of the current operation, there are already two rows into the two ghost buffers, and the operation in-course is "terminated", by CS# going HIGH, the whole time budget available for "impending" self-refresh ops in a possible just-to-be-instantiated read/write operation, could be consumed, lefting no available time for nothing (including any other "possible" self-refresh allarm, reaching its time-out), but blindly following external HyperBus master controller command-chainned operations.

Wrong, next post explains it, in detail.

In the past (good old times), datasheets, spec sheets, and sure, app notes, used to be a fantastic source of good information; a thorough reading session could provide enough material to get you started; some design and experimenting would let you find yourself confortable, able to leverage from the embodied technology, exempt from being regularly catched by incomplete, hidden, unexplained or, worse of all, unreliable information.

Unfortunatelly, that is not the case, anymore... sad.

evanh · 2020-02-29 12:13

Of note is tRFH == tACC == tRWR. And doing some reading ... it turns out the refreshing of DRAM cells is actually performed by the read sense amplifiers. ie: The data sense lines are also the data write lines. So access time includes the writeback/refresh time for free. Reduces circuit size not needing separate write lines either. Nifty trick. I've learnt something new.

This article details it - http://www.cse.scu.edu/~tschwarz/coen180/LN/DRAM.html
The section DRAM Read - point 4 has the details. Figure 5 shows the sense amp schematic, with D1 and D1* being the two Digit Lines shown in figure 3.

evanh · 2020-02-29 12:18

Figure 3 also implies two transistors per bit. I didn't think that was the case for DRAM. Maybe there is even further reduced designs from that one that only needs one transistor per bit storage.

Yanomani · 2020-02-29 12:19

As for HyperRams, the internal memory array is time-multiplexed, providing enough bandwidht for two ghost-registers to operate, in synchronism (kind of a seesaw scheme).

If one intends to start accessing near the end of any row, he needs to do so, wiselly, in order to ensure enough time, so HyperRam's internal controller can be given enough time to read two rows in the course of each linear burst operation:

- the one specc'd by the contents of CA phase-loaded internal registers, whose destiny will be the "exposed" ghost buffer, and the next one (address-wise), that will be held at the "undercovered" ghost buffer, till needed by the access sequencing reaching the last word of current row.

Now, HyperRAM Read Initial Access Time tACC is the name of the game. If one targets the very last words of a row, then it should provide enough CK periods thereon, so the product TCK x nWORDS equals 2 x tACC, or a bit more, in fact.

Not specc'd anywhere, but clear as crystal (at least to me), or the two latency counts would need to be dedicated to fill both ghost buffers, with two rows, in a sequence, only to warranty that linear burst operations will at least work, marginally, if an innitial access targets the very end (less than tACC / (TCK x number of ought-to-be accessed WORDS, till the end of a row is reached).

And if the two latency counts were wasted that way, no time was left for any "impending" self-refresh operation to take place, at least during the present access timeframe.

tRWR has been exhausted too early, at beggining of the access cycle, before the Hyper internal controller could even be aware of the initial address that will be accessed by the current command, so it can only be used to write-back the contents of a possible last "orphan" ghost buffer, left at the end of a previous operation

Yanomani · 2020-02-29 12:44

Not exactly OT, but, Digikey is offering 2015'Cypress HyperRams at peanut costs, both 2015-S27KS0641DPBHV020-ND (8MB, 351 in stock, US$1.20 each) and 2015-S70KS1281DPBHV020-ND (16MB, 216 in stock, US$2.15 each ), under "NO WARRATY" terms:

https://forum.digikey.com/t/cypress-no-warranty-parts/5047

These are 1.8V, 166MHz (332 MBps) parts, thus not exactly suited to our 3.3V environment, but, in case of interest, and given the right voltage-level interface chips are not that expensive, and there is also the point of speed gain...

evanh · 2020-02-29 13:08

I have my doubts we can reliably get to 200 MT/s at 3.3 V anyway. Trying to dance with a 1.8 V part seems nuts.

EDIT: Also, Hyperbus 2.0 parts support 400 MT/s at 3.3 V. If we can get reliable sysclock/1 transfers going then changing to the newer versions would be ideal.

evanh · 2020-02-29 13:08

Yanomani wrote: »

Not specc'd anywhere, but clear as crystal (at least to me), or the two latency counts would need to be dedicated to fill both ghost buffers, with two rows, in a sequence, only to warranty that linear burst operations will at least work, marginally, if an innitial access targets the very end (less than tACC / (TCK x number of ought-to-be accessed WORDS, till the end of a row is reached).

The sense amps act as latches - read my above post. There will be a read buffer of 8 bytes deep (the amount needed to cover subsequent tACC at full data rate). The rest is likely just muxes and maybe an I/O register pair.

The double latency is expected. This situation would be exactly why it is documented as possible to occurs.

evanh · 2020-02-29 13:32

evanh wrote: »

rogloh wrote: »

Also it looks like the newer HyperRAM Tubular mentioned from Cypress doesn't support these slower refresh settings any more so it's probably not very standard.

The ISSI parts have the same config bits. I'd be surprised if those are being removed, unless they are offering some alternative ways of having longer bursts.

Damn, I see what you are talking about now. On page 5 it state those two bits are now read-only. That sucks.

EDIT: Oh, that's a Hyperbus 2.0 part. Time to read the full datasheet ...

evanh · 2020-02-29 13:57

evanh wrote: »

There will be a read buffer of 8 bytes deep (the amount needed to cover subsequent tACC at full data rate). The rest is likely just muxes and maybe an I/O register pair.

Ah, they call that a half-page. And I just read that additional latency is signalled when coinciding with a refresh, so it's not for short row boundaries like I had thought.

Hmm, maybe there are restrictions if not starting on half-page boundaries ...

EDIT: Correction, a half-page is 16 bytes (8 shortwords). Locations are all shortwords, not bytes.

Yanomani · 2020-02-29 13:59

evanh wrote: »

Yanomani wrote: »

Not specc'd anywhere, but clear as crystal (at least to me), or the two latency counts would need to be dedicated to fill both ghost buffers, with two rows, in a sequence, only to warranty that linear burst operations will at least work, marginally, if an innitial access targets the very end (less than tACC / (TCK x number of ought-to-be accessed WORDS, till the end of a row is reached).

The sense amps act as latches - read my above post. There will be a read buffer of 8 bytes deep (the amount needed to cover subsequent tACC at full data rate). The rest is likely just muxes and maybe an I/O register pair.

The double latency is expected. This situation would be exactly why it is documented as possible to occurs.

Just now I've had a chance to read your posts (it was my breakfast time, when you were posting).

Now, I understand it better (reads) and just by reading HyperBus 2.0 parts datasheet, I'd noticed they'll need 35nS (tRWR = tACC = tRFH), wich gives 7 HR clocks (@200 MHz), or 14 bytes, wich suggests at least 16 bytes (or 8 words) should be readden from each row, or written to, to ensure enough time is given for 2.0 parts to safelly operate.

Despite the inherent self-refresh mechanism during reads, and pointed by you (good document indeed, saved...), most of my musings still hold valid, in case of write operations targeting main memory array data contents.

evanh · 2020-02-29 14:08

I just realised it's 16 bytes myself.

Yanomani · 2020-02-29 14:17

At the end of each day, our heads did inflate a little, and so do our eyes, popping...

Too much information to archive, classify, nothing to erase at all, because we turn to the same subjects, very frequently...

Passing too much time writing @ XMHz, @ YV, Z nS (if not pS at all)... Ouch

Yanomani · 2020-02-29 15:45

evanh wrote: »

Yanomani wrote: »

Actually, there is not any flag the HyperBus controller can access, in order to be informed that any row's content can be already rotten, due to starvation (or total absence) of refresh.

I was talking about how the internal automatic refresh likely works. There is no informing outside the chip.

Manually, it all comes down to ensuring explicit row accesses I think. As Roger indicated, this would need 16 k accesses to refresh the whole 16 MB chip.

Very strange that dual-dice, stacked devices, like the 128Mb ISSIs needs more than 8192 refresh operations at all.

Each dice still haves 8k rows, and the whole device supports only fixed latency.

And the above mentioned difference in refresh behavior (16384 rows (PER TWO DICE DEVICE) versus 8192 rows (PER SINGLE DICE DEVICE) isn't mentioned at ISSI datasheets I have in my collection.

Not mentioned even at the existing list of differences between devices, as for the 2016 version (REV 00B, 2016-01-25) nor the 2018 version (REV B4, 2018-09-21).

Also, when accessing either die in a MCP package, the other die is freed by its own internal controller (one present at each die, only the higher Row Address line (CA35) is multiplexed between them, thru a special circuit that differentiate top and bottom die, internally.

CA35 = 0 means Bottom device;
CA35 = 1 means Top device.

Sure, configuration registers 0 and 1 must be set individually, for each device, but still under differentiation conditions, as stabilished by (CA35) value.

The rest of the circuitry is the same as a 64M device, so, 8192 rows are expected to pertain to each one, and the differentiation circuitry ensures that only one device is connected to the external bus, at each time (except during CA phase, when the differentiation is stabilished, and maintained till the end of any access cycle).

That means accesses done (read or write) at one device, are replicated at the other one; the sollely difference being that reads don't interfere with each other, since one of the data/rwds buffers is 3-stated, when the other die is being accessed. Writes either, since the die that is not receiving new data, has its own RWDS (internal) pulled high, during the whole data block transfer, effectivelly masking out each and every byte/word.

Perhaps there is some misunderstanding in course (including me, for sure); better verify and not having to repeatedly issue refresh reads to an still refreshed memory bank; twice the cost for an already done job.

whicker · 2020-02-29 16:02

Yes @Yanomani there are now 4 DICE versions as well (at the 1.8V level).

I doubt their internal refresh counters are synchronized, with some refreshing on their own faster than others. Hence the need for fixed latency (always assuming the worst case).

In my own bench testing, with I was surprised how often i didn't hit the additional latency refresh penalty with variable latency enabled on a single die version.

evanh · 2020-02-29 16:04

Oh, I read, just before, that the tCMS spec is based on half the period used for each row refreshing. So, 8 k rows will be the correct number for full array refresh. Roger guessed at 16 k and I didn't see a problem with it.

EDIT: It's under heading of "Distributed Refresh Interval". There is also some hints on the auto-refresh sequencing that suggests it can sequence at two distinct speeds and thereby catch up after a large burst. So not as basic as I first surmised. It's not particularly clear on what the limits are though.

Because tCSM is set to half the required distributed refresh interval, any series of maximum length host accesses that delay refresh operations will catch up on refresh operations at twice the rate required by the refresh interval divided by the number of rows.

evanh · 2020-02-29 16:19

Then it immediately says the usual:

The host system is required to respect the tCSM value by ending each transaction before violating tCSM.

The only way I can think of that makes the first quote a useful feature, is if the second quote is occasionally violated by a large margin.

whicker · 2020-02-29 16:37

I read that word salad as related to nyquist sampling theorem. The refresh is merely delayed, unless it is delayed for so long that the counter went to the next row.

evanh · 2020-02-29 17:04

That first quote seems to be stating it can cycle multiple times at double rate until caught back up. As long as there is gaps for it to take. The tCSM spec is to provide for this double rate. Although it isn't specific about how many rows this double rate can cover.

It would also seem to imply it can adjust/sync to whenever a gap appears if a refresh is already pending. So not the typical async behaviour of Nyquist.

Certainly more sophisticated than I first imagined.

whicker · 2020-02-29 19:47

Say there's a refresh counter incrementing every 8uS. It's off doing its own thing at wherever address, and that row refresh gets triggered to happen at that interval.

So now you come in and want control of the memory array. You must wait the worst case time for the refresh to complete just in case the start of your access and the refresh were simultaneous.

But what if the refresh demand more likely was triggered like 3 uS into your 4 uS burst.

By leaving a gap every 4uS, by definition any possible overlap of the refresh and your memory access will allow the deferred refresh to compete before incrementing.

Then you do another 4uS burst. No refresh needs to happen. You're about 5 of 8 uS into the refresh cadence. Then you do another 4 uS burst. Yet again somewhere in there about 3 uS into that burst the next row wants to be refreshed but has to wait.

It's a ping pong kind of thing. The worst combination would be just slightly over 8 uS long bursts causing a beat frequency of repeatedly skipped rows.

evanh · 2020-02-29 23:25

If it can defer in any way then it's not following Nyquist. And indications is it can defer effectively 32 milliseconds because that's 100% array coverage at double rate.

evanh · 2020-02-29 23:37

If I've got that right, then it would need a way to track its progress through the array. An easy method would be to have two counters. One incrementing at the nominal 2 x tCSM (8 us) period, the other one incrementing when a row is refreshed. And the refresh gets triggered by inequality between the two counters ... and CS de-asserted ... oops, and a tCSM timeout completed. New timeout starts with the second counter incremented by a row refresh.

It certainly would be handy if they just said how it works.

Yanomani · 2020-03-01 01:46

Well, no one is perfect, but...

I believe I've gathered some clues, about the reason some 2015'Cypress parts are being sold by Digikey, at hoard-inviting discounts...

A genuine, jewelry-alike piece of information, coming from a Cypress datasheet: Document Number: 001-97964 Rev. *E Revised March 01, 2016.

I'm just musing aloud...

What are we really looking for? To understand how self-refresh circuitry works, so we can leverage from that knowledge, in order to get better results...

evanh · 2020-03-01 02:26

Huh, yep, that is a good matching reason. I hadn't even looked at the Digikey link before now.

Proposed external &amp; HyperRAM memory interface suitable for video drivers with other COGs.

Comments

Proposed external & HyperRAM memory interface suitable for video drivers with other COGs.