Proposed external & HyperRAM memory interface suitable for video drivers with other COGs.
rogloh
Posts: 5,787
As part of my ongoing video driver design I have found it necessary to design an
external memory request interface intended to allow the video driver to access
external memory for its frame buffer in addition to the Hub memory space, which
allows expansion of video resolutions and bit depths available to beyond what
is supported by 512kB in Hub memory alone.
In this case HyperRAM is the intended target but this design is quite generic
and could be just as easily be used for other memory types including read only
memory such as flash, or potentially even SD cards in some applications though
that is not the intention here.
What this offers is an interface specification for transferring data to or from
external memory, and allowing multiple COGs to share the external memory.
The arbiter function of the memory driver is implementation dependent and is
not specified here, however in video applications priority may be needed
so video COG(s) would get sufficient bandwidth for their real-time needs.
The memory request setup and polling needs to be fast and two 32 bit hub memory
locations are used to indicate a request is being made from a particular COG
to the memory driver COG. This is essentially a mailbox design where each
COG has its own pair of consecutive longs and a memory driver COG repeatedly
polls these longs over its set of COGs being serviced to process their incoming
requests.
The format is as follows:
First Mailbox Long:
item being requested to either read or write in external memory.
The BankAddr is not specified in detail however it could be used for dividing
the memory into smaller sections with independent chip selected banks. Or it
could go to different devices on different pins for example, dividing up
the overall external memory space by device.
It can also be considered as an extension to the External Memory Address field
to allow memory devices larger than 16MB to be used. Alternatively, portions
of the 24 bit External Memory Address could be used to increase the BankAddr
field or whenever memories smaller than 16MB are desired. It is flexible but
implementation specific.
The upper 4 bit Request field is comprised of the following bits:
memory being requested.
set to 0 for external memory reads.
RequestActive is set by the calling COG to make a memory transfer request.
It will be cleared by the memory driver COG once the request has been made,
indicating the channel for that COG is idle and a new request can be made.
If the transfer is successful the entire long will be cleared, allowing
the requesting COG to easily poll for a zero value to indicate the request
is complete, when it knows no failures or errors are going to be possible.
In the case of devices supporting errors such as read only devices being
written to, or addresses out of range, or device not ready etc, this design
can also allow the indication of errors. When a memory transfer fails
the Request_Active bit is cleared however the Write/Error bit is set to
indicate an error occurred, and any error code can be left in the second
long for reading by the COG and/or in the remaining portion of the first
mailbox long. Errors are implementation dependent.
Second Mailbox Long:
used for the transfer of data to/from the external memory.
In the case of a read, it will get populated by the memory driver with the data
after completing the read and before it clears the Request_Active bit in the
first long. The data will be zero extended by the driver if the transfer size
is a byte or a word.
In the case of a write, data will be read from this location then written
into the external memory, once the driver sees the Request_Active_field set
and a write being requested. It would then clear the Request_Active flag by
zeroing out the long.
For the burst transfers the long is interpreted differently, as shown below:
will source its data for writes, or have the data written to for reads.
The Size nibble is used to indicate the size of each item being transferred
which is a power of 2. The overall number of bytes read or written will be
scaled by this number.
request. A zero in this field represents a value of 256. This allows a
maximum of 8MB to be requested at a time (256 x 32kB).
Not every size from 1-8MB is possible to transfer. But the ranges overlap.
Other sizes not directly supported can still be made using multiple sequential
requests.
To transfer 640 bytes in the case of a video driver at 8bpp VGA resolution,
you could request 80 transfers and use a multiplier of 8 (Size = 3).
For 1920 pixels in HDTV resolution you could choose 240 x 8, or 120 x 16,
or 30 x 64, etc. Transfer sizes for most common video resolutions that
use an even numbers of pixels should be supported by this approach.
The multiplier becomes useful for different bit depths, as you could keep
the same transfer size for the same resolution, but just change the size
according to the bit depth. So a video resolution of 640x480x32bpp could
use 80 transfers x 32 (Size = 5) to read 2560 bytes, while at the same
resolution at 4bpp would use 80 transfers x 4 (Size = 2) to read 320 bytes.
Example mailbox polling (supporting optional errors):
Once a requesting COG issues its external memory request, that COG can poll
for the result using the following code. If no errors are required the last
instruction can be omitted, as well as the use of the zero flag.
external memory request interface intended to allow the video driver to access
external memory for its frame buffer in addition to the Hub memory space, which
allows expansion of video resolutions and bit depths available to beyond what
is supported by 512kB in Hub memory alone.
In this case HyperRAM is the intended target but this design is quite generic
and could be just as easily be used for other memory types including read only
memory such as flash, or potentially even SD cards in some applications though
that is not the intention here.
What this offers is an interface specification for transferring data to or from
external memory, and allowing multiple COGs to share the external memory.
The arbiter function of the memory driver is implementation dependent and is
not specified here, however in video applications priority may be needed
so video COG(s) would get sufficient bandwidth for their real-time needs.
The memory request setup and polling needs to be fast and two 32 bit hub memory
locations are used to indicate a request is being made from a particular COG
to the memory driver COG. This is essentially a mailbox design where each
COG has its own pair of consecutive longs and a memory driver COG repeatedly
polls these longs over its set of COGs being serviced to process their incoming
requests.
The format is as follows:
First Mailbox Long:
bit bit 31 28 27 24 23 0 ------------------------------------------------------------------------ | Request | BankAddr | External Memory Address | ------------------------------------------------------------------------The External Memory Address is self explanatory. It is the address of the
item being requested to either read or write in external memory.
The BankAddr is not specified in detail however it could be used for dividing
the memory into smaller sections with independent chip selected banks. Or it
could go to different devices on different pins for example, dividing up
the overall external memory space by device.
It can also be considered as an extension to the External Memory Address field
to allow memory devices larger than 16MB to be used. Alternatively, portions
of the 24 bit External Memory Address could be used to increase the BankAddr
field or whenever memories smaller than 16MB are desired. It is flexible but
implementation specific.
The upper 4 bit Request field is comprised of the following bits:
bit 31 bit 30 bit 29 bit 28 -------------------------------------------------------------------- | Request_Active | Write/Error | Access_Type[1:0] | --------------------------------------------------------------------Access_Type is a 2 bit field that indicates the size of the element in the
memory being requested.
Access_Type Transfer Size 00 - Byte 01 - Word 10 - Long 11 - Burst (described below)Write/Error should be set to 1 for performing external memory writes, or
set to 0 for external memory reads.
RequestActive is set by the calling COG to make a memory transfer request.
It will be cleared by the memory driver COG once the request has been made,
indicating the channel for that COG is idle and a new request can be made.
If the transfer is successful the entire long will be cleared, allowing
the requesting COG to easily poll for a zero value to indicate the request
is complete, when it knows no failures or errors are going to be possible.
In the case of devices supporting errors such as read only devices being
written to, or addresses out of range, or device not ready etc, this design
can also allow the indication of errors. When a memory transfer fails
the Request_Active bit is cleared however the Write/Error bit is set to
indicate an error occurred, and any error code can be left in the second
long for reading by the COG and/or in the remaining portion of the first
mailbox long. Errors are implementation dependent.
Second Mailbox Long:
bit bit 31 0 ------------------------------------------------------------------------ | Data Read or Written | ------------------------------------------------------------------------In the case of individal access to bytes, words and longs, this long will be
used for the transfer of data to/from the external memory.
In the case of a read, it will get populated by the memory driver with the data
after completing the read and before it clears the Request_Active bit in the
first long. The data will be zero extended by the driver if the transfer size
is a byte or a word.
In the case of a write, data will be read from this location then written
into the external memory, once the driver sees the Request_Active_field set
and a write being requested. It would then clear the Request_Active flag by
zeroing out the long.
For the burst transfers the long is interpreted differently, as shown below:
bit bit 31 24 23 20 19 0 ------------------------------------------------------------------------ | Transfers | Size | Hub Memory Address | ------------------------------------------------------------------------The Hub Memory Address is the address in Hub RAM where the burst transfer
will source its data for writes, or have the data written to for reads.
The Size nibble is used to indicate the size of each item being transferred
which is a power of 2. The overall number of bytes read or written will be
scaled by this number.
Size Multiplier Amount 0 1 1 byte 1 2 1 word 2 4 1 long 3 8 8 bytes . . . 14 16384 16kB 15 32768 32kBThe Transfers byte indicates how many transfers to perform to complete the
request. A zero in this field represents a value of 256. This allows a
maximum of 8MB to be requested at a time (256 x 32kB).
Not every size from 1-8MB is possible to transfer. But the ranges overlap.
Other sizes not directly supported can still be made using multiple sequential
requests.
To transfer 640 bytes in the case of a video driver at 8bpp VGA resolution,
you could request 80 transfers and use a multiplier of 8 (Size = 3).
For 1920 pixels in HDTV resolution you could choose 240 x 8, or 120 x 16,
or 30 x 64, etc. Transfer sizes for most common video resolutions that
use an even numbers of pixels should be supported by this approach.
The multiplier becomes useful for different bit depths, as you could keep
the same transfer size for the same resolution, but just change the size
according to the bit depth. So a video resolution of 640x480x32bpp could
use 80 transfers x 32 (Size = 5) to read 2560 bytes, while at the same
resolution at 4bpp would use 80 transfers x 4 (Size = 2) to read 320 bytes.
Example mailbox polling (supporting optional errors):
Once a requesting COG issues its external memory request, that COG can poll
for the result using the following code. If no errors are required the last
instruction can be omitted, as well as the use of the zero flag.
wrlong request, mailbox ' issue request poll rdlong result, mailbox wc wz ' poll for completion if_c jmp #poll ' wait for request to be cleared if_nz jmp #handle_error ' zero is success, non-zero failure '...continue
Comments
I would like to throw some cents in. I think your second long should always (not just in burst mode) contain the address of the item read from or written to.
My reasoning is that your driver writes it into the HUB and my program needs to read the value again from the HUB and put it where it belongs again HUB access, to free the mailbox.
Same for writing, I need to read my source (HUB) write the mailbox2 (HUB) your driver reads HUB.
current proposal
I need to read a long. I write command & external address in Mailbox1 ONE write to HUB. Your driver writes result, ONE write to HUB. Now I need one HUB read and one HUB write to place it.
Makes four HUB access (excluding waiting for result/ clearing mailboxes, that stays the same in both versions.)
I need to write a long. I read the long from HUB, I write it into the mailbox2, I write command, you read the long from HUB, at least 4 excluding command/response as above.
Makes four HUB access.
lets use indirection.
I need to read a long, I write HUB address in mailbox2, command & external address in Mailbox1, TWO writes to HUB. Your driver will put the long where it belongs. ONE write to HUB.
Makes just three HUB access.
I need to write a long.I write HUB address in mailbox2, command & external address in Mailbox1, TWO writes to HUB. Your driver will get the long where it actually is. ONE read from HUB.
Makes just three HUB access.
The other thing is the double use of read/write (as input) and error flag. It will work but basically yo need to set bit 31 AND clear 30 after each command to clear the error flag on reads. I have no better idea right know, but it is sort of itching me.
This might be usable for other things too.
Mike
Also by using setq you could read in 2 longs in a burst rather than reading once, then reading again for the result. If the transfer is complete after you test the request_active bit and see it clear, you can just use the data from the second long already read in. It should be reasonably fast.
The error flag is optional. I think it would only ever be needed for memory types that can fail, such as writes to flash. In pure RAM cases, it won't error. Nibble accesses are fast in the P2, you can set all 4 top bits at the same time for a fresh read or write request, no need to set/clear individual bits like the P1.
Also from the memory driver perspective, it can request a burst read of all eight mailboxes in a row. It would get this in 16 clocks, and have any single write data already present. No need to go read the hub a second time for the write data unless indirection/burst is indicated. This improves latency.
basically burst transfer does what I asked for, did not noticed it can even read 1 byte as "burst".
Mike
Rather than have the requesting cog poll HUBRAM in a loop, couldn't the arbiter cog use COGATN to signal completion? Then the requesting cog could have the option to focus on something useful while waiting, without burning a hub access per loop.
I found 2 longs was always too restrictive, and the way around that is to use the 2nd long as a pointer to another buffer. That's what Catalina for P1 did and I always thought that having a pointer to a pointer was IMHO silly tho P1 was tight for memory.
It could potentially be done that way. In the case of my own video client COG, it would probably not even be checking the result, and will assume the data is ready 25us later simply because of its real-time nature. There is nothing it can do if the data is not ready, apart from maybe drawing a blank line instead of replicating what is left in the buffer from last time if it knows the transfer is not complete.
There needs to be a way to indicate such a notification preference. Perhaps another memory area in HUB RAM can be defined for the driver COG to use that nominates COG preferences for how they are to be notified on completion of their requests. I'd prefer to not have to fill out an additional long each time for this however if the choice is made on a per request basis, unless we steal a bit from the BankAddr perhaps...
For memory use specifically I am hoping to keep this particular mailbox specification down to two longs just to help minimize the setup required and reduce the driver polling latency, as it has to be particularly simple for video and with few instructions. I could see for other devices needing additional control, or for locks etc, more longs may be required, though they can always use different drivers dedicated for that purpose.
I am thinking this allows for the expansion of future scheduling options and selection of features like requesting being notified by COGATN from the driver upon completion or just via the regular mailbox polling. It does mean that the first time a COG starts to use the external memory driver, it would probably also need to have a way to reset back to defaults, in the case that some prior code running on this same COG ID earlier had been configuring the arbiter with it's own settings. I am thinking this could simply be a write of 0xFFFFFFFF to the mailbox 1 register to do this, as that value would not need to be used for other things. You would lose one the the 16 banks the top 16MB of the overall 256MB address space available for this control path feature, but it would open up future capabilities once the arbiters become more complex.
As an example we may wish to put in a way for a video COG to indicate to the driver that it is in vertical blanking for the next number of clock cycles and it would be free to grant larger requests to other non-video COGs, or do more refresh cycles etc. Or for it to advertise how many clocks until the next scanline request, so the arbiter could do other things. It might allow a way to request a regular update of data to be sent to hub without requesting by the COG as well, or setup a block copy internally between memory banks, or zero fill etc. Lots of possibilities...
I think I will add this. Does anyone else have any thoughts on this?
It seems like a reasonable tradeoff to me. If someone needs more than 240MB then one of the control words could be to flip a bank switch.
After already extending what I had started to support other things such as fill and variable length copies, HyperRAM register access etc, here are the shortcomings I've identified with this current proposal that I've implemented...
For the transfers that exceed 256 bytes that are not 1-256 multiples of powers of 2, I needed to extend the read and write burst requests by pointing the second mailbox long value to a second pair of longs in hub RAM containing the real hub address buffer and separate transfer length (in bytes) instead of combining length and hub address together in the second mailbox long as I do for the fixed burst transfers that my video driver uses. It was not too difficult to support this in the driver but it did add some more overhead to the code path that deals with it, requiring an additional memory read including its extra egg-beater latency and it also complicates setting up request of variable sized request by requiring structures that point to other structures in memory (outside the mailbox area). I suspect arbitrary length copies will be quite frequently requested from high level languages, so it would be good to have this capability built in and optimised from the start. Also the extra packing and unpacking of data lengths and sizes within the hub memory address itself takes more instruction overhead, both at request setup time and in my driver, using various getnib, setnib, getbyte, setbyte operations to populate and extract these fields. It would be nice to avoid this.
Polling latency - right now the polling loop is always transferring 16 longs from all 8 COG's mailbox long pairs before checking and branching on an active memory request. This block move operation on the P2, after the setq #16-1 is issued, takes a minimum of 24 clocks plus some variable egg-beater latency of up to 7 more clocks. There is both good and bad in doing it like this.
The benefit is that once you detect that a COG mailbox request is active in the first mailbox long being polled, you already have read in the extra data value in the second mailbox slot so there is no need to go read it again before you continue the operation, saving at least another 9-16 clocks for an additional rdlong. Another benefit when the request/data long pairs are consecutively stored in memory per COG is that you can setup and issue the request from the requestor COG quickly using "setq #1" & "wrlong req, mailbox" fast block transfer sequence. Similarly the requesting COG can also poll the mailbox for the result and get the data back in the same block transfer operation when reading a byte, word or long from external memory. It only takes one extra cycle to read/write the second long if using setq #1 beforehand which was a real no-brainer in this case and well worth doing.
The downside however is that there is always 8 extra cycles of overhead in the polling loop as you read ALL the COGs mailbox data. Because all COGs mailboxes get stored consecutively you will always need to read these 16 longs even if only a small handful of COGs are ever actively using the external memory driver because you can't easily re-arrange the read order, particularly if COG requestors are dynamic. It is true that an optimization exists for the single COG requestor case where you can reduce this mailbox group read down to just two longs for that COG alone, but that is a very special singular case and not the general case.
Now if you ever want to expand the mailbox size for passing in additional parameters (which I'd quite like to do now for bank-to-bank and graphics copies etc) you can't access this additional data with the first read operation in the polling loop unless you increase the setq size within the polling loop to read say 32 longs if you wanted 4 longs per mailbox, or even 64 for 8 longs per mailbox. This adds 16-48 extra clocks or the equivalent of 8-24 instructions worth of overhead to the processing path when including this additional polling latency which seems excessive to me. By also effectively reducing the polling rate it also increases the overall per request jitter too in any cases where that may matter (it probably won't in most cases). The alternative approach is to separate the first mailbox long from the remaining mailbox parameter entries in memory. But as soon as you do that, the requesting COG cannot do the mailbox command setup and data parameter writing command atomically with "setq #; wrlong" sequence, nor poll and get the result back in the related "setq #;rdlong" sequence. It has to do two writes and potentially two reads (if data is being returned). The benefit however is that once we decouple the mailbox request from all its data arguments we can increase it's data size to arbitrary values and keep each individual data element member more unpacked in separate longs which is faster to use in the driver and probably also to setup the original request if a block transfer is used. A requesting COG can also easily reuse a request template already setup in HUB memory and just write a (constant) pointer to it in the first mailbox long each time a new request is issued.
This pointer approach is good for all the different cases with different parameter requirements. E.g.:
All these different request cases (and others) can be accommodated with a mailbox scheme that uses a pointer to a variable sized request block. Also if these request blocks include an additional link field they can be chained together into a nice request list from the original pointer which just becomes the head of the list.
Right now I am still sort of following both alternatives in software to see which one wins out. I have the implementation that started from the original proposal I'm working with, and there probably will be ways to extend it support the additional features discussed, though it is getting harder to achieve and the performance may start to suffer in the more complex cases.
I am also looking at coding up some of the key parts of an alternative implementation that uses this mailbox pointer to a request block approach with only one mailbox slot and want to see how much additional instruction overhead it adds to the various code paths. I think the latter approach will be far more extensible to the features I wish to add, perhaps at the expense of more cycles. But in the end it may help streamline the code in some places too and have other benefits....
Great Stuff,
Mike
I think, for my SDRAM driver for P2-Hot, each cog could just specify read or write, start block address, and maybe a block count. When the driver satisfied the request, it would write zero back to the request mailbox, whose MSB had signalled the request. I think block size might have been 1KB, or something. If I recall, the entire mailbox was one long per cog. Latency was low enough for video.
Chip's P2-hot SDRAM driver sounds quite similar to where my current approach has started from.
I think I can take some ideas from the second approach and feed them into the first approach for the request lists and more complex graphics transfers. In doing this work I've already found a common optimization I can use in either approach using the LUT RAM for the increased per COG state storage I need with these extended requests.
If I could simply extend it to a 3 long mailbox it might help out a bit too, with only a moderate 8 clock increase in polling overhead. It would still negatively affecting atomic request rates a little but I know it can buy back more than 7 instructions of transfer size unpacking/computing operations in the block transfer requests, which people should really try to use for highest performance with HyperRAM anyway. The thing is that any latency impact on other COGs from high priority video block transfers is still going to swamp any single byte/long/word access anyway so realistically these extra overheads on individual element transfer are probably of little impact in the real world. I need to keep things in perspective.
I'm certainly not abandoning the current method quite yet and all this thinking helps me figure out what else I need to put in for these copy/request list features. It's just getting more complicated for me to extend things from the current approach, though even if I do it could remain pretty fast. With SKIPF and EXECF being used I think there's only three branches being taken for the simplest requests - one from polling to per COG handler setup, the next to go perform the requested service, and the last to return back to the polling loop. I think that's about as small as you can get it. The rest of the instructions in the sequence basically need to just do the work. There's not a great deal of other testing/branching overhead apart for computing the block transfer size or going off and reading some variable block size information from hub (which is what is bugging me).
When you say "atomic granularity" do you mean the base address of each block in external memory you wish to access would be quantised to 1kB? If so, that then impacts per scan line frame buffer memory layout. I can actually accommodate this in the video driver with the scan line skew parameter.
But if you meant that transfer bursts get quantised in steps of 1kB, that is not ideal for video which would then be reading in more data than is needed at various resolutions, wasting bus bandwidth.
eg.
640 pixels @ 8bpp rounds up to 1kB
800 pixels @ 8bpp rounds up to 1kB
1280 pixels @ 8bpp rounds up to 2kB
etc
I can already accommodate transferring any number from 1-256 of 1kB sized blocks in the current design with two mailbox longs. The upper 12 bits of the second long defines the transfer length requested. You can choose to transfer anything from:
1-256 bytes
1-256 words
1-256 longs
1-256 8 byte values
...
1-256 1kB sized blocks
...
1-256 16kB sized blocks
In my recent expansion to support arbitrary length transfers I've replaced the previously defined maximum 32kB block size to now indicate the hub address points to a variable byte length block size & new hub address. It all works out but adds some additional read overhead when used and could complicate memory management for request lists slightly.
E.g. if ptra points to the list of requests in memory you just do this to initiate the transfer. Really simple.
There's no need to even setup the second parameter making its command launch really fast once your list is setup.
Having to transfer 1024B-blocks @Sysclk = 250MHz and HyperCK = 125MHz (DDR = 250 MBps), will blow-out your 4uS self-refresh time-budget.
And I'm not solelly talking about (1024 x 4nS = 4096 nS); there is also the (tWRW + 2 x Latency Count (5 x THyperCk each, @125MHz)) barrier, that, sure, encompasses the CA phase. The above time adds 128nS 96nS to the whole proccess, wich gives a total of 4,224nS (~5.5% 4,192 nS (~4.6%) higuer than specc'd).
In theory I have 2-3 sub-burst limit that apply here. Right now I only setup limit 1) but I really also need to apply 2) & 3) dynamically not statically.
1) A per COG limit - this allows RR COGs to not affect video. One day we could optionally dynamically adjust this value to enforce fairness amongst different COGs to stop COGs asking for large transfers from dominating.
2) A per device limit, eg. HyperRAM needs 4us, but HyperFlash is not limited.
3) A different read/write limit, e.g. we might sometimes do writes at sysclk/2 but reads at sysclk/1. This scales up/down the sub-burst size that fits into the 4us interval.
Unfortunately for item 2) it means I need to lookup a per device limit and apply it. Probably a few more instructions for that step. Doing 3) will be another few instructions to test the device bus transfer speed and scale for a read/write. It all starts to add up when you need to cover all cases.
Here was the recent measurement taken:
https://forums.parallax.com/discussion/comment/1490412/#Comment_1490412
https://www.cypress.com/file/498626/download
You're probably aware of this already, thought I'd mention it just in case
Polling doesn't run concurrently as I have a dedicated polling loop outside the transfer code. For large bursts though I believe can overlap some next burst setup work during the time while waiting to complete the sub-bursts. I don't do this yet, but know about that optimisation. I could at least increase the hub/ext-mem addresses and setup a couple of other things before I do the waitxfi. One problem with doing this though is that the single accesses share this code path and we don't want to burden those sequences with unnecessary work, though I guess a skipf can help in that case.
The problem is the total amount of overhead to refresh each row is going to be in the order of a microsecond and there's around 16k rows to do in 64ms (16MB device). It may not be feasible to keep interrupting the transfers this often when it will already manage the refresh for you if you stick to the 4us limit.
Actually I wasn't aware of this, so it was helpful to point out. It's actually a bit annoying they've changed the register format so I think the register setup code would now need to be customised for a given device/manufacturer version etc. It should be handled outside the driver at setup time. We can at least read the registers beforehand to determine the manufacturer.
There are other possible methods of conquering Controller(P2)-supervised refresh ops, that would consume almost none of our data-transfer bandwidht.
Depending "how wise" their embeded self-refresh algorithm can be, I believe there are some alternative ways to keep their main memory array contents completelly sane, without even having to waste that "pesky" 1uS.
I think it's pretty simplistic. It'll be something like a row counter and a row refresh done bit. The row counter will be incremented, and done bit cleared, metronomically. If no opportunity for a refresh is available during that row's period then it'll be skipped.
Following that line of thinking, leads to the conclusion that even the existence of any "done" bit is questionable.
If one doesn't asks "is there anyone using the bathroom?" before opening the door, why feeling surprised, by finding that, effectivelly, there is someone already inside?
Issi has freed preliminary information about new ECC-supervised HyperRams, equiped with an "ERR" pin, to flag at least one ECC error has occurred, during the undergoing access.
It's statistically unlikelly that some row that did lost its contents integrity, by not having received enough "attention" from some kind of refresh procedure, would fail in developing content-related ECC errors.
So, at least those memories would let you know that their contents aren't reliable, anymore.
But there is a caveat... ever...
Contents that are never accessed would not generate any warning.
Tricky, tricky, tricky..
Or you have to access them, very frequently, to ensure they where given enough attention to be fresh at all, or you have to access them, very frequently, to ensure they'll be given enough attention, to be fresh at all.
One can be spanked by don't having a dog by his side, or, otherwise, be spanked by having an irresponsive (or ferocce) dog by his side.
Read the last word of the row, then continue to have the address counter increment into the first word of the next row.