Proposed external & HyperRAM memory interface suitable for video drivers with other COGs.

rogloh · 2019-10-21 01:29

As part of my ongoing video driver design I have found it necessary to design an
external memory request interface intended to allow the video driver to access
external memory for its frame buffer in addition to the Hub memory space, which
allows expansion of video resolutions and bit depths available to beyond what
is supported by 512kB in Hub memory alone.

In this case HyperRAM is the intended target but this design is quite generic
and could be just as easily be used for other memory types including read only
memory such as flash, or potentially even SD cards in some applications though
that is not the intention here.

What this offers is an interface specification for transferring data to or from
external memory, and allowing multiple COGs to share the external memory.

The arbiter function of the memory driver is implementation dependent and is
not specified here, however in video applications priority may be needed
so video COG(s) would get sufficient bandwidth for their real-time needs.

The memory request setup and polling needs to be fast and two 32 bit hub memory
locations are used to indicate a request is being made from a particular COG
to the memory driver COG. This is essentially a mailbox design where each
COG has its own pair of consecutive longs and a memory driver COG repeatedly
polls these longs over its set of COGs being serviced to process their incoming
requests.

The format is as follows:

First Mailbox Long:

  bit                                                                    bit 
   31     28 27      24 23                                                0  
   ------------------------------------------------------------------------  
  | Request | BankAddr |            External Memory Address                | 
   ------------------------------------------------------------------------

The External Memory Address is self explanatory. It is the address of the
item being requested to either read or write in external memory.

The BankAddr is not specified in detail however it could be used for dividing
the memory into smaller sections with independent chip selected banks. Or it
could go to different devices on different pins for example, dividing up
the overall external memory space by device.

It can also be considered as an extension to the External Memory Address field
to allow memory devices larger than 16MB to be used. Alternatively, portions
of the 24 bit External Memory Address could be used to increase the BankAddr
field or whenever memories smaller than 16MB are desired. It is flexible but
implementation specific.

The upper 4 bit Request field is comprised of the following bits:

         bit 31           bit 30           bit 29           bit 28
    --------------------------------------------------------------------
   | Request_Active |   Write/Error  |        Access_Type[1:0]          |
    --------------------------------------------------------------------

Access_Type is a 2 bit field that indicates the size of the element in the
memory being requested.

  Access_Type   Transfer Size
      00     -     Byte 
      01     -     Word
      10     -     Long
      11     -     Burst (described below)

Write/Error should be set to 1 for performing external memory writes, or
set to 0 for external memory reads.

RequestActive is set by the calling COG to make a memory transfer request.
It will be cleared by the memory driver COG once the request has been made,
indicating the channel for that COG is idle and a new request can be made.

If the transfer is successful the entire long will be cleared, allowing
the requesting COG to easily poll for a zero value to indicate the request
is complete, when it knows no failures or errors are going to be possible.

In the case of devices supporting errors such as read only devices being
written to, or addresses out of range, or device not ready etc, this design
can also allow the indication of errors. When a memory transfer fails
the Request_Active bit is cleared however the Write/Error bit is set to
indicate an error occurred, and any error code can be left in the second
long for reading by the COG and/or in the remaining portion of the first
mailbox long. Errors are implementation dependent.

Second Mailbox Long:

  bit                                                                    bit 
   31                                                                     0  
   ------------------------------------------------------------------------  
  |                          Data Read or Written                          | 
   ------------------------------------------------------------------------

In the case of individal access to bytes, words and longs, this long will be
used for the transfer of data to/from the external memory.

In the case of a read, it will get populated by the memory driver with the data
after completing the read and before it clears the Request_Active bit in the
first long. The data will be zero extended by the driver if the transfer size
is a byte or a word.

In the case of a write, data will be read from this location then written
into the external memory, once the driver sees the Request_Active_field set
and a write being requested. It would then clear the Request_Active flag by
zeroing out the long.

For the burst transfers the long is interpreted differently, as shown below:

  bit                                                                    bit 
   31         24 23  20 19                                                0  
   ------------------------------------------------------------------------  
  |  Transfers  | Size |                 Hub Memory Address                | 
   ------------------------------------------------------------------------

The Hub Memory Address is the address in Hub RAM where the burst transfer
will source its data for writes, or have the data written to for reads.

The Size nibble is used to indicate the size of each item being transferred
which is a power of 2. The overall number of bytes read or written will be
scaled by this number.

  Size    Multiplier    Amount
    0          1        1 byte
    1          2        1 word
    2          4        1 long
    3          8        8 bytes
    .          .        .
   14      16384        16kB
   15      32768        32kB

The Transfers byte indicates how many transfers to perform to complete the
request. A zero in this field represents a value of 256. This allows a
maximum of 8MB to be requested at a time (256 x 32kB).

Not every size from 1-8MB is possible to transfer. But the ranges overlap.
Other sizes not directly supported can still be made using multiple sequential
requests.

To transfer 640 bytes in the case of a video driver at 8bpp VGA resolution,
you could request 80 transfers and use a multiplier of 8 (Size = 3).
For 1920 pixels in HDTV resolution you could choose 240 x 8, or 120 x 16,
or 30 x 64, etc. Transfer sizes for most common video resolutions that
use an even numbers of pixels should be supported by this approach.

The multiplier becomes useful for different bit depths, as you could keep
the same transfer size for the same resolution, but just change the size
according to the bit depth. So a video resolution of 640x480x32bpp could
use 80 transfers x 32 (Size = 5) to read 2560 bytes, while at the same
resolution at 4bpp would use 80 transfers x 4 (Size = 2) to read 320 bytes.

Example mailbox polling (supporting optional errors):

Once a requesting COG issues its external memory request, that COG can poll
for the result using the following code. If no errors are required the last
instruction can be omitted, as well as the use of the zero flag.

         wrlong request, mailbox      ' issue request

poll     rdlong result, mailbox wc wz ' poll for completion
  if_c   jmp    #poll                 ' wait for request to be cleared
  if_nz  jmp    #handle_error         ' zero is success, non-zero failure
         '...continue

msrobots · 2019-10-21 02:15

Very interesting.

I would like to throw some cents in. I think your second long should always (not just in burst mode) contain the address of the item read from or written to.

My reasoning is that your driver writes it into the HUB and my program needs to read the value again from the HUB and put it where it belongs again HUB access, to free the mailbox.
Same for writing, I need to read my source (HUB) write the mailbox2 (HUB) your driver reads HUB.

current proposal

I need to read a long. I write command & external address in Mailbox1 ONE write to HUB. Your driver writes result, ONE write to HUB. Now I need one HUB read and one HUB write to place it.
Makes four HUB access (excluding waiting for result/ clearing mailboxes, that stays the same in both versions.)

I need to write a long. I read the long from HUB, I write it into the mailbox2, I write command, you read the long from HUB, at least 4 excluding command/response as above.
Makes four HUB access.

lets use indirection.

I need to read a long, I write HUB address in mailbox2, command & external address in Mailbox1, TWO writes to HUB. Your driver will put the long where it belongs. ONE write to HUB.
Makes just three HUB access.

I need to write a long.I write HUB address in mailbox2, command & external address in Mailbox1, TWO writes to HUB. Your driver will get the long where it actually is. ONE read from HUB.
Makes just three HUB access.

The other thing is the double use of read/write (as input) and error flag. It will work but basically yo need to set bit 31 AND clear 30 after each command to clear the error flag on reads. I have no better idea right know, but it is sort of itching me.

This might be usable for other things too.

Mike

rogloh · 2019-10-21 02:46

The way I think about it is that data to be written does not always come from HUB, it can be internally sourced from the COG, similarly it doesn't always need to go to HUB - you might be doing some type of VM running executable code and making requests for data from external memory to be used within the COG. So there is no need to do indirection for ALL transfers. If you want to use indirection for a single transfer you still can do this, by setting the transfers to 1 and size to 0/1/2 in the second long. This will result in one byte, word or long being transferred indirectly as you requested.

Also by using setq you could read in 2 longs in a burst rather than reading once, then reading again for the result. If the transfer is complete after you test the request_active bit and see it clear, you can just use the data from the second long already read in. It should be reasonably fast.

The error flag is optional. I think it would only ever be needed for memory types that can fail, such as writes to flash. In pure RAM cases, it won't error. Nibble accesses are fast in the P2, you can set all 4 top bits at the same time for a fresh read or write request, no need to set/clear individual bits like the P1.

Also from the memory driver perspective, it can request a burst read of all eight mailboxes in a row. It would get this in 16 clocks, and have any single write data already present. No need to go read the hub a second time for the write data unless indirection/burst is indicated. This improves latency.

msrobots · 2019-10-21 03:01

Aah, yes.

basically burst transfer does what I asked for, did not noticed it can even read 1 byte as "burst".

Mike

AJL · 2019-10-21 03:07

rogloh wrote: »

The way I think about it is that data to be written does not always come from HUB, it can be internally sourced from the COG, similarly it doesn't always need to go to HUB - you might be doing some type of VM running executable code and making requests for data from external memory to be used within the COG. So there is no need to do indirection for ALL transfers. If you want to use indirection for a single transfer you still can do this, by setting the transfers to 1 and size to 0/1/2 in the second long. This will result in one byte, word or long being transferred indirectly as you requested.

Also by using setq you could read in 2 longs in a burst rather than reading once, then reading again for the result. If the transfer is complete after you test the request_active bit and see it clear, you can just use the data from the second long already read in. It should be reasonably fast.

The error flag is optional. I think it would only ever be needed for memory types that can fail, such as writes to flash. In pure RAM cases, it won't error. Nibble accesses are fast in the P2, you can set all 4 top bits at the same time for a fresh read or write request, no need to set/clear individual bits like the P1.

Also from the memory driver perspective, it can request a burst read of all eight mailboxes in a row. It would get this in 16 clocks, and have any single write data already present. No need to go read the hub a second time for the write data unless indirection/burst is indicated. This improves latency.

Rather than have the requesting cog poll HUBRAM in a loop, couldn't the arbiter cog use COGATN to signal completion? Then the requesting cog could have the option to focus on something useful while waiting, without burning a hub access per loop.

Cluso99 · 2019-10-21 03:09

FWIW, in my SD Driver I have used 4 hub longs IIRC (at work atm) as follows...

command	   byte  0  ' command to the driver (non-zero) set by caller, cleared by driver
                    ' the caller must have cleared the status before setting this
status     byte  0  ' result of the command (non-zero is error) set by driver, cleared by caller
                    ' the driver clears command before setting this, $0 is good 
cog        byte  0  ' $80+cogid (set by the driver)
lock       byte  0  ' $80+lockid (set by the driver) - a lock is taken when driver is invoked
sector     long  0  ' SD sector required (set by caller)
ptrbuffer  long  0  ' pointer to the location where the sector data is stored
spare?     long  0  ' SD pins used: CSpin<<24 + CKpin<<16 + DIpin<<8 + DOpin, set by caller
                    ' once the driver takes these, the slot is usable for whatever

I have used similar in the P1 for serial. Here I used $1xx to indicate a character and $100 as empty. By using longs, it's also possible to use them as pointers to hub buffers.

I found 2 longs was always too restrictive, and the way around that is to use the 2nd long as a pointer to another buffer. That's what Catalina for P1 did and I always thought that having a pointer to a pointer was IMHO silly tho P1 was tight for memory.

rogloh · 2019-10-21 04:03

Rather than have the requesting cog poll HUBRAM in a loop, couldn't the arbiter cog use COGATN to signal completion? Then the requesting cog could have the option to focus on something useful while waiting, without burning a hub access per loop.

It could potentially be done that way. In the case of my own video client COG, it would probably not even be checking the result, and will assume the data is ready 25us later simply because of its real-time nature. There is nothing it can do if the data is not ready, apart from maybe drawing a blank line instead of replicating what is left in the buffer from last time if it knows the transfer is not complete.

There needs to be a way to indicate such a notification preference. Perhaps another memory area in HUB RAM can be defined for the driver COG to use that nominates COG preferences for how they are to be notified on completion of their requests. I'd prefer to not have to fill out an additional long each time for this however if the choice is made on a per request basis, unless we steal a bit from the BankAddr perhaps...

rogloh · 2019-10-21 04:14

Cluso99 wrote: »

I found 2 longs was always too restrictive, and the way around that is to use the 2nd long as a pointer to another buffer. That's what Catalina for P1 did and I always thought that having a pointer to a pointer was IMHO silly tho P1 was tight for memory.

For memory use specifically I am hoping to keep this particular mailbox specification down to two longs just to help minimize the setup required and reduce the driver polling latency, as it has to be particularly simple for video and with few instructions. I could see for other devices needing additional control, or for locks etc, more longs may be required, though they can always use different drivers dedicated for that purpose.

rogloh · 2019-10-23 07:28

I think it makes sense in this spec to add an inband control path to be able to configure the memory driver according to each COG's requirements. I am proposing to extend the use of the BankAddr bits to allow for this. So the idea is that if the BankAddr field is set to 0xF = 1111 this indicates a control read/write instead of a normal external memory read or write. Both the remaining 24 bit part of the mailbox long and the 32 bit second long can be used to pass parameters for the request, though it could be useful to partition some of the first mailbox long to identify which driver "register" or setting is being accessed. Perhaps into 8 bits + 16 bits, leaving the full 32 bits in the other mailbox register for wider arguments or additional control parameters by using a hub or external memory pointer.

I am thinking this allows for the expansion of future scheduling options and selection of features like requesting being notified by COGATN from the driver upon completion or just via the regular mailbox polling. It does mean that the first time a COG starts to use the external memory driver, it would probably also need to have a way to reset back to defaults, in the case that some prior code running on this same COG ID earlier had been configuring the arbiter with it's own settings. I am thinking this could simply be a write of 0xFFFFFFFF to the mailbox 1 register to do this, as that value would not need to be used for other things. You would lose one the the 16 banks the top 16MB of the overall 256MB address space available for this control path feature, but it would open up future capabilities once the arbiters become more complex.

As an example we may wish to put in a way for a video COG to indicate to the driver that it is in vertical blanking for the next number of clock cycles and it would be free to grant larger requests to other non-video COGs, or do more refresh cycles etc. Or for it to advertise how many clocks until the next scanline request, so the arbiter could do other things. It might allow a way to request a regular update of data to be sent to hub without requesting by the COG as well, or setup a block copy internally between memory banks, or zero fill etc. Lots of possibilities...

I think I will add this. Does anyone else have any thoughts on this?

AJL · 2019-10-23 08:54

rogloh wrote: »

I think it makes sense in this spec to add an inband control path to be able to configure the memory driver according to each COG's requirements. I am proposing to extend the use of the BankAddr bits to allow for this. So the idea is that if the BankAddr field is set to 0xF = 1111 this indicates a control read/write instead of a normal external memory read or write. Both the remaining 24 bit part of the mailbox long and the 32 bit second long can be used to pass parameters for the request, though it could be useful to partition some of the first mailbox long to identify which driver "register" or setting is being accessed. Perhaps into 8 bits + 16 bits, leaving the full 32 bits in the other mailbox register for wider arguments or additional control parameters by using a hub or external memory pointer.

I am thinking this allows for the expansion of future scheduling options and selection of features like requesting being notified by COGATN from the driver upon completion or just via the regular mailbox polling. It does mean that the first time a COG starts to use the external memory driver, it would probably also need to have a way to reset back to defaults, in the case that some prior code running on this same COG ID earlier had been configuring the arbiter with it's own settings. I am thinking this could simply be a write of 0xFFFFFFFF to the mailbox 1 register to do this, as that value would not need to be used for other things. You would lose one the the 16 banks the top 16MB of the overall 256MB address space available for this control path feature, but it would open up future capabilities once the arbiters become more complex.

As an example we may wish to put in a way for a video COG to indicate to the driver that it is in vertical blanking for the next number of clock cycles and it would be free to grant larger requests to other non-video COGs, or do more refresh cycles etc. Or for it to advertise how many clocks until the next scanline request, so the arbiter could do other things. It might allow a way to request a regular update of data to be sent to hub without requesting by the COG as well, or setup a block copy internally between memory banks, or zero fill etc. Lots of possibilities...

I think I will add this. Does anyone else have any thoughts on this?

It seems like a reasonable tradeoff to me. If someone needs more than 240MB then one of the control words could be to flip a bank switch.

rogloh · 2020-02-28 12:10

I'm encountering some driver design dilemmas at the moment while trying to add in more support for request lists and graphics memory copies and external bank to bank transfers to my HyperRAM driver, which this original proposal wasn't really ever designed to support. This existing proposal is basically highly optimized for single acesses (byte/word/long) read and writes, as well as hub to external memory and external to hub memory bursts, transferring a variable number from 1-256 of nominated fixed block sizes (indicated as powers of two). This is ideally suited for atomic element access plus video and cache type of burst transfers which is where things had first started from.

After already extending what I had started to support other things such as fill and variable length copies, HyperRAM register access etc, here are the shortcomings I've identified with this current proposal that I've implemented...

For the transfers that exceed 256 bytes that are not 1-256 multiples of powers of 2, I needed to extend the read and write burst requests by pointing the second mailbox long value to a second pair of longs in hub RAM containing the real hub address buffer and separate transfer length (in bytes) instead of combining length and hub address together in the second mailbox long as I do for the fixed burst transfers that my video driver uses. It was not too difficult to support this in the driver but it did add some more overhead to the code path that deals with it, requiring an additional memory read including its extra egg-beater latency and it also complicates setting up request of variable sized request by requiring structures that point to other structures in memory (outside the mailbox area). I suspect arbitrary length copies will be quite frequently requested from high level languages, so it would be good to have this capability built in and optimised from the start. Also the extra packing and unpacking of data lengths and sizes within the hub memory address itself takes more instruction overhead, both at request setup time and in my driver, using various getnib, setnib, getbyte, setbyte operations to populate and extract these fields. It would be nice to avoid this.

Polling latency - right now the polling loop is always transferring 16 longs from all 8 COG's mailbox long pairs before checking and branching on an active memory request. This block move operation on the P2, after the setq #16-1 is issued, takes a minimum of 24 clocks plus some variable egg-beater latency of up to 7 more clocks. There is both good and bad in doing it like this.

The benefit is that once you detect that a COG mailbox request is active in the first mailbox long being polled, you already have read in the extra data value in the second mailbox slot so there is no need to go read it again before you continue the operation, saving at least another 9-16 clocks for an additional rdlong. Another benefit when the request/data long pairs are consecutively stored in memory per COG is that you can setup and issue the request from the requestor COG quickly using "setq #1" & "wrlong req, mailbox" fast block transfer sequence. Similarly the requesting COG can also poll the mailbox for the result and get the data back in the same block transfer operation when reading a byte, word or long from external memory. It only takes one extra cycle to read/write the second long if using setq #1 beforehand which was a real no-brainer in this case and well worth doing.

The downside however is that there is always 8 extra cycles of overhead in the polling loop as you read ALL the COGs mailbox data. Because all COGs mailboxes get stored consecutively you will always need to read these 16 longs even if only a small handful of COGs are ever actively using the external memory driver because you can't easily re-arrange the read order, particularly if COG requestors are dynamic. It is true that an optimization exists for the single COG requestor case where you can reduce this mailbox group read down to just two longs for that COG alone, but that is a very special singular case and not the general case.

Now if you ever want to expand the mailbox size for passing in additional parameters (which I'd quite like to do now for bank-to-bank and graphics copies etc) you can't access this additional data with the first read operation in the polling loop unless you increase the setq size within the polling loop to read say 32 longs if you wanted 4 longs per mailbox, or even 64 for 8 longs per mailbox. This adds 16-48 extra clocks or the equivalent of 8-24 instructions worth of overhead to the processing path when including this additional polling latency which seems excessive to me. By also effectively reducing the polling rate it also increases the overall per request jitter too in any cases where that may matter (it probably won't in most cases). The alternative approach is to separate the first mailbox long from the remaining mailbox parameter entries in memory. But as soon as you do that, the requesting COG cannot do the mailbox command setup and data parameter writing command atomically with "setq #; wrlong" sequence, nor poll and get the result back in the related "setq #;rdlong" sequence. It has to do two writes and potentially two reads (if data is being returned). The benefit however is that once we decouple the mailbox request from all its data arguments we can increase it's data size to arbitrary values and keep each individual data element member more unpacked in separate longs which is faster to use in the driver and probably also to setup the original request if a block transfer is used. A requesting COG can also easily reuse a request template already setup in HUB memory and just write a (constant) pointer to it in the first mailbox long each time a new request is issued.

This pointer approach is good for all the different cases with different parameter requirements. E.g.:

For single element ext-mem write you need two parameters:
- ExtMem Address  (inluding bank select bits)
- Data (byte/word/long)

For burst transfers involving hub and ext-mem you need three parameters:
- ExtMem Address  (inluding bank select bits)
- Hub address
- Count of bytes to transfer

For fills of bytes/words/longs you need 3 parameters:
- ExtMem Address  (inluding bank select bits)
- Data pattern
- Count of bytes/words/longs to transfer

For ext-mem to ext-mem copies you need 5 parameters
- ExtMem SrcAddress (including bank select bits)
- Hub address of intermediate buffer (if used or 0 if direct bus transfer)
- Size of intermediate buffer in bytes (if used of 0 if direct bus transfer)
- Count of bytes to transfer in total
- ExtMem DestAddress (including bank select bits)

For graphics copies you need 6 parameters (or 7 if it is external memory bank-to-bank transfer):
- ExtMem Src/DstAddress (including bank select bits)
- Hub address of buffer
- Size of each scan line transfer in bytes (related to horizontal pixels written and bit depth)
- Scan line repeat count (vertical pixels written)
- Offset to add to SrcAddress per scan line (pitch)
- Offset to add to HubAddress per scan line (pitch)
- (optional ExtMem DestAddress for bank-to-bank case)

All these different request cases (and others) can be accommodated with a mailbox scheme that uses a pointer to a variable sized request block. Also if these request blocks include an additional link field they can be chained together into a nice request list from the original pointer which just becomes the head of the list.

Right now I am still sort of following both alternatives in software to see which one wins out. I have the implementation that started from the original proposal I'm working with, and there probably will be ways to extend it support the additional features discussed, though it is getting harder to achieve and the performance may start to suffer in the more complex cases.

I am also looking at coding up some of the key parts of an alternative implementation that uses this mailbox pointer to a request block approach with only one mailbox slot and want to see how much additional instruction overhead it adds to the various code paths. I think the latter approach will be far more extensible to the features I wish to add, perhaps at the expense of more cycles. But in the end it may help streamline the code in some places too and have other benefits....

msrobots · 2020-02-28 16:58

I think your second approach has a lot of benefits, that the requesting COG can set up multiple parameter blocks in advance and maybe even chain them makes a lot of sense.

Great Stuff,

Mike

cgracey · 2020-02-28 22:13

There's always value in just keeping things simple.

I think, for my SDRAM driver for P2-Hot, each cog could just specify read or write, start block address, and maybe a block count. When the driver satisfied the request, it would write zero back to the request mailbox, whose MSB had signalled the request. I think block size might have been 1KB, or something. If I recall, the entire mailbox was one long per cog. Latency was low enough for video.

rogloh · 2020-02-28 23:39

The last two responses basically illustrate my dilemma. Both are true. LOL.

Chip's P2-hot SDRAM driver sounds quite similar to where my current approach has started from.

I think I can take some ideas from the second approach and feed them into the first approach for the request lists and more complex graphics transfers. In doing this work I've already found a common optimization I can use in either approach using the LUT RAM for the increased per COG state storage I need with these extended requests.

If I could simply extend it to a 3 long mailbox it might help out a bit too, with only a moderate 8 clock increase in polling overhead. It would still negatively affecting atomic request rates a little but I know it can buy back more than 7 instructions of transfer size unpacking/computing operations in the block transfer requests, which people should really try to use for highest performance with HyperRAM anyway. The thing is that any latency impact on other COGs from high priority video block transfers is still going to swamp any single byte/long/word access anyway so realistically these extra overheads on individual element transfer are probably of little impact in the real world. I need to keep things in perspective.

I'm certainly not abandoning the current method quite yet and all this thinking helps me figure out what else I need to put in for these copy/request list features. It's just getting more complicated for me to extend things from the current approach, though even if I do it could remain pretty fast. With SKIPF and EXECF being used I think there's only three branches being taken for the simplest requests - one from polling to per COG handler setup, the next to go perform the requested service, and the last to return back to the polling loop. I think that's about as small as you can get it. The rest of the instructions in the sequence basically need to just do the work. There's not a great deal of other testing/branching overhead apart for computing the block transfer size or going off and reading some variable block size information from hub (which is what is bugging me).

cgracey · 2020-02-28 23:52

It maybe wouldn't hurt to make the atomic granularity rather large, like 1KB. I think most cogs would need a number of those blocks and wouldn't need to have other cogs access the same blocks. Inter-cog mailboxing can be done in hub memory.

rogloh · 2020-02-29 00:59

I'm not entirely sure what you mean Chip.

When you say "atomic granularity" do you mean the base address of each block in external memory you wish to access would be quantised to 1kB? If so, that then impacts per scan line frame buffer memory layout. I can actually accommodate this in the video driver with the scan line skew parameter.

But if you meant that transfer bursts get quantised in steps of 1kB, that is not ideal for video which would then be reading in more data than is needed at various resolutions, wasting bus bandwidth.
eg.
640 pixels @ 8bpp rounds up to 1kB
800 pixels @ 8bpp rounds up to 1kB
1280 pixels @ 8bpp rounds up to 2kB
etc

I can already accommodate transferring any number from 1-256 of 1kB sized blocks in the current design with two mailbox longs. The upper 12 bits of the second long defines the transfer length requested. You can choose to transfer anything from:

1-256 bytes
1-256 words
1-256 longs
1-256 8 byte values
...
1-256 1kB sized blocks
...
1-256 16kB sized blocks

In my recent expansion to support arbitrary length transfers I've replaced the previously defined maximum 32kB block size to now indicate the hub address points to a variable byte length block size & new hub address. It all works out but adds some additional read overhead when used and could complicate memory management for request lists slightly.

cgracey · 2020-02-29 02:47

Rogloh, you are right. Video needs dictate that finer granularity should be supported. If video doesn't get its data on time, it doesn't work.

rogloh · 2020-02-29 02:54

Actually I think I can steal some ideas from my alternative approach and apply them to the current approach, sort of creating the best of both worlds. In the case of more complex transfers like graphics copies and fills I can make it such they can be applied only from within a list of requests (whose list size can be one if needed). I do have my special 0xF bank and 8 other services can be used with it to rapidly identify this case and branch out to perform it. So if you want to do some existing transfers you can keep doing it the existing way with the two mailbox longs, but for more complex transfers such as graphics copies and fills or multiple audio channel transfer etc you just setup a list and a pointer to it.

E.g. if ptra points to the list of requests in memory you just do this to initiate the transfer. Really simple.

    setbyte ptra, #START_LIST, #3  '  where START_LIST could equal something like $BF.
    wrlong  ptra, mailbox  '  where mailbox is your COG's mailbox being polled by the external memory driver

There's no need to even setup the second parameter making its command launch really fast once your list is setup.

Yanomani · 2020-02-29 03:00

Some thoughts about 1kB HyperRam data transfers...

Having to transfer 1024B-blocks @Sysclk = 250MHz and HyperCK = 125MHz (DDR = 250 MBps), will blow-out your 4uS self-refresh time-budget.

And I'm not solelly talking about (1024 x 4nS = 4096 nS); there is also the (tWRW + 2 x Latency Count (5 x THyperCk each, @125MHz)) barrier, that, sure, encompasses the CA phase. The above time adds 128nS 96nS to the whole proccess, wich gives a total of 4,224nS (~5.5% 4,192 nS (~4.6%) higuer than specc'd).

rogloh · 2020-02-29 03:08

Yes I know about this, and I limit the large block sizes to be lower than 1kB on the bus by breaking the overall burst transfers into sub-bursts. I also have to compensate slightly for the address phase & latency CS low time, reducing the bandwidth further.

In theory I have 2-3 sub-burst limit that apply here. Right now I only setup limit 1) but I really also need to apply 2) & 3) dynamically not statically.

1) A per COG limit - this allows RR COGs to not affect video. One day we could optionally dynamically adjust this value to enforce fairness amongst different COGs to stop COGs asking for large transfers from dominating.

2) A per device limit, eg. HyperRAM needs 4us, but HyperFlash is not limited.

3) A different read/write limit, e.g. we might sometimes do writes at sysclk/2 but reads at sysclk/1. This scales up/down the sub-burst size that fits into the 4us interval.

Unfortunately for item 2) it means I need to lookup a per device limit and apply it. Probably a few more instructions for that step. Doing 3) will be another few instructions to test the device bus transfer speed and scale for a read/write. It all starts to add up when you need to cover all cases.

evanh · 2020-02-29 03:16

How big, time wise, is the smallest interval between client requests? ie: When it's processing the mailboxes. Or can that run concurrently to block transfers?

rogloh · 2020-02-29 03:20

I've measured the idle time between clock activity on the HyperRAM bus to be about 1.2us @ 200MHz P2 with back to back requests. However I found I can shave this down a lot lower for the video requests that keep control of the bus between bursts and avoid the re-polling and other service setup overhead for the next request. I think I counted about 30 instructions between activity in that case, so maybe only 0.3us. This is good for video which can dominate the bus bandwidth and will have its requests broken up.

Here was the recent measurement taken:
https://forums.parallax.com/discussion/comment/1490412/#Comment_1490412

Tubular · 2020-02-29 03:30

I noticed from this doc that the auto refresh seems to have been tightened on the v2 hyperrams, so 4us is the new max
https://www.cypress.com/file/498626/download

You're probably aware of this already, thought I'd mention it just in case

evanh · 2020-02-29 03:34

I'm not sure it'll help due to burden but the driver could take control of the DRAM refresh cycles. It's mentioned in the datasheet somewhere. This would allow much larger single bursts because the refreshes can be bunched then.

rogloh · 2020-02-29 03:42

evanh wrote: »

Or can that run concurrently to block transfers?

Polling doesn't run concurrently as I have a dedicated polling loop outside the transfer code. For large bursts though I believe can overlap some next burst setup work during the time while waiting to complete the sub-bursts. I don't do this yet, but know about that optimisation. I could at least increase the hub/ext-mem addresses and setup a couple of other things before I do the waitxfi. One problem with doing this though is that the single accesses share this code path and we don't want to burden those sequences with unnecessary work, though I guess a skipf can help in that case.

evanh wrote: »

I'm not sure it'll help due to burden but the driver could take control of the DRAM refresh cycles. It's mentioned in the datasheet somewhere. This would allow much larger single bursts because the refreshes can be bunched then.

The problem is the total amount of overhead to refresh each row is going to be in the order of a microsecond and there's around 16k rows to do in 64ms (16MB device). It may not be feasible to keep interrupting the transfers this often when it will already manage the refresh for you if you stick to the 4us limit.

Tubular wrote: »

I noticed from this doc that the auto refresh seems to have been tightened on the v2 hyperrams, so 4us is the new max
https://www.cypress.com/file/498626/download

You're probably aware of this already, thought I'd mention it just in case

Actually I wasn't aware of this, so it was helpful to point out. It's actually a bit annoying they've changed the register format so I think the register setup code would now need to be customised for a given device/manufacturer version etc. It should be handled outside the driver at setup time. We can at least read the registers beforehand to determine the manufacturer.

evanh · 2020-02-29 03:50

rogloh wrote: »

The problem is the total amount of overhead to refresh each row is going to be in the order of a microsecond and there's around 16k rows to do in 64ms (16MB device). It may not be feasible to keep interrupting the transfers this often when it will already manage the refresh for you if you stick to the 4us limit.

It's a pity it doesn't have a command for "do power hungry back-to-back refreshes until CS goes high".

rogloh · 2020-02-29 04:01

Wouldn't that be nice.

Yanomani · 2020-02-29 04:19

Suddenly, I feel the need to have variable-latency-able HyperRams assembled onto 64004-ES-REVA boards; only two HyperRams, no flash at all...

There are other possible methods of conquering Controller(P2)-supervised refresh ops, that would consume almost none of our data-transfer bandwidht.

Depending "how wise" their embeded self-refresh algorithm can be, I believe there are some alternative ways to keep their main memory array contents completelly sane, without even having to waste that "pesky" 1uS.

evanh · 2020-02-29 04:34

Yanomani wrote: »

Depending "how wise" their embeded self-refresh algorithm can be, I believe there are some alternative ways to keep their main memory array contents completelly sane, without even having to waste that "pesky" 1uS.

I think it's pretty simplistic. It'll be something like a row counter and a row refresh done bit. The row counter will be incremented, and done bit cleared, metronomically. If no opportunity for a refresh is available during that row's period then it'll be skipped.

Yanomani · 2020-02-29 05:28

Actually, there is not any flag the HyperBus controller can access, in order to be informed that any row's content can be already rotten, due to starvation (or total absence) of refresh.

Following that line of thinking, leads to the conclusion that even the existence of any "done" bit is questionable.

If one doesn't asks "is there anyone using the bathroom?" before opening the door, why feeling surprised, by finding that, effectivelly, there is someone already inside?

Issi has freed preliminary information about new ECC-supervised HyperRams, equiped with an "ERR" pin, to flag at least one ECC error has occurred, during the undergoing access.

It's statistically unlikelly that some row that did lost its contents integrity, by not having received enough "attention" from some kind of refresh procedure, would fail in developing content-related ECC errors.

So, at least those memories would let you know that their contents aren't reliable, anymore.

But there is a caveat... ever...

Contents that are never accessed would not generate any warning.

Tricky, tricky, tricky..

Or you have to access them, very frequently, to ensure they where given enough attention to be fresh at all, or you have to access them, very frequently, to ensure they'll be given enough attention, to be fresh at all.

One can be spanked by don't having a dog by his side, or, otherwise, be spanked by having an irresponsive (or ferocce) dog by his side.

whicker · 2020-02-29 06:43

Could you cheat a little by dividing the needed manual refresh reads in half?

Read the last word of the row, then continue to have the address counter increment into the first word of the next row.

Proposed external &amp; HyperRAM memory interface suitable for video drivers with other COGs.

Comments

Proposed external & HyperRAM memory interface suitable for video drivers with other COGs.