Proposed external & HyperRAM memory interface suitable for video drivers with other COGs.

roglohrogloh Posts: 1,872
edited 2019-10-21 - 06:01:03 in Propeller 2
As part of my ongoing video driver design I have found it necessary to design an
external memory request interface intended to allow the video driver to access
external memory for its frame buffer in addition to the Hub memory space, which
allows expansion of video resolutions and bit depths available to beyond what
is supported by 512kB in Hub memory alone.

In this case HyperRAM is the intended target but this design is quite generic
and could be just as easily be used for other memory types including read only
memory such as flash, or potentially even SD cards in some applications though
that is not the intention here.

What this offers is an interface specification for transferring data to or from
external memory, and allowing multiple COGs to share the external memory.

The arbiter function of the memory driver is implementation dependent and is
not specified here, however in video applications priority may be needed
so video COG(s) would get sufficient bandwidth for their real-time needs.

The memory request setup and polling needs to be fast and two 32 bit hub memory
locations are used to indicate a request is being made from a particular COG
to the memory driver COG. This is essentially a mailbox design where each
COG has its own pair of consecutive longs and a memory driver COG repeatedly
polls these longs over its set of COGs being serviced to process their incoming
requests.

The format is as follows:

First Mailbox Long:
  bit                                                                    bit 
   31     28 27      24 23                                                0  
   ------------------------------------------------------------------------  
  | Request | BankAddr |            External Memory Address                | 
   ------------------------------------------------------------------------  
The External Memory Address is self explanatory. It is the address of the
item being requested to either read or write in external memory.

The BankAddr is not specified in detail however it could be used for dividing
the memory into smaller sections with independent chip selected banks. Or it
could go to different devices on different pins for example, dividing up
the overall external memory space by device.

It can also be considered as an extension to the External Memory Address field
to allow memory devices larger than 16MB to be used. Alternatively, portions
of the 24 bit External Memory Address could be used to increase the BankAddr
field or whenever memories smaller than 16MB are desired. It is flexible but
implementation specific.

The upper 4 bit Request field is comprised of the following bits:
         bit 31           bit 30           bit 29           bit 28
    --------------------------------------------------------------------
   | Request_Active |   Write/Error  |        Access_Type[1:0]          |
    --------------------------------------------------------------------

Access_Type is a 2 bit field that indicates the size of the element in the
memory being requested.
  Access_Type   Transfer Size
      00     -     Byte 
      01     -     Word
      10     -     Long
      11     -     Burst (described below)

Write/Error should be set to 1 for performing external memory writes, or
set to 0 for external memory reads.

RequestActive is set by the calling COG to make a memory transfer request.
It will be cleared by the memory driver COG once the request has been made,
indicating the channel for that COG is idle and a new request can be made.

If the transfer is successful the entire long will be cleared, allowing
the requesting COG to easily poll for a zero value to indicate the request
is complete, when it knows no failures or errors are going to be possible.

In the case of devices supporting errors such as read only devices being
written to, or addresses out of range, or device not ready etc, this design
can also allow the indication of errors. When a memory transfer fails
the Request_Active bit is cleared however the Write/Error bit is set to
indicate an error occurred, and any error code can be left in the second
long for reading by the COG and/or in the remaining portion of the first
mailbox long. Errors are implementation dependent.

Second Mailbox Long:
  bit                                                                    bit 
   31                                                                     0  
   ------------------------------------------------------------------------  
  |                          Data Read or Written                          | 
   ------------------------------------------------------------------------  
In the case of individal access to bytes, words and longs, this long will be
used for the transfer of data to/from the external memory.

In the case of a read, it will get populated by the memory driver with the data
after completing the read and before it clears the Request_Active bit in the
first long. The data will be zero extended by the driver if the transfer size
is a byte or a word.

In the case of a write, data will be read from this location then written
into the external memory, once the driver sees the Request_Active_field set
and a write being requested. It would then clear the Request_Active flag by
zeroing out the long.

For the burst transfers the long is interpreted differently, as shown below:
  bit                                                                    bit 
   31         24 23  20 19                                                0  
   ------------------------------------------------------------------------  
  |  Transfers  | Size |                 Hub Memory Address                | 
   ------------------------------------------------------------------------  
The Hub Memory Address is the address in Hub RAM where the burst transfer
will source its data for writes, or have the data written to for reads.

The Size nibble is used to indicate the size of each item being transferred
which is a power of 2. The overall number of bytes read or written will be
scaled by this number.
  Size    Multiplier    Amount
    0          1        1 byte
    1          2        1 word
    2          4        1 long
    3          8        8 bytes
    .          .        .
   14      16384        16kB
   15      32768        32kB

The Transfers byte indicates how many transfers to perform to complete the
request. A zero in this field represents a value of 256. This allows a
maximum of 8MB to be requested at a time (256 x 32kB).

Not every size from 1-8MB is possible to transfer. But the ranges overlap.
Other sizes not directly supported can still be made using multiple sequential
requests.

To transfer 640 bytes in the case of a video driver at 8bpp VGA resolution,
you could request 80 transfers and use a multiplier of 8 (Size = 3).
For 1920 pixels in HDTV resolution you could choose 240 x 8, or 120 x 16,
or 30 x 64, etc. Transfer sizes for most common video resolutions that
use an even numbers of pixels should be supported by this approach.

The multiplier becomes useful for different bit depths, as you could keep
the same transfer size for the same resolution, but just change the size
according to the bit depth. So a video resolution of 640x480x32bpp could
use 80 transfers x 32 (Size = 5) to read 2560 bytes, while at the same
resolution at 4bpp would use 80 transfers x 4 (Size = 2) to read 320 bytes.


Example mailbox polling (supporting optional errors):

Once a requesting COG issues its external memory request, that COG can poll
for the result using the following code. If no errors are required the last
instruction can be omitted, as well as the use of the zero flag.
         wrlong request, mailbox      ' issue request

poll     rdlong result, mailbox wc wz ' poll for completion
  if_c   jmp    #poll                 ' wait for request to be cleared
  if_nz  jmp    #handle_error         ' zero is success, non-zero failure
         '...continue 

Comments

  • Very interesting.

    I would like to throw some cents in. I think your second long should always (not just in burst mode) contain the address of the item read from or written to.

    My reasoning is that your driver writes it into the HUB and my program needs to read the value again from the HUB and put it where it belongs again HUB access, to free the mailbox.
    Same for writing, I need to read my source (HUB) write the mailbox2 (HUB) your driver reads HUB.

    current proposal

    I need to read a long. I write command & external address in Mailbox1 ONE write to HUB. Your driver writes result, ONE write to HUB. Now I need one HUB read and one HUB write to place it.
    Makes four HUB access (excluding waiting for result/ clearing mailboxes, that stays the same in both versions.)

    I need to write a long. I read the long from HUB, I write it into the mailbox2, I write command, you read the long from HUB, at least 4 excluding command/response as above.
    Makes four HUB access.

    lets use indirection.

    I need to read a long, I write HUB address in mailbox2, command & external address in Mailbox1, TWO writes to HUB. Your driver will put the long where it belongs. ONE write to HUB.
    Makes just three HUB access.

    I need to write a long.I write HUB address in mailbox2, command & external address in Mailbox1, TWO writes to HUB. Your driver will get the long where it actually is. ONE read from HUB.
    Makes just three HUB access.

    The other thing is the double use of read/write (as input) and error flag. It will work but basically yo need to set bit 31 AND clear 30 after each command to clear the error flag on reads. I have no better idea right know, but it is sort of itching me.

    This might be usable for other things too.

    Mike
  • roglohrogloh Posts: 1,872
    edited 2019-10-21 - 02:55:01
    The way I think about it is that data to be written does not always come from HUB, it can be internally sourced from the COG, similarly it doesn't always need to go to HUB - you might be doing some type of VM running executable code and making requests for data from external memory to be used within the COG. So there is no need to do indirection for ALL transfers. If you want to use indirection for a single transfer you still can do this, by setting the transfers to 1 and size to 0/1/2 in the second long. This will result in one byte, word or long being transferred indirectly as you requested.

    Also by using setq you could read in 2 longs in a burst rather than reading once, then reading again for the result. If the transfer is complete after you test the request_active bit and see it clear, you can just use the data from the second long already read in. It should be reasonably fast.

    The error flag is optional. I think it would only ever be needed for memory types that can fail, such as writes to flash. In pure RAM cases, it won't error. Nibble accesses are fast in the P2, you can set all 4 top bits at the same time for a fresh read or write request, no need to set/clear individual bits like the P1.

    Also from the memory driver perspective, it can request a burst read of all eight mailboxes in a row. It would get this in 16 clocks, and have any single write data already present. No need to go read the hub a second time for the write data unless indirection/burst is indicated. This improves latency.
  • Aah, yes.

    basically burst transfer does what I asked for, did not noticed it can even read 1 byte as "burst".

    Mike
  • rogloh wrote: »
    The way I think about it is that data to be written does not always come from HUB, it can be internally sourced from the COG, similarly it doesn't always need to go to HUB - you might be doing some type of VM running executable code and making requests for data from external memory to be used within the COG. So there is no need to do indirection for ALL transfers. If you want to use indirection for a single transfer you still can do this, by setting the transfers to 1 and size to 0/1/2 in the second long. This will result in one byte, word or long being transferred indirectly as you requested.

    Also by using setq you could read in 2 longs in a burst rather than reading once, then reading again for the result. If the transfer is complete after you test the request_active bit and see it clear, you can just use the data from the second long already read in. It should be reasonably fast.

    The error flag is optional. I think it would only ever be needed for memory types that can fail, such as writes to flash. In pure RAM cases, it won't error. Nibble accesses are fast in the P2, you can set all 4 top bits at the same time for a fresh read or write request, no need to set/clear individual bits like the P1.

    Also from the memory driver perspective, it can request a burst read of all eight mailboxes in a row. It would get this in 16 clocks, and have any single write data already present. No need to go read the hub a second time for the write data unless indirection/burst is indicated. This improves latency.

    Rather than have the requesting cog poll HUBRAM in a loop, couldn't the arbiter cog use COGATN to signal completion? Then the requesting cog could have the option to focus on something useful while waiting, without burning a hub access per loop.
  • FWIW, in my SD Driver I have used 4 hub longs IIRC (at work atm) as follows...
    command	   byte  0  ' command to the driver (non-zero) set by caller, cleared by driver
                        ' the caller must have cleared the status before setting this
    status     byte  0  ' result of the command (non-zero is error) set by driver, cleared by caller
                        ' the driver clears command before setting this, $0 is good 
    cog        byte  0  ' $80+cogid (set by the driver)
    lock       byte  0  ' $80+lockid (set by the driver) - a lock is taken when driver is invoked
    sector     long  0  ' SD sector required (set by caller)
    ptrbuffer  long  0  ' pointer to the location where the sector data is stored
    spare?     long  0  ' SD pins used: CSpin<<24 + CKpin<<16 + DIpin<<8 + DOpin, set by caller
                        ' once the driver takes these, the slot is usable for whatever
    
    I have used similar in the P1 for serial. Here I used $1xx to indicate a character and $100 as empty. By using longs, it's also possible to use them as pointers to hub buffers.

    I found 2 longs was always too restrictive, and the way around that is to use the 2nd long as a pointer to another buffer. That's what Catalina for P1 did and I always thought that having a pointer to a pointer was IMHO silly tho P1 was tight for memory.
  • Rather than have the requesting cog poll HUBRAM in a loop, couldn't the arbiter cog use COGATN to signal completion? Then the requesting cog could have the option to focus on something useful while waiting, without burning a hub access per loop.

    It could potentially be done that way. In the case of my own video client COG, it would probably not even be checking the result, and will assume the data is ready 25us later simply because of its real-time nature. There is nothing it can do if the data is not ready, apart from maybe drawing a blank line instead of replicating what is left in the buffer from last time if it knows the transfer is not complete.

    There needs to be a way to indicate such a notification preference. Perhaps another memory area in HUB RAM can be defined for the driver COG to use that nominates COG preferences for how they are to be notified on completion of their requests. I'd prefer to not have to fill out an additional long each time for this however if the choice is made on a per request basis, unless we steal a bit from the BankAddr perhaps...
  • roglohrogloh Posts: 1,872
    edited 2019-10-21 - 04:15:51
    Cluso99 wrote: »
    I found 2 longs was always too restrictive, and the way around that is to use the 2nd long as a pointer to another buffer. That's what Catalina for P1 did and I always thought that having a pointer to a pointer was IMHO silly tho P1 was tight for memory.

    For memory use specifically I am hoping to keep this particular mailbox specification down to two longs just to help minimize the setup required and reduce the driver polling latency, as it has to be particularly simple for video and with few instructions. I could see for other devices needing additional control, or for locks etc, more longs may be required, though they can always use different drivers dedicated for that purpose.
  • roglohrogloh Posts: 1,872
    edited 2019-10-23 - 07:36:08
    I think it makes sense in this spec to add an inband control path to be able to configure the memory driver according to each COG's requirements. I am proposing to extend the use of the BankAddr bits to allow for this. So the idea is that if the BankAddr field is set to 0xF = 1111 this indicates a control read/write instead of a normal external memory read or write. Both the remaining 24 bit part of the mailbox long and the 32 bit second long can be used to pass parameters for the request, though it could be useful to partition some of the first mailbox long to identify which driver "register" or setting is being accessed. Perhaps into 8 bits + 16 bits, leaving the full 32 bits in the other mailbox register for wider arguments or additional control parameters by using a hub or external memory pointer.

    I am thinking this allows for the expansion of future scheduling options and selection of features like requesting being notified by COGATN from the driver upon completion or just via the regular mailbox polling. It does mean that the first time a COG starts to use the external memory driver, it would probably also need to have a way to reset back to defaults, in the case that some prior code running on this same COG ID earlier had been configuring the arbiter with it's own settings. I am thinking this could simply be a write of 0xFFFFFFFF to the mailbox 1 register to do this, as that value would not need to be used for other things. You would lose one the the 16 banks the top 16MB of the overall 256MB address space available for this control path feature, but it would open up future capabilities once the arbiters become more complex.

    As an example we may wish to put in a way for a video COG to indicate to the driver that it is in vertical blanking for the next number of clock cycles and it would be free to grant larger requests to other non-video COGs, or do more refresh cycles etc. Or for it to advertise how many clocks until the next scanline request, so the arbiter could do other things. It might allow a way to request a regular update of data to be sent to hub without requesting by the COG as well, or setup a block copy internally between memory banks, or zero fill etc. Lots of possibilities...

    I think I will add this. Does anyone else have any thoughts on this?
  • rogloh wrote: »
    I think it makes sense in this spec to add an inband control path to be able to configure the memory driver according to each COG's requirements. I am proposing to extend the use of the BankAddr bits to allow for this. So the idea is that if the BankAddr field is set to 0xF = 1111 this indicates a control read/write instead of a normal external memory read or write. Both the remaining 24 bit part of the mailbox long and the 32 bit second long can be used to pass parameters for the request, though it could be useful to partition some of the first mailbox long to identify which driver "register" or setting is being accessed. Perhaps into 8 bits + 16 bits, leaving the full 32 bits in the other mailbox register for wider arguments or additional control parameters by using a hub or external memory pointer.

    I am thinking this allows for the expansion of future scheduling options and selection of features like requesting being notified by COGATN from the driver upon completion or just via the regular mailbox polling. It does mean that the first time a COG starts to use the external memory driver, it would probably also need to have a way to reset back to defaults, in the case that some prior code running on this same COG ID earlier had been configuring the arbiter with it's own settings. I am thinking this could simply be a write of 0xFFFFFFFF to the mailbox 1 register to do this, as that value would not need to be used for other things. You would lose one the the 16 banks the top 16MB of the overall 256MB address space available for this control path feature, but it would open up future capabilities once the arbiters become more complex.

    As an example we may wish to put in a way for a video COG to indicate to the driver that it is in vertical blanking for the next number of clock cycles and it would be free to grant larger requests to other non-video COGs, or do more refresh cycles etc. Or for it to advertise how many clocks until the next scanline request, so the arbiter could do other things. It might allow a way to request a regular update of data to be sent to hub without requesting by the COG as well, or setup a block copy internally between memory banks, or zero fill etc. Lots of possibilities...

    I think I will add this. Does anyone else have any thoughts on this?

    It seems like a reasonable tradeoff to me. If someone needs more than 240MB then one of the control words could be to flip a bank switch.
Sign In or Register to comment.