Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

evanh · 2020-05-31 01:51

The hyperFlash bands are narrow and, as expected, get narrower higher up. Except for the highest one that is.

rogloh · 2020-05-31 02:18

I think the narrowing low down in frequency is because they fail right at my breakpoints designed for RAM ranges. If I tune these differently specifically they may work better and be wider again. Not sure.

Now why it continues to work at higher frequencies I'm not sure. It might be my test itself. I only check 16 bytes and these are the same simple counting pattern at the same address. I need to generate a better flash test pattern to read to be sure it is working right.

Yanomani · 2020-05-31 02:39

Did you ever tryed to rely on Hybrid Bursts to easy the task of dealing with those page boundary crossings (and the lack (or absence) of RWDS (and, sure, also valid data) they introduce), so as to avoid lefting gaps at Hub Ram?

evanh · 2020-05-31 02:50

I'm only comparing to prior hyperRAM behaviour. The same narrowing at higher frequencies happened there. What didn't happen was the extra wide highest band.

EDIT: I'm questioning the validity of the "287-360 MHz ok" result. My interest in testing hyperFash myself is near zero. I see it as a burden on the HR performance.

EDIT2: Bah! Those forum icons are too close in looks. I mistook both posts as Roger's.

rogloh · 2020-05-31 03:02

Yanomani wrote: »

Did you ever tryed to rely on Hybrid Bursts to easy the task of dealing with those page boundary crossings (and the lack (or absence) of RWDS (and, sure, also valid data) they introduce), so as to avoid lefting gaps at Hub Ram?

No as of right now I was just using linear bursts. Hmm, does hybrid burst mode fully solve it? I guess you need to discard those extra bytes read at the start before it reaches your desired bytes and then starts linear bursting again. That may solve it, and I just need to increase the delay before reading from the streamer. I'll look into it, that may fix at the expense of some additional latency on flash reading, great tip thanks Yanomami.

EDIT: Actually no, I think this hybrid thing works a little differently. You get your desired bytes at the end of the page then a gap while the remaining bytes at the start of the page arrive, then the next page starts linearly after that. So it still requires two streamer commands spaced apart to work I think. Not just an initial delay at the start which I was hoping for. I think I might still have to break apart my bursts into two transactions to fit the model...

Yanomani · 2020-05-31 03:57

You're welcome, ever...

rogloh · 2020-06-01 11:44

I've been working on this flash linear burst page size alignment issue today. I think I've managed to squeeze it in, but I am still testing the idea. It is both helped and complicated by the fact that burst reads can get fragmented.

The way it works is that the first burst read from flash has to compute the bytes remaining in the first 16 byte page being read. If it is the full 16 bytes being read (ie. address least significant nibble = 0) then reads continue as normal, otherwise the first burst fragment size is set to the remaining bytes left in the page, and that proceeds to complete, then any remaining read portions get fragmented as normal. The SPIN2 driver code can ensure that the burst size for flash banks is aligned to a multiple of 16 bytes, and any per COG limit should take this into account too. That way the remainder of any read burst will always resume on another page boundary and the gap problem should go away. For the highest performance the flash can be accessed on page aligned boundaries to avoid the slight extra overhead, but this scheme will allow it to read any number of bytes at any address and not leave gaps in hub RAM. I think this is important to achieve in the driver.

The new code for this takes up about 7 longs and has now totally filled up my LUT RAM. I've had to start to shuffle things around to make room. I think I only have about 9 spare longs left in COG RAM now, and that's it. I might need to hunt for a few more longs soon, especially if I find any errors in the LUT RAM code.

I've not been able to include the single reads of flash longs and words that cross page boundaries in this scheme, only the bursts. So ideally that should not be done when reading from flash, or if they are required then setup a read burst of 2 or 4 bytes instead. I could possibly put this extra page crossing check in the SPIN driver, but it will of course also impact any single RAM reads slightly during the validation phase. TBD.

Writes work a little differently, as only either single words are writeable or bursts. The bursts have to remain within 256 words (512 bytes) and not cross a 512 byte buffer boundary. Written words should already be word aligned and I return an error if they fall on odd addresses.

Request lists will be able to read from flash but writing to it within request lists using special commands like fills and graphics writes will be a problem and is prevented in the driver. Flash sector writing is a special case anyway and requires extra setup commands to enable it so normal word and burst writes can and should still be used for that purpose.

rogloh · 2020-06-02 07:21

I added in the following final features and now I'm 100% full for both LUT and COG RAM. It's chock-a-block!

1) List cancellation. Client COG can instruct driver to cleanly stop at end of current list item being processed. May be useful for clean shutdowns. Client clears top bit of list request in first mailbox long, driver COG will write zero to this address and stop once it detects this bit is cleared, and it gets polled at end of each list item before advancing. Same mailbox long can also be used to monitor the status of request list progression as it gets updated with the current list item's address being processed by the driver.

2) Prevention of writes to flash in lists when running extended requests such as graphics fills and copies. The single word writes and single flash write bursts can still be put in a request list however. Also HyperFlash can still be read from for graphics operations such as image copies, copies into HyperRAM, wavetable data etc, within request lists.

3) The flash burst read fix for crossing 16 byte page boundaries in HyperFlash has been added. There should now be no gaps for any read address / length as long as the configured burst sizes remain multiples of this page size.

4) Automatic long/word memory address alignment (P1 style addressing) for any atomic 16 or 32 bit HyperFlash word and long reads. This prevents crossing page boundaries too which is good. HyperRAM can still be accessed at any byte address for reads/writes of bytes/words/longs (P2 style addressing), making it more versatile. Flash word writes to unaligned word addresses or an odd number of bytes written will also be detected and return an un-aligned error because the HyperFlash needs 16 bit writes or multiples of that sent each time. Individual byte or long write requests to HyperFlash are not supported and those commands will fail.

Now I just need to validate this and it's then feature complete. I can't fit anything else in! Any bugs introduced here that require further instructions to remedy are going to really challenge me.

rogloh · 2020-06-03 05:42

@evanh, I retested HyperFlash again and printed out the delay values that work at each frequency (tested from 3-9). If the delay LSB = 1 it means unregistered data bus pins, and 0 = registered data bus pins . Clock pin was unregistered. Results are attached with good delays that read back the expected pattern with sysclk/1 reads from HyperFlash, and delays that failed are omitted from the set. There seems to be reasonable overlap.

I've edited out some repeating values to keep the post size down. In fact the middle frequency of the cut points (...) where it overlaps would make sense to change from one delay value to the next. That makes it around 58MHz, 107MHz, 155MHz, 215MHz, 265MHz, 310MHz. Seems reasonably balanced.

Flash good at 25 MHz - good delays are: 3 4 
Flash good at 26 MHz - good delays are: 3 4 
Flash good at 27 MHz - good delays are: 3 4 
Flash good at 28 MHz - good delays are: 3 4 
Flash good at 29 MHz - good delays are: 3 4 
...
Flash good at 88 MHz - good delays are: 3 4 
Flash good at 89 MHz - good delays are: 3 4 
Flash good at 90 MHz - good delays are: 3 4 
Flash good at 91 MHz - good delays are: 3 4 
Flash good at 92 MHz - good delays are: 3 4 
Flash good at 93 MHz - good delays are: 4 
Flash good at 94 MHz - good delays are: 4 
Flash good at 95 MHz - good delays are: 4 5 
Flash good at 96 MHz - good delays are: 4 5 
Flash good at 97 MHz - good delays are: 4 5 
Flash good at 98 MHz - good delays are: 4 5 
Flash good at 99 MHz - good delays are: 4 5 
...
Flash good at 115 MHz - good delays are: 4 5 
Flash good at 116 MHz - good delays are: 4 5 
Flash good at 117 MHz - good delays are: 4 5 
Flash good at 118 MHz - good delays are: 4 5 
Flash good at 119 MHz - good delays are: 5 
Flash good at 120 MHz - good delays are: 5 
Flash good at 121 MHz - good delays are: 5 
Flash good at 122 MHz - good delays are: 5 
Flash good at 123 MHz - good delays are: 5 
Flash good at 124 MHz - good delays are: 5 
Flash good at 125 MHz - good delays are: 5 6 
Flash good at 126 MHz - good delays are: 5 6 
Flash good at 127 MHz - good delays are: 5 6 
Flash good at 128 MHz - good delays are: 5 6 
...
Flash good at 183 MHz - good delays are: 5 6 
Flash good at 184 MHz - good delays are: 5 6 
Flash good at 185 MHz - good delays are: 5 6 
Flash good at 186 MHz - good delays are: 5 6 
Flash good at 187 MHz - good delays are: 6 
Flash good at 188 MHz - good delays are: 6 
Flash good at 189 MHz - good delays are: 6 
Flash good at 190 MHz - good delays are: 6 
Flash good at 191 MHz - good delays are: 6 
Flash good at 192 MHz - good delays are: 6 7 
Flash good at 193 MHz - good delays are: 6 7 
Flash good at 194 MHz - good delays are: 6 7 
Flash good at 195 MHz - good delays are: 6 7 
...
Flash good at 233 MHz - good delays are: 6 7 
Flash good at 234 MHz - good delays are: 6 7 
Flash good at 235 MHz - good delays are: 6 7 
Flash good at 236 MHz - good delays are: 6 7 
Flash good at 237 MHz - good delays are: 7 
Flash good at 238 MHz - good delays are: 7 
Flash good at 239 MHz - good delays are: 7 
Flash good at 240 MHz - good delays are: 7 
Flash good at 241 MHz - good delays are: 7 
Flash good at 242 MHz - good delays are: 7 
Flash good at 243 MHz - good delays are: 7 
Flash good at 244 MHz - good delays are: 7 
Flash good at 245 MHz - good delays are: 7 
Flash good at 246 MHz - good delays are: 7 
Flash good at 247 MHz - good delays are: 7 
Flash good at 248 MHz - good delays are: 7 
Flash good at 249 MHz - good delays are: 7 
Flash good at 250 MHz - good delays are: 7 8 
Flash good at 251 MHz - good delays are: 7 8 
Flash good at 252 MHz - good delays are: 7 8 
Flash good at 253 MHz - good delays are: 7 8 
...
Flash good at 276 MHz - good delays are: 7 8 
Flash good at 277 MHz - good delays are: 7 8 
Flash good at 278 MHz - good delays are: 7 8 
Flash good at 279 MHz - good delays are: 7 8 
Flash good at 280 MHz - good delays are: 8 
Flash good at 281 MHz - good delays are: 8 
Flash good at 282 MHz - good delays are: 8 
Flash good at 283 MHz - good delays are: 8 
Flash good at 284 MHz - good delays are: 8 
Flash good at 285 MHz - good delays are: 8 
Flash good at 286 MHz - good delays are: 8 9 
Flash good at 287 MHz - good delays are: 8 9 
Flash good at 288 MHz - good delays are: 8 9 
Flash good at 289 MHz - good delays are: 8 9 
...
Flash good at 331 MHz - good delays are: 8 9 
Flash good at 332 MHz - good delays are: 8 9 
Flash good at 333 MHz - good delays are: 8 9 
Flash good at 334 MHz - good delays are: 8 9 
Flash good at 335 MHz - good delays are: 9 
Flash good at 336 MHz - good delays are: 9 
Flash good at 337 MHz - good delays are: 9 
Flash good at 338 MHz - good delays are: 9 
...
Flash good at 355 MHz - good delays are: 9 
Flash good at 356 MHz - good delays are: 9 
Flash good at 357 MHz - good delays are: 9 
Flash good at 358 MHz - good delays are: 9 
Flash good at 359 MHz - good delays are: 9 
Flash good at 360 MHz - good delays are: 9

rogloh · 2020-06-03 07:25

Here is this graphically for both RAM and Flash showing the overlapping ranges that worked from 25MHz to 360MHz for sysclk/1 reads. Interesting that the RAM is different and is the one that has narrower bands. I noticed in the data sheets that the default drive strength impedance is 27 ohms for the HyperFlash vs 34 ohms for HyperRAM. Maybe that makes a slight difference...?

rogloh · 2020-06-03 07:51

Here it is for sysclk/2. There is more overlap.

jmg · 2020-06-03 08:44

rogloh wrote: »

I retested HyperFlash again and printed out the delay values that work at each frequency (tested from 3-9).

Nice tables, I wonder how those move with temperature and if there is a single value that can be applied over a practical temperature range, or if this needs temp sense and live adjust (which would be more of a pain).
I've seen newer RAM parts specify ROM pattern areas, that can at least assist with bus tuning - maybe these issues are more widespread ?

evanh · 2020-06-03 10:01

rogloh wrote: »

... I noticed in the data sheets that the default drive strength impedance is 27 ohms for the HyperFlash vs 34 ohms for HyperRAM. Maybe that makes a slight difference...?

Damn good point, I sure hope so. The difference sure is dramatic imho.

rogloh · 2020-06-03 10:46

evanh wrote: »

rogloh wrote: »

... I noticed in the data sheets that the default drive strength impedance is 27 ohms for the HyperFlash vs 34 ohms for HyperRAM. Maybe that makes a slight difference...?

Damn good point, I sure hope so. The difference sure is dramatic imho.

Well I just now set the HyperRAM CR0 regs to 27 ohms impedance, and the profile looks the same as it was for sysclk/1 though - it still doesn't match the HyperFlash which is a pity.

rogloh · 2020-06-03 11:02

jmg wrote: »

rogloh wrote: »

I retested HyperFlash again and printed out the delay values that work at each frequency (tested from 3-9).

Nice tables, I wonder how those move with temperature and if there is a single value that can be applied over a practical temperature range, or if this needs temp sense and live adjust (which would be more of a pain).
I've seen newer RAM parts specify ROM pattern areas, that can at least assist with bus tuning - maybe these issues are more widespread ?

At room temp I think it is somewhat stable, but the temp extremes do show variation and evanh did some earlier work on that. A ROM pattern would be nice. In theory the RAM could be scanned through at init time if someone wanted to try different delay values, the problem is you'll get two or one working read delay values, and when it is two, you don't know which one is best unless you actually scan over the frequency range at that temperature.

It's probably easiest to assume linear variation over temp and change your delay accordingly. If you know the current chip/board temp and it changes slowly you could have a COG adapting the driver delay. For sysclk/2 it's probably less of an issue given how wide the bands are for that but still possibly could have an impact at some point.

Writes don't show this timing problem at sysclk/2 thankfully as the clock is centered in the middle of the bit.

Dave Hein · 2020-06-03 12:23

So when you say that at 250 MHz the good delays are 7 and 8, wouldn't any delay above 7 work?

Or are you saying that 7 and 8 are the smallest delays that work, and in that case wouldn't it just be 7?

rogloh · 2020-06-03 22:48

@"Dave Hein" No it needs to be a single value. This is how the delay parameter is used below in the read code. The "delay" register here is actually the number in the charts above divided by two, because the LSB is used elsewhere to gain a half step of delay by selecting between registered vs live I/O input (regdatabus) which introduces a small amount of extra delay and is ideal to be able to transition between bands. If we didn't have that there would be some frequencies that become unusable with HyperRAM (at sysclk/1 input rates).

Basically if you set the delay too high and wait too long to start the streamer you miss the first byte(s) coming back from the HyperRAM. If you see the delay too low then you don't wait long enough and the streamer will clock in $FF from the undriven data bus before the HyperRAM has a chance to respond to the clock you are sending it. So there is a sweet spot. Unfortunately as well as varying with the P2 clock rate, it also varies with temperature as @evanh found.

                            wxpin   clockdiv, clkpin        'adjust transition delay to # clocks
                            setxfrq xfreqr                  'setup streamer frequency
                            wypin   clks, clkpin            'setup number of transfer clocks
                            wrpin   regdatabus, datapins    'setup data bus inputs as registered or not
                            waitx   delay                   'tuning delay for input data reading
                            xinit   xrecv, #0               'start data transfer and then jump to setup code

Dave Hein · 2020-06-04 02:51

It also may vary from chip to chip, or from batch to batch, or maybe varies depending on which pin is used. Tweaked code based on the timing of a few chips is a little scary.

evanh · 2020-06-04 04:16

Dave,
It is hairy, but it's not that bad. Chip fabrication is very consistent. The biggest variability is temperature ... And board layout. A different board will give different outcome unless they all have a spec to conform to. That's something I'm keen to have except I don't have the knowledge or experience myself.

JMG or Mark T might have the knowledge and experience. Von is working on the revC Eval Board with this sort of thing in mind but I'm not sure what can be achieved with general expansion headers compared to a dedicated hyperRAM right next to the prop2.

evanh · 2020-06-04 09:18

Dave Hein wrote: »

... Or are you saying that 7 and 8 are the smallest delays that work, and in that case wouldn't it just be 7?

The potential number of choices for the compensation is dependant on the ratio to sysclock. If it's sysclock/1 then there can only be one value that works at any given frequency, if that. For sysclock/2 there is potential for two workable compensations at any given frequency. And on it goes for /3 having three compensations, /4 having four compensations ...

evanh · 2020-06-04 09:24

Here's a sysclock/4 example of read data. You can see the first shift of reliable operation occurs begins just above 80 MHz. For a short band it only has three working compensations. If that was sysclock/1 there wouldn't be any working compensation value for a short band.

 HyperRAM Burst Reads - Data pins registered, Clock pin unregistered
===============================
HubStart  HyperStart    BYTES    BLOCKS  HR_DIV  HR_WRITE    HR_READ  BASEPIN
00040000   003e8fa0   0000c350       2       4   a0aec350   e0aec350      16
 ------------------------------------------------------------------------------------------
|                                       COUNT OF BIT ERRORS                                |
|------------------------------------------------------------------------------------------|
|        |                                 Compensations                                   |
|   XMUL |       0       1       2       3       4       5       6       7       8       9 |
|--------|---------------------------------------------------------------------------------|
      40 |  400023  400081  401312       0       0       0       0  400061  399815  400100
      41 |  400307  400415  399928       0       0       0       0  399737  399487  400237
      42 |  400527  399662  399867       0       0       0       0  400467  399733  400060
      43 |  399686  400652  399471       0       0       0       0  400053  400041  400412
      44 |  400291  400649  400038       0       0       0       0  399070  399576  399455
      45 |  400212  400809  400188       0       0       0       0  399695  401213  400150
      46 |  399641  401091  400116       0       0       0       0  399477  399733  399646
      47 |  399740  400680  400290       0       0       0       0  400156  400525  400920
      48 |  400102  400789  399562       0       0       0       0  400746  400328  400046
      49 |  399391  399745  399596       0       0       0       0  400018  400521  400932
      50 |  400946  399454  399659       0       0       0       0  399523  399699  399458
      51 |  399374  398973  399548       0       0       0       0  399806  400008  399629
      52 |  400292  399938  399978       0       0       0       0  399277  399654  399298
      53 |  400042  400099  400095       0       0       0       0  400397  400904  399952
      54 |  399643  399724  399595       0       0       0       0  400042  400427  399890
      55 |  400153  400387  399591       0       0       0       0  400424  399735  399875
      56 |  399506  399933  400718       0       0       0       0  400548  400159  399670
      57 |  400120  399743  399752       0       0       0       0  400407  398947  399952
      58 |  400346  399618  400062       0       0       0       0  400186  400203  400042
      59 |  399629  399723  399601       0       0       0       0  400367  399625  399624
      60 |  400121  400263  400029       0       0       0       0  399571  400437  399603
      61 |  399876  400245  399306       0       0       0       0  400278  399112  399539
      62 |  400016  400009  399779       0       0       0       0  400147  399502  400537
      63 |  399611  400069  399850       0       0       0       0  400179  400654  399892
      64 |  400000  399363  400086       0       0       0       0  399484  399600  399580
      65 |  399703  399292  400856       0       0       0       0  399786  399492  399230
      66 |  400288  400373  400230       0       0       0       0  399919  400077  400110
      67 |  400242  400051  399859       0       0       0       0  399444  399505  399874
      68 |  399336  400299  399394       0       0       0       0  399764  399843  400258
      69 |  400513  399607  399853       0       0       0       0  400136  400155  399906
      70 |  400595  399438  400087       0       0       0       0  400119  400377  399852
      71 |  399569  400539  400542       0       0       0       0  399747  400790  399652
      72 |  399458  400141  400604       0       0       0       0  400060  400233  399486
      73 |  399957  399655  400984       0       0       0       0  399562  399228  400213
      74 |  399416  400111  400107       0       0       0       0  400482  400227  400265
      75 |  399820  400226  400676       0       0       0       0  400743  400161  400090
      76 |  399350  400200  400042       0       0       0       0  400011  399225  399648
      77 |  399793  399469  399983       0       0       0       0  400332  399642  399400
      78 |  400501  400041  399781       0       0       0       0  400295  399942  399463
      79 |  399667  399873  400189       0       0       0       0  400788  400881  400089
      80 |  399421  400623  399557       0       0       0       0  399773  399882  400451
      81 |  400456  399747  399486       0       0       0       0  400036  399876  400137
      82 |  400202  399576  399626       0       0       0       0  400698  399948  399605
      83 |  399695  400360  400302       0       0       0       0  399985  400164  399314
      84 |  400350  400099  400146       0       0       0       0  399356  400300  399790
      85 |  399164  399717  399498       1       0       0       0  400638  399917  399698
      86 |  400064  399876  399851     222       0       0       0  399112  400982  400165
      87 |  400235  400488  400118    5064       0       0       0  395972  400084  400525
      88 |  399734  399611  399634   20201       0       0       0  380206  399663  400233
      89 |  399631  400773  399600   45516       0       0       0  353482  399633  400986
      90 |  400554  399662  399853   78316       0       0       0  321367  399807  400573
      91 |  400082  400030  399308  110141       0       0       0  290079  400599  400571
      92 |  399461  400595  400015  134671       0       0       0  264881  400459  400489
      93 |  400783  400068  399435  168418       0       0       0  232033  399866  399434
      94 |  400555  399715  400395  213113       0       0       0  187269  400744  399364
      95 |  399438  399519  400165  246660       0       0       0  152801  399940  399906
      96 |  399994  400052  400253  286356       0       0       0  113425  399338  399797
      97 |  399406  400241  399603  321194       0       0       0   78959  399690  399566
      98 |  399631  399216  400504  346789       0       0       0   53291  400428  400000
      99 |  400158  399839  399695  374394       0       0       0   26198  400477  399932
     100 |  399150  398852  400289  392128       0       0       0    8160  399466  399710
     101 |  399904  400220  399130  399048       0       0       0     791  400321  399331
     102 |  399560  399719  400580  400027       0       0       0       1  399627  399386
     103 |  400431  400280  400274  400109       0       0       0       0  400766  399381
     104 |  400611  399428  399700  400720       0       0       0       0  399371  399940
     105 |  400374  399995  399335  400072       0       0       0       0  400385  399843
     106 |  400104  400471  399233  400475       0       0       0       0  400297  400425
     107 |  400223  400293  399688  400143       0       0       0       0  399452  399838
     108 |  399992  399848  400191  400330       0       0       0       0  399518  399474
     109 |  400233  399813  399763  399963       0       0       0       0  399609  399535
     110 |  400184  399535  399542  399792       0       0       0       0  399448  400342
     111 |  400232  399774  400336  399341       0       0       0       0  400086  399904
     112 |  400506  399688  400508  399797       0       0       0       0  400732  399578
     113 |  399437  400492  400919  400138       0       0       0       0  400007  399357
     114 |  399754  400131  399766  400662       0       0       0       0  399490  400600
     115 |  399831  400710  400035  399515       0       0       0       0  400689  400828
     116 |  400599  399894  400006  400367       0       0       0       0  400079  400664
     117 |  399888  400324  400145  399164       0       0       0       0  400024  400000
     118 |  399091  398969  400320  400155       0       0       0       0  400743  400461
     119 |  399668  400133  399306  399754       0       0       0       0  399383  400323
     120 |  399252  400237  400555  400410       0       0       0       0  400724  400075
...

evanh · 2020-06-04 09:27

Ignore the absolute values of the compensations, btw. That depends on where in the program the measuring is taken from.

rogloh · 2020-06-05 12:36

I have managed to scrounge a few more longs in the code by sharing registers in different places etc and think with some effort I might free up just enough to get Read-Modify-Write supported as one final feature of this driver.

The only issue is I need to change the single element (byte/word/long) read from the simple single mailbox write into 2 mailbox writes.

The single element read request format is currently this in HUB RAM:
mailbox + 0 : read request (byte/word/long) | external address
mailbox + 4 : don't care
mailbox + 8 : don't care

To support a read/modify/write request I would need to change it to this:
mailbox + 0 : read request (byte/word/long) | external address
mailbox + 4 : new data value
mailbox + 8 : mask

The completion of the read code path would be altered to examine the mailbox+8 long (mask) to see if it was zero or not. If it is zero it would complete the read as normal and the data would be returned in mailbox+4 in HUB. If the mask was non zero it would be applied to the just read value and the relevant bits in new data value would be updated according to the mask bits (either with SETQ/MUXQ or AND/OR etc) and written back to the address just read using the same element size.

Importantly the original read value would still be returned in mailbox+4. This allows a read-update to be supported for semaphores etc. Eg. you try to set a bit and see it it was already a 1 or a 0 before you set it, indicating it is already in use. I would always run this read-update cycle as a back to back operation on the bus so no other COGs could affect the change. This feature would also be very handy for graphics updates of pixel data that is smaller than a byte and it avoids multiple mailbox request to do this and any associated polling delay between them.

The only downside to this approach is that normal reads which used to be an easy matter of writing a single long to the first mailbox to trigger them would now need to ensure that the mask mailbox entry is also cleared to zero in case it has been changed by any other request such as a write since the last read was done. So it typically adds an extra write by the client. This is mainly of concern to PASM clients not so much SPIN2 clients as I will have the SPIN2 API do it for you. Eg. just one new line gets added to the code below. The extra overhead time is probably not that big a deal given the performance of single reads is already limited by much larger overhead and you'll typically want to use burst reads anyway, but it is still annoying and I am still deciding whether this change is worth it. Until I code it up and make sure it fits in the freed space I guess it is moot. It will be really tight. But it could be good.

Any thoughts?

PUB readWord(addr) : r | m
    if MAX_INSTANCES == 1           ' optimization for single instance, everything mapped to single bus
       m := mailboxAddrCog[cogid()] ' get mailbox base address for this COG
       if m == 0                    ' prevent hang if driver is not running
           return -1 
    else                            ' multiple buses, need to lookup address to find mailbox for bus
       m := addrMap[addr>>24]
       if m +> MAX_INSTANCES-1      ' if address not mapped, exit
          return -1
       m := mailboxAddr[m] + cogid()*12

    long[m+2] := 0  '<------------------ NEW LINE NEEDED TO AVOID READ-MODIFY-WRITE

    long[m] := R_READWORD + (addr & $fffffff)
    repeat until long[m] >= 0
    return long[m][1]

Wuerfel_21 · 2020-06-05 12:51

Given that each cog has it's own mailbox, PASM code that never sets that value can just ignore it or clear it only when it knows that it needs clearing

Wuerfel_21 · 2020-06-05 12:55

(Also, speaking of overhead, I think there'd be tremendous value in a cut-down, low-overhead driver. Only one mailbox, one RAM bank,etc, so you can do fast-ish small accesses (like one would need for XMM, emulators, etc.).

rogloh · 2020-06-05 15:45

Wuerfel_21 wrote: »

Given that each cog has it's own mailbox, PASM code that never sets that value can just ignore it or clear it only when it knows that it needs clearing

Yes that third mailbox long would remain at zero after single reads, so if you did multiple single reads in a row, you could avoid the clearing each time after being setup just once at the start. The PASM will know what it is doing so it can make the decision on what to do as needed. That can be helpful. I just liked that single long mailbox write to trigger a read, it was so simple.

Wuerfel_21 wrote: »

(Also, speaking of overhead, I think there'd be tremendous value in a cut-down, low-overhead driver. Only one mailbox, one RAM bank,etc, so you can do fast-ish small accesses (like one would need for XMM, emulators, etc.).

Yeah there are plenty of features that could be removed/hard coded to speed up the whole thing. I expect it can/should be done after the main code is complete as it is simpler to remove than to hand craft in once you know what it needs to do and I've also mentioned this in the past. But the features I've included in the full version should be quite useful especially for the combined HyperFlash + HyperRam case with the P2 EVAL module, as well as for GUI and external memory graphics. We trade off a bit of performance for this versatility. For medium sized transfer bursts it won't make a huge difference but for individual random access use we could certainly speed it up with a cut down variant.

I'm also thinking it would be cool to have some type of XMM model like we used to have on the P1, but somehow using the HyperRAM and/or HyperFlash with caching. But I don't know how it could work yet, and whether it could makes use of Hub exec or not. Without caching the performance won't be good, but with caching enabled it might end up working out okay for running very large programs. Something for the future...

msrobots · 2020-06-05 17:16

I think the Idea is worth to work on.

First because of the use to do masked writes down to bit level, that is a very useful addition to byte/word/long. And second the semaphore thing. Not sure where I would need it but it seems to be quite useful too.

Enjoy!

Mike

Wuerfel_21 · 2020-06-05 17:21

The usual P1 XMM model (as implemented in GCC and probably similiar in Catalina, but IDK ask @RossH ) is mostly transparent to the running code - it just has to use special function calls for jumps and any memory access that may be external. This is relatively easy to hack onto an existing compiler, but is very very slow, because every instruction fetch goes through a bounds check to determine if a it crosses into the next cache line and every jump needs to be a function call. This approach would be even slower on P2, because hubexec could not be used (OR COULD IT? The hardware breakpoint could make it work!).

I myself have also used the XMM moniker for a model where the code is fully aware that it and any external data it might want to use is being sliced into 512 byte pages and moved to hub for processing. This is fast because I can arrange the code to minimize external memory access and after being moved to Hub, code runs at usual LMM speed, because the assembler forces the last instruction in any page to always be a "jmp #vm_nextpage" or an unconditional jump/return. This model is much harder to support in a compiler though.

msrobots · 2020-06-05 17:59

Actuall PropGCC supports overlays used back where RAM was scare a technique used on 8086 and such.

Someone here in the forum got it running on a P1, forgot the name, one somehow needed to add some compile time switches and tweak the linker script?

Mike

rogloh · 2020-06-06 00:01

A single mailbox, single bank cut down driver without any special features like lists, fills, multi-bank copies, graphics transfers, register access, burst/latency control etc could be sped up quite a bit. The polling loop could also come down to within 16 clocks once aligned with the egg-beater, and you can still get all 3 longs read in the one poll loop, saving any additional reads later. It fits nicely:

rep #3, #0
setq #2                       ' 2 clocks
rdlong request, mboxaddr      ' 11 clocks for reading 3 longs once aligned to hub
tjs request, #service_request ' 2 clocks

Looking at the total code saved in the read path I'd roughly estimate a doubling of the request rate could just about be had for the cut down driver for single element transfers. So let's say ~2M/s instead of ~1M/s at about 250MHz or so.

As it is right now the single element read code path is about 84 instructions long plus the mailbox polling loop which varies with the number of active COGs but is 40 cycles at best for a single COG for this comparison. As well as the HyperRAM transfer itself this code path currently reads and sets up the different bank control pins, reads and applies per bank burst settings from LUT, extracts per COG mailbox settings and other state, applies per bank latencies and read delays, applies per COG burst settings, tests for list requests, and sets up round robin fairness for the next poll. A minimalist implementation would remove all of this and it would then be in the vicinity of 44 instructions plus its tighter shorter polling loop.

For burst transfers this gain will be reduced as the size increases because some extra work gets done during the actual transfer time itself, but there will still be some gains there too.

I do think a cut down driver for tightly coupled applications could be useful to include as well as the fully featured one. It just would not be as useful for graphics or for multiple COGs sharing the common memory. You could also only use either the HyperFlash or the HyperRAM in your application, not both, unless perhaps two driver COGs were spawned and they were never active at the same time, carefully controlled by the application COG using it.

Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

Comments