Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

RossH · 2020-06-06 04:44

Wuerfel_21 wrote: »

The usual P1 XMM model (as implemented in GCC and probably similiar in Catalina, but IDK ask @RossH ) ...

No point in asking me - I have no idea how XMM is implemented in GCC. I know Steve Densen did some of the original work on the cache which (I believe) both GCC and Catalina use to support XMM where the RAM itself is too slow to access directly (e.g. SRAM) - so you are probably right that they are basically similar.

I am still trying to come to grips with whether it is worth implementing code execution from XMM on the P2. I keep wavering between doing a trivial "P1" type port - which would be very easy but suffer from the same problems as XMM code execution has on the P1 (i.e. that it is quite slow) - or doing something more sophisticated.

The problem is that I really have no idea yet whether XMM will be widely used on the P2. Even on the P1 (which absolutely needed it to execute large programs) it didn't see very much use outside us hardcore fanatics.

I will probably continue to waver on this for a bit yet. There are just too many other interesting things to do!

rogloh · 2020-06-06 05:49

If some form of XMM was doable with HUB-exec and caching it could actually be quite a good combination. If the generated code knows about all its jump/calls crossing some "page" boundary (I know it is not true paging) it could possibly return to some page loader code that either jumps over to another cached page elsewhere in HUB or brings the next page in from HyperRAM and a whole bunch of these pages could be retained in the 512kB of HUB. With the largest burst transfers it might only take a few microseconds to bring in around 512-1kB bytes or so (128-256 instructions). For code that doesn't branch everywhere and mostly stays within its working set, performance could still be pretty decent. Smaller sized page transfers could also be used to speed up the loading rate at the expense of more inter-page branching. It could be tuned to see what page sizes work better.

Either HyperRAM or HyperFlash could be used for the program memory with the same amount of performance. Programs could grow as large as 32MB with the Flash on the P2-EVAL breakout. That's massive. I think this is worth some consideration once people get familiar with how the HyperRAM/Flash could be used here.

With video applications you could still share a video frame buffer in external memory with program memory and give the video driver priority over the caching VM loader. Performance can take a hit but it still could work. There's a lot of memory bandwidth to go around.

Cluso99 · 2020-06-06 18:52

I wrote a P1 Fast Overlay loader back in 2008? Heater used it in ZiCog. It loads from hub in reverse ie last address first so that it hits every hub slot.
We used overlays back in the 70’s on the minis and I suspect even earlier on the mainframes. It was absolutely necessary in a 5KB core memory model (cog) with a shared (ie hub) of 10KB.

rogloh · 2020-06-07 07:07

So I was able to find enough COGRAM and LUTRAM to squeeze in the read-modify-write stuff I was talking about recently. I think it should work out now. Unfortunately I had to shuffle around my EXECF sequences for reads quite a bit to make it all fit which is always a risk of breaking other stuff in pretty nasty ways, so I'll need to re-test a lot of this again.

Current situation:
COGRAM use 502 longs
LUTRAM use 512 longs

I might be able to free a few more COGRAM longs by sharing registers even more but that also gets risky and makes the code more fragile in any future changes, especially if you give the same register two different names for more clarity. You think you can freely re-use something but it then creates a side-effect somewhere else, and the more branching and skipf code paths etc, the harder this stuff gets to track.

evanh · 2020-06-07 11:07

I've implemented setting of CR0 myself now. In the process of testing out more combinations I just bumped into confirmation of Von's assertion that P16-P31 pins are more evenly routed than the others. Read data testing doesn't seem to be affected but sysclock/1 writes definitely are. This is of course the most hairy of my setups with the 22 nF capacitor on the HR clock pin.

Here's pins P32-P47:

 HyperRAM Burst Writes - Data pins registered, Clock pin unregistered
===============================
HubStart  HyperStart    BYTES    BLOCKS  HR_DIV  HR_WRITE    HR_READ  BASEPIN   DRIVE   CR0
00040000   003e8fa0   0000c350       2       1   a0cec350   e0cec350      32       7   ff1f
 ------------------------------------------------------------------------------------------
|                                       COUNT OF BIT ERRORS                                |
|------------------------------------------------------------------------------------------|
|        |                                 Compensations                                   |
|   XMUL |       0       1       2       3       4       5       6       7       8       9 |
|--------|---------------------------------------------------------------------------------|
...
     300 |  399828  400242  399618  399048       0  400687  400098  399863  400250  399445
     301 |  400015  400279  399866  399853       0  399099  399788  400982  400451  399256
     302 |  399434  400291  399672  398792       0  399643  399664  400503  400084  400247
     303 |  400354  399646  400398  400141       0  399740  399815  400467  399417  399182
     304 |  399461  399469  399676  400301       0  399913  399116  399890  400066  400520
     305 |  400359  400132  400471  400050       0  400554  399894  400055  399169  400547
     306 |  399762  399951  399633  400542       0  400285  400642  400205  400625  401250
     307 |  400055  399581  400313  399948       0  400090  399640  400285  400224  399772
     308 |  400256  400319  399840  400420       0  399549  399547  399891  400879  399912
     309 |  399954  399921  400330  400375       0  400486  400530  399327  399628  399428
     310 |  399901  400845  399819  400061       0  399190  400112  399622  399490  400139
     311 |  399687  399826  400341  399130       0  400161  400120  399717  399916  401052
     312 |  400340  400418  400500  400613       0  400549  400256  399733  399830  400725
     313 |  400347  399734  400124  399633       0  399231  398971  399759  400249  400693
     314 |  400438  400502  400440  401155       0  399682  399825  399901  400024  400078
     315 |  400239  400530  400131  400150       0  399074  400356  399792  400817  399448
     316 |  400064  400335  400136  399958       1  399491  400386  400571  399900  399381
     317 |  399956  399834  399691  400752       0  400450  399327  400501  399379  400078
     318 |  400764  400028  400853  399625       0  399809  399661  400141  400173  400122
     319 |  399647  399684  400492  399107       0  400345  399829  399920  400280  400620
     320 |  400424  400380  399751  400041       1  399806  398797  399988  399113  399122
     321 |  399202  400540  399867  399741       0  399446  400328  400547  400617  399992
     322 |  400371  400195  399429  399750       2  400792  400472  399964  401351  400252
     323 |  399552  399951  399819  400441       6  399205  399767  400019  400029  399876
     324 |  400361  400212  399866  399941       4  399704  399531  399438  400107  399676
     325 |  399835  400288  400014  399367       3  399075  400266  400550  400008  400084
     326 |  400245  400531  400382  398874       6  400437  400288  399909  400079  400152
     327 |  399946  399914  400190  399055      10  399533  400312  399975  399794  400696
     328 |  399140  399882  400408  400007       6  399792  399947  400855  399614  400075
     329 |  399882  400371  400385  400435       4  400301  399222  399698  399763  399799
     330 |  399931  399355  399452  399889      14  400283  399716  400041  400472  400237
     331 |  399585  400112  399953  399702      21  400489  401103  400571  400231  399667
     332 |  400083  400168  399533  399519      10  400544  399541  399465  400156  400220
     333 |  399974  400515  400636  399414      13  399598  400180  400049  401114  400735
     334 |  399940  399817  400841  400458      15  399570  399482  399209  400474  400020
     335 |  399810  399431  400308  399488       7  399804  400123  400274  399853  399974
     336 |  399783  400124  400004  400176      13  400083  400636  399685  400060  400301
     337 |  399507  400403  399997  399625      14  399707  400848  400231  399387  400008
     338 |  400101  399930  399762  400936      21  399936  399771  399824  399014  399699
     339 |  400280  400049  400232  400341      20  399866  400165  400427  399687  399601
     340 |  399951  400531  400162  400230      25  399685  400009  400025  400449  400531
     341 |  400037  400082  400702  399615      19  400452  400670  399349  399766  399527
     342 |  400191  400072  400051  400870      29  399968  400621  399665  399743  399377
     343 |  399789  400012  400052  399956      26  399891  399217  400415  399098  399953
     344 |  400635  399717  401135  400376      20  400384  399762  400387  399505  399924
     345 |  399612  400186  400029  399037      24  400270  399213  400027  399780  400447
     346 |  400550  399779  399639  400051      29  399898  399847  399407  400757  399031
     347 |  399737  399711  399205  399722      29  400370  400258  399018  399389  399813
     348 |  400013  400877  399856  399580      45  400562  399977  399272  400207  399686
     349 |  399812  400425  399803  399745      21  400153  399482  399631  400280  399451
     350 |  400163  400589  399915  399966      24  400184  399733  400290  400222  400912
     351 |  399848  399902  399840  400484      28  399763  400110  400364  399549  399838
     352 |  400182  399575  400498  399645      25  399935  399838  399402  400166  399770
     353 |  400054  400479  400252  400134      21  400690  399961  399616  399658  400595
     354 |  400162  399984  400560  400471      19  400600  401076  398929  400207  400279
     355 |  400577  398707  399237  400181      24  400375  400232  400253  399682  399938
     356 |  398731  399867  400018  400084      25  399292  399691  399482  400246  399744
     357 |  400078  399636  400086  400791      20  400339  400088  399749  400402  400418
     358 |  400699  399788  400193  400225      23  400135  400491  400706  400188  400078
     359 |  399398  399817  400424  400325      31  399718  399294  399304  400313  399704
     360 |  399409  399850  399671  399921      28  400134  399673  400496  400267  400737
     361 |  398865  399554  400301  399890      24  399916  399752  399994  399972  400108
     362 |  399601  400303  399682  399814      22  399859  400015  398922  399823  399931
     363 |  400067  400090  400847  399782      37  399344  400577  399640  398585  399716
     364 |  399846  400326  398851  400801      24  400014  400009  400748  401041  400290
     365 |  400462  399778  400534  400295      36  399943  399327  400096  400384  399299
     366 |  400161  401006  400314  399868      35  400631  400370  399861  399681  399789
     367 |  399709  399621  400039  399815      33  400473  400190  400186  399520  398993
     368 |  400664  399604  399236  399702      36  399459  399348  400124  399654  400085
     369 |  399342  400061  399701  399976      34  399938  399312  400334  400447  400015
     370 |  399819  399342  400056  400370      45  399778  399878  400185  400576  400660
     371 |  400186  400426  400614  399160      41  400135  400365  400389  399811  400073
     372 |  400421  399204  399787  399818      37  400072  399848  400777  399876  400027
     373 |  400724  399693  400222  400128      43  400107  399751  399887  400326  399241
     374 |  400429  400006  400270  400365      61  399752  399501  399668  398990  399567
     375 |  400511  400059  399784  399347      44  399465  400277  400197  399675  400267
     376 |  399928  399690  399561  400344      66  399520  399767  399840  399935  399561
     377 |  400154  400391  399919  400376      57  399961  400499  399662  400001  400209
     378 |  399631  399854  400549  399116      43  400156  400716  400269  400221  399769
     379 |  400214  399878  399331  399512      63  399705  399822  399973  399602  400432
     380 |  399590  400473  400183  400582      64  400249  400223  399765  399518  399210
     381 |  400203  399439  399739  400485      81  400501  400516  399490  401436  399570
     382 |  399699  399416  400382  400017      68  400051  399909  400499  399433  400352
     383 |  400341  400060  399743  400227     106  401119  400673  399867  399378  399582
     384 |  399994  399935  400416  399606     102  399831  399746  400320  399737  399829
     385 |  401206  400181  400021  400156     100  399874  400544  399984  399492  400325
     386 |  400296  400528  398380  399497      85  398731  400734  399813  399834  399302
     387 |  400041  400420  400184  399950      99  400911  399621  399749  399189  400541
     388 |  399296  400353  401566  399394     100  400671  399946  400095  399864  399286
     389 |  400509  400306  400623  400304      93  399768  399769  399776  400066  399605
     390 |  399414  399095  400597  400164     122  400202  399900  399838  400517  399879

evanh · 2020-06-07 11:07

And the same again but on pins P16-P31:

 HyperRAM Burst Writes - Data pins registered, Clock pin unregistered
===============================
HubStart  HyperStart    BYTES    BLOCKS  HR_DIV  HR_WRITE    HR_READ  BASEPIN   DRIVE   CR0
00040000   003e8fa0   0000c350       2       1   a0aec350   e0aec350      16       7   ff1f
 ------------------------------------------------------------------------------------------
|                                       COUNT OF BIT ERRORS                                |
|------------------------------------------------------------------------------------------|
|        |                                 Compensations                                   |
|   XMUL |       0       1       2       3       4       5       6       7       8       9 |
|--------|---------------------------------------------------------------------------------|
...
     300 |  400164  400231  400514  399283       0  399616  399587  400353  400166  399695
     301 |  399787  399852  399431  399861       0  399805  399444  399408  400619  399899
     302 |  400388  400369  399797  400235       0  400005  400164  400310  399301  399670
     303 |  400480  400450  399820  399775       0  398937  399754  400200  399737  399726
     304 |  399797  399714  399754  400073       0  400532  399401  399766  400080  399621
     305 |  399700  400406  400087  400464       0  400128  400097  400137  400116  399700
     306 |  400209  399659  400422  399979       0  400372  400134  399562  399871  400121
     307 |  399120  399972  399446  399906       0  400188  399591  400713  401112  399020
     308 |  400033  400220  399307  399842       0  400042  399580  400193  399127  399845
     309 |  399805  399195  400170  399927       0  399861  399961  399975  399582  400356
     310 |  400504  399405  400619  399220       0  399911  399733  399918  399654  399712
     311 |  400002  400504  400158  399782       0  399562  399355  399623  400137  399702
     312 |  399438  399138  399379  399917       0  399150  399919  399738  400711  400214
     313 |  399848  399593  400585  400837       0  400267  400021  399421  399344  400392
     314 |  400326  399667  400924  399881       0  399758  400122  399554  400535  399587
     315 |  399965  400334  399966  399115       0  399830  399788  399586  400522  399312
     316 |  400347  399184  399960  399972       0  400366  400105  400522  399487  399621
     317 |  399479  399682  399013  399812       0  400235  400099  400195  399741  399593
     318 |  400119  399439  400851  400648       0  399304  400588  399620  399252  400007
     319 |  399705  400110  399795  400367       0  400528  399402  400356  399665  400687
     320 |  399701  399647  399784  400040       0  399920  399856  399736  400386  399971
     321 |  400203  400466  400239  399713       0  399548  399147  400729  399743  399746
     322 |  399744  400146  399656  399905       0  399733  399734  399313  399671  400330
     323 |  400310  400390  399855  400595       0  399577  400222  400270  400626  399870
     324 |  400013  400314  400673  400609       0  399636  399982  400187  399760  400664
     325 |  399863  399406  400195  399753       0  400570  399473  400099  400200  399550
     326 |  399469  399747  399830  399771       0  398961  400214  399686  399640  400504
     327 |  399995  399737  399317  400034       0  399723  400531  401067  400345  400141
     328 |  400547  400406  399662  400115       0  400134  399390  399573  399990  400061
     329 |  400119  400909  400655  399725       0  400806  399775  400127  400084  400368
     330 |  400676  400591  399427  399910       0  400038  399105  400137  399955  401108
     331 |  400114  399489  399982  400409       0  399888  400517  400181  400349  399332
     332 |  399872  399986  400426  400554       0  399822  399831  400251  399801  400152
     333 |  400156  399895  399995  399569       0  399604  399546  400445  399704  399629
     334 |  399600  399692  400199  399713       0  400836  400064  399548  399573  399756
     335 |  400746  400626  400286  399600       0  399255  400519  400348  400508  400717
     336 |  400500  400619  400114  400400       0  399904  399933  399664  400032  399353
     337 |  400230  400460  399905  400230       0  399969  399822  400443  400341  399888
     338 |  399778  400571  400934  398848       0  400639  399493  400322  399427  400251
     339 |  400219  400642  400405  399229       0  399678  399328  399705  399973  399215
     340 |  400587  400664  400066  400088       0  400482  399728  399686  398257  400125
     341 |  399669  399997  400166  399753       0  399952  400809  399851  401140  400161
     342 |  399479  400055  400057  400224       0  399391  400247  399852  400147  399393
     343 |  400332  400296  399061  399588       0  400735  400247  399983  399426  399771
     344 |  400971  399160  400228  399548       0  400299  399680  400058  399999  398927
     345 |  399843  399957  400140  400432       0  399075  401056  400786  400382  400302
     346 |  399556  400224  400030  399757       0  399321  400304  398836  399929  400304
     347 |  399877  399505  400441  399928       0  400770  399872  399877  399503  399569
     348 |  399528  399274  400631  399727       0  399863  399388  399824  399602  399342
     349 |  399512  399652  399827  399459       0  399955  400548  400822  399877  400204
     350 |  400043  400167  400510  400521       0  401237  399988  399812  400065  400177
     351 |  400000  400336  399760  399436       0  399982  400245  399480  399900  399617
     352 |  400307  400545  400192  400803       0  400303  400741  399891  399654  400129
     353 |  400664  400548  400173  400168       0  400902  399663  400360  399751  399978
     354 |  400251  400119  400266  401337       0  400059  400116  399927  399913  400075
     355 |  400041  400254  399515  400330       0  400314  399535  399513  400136  399413
     356 |  399701  400448  399896  399659       0  399636  399163  400483  399942  399513
     357 |  400126  400363  400104  399911       0  400032  399907  399258  399047  400173
     358 |  400111  399959  400161  400248       0  400860  399900  400131  400496  399784
     359 |  400298  400848  400267  399859       0  399275  400026  400432  399770  399671
     360 |  399060  400518  401037  400193       0  400154  400324  399253  399691  400114
     361 |  399804  399467  400447  399211       0  399336  399936  399874  400793  400205
     362 |  398814  399900  399797  399760       0  399894  399750  400178  399485  400730
     363 |  399620  400668  400501  400608       0  400529  399916  399558  399939  400256
     364 |  400254  400196  400324  399662       0  399322  399573  400009  399230  400063
     365 |  400201  400283  399851  399155       0  399797  399904  399796  399693  399777
     366 |  399788  399156  399922  400010       1  400055  400823  400486  399466  400023
     367 |  399948  400225  399297  400409       0  400376  399745  400031  400186  399132
     368 |  400148  400290  400056  399426       0  400648  399747  399729  400136  399847
     369 |  400334  400358  399363  400593       0  399942  399841  399934  400647  399503
     370 |  400014  400685  400116  400254       0  399645  400072  400135  401141  400253
     371 |  399962  400511  399650  400713       0  400121  399629  399857  401298  399495
     372 |  400508  400113  400392  400502       0  399292  400317  399521  399567  399921
     373 |  399641  399707  399930  400214       0  400328  399954  399826  400236  399968
     374 |  400413  399988  398800  399126       0  399785  399622  400343  399860  399916
     375 |  400392  399688  400733  399645       0  400301  399910  399971  400225  400330
     376 |  399736  400236  400331  400377       0  399488  400131  400808  400424  400157
     377 |  400615  399729  400730  399859       0  399600  400330  400497  399933  400146
     378 |  399429  399619  400808  400422       0  399750  399889  399095  399520  399881
     379 |  400327  400473  400468  400060       0  400085  400078  399758  399535  399558
     380 |  399730  400077  400826  399674       0  399994  399666  400240  400894  400192
     381 |  399815  399990  399921  399897       0  399888  399636  399078  400152  400146
     382 |  400189  399973  400735  400053       0  400553  399982  400245  399915  399513
     383 |  399103  400144  399531  399770       0  399708  400397  399842  400191  399491
     384 |  399672  400552  400508  399452       0  400013  400456  400142  399872  399212
     385 |  400456  399607  399815  399605       0  400582  399413  400334  399716  400161
     386 |  399797  399623  399883  399503       0  400139  400040  400020  400153  399679
     387 |  400411  399358  400650  399006       1  399756  399776  400179  399353  400642
     388 |  399252  400047  400647  399485       0  399735  400215  400278  400057  399905
     389 |  399924  399599  400306  400203       0  400004  400859  399678  400458  399568
     390 |  399437  399511  399352  399538       0  399904  399688  399528  400412  400610

evanh · 2020-06-07 11:14

Here's latest revision of the testing code with CR0 setting option.

rogloh · 2020-06-07 11:37

What sort of write timing variation have you found as you change CR0 impedance @evanh? The results posted above are for the lowest impedance 19 ohm drive value for cr0 ($ff1f) by the looks of it. I wonder how much that should EDIT: (actually affect) write timing if this bus is tri-stated? Do we have other values to compare this with?

evanh · 2020-06-07 11:43

Roger,
Registering the HR clkpin for data reads definitely gives a higher usable clock speed. And setting the CR0 drive strength to 19 ohms does help a little too. I'm getting over 320 MT/s read speed now. On the down side, the 22 pF capacitor definitely drags it down. A dedicated board layout will help a lot.

evanh · 2020-06-07 11:45

rogloh wrote: »

What sort of write timing variation have you found as you change CR0 impedance @evanh? The results posted above are for the lowest impedance 19 ohm drive value for cr0 ($ff1f) by the looks of it. I wonder how much that should actual write timing if this bus is tri-stated? Do we have other values to compare this with?

That's with the capacitor in place. It's only a demo of the differences with something sensitive.

EDIT: ie: The difference between a basepin of 16 and 32 is tiny and generally doesn't impact reliability.

jmg · 2020-06-07 20:22

evanh wrote: »

.... This is of course the most hairy of my setups with the 22 nF capacitor on the HR clock pin.

I think you meant 22pF there ?
Did you try a trimmer for that skew-cap ? Now there are nice error tables, there may be a better C value ?

rogloh · 2020-06-08 00:49

This information on keeping routing well matched (and in general short) should be useful for @"Peter Jakacki" and his HyperRAM add on for P2D2. His design uses P32-P39 for the data bus but there is nothing bad about those P2 pins in general, just that on the P2-EVAL board their trace lengths are not quite as evening matched to the header pins vs other ports which may compromise a future sysclk/1 write operation. I still use P32 all the time with sysclk/1 reads.

evanh · 2020-06-08 06:27

jmg wrote: »

evanh wrote: »

.... This is of course the most hairy of my setups with the 22 nF capacitor on the HR clock pin.

I think you meant 22pF there ?

Oops, lol, yeah, 22 pF.

Did you try a trimmer for that skew-cap ? Now there are nice error tables, there may be a better C value ?

I was quite happy with the measured result on the scope. The 22 pF brought the slope nicely parallel to the data traces and the roughly 1 ns lag was just what I wanted.

There was clear case of attenuation kicking in though. The board layout has a lot of capacitance I suspect. I don't think any signal improvement can be made without a dedicated snug hyperRAM on the prop2 board. Even then, I worry that adding a capacitor will be a major nerf, so want to have the space to modify the first experiment board.

Sadly I don't have any layout even on the drawing board yet.

evanh · 2020-06-08 06:32

rogloh wrote: »

This information on keeping routing well matched (and in general short) should be useful for @"Peter Jakacki" and his HyperRAM add on for P2D2. His design uses P32-P39 for the data bus but there is nothing bad about those P2 pins in general, just that on the P2-EVAL board their trace lengths are not quite as evening matched to the header pins vs other ports which may compromise a future sysclk/1 write operation. I still use P32 all the time with sysclk/1 reads.

Correct, it's the board causing it, not the chip.

On the other hand, I still favour keeping P28-P31 and associated VIO away from any connectors. If that VIO is taken out then the prop2 is bricked because the sysclock oscillators won't power up without it.

Surac · 2020-06-08 06:53

Just a quick question
Is the hyperram rated for the clock speeds you are using?
The data rates you are achieving top mine in a complete
Different project by a good number

Best regards

evanh · 2020-06-08 07:15

Nope.

We're using Parallax's accessory board, which is a 200 MT/s rated part (IS66WVH16M8BLL). Given how much better the HyperFlash has performed for Roger, maybe the faster Hyperbus2 parts will be a notable boost for us.

The HyperFlash is IS26KL256S-DABLI00, also 200 MT/s.

Tubular · 2020-06-08 09:28

Surac there's a "version 2" of HyperRam coming from Infineon/Cypress, that goes faster and is rated to 400 MBps (=400 MT/s), in both 1v8 and 3v3 variants.

So far, the 1v8 parts of version 2 are available in stock, "S27KS0642*", but the 3v3 "S27KL0642*" versions are not yet in stock. Hopefully soon

When these parts arrive, we will be within spec again

rogloh · 2020-06-21 08:12

Finally had a chance to get back onto this after a couple of weeks of being sidetracked with AVR micros, PS/2 keyboards and Z80 CRTCs of all things.

I tested out the read-modify-write feature I had recently added and it appears to work. This now lets us write arbitrary bitfields within individual bytes/words/longs and retrieve the prior value in a single mailbox operation.

It will be useful for semaphores, general bitfield updates and graphics pixel operations on any elements that differ from the native 8/16/32 bit storage sizes.

The existing single element access APIs were just these:

PUB readByte(addr) : r
PUB readWord(addr) : r
PUB readLong(addr) : r
PUB writeByte(addr, data) : r 
PUB writeWord(addr, data) : r 
PUB writeLong(addr, data) : r

And the updated API now includes these too:

PUB readModifyByte(addr, data, mask) : r
PUB readModifyWord(addr, data, mask) : r
PUB readModifyLong(addr, data, mask) : r

The mask is the same size as the element being written (8, 16, or 32 bits), and it's binary ones indicate the bit(s) in the data parameter at these mask bit position(s) that be written to the HyperRAM and overwrite the existing data bit value. When the mask bit zero the corresponding data bit is left alone.

The original value before any update gets applied is also returned by the API.

If the mask used is zero, no updates are applied (and this then defaults to the same behaviour as the general read case in the PASM driver).

I'm now down to 1 long free in COG RAM.

That's it.

Rayman · 2020-06-22 13:09

What is in LUT?

rogloh · 2020-06-22 13:28

Rayman wrote: »

What is in LUT?

Most Hyper memory access code and per bank/cog state live there. There's zero longs left in LUTRAM! Any more code space will need to find optimisations or elimination of features.

Things are looking good with this final? version in my testing. Tonight I just got a video frame buffer output to my video driver from HyperFlash for the first time ever. So I can send my driver a frame buffer held in either RAM or Flash controlled just by the nominated address which is mapped to one of the devices on the bus. I should be able to graphics copy image data out of flash directly to RAM too, this could be useful for image resources for GUI elements etc. Just about to try that out once I write something useful for testing into flash.

I am very glad I decided to enable different input timing for each different bank back when I was contemplating doing all that. At one time I wasn't fully sure I would to need to do this and it also added some small setup overheads, but it would have been very hard to add it in at this stage. This issue just showed up already as a problem at 200MHz with my video frame buffer. HyperFlash needed a delay of 8 to show stable pixels while RAM wanted 9. This input delay is kept as a per bank parameter so the driver can handle the different input timing per each read access.

rogloh · 2020-06-23 03:58

I just noticed something interesting with 16MB HyperRAM on the P2-EVAL during some re-testing. If you cross over from one stacked die to the other in the same HyperRAM multi-chip package (MCP) at the 8MB boundary during a read, the burst read will wrap around only with the starting die and not cross to the next die in the package. When I looked further at the data sheet, I found this was documented as:

5. When Linear Burst is selected by CA[45], the device cannot advance across to next die.

Here's a picture of what happens. I had a frame buffer starting at around 8MB-64kB (with unprovisioned memory but primarily purple colours) and when it crosses at the 8MB boundary, you start to see some colour patterns that I had written at address 0 in the first die in the package. After the scan line burst read completes at the boundary, the frame continues on within the second die (primarily green colours). Same thing happens when wrapping from 16MB back to zero, but the data read at the crossing will be that stored starting at 8MB.

I can't really do much about this now, it is a feature of the HyperRAM MCP itself. The only way we could deal with it would be to specifically test for a 8MB crossing within every burst read and split the bursts at the boundary (a little bit like I did for flash page crossings), but this is not worth the additional overhead on all read bursts and probably also all other fills/copies etc. So to avoid this it is best to just keep your frame buffers (or other data elements) to be fully contained within the same 8MB block if you use an MCP based HyperRAM device like the one on the Parallax HyperRAM module. Future single die HyperRAM devices will probably not have this issue anyway.

evanh · 2020-06-23 04:32

rogloh wrote: »

... Future single die HyperRAM devices will probably not have this issue anyway.

Yeah, regular DDR4 DIMMs are 8 Gbit (1 GByte) per die, with the latest packing 16 Gbit already.

rogloh · 2020-06-25 05:48

Recently I was just testing the round robin scheduling in this HyperRAM driver and noticed something "interesting". As a test I had 5 COGs competing for HyperRAM access and wanted to see how many requests each one obtains relative to the others to compare the fairness.

Each round-robin (RR) COG does the exact same operation drawing some vertical lines into the frame buffer moving from left to right (covering 1920 pixels wide on a FullHD screen), cycling the colour when it reaches the end and wraps around again to the left side. These operations form a colour row or bar per COG on the screen and one strict priority video COG outputs this screen over VGA. If a round-robin COG is getting more requests serviced compared to others its bar advances faster relative to the others and it looks like a race visually with some less serviced COGs being "lapped" making this nice and easy to see in real time. The request servicing over these 5 RR COGs looks good like this and they advance at pretty much the same speed. I count the requests and take the average as a percentage of the total request count and show this at the top left of the screen per COG.

So when all RR COGs do the same thing and request the same operation taking the same duration, it is fair and there is minimal separation between the different COG's bars. However if one COG is then stopped and becomes inactive, its requests cease and the fairness changes. Instead of being equally allocated to the other 4 RR COGs, one COG is given an advantage and the request share then looks more like this (apologies for the blurry shot):

COG0 40%
COG1 20%
COG2 20%
COG3 20%
COG4 0% (inactive)

I found the issue is the way these RR COGs are polled. Each time a request is serviced the RR COG polling order advances, like this:

Initially: COG0, COG1, COG2, COG3, COG4
next request: COG1, 2, 3, 4, 0
next request: 2, 3, 4, 0, 1
next request: 3, 4, 0, 1, 2
next request: 4, 0, 1, 2, 3
next request: 0, 1, 2, 3, 4 (continuing etc)

This works fairly when all COGs are active. They have an equal time at the first spot, 2nd spot, 3rd, 4th, and last spot. The problem happens if COG4 is inactive, because the next in line is COG0. This essentially gives COG0 two goes at the top spot because COG4 never needs servicing. For 2 in every 5 requests serviced, COG0 gets polled before the others. To fix this requires another more complicated implementation where you select each RR COG only once per polling loop iteration, or somehow determine the full polling order more randomly so inactive COGs aren't followed by the same COG. I think a polling change like that might have to come later if at all. There is a tradeoff between complexity and polling latency here. My current implementation keeps the polling as simple as possible to try to be as fast as it can be. Currently the loop generated for 5 RR COGs and one strict priority video COG would be the one shown below. It builds a skip mask to determine the polling sequence.

poller
                            incmod  rrcounter, #4           'cycle the round-robin (RR) counter
                            bmask   mask, rrcounter         'generate a RR skip mask from the count
                            shl     mask, #1                'don't skip first instruction in skip mask

repcount                    rep     #10, #0                 'repeat until we get a request for something
                            setq    #24-1                   'read 24 longs
                            rdlong  req0, mbox              'get all mailbox requests and data longs
                            tjs     req7, cog7_handler      ' (video COG for example)
polling_code                skipf   mask                    ']dyanmic polling code starts from here....
                            jatn    atn_handler             ']JATN triggers reconfiguration 
                            tjs     req0, cog0_handler      ']
                            tjs     req1, cog1_handler      ']
                            tjs     req2, cog2_handler      ']
                            tjs     req3, cog3_handler      ']
                            tjs     req4, cog4_handler      '] Loop is generated based on
                            tjs     req0, cog0_handler      '] number of RR COGs
                            tjs     req1, cog1_handler      ']
                            tjs     req2, cog2_handler      ']
                            tjs     req3, cog3_handler      ']
                            tjs     req4, cog4_handler      ']

I think to work around this polling issue when RR COGs are all doing the same thing and true fairness is needed, it is best to only enable COGs in the RR polling loop that will actually be active and remove any others already in there by default.

When requests are randomly arriving this should be far less of a problem. It's mainly happening when they are all doing the same thing at the same rate, and some COG(s) are idle. I noticed if I add a random delay to each RR client COG after it's request completes the fairness starts to return.

avsa242 · 2020-06-25 10:37

I wish I had something useful to contribute, but man, what a great visual aide/tool...really neat idea. Coming up with tools like this really help diagnose problems, letting you really see them. Sometimes it takes something like this to get to that "Ah! I know what's wrong now" moment.

rogloh · 2020-06-25 13:15

Yeah this approach helped me encounter the issue visually and it was good to use it to prove out the strict priority COG polling setting as well. In that case the bar of the highest priority COG (after video) screams along the fastest, then the next priority COG's bar, the third priority bar is pretty slow to move, and the fourth bar barely moves at all. Definitely strict priority as intended there.

For some time I thought I must have had a bug in the code causing this type of unfairness, but in the end it was just the polling design's own limitation. I needed to write it down and really think about the effect of the polling sequence and what happens when skipping inactive COGs. It would be cool to come up with a fast scheme that somehow improves on this and keeps it fair for equal load even when some COGs are idle and which still fits in the existing COGRAM footprint. If it doesn't fit the space then any change to that poller will have to wait until I free more COGRAM by changing the table lookup method I use and would then add four extra cycles of overhead per request. Doing that can wait for another time though....it's not worth all the extra work right now. I want to get it out, it's working very well already.

rogloh · 2020-06-27 05:45

I was hoping I might be able to reorder the HyperRAM driver's mailbox parameter order to increase performance slightly for PASM clients. Right now the mailbox parameter order is this:

mailboxBase + 0: request & bank/external address
mailboxBase + 4: read/write data or read/write hub address for bursts
mailboxBase + 8: mask or transfer count

If I reverse this order it then has the request long written last which triggers the memory request and a SETQ #2 followed by a WRLONG is a safe and fast way to generate memory requests even with the fifo running as it might be in video driver clients. The existing problem is that any fifo use can interfere with the SETQ read/write bursts and introduce gaps between longs transferred and potentially cause a request to be triggered prematurely with stale data parameters in some mailbox registers. My own video driver client works around this issue by writing the second two mailbox longs first (with a SETQ #1 burst), then the writing to request mailbox long separately after that, but doing this slows down the initial request a little bit. Changing this order would improve that side of things. I also thought that it may let the polling sequence that reads the result and the status be tightened to something like this sample below which would also allow us to use the flags from the last long of the read burst to detect the poll exit condition, however there is still a problem...


        mov     addr, ##$ABCDE
        call    #readlong ' read HyperRAM memory
        ' data reg now contains the result
        ...

' readlong:
' input addr - external address to read from (trashed afterwards)
' output data - result
readlong 
        mov     mask, #0 ' set mask = 0 to prevent read-modify-write cycle
        setbyte addr, #READLONG, #3 ' setup 32 bit read request using external address
        setq    #3-1 ' writing 3 longs
        wrlong  mask, mailboxPtr ' trigger mailbox request
        rep     #3, #0 ' poll for result (15 clock cycle rep loop once aligned to hub)
        setq    #3-1   ' reading 3 longs (or possibly just two if you had a second mailboxPtr)
        rdlong  mask, mailboxPtr wcz  ' read memory data and polling status
 if_nc  ret     wcz  ' need to check with evanh if you can really return from the end of a rep loop

mask  long 0
data  long 0
addr  long 0

mailboxPtr long MAILBOXADDR

The new problem is that if the final read burst of the data result+status itself is interrupted by a fifo transfer on the client COG between reading the data and the polling status, you might have stale data read into the data result long, you'd need to read it again after the REP loop completes if you ever use the fifo during the polling operation. So the change of order helps one side but hinders the other side. We sort of want to keep the existing order on polling for the result to prevent this problem. We can't really fix both ends.

It would be a fairly simple change in the PASM driver to reorder the mailbox but the SPIN layer which abstracts this ordering needs to be changed in lots of places (still mostly straightforward). If it is going to happen at all I think it's worth doing now before the code is released because changing it later will affect all the PASM clients.

I'll need to mull this over and I'm really on the fence about doing it now it introduces new problems...any suggestions?

rogloh · 2020-06-27 08:58

After thinking through a bit more I should just keep the original mailbox order as is. The result polling side should be the one that is optimized given rdlong's typically take longer than wrlongs to execute. Also the fifo use is a special case, and not all PASM clients will need that so they can still use a 3 wrlong burst to setup the mailbox whenever they don't use the fifo, even using the way it works today. With any luck doing this request setup sequence won't add too many extra total clocks anyway for the second WRLONG.

SETQ    #2-1
WRLONG  data, mailboxPtr2 
WRLONG  addr, mailboxPtr1

And the read result polling can still exit the mailbox polling loop with the data quickly this way without needing the flags read:

POP     exitaddr ' pop return address off the stack
REP     #3, #0 ' repeat until result is ready
SETQ    #2-1 ' read data and status
RDLONG  addr, mailboxptr1
TJNS    addr, exitaddr ' returns to caller when request is serviced

I'm going to leave the order alone.

evanh · 2020-06-27 10:39

Sounds right to me. I wasn't sure which way was which in your earlier post.

rogloh · 2020-06-27 23:35

@evanh In the earlier post the existing mailbox long order was shown but I was just contemplating changing it to enable the request setup/polling according to the sample code provided. However in the end I've decided against changing it as it still introduces a problem on the read side even though it can improve writing the request.

Had a question you might be able to answer with your knowledge of the egg beater timing. In this code sequence:

SETQ    #2-1
WRLONG  data, mailboxPtr2 
WRLONG  addr, mailboxPtr1

How long would the second WRLONG take if mailboxPtr1 = mailboxPtr2 - 4 ?

I know that WRLONGs take anywhere from 3-10 clocks, but I'm hoping it might be on the shorter side of that range when it follows the first WRLONG which will already sync up to the egg beater.

evanh · 2020-06-28 00:25

rogloh wrote: »

... but I'm hoping it might be on the shorter side of that range when it follows the first WRLONG which will already sync up to the egg beater.

It is a determinable amount of sysclocks but it depends the modulo'd difference between the final address of the burst and address of mailboxPrt1. If they both shift in unison, ie: the delta doesn't change, then you're in luck.

EDIT: Basically, if you can arrange addresses of burstEnd % totalCogs == (mailboxPrt1 - 3) % totalCogs then you should achieve optimal WRLONG of 3 sysclock.

EDIT2: I don't think I'd worry about it with a 4-cog prop2, but a 16-cog prop2 would definitely want to use this.

Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

Comments