Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

evanh · 2020-05-15 07:53

evanh wrote: »

EDIT: That said, on the prop2, I doubt it'll get used for more than a single byte at a time. So pointless needing it to electrically perform when it's faster to just bit-bash the one byte and leave the performance for complete bursts without RWDS.

On that note, I can see it being feasible to make the custom board layout, with the integrated HyperRAM using pins P48 to P58. RWDS on P58, sharing with the SD/EEPROM pins.

EDIT: Grr, bad idea. SD card needs to be able to operate concurrently. Particularly since the HR will likely be busy all the time.

whicker · 2020-05-16 00:11

I'd like to see a dual parallel hyperram board, but that's impractical on the eval board because of more than 16 pins.

But the data would really fly, still sharing the same clock and cs pins but of course being in fixed latency mode.

rogloh · 2020-05-16 00:38

Yeah that would be quite good whicker.

If byte & word granularity was sacrificed and you always had to read/write 32 bits on both chips it may not be much of a change to the driver - it's mainly just some clock counter scaling and slightly different streamer commands, and also giving up the RWDS pin control. So another driver variant of what I have could probably be developed with this wider capability in time.

Supporting individual byte transfers would complicate it a lot more with 2 RWDS lines and requires separate CS lines too. Don't even want to think about that right now

evanh · 2020-05-16 01:11

Shouldn't need two CS lines. Actually, maybe RWDS and CS, rather than CLK and CS, can be combined somehow ... using resistors and capacitors ... neither, as used with the prop2, are going to need the performance of the clock and data lines.

evanh · 2020-05-16 01:19

I say "with the prop2" because the mask byte writes are only usefully a single byte at a time. In other words, random writes only. It's not going to be practical to build a mask map amongst the 8-bit data for burst writes.

The presumption is random writes will be bit-bashed at a slower toggle rate than consecutive bursts.

rogloh · 2020-05-16 01:31

I guess with two RWDS lines you could combine the CS signals and just fully mask the chip you don't want to write to with its own RWDS signal. Somewhat less messy to figure out byte writes than using different CS pins as well. So four total control pins (cs, clk, rwds1, rwds2) and 16 data lines.

evanh · 2020-05-16 01:42

rogloh wrote: »

So four total control pins (cs, clk, rwds1, rwds2) and 16 data lines.

Yep. Twenty pins is getting hungry though.

They would be tidier on P32-P47 but I'd want the 16 data lines on P16-P31 so that those pins are not exposed to transients from a connector.

evanh · 2020-05-17 14:14

Von,
I've been mulling over possible ideas for building a flexible high speed test rig for proving out timings. Here's what I've got so far:

One big plus of this setup is it fits alongside the SD card without any interference.

The idea is to have short track lengths for the data pins and long track lengths for CLK, RWDS, and CS. The plan is to have a phase shift of the CLK by about 1.0 ns so that data setup is guaranteed in hardware while minimising attenuation.

RWDS is pushed out 2-3 ns, into next cycle, with software to compensate. A big question mark on attenuation here. This is needed to accommodate the lag from the 220 ohm resistor. The trimmer is there only to experimentally find the fine tune resistance. Likewise for all three trimmers.

CS can be much slower with suitable accommodation in software. The idea is that RWDS can be toggled high quite a lot and still keep CS low. In fact, I think I've got a way to use the streamer with RWDS at full sysclock for short writes. Including all steps of command, address, latency gap and data in an unbroken but short burst.

rogloh · 2020-05-17 15:30

evanh. Not sure I like the loss of knowing exactly when CS returns high due to component tolerances. It's a fair bit of mucking about to save that one pin. It also probably makes polling RWDS troublesome if you ever want to use variable latency.

In fact, I think I've got a way to use the streamer with RWDS at full sysclock for short writes. Including all steps of command, address, latency gap and data in an unbroken but short burst.

It would be interesting if you get that working. My driver can drive out back to back clocks for writes at sysclk/2 and supports byte granular writes. Sysclk/1 writes are not supported (at least yet) in this driver.

VonSzarvas · 2020-05-17 15:54

I'm with rogloh!

I think in systems that require HyperRAM, then it will most likely be the most important "peripheral" of the P2; such that sharing pins would be avoided.

Plenty of other pins that could share function with other things. Sure, that's a sweeping statement with no real examples in mind! But that's my hunch- given how pernickety HR seems to be with timing requirements it just makes sense (for me) to avoid headaches and run those 10 pins as directly, as well matched and as short as possible. And certainly not to add resistance beyond the minimal required to clean up fast edges.

Though I kinda like the cunning way to hardware-set the CLK shift! Although surely P2 could handle that more reliably? This will probably be more obvious to me after I get a chance to experiment with the cap tuning. You guys are well ahead of me on all this.

evanh · 2020-05-18 04:10

Sysclock/1 writes only works with the phase shifted clock. Basically, it requires something to delay or slow the edges with respect to the data. So the tuning is all about matching a resistor or capacitor to the board layout. Change the board and the tuning needs changed to suit the desired 1.0 - 1.5 ns of idea phase difference.

I figured if I'm going to aim for a more reliable sysclock/1 then why not look at dealing to fitting it to the somewhat unused pin group as well.

And if P48-P57 is not going to be it then P21-P31 has to be it instead. The oscillator pin group is best kept away from any connectors.

rogloh · 2020-05-18 04:26

Definitely give it a try to see what you can achieve @evanh . It just may not be a widely implemented way to do it unless it is really solid and ultimately offers very compelling advantages. I see what you are trying to do to put the HyperRAM up as high as possible but I think boards such as the P2D2 will be using those two pins on 56, 57 for other purposes such as I2C anyway. The P2-EVAL board really can only use the current Hyper module on 0, 16, 32. Any new custom board can still do whatever it wants though.

A small phase shift might be possible with a fast gate delay of some kind or buffered clock. How stable that is relative to a simple capacitive delay of 22pF I'm not sure but hopefully some small device could be found that does the right job. As you say probably just a fixed 1-1.5ns delay would be nice to put on a board and work fairly well up to its rated 166MHz DDR with a 333MHz P2. What active device can give a stable delay of this magnitude? Perhaps a tight tolerance cap is the best?

rogloh · 2020-05-18 04:34

I've been building up the HyperRAM driver API in SPIN2 and a significant amount ~75% is coded but have now run into an issue with the syntax and it is slowing me down. It doesn't seem like there is a way to do unsigned comparisons in Fastspin.

The SPIN2 language definition allows +> and +>= type of unsigned comparisons but I get an error with this and Fastspin (v 4.1.9). I am hoping to make a single driver that works with both PNUT and Fastspin but perhaps this is not going to be possible without many changes where I check for negative values in lots of places...

PUB mapAddrDevice(addr, bus, memoryType, size, cspin, clkpin, rwdspin, resetpin, burst) | device, pinInfo, i, latency
    ' check for invalid arguments
    if size < SIZE_16MB or size > SIZE_128MB or bus +>= MAX_INSTANCES or memoryType +>= TYPE_LAST
        return ERR_INVALID

/Users/roger/Downloads/flexgui-4.0.3/samples/hyper5.spin2:290: error: syntax error, unexpected '='

Update: Actually I just tried the +> by itself with Fastspin instead of +>= and and it doesn't generate an error so perhaps I can subtract one in many places and it may fix the issue...at the possible expense of more runtime overhead depending how the constants are compiled. e.g. do this sort of thing:

PUB setupCog(cog, bus, burst, priority, flags) | f
    ' check for invalid arguments
    if bus +> MAX_INSTANCES-1
        return ERR_INVALID

evanh · 2020-05-18 04:57

rogloh wrote: »

evanh. Not sure I like the loss of knowing exactly when CS returns high due to component tolerances. It's a fair bit of mucking about to save that one pin. It also probably makes polling RWDS troublesome if you ever want to use variable latency.

At slower read rates checking for RWDS will be fine. Polling isn't ever going to be a fast solution though. IMHO, it's a dead option for the prop2. It's why I'm entertaining the ditching of RWDS entirely.

It'll be less bulky once the resistor values are nailed down and the trimmers can be ditched.

... My driver can drive out back to back clocks for writes at sysclk/2 and supports byte granular writes.

Arbitrary masks in large burst writes? I was envisaging singles only ... for the moment.

rogloh · 2020-05-18 05:01

No only the start and ending bytes need to be masked for a burst to get byte granular addressing. All other bytes within the burst get fully written. It's only the first word and last word of the burst that need finer RWDS control to achieve this.

evanh · 2020-05-18 05:19

VonSzarvas wrote: »

... given how pernickety HR seems to be with timing requirements it just makes sense (for me) to avoid headaches and run those 10 pins as directly, as well matched and as short as possible. And certainly not to add resistance beyond the minimal required to clean up fast edges.

For data pins, totally. But clock absolutly needs shifted to provide the data setup timing. The easiest way is to soften/lag the clock edges. Part of doing that is make the clock track longer to give it an L-C property. And the resistor is a cheap reliable fine tune. Ideally, the track would be engineered to perform perfectly but that's way beyond my knowledge.

JMG thought it was a good idea to stick with using passive components over trying to select something active that'll always be thermally sensitive.

Though I kinda like the cunning way to hardware-set the CLK shift! Although surely P2 could handle that more reliably?

The prop2 does wonderfully at sysclock/2. The timing is clean because the HRdata and HRclock can transition on alternate sysclocks. This gives clean data setup and hold timings. Just like SPI clock and data.

Problem is at sysclock/1 the data setup time vanishes. The only way the prop2 could possibly have finer timing is to use the both polarities of the sysclock. This sort of trick is not provided though ... So an external solution is needed to provide the data setup time.

whicker · 2020-05-18 05:28

evanh wrote: »

JMG thought it was a good idea to stick with using passive components over trying to select something active that'll always be thermally sensitive.

Unfortunately, every solution is thermally sensitive. Capacitor, delay line, logic gate, or even just a long snaking PCB trace.
A long PCB trace actually is quite sensitive to temperature extremes on a cheap FR4 board due to changing dielectric "constant".
But that's the extent of my knowledge.

evanh · 2020-05-18 05:35

I guess that's a reason to under-do the track length and rely somewhat more on the resistor then.

Funnily, I note the Hyperbus V2 spec says only 0.5 ns data setup time is needed. This can likely be satisfied with just an "unregistered" pin for the HRclock. Maybe short tracks all round is desirable.

evanh · 2020-05-18 06:01

rogloh wrote: »

No only the start and ending bytes need to be masked for a burst to get byte granular addressing. All other bytes within the burst get fully written. It's only the first word and last word of the burst that need finer RWDS control to achieve this.

Oh, I'd not bother with it at all then. Just always require shortword aligned bursts. Both read and write.

rogloh · 2020-05-18 06:08

evanh wrote: »

Oh, I'd not bother with it at all then. Just always require shortword aligned bursts. Both read and write.

Actually it's not always good for 8bpp graphics doing that. Unless you are in 16bpp colour mode, or want all graphics blocks copied only to every second pixel and be multiples of two pixels wide, doing that has ramifications. I've made this driver work with 8bpp graphics so I've enabled byte granular writes for bursts (both for start address and odd byte lengths).

Of course other drivers for non-graphics applications could ignore RWDS though. For a different cache application for example it might be okay to only support 32 bit writes on aligned boundaries.

evanh · 2020-05-18 07:07

Ah, of course, blitting needs it. And read-modify-write is not a friendly thing with these type buses. I don't suppose you've had any ideas on how to perform 4 bits per pixel ops?

evanh · 2020-05-18 07:50

evanh wrote: »

I guess that's a reason to under-do the track length and rely somewhat more on the resistor then.

Grr, that wouldn't help as much as adding a capacitor. I suppose I could use a trim-capacitor for tuning instead of the trim-resistor. Not as sturdy but I guess the clock tuning will be the first thing resolved anyway.

rogloh · 2020-05-18 08:09

evanh wrote: »

Ah, of course, blitting needs it. And read-modify-write is not a friendly thing with these type buses. I don't suppose you've had any ideas on how to perform 4 bits per pixel ops?

Yep. Blitting is not ideal with read/modify/write and HyperRAM. In fact in a worst case implementation, by adding read/modify/write on each end of a burst it could probably slow things down by a factor of 5 in some cases because you then need 5 mailbox transactions instead of 1. In some cases, depending on the burst size it may make sense to read an entire portion in and modify the ends in hub RAM, copy the middle portion from hub to hub and then write the whole lot back to HyperRAM. It could get down to just over a 2x penalty.

In fact for graphics modes < 8bpp this will be the way to go in the immediate term as HyperRAM does not support sub-byte access. If there was space freed in the driver in time it might be possible to add sub byte masking within request lists for individual pixel changes, and then you don't need to interact with the mailbox more than once to trigger the operation but there probably still isn't a huge gain there. HyperRAM is best accessed in bursts for high performance.

VonSzarvas · 2020-05-18 08:47

evanh wrote: »

Sysclock/1 writes only works with the phase shifted clock. Basically, it requires something to delay or slow the edges with respect to the data. So the tuning is all about matching a resistor or capacitor to the board layout. Change the board and the tuning needs changed to suit the desired 1.0 - 1.5 ns of idea phase difference.

The penny drops. Thanks evan.

ps. I won't start on this until next week Tuesday now, as another priority stepped in. But I'll start researching over the week. Keeping everything matched and tuning for a ~1ns delay on clk might work out if the min-max range is 0.5-1.5ns. Lot's of helpful replies here for everyone that I'll read through more carefully later before getting started.

rogloh · 2020-05-19 06:38

Wow, I just enabled the optimizer in Fastspin 4.1.9 and it saves a lot of space and should also help speed things up a lot in the SPIN2 API for the HyperRAM driver. Until now I was keeping it turned off and sort of watching the driver code start to bloat up towards 14kB of SPIN2 + 4kB PASM2 driver and getting concerned, and wondering how it will compare in size with the interpreted SPIN2 Chip is doing.

I now hope people can properly enable this optimizer to save space with Fastspin.

For example comparing the output for this readByte method primitive:

PUB readByte(addr) | m
    if MAX_INSTANCES == 1         ' optimization for single instance, everything mapped to single bus
       m := mailboxAddrCog[cogid] ' get mailbox base address for this COG
       if m == 0                  ' prevent hang if driver is not running
           return -1 
    else                          ' multiple buses, need to lookup address to find mailbox for bus
       m := addrMap[addr>>24]
       if m +> MAX_INSTANCES-1    ' if address not mapped, exit
          return -1
       m := mailboxAddr[m] + cogid*12  ' compute COG mailbox offset
    long[m] := REQ_READBYTE + (addr & $fffffff) ' generate read request in mailbox
    repeat until long[m] => 0                   ' wait to complete
    return long[m+1]                            ' return result

The optimised (default level) Fastspin compiled code is this (24 longs):

00554                 | _readbyte
00554     03 66 04 F6 |     mov COUNT_, #3
00558     35 00 C0 FD |     calla   #pushregs_
0055c                 | '     if MAX_INSTANCES == 1         ' optimization for single instance, everything mapped to single bus
0055c                 | '        m := mailboxAddrCog[cogid] ' get mailbox base address for this COG
0055c     01 A2 60 FD |     cogid   result1
00560     02 A2 64 F0 |     shl result1, #2
00564     00 9F 04 F1 |     add ptr__dat__, #256
00568     4F A2 00 F1 |     add result1, ptr__dat__
0056c     51 AC 08 FB |     rdlong  local01, result1 wz
00570                 | '        if m == 0                  ' prevent hang if driver is not running
00570                 | '            return -1 
00570     00 9F 84 F1 |     sub ptr__dat__, #256
00574     01 A2 64 A6 |  if_e   neg result1, #1
00578     2C 00 90 AD |  if_e   jmp #LR__0002
0057c                 | '     else                          ' multiple buses, need to lookup address to find mailbox for bus
0057c                 | '        m := addrMap[addr>>24]
0057c                 | '        if m +> MAX_INSTANCES-1    ' if address not mapped, exit
0057c                 | '           return -1
0057c                 | '        m := mailboxAddr[m] + cogid*12  ' compute COG mailbox offset
0057c                 | '     long[m] := REQ_READBYTE + (addr & $fffffff) ' generate read request in mailbox
0057c     1F AE C4 F9 |     decod   local02, #31
00580     52 B0 00 F6 |     mov local03, arg01
00584     FF FF 07 FF
00588     FF B1 04 F5 |     and local03, ##268435455
0058c     58 AE 00 F1 |     add local02, local03
00590     56 AE 60 FC |     wrlong  local02, local01
00594                 | '     repeat until long[m] => 0                   ' wait to complete
00594                 | LR__0001
00594     56 AE 00 FB |     rdlong  local02, local01
00598     00 AE 5C F2 |     cmps    local02, #0 wcz
0059c     F4 FF 9F CD |  if_b   jmp #LR__0001
005a0                 | '     return long[m+1]                            ' return result
005a0     01 AC 04 F1 |     add local01, #1
005a4     56 A2 00 FB |     rdlong  result1, local01
005a8                 | LR__0002
005a8     4D F0 03 F6 |     mov ptra, fp
005ac     42 00 C0 FD |     calla   #popregs_
005b0                 | _readbyte_ret
005b0     2E 00 64 FD |     reta

while the unoptimized code bloats rapidly and looks like this (66 longs!) :

01984                 | ' PUB readByte(addr) | m
01984                 | _readbyte
01984     07 66 04 F6 |     mov COUNT_, #7
01988     35 00 C0 FD |     calla   #pushregs_
0198c     8A 2A 01 F6 |     mov local01, arg01
01990                 | '     if MAX_INSTANCES == 1         ' optimization for single instance, everything mapped to single bus
01990                 | '        m := mailboxAddrCog[cogid] ' get mailbox base address for this COG
01990     4C 1D D0 FD |     calla   #__system__cogid
01994     72 2C 01 F6 |     mov local02, result1
01998     96 2E 01 F6 |     mov local03, local02
0199c     02 2E 65 F0 |     shl local03, #2
019a0     00 DF 04 F1 |     add ptr__dat__, #256
019a4     6F 30 01 F6 |     mov local04, ptr__dat__
019a8     00 DF 84 F1 |     sub ptr__dat__, #256
019ac     98 2E 01 F1 |     add local03, local04
019b0     97 32 01 FB |     rdlong  local05, local03
019b4                 | '        if m == 0                  ' prevent hang if driver is not running
019b4     00 32 0D F2 |     cmp local05, #0 wz
019b8     0C 00 90 5D |  if_ne  jmp #LR__0097
019bc                 | '            return -1 
019bc     FF FF 7F FF
019c0     FF E5 04 F6 |     mov result1, ##-1
019c4     B8 00 90 FD |     jmp #LR__0102
019c8                 | LR__0097
019c8                 | '     else                          ' multiple buses, need to lookup address to find mailbox for bus
019c8     78 00 90 FD |     jmp #LR__0099
019cc                 | '        m := addrMap[addr>>24]
019cc     95 2C 01 F6 |     mov local02, local01
019d0     18 2C 45 F0 |     shr local02, #24
019d4     96 2E 01 F6 |     mov local03, local02
019d8     01 00 00 FF
019dc     60 DE 04 F1 |     add ptr__dat__, ##608
019e0     6F 30 01 F6 |     mov local04, ptr__dat__
019e4     01 00 00 FF
019e8     60 DE 84 F1 |     sub ptr__dat__, ##608
019ec     98 2E 01 F1 |     add local03, local04
019f0     97 32 C1 FA |     rdbyte  local05, local03
019f4                 | '        if m +> MAX_INSTANCES-1    ' if address not mapped, exit
019f4     00 32 1D F2 |     cmp local05, #0 wcz
019f8     0C 00 90 ED |  if_be  jmp #LR__0098
019fc                 | '           return -1
019fc     FF FF 7F FF
01a00     FF E5 04 F6 |     mov result1, ##-1
01a04     78 00 90 FD |     jmp #LR__0102
01a08                 | LR__0098
01a08                 | '        m := mailboxAddr[m] + cogid*12  ' compute COG mailbox offset
01a08     99 2E 01 F6 |     mov local03, local05
01a0c     02 2E 65 F0 |     shl local03, #2
01a10     4C DF 04 F1 |     add ptr__dat__, #332
01a14     6F 30 01 F6 |     mov local04, ptr__dat__
01a18     4C DF 84 F1 |     sub ptr__dat__, #332
01a1c     98 2E 01 F1 |     add local03, local04
01a20     BC 1C D0 FD |     calla   #__system__cogid
01a24     72 34 01 F6 |     mov local06, result1
01a28     9A 36 01 F6 |     mov local07, local06
01a2c     01 36 65 F0 |     shl local07, #1
01a30     9A 36 01 F1 |     add local07, local06
01a34     02 36 65 F0 |     shl local07, #2
01a38     97 2C 01 FB |     rdlong  local02, local03
01a3c     9B 2C 01 F1 |     add local02, local07
01a40     96 32 01 F6 |     mov local05, local02
01a44                 | LR__0099
01a44                 | '     long[m] := REQ_READBYTE + (addr & $fffffff) ' generate read request in mailbox
01a44     00 00 40 FF
01a48     00 2C 05 F6 |     mov local02, ##-2147483648
01a4c     95 2E 01 F6 |     mov local03, local01
01a50     FF FF 07 FF
01a54     FF 2F 05 F5 |     and local03, ##268435455
01a58     97 2C 01 F1 |     add local02, local03
01a5c     99 2C 61 FC |     wrlong  local02, local05
01a60                 | '     repeat until long[m] => 0                   ' wait to complete
01a60                 | LR__0100
01a60     99 2C 01 FB |     rdlong  local02, local05
01a64     00 2C 5D F2 |     cmps    local02, #0 wcz
01a68     04 00 90 3D |  if_ae  jmp #LR__0101
01a6c     F0 FF 9F FD |     jmp #LR__0100
01a70                 | LR__0101
01a70                 | '     return long[m+1]                            ' return result
01a70     99 2C 01 F6 |     mov local02, local05
01a74     01 2C 05 F1 |     add local02, #1
01a78     96 E4 00 FB |     rdlong  result1, local02
01a7c     00 00 90 FD |     jmp #LR__0102
01a80                 | LR__0102
01a80     6B F0 03 F6 |     mov ptra, fp
01a84     42 00 C0 FD |     calla   #popregs_
01a88                 | _readbyte_ret
01a88     2E 00 64 FD |     reta

evanh · 2020-05-19 07:23

Eric knows his compiler tools for sure, marks him as a comp-sci graduate.

rogloh · 2020-05-19 07:43

Yep for sure.
For a minimal app, referencing all current HyperRAM driver functions (to prevent method removal) and including the PASM2 driver which is ~3800 bytes or so, I get these build sizes (which include a 1kB hub overhead plus Fastspin's own stuff):

No Optimization :     20672 bytes
Default Optimization: 14688 bytes
Full Optimization :   14720 bytes

For those wanting to interact directly with the HyperRAM driver mailbox (eg. from a PASM2 COG), it will free a lot more space as you don't need the extra SPIN2 layer API, which while very helpful to use is not mandatory. You'll have to understand the setup parameters and mailbox format.

Eg. upon driver start you just pass in a pointer to 8 long parameters which define the devices and COG parameters etc. I will also document the format of items accordingly.

' setup driver COG startup parameters
params[0]:= freq
params[1]:= @cogList[bus*NUMCOGS]
params[2]:= flags
params[3]:= busBasePin[bus]
params[4]:= @devices[bus*32] 'per bank settings
params[5]:= maskA[bus] 'port A (lower 32 pins) reset mask
params[6]:= maskB[bus] 'port B (upper 32 pins) reset mask
params[7]:= mailboxAddr[bus] 'mailbox address for the driver

rogloh · 2020-05-20 03:31

I was able to compile my new HyperRAM driver codebase in PNut v34s running on VirtualBox. Still not tested, just compiling without errors.
The size difference vs Fastspin is interesting. Looks like the SPIN2 driver object is currently about 8kB including the 3600 byte PASM code. This probably compares to just over 13kB in Fastspin with optimisation enabled. Though the Fastspin version should still be somewhat faster to run of course. By how much, I'm keen to find out at some point.

I needed to change a few things before it compiled and this is what I learned (I'm sure it has been discussed before, but this is the first time I've ever run PNut so I'm learning the hard way when porting the driver code to be hopefully runnable using both environments):

- PNUT needs that return parameter to compile without errors if you want to return something, Fastspin doesn't need it.

PUB getHyperDriver() : r
    return @hyper_driver
vs
PUB getHyperDriver()  ' Fastspin allows this syntax and can still return a value
    return @hyper_driver

- PNUT needs cogid to be returned via function cogid() while Fastspin allows just cogid to be used

- Fastspin allows # but PNut now always needs a dot. Eg:

driver#REQ_READBYTE  ' Fastspin allows this
vs
driver.REQ_READBYTE

- There is no cognew function in PNut to spawn PASM COGs you need to use coginit with 16 as the argument to start a new COG.

driverCog := cognew(addr, @params)
vs
driverCog := coginit(16, addr, @params)

- SPIN2 method parameters can't use the same name as labels do in the PASM2 code in PNut.

- PNut requires any no-argument SPIN2 methods to be defined and called with ()

- Finally there was a problem with greater than and equal to order
PNut needs this:

repeat until long[m] >= 0

while (perhaps an older) Fastspin needed this to work correctly:

repeat until long[m] => 0

Hopefully a newer Fastspin should fix this.
Update: looks like Fastspin 4.1.9 is doing what I want now and can use the Pnut syntax...this should work according to the listing output.
00730 | ' repeat until long[m] >= 0
00730 | LR__0001
00730 81 CC 01 FB | rdlong dump_tmp001_, _dump_m
00734 00 CC 5D F2 | cmps dump_tmp001_, #0 wcz
00738 F4 FF 9F CD | if_b jmp #LR__0001

evanh · 2020-05-20 03:58

The interpreter footprint isn't free. They're almost the same size with that included.

rogloh · 2020-05-20 04:18

Yeah that is an interesting observation, and currently they are comparable in total size, though that Spin2 interpreter is a common overhead that other code can also use (I hope!) so as more client application code is added these example images will probably start to diverge further in code space consumed. I guess I was interested more in the HyperRAM driver sizes with this particular comparison.

Main thing is this driver is not a total hog and should be fully usable in both environments. Any unused method removal by the tools can help further too. It will consume far less memory that it enables!

Memory drivers for P2 - PSRAM/SRAM/HyperRAM (was HyperRAM driver for P2)

Comments