EDIT: That said, on the prop2, I doubt it'll get used for more than a single byte at a time. So pointless needing it to electrically perform when it's faster to just bit-bash the one byte and leave the performance for complete bursts without RWDS.
On that note, I can see it being feasible to make the custom board layout, with the integrated HyperRAM using pins P48 to P58. RWDS on P58, sharing with the SD/EEPROM pins.
EDIT: Grr, bad idea. SD card needs to be able to operate concurrently. Particularly since the HR will likely be busy all the time.
If byte & word granularity was sacrificed and you always had to read/write 32 bits on both chips it may not be much of a change to the driver - it's mainly just some clock counter scaling and slightly different streamer commands, and also giving up the RWDS pin control. So another driver variant of what I have could probably be developed with this wider capability in time.
Supporting individual byte transfers would complicate it a lot more with 2 RWDS lines and requires separate CS lines too. Don't even want to think about that right now
Shouldn't need two CS lines. Actually, maybe RWDS and CS, rather than CLK and CS, can be combined somehow ... using resistors and capacitors ... neither, as used with the prop2, are going to need the performance of the clock and data lines.
I say "with the prop2" because the mask byte writes are only usefully a single byte at a time. In other words, random writes only. It's not going to be practical to build a mask map amongst the 8-bit data for burst writes.
The presumption is random writes will be bit-bashed at a slower toggle rate than consecutive bursts.
I guess with two RWDS lines you could combine the CS signals and just fully mask the chip you don't want to write to with its own RWDS signal. Somewhat less messy to figure out byte writes than using different CS pins as well. So four total control pins (cs, clk, rwds1, rwds2) and 16 data lines.
Von,
I've been mulling over possible ideas for building a flexible high speed test rig for proving out timings. Here's what I've got so far:
One big plus of this setup is it fits alongside the SD card without any interference.
The idea is to have short track lengths for the data pins and long track lengths for CLK, RWDS, and CS. The plan is to have a phase shift of the CLK by about 1.0 ns so that data setup is guaranteed in hardware while minimising attenuation.
RWDS is pushed out 2-3 ns, into next cycle, with software to compensate. A big question mark on attenuation here. This is needed to accommodate the lag from the 220 ohm resistor. The trimmer is there only to experimentally find the fine tune resistance. Likewise for all three trimmers.
CS can be much slower with suitable accommodation in software. The idea is that RWDS can be toggled high quite a lot and still keep CS low. In fact, I think I've got a way to use the streamer with RWDS at full sysclock for short writes. Including all steps of command, address, latency gap and data in an unbroken but short burst.
evanh. Not sure I like the loss of knowing exactly when CS returns high due to component tolerances. It's a fair bit of mucking about to save that one pin. It also probably makes polling RWDS troublesome if you ever want to use variable latency.
In fact, I think I've got a way to use the streamer with RWDS at full sysclock for short writes. Including all steps of command, address, latency gap and data in an unbroken but short burst.
It would be interesting if you get that working. My driver can drive out back to back clocks for writes at sysclk/2 and supports byte granular writes. Sysclk/1 writes are not supported (at least yet) in this driver.
I think in systems that require HyperRAM, then it will most likely be the most important "peripheral" of the P2; such that sharing pins would be avoided.
Plenty of other pins that could share function with other things. Sure, that's a sweeping statement with no real examples in mind! But that's my hunch- given how pernickety HR seems to be with timing requirements it just makes sense (for me) to avoid headaches and run those 10 pins as directly, as well matched and as short as possible. And certainly not to add resistance beyond the minimal required to clean up fast edges.
Though I kinda like the cunning way to hardware-set the CLK shift! Although surely P2 could handle that more reliably? This will probably be more obvious to me after I get a chance to experiment with the cap tuning. You guys are well ahead of me on all this.
Sysclock/1 writes only works with the phase shifted clock. Basically, it requires something to delay or slow the edges with respect to the data. So the tuning is all about matching a resistor or capacitor to the board layout. Change the board and the tuning needs changed to suit the desired 1.0 - 1.5 ns of idea phase difference.
I figured if I'm going to aim for a more reliable sysclock/1 then why not look at dealing to fitting it to the somewhat unused pin group as well.
And if P48-P57 is not going to be it then P21-P31 has to be it instead. The oscillator pin group is best kept away from any connectors.
Definitely give it a try to see what you can achieve @evanh . It just may not be a widely implemented way to do it unless it is really solid and ultimately offers very compelling advantages. I see what you are trying to do to put the HyperRAM up as high as possible but I think boards such as the P2D2 will be using those two pins on 56, 57 for other purposes such as I2C anyway. The P2-EVAL board really can only use the current Hyper module on 0, 16, 32. Any new custom board can still do whatever it wants though.
A small phase shift might be possible with a fast gate delay of some kind or buffered clock. How stable that is relative to a simple capacitive delay of 22pF I'm not sure but hopefully some small device could be found that does the right job. As you say probably just a fixed 1-1.5ns delay would be nice to put on a board and work fairly well up to its rated 166MHz DDR with a 333MHz P2. What active device can give a stable delay of this magnitude? Perhaps a tight tolerance cap is the best?
I've been building up the HyperRAM driver API in SPIN2 and a significant amount ~75% is coded but have now run into an issue with the syntax and it is slowing me down. It doesn't seem like there is a way to do unsigned comparisons in Fastspin.
The SPIN2 language definition allows +> and +>= type of unsigned comparisons but I get an error with this and Fastspin (v 4.1.9). I am hoping to make a single driver that works with both PNUT and Fastspin but perhaps this is not going to be possible without many changes where I check for negative values in lots of places...
PUB mapAddrDevice(addr, bus, memoryType, size, cspin, clkpin, rwdspin, resetpin, burst) | device, pinInfo, i, latency
' check for invalid arguments
if size < SIZE_16MB or size > SIZE_128MB or bus +>= MAX_INSTANCES or memoryType +>= TYPE_LAST
return ERR_INVALID
Update: Actually I just tried the +> by itself with Fastspin instead of +>= and and it doesn't generate an error so perhaps I can subtract one in many places and it may fix the issue...at the possible expense of more runtime overhead depending how the constants are compiled. e.g. do this sort of thing:
PUB setupCog(cog, bus, burst, priority, flags) | f
' check for invalid arguments
if bus +> MAX_INSTANCES-1
return ERR_INVALID
evanh. Not sure I like the loss of knowing exactly when CS returns high due to component tolerances. It's a fair bit of mucking about to save that one pin. It also probably makes polling RWDS troublesome if you ever want to use variable latency.
At slower read rates checking for RWDS will be fine. Polling isn't ever going to be a fast solution though. IMHO, it's a dead option for the prop2. It's why I'm entertaining the ditching of RWDS entirely.
It'll be less bulky once the resistor values are nailed down and the trimmers can be ditched.
... My driver can drive out back to back clocks for writes at sysclk/2 and supports byte granular writes.
Arbitrary masks in large burst writes? I was envisaging singles only ... for the moment.
No only the start and ending bytes need to be masked for a burst to get byte granular addressing. All other bytes within the burst get fully written. It's only the first word and last word of the burst that need finer RWDS control to achieve this.
... given how pernickety HR seems to be with timing requirements it just makes sense (for me) to avoid headaches and run those 10 pins as directly, as well matched and as short as possible. And certainly not to add resistance beyond the minimal required to clean up fast edges.
For data pins, totally. But clock absolutly needs shifted to provide the data setup timing. The easiest way is to soften/lag the clock edges. Part of doing that is make the clock track longer to give it an L-C property. And the resistor is a cheap reliable fine tune. Ideally, the track would be engineered to perform perfectly but that's way beyond my knowledge.
JMG thought it was a good idea to stick with using passive components over trying to select something active that'll always be thermally sensitive.
Though I kinda like the cunning way to hardware-set the CLK shift! Although surely P2 could handle that more reliably?
The prop2 does wonderfully at sysclock/2. The timing is clean because the HRdata and HRclock can transition on alternate sysclocks. This gives clean data setup and hold timings. Just like SPI clock and data.
Problem is at sysclock/1 the data setup time vanishes. The only way the prop2 could possibly have finer timing is to use the both polarities of the sysclock. This sort of trick is not provided though ... So an external solution is needed to provide the data setup time.
JMG thought it was a good idea to stick with using passive components over trying to select something active that'll always be thermally sensitive.
Unfortunately, every solution is thermally sensitive. Capacitor, delay line, logic gate, or even just a long snaking PCB trace.
A long PCB trace actually is quite sensitive to temperature extremes on a cheap FR4 board due to changing dielectric "constant".
But that's the extent of my knowledge.
I guess that's a reason to under-do the track length and rely somewhat more on the resistor then.
Funnily, I note the Hyperbus V2 spec says only 0.5 ns data setup time is needed. This can likely be satisfied with just an "unregistered" pin for the HRclock. Maybe short tracks all round is desirable.
No only the start and ending bytes need to be masked for a burst to get byte granular addressing. All other bytes within the burst get fully written. It's only the first word and last word of the burst that need finer RWDS control to achieve this.
Oh, I'd not bother with it at all then. Just always require shortword aligned bursts. Both read and write.
Oh, I'd not bother with it at all then. Just always require shortword aligned bursts. Both read and write.
Actually it's not always good for 8bpp graphics doing that. Unless you are in 16bpp colour mode, or want all graphics blocks copied only to every second pixel and be multiples of two pixels wide, doing that has ramifications. I've made this driver work with 8bpp graphics so I've enabled byte granular writes for bursts (both for start address and odd byte lengths).
Of course other drivers for non-graphics applications could ignore RWDS though. For a different cache application for example it might be okay to only support 32 bit writes on aligned boundaries.
Ah, of course, blitting needs it. And read-modify-write is not a friendly thing with these type buses. I don't suppose you've had any ideas on how to perform 4 bits per pixel ops?
I guess that's a reason to under-do the track length and rely somewhat more on the resistor then.
Grr, that wouldn't help as much as adding a capacitor. I suppose I could use a trim-capacitor for tuning instead of the trim-resistor. Not as sturdy but I guess the clock tuning will be the first thing resolved anyway.
Ah, of course, blitting needs it. And read-modify-write is not a friendly thing with these type buses. I don't suppose you've had any ideas on how to perform 4 bits per pixel ops?
Yep. Blitting is not ideal with read/modify/write and HyperRAM. In fact in a worst case implementation, by adding read/modify/write on each end of a burst it could probably slow things down by a factor of 5 in some cases because you then need 5 mailbox transactions instead of 1. In some cases, depending on the burst size it may make sense to read an entire portion in and modify the ends in hub RAM, copy the middle portion from hub to hub and then write the whole lot back to HyperRAM. It could get down to just over a 2x penalty.
In fact for graphics modes < 8bpp this will be the way to go in the immediate term as HyperRAM does not support sub-byte access. If there was space freed in the driver in time it might be possible to add sub byte masking within request lists for individual pixel changes, and then you don't need to interact with the mailbox more than once to trigger the operation but there probably still isn't a huge gain there. HyperRAM is best accessed in bursts for high performance.
Sysclock/1 writes only works with the phase shifted clock. Basically, it requires something to delay or slow the edges with respect to the data. So the tuning is all about matching a resistor or capacitor to the board layout. Change the board and the tuning needs changed to suit the desired 1.0 - 1.5 ns of idea phase difference.
The penny drops. Thanks evan.
ps. I won't start on this until next week Tuesday now, as another priority stepped in. But I'll start researching over the week. Keeping everything matched and tuning for a ~1ns delay on clk might work out if the min-max range is 0.5-1.5ns. Lot's of helpful replies here for everyone that I'll read through more carefully later before getting started.
Wow, I just enabled the optimizer in Fastspin 4.1.9 and it saves a lot of space and should also help speed things up a lot in the SPIN2 API for the HyperRAM driver. Until now I was keeping it turned off and sort of watching the driver code start to bloat up towards 14kB of SPIN2 + 4kB PASM2 driver and getting concerned, and wondering how it will compare in size with the interpreted SPIN2 Chip is doing.
I now hope people can properly enable this optimizer to save space with Fastspin.
For example comparing the output for this readByte method primitive:
PUB readByte(addr) | m
if MAX_INSTANCES == 1 ' optimization for single instance, everything mapped to single bus
m := mailboxAddrCog[cogid] ' get mailbox base address for this COG
if m == 0 ' prevent hang if driver is not running
return -1
else ' multiple buses, need to lookup address to find mailbox for bus
m := addrMap[addr>>24]
if m +> MAX_INSTANCES-1 ' if address not mapped, exit
return -1
m := mailboxAddr[m] + cogid*12 ' compute COG mailbox offset
long[m] := REQ_READBYTE + (addr & $fffffff) ' generate read request in mailbox
repeat until long[m] => 0 ' wait to complete
return long[m+1] ' return result
The optimised (default level) Fastspin compiled code is this (24 longs):
00554 | _readbyte
00554 03 66 04 F6 | mov COUNT_, #3
00558 35 00 C0 FD | calla #pushregs_
0055c | ' if MAX_INSTANCES == 1 ' optimization for single instance, everything mapped to single bus
0055c | ' m := mailboxAddrCog[cogid] ' get mailbox base address for this COG
0055c 01 A2 60 FD | cogid result1
00560 02 A2 64 F0 | shl result1, #2
00564 00 9F 04 F1 | add ptr__dat__, #256
00568 4F A2 00 F1 | add result1, ptr__dat__
0056c 51 AC 08 FB | rdlong local01, result1 wz
00570 | ' if m == 0 ' prevent hang if driver is not running
00570 | ' return -1
00570 00 9F 84 F1 | sub ptr__dat__, #256
00574 01 A2 64 A6 | if_e neg result1, #1
00578 2C 00 90 AD | if_e jmp #LR__0002
0057c | ' else ' multiple buses, need to lookup address to find mailbox for bus
0057c | ' m := addrMap[addr>>24]
0057c | ' if m +> MAX_INSTANCES-1 ' if address not mapped, exit
0057c | ' return -1
0057c | ' m := mailboxAddr[m] + cogid*12 ' compute COG mailbox offset
0057c | ' long[m] := REQ_READBYTE + (addr & $fffffff) ' generate read request in mailbox
0057c 1F AE C4 F9 | decod local02, #31
00580 52 B0 00 F6 | mov local03, arg01
00584 FF FF 07 FF
00588 FF B1 04 F5 | and local03, ##268435455
0058c 58 AE 00 F1 | add local02, local03
00590 56 AE 60 FC | wrlong local02, local01
00594 | ' repeat until long[m] => 0 ' wait to complete
00594 | LR__0001
00594 56 AE 00 FB | rdlong local02, local01
00598 00 AE 5C F2 | cmps local02, #0 wcz
0059c F4 FF 9F CD | if_b jmp #LR__0001
005a0 | ' return long[m+1] ' return result
005a0 01 AC 04 F1 | add local01, #1
005a4 56 A2 00 FB | rdlong result1, local01
005a8 | LR__0002
005a8 4D F0 03 F6 | mov ptra, fp
005ac 42 00 C0 FD | calla #popregs_
005b0 | _readbyte_ret
005b0 2E 00 64 FD | reta
while the unoptimized code bloats rapidly and looks like this (66 longs!) :
Yep for sure.
For a minimal app, referencing all current HyperRAM driver functions (to prevent method removal) and including the PASM2 driver which is ~3800 bytes or so, I get these build sizes (which include a 1kB hub overhead plus Fastspin's own stuff):
No Optimization : 20672 bytes
Default Optimization: 14688 bytes
Full Optimization : 14720 bytes
For those wanting to interact directly with the HyperRAM driver mailbox (eg. from a PASM2 COG), it will free a lot more space as you don't need the extra SPIN2 layer API, which while very helpful to use is not mandatory. You'll have to understand the setup parameters and mailbox format.
Eg. upon driver start you just pass in a pointer to 8 long parameters which define the devices and COG parameters etc. I will also document the format of items accordingly.
' setup driver COG startup parameters
params[0]:= freq
params[1]:= @cogList[bus*NUMCOGS]
params[2]:= flags
params[3]:= busBasePin[bus]
params[4]:= @devices[bus*32] 'per bank settings
params[5]:= maskA[bus] 'port A (lower 32 pins) reset mask
params[6]:= maskB[bus] 'port B (upper 32 pins) reset mask
params[7]:= mailboxAddr[bus] 'mailbox address for the driver
I was able to compile my new HyperRAM driver codebase in PNut v34s running on VirtualBox. Still not tested, just compiling without errors.
The size difference vs Fastspin is interesting. Looks like the SPIN2 driver object is currently about 8kB including the 3600 byte PASM code. This probably compares to just over 13kB in Fastspin with optimisation enabled. Though the Fastspin version should still be somewhat faster to run of course. By how much, I'm keen to find out at some point.
I needed to change a few things before it compiled and this is what I learned (I'm sure it has been discussed before, but this is the first time I've ever run PNut so I'm learning the hard way when porting the driver code to be hopefully runnable using both environments):
- PNUT needs that return parameter to compile without errors if you want to return something, Fastspin doesn't need it.
PUB getHyperDriver() : r
return @hyper_driver
vs
PUB getHyperDriver() ' Fastspin allows this syntax and can still return a value
return @hyper_driver
- PNUT needs cogid to be returned via function cogid() while Fastspin allows just cogid to be used
- Fastspin allows # but PNut now always needs a dot. Eg:
driver#REQ_READBYTE ' Fastspin allows this
vs
driver.REQ_READBYTE
- There is no cognew function in PNut to spawn PASM COGs you need to use coginit with 16 as the argument to start a new COG.
driverCog := cognew(addr, @params)
vs
driverCog := coginit(16, addr, @params)
- SPIN2 method parameters can't use the same name as labels do in the PASM2 code in PNut.
- PNut requires any no-argument SPIN2 methods to be defined and called with ()
- Finally there was a problem with greater than and equal to order
PNut needs this:
repeat until long[m] >= 0
while (perhaps an older) Fastspin needed this to work correctly:
repeat until long[m] => 0
Hopefully a newer Fastspin should fix this.
Update: looks like Fastspin 4.1.9 is doing what I want now and can use the Pnut syntax...this should work according to the listing output.
00730 | ' repeat until long[m] >= 0
00730 | LR__0001
00730 81 CC 01 FB | rdlong dump_tmp001_, _dump_m
00734 00 CC 5D F2 | cmps dump_tmp001_, #0 wcz
00738 F4 FF 9F CD | if_b jmp #LR__0001
Yeah that is an interesting observation, and currently they are comparable in total size, though that Spin2 interpreter is a common overhead that other code can also use (I hope!) so as more client application code is added these example images will probably start to diverge further in code space consumed. I guess I was interested more in the HyperRAM driver sizes with this particular comparison.
Main thing is this driver is not a total hog and should be fully usable in both environments. Any unused method removal by the tools can help further too. It will consume far less memory that it enables!
Comments
EDIT: Grr, bad idea. SD card needs to be able to operate concurrently. Particularly since the HR will likely be busy all the time.
But the data would really fly, still sharing the same clock and cs pins but of course being in fixed latency mode.
If byte & word granularity was sacrificed and you always had to read/write 32 bits on both chips it may not be much of a change to the driver - it's mainly just some clock counter scaling and slightly different streamer commands, and also giving up the RWDS pin control. So another driver variant of what I have could probably be developed with this wider capability in time.
Supporting individual byte transfers would complicate it a lot more with 2 RWDS lines and requires separate CS lines too. Don't even want to think about that right now
The presumption is random writes will be bit-bashed at a slower toggle rate than consecutive bursts.
They would be tidier on P32-P47 but I'd want the 16 data lines on P16-P31 so that those pins are not exposed to transients from a connector.
I've been mulling over possible ideas for building a flexible high speed test rig for proving out timings. Here's what I've got so far:
One big plus of this setup is it fits alongside the SD card without any interference.
The idea is to have short track lengths for the data pins and long track lengths for CLK, RWDS, and CS. The plan is to have a phase shift of the CLK by about 1.0 ns so that data setup is guaranteed in hardware while minimising attenuation.
RWDS is pushed out 2-3 ns, into next cycle, with software to compensate. A big question mark on attenuation here. This is needed to accommodate the lag from the 220 ohm resistor. The trimmer is there only to experimentally find the fine tune resistance. Likewise for all three trimmers.
CS can be much slower with suitable accommodation in software. The idea is that RWDS can be toggled high quite a lot and still keep CS low. In fact, I think I've got a way to use the streamer with RWDS at full sysclock for short writes. Including all steps of command, address, latency gap and data in an unbroken but short burst.
It would be interesting if you get that working. My driver can drive out back to back clocks for writes at sysclk/2 and supports byte granular writes. Sysclk/1 writes are not supported (at least yet) in this driver.
I think in systems that require HyperRAM, then it will most likely be the most important "peripheral" of the P2; such that sharing pins would be avoided.
Plenty of other pins that could share function with other things. Sure, that's a sweeping statement with no real examples in mind! But that's my hunch- given how pernickety HR seems to be with timing requirements it just makes sense (for me) to avoid headaches and run those 10 pins as directly, as well matched and as short as possible. And certainly not to add resistance beyond the minimal required to clean up fast edges.
Though I kinda like the cunning way to hardware-set the CLK shift! Although surely P2 could handle that more reliably? This will probably be more obvious to me after I get a chance to experiment with the cap tuning. You guys are well ahead of me on all this.
I figured if I'm going to aim for a more reliable sysclock/1 then why not look at dealing to fitting it to the somewhat unused pin group as well.
And if P48-P57 is not going to be it then P21-P31 has to be it instead. The oscillator pin group is best kept away from any connectors.
A small phase shift might be possible with a fast gate delay of some kind or buffered clock. How stable that is relative to a simple capacitive delay of 22pF I'm not sure but hopefully some small device could be found that does the right job. As you say probably just a fixed 1-1.5ns delay would be nice to put on a board and work fairly well up to its rated 166MHz DDR with a 333MHz P2. What active device can give a stable delay of this magnitude? Perhaps a tight tolerance cap is the best?
The SPIN2 language definition allows +> and +>= type of unsigned comparisons but I get an error with this and Fastspin (v 4.1.9). I am hoping to make a single driver that works with both PNUT and Fastspin but perhaps this is not going to be possible without many changes where I check for negative values in lots of places...
Update: Actually I just tried the +> by itself with Fastspin instead of +>= and and it doesn't generate an error so perhaps I can subtract one in many places and it may fix the issue...at the possible expense of more runtime overhead depending how the constants are compiled. e.g. do this sort of thing:
It'll be less bulky once the resistor values are nailed down and the trimmers can be ditched.
Arbitrary masks in large burst writes? I was envisaging singles only ... for the moment.
JMG thought it was a good idea to stick with using passive components over trying to select something active that'll always be thermally sensitive.
The prop2 does wonderfully at sysclock/2. The timing is clean because the HRdata and HRclock can transition on alternate sysclocks. This gives clean data setup and hold timings. Just like SPI clock and data.
Problem is at sysclock/1 the data setup time vanishes. The only way the prop2 could possibly have finer timing is to use the both polarities of the sysclock. This sort of trick is not provided though ... So an external solution is needed to provide the data setup time.
Unfortunately, every solution is thermally sensitive. Capacitor, delay line, logic gate, or even just a long snaking PCB trace.
A long PCB trace actually is quite sensitive to temperature extremes on a cheap FR4 board due to changing dielectric "constant".
But that's the extent of my knowledge.
Funnily, I note the Hyperbus V2 spec says only 0.5 ns data setup time is needed. This can likely be satisfied with just an "unregistered" pin for the HRclock. Maybe short tracks all round is desirable.
Actually it's not always good for 8bpp graphics doing that. Unless you are in 16bpp colour mode, or want all graphics blocks copied only to every second pixel and be multiples of two pixels wide, doing that has ramifications. I've made this driver work with 8bpp graphics so I've enabled byte granular writes for bursts (both for start address and odd byte lengths).
Of course other drivers for non-graphics applications could ignore RWDS though. For a different cache application for example it might be okay to only support 32 bit writes on aligned boundaries.
Yep. Blitting is not ideal with read/modify/write and HyperRAM. In fact in a worst case implementation, by adding read/modify/write on each end of a burst it could probably slow things down by a factor of 5 in some cases because you then need 5 mailbox transactions instead of 1. In some cases, depending on the burst size it may make sense to read an entire portion in and modify the ends in hub RAM, copy the middle portion from hub to hub and then write the whole lot back to HyperRAM. It could get down to just over a 2x penalty.
In fact for graphics modes < 8bpp this will be the way to go in the immediate term as HyperRAM does not support sub-byte access. If there was space freed in the driver in time it might be possible to add sub byte masking within request lists for individual pixel changes, and then you don't need to interact with the mailbox more than once to trigger the operation but there probably still isn't a huge gain there. HyperRAM is best accessed in bursts for high performance.
The penny drops. Thanks evan.
ps. I won't start on this until next week Tuesday now, as another priority stepped in. But I'll start researching over the week. Keeping everything matched and tuning for a ~1ns delay on clk might work out if the min-max range is 0.5-1.5ns. Lot's of helpful replies here for everyone that I'll read through more carefully later before getting started.
I now hope people can properly enable this optimizer to save space with Fastspin.
For example comparing the output for this readByte method primitive:
The optimised (default level) Fastspin compiled code is this (24 longs):
while the unoptimized code bloats rapidly and looks like this (66 longs!) :
For a minimal app, referencing all current HyperRAM driver functions (to prevent method removal) and including the PASM2 driver which is ~3800 bytes or so, I get these build sizes (which include a 1kB hub overhead plus Fastspin's own stuff):
For those wanting to interact directly with the HyperRAM driver mailbox (eg. from a PASM2 COG), it will free a lot more space as you don't need the extra SPIN2 layer API, which while very helpful to use is not mandatory. You'll have to understand the setup parameters and mailbox format.
Eg. upon driver start you just pass in a pointer to 8 long parameters which define the devices and COG parameters etc. I will also document the format of items accordingly.
' setup driver COG startup parameters
params[0]:= freq
params[1]:= @cogList[bus*NUMCOGS]
params[2]:= flags
params[3]:= busBasePin[bus]
params[4]:= @devices[bus*32] 'per bank settings
params[5]:= maskA[bus] 'port A (lower 32 pins) reset mask
params[6]:= maskB[bus] 'port B (upper 32 pins) reset mask
params[7]:= mailboxAddr[bus] 'mailbox address for the driver
The size difference vs Fastspin is interesting. Looks like the SPIN2 driver object is currently about 8kB including the 3600 byte PASM code. This probably compares to just over 13kB in Fastspin with optimisation enabled. Though the Fastspin version should still be somewhat faster to run of course. By how much, I'm keen to find out at some point.
I needed to change a few things before it compiled and this is what I learned (I'm sure it has been discussed before, but this is the first time I've ever run PNut so I'm learning the hard way when porting the driver code to be hopefully runnable using both environments):
- PNUT needs that return parameter to compile without errors if you want to return something, Fastspin doesn't need it.
- PNUT needs cogid to be returned via function cogid() while Fastspin allows just cogid to be used
- Fastspin allows # but PNut now always needs a dot. Eg:
- There is no cognew function in PNut to spawn PASM COGs you need to use coginit with 16 as the argument to start a new COG.
- SPIN2 method parameters can't use the same name as labels do in the PASM2 code in PNut.
- PNut requires any no-argument SPIN2 methods to be defined and called with ()
- Finally there was a problem with greater than and equal to order
PNut needs this: while (perhaps an older) Fastspin needed this to work correctly: Hopefully a newer Fastspin should fix this.
Update: looks like Fastspin 4.1.9 is doing what I want now and can use the Pnut syntax...this should work according to the listing output.
00730 | ' repeat until long[m] >= 0
00730 | LR__0001
00730 81 CC 01 FB | rdlong dump_tmp001_, _dump_m
00734 00 CC 5D F2 | cmps dump_tmp001_, #0 wcz
00738 F4 FF 9F CD | if_b jmp #LR__0001
Main thing is this driver is not a total hog and should be fully usable in both environments. Any unused method removal by the tools can help further too. It will consume far less memory that it enables!