EDIT: That said, on the prop2, I doubt it'll get used for more than a single byte at a time. So pointless needing it to electrically perform when it's faster to just bit-bash the one byte and leave the performance for complete bursts without RWDS.
On that note, I can see it being feasible to make the custom board layout, with the integrated HyperRAM using pins P48 to P58. RWDS on P58, sharing with the SD/EEPROM pins.
EDIT: Grr, bad idea. SD card needs to be able to operate concurrently. Particularly since the HR will likely be busy all the time.
If byte & word granularity was sacrificed and you always had to read/write 32 bits on both chips it may not be much of a change to the driver - it's mainly just some clock counter scaling and slightly different streamer commands, and also giving up the RWDS pin control. So another driver variant of what I have could probably be developed with this wider capability in time.
Supporting individual byte transfers would complicate it a lot more with 2 RWDS lines and requires separate CS lines too. Don't even want to think about that right now
Shouldn't need two CS lines. Actually, maybe RWDS and CS, rather than CLK and CS, can be combined somehow ... using resistors and capacitors ... neither, as used with the prop2, are going to need the performance of the clock and data lines.
I say "with the prop2" because the mask byte writes are only usefully a single byte at a time. In other words, random writes only. It's not going to be practical to build a mask map amongst the 8-bit data for burst writes.
The presumption is random writes will be bit-bashed at a slower toggle rate than consecutive bursts.
I guess with two RWDS lines you could combine the CS signals and just fully mask the chip you don't want to write to with its own RWDS signal. Somewhat less messy to figure out byte writes than using different CS pins as well. So four total control pins (cs, clk, rwds1, rwds2) and 16 data lines.
Von,
I've been mulling over possible ideas for building a flexible high speed test rig for proving out timings. Here's what I've got so far:
One big plus of this setup is it fits alongside the SD card without any interference.
The idea is to have short track lengths for the data pins and long track lengths for CLK, RWDS, and CS. The plan is to have a phase shift of the CLK by about 1.0 ns so that data setup is guaranteed in hardware while minimising attenuation.
RWDS is pushed out 2-3 ns, into next cycle, with software to compensate. A big question mark on attenuation here. This is needed to accommodate the lag from the 220 ohm resistor. The trimmer is there only to experimentally find the fine tune resistance. Likewise for all three trimmers.
CS can be much slower with suitable accommodation in software. The idea is that RWDS can be toggled high quite a lot and still keep CS low. In fact, I think I've got a way to use the streamer with RWDS at full sysclock for short writes. Including all steps of command, address, latency gap and data in an unbroken but short burst.
evanh. Not sure I like the loss of knowing exactly when CS returns high due to component tolerances. It's a fair bit of mucking about to save that one pin. It also probably makes polling RWDS troublesome if you ever want to use variable latency.
In fact, I think I've got a way to use the streamer with RWDS at full sysclock for short writes. Including all steps of command, address, latency gap and data in an unbroken but short burst.
It would be interesting if you get that working. My driver can drive out back to back clocks for writes at sysclk/2 and supports byte granular writes. Sysclk/1 writes are not supported (at least yet) in this driver.
I think in systems that require HyperRAM, then it will most likely be the most important "peripheral" of the P2; such that sharing pins would be avoided.
Plenty of other pins that could share function with other things. Sure, that's a sweeping statement with no real examples in mind! But that's my hunch- given how pernickety HR seems to be with timing requirements it just makes sense (for me) to avoid headaches and run those 10 pins as directly, as well matched and as short as possible. And certainly not to add resistance beyond the minimal required to clean up fast edges.
Though I kinda like the cunning way to hardware-set the CLK shift! Although surely P2 could handle that more reliably? This will probably be more obvious to me after I get a chance to experiment with the cap tuning. You guys are well ahead of me on all this.
Sysclock/1 writes only works with the phase shifted clock. Basically, it requires something to delay or slow the edges with respect to the data. So the tuning is all about matching a resistor or capacitor to the board layout. Change the board and the tuning needs changed to suit the desired 1.0 - 1.5 ns of idea phase difference.
I figured if I'm going to aim for a more reliable sysclock/1 then why not look at dealing to fitting it to the somewhat unused pin group as well.
And if P48-P57 is not going to be it then P21-P31 has to be it instead. The oscillator pin group is best kept away from any connectors.
Definitely give it a try to see what you can achieve @evanh . It just may not be a widely implemented way to do it unless it is really solid and ultimately offers very compelling advantages. I see what you are trying to do to put the HyperRAM up as high as possible but I think boards such as the P2D2 will be using those two pins on 56, 57 for other purposes such as I2C anyway. The P2-EVAL board really can only use the current Hyper module on 0, 16, 32. Any new custom board can still do whatever it wants though.
A small phase shift might be possible with a fast gate delay of some kind or buffered clock. How stable that is relative to a simple capacitive delay of 22pF I'm not sure but hopefully some small device could be found that does the right job. As you say probably just a fixed 1-1.5ns delay would be nice to put on a board and work fairly well up to its rated 166MHz DDR with a 333MHz P2. What active device can give a stable delay of this magnitude? Perhaps a tight tolerance cap is the best?
I've been building up the HyperRAM driver API in SPIN2 and a significant amount ~75% is coded but have now run into an issue with the syntax and it is slowing me down. It doesn't seem like there is a way to do unsigned comparisons in Fastspin.
The SPIN2 language definition allows +> and +>= type of unsigned comparisons but I get an error with this and Fastspin (v 4.1.9). I am hoping to make a single driver that works with both PNUT and Fastspin but perhaps this is not going to be possible without many changes where I check for negative values in lots of places...
PUBmapAddrDevice(addr, bus, memoryType, size, cspin, clkpin, rwdspin, resetpin, burst) | device, pinInfo, i, latency' check for invalid argumentsif size < SIZE_16MB or size > SIZE_128MB or bus +>= MAX_INSTANCES or memoryType +>= TYPE_LAST
return ERR_INVALID
Update: Actually I just tried the +> by itself with Fastspin instead of +>= and and it doesn't generate an error so perhaps I can subtract one in many places and it may fix the issue...at the possible expense of more runtime overhead depending how the constants are compiled. e.g. do this sort of thing:
PUBsetupCog(cog, bus, burst, priority, flags) | f' check for invalid argumentsif bus +> MAX_INSTANCES-1return ERR_INVALID
evanh. Not sure I like the loss of knowing exactly when CS returns high due to component tolerances. It's a fair bit of mucking about to save that one pin. It also probably makes polling RWDS troublesome if you ever want to use variable latency.
At slower read rates checking for RWDS will be fine. Polling isn't ever going to be a fast solution though. IMHO, it's a dead option for the prop2. It's why I'm entertaining the ditching of RWDS entirely.
It'll be less bulky once the resistor values are nailed down and the trimmers can be ditched.
... My driver can drive out back to back clocks for writes at sysclk/2 and supports byte granular writes.
Arbitrary masks in large burst writes? I was envisaging singles only ... for the moment.
No only the start and ending bytes need to be masked for a burst to get byte granular addressing. All other bytes within the burst get fully written. It's only the first word and last word of the burst that need finer RWDS control to achieve this.
... given how pernickety HR seems to be with timing requirements it just makes sense (for me) to avoid headaches and run those 10 pins as directly, as well matched and as short as possible. And certainly not to add resistance beyond the minimal required to clean up fast edges.
For data pins, totally. But clock absolutly needs shifted to provide the data setup timing. The easiest way is to soften/lag the clock edges. Part of doing that is make the clock track longer to give it an L-C property. And the resistor is a cheap reliable fine tune. Ideally, the track would be engineered to perform perfectly but that's way beyond my knowledge.
JMG thought it was a good idea to stick with using passive components over trying to select something active that'll always be thermally sensitive.
Though I kinda like the cunning way to hardware-set the CLK shift! Although surely P2 could handle that more reliably?
The prop2 does wonderfully at sysclock/2. The timing is clean because the HRdata and HRclock can transition on alternate sysclocks. This gives clean data setup and hold timings. Just like SPI clock and data.
Problem is at sysclock/1 the data setup time vanishes. The only way the prop2 could possibly have finer timing is to use the both polarities of the sysclock. This sort of trick is not provided though ... So an external solution is needed to provide the data setup time.
JMG thought it was a good idea to stick with using passive components over trying to select something active that'll always be thermally sensitive.
Unfortunately, every solution is thermally sensitive. Capacitor, delay line, logic gate, or even just a long snaking PCB trace.
A long PCB trace actually is quite sensitive to temperature extremes on a cheap FR4 board due to changing dielectric "constant".
But that's the extent of my knowledge.
I guess that's a reason to under-do the track length and rely somewhat more on the resistor then.
Funnily, I note the Hyperbus V2 spec says only 0.5 ns data setup time is needed. This can likely be satisfied with just an "unregistered" pin for the HRclock. Maybe short tracks all round is desirable.
No only the start and ending bytes need to be masked for a burst to get byte granular addressing. All other bytes within the burst get fully written. It's only the first word and last word of the burst that need finer RWDS control to achieve this.
Oh, I'd not bother with it at all then. Just always require shortword aligned bursts. Both read and write.
Oh, I'd not bother with it at all then. Just always require shortword aligned bursts. Both read and write.
Actually it's not always good for 8bpp graphics doing that. Unless you are in 16bpp colour mode, or want all graphics blocks copied only to every second pixel and be multiples of two pixels wide, doing that has ramifications. I've made this driver work with 8bpp graphics so I've enabled byte granular writes for bursts (both for start address and odd byte lengths).
Of course other drivers for non-graphics applications could ignore RWDS though. For a different cache application for example it might be okay to only support 32 bit writes on aligned boundaries.
Ah, of course, blitting needs it. And read-modify-write is not a friendly thing with these type buses. I don't suppose you've had any ideas on how to perform 4 bits per pixel ops?
I guess that's a reason to under-do the track length and rely somewhat more on the resistor then.
Grr, that wouldn't help as much as adding a capacitor. I suppose I could use a trim-capacitor for tuning instead of the trim-resistor. Not as sturdy but I guess the clock tuning will be the first thing resolved anyway.
Ah, of course, blitting needs it. And read-modify-write is not a friendly thing with these type buses. I don't suppose you've had any ideas on how to perform 4 bits per pixel ops?
Yep. Blitting is not ideal with read/modify/write and HyperRAM. In fact in a worst case implementation, by adding read/modify/write on each end of a burst it could probably slow things down by a factor of 5 in some cases because you then need 5 mailbox transactions instead of 1. In some cases, depending on the burst size it may make sense to read an entire portion in and modify the ends in hub RAM, copy the middle portion from hub to hub and then write the whole lot back to HyperRAM. It could get down to just over a 2x penalty.
In fact for graphics modes < 8bpp this will be the way to go in the immediate term as HyperRAM does not support sub-byte access. If there was space freed in the driver in time it might be possible to add sub byte masking within request lists for individual pixel changes, and then you don't need to interact with the mailbox more than once to trigger the operation but there probably still isn't a huge gain there. HyperRAM is best accessed in bursts for high performance.
Sysclock/1 writes only works with the phase shifted clock. Basically, it requires something to delay or slow the edges with respect to the data. So the tuning is all about matching a resistor or capacitor to the board layout. Change the board and the tuning needs changed to suit the desired 1.0 - 1.5 ns of idea phase difference.
The penny drops. Thanks evan.
ps. I won't start on this until next week Tuesday now, as another priority stepped in. But I'll start researching over the week. Keeping everything matched and tuning for a ~1ns delay on clk might work out if the min-max range is 0.5-1.5ns. Lot's of helpful replies here for everyone that I'll read through more carefully later before getting started.
Wow, I just enabled the optimizer in Fastspin 4.1.9 and it saves a lot of space and should also help speed things up a lot in the SPIN2 API for the HyperRAM driver. Until now I was keeping it turned off and sort of watching the driver code start to bloat up towards 14kB of SPIN2 + 4kB PASM2 driver and getting concerned, and wondering how it will compare in size with the interpreted SPIN2 Chip is doing.
I now hope people can properly enable this optimizer to save space with Fastspin.
For example comparing the output for this readByte method primitive:
PUBreadByte(addr) | mif MAX_INSTANCES == 1' optimization for single instance, everything mapped to single bus
m := mailboxAddrCog[cogid] ' get mailbox base address for this COGif m == 0' prevent hang if driver is not runningreturn -1else' multiple buses, need to lookup address to find mailbox for bus
m := addrMap[addr>>24]
if m +> MAX_INSTANCES-1' if address not mapped, exitreturn -1
m := mailboxAddr[m] + cogid*12' compute COG mailbox offsetlong[m] := REQ_READBYTE + (addr & $fffffff) ' generate read request in mailboxrepeatuntillong[m] => 0' wait to completereturnlong[m+1] ' return result
The optimised (default level) Fastspin compiled code is this (24 longs):
00554 | _readbyte
00554036604 F6 | mov COUNT_, #3005583500 C0 FD | calla #pushregs_
0055c | ' if MAX_INSTANCES == 1 ' optimization for single instance, everything mapped to single bus
0055c | ' m := mailboxAddrCog[cogid] ' get mailbox base address for this COG
0055c01 A260 FD | cogid result10056002 A264 F0 | shl result1, #200564009F 04 F1 | add ptr__dat__, #256005684F A200 F1 | add result1, ptr__dat__
0056c51 AC 08 FB | rdlong local01, result1 wz
00570 | ' if m == 0 ' prevent hang if driver is not running
00570 | ' return -100570009F 84 F1 | sub ptr__dat__, #2560057401 A264 A6 | if_e neg result1, #1005782C 0090 AD | if_e jmp #LR__00020057c | ' else ' multiple buses, need to lookup address to find mailbox for bus
0057c | ' m := addrMap[addr>>24]
0057c | ' if m +> MAX_INSTANCES-1 ' if address not mapped, exit
0057c | ' return -10057c | ' m := mailboxAddr[m] + cogid*12 ' compute COG mailbox offset
0057c | ' long[m] := REQ_READBYTE + (addr & $fffffff) ' generate read request in mailbox
0057c1F AE C4 F9 | decod local02, #310058052 B000 F6 | mov local03, arg0100584 FF FF 07 FF
00588 FF B104 F5 | and local03, ##2684354550058c58 AE 00 F1 | add local02, local030059056 AE 60 FC | wrlong local02, local0100594 | ' repeat until long[m] => 0 ' wait to complete
00594 | LR__00010059456 AE 00 FB | rdlong local02, local010059800 AE 5C F2 | cmps local02, #0 wcz
0059c F4 FF 9F CD | if_b jmp #LR__0001005a0 | ' return long[m+1] ' return result
005a001 AC 04 F1 | add local01, #1005a456 A200 FB | rdlong result1, local01005a8 | LR__0002005a84D F003 F6 | mov ptra, fp
005ac4200 C0 FD | calla #popregs_
005b0 | _readbyte_ret
005b02E 0064 FD | reta
while the unoptimized code bloats rapidly and looks like this (66 longs!) :
01984 | ' PUB readByte(addr) | m
01984 | _readbyte
01984076604 F6 | mov COUNT_, #7019883500 C0 FD | calla #pushregs_
0198c8A 2A 01 F6 | mov local01, arg0101990 | ' if MAX_INSTANCES == 1 ' optimization for single instance, everything mapped to single bus
01990 | ' m := mailboxAddrCog[cogid] ' get mailbox base address for this COG
019904C 1D D0 FD | calla #__system__cogid
01994722C 01 F6 | mov local02, result101998962E 01 F6 | mov local03, local020199c022E 65 F0 | shl local03, #2019a000 DF 04 F1 | add ptr__dat__, #256019a46F 3001 F6 | mov local04, ptr__dat__
019a800 DF 84 F1 | sub ptr__dat__, #256019ac982E 01 F1 | add local03, local04019b0973201 FB | rdlong local05, local03019b4 | ' if m == 0 ' prevent hang if driver is not running
019b400320D F2 | cmp local05, #0 wz
019b80C 00905D | if_ne jmp #LR__0097019bc | ' return -1019bc FF FF 7F FF
019c0 FF E504 F6 | mov result1, ##-1019c4 B80090 FD | jmp #LR__0102019c8 | LR__0097019c8 | ' else ' multiple buses, need to lookup address to find mailbox for bus
019c8780090 FD | jmp #LR__0099019cc | ' m := addrMap[addr>>24]
019cc952C 01 F6 | mov local02, local01019d0182C 45 F0 | shr local02, #24019d4962E 01 F6 | mov local03, local02019d8010000 FF
019dc60 DE 04 F1 | add ptr__dat__, ##608019e06F 3001 F6 | mov local04, ptr__dat__
019e4010000 FF
019e860 DE 84 F1 | sub ptr__dat__, ##608019ec982E 01 F1 | add local03, local04019f09732 C1 FA | rdbyte local05, local03019f4 | ' if m +> MAX_INSTANCES-1 ' if address not mapped, exit
019f400321D F2 | cmp local05, #0 wcz
019f80C 0090 ED | if_be jmp #LR__0098019fc | ' return -1019fc FF FF 7F FF
01a00 FF E504 F6 | mov result1, ##-101a04780090 FD | jmp #LR__010201a08 | LR__009801a08 | ' m := mailboxAddr[m] + cogid*12 ' compute COG mailbox offset
01a08992E 01 F6 | mov local03, local0501a0c022E 65 F0 | shl local03, #201a104C DF 04 F1 | add ptr__dat__, #33201a146F 3001 F6 | mov local04, ptr__dat__
01a184C DF 84 F1 | sub ptr__dat__, #33201a1c982E 01 F1 | add local03, local0401a20 BC 1C D0 FD | calla #__system__cogid
01a24723401 F6 | mov local06, result101a289A 3601 F6 | mov local07, local0601a2c013665 F0 | shl local07, #101a309A 3601 F1 | add local07, local0601a34023665 F0 | shl local07, #201a38972C 01 FB | rdlong local02, local0301a3c9B 2C 01 F1 | add local02, local0701a40963201 F6 | mov local05, local0201a44 | LR__009901a44 | ' long[m] := REQ_READBYTE + (addr & $fffffff) ' generate read request in mailbox
01a44000040 FF
01a48002C 05 F6 | mov local02, ##-214748364801a4c952E 01 F6 | mov local03, local0101a50 FF FF 07 FF
01a54 FF 2F 05 F5 | and local03, ##26843545501a58972C 01 F1 | add local02, local0301a5c992C 61 FC | wrlong local02, local0501a60 | ' repeat until long[m] => 0 ' wait to complete
01a60 | LR__010001a60992C 01 FB | rdlong local02, local0501a64002C 5D F2 | cmps local02, #0 wcz
01a680400903D | if_ae jmp #LR__010101a6c F0 FF 9F FD | jmp #LR__010001a70 | LR__010101a70 | ' return long[m+1] ' return result
01a70992C 01 F6 | mov local02, local0501a74012C 05 F1 | add local02, #101a7896 E400 FB | rdlong result1, local0201a7c000090 FD | jmp #LR__010201a80 | LR__010201a806B F003 F6 | mov ptra, fp
01a844200 C0 FD | calla #popregs_
01a88 | _readbyte_ret
01a882E 0064 FD | reta
Yep for sure.
For a minimal app, referencing all current HyperRAM driver functions (to prevent method removal) and including the PASM2 driver which is ~3800 bytes or so, I get these build sizes (which include a 1kB hub overhead plus Fastspin's own stuff):
No Optimization : 20672 bytes
Default Optimization: 14688 bytes
Full Optimization : 14720 bytes
For those wanting to interact directly with the HyperRAM driver mailbox (eg. from a PASM2 COG), it will free a lot more space as you don't need the extra SPIN2 layer API, which while very helpful to use is not mandatory. You'll have to understand the setup parameters and mailbox format.
Eg. upon driver start you just pass in a pointer to 8 long parameters which define the devices and COG parameters etc. I will also document the format of items accordingly.
' setup driver COG startup parameters
params[0]:= freq
params[1]:= @cogList[bus*NUMCOGS]
params[2]:= flags
params[3]:= busBasePin[bus]
params[4]:= @devices[bus*32] 'per bank settings
params[5]:= maskA[bus] 'port A (lower 32 pins) reset mask
params[6]:= maskB[bus] 'port B (upper 32 pins) reset mask
params[7]:= mailboxAddr[bus] 'mailbox address for the driver
I was able to compile my new HyperRAM driver codebase in PNut v34s running on VirtualBox. Still not tested, just compiling without errors.
The size difference vs Fastspin is interesting. Looks like the SPIN2 driver object is currently about 8kB including the 3600 byte PASM code. This probably compares to just over 13kB in Fastspin with optimisation enabled. Though the Fastspin version should still be somewhat faster to run of course. By how much, I'm keen to find out at some point.
I needed to change a few things before it compiled and this is what I learned (I'm sure it has been discussed before, but this is the first time I've ever run PNut so I'm learning the hard way when porting the driver code to be hopefully runnable using both environments):
- PNUT needs that return parameter to compile without errors if you want to return something, Fastspin doesn't need it.
PUBgetHyperDriver() : rreturn @hyper_driver
vs
PUBgetHyperDriver()' Fastspin allows this syntax and can still return a valuereturn @hyper_driver
- PNUT needs cogid to be returned via function cogid() while Fastspin allows just cogid to be used
- Fastspin allows # but PNut now always needs a dot. Eg:
driver#REQ_READBYTE ' Fastspin allows this
vs
driver.REQ_READBYTE
- There is no cognew function in PNut to spawn PASM COGs you need to use coginit with 16 as the argument to start a new COG.
driverCog := cognew(addr, @params)
vs
driverCog := coginit(16, addr, @params)
- SPIN2 method parameters can't use the same name as labels do in the PASM2 code in PNut.
- PNut requires any no-argument SPIN2 methods to be defined and called with ()
- Finally there was a problem with greater than and equal to order
PNut needs this:
repeatuntillong[m] >= 0
while (perhaps an older) Fastspin needed this to work correctly:
repeatuntillong[m] => 0
Hopefully a newer Fastspin should fix this.
Update: looks like Fastspin 4.1.9 is doing what I want now and can use the Pnut syntax...this should work according to the listing output.
00730 | ' repeat until long[m] >= 0
00730 | LR__0001
00730 81 CC 01 FB | rdlong dump_tmp001_, _dump_m
00734 00 CC 5D F2 | cmps dump_tmp001_, #0 wcz
00738 F4 FF 9F CD | if_b jmp #LR__0001
Yeah that is an interesting observation, and currently they are comparable in total size, though that Spin2 interpreter is a common overhead that other code can also use (I hope!) so as more client application code is added these example images will probably start to diverge further in code space consumed. I guess I was interested more in the HyperRAM driver sizes with this particular comparison.
Main thing is this driver is not a total hog and should be fully usable in both environments. Any unused method removal by the tools can help further too. It will consume far less memory that it enables!
Comments
EDIT: Grr, bad idea. SD card needs to be able to operate concurrently. Particularly since the HR will likely be busy all the time.
But the data would really fly, still sharing the same clock and cs pins but of course being in fixed latency mode.
If byte & word granularity was sacrificed and you always had to read/write 32 bits on both chips it may not be much of a change to the driver - it's mainly just some clock counter scaling and slightly different streamer commands, and also giving up the RWDS pin control. So another driver variant of what I have could probably be developed with this wider capability in time.
Supporting individual byte transfers would complicate it a lot more with 2 RWDS lines and requires separate CS lines too. Don't even want to think about that right now
The presumption is random writes will be bit-bashed at a slower toggle rate than consecutive bursts.
They would be tidier on P32-P47 but I'd want the 16 data lines on P16-P31 so that those pins are not exposed to transients from a connector.
I've been mulling over possible ideas for building a flexible high speed test rig for proving out timings. Here's what I've got so far:
One big plus of this setup is it fits alongside the SD card without any interference.
The idea is to have short track lengths for the data pins and long track lengths for CLK, RWDS, and CS. The plan is to have a phase shift of the CLK by about 1.0 ns so that data setup is guaranteed in hardware while minimising attenuation.
RWDS is pushed out 2-3 ns, into next cycle, with software to compensate. A big question mark on attenuation here. This is needed to accommodate the lag from the 220 ohm resistor. The trimmer is there only to experimentally find the fine tune resistance. Likewise for all three trimmers.
CS can be much slower with suitable accommodation in software. The idea is that RWDS can be toggled high quite a lot and still keep CS low. In fact, I think I've got a way to use the streamer with RWDS at full sysclock for short writes. Including all steps of command, address, latency gap and data in an unbroken but short burst.
It would be interesting if you get that working. My driver can drive out back to back clocks for writes at sysclk/2 and supports byte granular writes. Sysclk/1 writes are not supported (at least yet) in this driver.
I think in systems that require HyperRAM, then it will most likely be the most important "peripheral" of the P2; such that sharing pins would be avoided.
Plenty of other pins that could share function with other things. Sure, that's a sweeping statement with no real examples in mind! But that's my hunch- given how pernickety HR seems to be with timing requirements it just makes sense (for me) to avoid headaches and run those 10 pins as directly, as well matched and as short as possible. And certainly not to add resistance beyond the minimal required to clean up fast edges.
Though I kinda like the cunning way to hardware-set the CLK shift! Although surely P2 could handle that more reliably? This will probably be more obvious to me after I get a chance to experiment with the cap tuning. You guys are well ahead of me on all this.
I figured if I'm going to aim for a more reliable sysclock/1 then why not look at dealing to fitting it to the somewhat unused pin group as well.
And if P48-P57 is not going to be it then P21-P31 has to be it instead. The oscillator pin group is best kept away from any connectors.
A small phase shift might be possible with a fast gate delay of some kind or buffered clock. How stable that is relative to a simple capacitive delay of 22pF I'm not sure but hopefully some small device could be found that does the right job. As you say probably just a fixed 1-1.5ns delay would be nice to put on a board and work fairly well up to its rated 166MHz DDR with a 333MHz P2. What active device can give a stable delay of this magnitude? Perhaps a tight tolerance cap is the best?
The SPIN2 language definition allows +> and +>= type of unsigned comparisons but I get an error with this and Fastspin (v 4.1.9). I am hoping to make a single driver that works with both PNUT and Fastspin but perhaps this is not going to be possible without many changes where I check for negative values in lots of places...
PUB mapAddrDevice(addr, bus, memoryType, size, cspin, clkpin, rwdspin, resetpin, burst) | device, pinInfo, i, latency ' check for invalid arguments if size < SIZE_16MB or size > SIZE_128MB or bus +>= MAX_INSTANCES or memoryType +>= TYPE_LAST return ERR_INVALID
/Users/roger/Downloads/flexgui-4.0.3/samples/hyper5.spin2:290: error: syntax error, unexpected '='
Update: Actually I just tried the +> by itself with Fastspin instead of +>= and and it doesn't generate an error so perhaps I can subtract one in many places and it may fix the issue...at the possible expense of more runtime overhead depending how the constants are compiled. e.g. do this sort of thing:
PUB setupCog(cog, bus, burst, priority, flags) | f ' check for invalid arguments if bus +> MAX_INSTANCES-1 return ERR_INVALID
It'll be less bulky once the resistor values are nailed down and the trimmers can be ditched.
Arbitrary masks in large burst writes? I was envisaging singles only ... for the moment.
JMG thought it was a good idea to stick with using passive components over trying to select something active that'll always be thermally sensitive.
The prop2 does wonderfully at sysclock/2. The timing is clean because the HRdata and HRclock can transition on alternate sysclocks. This gives clean data setup and hold timings. Just like SPI clock and data.
Problem is at sysclock/1 the data setup time vanishes. The only way the prop2 could possibly have finer timing is to use the both polarities of the sysclock. This sort of trick is not provided though ... So an external solution is needed to provide the data setup time.
Unfortunately, every solution is thermally sensitive. Capacitor, delay line, logic gate, or even just a long snaking PCB trace.
A long PCB trace actually is quite sensitive to temperature extremes on a cheap FR4 board due to changing dielectric "constant".
But that's the extent of my knowledge.
Funnily, I note the Hyperbus V2 spec says only 0.5 ns data setup time is needed. This can likely be satisfied with just an "unregistered" pin for the HRclock. Maybe short tracks all round is desirable.
Actually it's not always good for 8bpp graphics doing that. Unless you are in 16bpp colour mode, or want all graphics blocks copied only to every second pixel and be multiples of two pixels wide, doing that has ramifications. I've made this driver work with 8bpp graphics so I've enabled byte granular writes for bursts (both for start address and odd byte lengths).
Of course other drivers for non-graphics applications could ignore RWDS though. For a different cache application for example it might be okay to only support 32 bit writes on aligned boundaries.
Yep. Blitting is not ideal with read/modify/write and HyperRAM. In fact in a worst case implementation, by adding read/modify/write on each end of a burst it could probably slow things down by a factor of 5 in some cases because you then need 5 mailbox transactions instead of 1. In some cases, depending on the burst size it may make sense to read an entire portion in and modify the ends in hub RAM, copy the middle portion from hub to hub and then write the whole lot back to HyperRAM. It could get down to just over a 2x penalty.
In fact for graphics modes < 8bpp this will be the way to go in the immediate term as HyperRAM does not support sub-byte access. If there was space freed in the driver in time it might be possible to add sub byte masking within request lists for individual pixel changes, and then you don't need to interact with the mailbox more than once to trigger the operation but there probably still isn't a huge gain there. HyperRAM is best accessed in bursts for high performance.
The penny drops. Thanks evan.
ps. I won't start on this until next week Tuesday now, as another priority stepped in. But I'll start researching over the week. Keeping everything matched and tuning for a ~1ns delay on clk might work out if the min-max range is 0.5-1.5ns. Lot's of helpful replies here for everyone that I'll read through more carefully later before getting started.
I now hope people can properly enable this optimizer to save space with Fastspin.
For example comparing the output for this readByte method primitive:
PUB readByte(addr) | m if MAX_INSTANCES == 1 ' optimization for single instance, everything mapped to single bus m := mailboxAddrCog[cogid] ' get mailbox base address for this COG if m == 0 ' prevent hang if driver is not running return -1 else ' multiple buses, need to lookup address to find mailbox for bus m := addrMap[addr>>24] if m +> MAX_INSTANCES-1 ' if address not mapped, exit return -1 m := mailboxAddr[m] + cogid*12 ' compute COG mailbox offset long[m] := REQ_READBYTE + (addr & $fffffff) ' generate read request in mailbox repeat until long[m] => 0 ' wait to complete return long[m+1] ' return result
The optimised (default level) Fastspin compiled code is this (24 longs):
00554 | _readbyte 00554 03 66 04 F6 | mov COUNT_, #3 00558 35 00 C0 FD | calla #pushregs_ 0055c | ' if MAX_INSTANCES == 1 ' optimization for single instance, everything mapped to single bus 0055c | ' m := mailboxAddrCog[cogid] ' get mailbox base address for this COG 0055c 01 A2 60 FD | cogid result1 00560 02 A2 64 F0 | shl result1, #2 00564 00 9F 04 F1 | add ptr__dat__, #256 00568 4F A2 00 F1 | add result1, ptr__dat__ 0056c 51 AC 08 FB | rdlong local01, result1 wz 00570 | ' if m == 0 ' prevent hang if driver is not running 00570 | ' return -1 00570 00 9F 84 F1 | sub ptr__dat__, #256 00574 01 A2 64 A6 | if_e neg result1, #1 00578 2C 00 90 AD | if_e jmp #LR__0002 0057c | ' else ' multiple buses, need to lookup address to find mailbox for bus 0057c | ' m := addrMap[addr>>24] 0057c | ' if m +> MAX_INSTANCES-1 ' if address not mapped, exit 0057c | ' return -1 0057c | ' m := mailboxAddr[m] + cogid*12 ' compute COG mailbox offset 0057c | ' long[m] := REQ_READBYTE + (addr & $fffffff) ' generate read request in mailbox 0057c 1F AE C4 F9 | decod local02, #31 00580 52 B0 00 F6 | mov local03, arg01 00584 FF FF 07 FF 00588 FF B1 04 F5 | and local03, ##268435455 0058c 58 AE 00 F1 | add local02, local03 00590 56 AE 60 FC | wrlong local02, local01 00594 | ' repeat until long[m] => 0 ' wait to complete 00594 | LR__0001 00594 56 AE 00 FB | rdlong local02, local01 00598 00 AE 5C F2 | cmps local02, #0 wcz 0059c F4 FF 9F CD | if_b jmp #LR__0001 005a0 | ' return long[m+1] ' return result 005a0 01 AC 04 F1 | add local01, #1 005a4 56 A2 00 FB | rdlong result1, local01 005a8 | LR__0002 005a8 4D F0 03 F6 | mov ptra, fp 005ac 42 00 C0 FD | calla #popregs_ 005b0 | _readbyte_ret 005b0 2E 00 64 FD | reta
while the unoptimized code bloats rapidly and looks like this (66 longs!) :
01984 | ' PUB readByte(addr) | m 01984 | _readbyte 01984 07 66 04 F6 | mov COUNT_, #7 01988 35 00 C0 FD | calla #pushregs_ 0198c 8A 2A 01 F6 | mov local01, arg01 01990 | ' if MAX_INSTANCES == 1 ' optimization for single instance, everything mapped to single bus 01990 | ' m := mailboxAddrCog[cogid] ' get mailbox base address for this COG 01990 4C 1D D0 FD | calla #__system__cogid 01994 72 2C 01 F6 | mov local02, result1 01998 96 2E 01 F6 | mov local03, local02 0199c 02 2E 65 F0 | shl local03, #2 019a0 00 DF 04 F1 | add ptr__dat__, #256 019a4 6F 30 01 F6 | mov local04, ptr__dat__ 019a8 00 DF 84 F1 | sub ptr__dat__, #256 019ac 98 2E 01 F1 | add local03, local04 019b0 97 32 01 FB | rdlong local05, local03 019b4 | ' if m == 0 ' prevent hang if driver is not running 019b4 00 32 0D F2 | cmp local05, #0 wz 019b8 0C 00 90 5D | if_ne jmp #LR__0097 019bc | ' return -1 019bc FF FF 7F FF 019c0 FF E5 04 F6 | mov result1, ##-1 019c4 B8 00 90 FD | jmp #LR__0102 019c8 | LR__0097 019c8 | ' else ' multiple buses, need to lookup address to find mailbox for bus 019c8 78 00 90 FD | jmp #LR__0099 019cc | ' m := addrMap[addr>>24] 019cc 95 2C 01 F6 | mov local02, local01 019d0 18 2C 45 F0 | shr local02, #24 019d4 96 2E 01 F6 | mov local03, local02 019d8 01 00 00 FF 019dc 60 DE 04 F1 | add ptr__dat__, ##608 019e0 6F 30 01 F6 | mov local04, ptr__dat__ 019e4 01 00 00 FF 019e8 60 DE 84 F1 | sub ptr__dat__, ##608 019ec 98 2E 01 F1 | add local03, local04 019f0 97 32 C1 FA | rdbyte local05, local03 019f4 | ' if m +> MAX_INSTANCES-1 ' if address not mapped, exit 019f4 00 32 1D F2 | cmp local05, #0 wcz 019f8 0C 00 90 ED | if_be jmp #LR__0098 019fc | ' return -1 019fc FF FF 7F FF 01a00 FF E5 04 F6 | mov result1, ##-1 01a04 78 00 90 FD | jmp #LR__0102 01a08 | LR__0098 01a08 | ' m := mailboxAddr[m] + cogid*12 ' compute COG mailbox offset 01a08 99 2E 01 F6 | mov local03, local05 01a0c 02 2E 65 F0 | shl local03, #2 01a10 4C DF 04 F1 | add ptr__dat__, #332 01a14 6F 30 01 F6 | mov local04, ptr__dat__ 01a18 4C DF 84 F1 | sub ptr__dat__, #332 01a1c 98 2E 01 F1 | add local03, local04 01a20 BC 1C D0 FD | calla #__system__cogid 01a24 72 34 01 F6 | mov local06, result1 01a28 9A 36 01 F6 | mov local07, local06 01a2c 01 36 65 F0 | shl local07, #1 01a30 9A 36 01 F1 | add local07, local06 01a34 02 36 65 F0 | shl local07, #2 01a38 97 2C 01 FB | rdlong local02, local03 01a3c 9B 2C 01 F1 | add local02, local07 01a40 96 32 01 F6 | mov local05, local02 01a44 | LR__0099 01a44 | ' long[m] := REQ_READBYTE + (addr & $fffffff) ' generate read request in mailbox 01a44 00 00 40 FF 01a48 00 2C 05 F6 | mov local02, ##-2147483648 01a4c 95 2E 01 F6 | mov local03, local01 01a50 FF FF 07 FF 01a54 FF 2F 05 F5 | and local03, ##268435455 01a58 97 2C 01 F1 | add local02, local03 01a5c 99 2C 61 FC | wrlong local02, local05 01a60 | ' repeat until long[m] => 0 ' wait to complete 01a60 | LR__0100 01a60 99 2C 01 FB | rdlong local02, local05 01a64 00 2C 5D F2 | cmps local02, #0 wcz 01a68 04 00 90 3D | if_ae jmp #LR__0101 01a6c F0 FF 9F FD | jmp #LR__0100 01a70 | LR__0101 01a70 | ' return long[m+1] ' return result 01a70 99 2C 01 F6 | mov local02, local05 01a74 01 2C 05 F1 | add local02, #1 01a78 96 E4 00 FB | rdlong result1, local02 01a7c 00 00 90 FD | jmp #LR__0102 01a80 | LR__0102 01a80 6B F0 03 F6 | mov ptra, fp 01a84 42 00 C0 FD | calla #popregs_ 01a88 | _readbyte_ret 01a88 2E 00 64 FD | reta
For a minimal app, referencing all current HyperRAM driver functions (to prevent method removal) and including the PASM2 driver which is ~3800 bytes or so, I get these build sizes (which include a 1kB hub overhead plus Fastspin's own stuff):
No Optimization : 20672 bytes Default Optimization: 14688 bytes Full Optimization : 14720 bytes
For those wanting to interact directly with the HyperRAM driver mailbox (eg. from a PASM2 COG), it will free a lot more space as you don't need the extra SPIN2 layer API, which while very helpful to use is not mandatory. You'll have to understand the setup parameters and mailbox format.
Eg. upon driver start you just pass in a pointer to 8 long parameters which define the devices and COG parameters etc. I will also document the format of items accordingly.
' setup driver COG startup parameters
params[0]:= freq
params[1]:= @cogList[bus*NUMCOGS]
params[2]:= flags
params[3]:= busBasePin[bus]
params[4]:= @devices[bus*32] 'per bank settings
params[5]:= maskA[bus] 'port A (lower 32 pins) reset mask
params[6]:= maskB[bus] 'port B (upper 32 pins) reset mask
params[7]:= mailboxAddr[bus] 'mailbox address for the driver
The size difference vs Fastspin is interesting. Looks like the SPIN2 driver object is currently about 8kB including the 3600 byte PASM code. This probably compares to just over 13kB in Fastspin with optimisation enabled. Though the Fastspin version should still be somewhat faster to run of course. By how much, I'm keen to find out at some point.
I needed to change a few things before it compiled and this is what I learned (I'm sure it has been discussed before, but this is the first time I've ever run PNut so I'm learning the hard way when porting the driver code to be hopefully runnable using both environments):
- PNUT needs that return parameter to compile without errors if you want to return something, Fastspin doesn't need it.
PUB getHyperDriver() : r return @hyper_driver vs PUB getHyperDriver() ' Fastspin allows this syntax and can still return a value return @hyper_driver
- PNUT needs cogid to be returned via function cogid() while Fastspin allows just cogid to be used
- Fastspin allows # but PNut now always needs a dot. Eg:
driver#REQ_READBYTE ' Fastspin allows this vs driver.REQ_READBYTE
- There is no cognew function in PNut to spawn PASM COGs you need to use coginit with 16 as the argument to start a new COG.
driverCog := cognew(addr, @params) vs driverCog := coginit(16, addr, @params)
- SPIN2 method parameters can't use the same name as labels do in the PASM2 code in PNut.
- PNut requires any no-argument SPIN2 methods to be defined and called with ()
- Finally there was a problem with greater than and equal to order
PNut needs this:
repeat until long[m] >= 0
while (perhaps an older) Fastspin needed this to work correctly:repeat until long[m] => 0
Hopefully a newer Fastspin should fix this.Update: looks like Fastspin 4.1.9 is doing what I want now and can use the Pnut syntax...this should work according to the listing output.
00730 | ' repeat until long[m] >= 0
00730 | LR__0001
00730 81 CC 01 FB | rdlong dump_tmp001_, _dump_m
00734 00 CC 5D F2 | cmps dump_tmp001_, #0 wcz
00738 F4 FF 9F CD | if_b jmp #LR__0001
Main thing is this driver is not a total hog and should be fully usable in both environments. Any unused method removal by the tools can help further too. It will consume far less memory that it enables!