Of course, good call TonyB_, thanks. I was looking for some constant with a top bit set lower down in the code a little while ago and had missed seeing that. Perfect for the job! Thanks TonyB_.
This is good too because I also need those two spare COG RAM locations free to even toggle a pin for measuring its benefits.
Seeing as the pin toggle I put in to time the execution with and without the suggested RDFAST tweak is too fast for my logic analyzer, I have added a COG to try to time the execution by looking at a pin and reading the counter when it transitions but am having trouble getting this timing code working, the result seems to be jumping all over the place and doesn't make sense. Not sure what I have wrong here. My driver drops P56 low when it starts a request and drives it high at the end. It should be working. I print out this global VAR called "lasttime" in my Fastspin test harness as an unsigned int after the request completes but am getting garbage values, not sure what I have wrong here....any ideas?
eg. single HyperRAM read request returns this, but should only take a microsecond or so. I must be overlooking something obvious...this should be easy enough to achieve.
In advance, I'm asking your indulgence for any confusion of my part, due to my bad eyesight (and weird brain...), but, thnking about it as an interlocked waiting sequence, I believe I would have coded it differently, just a bit:
IMHO, that way you'll be ensured to catch the high-to-low transition of P56 at the beggining, and also be able to catch its low-to-high transition, at the end of each timing loop.
Also, and perhaps it's just my opinion, by changing the %zzzz-bits, the PAT-hardware would react earlier than expected, hence the detection would not be directly related to any pin state change, but to the inverter-behaviour state-change (Z selects =/!=), as commanded by %zzzz = %1111.
Hope it helps more than confuses...
Henrique
Addendum: Perhaps, due to some personal preferences, even the first lonelly "modcz" could be taken out from the looping wait, and kept isolated, at the beggining, in order to denote it works just as a "setup-alike" instruction in that sequence.
I simply believe its role can be better understood this way.
Addendum II: Eh eh, it appears that I was caught by some traps too; even the first setpat/waitpat pair can be taken out from the looping sequence, and left alone, isolated, with the first (and lone) modcz; the last setpat/waitpat pair just left them at the right state, ensuring a sane re-entering.
Yeah I'll give that a try shortly. I know there should be sufficient gaps between high and low pin transitions to be able to execute the following setpat+waitpat sequence without missing the transition, but there seems to be something getting stuck somewhere. I've moved out the first waitpat group out of the rep loop and now clear the lastime value before I begin and I now see it stuck at zero. I think my understanding what setpat/waitpat do is wrong and it must be behaving a bit differently to how waitpeq/waitpne effectively worked from the P1 as something seems to be locking up there, or it's a new problem now. I might need to create a separate test program just to test this out.
Just a warning: unless there are some provisions to ensure the above routine would start and executes the first waitpat, well within the time P56 is being kept HIGH, there is no warranty that the first """\_________/""" down-going transition would be flagged as expected, so you can loose a full low-period-count at first, but, once it passes that first "possible hiccup", any other timing readings would be correct.
Just a warning: unless there are some provisions to ensure the above routine would start and executes the first waitpat, well within the time P56 is being kept HIGH, there is no warranty that the first """\_________/""" down-going transition would be flagged as expected, so you can loose a full low-period-count at first, but, once it passes that first "possible hiccup", any other timing readings would be correct.
Henrique
Yeah that was the intent of the first group of 3 instructions in my rep loop to avoid the hiccup. To solve the initial problem of catching it randomly in some state and getting the first computation wrong. It didn't really need to be in the rep loop however and can happen once at the start, so I've since moved it outside the repeat loop.
Update: I might just move to a COGATN approach to time it instead of waiting on pin transitions if I can't figure it out. Getting this timing thing working reliably is going to be useful for comparing any performance tweaks to the HyperRAM transfer code that might be added, though I barely have the space for it right now.
I think it should be called matching a "new pin pattern" event, not a pin pattern event. The event only happens when the new pin pattern is first entered from a different pattern, not if it is in that state to begin with. I didn't realize this from the documentation and had (incorrectly) assumed P1 style level sensitive behaviour.
I'm sure there is at least some way of achieving it, perhaps more than one, in fact.
One idea would be throwing POLLPAT and conditional execution of instructions into the mix, but since WC, WCZ, or WZ would get used too, I'm still not sure about the way this could affect the modcz/setpat/waitpat structure, in order to ensure the resulting functionality.
Just a bit more thought, and experimentation, as usual...
Just measured it, and the rdfast ##$80000000 thing saved 32 clocks on a 256 byte memory copy transfer at sysclk/2 (882 vs 850 cycles at P2 clock =200MHz)
That's a pretty good and a surprising improvement, I thought it might only be 8 clocks but it was more. Definitely going to be added (no-brainer).
I am just amazed what amount of code you are able to press into a COG. And the variations available thru configuration. Your Video driver sounds fantastic with mixed graphic and text areas supporting all needed output formats out of one driver.
At that point I was a bit stunned, just having a 2 port serial driver written in PASM and feeling not so proud of it anymore as before.
Now your Hyperram integration, and comments like '2 more longs left' but you still add stuff while fighting for every long in the COG.
How you are doing this? Do you sleep sometimes?
My offer still stands to test a 2 Hyperram-board configuration on a P2 rev B if you want to.
I am just amazed what amount of code you are able to press into a COG. And the variations available thru configuration. Your Video driver sounds fantastic with mixed graphic and text areas supporting all needed output formats out of one driver.
At that point I was a bit stunned, just having a 2 port serial driver written in PASM and feeling not so proud of it anymore as before.
Now your Hyperram integration, and comments like '2 more longs left' but you still add stuff while fighting for every long in the COG.
How you are doing this? Do you sleep sometimes?
My offer still stands to test a 2 Hyperram-board configuration on a P2 rev B if you want to.
Bowing deep,
Mike
Cheers Mike. I do like to pack in all those features where I can. P2 capacity seems to keep expanding in size the more you learn how to use it.
Yes I sleep, a lot. In fact probably quite a bit more than normal during lockdown. In theory at the start I'd expected that this whole driver should have taken me around 6-8 weeks or so if I worked on it full time, but instead it got stretched out to over 6 months by only working a few hours or so at a time here and there and then leaving it for a few of days at a time whenever I got tired of it. Once you do that there is a lot of context switching, relearning going on and it becomes inefficient and drags out. But our extended lockdown here was a demotivator for me in general.
I think people should be able to try to use a 2 HyperRAM board setup now if they have that type of equipment handy. If you just call InitHyperDriver twice with a second address range and second base pin to create another memory bus, it should spawn a second driver and use that whenever the address range maps to it. Then you could try to copy memory from one board's RAM to another with memory.copy(...) . That would exercise both boards.
It should just work (but as mentioned is still untested), although today I found copying from HyperFlash to HyperRAM may have some issues which I am trying to pinpoint to see if it is something in my test harness itself or in the driver. This has worked before in the driver so something has either regressed there or there is another problem. HyperRAM to HyperRAM copy does seem to still work at least which is a good sign for the PASM driver. I am using my memtest.spin program to look at this. Hopefully it is something nice and simple to resolve.
Just measured it, and the rdfast ##$80000000 thing saved 32 clocks on a 256 byte memory copy transfer at sysclk/2 (882 vs 850 cycles at P2 clock =200MHz)
That's a pretty good and a surprising improvement, I thought it might only be 8 clocks but it was more. Definitely going to be added (no-brainer).
Do you use no-wait WRFASTs? How many separate RDFAST / WRFAST instructions are there for 256 bytes? Maximum possible saving for a no-wait RDFAST / WRFAST is 17 / 3 cycles.
Yeah I do now, I am also seeing quantization to 8 clock boundaries in my measurements as I increased the transfer size by one byte. It slips by 8 clocks every 8 bytes. There should be only one WRFAST and one RDFAST needed as the burst size was 320 bytes which is still over the 256 byte sized copy I used in the test, so no fragmentation would occur. I think it must be somehow rounding back further to gain even more than whatever the RDFAST change did by itself. I'm definitely not complaining that it's better than expected.
Yeah I do now, I am also seeing quantization to 8 clock boundaries in my measurements as I increased the transfer size by one byte. It slips by 8 clocks every 8 bytes. There should be only one WRFAST and one RDFAST needed as the burst size was 320 bytes which is still over the 256 byte sized copy I used in the test, so no fragmentation would occur. I think it must be somehow rounding back further to gain even more than whatever the RDFAST change did by itself. I'm definitely not complaining that it's better than expected.
Saving 32 cycles does not appear possible purely on instruction timing, which I asked about it in my previous post. Anyway, what a great idea to set bit 31 for no-waits!
Saving 32 cycles does not appear possible purely on instruction timing, which I asked about it in my previous post. Anyway, what a great idea to set bit 31 for no-waits!
Yeah I hope to look into this more and want to retest now I fixed this bug below.
I tracked down the the flash to RAM copy problem I mentioned, it was a special case. In order to make it occur you needed to have locked the COG's burst first before doing this type of copy. If the COG is setup as a regular/default RR COG and it's bursts can be fragmented (ie. it normally yields between read/write bursts) its bank information is reloaded each time and this problem doesn't happen, which is why I hadn't seen it, but if the COG has also enabled its F_LOCKED setting then the different RAM bank information was not automatically being reloaded and a flash to RAM copy basically tries to do flash to flash copy which doesn't do anything. It also explained why RAM to RAM worked okay, being the same bank anyway.
In my tracking down of this issue I also stumbled upon a duplicated instruction I could remove which was great as I found I need an extra long to fix this special case anyway. I often seem to get lucky like that.
I'll add it to the next release but if anyone needs to fix this in their local build, make the two line change to hyperdrv.spin2 as shown below.
if_nz jmp #moretransfers ' a b c more transfers still to go
' REMOVE THIS LINE ----->> wrlut #0, ptra[8]
tjz link, #listcomplete ' a b c test link for next request
rdlong pa, ptrb[-1] 'check if list has been aborted by client
tjns pa, #listcomplete 'will exit if it has
wrlong link, ptrb 'setup list next pointer
altd id, #id0 'compute COG's state address
bitl 0-0, #LIST_BIT 'clear list flag for this COG
_ret_ push #poller 'we will return to polling
moretransfers
{ ADD THIS LINE ---->> } getbyte request, addr1, #3 'prepare request
testb d, #FLASH_BIT wz 'test if flash bank is being accessed
Update: @TonyB_ after retesting with this fix, I see only 8 clocks improvement as I sort of initially expected with the RDFAST optimisation, tested using a single isolated burst write. Seems the previous test was not valid.
Yeah I'm basically bringing one together at the moment. I am integrating my video driver samples to use the new memory driver before releasing the update for that too. In the meantime if you dig through these posts you might find an example of at least a binary that demonstrated its use with some graphics...here was one I did earlier: http://forums.parallax.com/discussion/comment/1494843/#Comment_1494843
( Fit HyperRAM module on pins P32-47 and VGA module on pins P0-7)
I did see some visible pixel noise issue yesterday with random memory startup data and I want to track that down first. It's been clean before with image data so I am worried I might have tweaked something recently while not using video that messed something up or I've done something different with the PLL in my video example that has somehow introduced jitter. Still investigating. Interestingly it appears to hit all resolutions/timings, it's like a mild fuzz on various pixels like bit errors. Maybe I'm one cycle off now and latching close to a transition...? It's weird, I hope I've not fried anything.
When I get to it later today, I'm going to try P16-31 and sysclk/2 to see if that helps with anything too. I hadn't written anything to the memory by this point so I'm wondering if the data being read back was even reliable. Maybe I was starving it of refreshes. Hopefully just something dumb.
I was able to dig into this more. Looks like this may be related to signal integrity issues at high speeds. I moved back to sysclk/2 and it thankfully went away. Hopefully v2 HyperRAM will be better for us there given we won't need to overlock it when we run it faster and it's IO performance will probably be better. Problem shows up far more with vertical lines drawn which causes lots of high frequency transitions and in RGBI mode so any bit errors would more often flip the colour, rather than showing subtle gradient changes in other modes which may not be as easily observed. If you draw areas with solid colours (which is what a lot of my earlier work did) it doesn't seem to be visible. You probably need high frequency transitions with lots of bits changing at once to see it. Interestingly only certain vertical parts of the screen seem to be affected and not all. I don't know if it is related to some offset from the start of the video line, or if it is a particular colour transition. If it was a colour transition I'd expect it to run all the way down the screen in areas where the colour is the same but it sometimes only a smaller portion of it. Maybe P2 jitter is causing this if the memory to P2 path delay is constant but the P2 frequency is very slightly changing near the actual signal transition for small durations in time...?
I think sysclk/1 operation is going to need a pretty tight board layout to be in any way reliable/usable as anything other than video RAM, this is why it is good to have the memory on the base board and not running a long way from it via various connectors with different path lengths etc. That at least keeps things more well constrained. Another improvement might be to try to use the RWDS clock to clock the data in with some external DDR latch sitting between the RAM and the P2, but I don't know what HW device if any even supports that. That is probably the whole purpose of RWDS during the read phase in the first place, as a source synchronous clock of sorts.
This is weird, my old demo which also uses sysclk/1 reads with an older HyperRAM driver does NOT seem to exhibit this same issue with vertical lines...I'm now think I'm going crazy here and second guessing myself. I tested with SVGA @200MHz and 1080p @297MHz and both seemed to look clean/stable on my monitor with none of these artefacts mentioned above. So something probably has changed in the software to cause this now...
Could anyone with a VGA monitor and HyperRAM board on their P2-EVAL try these binaries out to see if the drawn image is also clean for them or if they see noise on the vertical lines once it is all drawn and stable? This is based on my old now outdated driver which I had believed worked okay there.
For these setups I've put the HyperRAM board down on P16..31 to try to help performance and the VGA breakout expansion is still at P0..7 so that is where they should be fitted.
Also it had a USB driver incorporated for the breakout on P48-P56 so just keep those pins free of other types of devices, but it is not needed in this demo.
Maybe at some point, but I can test other things in the meantime.
Using the 1080p demo I just posted in the prior zip and with the magnifying glass close up to my monitor I'm seeing a very slight amount of noise only in the 5-10% percent of the screen on the left side on some vertical lines. The rest is really clean. I'm now wondering if that is some type of analog noise on the VGA output. If I could do digital over DVI I might see if that noise can be eliminated. I'll probably have to drop back down to a lower resolution for testing that though. My SVGA example seems totally clean at 200MHz with sysclk/1 on my good Sony monitor.
Ok, so I took my old code with VGA resolution and boosted it up to 252MHz from 200MHz and the problem becomes readily apparent (you can easily see in this binary vga demo). It was also there in the DVI output version at this same resolution so it is probably not analog noise getting onto the video pins.
At least based on this result by itself I do think this is probably an overclock issue at sysclk/1. Interestingly at 297MHz it was cleaner than this. Maybe the delay timing can be tweaked to assist 252MHz if that was near a transition point...
If the streamer's NCO is producing a pixel clock that's not a clean fraction of the sysclock then that always produces varying amounts of noise, the worse is distinct pixel crawl.
Comments
Turns out I have 2 instructions free in COGRAM, so I can setup a constant with $80000000 and not use RDFAST #0, reg
I might try to time this to see how much it saves for burst writes. I guess it could be one hub window cycle or so.
There's this already but if there wasn't then any unconditional instruction hence opcode[31]=1 in cog RAM with opcode[13:0]=0 would do the job.
This is good too because I also need those two spare COG RAM locations free to even toggle a pin for measuring its benefits.
eg. single HyperRAM read request returns this, but should only take a microsecond or so. I must be overlooking something obvious...this should be easy enough to achieve.
lasttime = 2055634353
Disassembly of inline ASM looks okay too I think
IMHO, that way you'll be ensured to catch the high-to-low transition of P56 at the beggining, and also be able to catch its low-to-high transition, at the end of each timing loop.
Also, and perhaps it's just my opinion, by changing the %zzzz-bits, the PAT-hardware would react earlier than expected, hence the detection would not be directly related to any pin state change, but to the inverter-behaviour state-change (Z selects =/!=), as commanded by %zzzz = %1111.
Hope it helps more than confuses...
Henrique
Addendum: Perhaps, due to some personal preferences, even the first lonelly "modcz" could be taken out from the looping wait, and kept isolated, at the beggining, in order to denote it works just as a "setup-alike" instruction in that sequence.
I simply believe its role can be better understood this way.
Addendum II: Eh eh, it appears that I was caught by some traps too; even the first setpat/waitpat pair can be taken out from the looping sequence, and left alone, isolated, with the first (and lone) modcz; the last setpat/waitpat pair just left them at the right state, ensuring a sane re-entering.
Based on the general equation for event-triggering recognition, as posted by Chip at:
https://forums.parallax.com/discussion/comment/1466019/#Comment_1466019
"event = (((C ? INB : INA) & D) == S) ^ !Z "
the same waiting loop can be coded in other terms, simmilar to the ones you'd used at your code, just a little different, as follows:
Just a warning: unless there are some provisions to ensure the above routine would start and executes the first waitpat, well within the time P56 is being kept HIGH, there is no warranty that the first """\_________/""" down-going transition would be flagged as expected, so you can loose a full low-period-count at first, but, once it passes that first "possible hiccup", any other timing readings would be correct.
Henrique
Yeah that was the intent of the first group of 3 instructions in my rep loop to avoid the hiccup. To solve the initial problem of catching it randomly in some state and getting the first computation wrong. It didn't really need to be in the rep loop however and can happen once at the start, so I've since moved it outside the repeat loop.
Update: I might just move to a COGATN approach to time it instead of waiting on pin transitions if I can't figure it out. Getting this timing thing working reliably is going to be useful for comparing any performance tweaks to the HyperRAM transfer code that might be added, though I barely have the space for it right now.
https://forums.parallax.com/discussion/comment/1466019/#Comment_1466019
I think it should be called matching a "new pin pattern" event, not a pin pattern event. The event only happens when the new pin pattern is first entered from a different pattern, not if it is in that state to begin with. I didn't realize this from the documentation and had (incorrectly) assumed P1 style level sensitive behaviour.
Mike
One idea would be throwing POLLPAT and conditional execution of instructions into the mix, but since WC, WCZ, or WZ would get used too, I'm still not sure about the way this could affect the modcz/setpat/waitpat structure, in order to ensure the resulting functionality.
Just a bit more thought, and experimentation, as usual...
That's a pretty good and a surprising improvement, I thought it might only be 8 clocks but it was more. Definitely going to be added (no-brainer).
Another tribute to your efforts! Congrats!
I am just amazed what amount of code you are able to press into a COG. And the variations available thru configuration. Your Video driver sounds fantastic with mixed graphic and text areas supporting all needed output formats out of one driver.
At that point I was a bit stunned, just having a 2 port serial driver written in PASM and feeling not so proud of it anymore as before.
Now your Hyperram integration, and comments like '2 more longs left' but you still add stuff while fighting for every long in the COG.
How you are doing this? Do you sleep sometimes?
My offer still stands to test a 2 Hyperram-board configuration on a P2 rev B if you want to.
Bowing deep,
Mike
Cheers Mike. I do like to pack in all those features where I can. P2 capacity seems to keep expanding in size the more you learn how to use it.
Yes I sleep, a lot. In fact probably quite a bit more than normal during lockdown. In theory at the start I'd expected that this whole driver should have taken me around 6-8 weeks or so if I worked on it full time, but instead it got stretched out to over 6 months by only working a few hours or so at a time here and there and then leaving it for a few of days at a time whenever I got tired of it. Once you do that there is a lot of context switching, relearning going on and it becomes inefficient and drags out. But our extended lockdown here was a demotivator for me in general.
I think people should be able to try to use a 2 HyperRAM board setup now if they have that type of equipment handy. If you just call InitHyperDriver twice with a second address range and second base pin to create another memory bus, it should spawn a second driver and use that whenever the address range maps to it. Then you could try to copy memory from one board's RAM to another with memory.copy(...) . That would exercise both boards.
It should just work (but as mentioned is still untested), although today I found copying from HyperFlash to HyperRAM may have some issues which I am trying to pinpoint to see if it is something in my test harness itself or in the driver. This has worked before in the driver so something has either regressed there or there is another problem. HyperRAM to HyperRAM copy does seem to still work at least which is a good sign for the PASM driver. I am using my memtest.spin program to look at this. Hopefully it is something nice and simple to resolve.
Do you use no-wait WRFASTs? How many separate RDFAST / WRFAST instructions are there for 256 bytes? Maximum possible saving for a no-wait RDFAST / WRFAST is 17 / 3 cycles.
Saving 32 cycles does not appear possible purely on instruction timing, which I asked about it in my previous post. Anyway, what a great idea to set bit 31 for no-waits!
Yeah I hope to look into this more and want to retest now I fixed this bug below.
I tracked down the the flash to RAM copy problem I mentioned, it was a special case. In order to make it occur you needed to have locked the COG's burst first before doing this type of copy. If the COG is setup as a regular/default RR COG and it's bursts can be fragmented (ie. it normally yields between read/write bursts) its bank information is reloaded each time and this problem doesn't happen, which is why I hadn't seen it, but if the COG has also enabled its F_LOCKED setting then the different RAM bank information was not automatically being reloaded and a flash to RAM copy basically tries to do flash to flash copy which doesn't do anything. It also explained why RAM to RAM worked okay, being the same bank anyway.
In my tracking down of this issue I also stumbled upon a duplicated instruction I could remove which was great as I found I need an extra long to fix this special case anyway. I often seem to get lucky like that.
I'll add it to the next release but if anyone needs to fix this in their local build, make the two line change to hyperdrv.spin2 as shown below.
Update: @TonyB_ after retesting with this fix, I see only 8 clocks improvement as I sort of initially expected with the RDFAST optimisation, tested using a single isolated burst write. Seems the previous test was not valid.
http://forums.parallax.com/discussion/comment/1494843/#Comment_1494843
( Fit HyperRAM module on pins P32-47 and VGA module on pins P0-7)
I did see some visible pixel noise issue yesterday with random memory startup data and I want to track that down first. It's been clean before with image data so I am worried I might have tweaked something recently while not using video that messed something up or I've done something different with the PLL in my video example that has somehow introduced jitter. Still investigating. Interestingly it appears to hit all resolutions/timings, it's like a mild fuzz on various pixels like bit errors. Maybe I'm one cycle off now and latching close to a transition...? It's weird, I hope I've not fried anything.
When I get to it later today, I'm going to try P16-31 and sysclk/2 to see if that helps with anything too. I hadn't written anything to the memory by this point so I'm wondering if the data being read back was even reliable. Maybe I was starving it of refreshes. Hopefully just something dumb.
I think sysclk/1 operation is going to need a pretty tight board layout to be in any way reliable/usable as anything other than video RAM, this is why it is good to have the memory on the base board and not running a long way from it via various connectors with different path lengths etc. That at least keeps things more well constrained. Another improvement might be to try to use the RWDS clock to clock the data in with some external DDR latch sitting between the RAM and the P2, but I don't know what HW device if any even supports that. That is probably the whole purpose of RWDS during the read phase in the first place, as a source synchronous clock of sorts.
Two versions in the zipfile:
- svga.binary generates SVGA @ 200MHz
- fullhd.binary generates 1080p @ 297MHz.
For these setups I've put the HyperRAM board down on P16..31 to try to help performance and the VGA breakout expansion is still at P0..7 so that is where they should be fitted.
Also it had a USB driver incorporated for the breakout on P48-P56 so just keep those pins free of other types of devices, but it is not needed in this demo.
Using the 1080p demo I just posted in the prior zip and with the magnifying glass close up to my monitor I'm seeing a very slight amount of noise only in the 5-10% percent of the screen on the left side on some vertical lines. The rest is really clean. I'm now wondering if that is some type of analog noise on the VGA output. If I could do digital over DVI I might see if that noise can be eliminated. I'll probably have to drop back down to a lower resolution for testing that though. My SVGA example seems totally clean at 200MHz with sysclk/1 on my good Sony monitor.
At least based on this result by itself I do think this is probably an overclock issue at sysclk/1. Interestingly at 297MHz it was cleaner than this. Maybe the delay timing can be tweaked to assist 252MHz if that was near a transition point...
Same HyperRAM+VGA pinout as above.
I'll give those binaries a try out ...