Ah, I did my command and address phases different to you in my testing. I used bit-bashing for both. Only engaging the streamer for data phase.
This means, because of the required extra latency clocks after address phase, there was never any close instruction timing between smartpin and streamer starting. It's several sysclocks after I've started the clock smartpin before the streamer is needed to kick in. This matters because it makes it really simple to be sysclock exact just by having an adjustable WAITX in between the two instructions.
I suspect you've ended up with the WXPIN #2 to give you a +1 delay on the clock so it lines up with start of streamer output when the two instructions are packed one after the other.
Here's a binary you can play with to test out your memory module and to whet your appetite. I will be releasing the rest of the driver source very soon (hopefully by tonight/tomorrow).
200MHz P2 sysclk/1 operation, fit your HyperRAM module on P32-47 of the P2-EVAL.
115200 baud rate to console.
This allows you to flash/dump memory etc and experiment a bit. Ideally at some point someone may extend/rework this to add reading from SD card and flashing data from there too.
loadp2 -t memtest.binary
( Entering terminal mode. Press Ctrl-] to exit. )
Memory driver started, P2 Frequency = 200000000
External Memory Driver Test Tool, ESC aborts at any time
Commmands:
[D] = Dump memory, space continues
[R] = Read memory
[W] = Write memory
[F] = Fill memory
[M] = Move memory
[C] = Compare memory
[E] = Erase Flash
[B] = Blank check
[P] = Program Flash
[S] = Show settings
[G] = Generate Random data
[Q] = Quit
Enter command (?=HELP) :
Thanks for that. Yeah your 22pF mod has shifted things about, you would need a tweak to the timing profile if you wanted to use that HW.
Huh, looking back at my old log files I see I started investigating using a 10 pF cap in place of the 22 pF. It looked like it was okay. I should do some more testing of that ...
EDIT: Oh, I also discovered that P32 as the base pin on the RevB Eval Board sucks. Even with the 22 pF cap, writes at sysclock/1 still can fail at certain bands, namely around 120 MHz and 240 MHz. Best not to use P32 unless you have the RevC Eval Board.
What is the difference between Rev B and Rev C Eval Boards do you think caused this?
Yeah, 64kB scratch space in HUB to play with so you don't trash the actual application, though I don't really fully check for all overwrites so be wary.
Oh, I also discovered that P32 as the base pin on the RevB Eval Board sucks. Even with the 22 pF cap, writes at sysclock/1 still can fail at certain bands, namely around 120 MHz and 240 MHz. Best not to use P32 unless you have the RevC Eval Board.
What is the difference between Rev B and Rev C Eval Boards do you think caused this?
With revC Eval Board, Von has ensured the tracks are the same length and spacing within each 16 pin group to ensure best high speed timings. This wasn't done for revB so it's only fortuitous that using P0 or P16 as the base pin work well for this on revB.
I like to test with P32 myself because as you say it is the worst group of the 3. Despite that I can still get over 300MHz reads and can do 1080p 8bpp with it with sysclk/1 rate reads on this port, though I do stick to sysclk/2 writes.
Anyone got a simple way to update the DACs from SPIN? I have another demo which in theory streams audio in and out via HyperRAM and I'd like to see if I can output audio samples.
I really just need to add the init code and the code to write a new sample to the DACs. That's the missing bit.
Presumably, Ahle2 has done plenty of DAC work. Not sure if he's using Spin or Pasm for the low level though. I wouldn't be surprised if he's using a streamer and software mixing to a looping buffer in hubRAM. Two DACs from one streamer would require interleaving of the data for the two channels.
On the other hand, he may have opted to use the hardware dither provided by the smartpins. For that he'd probably want to use an interrupt for pacing the samples to the smartpins.
Hmm, even Chip's demos of this type use Pasm for the low level routines. Everything I've done is pure Pasm.
I haven't had my hands on a P2 or even a FPGA running a P2, but just looking at Chip's code snippits and other PASM code, it seems to be quite a rich assembly language. I love the fact that you can PASM doesn't have to use it's own COG also. Do you find it to be a huge improvement over P1 PASM?
I bought some bits but never really programmed the prop1.
Yeah, there is a ton more instructions even for more regular logic and maths. Chip kind of went to town on special instructions for managing the extra hardware resources. Some might say those should have been memory mapped special registers instead but, like you say, it does give an expressive richness that you don't get when it's just an endless bunch of MOVes. I think the main reason for the reorientation away from addressable special registers was it would chew up far too much of cogRAM addresses.
Using the information I found in that link above from Ahle2 I've now got audio working in SPIN2 in a simple HyperRAM demo using a small amount of inline PASM. Scoping the audio doesn't show jitter and the tones sounds pure.
It does the following:
Main COG launches the HyperRAM driver
Main COG spawns an audio COG giving it the highest priority in the HyperRAM driver
Main COG spawns 5 round-robin polled dummy load COGs
Main COG gives itself 2nd highest priority in HyperRAM driver
Main COG alternates sending 2 tone buffers into HyperRAM on demand from Audio COG using WAITATN
Audio COG reads double buffered data from HyperRAM one sample at a time
Audio COG outputs sound via DACs and waits for sample period to expire
Audio COG notifies main with COGATN whenever it needs more data
Dummy load COGs read random HyperRAM bytes at random addresses non-stop
Here's the binary and demo code:
P2-eval A/V board goes on P0-7, HyperRAM on P32, Baud = 115200
Update: Realized I have the LED code for 5 pins from P56-P60 still in place. This will toggle SDCS, DO, DI signals so be wary if you have the SD card present and it drives DO. There is no clock driven so that may save us. Not sure how safe it is to drive all these LEDs with random patterns in general.
I hacked up a quick test for looking at write performance in 1080p 8bpp mode from Fastspin code and am using the P2 to measure its own performance and plot into its own 1080p frame buffer and display it.
The P2 is clocked at 297MHz, using sysclk/1 reads and sysclk/2 writes with HyperRAM. The burst size is set to 640 bytes. This is the number of bytes the COGs can fit in a 4uS interval at sysclk/1 rates with some safety margin when the overheads are factored in. At sysclk/2 it is halved to 320 bytes for writes in this case.
The horizontal axis represents the number of bytes written in a single burst write request.
The vertical axis represents the achieved write request rate (red), and the resulting write bandwidth (cyan)
The scale is not shown (I still need to port my graphics font drawing code for that), but is this:
- 100 bytes wide per major tick interval
- 20000 requests/second per major tick interval for red request rate trace, so the peak is around 180000 requests/s for small requests.
- 10MiB/s for bandwidth per major tick interval for the cyan trace.
The visible peak is in this photo is around 32MiB/s on the right, though the image is truncated, it still increases further beyond this.
You can see the sweet spots as the 320 byte bursts fill up and then drop as the burst is split and another fragment needs to be sent. Different video modes and colour depths will have different performance levels, and the video line timing becomes very important because the number of requests from the non-video COGs that is possible to achieve is the number that can be started in the idle time before the next scan line request.
Multiple COGs could possibly improve this more, as could PASM based COGs because some of the measured overhead is in the SPIN code itself, though it should be quite small with Fastspin. Here was my test loop, it's as tight as it could be really:
repeat size from 1 to 1900
count:=0
timeout := _clkfreq/10
timeout += getct()
repeat
mem.write(0, 3000000, size)
count++
until pollct(timeout)
mem.fillBytes(size+(1000-(count/20))*xmax, PLOTCOLOR, 1, 0)
mem.fillBytes(size+(1000-(count*size/scale))*xmax, PLOTCOLOR2, 1, 0)
Actually scrap that about the 4us only being able to send 640 bytes. I actually had the COG's limit set to 640 bytes (320 for writes) in the test above, and the RAM device burst limit is actually 1024 bytes for sysclk/1 which is 3.45us of useful transfer in 4us (86.2% bus efficiency). Once I increased the COG's burst limit beyond 512, I see increased performance, up to around 36MiB/s at 1kB burst sizes and increasing slowly.
The lower value of the memory device limit and the per COG limit takes effect in my driver. Hyperflash doesn't have a device limit per transfer so we set it to the streamer limit of ~64kB. HyperRAM has a device limit of 4us per transfer. The COG limit is used to restrict the number of bytes a low priority COG can send in the presence of real-time COGs such as video/audio. It is used for QoS. If you set it too high it can impact the real-time services' requests by delaying them. Setting it too low reduces performance by excessively fragmenting the bursts and yielding to other COGs' requests more often.
The numbers above indicate with a single HyperRAM you could fully update the 1080p screen and flip it at a rate of about 17Hz. If sysclk/1 writes were achievable this would about double. If you have a dual HyperRAM setup I expect it will be able to update 1080p 8bpp at around 60Hz with sysclk/2 writes. You will just need COGs that can generate the data fast enough.
I tidied up the code and just tried to run my demos in PNut and found I needed to use a :1 on some method pointers returning results when used as callback functions in PNut which then went and broke the Fastspin build of my driver. This is the last thing to sort out before release. I have posted a question about it in the Fastspin thread, which hopefully Eric will shed some light on.
I really want my driver to work in both Fastspin and PNut without two driver versions. The demos work in both setups at least. If there was an #ifdef I could work around it.
Update: I have figured out a possible workaround which uses another argument as pointer to a long to return the result in, instead of using a return value from the callback method. It seems to a least now compile in both platforms, still testing.
First things first: I'll need to went, just for a while, to have a meal, coffee (lots of), and, perhaps, another pair of eyeballs-and-glasses, combo-pack style!
Using the information I found in that link above from Ahle2 I've now got audio working in SPIN2 in a simple HyperRAM demo using a small amount of inline PASM. Scoping the audio doesn't show jitter and the tones sounds pure.
It does the following:
Main COG launches the HyperRAM driver
Main COG spawns an audio COG giving it the highest priority in the HyperRAM driver
Main COG spawns 5 round-robin polled dummy load COGs
Main COG gives itself 2nd highest priority in HyperRAM driver
Main COG alternates sending 2 tone buffers into HyperRAM on demand from Audio COG using WAITATN
Audio COG reads double buffered data from HyperRAM one sample at a time
Audio COG outputs sound via DACs and waits for sample period to expire
Audio COG notifies main with COGATN whenever it needs more data
Dummy load COGs read random HyperRAM bytes at random addresses non-stop
Here's the binary and demo code:
P2-eval A/V board goes on P0-7, HyperRAM on P32, Baud = 115200
Update: Realized I have the LED code for 5 pins from P56-P60 still in place. This will toggle SDCS, DO, DI signals so be wary if you have the SD card present and it drives DO. There is no clock driven so that may save us. Not sure how safe it is to drive all these LEDs with random patterns in general.
Thanks. One intent of this memory driver was to ensure it can be configured to allow real-time video and audio to be prioritized over other requests while still providing some access for other COGs to share the memory. By constraining burst sizes and controlling request servicing order you can bound service latency. That is, you can provide some QoS (quality of service). It's a very similar problem to solve in the packet network world where I've come from.
One other thing I should have mentioned in the release notes that was still untested is the multiple HyperRAM modules thing. It is designed for it, but hasn't fully been exercised yet as I still only have the one module. When I can travel more than 5km again I will try it out.
Oh, that lockdown is long. Things are pretty relaxed here although masks are certainly more common now. My new work requires us to wear them because we're classed an essential business.
Auckland got more roughed up, one death, with the quarantine failure but that is all but sorted again. The security guards now replaced with soldiers. I gather that's what happened in Melbourne too.
Roger, I'm looking for the number of latency clocks that are used to read the hyperRAM, including the registers. Where's that located in the sources? I can see it used in the driver from "pinconfig" top 8 bits but it gets really complicated reversing out from there.
Oh, that lockdown is long. Things are pretty relaxed here although masks are certainly more common now. My new work requires us to wear them because we're classed an essential business.
Auckland got more roughed up, one death, with the quarantine failure but that is all but sorted again. The security guards now replaced with soldiers. I gather that's what happened in Melbourne too.
Our hotel quarantine program over here in our state failed spectacularly due to lax and untrained contractor arrangements and no-one wanting to be accountable for managing it. Every politician has tried to get as far away from it as they could in the resulting recent inquiry about it given the many hundreds of deaths and other heavy economic losses directly resulting from its failure, according to genomic data.
From our Department of Health tracking: https://www.dhhs.vic.gov.au/tracking-coronavirus-victoria: "More recent data indicates that for the 1,589 cases sequenced from cases with symptom onset from 14 July to 14 August, all but 12 were linked to Rydges. The other 12 cases are linked to the Stamford Hotel cluster. It is likely that 99% of current cases of Covid-19 in Victoria have arisen from the Rydges or Stamford Plaza hotels."
We are supposed to get to below 5 cases a day over a 14 day average for us to be able to travel beyond 5km or leave for home anything other than the 4 allowed reasons (food, exercise, work, care), who knows if that is achievable or how long it will take...by Nov/Dec maybe if we are lucky? It's still probably hovering around 10-15 cases or so at the moment. Hopefully we can go to the beach then when it is hot weather. Lockdown life sucks.
Comments
This means, because of the required extra latency clocks after address phase, there was never any close instruction timing between smartpin and streamer starting. It's several sysclocks after I've started the clock smartpin before the streamer is needed to kick in. This matters because it makes it really simple to be sysclock exact just by having an adjustable WAITX in between the two instructions.
I suspect you've ended up with the WXPIN #2 to give you a +1 delay on the clock so it lines up with start of streamer output when the two instructions are packed one after the other.
200MHz P2 sysclk/1 operation, fit your HyperRAM module on P32-47 of the P2-EVAL.
115200 baud rate to console.
This allows you to flash/dump memory etc and experiment a bit. Ideally at some point someone may extend/rework this to add reading from SD card and flashing data from there too.
What is the difference between Rev B and Rev C Eval Boards do you think caused this?
I really just need to add the init code and the code to write a new sample to the DACs. That's the missing bit.
On the other hand, he may have opted to use the hardware dither provided by the smartpins. For that he'd probably want to use an interrupt for pacing the samples to the smartpins.
Update: found this info which I might be able to use:
https://forums.parallax.com/discussion/comment/1486118/#Comment_1486118
I haven't had my hands on a P2 or even a FPGA running a P2, but just looking at Chip's code snippits and other PASM code, it seems to be quite a rich assembly language. I love the fact that you can PASM doesn't have to use it's own COG also. Do you find it to be a huge improvement over P1 PASM?
Yeah, there is a ton more instructions even for more regular logic and maths. Chip kind of went to town on special instructions for managing the extra hardware resources. Some might say those should have been memory mapped special registers instead but, like you say, it does give an expressive richness that you don't get when it's just an endless bunch of MOVes. I think the main reason for the reorientation away from addressable special registers was it would chew up far too much of cogRAM addresses.
It does the following:
Main COG launches the HyperRAM driver
Main COG spawns an audio COG giving it the highest priority in the HyperRAM driver
Main COG spawns 5 round-robin polled dummy load COGs
Main COG gives itself 2nd highest priority in HyperRAM driver
Main COG alternates sending 2 tone buffers into HyperRAM on demand from Audio COG using WAITATN
Audio COG reads double buffered data from HyperRAM one sample at a time
Audio COG outputs sound via DACs and waits for sample period to expire
Audio COG notifies main with COGATN whenever it needs more data
Dummy load COGs read random HyperRAM bytes at random addresses non-stop
Here's the binary and demo code:
P2-eval A/V board goes on P0-7, HyperRAM on P32, Baud = 115200
Update: Realized I have the LED code for 5 pins from P56-P60 still in place. This will toggle SDCS, DO, DI signals so be wary if you have the SD card present and it drives DO. There is no clock driven so that may save us. Not sure how safe it is to drive all these LEDs with random patterns in general.
The P2 is clocked at 297MHz, using sysclk/1 reads and sysclk/2 writes with HyperRAM. The burst size is set to 640 bytes. This is the number of bytes the COGs can fit in a 4uS interval at sysclk/1 rates with some safety margin when the overheads are factored in. At sysclk/2 it is halved to 320 bytes for writes in this case.
The horizontal axis represents the number of bytes written in a single burst write request.
The vertical axis represents the achieved write request rate (red), and the resulting write bandwidth (cyan)
The scale is not shown (I still need to port my graphics font drawing code for that), but is this:
- 100 bytes wide per major tick interval
- 20000 requests/second per major tick interval for red request rate trace, so the peak is around 180000 requests/s for small requests.
- 10MiB/s for bandwidth per major tick interval for the cyan trace.
The visible peak is in this photo is around 32MiB/s on the right, though the image is truncated, it still increases further beyond this.
You can see the sweet spots as the 320 byte bursts fill up and then drop as the burst is split and another fragment needs to be sent. Different video modes and colour depths will have different performance levels, and the video line timing becomes very important because the number of requests from the non-video COGs that is possible to achieve is the number that can be started in the idle time before the next scan line request.
Multiple COGs could possibly improve this more, as could PASM based COGs because some of the measured overhead is in the SPIN code itself, though it should be quite small with Fastspin. Here was my test loop, it's as tight as it could be really:
The lower value of the memory device limit and the per COG limit takes effect in my driver. Hyperflash doesn't have a device limit per transfer so we set it to the streamer limit of ~64kB. HyperRAM has a device limit of 4us per transfer. The COG limit is used to restrict the number of bytes a low priority COG can send in the presence of real-time COGs such as video/audio. It is used for QoS. If you set it too high it can impact the real-time services' requests by delaying them. Setting it too low reduces performance by excessively fragmenting the bursts and yielding to other COGs' requests more often.
The numbers above indicate with a single HyperRAM you could fully update the 1080p screen and flip it at a rate of about 17Hz. If sysclk/1 writes were achievable this would about double. If you have a dual HyperRAM setup I expect it will be able to update 1080p 8bpp at around 60Hz with sysclk/2 writes. You will just need COGs that can generate the data fast enough.
I really want my driver to work in both Fastspin and PNut without two driver versions. The demos work in both setups at least. If there was an #ifdef I could work around it.
Update: I have figured out a possible workaround which uses another argument as pointer to a long to return the result in, instead of using a return value from the callback method. It seems to a least now compile in both platforms, still testing.
Let me know if you hit any major issues. But drop to sysclk/2 reads first to see if that helps if you do encounter corrupted data.
First things first: I'll need to went, just for a while, to have a meal, coffee (lots of), and, perhaps, another pair of eyeballs-and-glasses, combo-pack style!
Nice clean sound. Well done!
First thing I tried was removing all the clock smartpin dis/enable instruction pairs. Didn't work. Oh well, I suppose you'd tried that already.
Yep. Been there done that.
Thanks. One intent of this memory driver was to ensure it can be configured to allow real-time video and audio to be prioritized over other requests while still providing some access for other COGs to share the memory. By constraining burst sizes and controlling request servicing order you can bound service latency. That is, you can provide some QoS (quality of service). It's a very similar problem to solve in the packet network world where I've come from.
Auckland got more roughed up, one death, with the quarantine failure but that is all but sorted again. The security guards now replaced with soldiers. I gather that's what happened in Melbourne too.
Yeah, we've been in one form or another of our second lockdown since July 8, and we only had our 8pm/9pm curfew lifted today when a Supreme court case against it is due to start. The curfew that neither police or health authorities asked for may possibly have been illegal.
https://www.theaustralian.com.au/breaking-news/fought-tooth-and-nail-to-conceal-vic-knew-curfew-could-be-illegal-court-told/news-story/aa718fa733d7c06a765c5efe1f0e5dcc
Our hotel quarantine program over here in our state failed spectacularly due to lax and untrained contractor arrangements and no-one wanting to be accountable for managing it. Every politician has tried to get as far away from it as they could in the resulting recent inquiry about it given the many hundreds of deaths and other heavy economic losses directly resulting from its failure, according to genomic data.
From our Department of Health tracking:
https://www.dhhs.vic.gov.au/tracking-coronavirus-victoria:
"More recent data indicates that for the 1,589 cases sequenced from cases with symptom onset from 14 July to 14 August, all but 12 were linked to Rydges. The other 12 cases are linked to the Stamford Hotel cluster. It is likely that 99% of current cases of Covid-19 in Victoria have arisen from the Rydges or Stamford Plaza hotels."
We are supposed to get to below 5 cases a day over a 14 day average for us to be able to travel beyond 5km or leave for home anything other than the 4 allowed reasons (food, exercise, work, care), who knows if that is achievable or how long it will take...by Nov/Dec maybe if we are lucky? It's still probably hovering around 10-15 cases or so at the moment. Hopefully we can go to the beach then when it is hot weather. Lockdown life sucks.