reSound - A sound driver and mixer for the P2

13

Comments

  • For some uses, say background music for a game, the tunes could be compiled in a sense, ordered, and setup for burst reads into a HUB buffer. Other misc effects get played, from the HRAM, at one frame precision during VBLANK. That's just one example, but where that chaos is known, it's likely the problem can be solved or profiled in some way.

    Should be very interesting.

  • roglohrogloh Posts: 2,256
    edited 2020-02-07 - 00:58:36
    Yeah that's right Ahle2, it is not "random" as such, it is deterministic. It can just be non-sequential if played back at pitches > 2x normal rate. Hub FIFO caches will help of course but how big do these FIFOs need to be for each channel I wonder? For very high quality samples, they could start to get rather large. Now if they can all be read and fit in HUB RAM whilst being played back, that could still work with HyperRAM. Which COG is responsible to manage the requests, when to access them and where the sample data is to be stored in Hub? The player COG, the mixer COG, or the HyperRAM COG or a combination? I'd not really expect different audio channel caches to be fully managed by the HyperRAM COG. Right now it acts on memory transfer requests of a given size to a given destination to/from hub RAM using HyperRAM. So the client request logic determines what it reads and where it puts it. I think this mixer COG might have to manage these audio queues and try to access them when it can from hub RAM, and read new data from external memory if it needs to. I don't know how it will decide when it needs to read new data.
  • roglohrogloh Posts: 2,256
    edited 2020-02-07 - 00:59:05
    Actually it might be possible to have two overlapping audio buffers in hub RAM per input channel to hold the samples and ping pong between them. If you see you are reaching near the end of the first sample buffer, you trigger a HyperRAM read which runs in the background and populates the next batch of samples into a second Hub RAM buffer, and you later switch to reading from this second buffer when you need to, then repeat the process etc. That way you have much more time to do this work and can wait longer to collect the result when it competes with HyperRAM video transfers at the same time. This may be a good way to go.

    Update: One thing to consider is that the COG requesting these audio sample read bursts may like to generate multiple requests, one per audio input channel, during the same sample interval, rather than waiting for the read burst result to come back before it can request another read burst of samples which would hold it up. This is where a list of requests may be desirable. Right now the HyperRAM COG does not support such a concept and just works on one mailbox request at a time before checking the same mailbox again, but it might be possible to add this if something like this is needed for audio.
  • evanh wrote: »
    Ah, the advantage of a ratio of 8:1 is there is four sysclocks of lag from the SPI clock rise to the SPI tx data out appearing at its pin. So 8:1 perfectly delays the tx by half the SPI clock period which is nicely bang on the low going SPI clock edge. Exactly as desired.

    At 6:1, the tx pin transitions one sysclock after the SPI clock low, leaving still two sysclocks of tx data setup before the next rising SPI clock.

    At 5:1 you're down to a single sysclock of setup.

    Oh, something of a breakthrough with this. If change the SPI clock from idling low to idling high, then, instead of using the prior rising clock edge, the prior falling edge gives a whole SPI clock period for leading. That means that at a ratio of just 4:1 it can still be bang on ideal. And 3:1 is possible.

  • One more issue that is still not resolved in this discussion, is that even if HyperRAM read bursts are cached or ping-ponged etc (we can just think of this as read-ahead), we still have the issue of starting a new sample as soon as we are notified of it changing. Right now @Ahle2 's excellent mixer can begin a new sample at the next sample clock. There is basically only a one sample interval delay at worst before the new sound starts being output. But HyperRAM will always have some delay to get the result when a new sample is to be loaded and played and this could well be more than a sample period. Possibly something in the order of a video scanline's worth of latency before any read results start to appear. We almost need the driver to buffer some handful of samples (let's say 50 microseconds worth) internally somewhere so that there can be some lag between issuing the request for new sample data from HyperRAM before it is really required to be output.

    There's plenty of room in the COG for holding a few extra samples in an output fifo of premixed audio data which should allow this to work. Also if I2S output from this COG is added at some point, some small amount extra buffering may be desirable for that purpose anyway if the streamer is involved in the work, assuming this small output fifo is moved over to hub RAM. Or if smartpin modes can be used to clock out the I2S data, perhaps the streamer is not required and this output FIFO can remain in COG/LUT RAM.
  • jmgjmg Posts: 14,320
    rogloh wrote: »
    ...
    There's plenty of room in the COG for holding a few extra samples in an output fifo of premixed audio data which should allow this to work. Also if I2S output from this COG is added at some point, some small amount extra buffering may be desirable for that purpose anyway if the streamer is involved in the work, assuming this small output fifo is moved over to hub RAM. Or if smartpin modes can be used to clock out the I2S data, perhaps the streamer is not required and this output FIFO can remain in COG/LUT RAM.

    I recall coding for a sound card some years back, and the only way I was able to get stable, click/artifact free handling, was to run a 4 phase version of a ping-pong buffer.
    A 2 phase one was not enough, given the latencies and jitter in those latencies.
    The SW polled to check the READ and WRITE pointers were in opposite quadrants, so there was one quadrants worth of tolerance built into the code.
    PC's are likely worse than P2, but a P2 running video and audio and HyperRAM has many balls in the air too...

  • roglohrogloh Posts: 2,256
    edited 2020-02-07 - 04:06:35
    Yeah jmg it will get complex sharing the video+audio+other COG requests. But with the right design I think it can still work for many cases.

    I just took a look at the Smartpin synchronous TX mode. I think this may be useful for stereo 32 bit I2S transmission directly from the COG - probably no need for the streamer if this works.

    If you don't already have an external clock source, you need to setup an output bit clock (BCLK) using another Smartpin and then read it back in as the B input for the Smartpin serial data output which will clock out the next data bit that has already been loaded into the Smartpin earlier using WYPIN. In this particular synchronous mode it can run continuously and notifies the COG of completion of the transmission of the 32 bit (stereo) value by raising IN. This signal can then be used to replace the wait condition between samples.

    You'd also need to setup an LRCLK output that is a BCLK / 32 being sent on another Smartpin whose rising edge is in phase with the falling edge of BCLK.

    These clock output pins could perhaps use transition output mode, if they can be restarted synchronously to avoid slipping the phase. Or another more suitable Smartpin mode may be used. The problem if you don't have an external clock is that the output may not be precisely the desired sample rate. Eg. A bit clock at 32*44100 = 1411200Hz but a P2 at 252MHz can only synthesize 252MHz/178 = 1415730 or 252MHz/179 = 1407821 without jitter (EDIT: Hmm, maybe just even dividers only). It's close though and you may not perceive pitches being off. The P2/video clock might be able to be tweaked down a fraction to help too (eg. 251MHz). Other NCO modes might be possible but may add some jitter. How tolerable that is would depend on the receiving device I guess.
  • evanh wrote: »
    evanh wrote: »
    Ah, the advantage of a ratio of 8:1 is there is four sysclocks of lag from the SPI clock rise to the SPI tx data out appearing at its pin. So 8:1 perfectly delays the tx by half the SPI clock period which is nicely bang on the low going SPI clock edge. Exactly as desired.

    At 6:1, the tx pin transitions one sysclock after the SPI clock low, leaving still two sysclocks of tx data setup before the next rising SPI clock.

    At 5:1 you're down to a single sysclock of setup.

    Oh, something of a breakthrough with this. If change the SPI clock from idling low to idling high, then, instead of using the prior rising clock edge, the prior falling edge gives a whole SPI clock period for leading. That means that at a ratio of just 4:1 it can still be bang on ideal. And 3:1 is possible.

    VERY NICE! My sd driver idles high because it is beneficial to me. Glad to see it actually works out the best. I'm still trying to understand a context of multiple files and how that would work. I also have the idea of adding a command to load to LUT since in theory, it would be a faster path from cog to cog than through hub. With the new eggbeater hub it might not be that much of an improvement though... Not sure still.

    I don't have a scope so verifying max spi clk is just based on numbers. During initialization smartpin clock wxpin is calculated as:
    PRI GetBitTime(max_clkfreq) : bt
        bt :=((clkfreq / max_clkfreq) /2 )+1   ' SPI bit time
        if bt < 2                              ' make sure at least sysclock /2
            bt := 2
    

    So perhaps that 2 should become a constant ratio? Maybe this is the only thing that needs to be changed, fingers crossed! I spent some time trying to clean up code for the next release and got stuck on another bug-fix.
  • evanhevanh Posts: 9,442
    edited 2020-02-07 - 07:04:39
    A generic ratio calculation works fine with tx clocking on the rising SPI clock because the prop2's I/O stages guarantees four sysclocks of data hold time of the prior data bit for the SPI device. So, although it's leading by a half SPI clock and looks like the data is clocked too early at larger ratios, it's actually very reliable no matter how slow the SPI clock gets.

    However, with this new extra lag compensation allowed by inverting the SPI clock and using a whole leading SPI clock period means that it's customised for the 4:1 ratio and can't be used for larger ratios.

    EDIT: Of course, I'm talking about tx smartpin only. Rx smartpin is happy with any ratio from 2:1 and up, and the clock pin can idle high or low. It doesn't care. I presume you've already found my post in the tricks and traps - https://forums.parallax.com/discussion/comment/1488948/#Comment_1488948

    EDIT2: These statements are all assuming the prop2 is the SPI bus master and clock source.
  • I found the bug that causes most portamento effects in the module player to behave crazy. It turns out that it needs a last set period memory. Very obvious when you think about it, but I thought it was okay since most tunes I tried used pitch bend primarily. It is fixed in my development code.

    Regarding a hyperRAM backend. Just like rogloh said, I think it will do if we can make a double buffer (of big enough size) + interrupt call back routine to trigger per channel independant hyperRAM reads. (with a queue in the hyperRAM driver). It doesn't need any modifications to reSound. Just register a call back. As long as the delay is less than ~1/20 (50 ms) second, it is instant according to our brains.
  • roglohrogloh Posts: 2,256
    edited 2020-05-04 - 22:40:17
    From your recent post in the Fastspin thread you mentioned this @Ahle2 (but I thought it best to answer here):
    Ahle2 wrote: »
    I want to prove that reSounds cog attention/interrupt/double buffer mechanism works as intended triggering a read from SD card on the back buffer when needed. I have made a working poll test to prove most of it, but I'm trying to get a fully working interrupt example running. Then I want to test reSound with multiple double buffers (mixed down to two stereo channels) of different sizes running at different sample rates and triggering interrupts asynchronously. The polling version has proven most of it, but I need to make a complete working example before I release the next version of reSound.

    My HyperRAM driver now supports request lists which can copy multiple arbitrary sized blocks of external memory into hub memory to/from any addresses in a single request. This may help you with playback if the amount of data is too large to fit in HUB or if the SD filesystem introduces a lot of latency for multi-channel streaming. If video is running and it's using an external frame buffer, the video COG will get the highest priority servicing but an audio COG can be the next in line to get serviced first before the non-realtime COGs are round-robin polled. There will be latency of up to one video scan line before the audio COG's request can begin to be serviced and each transferred sub-burst in each list entry will have to yield back to check for another video request, but if your code can buffer multiple audio samples per channel read and you are not expecting the full request list to be transferred every scan line then I think it might work out pretty well and not burden the requesting COG much (just think of it as a DMA engine). It is less likely to work so well if each audio request is only a single audio sample from lots of channels when video is running unless the audio sample rate is lower than video's horizontal sync rate, so some small local hub buffer per audio channel will probably still be warranted regardless.

    Another way to go if the music fits in HyperRAM but the SD is slow even with double buffering, is to map a filesystem over HyperRAM or (Hyper)Flash. If a sound file is copied into the HyperRAM or read from flash it would speed up access time and increase the transfer rate significantly. Seems like Rayman may already be doing some interesting work there too.
  • Ahle2Ahle2 Posts: 1,039
    edited 2020-05-05 - 13:28:17
    Roger,

    I do not have a HyperRAM board to use with my EVAL board to test this out.
    if your code can buffer multiple audio samples per channel read and you are not expecting the full request list to be transferred every scan line then I think it might work out pretty well and not burden the requesting COG much (just think of it as a DMA engine).

    Yes, that is the whole idea. I can have up to 64 completely independant buffers of different sizes, running at different sample rates, with different sample formats (U8, S8, U16, S16, little/big endianess) and mix them independantly from 1 up to 8 output pins. The driver can "only" handle 32 double buffers though, with interrupt handling and all. I have a running interrupt example using CD-quality stereo audio from a SD card, compiled with FastSpin. It is the worst kind of hack since FastSpin is not made for interrupt handling, but it proves that my driver works.

    Other stuff that is partly implemented or will be very soon, are distortion effect, multimode resonance filter, reverb, spatial surround... And there is room for more after that.

    Btw, You are doing an amazing job with your video driver and hyperRAM driver! :)
  • Wow, keep adding all your cool filter effects features Ahle2, your audio COG sounds like it will end up being very versatile when done.

    Having up to 32 input double buffers sounds great. With that many input streams maybe there will be some opportunity for those multi-channel scream tracker S3M files to be replayed one day on the P2 in addition to the regular 4 channel mod files you already got working.
    Btw, You are doing an amazing job with your video driver and hyperRAM driver! :)

    Cheers Ahle2. To be honest the whole request list thing was basically prompted by thinking more about some streaming audio stuff after you released this reSound demo. I also liked the idea of a wavetable synth application that can make ready use of larger amounts of sample data in HyperRAM / HyperFlash and which may need low latency streaming from multiple address buffers in external memory during audio playback. So your code pretty much convinced me to add that whole request list feature to my driver and now it's finally in I'm glad plus it can be used for other things like graphics copies now too and frees requesting COGs to do other work in parallel while the driver manages its list.
  • Roger,
    Actually you could use up to 64 channels for FastTracker or ScreamTracker modules using this driver... (just read on wiki that FT2 and ST only supports 32) The 32 channel limit is for double buffers which are needed just when streaming from memory outside of hub RAM. Pure sample playback from HUB doesn't need filling of any buffers.... Just point to a location in hub, set the number of samples to playback, set frequency and gain for all outputs. Fire and forget. This can be done simultanously for up to 64 channels.

    I have a beta ready for release very soon... Do you want to test it out with your HyperRAM driver? You can have what I have got so far...
  • roglohrogloh Posts: 2,256
    edited 2020-05-08 - 02:45:37
    Ahle2 wrote: »
    Roger,
    Actually you could use up to 64 channels for FastTracker or ScreamTracker modules using this driver... (just read on wiki that FT2 and ST only supports 32) The 32 channel limit is for double buffers which are needed just when streaming from memory outside of hub RAM. Pure sample playback from HUB doesn't need filling of any buffers.... Just point to a location in hub, set the number of samples to playback, set frequency and gain for all outputs. Fire and forget. This can be done simultanously for up to 64 channels.

    I have a beta ready for release very soon... Do you want to test it out with your HyperRAM driver? You can have what I have got so far...

    Sounds great Johannes. I would definitely try it out when your code is ready and mine is also testable from Spin. In my case I am just sorting out the Spin driver at the moment figuring out a simple way to do the mapping of address ranges to a device and bus, which potentially allows management of multiple instances for systems with more than one Hyperbus in the future. The multiple buses capability itself might come later as I don't have the HW to test it, but I just don't want the API to have to fundamentally change...

    Given sufficient buffering in your audio driver, I'm hoping we can get decent video and audio simultaneously using the HyperRAM, with decent video repaint/update speed for a responsive GUI remaining. I already anticipate this should be achievable with a suitably clocked P2.
  • I think Parallax are just about to build the next hyperram boards, I have an order waiting on that batch

    Perhaps we could persuade them to run a couple of "overruns" with dual Hyperram, as opposed to normal one flash and one hyperram
  • This could be really interesting, mainly If it also includes using variable-latency-aware HyperRams, diverselly from the ones that were assembled during the first batch.

    Besides the obvious capability of being able to improve access timings, it would enable experimenters to gather a better understanding of the self-refresh circuitry/logic operation, embodied into newer devices.

    Things like extending read/write access cycles way over the present 4uS limit and, at the same time, being able to take full control over the mandatory/necessary refresh operations, could unleash the full potential of having a better integration between P2 and HyperRams.
  • Ahle2Ahle2 Posts: 1,039
    edited 2020-05-17 - 13:54:36
    Hi all,

    I have uploaded the first beta release of reSound to the top post... I think the examples speaks for themselves, so I will not say much more.
    The two wav playback examples needs you to specify a wav file on a SD card, the other examples just need you to specify which smart pins to use.

    All modules I wanted to include in the zip-archive did not fit (time to change forum policies?), so you have to get them through the link I posted in the top post instead.
    Btw, the module playback has been improved dramatically, almost everything is implemented, even bugs found in the original routine on the Amiga.

    /Johannes
  • I think this is going to be great Ahle2, though I'm having major problems with those mod files in the downloaded zip archive. Some of the voices/channels seem to be corrupted or not playing. Tracks get stuck for a bit and repeat, it's almost like a strange dance remix in places of that overload.mod file :lol: , and there's some distortion/crackling/blurping etc. Not sure what is going wrong. I also tried the original overload.mod file too from your earlier first release in case the newer zipped version was somehow corrupted, but had the same results. The earlier code played it much better albeit at a lower volume.

    All I did when I ran main.spin2 of your mod player demo was to adjust pin 0,1 to be 6,7 and use the P2-EVAL A/V breakout board fitted on Pins P0-P7. Using fastspin compiler v4.1.2 and revB silicon. I could try to hunt for a newer version of Fastspin to see if it helps in case that's something different in the setup.

    I'm having issues with the other demos too.

    For the multi-channel demo I only hear the cricket sounds nothing else. I tried enabling front left and right, as well as back left and right on pins 6,7 (independently) as I only have the two channel amplifier. Nothing apart from the crickets.

    For the Sample Playback example I only hear the hi-hat channel.

    Any ideas? I'm sure it works fine in your setup.
  • Roger,
    You should be running the latest fastspin 4.1.9 or 4.1.10
    Eric has been squashing bugs as we find them :)
  • roglohrogloh Posts: 2,256
    edited 2020-05-18 - 03:29:17
    Cluso99 wrote: »
    Roger,
    You should be running the latest fastspin 4.1.9 or 4.1.10
    Eric has been squashing bugs as we find them :)

    Yeah I know, I'm getting out of date again. I'll check it out.

    Update: FIXED! Demos all working now and it sounds great. I'm using 4.1.9. You just need to change one line to get it to work. It doesn't like this line with the single #. Change it to ## fixes the error.

    and nrOfMixes, #$00ff_ffff ' The top 8 MSBs has got a different meaning

    Without the change it gives this error:
    "/Users/roger/Downloads/flexgui-2/bin/fastspin.mac" -2 -l -D_BAUD=230400 -O0 -I "/Users/roger/Downloads/flexgui-2/include" -I "/Users/roger/Downloads/flexgui-4.0.3/spin"  "/Users/roger/Downloads/reSoundBeta1/example_SamplePlayback/main.spin2"
    Propeller Spin/PASM Compiler 'FastSpin' (c) 2011-2020 Total Spectrum Software Inc.
    Version 4.1.9 Compiled on: May 18 2020
    /Users/roger/Downloads/reSoundBeta1/example_SamplePlayback/../driver/reSound.spin2:223: error: immediate operand 16777215 out of range
    

    It's amazing how hearing these old modfile tunes takes you right back to that long lost age when you were playing those fun games in the 90's. Xenon2, GODs, Dune2 etc. Stuff you haven't heard for like 25 years. COOL.
  • Ahle2Ahle2 Posts: 1,039
    edited 2020-05-19 - 11:19:13
    Cluso99, Roger,

    I forgot to put that extra "#" in there to augment the value, but for some reason it works on the version of FastSpin I'm using...

    Version 4.1.5 Compiled on: Apr 18 2020

    It shouldn't work... but it does??

    Did anyone try out the interrupt fill example? Did it work... It is a house of cards and will fail if tampered with in the least. It does prove that the reSound driver itself works as intended. I really wonder why it fails (when tampered with) when I do make a complete copy of the whole cog ram on ISR entry and then restore everything on return?! I didn't want to spend more time on figuring out exactly what's going on since it is a hack and not supported by the FastSpin paradigm. And my time is better spent on making the actual driver somewhat near my vision for it. The beta has actually got some of the sound processing options I mentioned earlier, but they are not finished or working perfectly, so my examples doesn't use them.

    Btw, Cluso99.... Your Spin2 syntax highlighting in VSC is the best thing since sliced bread. I'm using it daily, thanks! :)

  • Ahle2 wrote: »
    Cluso99, Roger,

    I forgot to put that extra "#" in there to augment the value, but for some reason it works on the version of FastSpin I'm using...

    Version 4.1.5 Compiled on: Apr 18 2020

    It shouldn't work... but it does??

    Did anyone try out the interrupt fill example? Did it work... It is a house of cards and will fail if tampered with in the least. It does prove that the reSound driver itself works as intended. I really wonder why it fails (when tampered with) when I do make a complete copy of the whole cog ram on ISR entry and then restore everything on return?! I didn't want to spend more time on figuring out exactly what's going on since it is a hack and not supported by the FastSpin paradigm. And my time is better spent on making the actual driver somewhat near my vision for it. The beta has actually got some of the sound processing options I mentioned earlier, but they are not finished or working perfectly, so my examples doesn't use them.

    Btw, Cluso99.... Your Spin2 syntax highlighting in VSC is the best thing since sliced bread. I'm using it daily, thanks! :)
    Pleased someone is using the VSC syntax highlighting. I really prefer it over PropTool now.

    As for the missing # fastspin 4.1.9 fixed a bug where # was assumed when doing reg[x] := reg[y] so it’s possibly the same problem.
  • I have fixed the examples to work with the latest version of FastSpin (Version 4.1.9 Compiled on: May 13 2020). So to all that couldn't get them to work, download and try again! :)

    @Eric Smith
    I think that FastSpin uses "$1F4/IJMP1" for its internal operations?! Is that assumption right? So the reason that the interrupt fill example fails sometimes is because the interrupt vector gets corrupted and points to some "random" location in hub that then gets called as an ISR!

    @JonnyMac
    did you try this new beta release? I think it does what you were after. It should be easy to modify the buffer fill examples with additional input channels that can play samples over the soundtrack!
  • I have not (yet) -- been working on a client project so I've been away from the forums for a while.
  • Ahle2 wrote: »
    I have fixed the examples to work with the latest version of FastSpin (Version 4.1.9 Compiled on: May 13 2020). So to all that couldn't get them to work, download and try again! :)

    @Eric Smith
    I think that FastSpin uses "$1F4/IJMP1" for its internal operations?! Is that assumption right? So the reason that the interrupt fill example fails sometimes is because the interrupt vector gets corrupted and points to some "random" location in hub that then gets called as an ISR!
    No, fastspin does not use $1F4 for anything internally. I don't know why your interrupt fill example is failing, but fastspin was not built with interrupt safety in mind (and in particular uses the CORDIC and other hardware resources without trying to block interrupts) so using interrupts in a COG running fastspin code probably won't work.

    Regards,
    Eric
  • Ahle2Ahle2 Posts: 1,039
    edited 2020-05-22 - 11:25:33
    Thanks for the response Eric! :)

    I think this is a little bit strange since a simple repeat doing nothing in the main loop shouldn't use the CORDIC, right? The ISR use the CORDIC for math, but that shouldn't be a problem, because nothing is interrupting it. I don't use the LUT RAM for anything and the whole cog register RAM is stored/restored on each interrupt. Anyone up for the task to find out what's really happening? (it's impossible! ;) )

    /Johannes
  • How big is the .lst file? That'll show any use of cordic and the like. REPs completely block IRQs for the duration, BTW.

  • evanh wrote: »
    REPs completely block IRQs for the duration, BTW.

    Ah. fastspin uses REP for many loops, so that may be your problem Johannes.
  • Thanks Evan and Eric,

    Yes, the FCACHE with rep instruction is in there, but it shouldn't crash the application with unpredictable results, just delay the execution. My new theory is that it is the "objptr" that gets interrupted between the add and sub instructions. When the high level code in the ISR gets called, the ptr is waaay of and things gets corrupted.
Sign In or Register to comment.