reSound - A sound driver and mixer for the P2

13»

Comments

  • For some uses, say background music for a game, the tunes could be compiled in a sense, ordered, and setup for burst reads into a HUB buffer. Other misc effects get played, from the HRAM, at one frame precision during VBLANK. That's just one example, but where that chaos is known, it's likely the problem can be solved or profiled in some way.

    Should be very interesting.

  • roglohrogloh Posts: 1,999
    edited 2020-02-07 - 00:58:36
    Yeah that's right Ahle2, it is not "random" as such, it is deterministic. It can just be non-sequential if played back at pitches > 2x normal rate. Hub FIFO caches will help of course but how big do these FIFOs need to be for each channel I wonder? For very high quality samples, they could start to get rather large. Now if they can all be read and fit in HUB RAM whilst being played back, that could still work with HyperRAM. Which COG is responsible to manage the requests, when to access them and where the sample data is to be stored in Hub? The player COG, the mixer COG, or the HyperRAM COG or a combination? I'd not really expect different audio channel caches to be fully managed by the HyperRAM COG. Right now it acts on memory transfer requests of a given size to a given destination to/from hub RAM using HyperRAM. So the client request logic determines what it reads and where it puts it. I think this mixer COG might have to manage these audio queues and try to access them when it can from hub RAM, and read new data from external memory if it needs to. I don't know how it will decide when it needs to read new data.
  • roglohrogloh Posts: 1,999
    edited 2020-02-07 - 00:59:05
    Actually it might be possible to have two overlapping audio buffers in hub RAM per input channel to hold the samples and ping pong between them. If you see you are reaching near the end of the first sample buffer, you trigger a HyperRAM read which runs in the background and populates the next batch of samples into a second Hub RAM buffer, and you later switch to reading from this second buffer when you need to, then repeat the process etc. That way you have much more time to do this work and can wait longer to collect the result when it competes with HyperRAM video transfers at the same time. This may be a good way to go.

    Update: One thing to consider is that the COG requesting these audio sample read bursts may like to generate multiple requests, one per audio input channel, during the same sample interval, rather than waiting for the read burst result to come back before it can request another read burst of samples which would hold it up. This is where a list of requests may be desirable. Right now the HyperRAM COG does not support such a concept and just works on one mailbox request at a time before checking the same mailbox again, but it might be possible to add this if something like this is needed for audio.
  • evanh wrote: »
    Ah, the advantage of a ratio of 8:1 is there is four sysclocks of lag from the SPI clock rise to the SPI tx data out appearing at its pin. So 8:1 perfectly delays the tx by half the SPI clock period which is nicely bang on the low going SPI clock edge. Exactly as desired.

    At 6:1, the tx pin transitions one sysclock after the SPI clock low, leaving still two sysclocks of tx data setup before the next rising SPI clock.

    At 5:1 you're down to a single sysclock of setup.

    Oh, something of a breakthrough with this. If change the SPI clock from idling low to idling high, then, instead of using the prior rising clock edge, the prior falling edge gives a whole SPI clock period for leading. That means that at a ratio of just 4:1 it can still be bang on ideal. And 3:1 is possible.

  • One more issue that is still not resolved in this discussion, is that even if HyperRAM read bursts are cached or ping-ponged etc (we can just think of this as read-ahead), we still have the issue of starting a new sample as soon as we are notified of it changing. Right now @Ahle2 's excellent mixer can begin a new sample at the next sample clock. There is basically only a one sample interval delay at worst before the new sound starts being output. But HyperRAM will always have some delay to get the result when a new sample is to be loaded and played and this could well be more than a sample period. Possibly something in the order of a video scanline's worth of latency before any read results start to appear. We almost need the driver to buffer some handful of samples (let's say 50 microseconds worth) internally somewhere so that there can be some lag between issuing the request for new sample data from HyperRAM before it is really required to be output.

    There's plenty of room in the COG for holding a few extra samples in an output fifo of premixed audio data which should allow this to work. Also if I2S output from this COG is added at some point, some small amount extra buffering may be desirable for that purpose anyway if the streamer is involved in the work, assuming this small output fifo is moved over to hub RAM. Or if smartpin modes can be used to clock out the I2S data, perhaps the streamer is not required and this output FIFO can remain in COG/LUT RAM.
  • jmgjmg Posts: 14,278
    rogloh wrote: »
    ...
    There's plenty of room in the COG for holding a few extra samples in an output fifo of premixed audio data which should allow this to work. Also if I2S output from this COG is added at some point, some small amount extra buffering may be desirable for that purpose anyway if the streamer is involved in the work, assuming this small output fifo is moved over to hub RAM. Or if smartpin modes can be used to clock out the I2S data, perhaps the streamer is not required and this output FIFO can remain in COG/LUT RAM.

    I recall coding for a sound card some years back, and the only way I was able to get stable, click/artifact free handling, was to run a 4 phase version of a ping-pong buffer.
    A 2 phase one was not enough, given the latencies and jitter in those latencies.
    The SW polled to check the READ and WRITE pointers were in opposite quadrants, so there was one quadrants worth of tolerance built into the code.
    PC's are likely worse than P2, but a P2 running video and audio and HyperRAM has many balls in the air too...

  • roglohrogloh Posts: 1,999
    edited 2020-02-07 - 04:06:35
    Yeah jmg it will get complex sharing the video+audio+other COG requests. But with the right design I think it can still work for many cases.

    I just took a look at the Smartpin synchronous TX mode. I think this may be useful for stereo 32 bit I2S transmission directly from the COG - probably no need for the streamer if this works.

    If you don't already have an external clock source, you need to setup an output bit clock (BCLK) using another Smartpin and then read it back in as the B input for the Smartpin serial data output which will clock out the next data bit that has already been loaded into the Smartpin earlier using WYPIN. In this particular synchronous mode it can run continuously and notifies the COG of completion of the transmission of the 32 bit (stereo) value by raising IN. This signal can then be used to replace the wait condition between samples.

    You'd also need to setup an LRCLK output that is a BCLK / 32 being sent on another Smartpin whose rising edge is in phase with the falling edge of BCLK.

    These clock output pins could perhaps use transition output mode, if they can be restarted synchronously to avoid slipping the phase. Or another more suitable Smartpin mode may be used. The problem if you don't have an external clock is that the output may not be precisely the desired sample rate. Eg. A bit clock at 32*44100 = 1411200Hz but a P2 at 252MHz can only synthesize 252MHz/178 = 1415730 or 252MHz/179 = 1407821 without jitter (EDIT: Hmm, maybe just even dividers only). It's close though and you may not perceive pitches being off. The P2/video clock might be able to be tweaked down a fraction to help too (eg. 251MHz). Other NCO modes might be possible but may add some jitter. How tolerable that is would depend on the receiving device I guess.
  • evanh wrote: »
    evanh wrote: »
    Ah, the advantage of a ratio of 8:1 is there is four sysclocks of lag from the SPI clock rise to the SPI tx data out appearing at its pin. So 8:1 perfectly delays the tx by half the SPI clock period which is nicely bang on the low going SPI clock edge. Exactly as desired.

    At 6:1, the tx pin transitions one sysclock after the SPI clock low, leaving still two sysclocks of tx data setup before the next rising SPI clock.

    At 5:1 you're down to a single sysclock of setup.

    Oh, something of a breakthrough with this. If change the SPI clock from idling low to idling high, then, instead of using the prior rising clock edge, the prior falling edge gives a whole SPI clock period for leading. That means that at a ratio of just 4:1 it can still be bang on ideal. And 3:1 is possible.

    VERY NICE! My sd driver idles high because it is beneficial to me. Glad to see it actually works out the best. I'm still trying to understand a context of multiple files and how that would work. I also have the idea of adding a command to load to LUT since in theory, it would be a faster path from cog to cog than through hub. With the new eggbeater hub it might not be that much of an improvement though... Not sure still.

    I don't have a scope so verifying max spi clk is just based on numbers. During initialization smartpin clock wxpin is calculated as:
    PRI GetBitTime(max_clkfreq) : bt
        bt :=((clkfreq / max_clkfreq) /2 )+1   ' SPI bit time
        if bt < 2                              ' make sure at least sysclock /2
            bt := 2
    

    So perhaps that 2 should become a constant ratio? Maybe this is the only thing that needs to be changed, fingers crossed! I spent some time trying to clean up code for the next release and got stuck on another bug-fix.
  • evanhevanh Posts: 8,981
    edited 2020-02-07 - 07:04:39
    A generic ratio calculation works fine with tx clocking on the rising SPI clock because the prop2's I/O stages guarantees four sysclocks of data hold time of the prior data bit for the SPI device. So, although it's leading by a half SPI clock and looks like the data is clocked too early at larger ratios, it's actually very reliable no matter how slow the SPI clock gets.

    However, with this new extra lag compensation allowed by inverting the SPI clock and using a whole leading SPI clock period means that it's customised for the 4:1 ratio and can't be used for larger ratios.

    EDIT: Of course, I'm talking about tx smartpin only. Rx smartpin is happy with any ratio from 2:1 and up, and the clock pin can idle high or low. It doesn't care. I presume you've already found my post in the tricks and traps - https://forums.parallax.com/discussion/comment/1488948/#Comment_1488948

    EDIT2: These statements are all assuming the prop2 is the SPI bus master and clock source.
  • I found the bug that causes most portamento effects in the module player to behave crazy. It turns out that it needs a last set period memory. Very obvious when you think about it, but I thought it was okay since most tunes I tried used pitch bend primarily. It is fixed in my development code.

    Regarding a hyperRAM backend. Just like rogloh said, I think it will do if we can make a double buffer (of big enough size) + interrupt call back routine to trigger per channel independant hyperRAM reads. (with a queue in the hyperRAM driver). It doesn't need any modifications to reSound. Just register a call back. As long as the delay is less than ~1/20 (50 ms) second, it is instant according to our brains.
Sign In or Register to comment.