Paula (Amiga) inspired audio driver [0.95 - envelopes]

pik33 · 2022-01-05 11:06

Now it works. I simply forgot to zero the ptrb at the start.

There is 11 nops free at 354 MHz and 6 nops at 320 MHz. After an hour of playing at 354 MHz P2 Edge is hot. Not too hot to touch, something slightly more than 40C/100F. Still stable.

isr1        wypin   lsample,#left        '2     The sample has to be outputted every 90 cycles     
            wypin   rsample,#right       '4

            incmod  counter,a20000000    '6 
            cmp     counter,irqtime wcz  '8     Check if it is time for the next sample
    if_ne   reti1                        '10/12 If not, do nothing

            getword rsample,lsnext,#1    '12
            getword lsample,lsnext,#0    '14
            cmp     ptrb,front wcz       '16    If the buffer is empty, do nothing 
    if_e    reti1                        '18/20

            rdlut   lsnext,ptrb++        '21    else read the sample and its time from LUT
            rdlut   irqtime,ptrb++       '23    Read the time for this sample

            reti1

pik33 · 2022-01-05 11:26

.. the "6 nops" low pass filter started to work....

.. and I have to measure the filer and/or find the error in the calculations as what is supposed to be 17 kHz low pass filter seems to have much lower cutoff frequency. Then the filter has to be made switchable via the player.

rogloh · 2022-01-05 14:01

Just double check your INCMOD "S" value is correct. I think it is meant to be the last value in the incrementing sequence after which it wraps to zero. So INCMOD D, #0 would always remain at zero, while INCMOD D, #1 would count 0,1,0,1, etc. At least that's how I think it works.

pik33 · 2022-01-05 14:16

incmod works with $20000000

The filters don't work stable at 320 MHz. There are modules that can make the player stop playing, while at 354 MHz they work (more place available in the ISR).

I made a big mistake implementing this filter and that's why calculations failed...

A new idea. Precompute all these samples offline, not sample-time but all of them. This is simply compute the sample, as it is now, determine how many samples with the same value needs to be placed in the buffer, place them there and let ISR to put them to DACs and nothing else.

This may allow to implement these PWMs, Amiga filter and postprocessing...

pik33 · 2022-01-10 11:24

A major rewrite in progress, but the changed driver (v 014) actually works.

As in the previous post. The main program computes all samples and put them into the buffer.
The interrupt only pushes these samples into DACs

A disadvantage: more complex main loop which executes longer - to be optimized, but Amiga modules still play
Advantages:

much simpler ISR
all samples available in the main loop so filters can be added, on-screen oscilloscope can be made, etc

Still no time and idea how to add a Paula type PWM volume. This has the low priority, as it is complex to implement and I have no real Amiga to compare the result.

rogloh · 2022-01-10 12:24

@pik33 said:
A major rewrite in progress, but the changed driver (v 014) actually works.

As in the previous post. The main program computes all samples and put them into the buffer.
The interrupt only pushes these samples into DACs

A disadvantage: more complex main loop which executes longer - to be optimized, but Amiga modules still play
Advantages:

much simpler ISR

all samples available in the main loop so filters can be added, on-screen oscilloscope can be made, etc

Still no time and idea how to add a Paula type PWM volume. This has the low priority, as it is complex to implement and I have no real Amiga to compare the result.

See if you can come up with a scheme that still allows samples to be read in from PSRAM (aim for about 1us of access latency per each random sample access in setups above 250MHz). This way you'll hopefully be able to play larger mod files (or maybe one day even s3m files LOL) and fit them in external memory.

pik33 · 2022-01-10 14:40

I need to connect PSRAMs to test. I have no Edge with PSRAM (and no hope to get one in the near future) and I don't know what I can achieve with these chips connected using wires (speed?)

I can however simulate one us latency while getting a sample to test if it still works.

Edit. 1 us latency per sample should still work. This loop

delay   mov qq,##90   ' about 360 cycles (?)
p900    djnz qq,#p900

caused the player to distort at something about 300 without filter and 200 with filter. There is still a lot of code to optimize in the main loop. These setq/rdlongs - they will be replaced with something less time consuming. I don't need to read all these parameters every sample.

pik33 · 2022-01-11 10:53

0.15.

A lot of time (and also size) optimizations - this version doesn't hang up while running interrrupts at Paula x2 speed! (while distorting..... but the previous version didn't run at all)

This means there will be no problem with getting samples from PSRAM or (and?) adding filters to the audio at the standard Paula clock.

The main difference is: removed reading all registers every loop. Instead, only the channel which has to be computed reads its registers. About 160 clocks (average) saved at every loop
The API difference: the cmd register which is used to reset the sample phase accumulator (= to start playing the note) now has to be set to $FFFFFFFF instead of any non-zero value to allow the sample to run. This allowed to remove 8 instructions - 16 clocks from the main loop. This also added additional unneeded (?) "feature": setting the value to any non-FFFFFFFF value will cause the sample to loop when the phase accumulator reaches the zero bit in the CMD register (it is ANDed with this value every loop)

Several other changes also stripped several clocks from the loop, the loop time is about 200 clocks less now.

evanh · 2022-01-11 11:51

With the RAMs being managed from a dedicated cog, it should be possible to perform a prefetching scheme for read data. The tricky part is making it deterministic. Which basically means application driven - the developer has to specify what to prefetch, and preallocate the space for it, before actually needing the data.

Write data is much easier. The regular write-this-block-of-data can be a background operation. As long as data rate stays sane then no issue.

PS: I've not tried reading up on Roger's driver as yet. No idea what flexibility it might have.

pik33 · 2022-01-11 13:55

While playing an Amiga module, sample data are read sequentially, byte by byte and every byte, so a prefetch is possible. Exceptions are starting a new sample and loop the sample. The driver can skip samples and use 16-bit samples, but these features are not present in the original original Paula, so they are not needed to play a module. The read however is still sequential and predictable, so we can tell the RAM driver to do this.

The audiodriver's LUT based buffer now contains about 140 microsecond of audio data.

Maybe the module can be played directly from Edge's or Eval's flash memory using such a dedicated driver/cache? I have to try this

rogloh · 2022-01-11 14:56

@evanh said:
With the RAMs being managed from a dedicated cog, it should be possible to perform a prefetching scheme for read data. The tricky part is making it deterministic. Which basically means application driven - the developer has to specify what to prefetch, and preallocate the space for it, before actually needing the data.

You can try to prefetch if you can and this will reduce the number of requests needed which is helpful. Also my driver has some in-built capability to skip some bytes of memory between read portions with its graphics copy stuff, and I think this might be useful for audio, however any wrapping back around to lower memory addresses during audio sample looping still has to be dealt with by the caller.

pik33 · 2022-01-15 19:23

A huge size optimization (445->190 longs) and major rewrite for v. 0.18 (use player2.bas to test this - the registers changed). No more 8 repeated channels code, no more 72 longs in cog ram for registers. ALTS/ALTD used instead and only one procedure and one register set.

https://github.com/pik33/P2-retromachine/tree/main/Propeller/Tracker player

I have to clean this up moving all old versions to recycle folder

This change creates the place for more functions:

there is a current pointer available in the hub for every channel so the main program can now stream the long sample to one looped channel buffer (wav playing from the SD/Flash/PSRAM)
there is a current sample value available in the hub, so things like oscilloscopes can be done without messing with the driver: simply read these samples in the main program)
there is a room for more channels in the driver

evanh · 2022-01-15 21:31

DAC update rate of sysclock/90 !? .... I was going in my head "That's insane! Why?" ... Then it dawned on me you may not know that the smartpin DACmodes have an integral DAC update timer with buffering. If you were setting the DACs directly then you've coded using the IRQ correctly.

However, as long as the program feeds the next sample into the smartpin sometime before the next DAC update period, then it'll be a clean metronomic interval between the two samples.

PS: Eliminating interrupts altogether will free up the smaller 14 of 90 clocks, 15%, of cog time.

evanh · 2022-01-15 21:52

Which, in turn, means the lutRAM buffer handling can be compacted right down to just waiting for smartpin ready. No software buffering at all.

Ariba · 2022-01-15 22:13

@evanh said:
DAC update rate of sysclock/90 !? .... I was going in my head "That's insane! Why?" ... Then it dawned on me you may not know that the smartpin DACmodes have an integral DAC update timer with buffering. If you were setting the DACs directly then you've coded using the IRQ correctly.

However, as long as the program feeds the next sample into the smartpin sometime before the next DAC update period, then it'll be a clean metronomic interval between the two samples.

PS: Eliminating interrupts altogether will free up the smaller 14 of 90 clocks, 15%, of cog time.

Interessting idea! But you need a DAC output per channel, and have to mix them together analog with resistors. With CMOS switches (4016 or 4066) and another 4 smart pins, you can also do the PWM volume control.
So 8 pins + some external circuit for a Paula emulation, but this may be very close to the original.

Andy

evanh · 2022-01-15 22:24

Pik's doing all that in software I think. It's just the two 16-bit dithered DAC smartpins being fed by the ISR.

'--------------------------------------------------------------------------
'------ Interrupt service -------------------------------------------------
'------ Output the sample, get the next one if exists ---------------------
'--------------------------------------------------------------------------

isr1        wypin   lsample,#left        '2     The sample has to be outputted every 90 cycles     
            wypin   rsample,#right       '4

            cmp     ptrb,front wcz       '6    If the buffer is empty, do nothing 
    if_e    reti1                        '8/10

            rdlut   lsnext,ptrb++        '11    else read the sample and its time from LUT
            getword rsample,lsnext,#1    '13
            getword lsample,lsnext,#0    '15
            reti1                        '17/19

PS: His program is whizzing along doing the mixing at over 300 MHz sysclock! There's quite some MIPS to burn.

rogloh · 2022-01-15 23:35

@pik33 Instead of this:

            testb   sstart0,#30 wz
    if_nz   jmp     #p403

            mov    pointer0,#0  
            bitl   sstart0,#30
            add    ptra,#8
            wrlong sstart0,ptra
            sub    ptra,#8

p403        setq #1

you could just do this (C flag is affected too but your code doesn't need to preserve it here anyway):

            bitl   sstart0, #30 wcz            
    if_z    mov    pointer0, #0  
    if_z    wrlong sstart0, ptra[2]

p403        setq #1

pik33 · 2022-01-16 09:07

Applied the patch: it works. 3 instruction instead of 7 !. I didn't realized there is something like ptra[2] available, after near full year (February 8th) with a P2 on my desk... Too old or too busy, or both of these...

This was also the first time when I utilized ALTx instructions (that's why I wrote this topic about aliases).

Now I have to try write a player with the oscilloscope on the screen. This can be a good example of using displaylisted driver: I can replace several text lines with graphic to display the oscilloscope while keeping most of the screen in the text mode, saving the hub ram space.

pik33 · 2022-01-18 21:41

A little offtopic...

These 4 lines in the player2.bas

    dlptr=v030.dl_ptr
    olddl=lpeek(dlptr)
    for i=0 to 539: lpoke dlptr+4*i,lpeek(dlptr+4*i+4) : next i
    lpoke dlptr+539*4,olddl

do this to the display:

The display is in character mode... (112x31 chars @ 8x16 pixels). I had to recall how t o use the DL and then I discovered I have still a lot of unimplemented features. The MOD player using SD and a keyboard to control, either connected via a serial terminal or via RPi based interface is now possible, all the needed elements are in place ready to use.

The module is http://modarchive.org/index.php?request=view_by_moduleid&query=35071 - worth listening to.

It seems connecting the power amplifier to RCAs of AV board instead of audio jack gives slightly better audio quality... one IC less in the audio path

Wuerfel_21 · 2022-01-18 23:12

@pik33 said:
It seems connecting the power amplifier to RCAs of AV board instead of audio jack gives slightly better audio quality... one IC less in the audio path

Yes, the audio amplifier is a bit poor and not needed when going into a line input. And for when you do plug in headphones, it is WAY TOO LOUD.

pik33 · 2022-01-28 15:48

The driver was upgraded, now without any problems, to 16 channels.

rogloh · 2022-01-29 00:41

Wow, maybe soon you will be able to play that DOPE.MOD file with 28 channels. Although it would need external memory. I wonder if it could work across two COGs somehow...and still be mixed somewhere?

evanh · 2022-01-29 01:59

A couple of optimised ISR options:

isr1
            cmp     ptrb,front wz        '6         If the buffer is empty, do nothing
    if_z    reti1                        '8/10

            rdlut   lsnext,ptrb++        '11        read the stereo samples from LUT
            wypin   lsnext,#left         '15        The sample has to be outputted every 100 cycles
            getword lsnext,lsnext,#1     '13
            wypin   lsnext,#right        '17
            reti1                        '21

isr1
            cmp     ptrb,front wz        '6         If the buffer is not empty
    if_nz   rdlut   lsnext,ptrb++        '8/9       read the stereo samples from LUT
    if_nz   wypin   lsnext,#left         '10/11     The sample has to be outputted every 100 cycles
    if_nz   getword lsnext,lsnext,#1     '12/13
    if_nz   wypin   lsnext,#right        '14/15
            reti1                        '18/19

Original:

isr1        wypin   lsample,#left        '6     The sample has to be outputted every 100 cycles     
            wypin   rsample,#right       '8

            cmp     ptrb,front wcz       '10    If the buffer is empty, do nothing 
    if_e    reti1                        '12/14

            rdlut   lsnext,ptrb++        '15    else read the sample and its time from LUT
            getword rsample,lsnext,#1    '17
            getword lsample,lsnext,#0    '19
            reti1                        '23

pik33 · 2022-01-29 07:28

This if_nz thing can speed the ISR up - I will implement this

Adding more channels is possible but I have to get rid of this:

            mov     rs,#0            ' Mix all channels to rs and ls
            mov     ls,#0
            add     rs,rs1
            add     rs,rs2            'todo: in channel computing, mov rs to oldrs, then sub oldrs, add newrs
            add     rs,rs3
            add     rs,rs4
            add     rs,rs5
            add     rs,rs6
            add     rs,rs7
            add     rs,rs8
            add     rs,rs9
            add     rs,rs10
            add     rs,rs11
            add     rs,rs12
            add     rs,rs13
            add     rs,rs14
            add     rs,rs15
            add     rs,rs16


            add     ls,ls1
            add     ls,ls2
            add     ls,ls3
            add     ls,ls4
            add     ls,ls5
            add     ls,ls6
            add     ls,ls7
            add     ls,ls8
           add     ls,ls9
           add     ls,ls10
           add     ls,ls11
           add     ls,ls12
           add     ls,ls13
           add     ls,ls14
           add     ls,ls15
           add     ls,ls16

Instead, let every channel keep its own sample, sub it from the sum and add the new one.

pik33 · 2022-01-29 07:42

@rogloh said:
Wow, maybe soon you will be able to play that DOPE.MOD file with 28 channels. Although it would need external memory. I wonder if it could work across two COGs somehow...and still be mixed somewhere?

It's big! 28 channels - so there IS a reason to make this driver 32 chn....

And, at last, try to solder these PSRAMs... Mouser PL still doesn't have Edge 32MB in stock.

evanh · 2022-01-29 09:16

@pik33 said:
Adding more channels is possible but I have to get rid of this:
...

Assuming I've read the code correctly ... I think the whole thing needs restructured. Move away from processing one channel at a time and just do them all every loop:

The channel selector naturally vanishes.
The SCA instructions are begging to feed summing instead of wasted on MOVs. The channel mixing can be integrated here.
Both the lutRAM buffering and interrupt can and need to vanish. There is already hardware sample buffering and the timing in the smartpins. TESTP or WAITSE1 for waiting on buffer ready.
Overall loop time becomes constant.

pik33 · 2022-01-29 11:18

just do them all every loop:

No way This is 3.5 MHz sample rate. All this channel selector and buffering allows to work around the lack of time because one channel needs to be recalculated every 100-800 samples - so average 400/16=25 samples for 16 channels, but there can be one moment where all samples need to be calculated. That's why I have to buffer this and optimize where possible. While all these channels are being calculated, the ISR has 512 samples in the buffer to play, about 140 microseconds. In this situation there will be no less than 100 samples time to fill the buffer again.

evanh · 2022-01-29 12:33

Thanks for the detail.

I found three ANDs that can be deleted. I've commented them out below:

p101        cmp     oldt0,time0 wz   ' There must not be 2 entries with the same time
    if_z    sub     front,#1         ' 
'    if_z    and     front,#511     

                ...

p301        mov     t2,ptrb            ' Check if the buffer is full    
            sub     t2,#1
'            and     t2,#511
            cmp     t2,front wcz
    if_e    jmp     #p301    

            wrlut   newsample, front
            add     front,#1
'            and     front,#511
            djnz    t1,#p301

pik33 · 2022-01-29 12:37

I found three ANDs that can be deleted

To be experimentally checked.

evanh · 2022-01-29 12:39

I wonder if this zero check should be up at the start rather than down at the buffering? If there's nothing to buffer then why was the sample computed?

            cmp      dt0,#0 wz
    if_z    jmp      #loop1
            mov      t1,dt0

Paula (Amiga) inspired audio driver [0.95 - envelopes]

Comments