Now it works. I simply forgot to zero the ptrb at the start.
There is 11 nops free at 354 MHz and 6 nops at 320 MHz. After an hour of playing at 354 MHz P2 Edge is hot. Not too hot to touch, something slightly more than 40C/100F. Still stable.
isr1 wypin lsample,#left '2 The sample has to be outputted every 90 cycles
wypin rsample,#right '4
incmod counter,a20000000 '6
cmp counter,irqtime wcz '8 Check if it is time for the next sample
if_ne reti1 '10/12 If not, do nothing
getword rsample,lsnext,#1 '12
getword lsample,lsnext,#0 '14
cmp ptrb,front wcz '16 If the buffer is empty, do nothing
if_e reti1 '18/20
rdlut lsnext,ptrb++ '21 else read the sample and its time from LUT
rdlut irqtime,ptrb++ '23 Read the time for this sample
reti1
.. the "6 nops" low pass filter started to work....
.. and I have to measure the filer and/or find the error in the calculations as what is supposed to be 17 kHz low pass filter seems to have much lower cutoff frequency. Then the filter has to be made switchable via the player.
Just double check your INCMOD "S" value is correct. I think it is meant to be the last value in the incrementing sequence after which it wraps to zero. So INCMOD D, #0 would always remain at zero, while INCMOD D, #1 would count 0,1,0,1, etc. At least that's how I think it works.
The filters don't work stable at 320 MHz. There are modules that can make the player stop playing, while at 354 MHz they work (more place available in the ISR).
I made a big mistake implementing this filter and that's why calculations failed...
A new idea. Precompute all these samples offline, not sample-time but all of them. This is simply compute the sample, as it is now, determine how many samples with the same value needs to be placed in the buffer, place them there and let ISR to put them to DACs and nothing else.
This may allow to implement these PWMs, Amiga filter and postprocessing...
A major rewrite in progress, but the changed driver (v 014) actually works.
As in the previous post. The main program computes all samples and put them into the buffer.
The interrupt only pushes these samples into DACs
A disadvantage: more complex main loop which executes longer - to be optimized, but Amiga modules still play
Advantages:
much simpler ISR
all samples available in the main loop so filters can be added, on-screen oscilloscope can be made, etc
Still no time and idea how to add a Paula type PWM volume. This has the low priority, as it is complex to implement and I have no real Amiga to compare the result.
@pik33 said:
A major rewrite in progress, but the changed driver (v 014) actually works.
As in the previous post. The main program computes all samples and put them into the buffer.
The interrupt only pushes these samples into DACs
A disadvantage: more complex main loop which executes longer - to be optimized, but Amiga modules still play
Advantages:
much simpler ISR
all samples available in the main loop so filters can be added, on-screen oscilloscope can be made, etc
Still no time and idea how to add a Paula type PWM volume. This has the low priority, as it is complex to implement and I have no real Amiga to compare the result.
See if you can come up with a scheme that still allows samples to be read in from PSRAM (aim for about 1us of access latency per each random sample access in setups above 250MHz). This way you'll hopefully be able to play larger mod files (or maybe one day even s3m files LOL) and fit them in external memory.
I need to connect PSRAMs to test. I have no Edge with PSRAM (and no hope to get one in the near future) and I don't know what I can achieve with these chips connected using wires (speed?)
I can however simulate one us latency while getting a sample to test if it still works.
Edit. 1 us latency per sample should still work. This loop
caused the player to distort at something about 300 without filter and 200 with filter. There is still a lot of code to optimize in the main loop. These setq/rdlongs - they will be replaced with something less time consuming. I don't need to read all these parameters every sample.
A lot of time (and also size) optimizations - this version doesn't hang up while running interrrupts at Paula x2 speed! (while distorting..... but the previous version didn't run at all)
This means there will be no problem with getting samples from PSRAM or (and?) adding filters to the audio at the standard Paula clock.
The main difference is: removed reading all registers every loop. Instead, only the channel which has to be computed reads its registers. About 160 clocks (average) saved at every loop
The API difference: the cmd register which is used to reset the sample phase accumulator (= to start playing the note) now has to be set to $FFFFFFFF instead of any non-zero value to allow the sample to run. This allowed to remove 8 instructions - 16 clocks from the main loop. This also added additional unneeded (?) "feature": setting the value to any non-FFFFFFFF value will cause the sample to loop when the phase accumulator reaches the zero bit in the CMD register (it is ANDed with this value every loop)
Several other changes also stripped several clocks from the loop, the loop time is about 200 clocks less now.
With the RAMs being managed from a dedicated cog, it should be possible to perform a prefetching scheme for read data. The tricky part is making it deterministic. Which basically means application driven - the developer has to specify what to prefetch, and preallocate the space for it, before actually needing the data.
Write data is much easier. The regular write-this-block-of-data can be a background operation. As long as data rate stays sane then no issue.
PS: I've not tried reading up on Roger's driver as yet. No idea what flexibility it might have.
While playing an Amiga module, sample data are read sequentially, byte by byte and every byte, so a prefetch is possible. Exceptions are starting a new sample and loop the sample. The driver can skip samples and use 16-bit samples, but these features are not present in the original original Paula, so they are not needed to play a module. The read however is still sequential and predictable, so we can tell the RAM driver to do this.
The audiodriver's LUT based buffer now contains about 140 microsecond of audio data.
Maybe the module can be played directly from Edge's or Eval's flash memory using such a dedicated driver/cache? I have to try this
@evanh said:
With the RAMs being managed from a dedicated cog, it should be possible to perform a prefetching scheme for read data. The tricky part is making it deterministic. Which basically means application driven - the developer has to specify what to prefetch, and preallocate the space for it, before actually needing the data.
You can try to prefetch if you can and this will reduce the number of requests needed which is helpful. Also my driver has some in-built capability to skip some bytes of memory between read portions with its graphics copy stuff, and I think this might be useful for audio, however any wrapping back around to lower memory addresses during audio sample looping still has to be dealt with by the caller.
A huge size optimization (445->190 longs) and major rewrite for v. 0.18 (use player2.bas to test this - the registers changed). No more 8 repeated channels code, no more 72 longs in cog ram for registers. ALTS/ALTD used instead and only one procedure and one register set.
I have to clean this up moving all old versions to recycle folder
This change creates the place for more functions:
there is a current pointer available in the hub for every channel so the main program can now stream the long sample to one looped channel buffer (wav playing from the SD/Flash/PSRAM)
there is a current sample value available in the hub, so things like oscilloscopes can be done without messing with the driver: simply read these samples in the main program)
DAC update rate of sysclock/90 !? .... I was going in my head "That's insane! Why?" ... Then it dawned on me you may not know that the smartpin DACmodes have an integral DAC update timer with buffering. If you were setting the DACs directly then you've coded using the IRQ correctly.
However, as long as the program feeds the next sample into the smartpin sometime before the next DAC update period, then it'll be a clean metronomic interval between the two samples.
PS: Eliminating interrupts altogether will free up the smaller 14 of 90 clocks, 15%, of cog time.
@evanh said:
DAC update rate of sysclock/90 !? .... I was going in my head "That's insane! Why?" ... Then it dawned on me you may not know that the smartpin DACmodes have an integral DAC update timer with buffering. If you were setting the DACs directly then you've coded using the IRQ correctly.
However, as long as the program feeds the next sample into the smartpin sometime before the next DAC update period, then it'll be a clean metronomic interval between the two samples.
PS: Eliminating interrupts altogether will free up the smaller 14 of 90 clocks, 15%, of cog time.
Interessting idea! But you need a DAC output per channel, and have to mix them together analog with resistors. With CMOS switches (4016 or 4066) and another 4 smart pins, you can also do the PWM volume control.
So 8 pins + some external circuit for a Paula emulation, but this may be very close to the original.
Pik's doing all that in software I think. It's just the two 16-bit dithered DAC smartpins being fed by the ISR.
'--------------------------------------------------------------------------
'------ Interrupt service -------------------------------------------------
'------ Output the sample, get the next one if exists ---------------------
'--------------------------------------------------------------------------
isr1 wypin lsample,#left '2 The sample has to be outputted every 90 cycles
wypin rsample,#right '4
cmp ptrb,front wcz '6 If the buffer is empty, do nothing
if_e reti1 '8/10
rdlut lsnext,ptrb++ '11 else read the sample and its time from LUT
getword rsample,lsnext,#1 '13
getword lsample,lsnext,#0 '15
reti1 '17/19
PS: His program is whizzing along doing the mixing at over 300 MHz sysclock! There's quite some MIPS to burn.
Applied the patch: it works. 3 instruction instead of 7 !. I didn't realized there is something like ptra[2] available, after near full year (February 8th) with a P2 on my desk... Too old or too busy, or both of these...
This was also the first time when I utilized ALTx instructions (that's why I wrote this topic about aliases).
Now I have to try write a player with the oscilloscope on the screen. This can be a good example of using displaylisted driver: I can replace several text lines with graphic to display the oscilloscope while keeping most of the screen in the text mode, saving the hub ram space.
dlptr=v030.dl_ptr
olddl=lpeek(dlptr)
for i=0 to 539: lpoke dlptr+4*i,lpeek(dlptr+4*i+4) : next i
lpoke dlptr+539*4,olddl
do this to the display:
The display is in character mode... (112x31 chars @ 8x16 pixels). I had to recall how t o use the DL and then I discovered I have still a lot of unimplemented features. The MOD player using SD and a keyboard to control, either connected via a serial terminal or via RPi based interface is now possible, all the needed elements are in place ready to use.
@pik33 said:
It seems connecting the power amplifier to RCAs of AV board instead of audio jack gives slightly better audio quality... one IC less in the audio path
Yes, the audio amplifier is a bit poor and not needed when going into a line input. And for when you do plug in headphones, it is WAY TOO LOUD.
Wow, maybe soon you will be able to play that DOPE.MOD file with 28 channels. Although it would need external memory. I wonder if it could work across two COGs somehow...and still be mixed somewhere?
isr1
cmp ptrb,front wz '6 If the buffer is empty, do nothing
if_z reti1 '8/10
rdlut lsnext,ptrb++ '11 read the stereo samples from LUT
wypin lsnext,#left '15 The sample has to be outputted every 100 cycles
getword lsnext,lsnext,#1 '13
wypin lsnext,#right '17
reti1 '21
isr1
cmp ptrb,front wz '6 If the buffer is not empty
if_nz rdlut lsnext,ptrb++ '8/9 read the stereo samples from LUT
if_nz wypin lsnext,#left '10/11 The sample has to be outputted every 100 cycles
if_nz getword lsnext,lsnext,#1 '12/13
if_nz wypin lsnext,#right '14/15
reti1 '18/19
Original:
isr1 wypin lsample,#left '6 The sample has to be outputted every 100 cycles
wypin rsample,#right '8
cmp ptrb,front wcz '10 If the buffer is empty, do nothing
if_e reti1 '12/14
rdlut lsnext,ptrb++ '15 else read the sample and its time from LUT
getword rsample,lsnext,#1 '17
getword lsample,lsnext,#0 '19
reti1 '23
@rogloh said:
Wow, maybe soon you will be able to play that DOPE.MOD file with 28 channels. Although it would need external memory. I wonder if it could work across two COGs somehow...and still be mixed somewhere?
It's big! 28 channels - so there IS a reason to make this driver 32 chn....
And, at last, try to solder these PSRAMs... Mouser PL still doesn't have Edge 32MB in stock.
@pik33 said:
Adding more channels is possible but I have to get rid of this:
...
Assuming I've read the code correctly ... I think the whole thing needs restructured. Move away from processing one channel at a time and just do them all every loop:
The channel selector naturally vanishes.
The SCA instructions are begging to feed summing instead of wasted on MOVs. The channel mixing can be integrated here.
Both the lutRAM buffering and interrupt can and need to vanish. There is already hardware sample buffering and the timing in the smartpins. TESTP or WAITSE1 for waiting on buffer ready.
No way This is 3.5 MHz sample rate. All this channel selector and buffering allows to work around the lack of time because one channel needs to be recalculated every 100-800 samples - so average 400/16=25 samples for 16 channels, but there can be one moment where all samples need to be calculated. That's why I have to buffer this and optimize where possible. While all these channels are being calculated, the ISR has 512 samples in the buffer to play, about 140 microseconds. In this situation there will be no less than 100 samples time to fill the buffer again.
I found three ANDs that can be deleted. I've commented them out below:
p101 cmp oldt0,time0 wz ' There must not be 2 entries with the same time
if_z sub front,#1 '
' if_z and front,#511
...
p301 mov t2,ptrb ' Check if the buffer is full
sub t2,#1
' and t2,#511
cmp t2,front wcz
if_e jmp #p301
wrlut newsample, front
add front,#1
' and front,#511
djnz t1,#p301
I wonder if this zero check should be up at the start rather than down at the buffering? If there's nothing to buffer then why was the sample computed?
Comments
Now it works. I simply forgot to zero the ptrb at the start.
There is 11 nops free at 354 MHz and 6 nops at 320 MHz. After an hour of playing at 354 MHz P2 Edge is hot. Not too hot to touch, something slightly more than 40C/100F. Still stable.
.. the "6 nops" low pass filter started to work....
.. and I have to measure the filer and/or find the error in the calculations as what is supposed to be 17 kHz low pass filter seems to have much lower cutoff frequency. Then the filter has to be made switchable via the player.
Just double check your INCMOD "S" value is correct. I think it is meant to be the last value in the incrementing sequence after which it wraps to zero. So INCMOD D, #0 would always remain at zero, while INCMOD D, #1 would count 0,1,0,1, etc. At least that's how I think it works.
incmod works with $20000000
The filters don't work stable at 320 MHz. There are modules that can make the player stop playing, while at 354 MHz they work (more place available in the ISR).
I made a big mistake implementing this filter and that's why calculations failed...
A new idea. Precompute all these samples offline, not sample-time but all of them. This is simply compute the sample, as it is now, determine how many samples with the same value needs to be placed in the buffer, place them there and let ISR to put them to DACs and nothing else.
This may allow to implement these PWMs, Amiga filter and postprocessing...
A major rewrite in progress, but the changed driver (v 014) actually works.
As in the previous post. The main program computes all samples and put them into the buffer.
The interrupt only pushes these samples into DACs
A disadvantage: more complex main loop which executes longer - to be optimized, but Amiga modules still play
Advantages:
Still no time and idea how to add a Paula type PWM volume. This has the low priority, as it is complex to implement and I have no real Amiga to compare the result.
See if you can come up with a scheme that still allows samples to be read in from PSRAM (aim for about 1us of access latency per each random sample access in setups above 250MHz). This way you'll hopefully be able to play larger mod files (or maybe one day even s3m files LOL) and fit them in external memory.
I need to connect PSRAMs to test. I have no Edge with PSRAM (and no hope to get one in the near future) and I don't know what I can achieve with these chips connected using wires (speed?)
I can however simulate one us latency while getting a sample to test if it still works.
Edit. 1 us latency per sample should still work. This loop
caused the player to distort at something about 300 without filter and 200 with filter. There is still a lot of code to optimize in the main loop. These setq/rdlongs - they will be replaced with something less time consuming. I don't need to read all these parameters every sample.
0.15.
A lot of time (and also size) optimizations - this version doesn't hang up while running interrrupts at Paula x2 speed! (while distorting..... but the previous version didn't run at all)
This means there will be no problem with getting samples from PSRAM or (and?) adding filters to the audio at the standard Paula clock.
The main difference is: removed reading all registers every loop. Instead, only the channel which has to be computed reads its registers. About 160 clocks (average) saved at every loop
The API difference: the cmd register which is used to reset the sample phase accumulator (= to start playing the note) now has to be set to $FFFFFFFF instead of any non-zero value to allow the sample to run. This allowed to remove 8 instructions - 16 clocks from the main loop. This also added additional unneeded (?) "feature": setting the value to any non-FFFFFFFF value will cause the sample to loop when the phase accumulator reaches the zero bit in the CMD register (it is ANDed with this value every loop)
Several other changes also stripped several clocks from the loop, the loop time is about 200 clocks less now.
With the RAMs being managed from a dedicated cog, it should be possible to perform a prefetching scheme for read data. The tricky part is making it deterministic. Which basically means application driven - the developer has to specify what to prefetch, and preallocate the space for it, before actually needing the data.
Write data is much easier. The regular write-this-block-of-data can be a background operation. As long as data rate stays sane then no issue.
PS: I've not tried reading up on Roger's driver as yet. No idea what flexibility it might have.
While playing an Amiga module, sample data are read sequentially, byte by byte and every byte, so a prefetch is possible. Exceptions are starting a new sample and loop the sample. The driver can skip samples and use 16-bit samples, but these features are not present in the original original Paula, so they are not needed to play a module. The read however is still sequential and predictable, so we can tell the RAM driver to do this.
The audiodriver's LUT based buffer now contains about 140 microsecond of audio data.
Maybe the module can be played directly from Edge's or Eval's flash memory using such a dedicated driver/cache? I have to try this
You can try to prefetch if you can and this will reduce the number of requests needed which is helpful. Also my driver has some in-built capability to skip some bytes of memory between read portions with its graphics copy stuff, and I think this might be useful for audio, however any wrapping back around to lower memory addresses during audio sample looping still has to be dealt with by the caller.
A huge size optimization (445->190 longs) and major rewrite for v. 0.18 (use player2.bas to test this - the registers changed). No more 8 repeated channels code, no more 72 longs in cog ram for registers. ALTS/ALTD used instead and only one procedure and one register set.
https://github.com/pik33/P2-retromachine/tree/main/Propeller/Tracker player
I have to clean this up moving all old versions to recycle folder
This change creates the place for more functions:
DAC update rate of sysclock/90 !? .... I was going in my head "That's insane! Why?" ... Then it dawned on me you may not know that the smartpin DACmodes have an integral DAC update timer with buffering. If you were setting the DACs directly then you've coded using the IRQ correctly.
However, as long as the program feeds the next sample into the smartpin sometime before the next DAC update period, then it'll be a clean metronomic interval between the two samples.
PS: Eliminating interrupts altogether will free up the smaller 14 of 90 clocks, 15%, of cog time.
Which, in turn, means the lutRAM buffer handling can be compacted right down to just waiting for smartpin ready. No software buffering at all.
Interessting idea! But you need a DAC output per channel, and have to mix them together analog with resistors. With CMOS switches (4016 or 4066) and another 4 smart pins, you can also do the PWM volume control.
So 8 pins + some external circuit for a Paula emulation, but this may be very close to the original.
Andy
Pik's doing all that in software I think. It's just the two 16-bit dithered DAC smartpins being fed by the ISR.
PS: His program is whizzing along doing the mixing at over 300 MHz sysclock! There's quite some MIPS to burn.
@pik33 Instead of this:
you could just do this (C flag is affected too but your code doesn't need to preserve it here anyway):
Applied the patch: it works. 3 instruction instead of 7 !. I didn't realized there is something like ptra[2] available, after near full year (February 8th) with a P2 on my desk... Too old or too busy, or both of these...
This was also the first time when I utilized ALTx instructions (that's why I wrote this topic about aliases).
Now I have to try write a player with the oscilloscope on the screen. This can be a good example of using displaylisted driver: I can replace several text lines with graphic to display the oscilloscope while keeping most of the screen in the text mode, saving the hub ram space.
A little offtopic...
These 4 lines in the player2.bas
do this to the display:
The display is in character mode... (112x31 chars @ 8x16 pixels). I had to recall how t o use the DL and then I discovered I have still a lot of unimplemented features. The MOD player using SD and a keyboard to control, either connected via a serial terminal or via RPi based interface is now possible, all the needed elements are in place ready to use.
The module is http://modarchive.org/index.php?request=view_by_moduleid&query=35071 - worth listening to.
It seems connecting the power amplifier to RCAs of AV board instead of audio jack gives slightly better audio quality... one IC less in the audio path
Yes, the audio amplifier is a bit poor and not needed when going into a line input. And for when you do plug in headphones, it is WAY TOO LOUD.
The driver was upgraded, now without any problems, to 16 channels.
Wow, maybe soon you will be able to play that DOPE.MOD file with 28 channels. Although it would need external memory. I wonder if it could work across two COGs somehow...and still be mixed somewhere?
A couple of optimised ISR options:
Original:
This if_nz thing can speed the ISR up - I will implement this
Adding more channels is possible but I have to get rid of this:
Instead, let every channel keep its own sample, sub it from the sum and add the new one.
It's big! 28 channels - so there IS a reason to make this driver 32 chn....
And, at last, try to solder these PSRAMs... Mouser PL still doesn't have Edge 32MB in stock.
Assuming I've read the code correctly ... I think the whole thing needs restructured. Move away from processing one channel at a time and just do them all every loop:
No way This is 3.5 MHz sample rate. All this channel selector and buffering allows to work around the lack of time because one channel needs to be recalculated every 100-800 samples - so average 400/16=25 samples for 16 channels, but there can be one moment where all samples need to be calculated. That's why I have to buffer this and optimize where possible. While all these channels are being calculated, the ISR has 512 samples in the buffer to play, about 140 microseconds. In this situation there will be no less than 100 samples time to fill the buffer again.
Thanks for the detail.
I found three ANDs that can be deleted. I've commented them out below:
To be experimentally checked.
I wonder if this zero check should be up at the start rather than down at the buffering? If there's nothing to buffer then why was the sample computed?