Shop OBEX P1 Docs P2 Docs Learn Events
SID´s adventure in P2 land — Parallax Forums

SID´s adventure in P2 land

Ahle2Ahle2 Posts: 1,178
edited 2019-01-10 16:14 in Propeller 2
Hi all,

SID is happy to announce his arrival in P2 land! :)

Just some small modifications to the P1 SIDcog code and this is what I have got so far. I first looked at the P2 instruction set and thought to myself that it's too different from the P1 and this will be a cumbersome task to get going on the P2. Boy, was I wrong!! I couldn't see the forest for all the trees. (soooooooo many instructions now) Almost everything is unmodified instruction wise vs the P1. After changing MIN/MAX with FGES/FLES, "wc wz" with "wcz", removing some "NR" after TJNZ, changing some MOVS for LUT´s to the P2 equivalent and changing P1 "counter code" with P2 "smartpin code"..... It works!

I have made NO optimizations at all compared to the P1 code, because I would like a P2 baseline version that is as close to the P1 code as possible. I will start optimizing from there. Actually, It still uses subroutine calls for multiplication, slow HUB access with RDLONG/RDWORD/WRWORD and 31 kHz sample rate (the P2 is mostly idling between samples at 180 MHz and twice the instruction rate compared to the P1). My goal is to have it running at 250 kHz (1/4 of a real SID) on the P2 with 180 MHz, using LUT ram, real multiplication instructions and other P2 specific optimizations.

The fun times begins! Just load the code on your P2 eval board and change L_PIN/R_PIN at the top and change dumpFile at the bottom for different tunes.

/Johannes
«13

Comments

  • cgraceycgracey Posts: 14,133
    edited 2019-01-10 16:16
    Great, Ahle2!

    I wrote a little program to test out the A/V board we made to connect to the P2 Eval. It uses the CORDIC and DACs to make nice analog signals:

    pinbase	=	8
    
    left	=	pinbase+6
    right	=	pinbase+7
    
    freql	=	440.0		'left frequency
    freqr	=	445.0		'right frequency
    
    volume	=	$0080		'$0000..$7FFF volume
    
    fsys	=	250_000_000.0
    
    dat	org
    
    	hubset	##%1_000001_0000011000_1111_10_00	'enable crystal+PLL, stay in 20MHz+ mode
    	waitx	##20_000_000/100			'wait ~10ms for crystal+PLL to stabilize
    	hubset	##%1_000001_0000011000_1111_10_11	'now switch to PLL running at 250MHz
    
    	wrpin	dac,#left
    	wxpin	#256,#left
    	dirh	#left
    
    	wrpin	dac,#right
    	wxpin	#256,#right
    	dirh	#right
    
    	setse1	#%001<<6 + right
    
    .loop	add	p1,f1		'calculate right sample
    	qrotate	amp,p1
    	getqx	x
    	bitnot	x,#15
    	wypin	x,#left
    
    	add	p2,f2		'calculate left sample
    	qrotate	amp,p2
    	getqx	x
    	bitnot	x,#15
    	wypin	x,#right
    
    	waitse1			'wait for new period
    
    	jmp	#.loop		'loop
    
    
    
    x	long	0
    
    p1	long	0
    p2	long	0
    
    f1	long	round(freql * 65536.0 * 65536.0 * 256.0 / fsys)
    f2	long	round(freqr * 65536.0 * 65536.0 * 256.0 / fsys)
    
    amp	long	volume
    
    'dac	long	%10111_00000000_01_00010_0	'random dither (noisier, needs no period)
    dac	long	%10111_00000000_01_00011_0	'pwm (quieter, needs 256N period)
    
  • Super @Ahle23 ! This is a great start for more voyages with the Prop2.
  • Awesome! Greets too. Nice to see you back.
  • ColeyColey Posts: 1,108
    edited 2019-01-10 17:06
    Excellent news!
    Will we also see Retronitus at some point in the future?
  • Cluso99Cluso99 Posts: 18,066
    Excellent work Ahle :smiley:
  • Great work as usual... it will be very handy for people to have something as nice as SIDCOG to test out audio so early on. Thanks @Ahle2 !
  • +1 - Thanks @Ahle2 - well done!
  • @Ahle2,
    This sounds interesting. However all of the links in your signature are broken. Can you point us newbies to info and background on the SIDcog?

    Thanks
    Tom
  • Ahle2Ahle2 Posts: 1,178
    cgracey wrote: »
    Great, Ahle2!

    I wrote a little program to test out the A/V board we made to connect to the P2 Eval. It uses the CORDIC and DACs to make nice analog signals:

    Tanks för the code snippet... That confirms that I did the smartpin configuration right, I even got the whole selectable event / period edge wait thingy right. :smile: The documentation is a little bit scetchy but it gives enough info to figure things out.

    The Cordic is just awesome, too bad it's not that much of an use for emulating the SID. I do have future plans for it though.
  • Ahle2Ahle2 Posts: 1,178
    Publison wrote: »
    Super @Ahle23 ! This is a great start for more voyages with the Prop2.
    potatohead wrote: »
    Awesome! Greets too. Nice to see you back.
    Coley wrote: »
    Excellent news!
    Will we also see Retronitus at some point in the future?
    Cluso99 wrote: »
    Excellent work Ahle :smiley:
    +1 - Thanks @Ahle2 - well done!

    Thanks a lot guys! :smiley:
    twm47099 wrote: »
    @Ahle2,
    This sounds interesting. However all of the links in your signature are broken. Can you point us newbies to info and background on the SIDcog?

    Thanks
    Tom

    I will fix the links... Short version. In 2009 I posted this in the P1 forum.:forums.parallax.com/discussion/118285/sidcog-the-sound-of-the-commodore-64-now-in-the-obex/p1
    And then a video on YouTube:
    rogloh wrote: »
    Great work as usual... it will be very handy for people to have something as nice as SIDCOG to test out audio so early on. Thanks @Ahle2 !

    Thanks roghloh... SIDcog will finally be able to sound GREAT on the P2, the 31 kHz limited sample rate on the P1 (@80 MHz) always bothered me. Half the cycles went to emulating a multiplication instruction.
  • Ahle2Ahle2 Posts: 1,178
    I tried changing my multiplication subroutine to the built in multiplication instruction and to my surprise it didn't work as expected. I thought that they both deliver a 32 bit result and should work the same?!

    Then I delved into the instruction document and saw that the two operands get truncated to 16 bit before doing the actual multiplication; That means that multiplying a 18 bit value by a 8 bit value will give the wrong result even though the product is at most 26 bits. Then it doesn't handle signed operations. I have to come up with a fast way of doing S18 X U8 without too many P2 instructions. (S18 really is S32 behind the scene, but I only use a range of -$‭20000‬ to $1ffff)

    Here is my multiplication routine for reference. It handles signs and takes all bits from the operands and delivers a S32 bit result. If the result is more than 32 bits, those extra bits gets discarded (of course). It still gives the same result as the built in P2 instruction if I manually truncate the two operands to 16b before multiplying.
    multiply      mov       r1,   #0            'Clear 32-bit product
    multiLoop     shr       arg2, #1   wcz      'Half multiplyer and get LSB of it
      if_c        add       r1,   arg1          'Add multiplicand to product on C
                  shl       arg1, #1            'Double multiplicand    
      if_nz       jmp       #multiLoop          'Check nonzero multiplier to continue multiplication
    multiply_ret  ret
    
  • evanhevanh Posts: 15,126
    I banged my head on that 16-bit limit too. There is a signed version though, MULS.

    QMUL can handle larger word size but is a fixed 56 clocks to process. Which is a lot faster than your worst case. On the good side, some of those spare clocks can sometimes be put to use instead of just waiting the whole time.

  • Ahle2Ahle2 Posts: 1,178
    evanh wrote: »
    There is a signed version though, MULS.

    How is it possible that I missed that?! :blush:


  • Ahle2Ahle2 Posts: 1,178
    edited 2019-01-11 14:11
    After going trough the whole SIDcog code it seems like I'm not going to benefit from hardware multiplication anywhere. I am always doing S18b x U8b, S27b x U4b and such. Even though I'm still using a subroutine call call for multiplication I managed to get SIDcog running at 125 kHz on the P2 (half my goal) and BOOOOOY does it sound better than the lousy 31 kHz that it always used since 2009! :-D
    I will soon upload the 125 kHz version for you all to be able to hear all those crystal clear waveforms.
  • Use QMUL.
  • Ahle2Ahle2 Posts: 1,178
    Dave Hein wrote: »
    Use QMUL.

    It's unsigned only, so I can't benefit there either! I am sure there is a smart way of using multiply muls or other trickery to get the result I want. I have to think about it some more.
  • Dave HeinDave Hein Posts: 6,347
    edited 2019-01-11 14:50
    Multiplication doesn't matter if it's unsigned or signed. You will get the same result. This assumes your 18 and 27-bit values are sign extended to 32-bit values.


    EDIT: It should work for the lower 32 bits. I'm not sure if QMUL will give you the correct answer for the upper 32 bits is you use signed values.
  • Ahle2Ahle2 Posts: 1,178
    @Dave Hein

    I changed all my multiplications to qmul and it works as expected with signed values as you said! :) Too bad for the 55 cycles vs 2 cycles though. In two cases my multiplication routine was faster than qmul. Both cases used a 4 bit multiplyer. For other cases qmul is quite a bit faster.
  • evanhevanh Posts: 15,126
    Ahle2 wrote: »
    Too bad for the 55 cycles vs 2 cycles though. In two cases my multiplication routine was faster than qmul. Both cases used a 4 bit multiplyer. For other cases qmul is quite a bit faster.

    Looking at your code, I see what seems to be a sequence of filter calculations. There should be ways to rearrange the processing to parallel up the multiplies. Often it doesn't matter if there is a lag introduced as a result. Make use of the cordic's pipeline.

    The original Xoroshiro128 PRNG demonstrates this very well. They explicitly arranged to process the output result before iterating the engine so as to allow more parallelism.
  • I am not familiar with P2asm yet, but wouldn't a 32x16 multiply look like this?
    MOV prod, v1
    SAR prod, #16
    MUL prod, v2
    SHL prod, #16
    MUL v1, v2
    ADD prod, v1
    
    Jonathan
  • evanhevanh Posts: 15,126
    That is good! Obvious in hindsight. And the last instruction can be an auto-return too.

  • The standard way to signed mult with unsigned hardware or vice versa is to conditionally add/sub each argument
    dependent on the sign of the other, or something like that, its simple algebra from the definition of 2's complement.
    This works at full precision.

    You can parallel several cordic multiplies using the pipeline for increased throughput if you get the timing right - in
    theory 7 can be in-flight at once, although I've not tried (I've got 3 rotates in parallel whilst writing DAC outputs)
  • Ahle2Ahle2 Posts: 1,178
    Coley wrote: »
    Excellent news!
    Will we also see Retronitus at some point in the future?

    Probably not Retronitus the way it is on the P1, because to make it fast (high sample rate) I used a lot of P1 specific decisions on the data format, structure and fixed waveform types per channel etc.

    On the P2 I will be able to get high sample rate while making the engine a lot more flexible and feature rich. A Retronitus-like music/sound engine is not my priority at the moment though. First thing is to learn the P2 better by optimizing SIDcog for the new instruction set and features. After that I will implement a flexible sample based sound driver for smartpin, spdif and I2S. This will be the main focus for quite some time I think. (and maybe I will do a surprise inbetween these two).
  • Ahle2Ahle2 Posts: 1,178
    evanh wrote: »
    Looking at your code, I see what seems to be a sequence of filter calculations. There should be ways to rearrange the processing to parallel up the multiplies. Often it doesn't matter if there is a lag introduced as a result. Make use of the cordic's pipeline.

    The original Xoroshiro128 PRNG demonstrates this very well. They explicitly arranged to process the output result before iterating the engine so as to allow more parallelism.
    Thank's for the input Evan, that gives me some ideas... I will eventually get to this some time in the optimization process. At the moment I'm shrinking all those mask/shift operations to getnib, getword etc. Next up is using the LUT for the lookups and removing the lookup subroutine. All these optimizations will hopefully get SIDcog running at 250 kHz at 180 MHz. At the moment I will have to clock the P2 at 270 MHz to get 250 kHz and I have only just started the optimization process. Still it sounds sooooo good at 125 kHz compared to the 31 kHz on the P1. It's like the veil that has been coloring the sound for a decade (almost), has been lifted. :smile:
  • Ahle2Ahle2 Posts: 1,178
    lonesock wrote: »
    I am not familiar with P2asm yet, but wouldn't a 32x16 multiply look like this?
    MOV prod, v1
    SAR prod, #16
    MUL prod, v2
    SHL prod, #16
    MUL v1, v2
    ADD prod, v1
    
    Jonathan

    Thanks for this Jonathan. It is quite obvious now when I'm looking at your code. :smile:
  • Ahle2Ahle2 Posts: 1,178
    edited 2019-01-14 10:29
    Mark_T wrote: »
    You can parallel several cordic multiplies using the pipeline for increased throughput if you get the timing right - in
    theory 7 can be in-flight at once, although I've not tried (I've got 3 rotates in parallel whilst writing DAC outputs)

    This is the way to go, I agree!... the downside is that the code will get less readable. I think the Cordic solver is the best thing since sliced bread and the way it is pipelined and shared between cogs is quite ingenious. It may take a lot of cycles for each operation, but the throughput for continous operations when done right is excellent. I'm thinking about a 3D engine using the rotate operation on an array of 3D points, all pipelined for fast calculations. Just look at some of the 3D stuff done on a 8 MHz Amiga or a 1 MHz C64 without any hardware aid. Those were ~0.5 mips and ~0.2 mips machines. It's mind boggling (for a MCU) to have this kind of "3D power". The P2 is such a cool architecture, I just love it! :smile:
  • I'm going to look at IIR and FIR digital filter implementation using pipelined cordic muliplies next I think
  • cgraceycgracey Posts: 14,133
    Good news, Ahle2!

    Are you using the 75-ohm DAC for output? The PWM mode %10111_00000000_01_00011_0 with a 256n time base sounds really good on my P2 Eval board with the A/V add-on, which has a headphone amplifier. Very important to select the LDO regulator for those pins; otherwise, the 3.3V switcher whines like crazy.

    These add-on board kits should ship sometime soon.
  • Ahle2Ahle2 Posts: 1,178
    edited 2019-01-15 08:16
    Mark_T wrote: »
    I'm going to look at IIR and FIR digital filter implementation using pipelined cordic muliplies next I think

    Fun times indeed! :smiley: I will follow your progress on this with interest. For now I will keep SIDcogs multimode resonance filter (IIR) the way it is and optimize it on the last step of this journey.
  • Ahle2Ahle2 Posts: 1,178
    edited 2019-01-15 08:34
    cgracey wrote: »
    Good news, Ahle2!

    Are you using the 75-ohm DAC for output? The PWM mode %10111_00000000_01_00011_0 with a 256n time base sounds really good on my P2 Eval board with the A/V add-on, which has a headphone amplifier. Very important to select the LDO regulator for those pins; otherwise, the 3.3V switcher whines like crazy.

    These add-on board kits should ship sometime soon.

    I'm testing all options for DAC outputs, but indeed 75-ohm in PWM mode with the period set to multiplies of 256 sounds the best. (naturally). And yes switching regulators are not good for audio stuff. The BOE board was "horrible" in this regard and the closer to the PCB you got with your fingers, the more it whined. Then I'm always suspicious of headphone amp IC's. They tend to make the response non-linear and add noise etc. I have a pro-grade external sound card that I connect directly to the P2 pin, it takes care of decoupling and has a very steep lowpass filter for filtering out the 8 bit PWM overlayed signal. I must say that I'm very satisfied with the overall sound quality of the P2 DACs. I will make some SNR/THD measurements and see what that gives.
Sign In or Register to comment.