SID´s adventure in P2 land
Ahle2
Posts: 1,179
Hi all,
SID is happy to announce his arrival in P2 land!
Just some small modifications to the P1 SIDcog code and this is what I have got so far. I first looked at the P2 instruction set and thought to myself that it's too different from the P1 and this will be a cumbersome task to get going on the P2. Boy, was I wrong!! I couldn't see the forest for all the trees. (soooooooo many instructions now) Almost everything is unmodified instruction wise vs the P1. After changing MIN/MAX with FGES/FLES, "wc wz" with "wcz", removing some "NR" after TJNZ, changing some MOVS for LUT´s to the P2 equivalent and changing P1 "counter code" with P2 "smartpin code"..... It works!
I have made NO optimizations at all compared to the P1 code, because I would like a P2 baseline version that is as close to the P1 code as possible. I will start optimizing from there. Actually, It still uses subroutine calls for multiplication, slow HUB access with RDLONG/RDWORD/WRWORD and 31 kHz sample rate (the P2 is mostly idling between samples at 180 MHz and twice the instruction rate compared to the P1). My goal is to have it running at 250 kHz (1/4 of a real SID) on the P2 with 180 MHz, using LUT ram, real multiplication instructions and other P2 specific optimizations.
The fun times begins! Just load the code on your P2 eval board and change L_PIN/R_PIN at the top and change dumpFile at the bottom for different tunes.
/Johannes
SID is happy to announce his arrival in P2 land!
Just some small modifications to the P1 SIDcog code and this is what I have got so far. I first looked at the P2 instruction set and thought to myself that it's too different from the P1 and this will be a cumbersome task to get going on the P2. Boy, was I wrong!! I couldn't see the forest for all the trees. (soooooooo many instructions now) Almost everything is unmodified instruction wise vs the P1. After changing MIN/MAX with FGES/FLES, "wc wz" with "wcz", removing some "NR" after TJNZ, changing some MOVS for LUT´s to the P2 equivalent and changing P1 "counter code" with P2 "smartpin code"..... It works!
I have made NO optimizations at all compared to the P1 code, because I would like a P2 baseline version that is as close to the P1 code as possible. I will start optimizing from there. Actually, It still uses subroutine calls for multiplication, slow HUB access with RDLONG/RDWORD/WRWORD and 31 kHz sample rate (the P2 is mostly idling between samples at 180 MHz and twice the instruction rate compared to the P1). My goal is to have it running at 250 kHz (1/4 of a real SID) on the P2 with 180 MHz, using LUT ram, real multiplication instructions and other P2 specific optimizations.
The fun times begins! Just load the code on your P2 eval board and change L_PIN/R_PIN at the top and change dumpFile at the bottom for different tunes.
/Johannes
Comments
I wrote a little program to test out the A/V board we made to connect to the P2 Eval. It uses the CORDIC and DACs to make nice analog signals:
Will we also see Retronitus at some point in the future?
This sounds interesting. However all of the links in your signature are broken. Can you point us newbies to info and background on the SIDcog?
Thanks
Tom
Tanks för the code snippet... That confirms that I did the smartpin configuration right, I even got the whole selectable event / period edge wait thingy right. The documentation is a little bit scetchy but it gives enough info to figure things out.
The Cordic is just awesome, too bad it's not that much of an use for emulating the SID. I do have future plans for it though.
Thanks a lot guys!
I will fix the links... Short version. In 2009 I posted this in the P1 forum.:forums.parallax.com/discussion/118285/sidcog-the-sound-of-the-commodore-64-now-in-the-obex/p1
And then a video on YouTube:
Thanks roghloh... SIDcog will finally be able to sound GREAT on the P2, the 31 kHz limited sample rate on the P1 (@80 MHz) always bothered me. Half the cycles went to emulating a multiplication instruction.
Then I delved into the instruction document and saw that the two operands get truncated to 16 bit before doing the actual multiplication; That means that multiplying a 18 bit value by a 8 bit value will give the wrong result even though the product is at most 26 bits. Then it doesn't handle signed operations. I have to come up with a fast way of doing S18 X U8 without too many P2 instructions. (S18 really is S32 behind the scene, but I only use a range of -$20000 to $1ffff)
Here is my multiplication routine for reference. It handles signs and takes all bits from the operands and delivers a S32 bit result. If the result is more than 32 bits, those extra bits gets discarded (of course). It still gives the same result as the built in P2 instruction if I manually truncate the two operands to 16b before multiplying.
QMUL can handle larger word size but is a fixed 56 clocks to process. Which is a lot faster than your worst case. On the good side, some of those spare clocks can sometimes be put to use instead of just waiting the whole time.
How is it possible that I missed that?!
I will soon upload the 125 kHz version for you all to be able to hear all those crystal clear waveforms.
It's unsigned only, so I can't benefit there either! I am sure there is a smart way of using multiply muls or other trickery to get the result I want. I have to think about it some more.
EDIT: It should work for the lower 32 bits. I'm not sure if QMUL will give you the correct answer for the upper 32 bits is you use signed values.
I changed all my multiplications to qmul and it works as expected with signed values as you said! Too bad for the 55 cycles vs 2 cycles though. In two cases my multiplication routine was faster than qmul. Both cases used a 4 bit multiplyer. For other cases qmul is quite a bit faster.
Looking at your code, I see what seems to be a sequence of filter calculations. There should be ways to rearrange the processing to parallel up the multiplies. Often it doesn't matter if there is a lag introduced as a result. Make use of the cordic's pipeline.
The original Xoroshiro128 PRNG demonstrates this very well. They explicitly arranged to process the output result before iterating the engine so as to allow more parallelism.
dependent on the sign of the other, or something like that, its simple algebra from the definition of 2's complement.
This works at full precision.
You can parallel several cordic multiplies using the pipeline for increased throughput if you get the timing right - in
theory 7 can be in-flight at once, although I've not tried (I've got 3 rotates in parallel whilst writing DAC outputs)
Probably not Retronitus the way it is on the P1, because to make it fast (high sample rate) I used a lot of P1 specific decisions on the data format, structure and fixed waveform types per channel etc.
On the P2 I will be able to get high sample rate while making the engine a lot more flexible and feature rich. A Retronitus-like music/sound engine is not my priority at the moment though. First thing is to learn the P2 better by optimizing SIDcog for the new instruction set and features. After that I will implement a flexible sample based sound driver for smartpin, spdif and I2S. This will be the main focus for quite some time I think. (and maybe I will do a surprise inbetween these two).
Thanks for this Jonathan. It is quite obvious now when I'm looking at your code.
This is the way to go, I agree!... the downside is that the code will get less readable. I think the Cordic solver is the best thing since sliced bread and the way it is pipelined and shared between cogs is quite ingenious. It may take a lot of cycles for each operation, but the throughput for continous operations when done right is excellent. I'm thinking about a 3D engine using the rotate operation on an array of 3D points, all pipelined for fast calculations. Just look at some of the 3D stuff done on a 8 MHz Amiga or a 1 MHz C64 without any hardware aid. Those were ~0.5 mips and ~0.2 mips machines. It's mind boggling (for a MCU) to have this kind of "3D power". The P2 is such a cool architecture, I just love it!
Are you using the 75-ohm DAC for output? The PWM mode %10111_00000000_01_00011_0 with a 256n time base sounds really good on my P2 Eval board with the A/V add-on, which has a headphone amplifier. Very important to select the LDO regulator for those pins; otherwise, the 3.3V switcher whines like crazy.
These add-on board kits should ship sometime soon.
Fun times indeed! I will follow your progress on this with interest. For now I will keep SIDcogs multimode resonance filter (IIR) the way it is and optimize it on the last step of this journey.
I'm testing all options for DAC outputs, but indeed 75-ohm in PWM mode with the period set to multiplies of 256 sounds the best. (naturally). And yes switching regulators are not good for audio stuff. The BOE board was "horrible" in this regard and the closer to the PCB you got with your fingers, the more it whined. Then I'm always suspicious of headphone amp IC's. They tend to make the response non-linear and add noise etc. I have a pro-grade external sound card that I connect directly to the P2 pin, it takes care of decoupling and has a very steep lowpass filter for filtering out the 8 bit PWM overlayed signal. I must say that I'm very satisfied with the overall sound quality of the P2 DACs. I will make some SNR/THD measurements and see what that gives.