First experimental try with mp3 playing
pik33
Posts: 2,366
Needs the bleeding edge Flexprop to compile (spin2cpp from github)
Needs a clean mp3, ( currently named test1.mp3 at line 15 of Basic code) without any id3 or other stuff.
This will end in Prop2play, but it needs some work.
Edit: I am now trying the decoder on the music I know. It introduces very strange artifacts on Jarre's Oxygene Part 4.
Still long way to go
Comments
This is really cool!
I am not familiar with the code you are building with. I want to share a great sounding bit of code that was the original basis for WINAMP.
https://www.rarewares.org/rrw/amp.php
The command line code you find there decoded most mp3 files great! I remember having some trouble with joint stereo, variable bit rate files, and that was the newness of both features at the time I was running the software.
Running on what you might ask?
SGI of course!
Last time I ran Amp was on a 30 Mhz Indigo. At 30 Mhz, Amp could handle up to 256Mbit / second, stereo Mp3 files all day long over a shared NFS file system. Doing that took somewhere north of 90 percent CPU.
Maybe this is helpful in some way.
Back when I first started working in the early 90s in a Telco's research labs I recall a colleague at work showing me MP3s when it was sort of bleeding edge and not too many had heard of such a thing. I think he was playing them on an early Pentium (66MHz?) under DOS. That's about what you'd needed for doing it back then. Not sure what DOS software he used.
Even the fastest 80486 could play some MP3s according to this guy's tests. But the original Pentium would have come out before that 100MHz 486.
Could have been Amp. There was some history on the now hard to find, if they can be found, SGI freeware pages.
Amp became the basis for a bunch of players, including Winamp. MSDOS was on the list of ports, along with BeOS and some others. That code base is very lean and mean.
The version I linked would play almost anything under 256kbps on a 30 Mhz MIPS R3000. A 66Mhz Pentium, assuming it was not clogged with some odd wait state or front side bus limit, was up to the task, in my opinion.
I looked at the Amp code. It's float. Maybe there will be a P3 with an FPU in the future, but now we have no FPU and we need optimized integer code to do the job in the real time.
The decoder used is the modified (by @ersmith who added the asm part for a P2 to it) Adafruit MP3 decoder. After this I concatenated all these small C files to make one object out of them and added a small Basic program as a main code, including my audio driver to output the sound.
This works, but there is something wrong with the decoded audio. While "normal" instruments play OK, several percussive, impulse sounds are decoded incorrectly, with a long "tails". I tried Jarre's "Oxygene" where these artifacts are hearable mostly in Part 4. With my limited understanding of C and mp3 decoding process I don't know yet what can be wrong there.
There are 2 types of blocks in mp3 - short and long. The first place I will search for a bug is a short block decoding... percussive sounds may switch the encoder to the short block mode. I will simply try to switch it off so instead of bad sound there will be a silence.
Edit: Yes, this is short blocks decoding broken.... Now, I have to learn more about this....
Edit 2: This is a Propeller implementation problem. I compiled the original code on a Linux machine and it works OK. There are 2 possibilities: a bug in Flexprop or something that I broke while doing one-file version. The next step will be compiling this multi-file standalone P2 version from the another topic and check if it works.
Are you using the DACs with dithering to extend beyond 8 bits? I vaguely recall a wrap issue when values beyond $FF00 are called, eg $FF01 and above start to wrap to $00 output values periodically
Its great that you're have a go at this
Yes, I do. The $FF00 problem can cause clipping if not corrected but this is not a problem. I think there is a bug in Flexprop. I am now preparing a simple test to check this and report, if it is really a compiler bug.
$FFxx simply outputs the same voltage as $FF00.
Edit: the problem is now reported.
Looking at the decoder, it seems that the DCT function can be sped up with similar inline ASM crackheaddery to what I applied to the JPEG code. Obviously need a bug-free version first. Also, does it really need 32 bit multiplies? I'm pretty sure I've seen some audio-related DCT code that works with 16 bit mul (what would you need more precision for? I know MP3 can encode more than 16 bit precison, but for an embedded decoder we don't care...)
At least 32 bit is needed because of rounding errors that will accumulate while doing a lot of muls and adds. Also, a format like 4.28 is needed to avoid clipping on intermediate results, which, for 16-bit ints became 4.12. That means less than 12 bit precision at the end which means too much noise. I compiled one of such things (minimp3) for the bare metal RPi: they still used 32 bit ints but in 16.16 format or something like this... there was #define for the FRAC_BITS but the decoder crashed when I tried to change this. It was good at louder parts but unacceptably noisy at silent parts, SNR was less than 50 dB. Unacceptable for me, so I had to find something else and ended with libmad.
I have very limited time because of the crappy project which I was pushed into because there was no one else to do it, and the house renovation with surprises (as always with house renovations), but the bug has to be found. The decoder works in the real time in a P2, at my "standard" 336 MHz it is fast enough (about 18 ms for one frame) to allow reading from SD and decoding in the real time using the same cog and 8 kB buffer for the input file data. Of course these DCTs can be much faster while rewritten into asm.
I guess that's a nope to that then /shrug
If one wanted absurdly fast decoding at the expense of somewhat noisy audio... I spent a lot of time developing a really efficient ADPCM-type codec optimized for P1: https://github.com/Wuerfel21/Heptafon P2 can probably decode some 20 heptafon streams at once without sweat if that's what's needed. I could probably make a P2-optimized one. Heptafon is designed to avoid multiplication, but on P2 it's fast...
That made a difference:
Instead of
#define CLZ(x) __builtin_clz((x))
I added 1 to the result:
#define CLZ(x) 1+ __builtin_clz((x))
Oxygene 4 seems to sound as it should... Let's listen to the full album...
The dynamic "guard bits" feature of the decoder needs to be revised. I tried to debug these guard bit values and they can be as high as 15 sometimes... (and I said 4.28.... ) They use them to shr the value before proceeeding (to avoid overflow?) in several places of the code.
Adding one more "guard bit" removed the sporadic glitches the decoder still had from time to time.
The good thing is: the decoder now works with an acceptable quality, so it can now be moved to Prop2play.
That's great news Pik33
Now the problem is: I cannot make the decoder run in another cog. While it runs in the player, when executing in the main cog, it works, but the main cog has too many work to do. There is no time to do also mp3 decoding in it.
Trying to make a function that runs in another cog and call mp3decode from it crashes the player. Stack, heap, ???... I gave it 64k of stack, still crashes. I have to experiment with this in simpler environment than the player itself.
Edit: it can run in the second cog in a simple program. Let's determine how much heap and stack it needs to work...
Edit 2: 4k heap, 1k stack, and it runs....
Cool @pik33. I was able to get this to work. I use lossless nowadays so don't have a lot of MP3 stuff handy anymore so I just downloaded some 12s music sample online of an MP3 (which turned out had ID3 tags). It failed of course but once I removed the metadata with this online tool, I could get it to play.
https://products.groupdocs.app/metadata/remove-from-mp3
A single COG MP3 player object could be quite handy. I wouldn't expect it to be able to simultaneously decode and do too much else of your application in the same COG but maybe it could report some play time position and could control seeking or play/pause etc from other COGs. Problem might be file access if you need to share the filesystem with other COGs unless you give it exclusive control. Also if your audio system output has other sources a mixer might be needed somewhere too.
Skipping tags is another thing to do.
The audio driver used has 8 channels so it can mix another 6 mono or 3 stereo audio sources (and I have 16 channels version too)
The problem is somewhere in the prop2play. There is a lot of stuff left from its first versions that still need to be changed/rewritten. It has for example several manually allocated buffers that can collide with the main program stack. I have now to found what collides with what and remove bugs that cause this.
I got the first mp3 audio, with a lot of bleeps (a better buffering needed) from p2play.
That heap was not too small, it was ....
way too big.
The player became big enough to make the stack grow over the audio and video driver's buffers placed at the top of the HUB RAM. Reducing the heap size helped. It's time to make use of drivers and resources I have already burned into the flash (including PSRAM ) instead of compiling all of this into the program.
There is a lot of flash on P2 boards: we need a way to dynamic load the objects from there. I have this already solved for cog drivers, but not for things like mp3 decoder, a big piece of hub based code.
And I cannot compile the player with any other optimization level than O1. Size optimization fails with the weird error:
D:/programowanie/p2-retromachine/Propeller/P2P16/player31.p2asm:69422: error: fit 480 failed: pc is 493
The thing with hub exec code is that it's already sitting in fixed addresses in the image and is ready to be executed directly from that address and always would be unless the code is specifically written and compiled by the tool to be relocatable and can be dynamically loaded somewhere at run time into some other available memory area. In comparison it's much simpler to dynamically load COG exec driver code as it is basically already relocatable because the COG address space where it is being loaded to and executed from is independent of where it was sourced from in HUB RAM (unless the driver was written to access fixed HUB RAM addresses for its data which would prevent inherent relocatability at COG load time). If the tools supported loading and executing relocatable hub exec code from flash/external RAM blocks it could improve this limitation. But that's not simply done and would probably require either some caching or an overlay style solution. Hopefully one day Eric might consider this feature. I heard he's getting a PSRAM Edge board soon so with any luck he might be able to look into that possibility one day. RossH has already recently added external PSRAM to his tools so it's certainly doable, but is non-trivial.
Must be exceeding COG space somehow. Normally 496 longs would typically be free to use but it could now be 480 if 16 longs are already reserved for flexspin's use somewhere?
Extended buffer and several small changes makes the MP3 play in Prop2Play. To do before publishing the new version is opening the file: reading sampling frequency and skipping id3v2 tag