Linux on P2
SaucySoliton
Posts: 581
Linux is not running directly on the P2. It is "inside" a riscv emulator. I used this one: https://github.com/cnlohr/mini-rv32ima
You'll need a P2-EC32MB.
flexcc -2 rv32demo2.c loadp2 -t -b 2000000 rv32demo2.binary -9 .
It can also load the kernel from SD card, perhaps with some slight modifications.
Boot up time is about 5 minutes. But that is quite good compared to this arm emulator on atmega. https://hackaday.com/2012/03/28/building-the-worst-linux-pc-ever/
I did some simple profiling and found that the PSRAM access uses about 1/6 of the cpu time. That is after adding a simple instruction cache. It seems that any PSRAM operation takes over 600 cycles. This is terrible latency. But fixing it might only improve the emulator by 10% or so. There is still some optimizations that could be done like using SIGNX and other P2 instructions. Even with 50% improvement, we are still looking at a 2 minute boot time. I see 2 paths to do anything practical with Linux on P2.
Perhaps riscvp2 could be adapted to run Linux. It would need 2 modifications. XMM mode. and interrupts.
Try to run Linux native on the P2. The compiler would need an XMM mode for using the PSRAM. The lack of MMU seems to not be a problem. The emulator tested here does not have an MMU and the kernel does not need it.

Comments
Neat!
I think @ersmith has a just in time risc V thing, are you using that?
Sweet! I was wondering how long before someone put a Penguin on the P2. I just didn't expect it to be ran on top of an emulation layer!
PSRAM needs to be cached. If this is emulated CPU, the cache subsystem will be simpler to do. Also, the RISC-V emulator is a good starting point, because there is already a linux kernel for it, but a "cached XMM" P2 virtual machine using native P2 instruction set where possible should perform better.
The main question is now: what for? The simple answer is: to prove it can. A Linux kernel and drivers are a huge bloat of abstraction layers that slow things down. I escape from the operating system even on a Raspberry, using bare metal environment where I can use the hardware and memory in the way I want.
Another answer can be this "caching PSRAM XMM" that doesn't exist yet. Such an machine with a complier using this mode can extend P2 possibilities
Yes it does. Catalina already has it, making it realistic to execute real-world programs on the Propeller 2 using PSRAM as XMM.
And such programs can be loaded and run using Catalyst in seconds - no 5 minute boot time needed.
And more to come soon
Ross.
Nice work Saucy
The reason your PSRAM ends up slow is that Roger's driver is not optimized for low latency small accesses. Especially not if you go through too many layers of high-level wrapper.
For my console emulators I rolled my own optimized PSRAM code. I don't think I ever cycle counted it, but it's a lot faster than 600 cycles.
Yeah, what Wuerfel_21 said. My original PSRAM driver is not really intended for top emulator performance, but instead for applications using COG sharing and doing larger transfers (therefore it's more suited for video and general use). For single COG use with small sized accesses you can do better if you talk directly to the device in your own code as this removes the service polling time. But even then some inherent latency still persists and the improvement is bounded. You can probably do a little better with an I-cache added on top.
In some case the real-world latency may in fact be better with HyperRAM because the 16 bit PSRAM architecture on P2-EDGE is rather problematic for writes and the driver needs to do read-modify-writes in cases where individual bytes or words are written, which gets really slow. However the 8 bit HyperRAM is going to be slower to transfer than 16 bit PSRAM unless you try sysclk/1 operation which gets flaky timing wise over some frequencies. When latency dominates and small transfers are done the reduced HyperRAM bandwidth isn't that significant.
When I last tested, from memory I think my driver's latency should be more like ~300 cycles of latency. It was in the order of 1us at 300MHz IIRC. If you are encountering 600 clocks it might be due to the need to perform read/modify/writes kicking in. Try writing aligned longs where possible.
That's why a cache is needed. The "CPU cog" cannot use the PSRAM exclusively, you need a framebuffer, a memory for audio samples, etc. So this multi-cog driver is still needed in such a system.
In my "Paula style" audio driver I also implemented a very simple cache to avoid single transfers for every sample. Instead, every audio channel has its own 256-byte cache, which is read when missed.
@rogloh said:
Did directly attached HyperRAM, as in not going through connectors, ever get tested for top performance?
Kudos! @SaucySoliton
I really like linux, and with libraries and better drivers, it could be quite a nice platform to port to.
Yeah, it comes with some bloat, but it sure is nice to be able to run your favorite open source tools
What Linux distro are you using? Better drivers? Bloat? Huh?
Except for well-known GPU and WiFi driver problems, which can generally be blamed on uncooperative manufacturers, or unless you're using some special piece of overpriced equipment whose developers couldn't afford to write Linux drivers and whose lawyers were too afraid to allow the release enough info for someone else to do so, I find Linux to have better driver support than Windows (I never have to bother with installing drivers - it all just works right out of the box!) and practically no bloat, while on Windows it has to install a driver for every little thing and there's nothing but bloat.
I use Linux every day both at home and at work and rarely run into true driver problems, of the kind where either nobody wrote a Linux driver for the piece of hardware in question or the only one available is too buggy to use. Linux's driver situation isn't quite the same as it was 10-15 years ago. On the other hand, Windows's driver situation has only gotten worse, with all the draconian driver signing rules in place nowadays.
It's pretty amazing that @SaucySoliton has gotten Linux running on the P2, but I'm quite certain that no other mainstream OS will ever run on the P2 - they're all orders of magnitude more bloated. I hope with some caching this can be made to be of practical use.
Unfortunately I don't have a setup with directly attached HyperRAM to test. But @evanh and I did test sysclk/1 performance using the connectors with the Parallax HyperRAM breakout board and test results are available if you want to wade through the long HyperRAM thread. From memory it typically topped out somewhere around 300-320MHz or so depending on the settings to use registered clocks and I/O. But there were some gaps in the frequency range where sampling happened at transitions and data errors occurred. A better setup may be able to more finely tweak clock and data signal separation and not combine HyperRAM and HyperFlash on the same bus but the V1 RAM itself was being significantly overclocked from 133MHz with DDR. V2 HyperRAM promised better performance up to 200MHz but I think it was still typically rated for that speed at 1.8V (maybe one manufacturer did have 3V 200MHz or 166MHz rated operation, can't recall offhand)
There is only one existing Prop2 module containing such RAM suitable for high speed sysclock/1 performance - https://forums.parallax.com/discussion/comment/1544576/#Comment_1544576
An interesting comparison happened. https://hackaday.com/2023/03/19/rp2040-runs-linux-through-risc-v-emulation/ Also uses cnlohr's emulator.
RP2040 at 375MHz using SD card as RAM. 230kB cache. 10-15 minute boot time. Hangs after entering username.
P2 at 250MHz with 32MB PSRAM. 5 minute boot time. Shell is usable.
Something seems off here. The RP2040 completes most operations in 1 clock cycle, for 375 MIPS. The P2 takes 2 clock cycles, so 125 MIPS. It must be the SD card slowing down memory operations. I expect that making an effective RAM cache controller is hard.
Yes SD as RAM should be quite a bit slower than PSRAM although you'd expect good cache controller could potentially help it out a bit. The P2 at 250MHz can transfer the equivalent of a 512 byte sector from PSRAM in ~2uS, excluding initial latency which, if optimised for this purpose, could probable be made something in the order of 0.5us. It would be very difficult for any normal SD card to achieve this degree of performance especially when randomly accessing the disk, which would be hamstringing the RP2040. You'd really need one of those fancy new UHS-II or UHS-III cards to compete.
I also wonder if it would be more competitive if you could only read smaller sized blocks from the SD card for each cache row because 512 bytes seems like a lot of data has to be read or written for every read miss or write back case given the relatively slow SD transfer rate. What sized PSRAM transfers did you use in your Linux on the P2 project @SaucySoliton ?
What sized PSRAM transfers did you use in your Linux on the P2 project @SaucySoliton ?
Data access was actual size, so 1/2/4 bytes. Instruction access was 64 bytes. But of course that would only be done once for every 16 instructions.
I've been disconnected from here for quite a while, but now that AI has come in, and has the ability to fill in for my lack of programming skills, I have the hope of getting some things I wanted to do done. So sorry for the delay in answering.
I have had a similar experience that you have, probably in the Linux world. I generally don't use windows unless I am gaming, and that doesn't really happen unless I am at a lan party, for which I have an old Alienware Laptop that will still boot Windows 7. Typically I deal with bloat of Linux, and lately ubuntu 22.04 and 24.04 because I am hosting GPUs and running LLMs. They have gotten quite bloated with my server running 24.04 using up 90% of it's 32GB root ssd on OS. I've heard that Windows has grown to 4x that, but to me, it is out of bounds because, AFAIK, when you run anything later than Windows7, MicroSoft gets a free copy of your data according to the EULA, same for Skype when acquired by MicroSoft, so I stopped using it. Who needs North Korean hackers when you have parasites like this?!
As far as a lighter linux, I have some Milk-V Duo (64MB) and Milk-V Duo256 boards that run a stripped down ubuntu. They are quite nice little boards at under $15 and a little bit bigger than a stick of gum! I also run IRIX on old SGIs and it is still possible to run an Indy on 32MB of RAM, with an older version of the OS, and with 256MB (the top for an Indy) the latest 6.5.30 runs pretty well.
So, I run linux every day on at least half a dozen systems if you count my phones which all run a degoogled android, which runs a linux kernel (All android phones run linux). I'm with you...left Windows behind, but I still don't care for the bloat.
An update to that question: I did the testing a couple of months later and found what I'd class as stable operation of the P2Stamp at 300 MHz sysclock, and data rate of sysclock/1 (Hyper clock @ 150 MHz) - https://forums.parallax.com/discussion/comment/1566321/#Comment_1566321
A custom design that provided even better heatsinking of the Prop2 would go faster, but not a lot.
PS: It involved the removal of several PLCC-84 socket pins that connect to the HyperRAM chip. Not hard to do before the PLCC socket is soldered into the PCB. This eliminates the dangly wires so to speak. Here's a photo - https://forums.parallax.com/discussion/comment/1568728/#Comment_1568728 You can see the missing socket pins along the upper middle edge.
PPS: And here's the source code I used for testing - https://forums.parallax.com/discussion/comment/1569287/#Comment_1569287
The downside to using sysclock/1 data rate is the margin for error is really tight. Particularly at the higher clock rates. It pretty much requires a look-up-table of characterised lag compensation values. Not just for frequency but also for Prop2 junction/die temperature. A live temperature measurement of the thermal pad would be advised.
Now I am a bit confused. I'm used to seeing sysclock/2 or sysclock/3 for PSRAMs but something tells me this sentence above has more to do with marketing than math.
Very nice. Can't be letting our dangly bits get in the way, huh?
Hyperbus is a DDR interface, and DDR means the data rate is double the clock rate. And I did attempt clarity by stating "data rate" above.
PS: I'm the only idiot attempting to go for sysclock/1. Everyone else has been using HyperRAM at sysclock/2 (max 85 MHz Hyper clock).
PPS: Rayman's 96 MB PSRAM add-on board has to be operated slower at sysclock/3 because of the large number of parallel RAM chips and much longer tracks.
Are there bigger 3.3v PSRAM chips yet? Could 8 tiny(like same package as EC32MB) 16MB PSRAMS be driven at sysclock/2 if the were placed much more compactly? 128MB would be nice for linux!
Yes, newer 64M x 8bit capacity exist. Not sure about available though. Looks like 4-bit packages top out at 64M x 4bit.
What type of project for P2 with Linux do you have in mind?
I'd expect it should be achievable with a good design, assuming 10-20mm trace lengths and 16 MB per device with two chips per nibble, making a 16 bit wide effective RAM. With the existing PSRAM chips I was able to get 64MB (fanout of two) mostly working for P2-EVAL with an sub-optimal design, a good design should be even better.
The limit for good operations appears to be 3 RAM chips per lane (empirically from the boards I have). So with 16 bit bus, 3 banks and 64MBit/8Mbyte per chip, you can get 96MB.
128Mbit chips have been alleged as "in development" before, but I don't think anything ever came of that.
Ganging two 8 bit devices up in a 16 bit bus woulf require some rather annoying setup (or wasting vast amounts of LUT)
Up to 512 Mbit parts are listed - https://www.digikey.co.nz/en/products/detail/issi-integrated-silicon-solution-inc/IS66WVH64M8DBLL-166B1LI/24617503
Wow, I didn't realize that these 3V parts had finally come to fruition.
Someone needs to make a new HyperRAM board and try these 64MB parts. I'm focussed on SW at this time, but it'd be interesting to spin a double wide breakout board that could surface mount one or two of these parts. Didn't read the data sheet fully but am optimistic that my HyperRAM driver could probably be adapted to that reasonably easily and the faster rated speed should allow for high speed operation. Also it would be great to find a small part that could tweak the input clock phase slightly under P2 control to give finer input latency adjustments and more reliable transfers, especially at sysclk/1. The spare pins on the breakout might be possible to control such a peripheral via CLK+DATA or i2c, if these devices even exist?