Linux on P2

SaucySoliton · 2023-01-09 23:12

Linux is not running directly on the P2. It is "inside" a riscv emulator. I used this one: https://github.com/cnlohr/mini-rv32ima

You'll need a P2-EC32MB.

flexcc -2 rv32demo2.c 
loadp2 -t -b 2000000 rv32demo2.binary -9 .

It can also load the kernel from SD card, perhaps with some slight modifications.

Boot up time is about 5 minutes. But that is quite good compared to this arm emulator on atmega. https://hackaday.com/2012/03/28/building-the-worst-linux-pc-ever/

I did some simple profiling and found that the PSRAM access uses about 1/6 of the cpu time. That is after adding a simple instruction cache. It seems that any PSRAM operation takes over 600 cycles. This is terrible latency. But fixing it might only improve the emulator by 10% or so. There is still some optimizations that could be done like using SIGNX and other P2 instructions. Even with 50% improvement, we are still looking at a 2 minute boot time. I see 2 paths to do anything practical with Linux on P2.

Perhaps riscvp2 could be adapted to run Linux. It would need 2 modifications. XMM mode. and interrupts.
Try to run Linux native on the P2. The compiler would need an XMM mode for using the PSRAM. The lack of MMU seems to not be a problem. The emulator tested here does not have an MMU and the kernel does not need it.

Rayman · 2023-01-09 23:24

Neat!
I think @ersmith has a just in time risc V thing, are you using that?

ke4pjw · 2023-01-09 23:41

Sweet! I was wondering how long before someone put a Penguin on the P2. I just didn't expect it to be ran on top of an emulation layer!

pik33 · 2023-01-10 08:38

It seems that any PSRAM operation takes over 600 cycles.

PSRAM needs to be cached. If this is emulated CPU, the cache subsystem will be simpler to do. Also, the RISC-V emulator is a good starting point, because there is already a linux kernel for it, but a "cached XMM" P2 virtual machine using native P2 instruction set where possible should perform better.

The main question is now: what for? The simple answer is: to prove it can. A Linux kernel and drivers are a huge bloat of abstraction layers that slow things down. I escape from the operating system even on a Raspberry, using bare metal environment where I can use the hardware and memory in the way I want.

Another answer can be this "caching PSRAM XMM" that doesn't exist yet. Such an machine with a complier using this mode can extend P2 possibilities

RossH · 2023-01-10 11:05

Another answer can be this "caching PSRAM XMM" that doesn't exist yet.

Yes it does. Catalina already has it, making it realistic to execute real-world programs on the Propeller 2 using PSRAM as XMM.

And such programs can be loaded and run using Catalyst in seconds - no 5 minute boot time needed.

And more to come soon

Ross.

Tubular · 2023-01-10 11:52

Nice work Saucy

Wuerfel_21 · 2023-01-10 16:45

The reason your PSRAM ends up slow is that Roger's driver is not optimized for low latency small accesses. Especially not if you go through too many layers of high-level wrapper.

For my console emulators I rolled my own optimized PSRAM code. I don't think I ever cycle counted it, but it's a lot faster than 600 cycles.

rogloh · 2023-01-11 02:42

Yeah, what Wuerfel_21 said. My original PSRAM driver is not really intended for top emulator performance, but instead for applications using COG sharing and doing larger transfers (therefore it's more suited for video and general use). For single COG use with small sized accesses you can do better if you talk directly to the device in your own code as this removes the service polling time. But even then some inherent latency still persists and the improvement is bounded. You can probably do a little better with an I-cache added on top.

In some case the real-world latency may in fact be better with HyperRAM because the 16 bit PSRAM architecture on P2-EDGE is rather problematic for writes and the driver needs to do read-modify-writes in cases where individual bytes or words are written, which gets really slow. However the 8 bit HyperRAM is going to be slower to transfer than 16 bit PSRAM unless you try sysclk/1 operation which gets flaky timing wise over some frequencies. When latency dominates and small transfers are done the reduced HyperRAM bandwidth isn't that significant.

When I last tested, from memory I think my driver's latency should be more like ~300 cycles of latency. It was in the order of 1us at 300MHz IIRC. If you are encountering 600 clocks it might be due to the need to perform read/modify/writes kicking in. Try writing aligned longs where possible.

pik33 · 2023-01-11 07:11

That's why a cache is needed. The "CPU cog" cannot use the PSRAM exclusively, you need a framebuffer, a memory for audio samples, etc. So this multi-cog driver is still needed in such a system.

In my "Paula style" audio driver I also implemented a very simple cache to avoid single transfers for every sample. Instead, every audio channel has its own 256-byte cache, which is read when missed.

hinv · 2023-01-12 01:58

@rogloh said:

In some case the real-world latency may in fact be better with HyperRAM because the 16 bit PSRAM architecture on P2-EDGE is rather problematic for writes and the driver needs to do read-modify-writes in cases where individual bytes or words are written, which gets really slow. However the 8 bit HyperRAM is going to be slower to transfer than 16 bit PSRAM unless you try sysclk/1 operation which gets flaky timing wise over some frequencies. When latency dominates and small transfers are done the reduced HyperRAM bandwidth isn't that significant.

Did directly attached HyperRAM, as in not going through connectors, ever get tested for top performance?

hinv · 2023-01-12 02:02

Kudos! @SaucySoliton
I really like linux, and with libraries and better drivers, it could be quite a nice platform to port to.
Yeah, it comes with some bloat, but it sure is nice to be able to run your favorite open source tools

Electrodude · 2023-01-12 04:03

@hinv said:
Kudos! @SaucySoliton
I really like linux, and with libraries and better drivers, it could be quite a nice platform to port to.
Yeah, it comes with some bloat, but it sure is nice to be able to run your favorite open source tools

What Linux distro are you using? Better drivers? Bloat? Huh?

Except for well-known GPU and WiFi driver problems, which can generally be blamed on uncooperative manufacturers, or unless you're using some special piece of overpriced equipment whose developers couldn't afford to write Linux drivers and whose lawyers were too afraid to allow the release enough info for someone else to do so, I find Linux to have better driver support than Windows (I never have to bother with installing drivers - it all just works right out of the box!) and practically no bloat, while on Windows it has to install a driver for every little thing and there's nothing but bloat.

I use Linux every day both at home and at work and rarely run into true driver problems, of the kind where either nobody wrote a Linux driver for the piece of hardware in question or the only one available is too buggy to use. Linux's driver situation isn't quite the same as it was 10-15 years ago. On the other hand, Windows's driver situation has only gotten worse, with all the draconian driver signing rules in place nowadays.

It's pretty amazing that @SaucySoliton has gotten Linux running on the P2, but I'm quite certain that no other mainstream OS will ever run on the P2 - they're all orders of magnitude more bloated. I hope with some caching this can be made to be of practical use.

rogloh · 2023-01-12 05:15

@hinv said:
Did directly attached HyperRAM, as in not going through connectors, ever get tested for top performance?

Unfortunately I don't have a setup with directly attached HyperRAM to test. But @evanh and I did test sysclk/1 performance using the connectors with the Parallax HyperRAM breakout board and test results are available if you want to wade through the long HyperRAM thread. From memory it typically topped out somewhere around 300-320MHz or so depending on the settings to use registered clocks and I/O. But there were some gaps in the frequency range where sampling happened at transitions and data errors occurred. A better setup may be able to more finely tweak clock and data signal separation and not combine HyperRAM and HyperFlash on the same bus but the V1 RAM itself was being significantly overclocked from 133MHz with DDR. V2 HyperRAM promised better performance up to 200MHz but I think it was still typically rated for that speed at 1.8V (maybe one manufacturer did have 3V 200MHz or 166MHz rated operation, can't recall offhand)

evanh · 2023-01-12 07:07

There is only one existing Prop2 module containing such RAM suitable for high speed sysclock/1 performance - https://forums.parallax.com/discussion/comment/1544576/#Comment_1544576

SaucySoliton · 2023-03-23 03:47

An interesting comparison happened. https://hackaday.com/2023/03/19/rp2040-runs-linux-through-risc-v-emulation/ Also uses cnlohr's emulator.

RP2040 at 375MHz using SD card as RAM. 230kB cache. 10-15 minute boot time. Hangs after entering username.

P2 at 250MHz with 32MB PSRAM. 5 minute boot time. Shell is usable.

Something seems off here. The RP2040 completes most operations in 1 clock cycle, for 375 MIPS. The P2 takes 2 clock cycles, so 125 MIPS. It must be the SD card slowing down memory operations. I expect that making an effective RAM cache controller is hard.

rogloh · 2023-03-23 04:24

Yes SD as RAM should be quite a bit slower than PSRAM although you'd expect good cache controller could potentially help it out a bit. The P2 at 250MHz can transfer the equivalent of a 512 byte sector from PSRAM in ~2uS, excluding initial latency which, if optimised for this purpose, could probable be made something in the order of 0.5us. It would be very difficult for any normal SD card to achieve this degree of performance especially when randomly accessing the disk, which would be hamstringing the RP2040. You'd really need one of those fancy new UHS-II or UHS-III cards to compete.

I also wonder if it would be more competitive if you could only read smaller sized blocks from the SD card for each cache row because 512 bytes seems like a lot of data has to be read or written for every read miss or write back case given the relatively slow SD transfer rate. What sized PSRAM transfers did you use in your Linux on the P2 project @SaucySoliton ?

SaucySoliton · 2023-03-23 06:43

@rogloh said:

What sized PSRAM transfers did you use in your Linux on the P2 project @SaucySoliton ?

Data access was actual size, so 1/2/4 bytes. Instruction access was 64 bytes. But of course that would only be done once for every 16 instructions.

Linux on P2

Comments