riscvp2: a C and C++ compiler for P2

Rayman · 2024-05-09 21:59

Ok, that didn't work out... P2 instantly reboots when hello.elf is loaded... Hmm...

ersmith · 2024-05-09 22:14

@Rayman said:
@ersmith Been wondering about something... In your Github readme files you seem to be doing things as if logged into root. I seem to have to preface a lot of things with "sudo". Do you really run as logged into root all the time? Or, is there something I'm missing?

No, I never log in as root. I just make sure that my regular user owns any files (like the /opt/riscv directory) that I need to modify.

evanh · 2024-05-09 23:09

I've always wondered what is common practice with editing files in non-/home directories too. Looks like the answer is "it depends".

Rayman · 2024-05-10 13:31

Ok, figured out the clock. Was blundering that badly, but more importantly see that need to manually remove the lower two bits of clock mode because the code first sets it that way as this is without the PLL, then it adds 3 to that and hubsets it again to turn on the PLL. Think good now.

_DEFAULT_CLOCK_MODE=$14d28fb & $FFFF_FFFC   'From Hub Address $18 (using FlexProp!) With lower two bits off!
_DEFAULT_CYCLES_PER_SEC=297_000_000  'From Hub Address $14 (using FlexProp!)

Rayman · 2024-05-11 15:34

@ersmith looks like spin2cpp doesn't know about getsec() either.
But, think can work around that...

Or, maybe not, can't do inline assembly either...

Rayman · 2024-05-16 21:37

@ersmith Just out of curiosity... Since this is actually GCC going on here, does that mean that the GDB debugger might work?
Guess that would mean disconnecting the serial output or opening up a different serial output?

Rayman · 2024-05-19 22:05

@ersmith Hard to imagine, but I think there's a bug in propeller2.h

//RJA thinks this is wrong: #define _waitus(u) _waitx((u)/(_clockfreq()/1000000))
#define _waitus(u) _waitx((u)*(_clockfreq()/1000000))

Also, added a _waitms() using same way.
Was trying to use csr for this but must be doing something wrong...

millis_write_csr
        ' wait pb milliseconds
        rdlong temp, #$14   ' get frequency
        qdiv   temp, ##1000
        getqx  temp     ' now have freq/1000 in temp
        'qmul   pb, temp
        'getqx  pb
        mul    pb, temp
        waitx  pb
        ret

ersmith · 2024-05-20 10:22

@Rayman said:
@ersmith Just out of curiosity... Since this is actually GCC going on here, does that mean that the GDB debugger might work?
Guess that would mean disconnecting the serial output or opening up a different serial output?

We'd have to implement a gdb debug stub on the P2 side. There probably is something already for RISC-V, but it may rely on instructions and/or hardware features that aren't implemented yet.

//RJA thinks this is wrong: #define _waitus(u) _waitx((u)/(_clockfreq()/1000000))
#define _waitus(u) _waitx((u)*(_clockfreq()/1000000))

Whoops, good catch! I'll check in the fix.

Also, added a _waitms() using same way.
Was trying to use csr for this but must be doing something wrong...

Changing from qmul to mul probably won't work, because frequency/1000 will exceed 16 bits if you're running at more than 65 MHz.

Rayman · 2024-05-21 22:22

The other issue with waitms is breaking the uart if using rep that would stop the interrupt. That's be a concern, right?

AJL · 2024-05-21 23:21

The concept might work if the rep wrapped an addct, pollse, conditional jump, waitct sequence.

That way the timing should remain constant and the uart driver will not be starved of interrupts for the full delay period.

ersmith · 2024-05-22 00:51

@Rayman said:
The other issue with waitms is breaking the uart if using rep that would stop the interrupt. That's be a concern, right?

I can't remember if rep in HUB stops interrupts, but it might be safer to use a djnz loop instead.

BTW I've updated the riscvp2 repo a bit to reflect the new xpack 13.2.0-2 compiler. I've also made it easier to change clock frequency: just edit the _clkfreq definition at the top of riscvptrace_p2.spin2. The default is now 252 MHz, which should be nicer for HDMI work (the default 160 MHz dated back to when we were first getting boards and the official Parallax line was that 180 MHz was the maximum frequency).

Rayman · 2024-05-22 01:11

Thanks @ersmith sounds great!

Rayman · 2024-05-22 02:06

I suppose in the context of micropython around 250 MHz is good for future hdmi output. Glad easier to change though…

Rayman · 2024-05-22 18:32

@ersmith Think asked this before, but had to pick a folder for libgcc.a...
Does this look right?
Seems to work...

RV32_GCCLIB=/opt/riscv/lib/gcc/riscv-none-elf/13.2.0/rv32ima/ilp32/libgcc.a

ersmith · 2024-05-22 19:11

@Rayman said:
@ersmith Think asked this before, but had to pick a folder for libgcc.a...
Does this look right?
Seems to work...

RV32_GCCLIB=/opt/riscv/lib/gcc/riscv-none-elf/13.2.0/rv32ima/ilp32/libgcc.a

Do you really need to explicitly specify libgcc.a? I think the compiler should find it for you. But yeah, that one or rv32imc should be fine -- the latter ought to be a little bit smaller.

Rayman · 2024-05-22 19:34

the "c" is for compressed, right? Does that mean smaller but slower because needs to be decompressed?

ersmith · 2024-05-22 20:25

@Rayman said:
the "c" is for compressed, right? Does that mean smaller but slower because needs to be decompressed?

Smaller but not really any slower (it means some instructions are encoded in 16 bits instead of 32 bits). I guess decoding the 16 bit instructions is slightly more complicated and so may be slightly slower, but they all get compiled to the same P2 code.

ersmith · 2024-06-05 12:15

I've been working on changes to allow running RISC-V code from flash (or potentially other external memory, in the future). The flash code is checked in. To compile a program to run from flash use -T riscvp2_flash.ld instead of -T riscvp2.ld. Caveat emptor, I've just got this working so there are almost certainly bugs lurking. It's also not terribly performant yet.

You'll need to use the latest loadp2 from my github repository, with the -HIMEM=flash flag, to actually program the code into flash, and you'll have to use the .elf file to do so (generating a .binary from the elf will probably fail and won't do what you want even if it doesn't). The code to be written to flash is stored in the ELF with load addresses above 0x80000000 (i.e. with the high bit set). At run time the JIT compiler checks for this and reads the RISC-V code from flash if the high bit of the address is set. There's a really stupid single entry cache for this that needs improving.

One fairly annoying restriction is that only 512K of code may be run from the high memory. That's because the cache tags only have 20 bits for the address, so the entire code address space is only 1M. For now that's split 512K HUB + 512K external. We could probably force all the code to go into external memory and thus allow the whole 1M of space to be there. Expanding the space is a big pain, it requires re-working the whole cache architecture .

I've tried to make everything modular so that at least in principle we could replace the flash with PSRAM or other external memory. loadp2 runs the external memory code in another COG with a mailbox, so it might not be too bad to update that, but the RISC-V JIT needs a very simple read-only version of the code that runs in the same COG. I don't understand how PSRAM works well enough to simplify the existing drivers.

Rayman · 2024-06-05 12:33

Very Interesting... So, we could presumably do micropython like this and have more free hub RAM?
Have to imagine it slows things down a lot when in flash though, right?
I think some chips use the XIP features of some flash to cycle through a short span of memory quickly. Would that be applicable here?
Would there be a way to flag some lesser used objects to be stored in flash?

Wuerfel_21 · 2024-06-05 13:16

@ersmith said:
I've tried to make everything modular so that at least in principle we could replace the flash with PSRAM or other external memory. loadp2 runs the external memory code in another COG with a mailbox, so it might not be too bad to update that, but the RISC-V JIT needs a very simple read-only version of the code that runs in the same COG. I don't understand how PSRAM works well enough to simplify the existing drivers.

Check the PSRAM code in my emulators, it's read-only and as simple as it really can be. Though depending on what you're doing, dropping a request into a mailbox can be worth the overhead. Doing an arbitrary "read N bytes from address X" like what roger's driver offers is actually very complicated due to alignment (EC32 PSRAM only operates on units of 32 bit) and row boundaries. Meanwhile if you're reading some fixed 2^n sized blocks from self-aligned addresses, that's very simple.

ersmith · 2024-06-05 14:52

@Rayman said:
Very Interesting... So, we could presumably do micropython like this and have more free hub RAM?

Yes, that was one of the major use cases I had in mind

Have to imagine it slows things down a lot when in flash though, right?

It can, although it depends on exactly what's running. Remember the usual riscvp2 code path is:

(1) JIT compiler translates RISC-V instructions from HUB to HUB/LUT
(2) JIT compiler calls the translated instructions, which return when a branch is encountered
(3) rinse & repeat

The new mode changes step (1) so that the JIT compiler optionally reads the RISC-V instructions from flash. That part is slow, but once the instructions are translated they still run at full speed. So if the translated code fits into the HUB cache of P2 code then it will run at full speed (except for the latency caused by translating).

I think some chips use the XIP features of some flash to cycle through a short span of memory quickly. Would that be applicable here?

Maybe? We're doing something kind of similar (reading a burst of data to fill a cache). OTOH I don't know how much difference that makes in practice, the flash should already be returning the data as fast as it can. I think the advantage of XIP is that you can get it to return the middle of the cache line (where the next instruction is) first, and then wrap around to give the rest of the cache. Possibly we can work out some way to do this, although it would mean moving the read code to another COG.

Would there be a way to flag some lesser used objects to be stored in flash?

Yes, although it might be even better to just increase the P2 HUB cache size so more of the translated routines can be kept. It all depends on how big the "hot" code path is.

@Wuerfel_21 said:
Check the PSRAM code in my emulators, it's read-only and as simple as it really can be. Though depending on what you're doing, dropping a request into a mailbox can be worth the overhead. Doing an arbitrary "read N bytes from address X" like what roger's driver offers is actually very complicated due to alignment (EC32 PSRAM only operates on units of 32 bit) and row boundaries. Meanwhile if you're reading some fixed 2^n sized blocks from self-aligned addresses, that's very simple.

Thanks! As you've guessed, it's just reading cache lines so nicely aligned and predictable sizes.

Rayman · 2024-06-05 15:15

Increasing size of HUB cache size sounds like the right thing to do...

rogloh · 2024-06-05 15:22

This sounds promising Eric. Will be good to see how fast we can make it run from PSRAM or SPI flash or even HyperRAM/HyperFlash. Pity about the 1M limit. Maybe one day some alternative caching scheme may work out... it would be nice to open things up to something like 8M-32MB which is available in the P2-EC32MB board or even from the 16MB SPI flash typically available on many boards.

Sharing PSRAM between multiple COGs will require something like my or Chip's driver - perhaps simplified down to do smaller sized/aligned accesses.

If you code to my mailbox format, you'll be able to get it working on HyperRam/HyperFlash/PSRAM and even other SRAM chips. One day I plan to make it read SPI Flash as well (probably in dual-spi mode for speed) but haven't got to that point yet. It should also work with a video driver accessing the external memory at the same time, with some performance reduction from bandwidth sharing.

Wuerfel_21 · 2024-06-05 15:45

@rogloh said:
Sharing PSRAM between multiple COGs will require something like my or Chip's driver - perhaps simplified down to do smaller sized/aligned accesses.

Using a lock also just works. Though with async data pins it can make the timing a bit flakey.

Rayman · 2024-06-05 16:21

I think Chip's PSRAM code is fairly straightforward.

rogloh · 2024-06-05 16:55

It is straightforward but as it stands it doesn't have a mechanism to ensure priority/QoS so video accesses for larger scan lines could be held off too long in some cases for when heavily loaded by multiple COGs. If code accesses are kept small by one other COG in general a single video COG will likely still be okay. Also I'm not sure if it fragments larger transfers that cross page boundaries which can be a problem for video (plus it's 32 bit transfers only).

ersmith · 2024-06-05 18:24

@rogloh said:
It is straightforward but as it stands it doesn't have a mechanism to ensure priority/QoS so video accesses for larger scan lines could be held off too long in some cases for when heavily loaded by multiple COGs. If code accesses are kept small by one other COG in general a single video COG will likely still be okay. Also I'm not sure if it fragments larger transfers that cross page boundaries which can be a problem for video (plus it's 32 bit transfers only).

Sure, but for my purposes (just running code from psram) all I need are "read a block of data" and "write a block of data" routines (and the write is only needed by loadp2, not by the riscvp2 runtime). I can live with any reasonable restrictions on alignment and block transfer size, although if the block size is > 1K that will make loadp2 unhappy -- we can work around this though if necessary. With flash I'm transferring flash pages, 256 bytes at a time on 256 byte boundaries.

Rayman · 2024-06-05 20:57

@ersmith When you say "latest loadp2", I guess you mean download the latest source and then cross-compile for Windows somehow?
Or, is there still that thing around where Github automatically compiles every day?

ersmith · 2024-06-05 21:19

@Rayman said:
@ersmith When you say "latest loadp2", I guess you mean download the latest source and then cross-compile for Windows somehow?
Or, is there still that thing around where Github automatically compiles every day?

I meant to build from source. Ada hasn't done her magic to make Github automatically compile loadp2, just spin2cpp .

Here's a zip file with the latest loadp2:

Rayman · 2024-06-05 22:14

Thanks @ersmith
Can you pls remind me how to increase hub cache size?

riscvp2: a C and C++ compiler for P2

Comments