Do you have any figures on how much smaller the compressed code becomes?
The compressed code is about 25% smaller than uncompressed (so we save about 64K on a 256K binary). The RISC-V instructions are already denser than P2 instructions, because the instruction set is for 3 operands and has very compiler friendly instructions like "beq r1, r2, label" which compares two registers and if they are equal jumps to the label.
Do you have any figures on how much smaller the compressed code becomes?
The compressed code is about 25% smaller than uncompressed (so we save about 64K on a 256K binary). The RISC-V instructions are already denser than P2 instructions, because the instruction set is for 3 operands and has very compiler friendly instructions like "beq r1, r2, label" which compares two registers and if they are equal jumps to the label.
Does this RISC-V - JIT pathway allow you to create a listing file that is P2-Assembler, with original C (or C++) source lines as comments ?
Could you also single step in a hypothetical debugger, with reference back to the starting source ?
Does this RISC-V - JIT pathway allow you to create a listing file that is P2-Assembler, with original C (or C++) source lines as comments ?
No, definitely not. The final RISC-V to P2 compilation is happening on the P2 itself at run time, so none of the C source is available.
Understood, but that 'final RISC-V to P2 compilation' is using explicit rules, which the PC-Side could apply just like P2 does, in order to create the P2-asm.
To get commented listing, it would need to start with the RISC-V C-listing, and replace the RISC-V opcodes with P2-jit ones.
How useful that would be, I guess depends on how close the RISC-V & P2 code blocks are.
Does this RISC-V - JIT pathway allow you to create a listing file that is P2-Assembler, with original C (or C++) source lines as comments ?
No, definitely not. The final RISC-V to P2 compilation is happening on the P2 itself at run time, so none of the C source is available.
Understood, but that 'final RISC-V to P2 compilation' is using explicit rules, which the PC-Side could apply just like P2 does, in order to create the P2-asm.
Except that the compilation really is "just in time", only code that is actually executed gets compiled. So for example in:
printf("continue? ");
x = getchar();
if (x == 'y') {
continue_operation();
} else {
cancel_operation();
}
which arm of the "if" statement gets compiled on the P2 depends on what the user responded. A listing file wouldn't be practical, because where the code ends up in the cache memory depends on the history and the environment (including user input).
So I was able to get out of Makefile hell a couple of days back and build a minimal Micropython with some amount of debug effort with PropellerGCC and Dave Hein's handy p2gcc translation tools. I have been able to run some of it from the REPL loop interactively however I notice a couple of crashes which I am still debugging. I have a hunch that there are still some remaining issues with the s2pasm and/or p2link tools and memory alignment for some symbols. I have found with a different link order different parts of the Micropython environment can lockup. It could potentially be module code size dependent too because adding debug code can cause it to stop failing as well (a Heisenbug), though each code change could also be affecting overall alignment. Further detective work is required and I'm still tinkering... ozpropdev showed me his PASM2 debugger and it looks like it could be something useful for this type of thing too.
I need to fix the serial port code so I can paste in test code faster without corruption at high speed due to the rudimentary polling serial receive code not keeping up with pasted input rate. Am quickly getting tired of typing in test code manually. I will also look into other terminal programs that might delay per character instead of using the loadp2 terminal.
Don't have any take on its overall performance level as yet but there's some reasonable sign of life at least.
loadp2 -t build/python.bin
( Entering terminal mode at 115200 bps. Press Ctrl-] to exit. )
MicroPython v1.11-105-gef00048fe-dirty on 2019-07-10; P2-EVAL with propeller2-cpu
Type "help()" for more information.
>>> a = 1
>>> print (a)
1
>>> print (a+10)
11
>>> print("Hello world")
Hello world
>>> print(a ^ 2)
3
>>> help()
Welcome to MicroPython!
For online docs please visit http://docs.micropython.org/
Control commands:
CTRL-A -- on a blank line, enter raw REPL mode
CTRL-B -- on a blank line, enter normal REPL mode
CTRL-C -- interrupt a running program
CTRL-D -- on a blank line, exit or do a soft reset
CTRL-E -- on a blank line, enter paste mode
For further help on a specific object, type help(obj)
>>> dir()
['__name__', 'a']
>>> print(bin(a))
0b1
>>> def foo():
... print("Hello")
...
>>> foo()
Hello
>>> for i in range(3,-1,-1):
... print(i)
...
3
2
1
0
>>>
I need to fix the serial port code so I can paste in test code faster without corruption at high speed due to the rudimentary polling serial receive code not keeping up with pasted input rate. Am quickly getting tired of typing in test code manually. I will also look into other terminal programs that might delay per character instead of using the loadp2 terminal.
Can you just slow down the serial baud rate, as a quick-fix, to pace the characters better ?
Can you just slow down the serial baud rate, as a quick-fix, to pace the characters better ?
Good idea, might give that a go.
Is the code size still ~176kB indicated earlier ?
Basically yeah, it is 177kB (0x2c4b8) for a minimal build with rudimentary serial IO but nothing else. Though recently I have been extending features and increased the heap space to 20kB from its relatively tiny default of 2k (increasing my actual Python workspace to use for my tests so I don't run out of memory) so right now my overall P2 image has grown a bit. We shouldn't be worried about heap space in the build (it's statically allocated) because that is just the buffer space Python has to use for storing your run time Python code/data, so it's really the Micropython executable image code size that is the main item to consider. Whatever HUB RAM is left after that can be divided amongst other COGs & stacks and the Micropython heap.
I've just enabled these extra Micropython features below for some testing and found the P2 image size increased to ~195kB excluding any heap increase. As you add further Python features it will keep going higher.
Can you just slow down the serial baud rate, as a quick-fix, to pace the characters better ?
Gave it a try down at 300baud and 160MHz P2 and it still corrupted characters during the paste which is weird. My Mac system must still be buffering somewhere and bursting in data faster than this. 300 baud is ridiculously slow for just executing the readline buffer fill code before even parsing anything in Micropython itself. Need to hook a logic probe on the Rx pin to see if data is bursting faster than expected.
EDIT: Added the scope and I see the problem right away, we are echoing characters back from the P2 at 300baud as well while receiving and can't execute the normal getch() during this time (single COG, not threaded). I need to look at improving serial.
Yeah @evanh I won't be wanting 300baud. It was only a quick experiment. I am currently bit banging though I noticed your code in another thread using smartpin modes for TX which may help. Ideally for a single COG code there would be a receive and perhaps TX buffer in the VM COG's LUT RAM and we could be using some type of interrupt driven receive routines to help avoid data loss. Though without flow control or other HW back pressure there is always some risk of loss.
... My Mac system must still be buffering somewhere and bursting in data faster than this....
PC's can buffer, but they cannot deliver faster than the baud speed.
The buffers are why it's better to adjust baud down, than to try to insert delays, as the former is exact, but the PC delays are coarse and elastic.
Having said that, I was using PC delays to do starved TX burst tests, and I was impressed I could get to maybe 100us LSBs over USB-UARTs for inter-packet pauses.
I am currently bit banging though I noticed your code in another thread using smartpin modes for TX which may help. Ideally for a single COG code there would be a receive and perhaps TX buffer in the VM COG's LUT RAM and we could be using some type of interrupt driven receive routines to help avoid data loss. Though without flow control or other HW back pressure there is always some risk of loss.
Currently, the only reason I use receive routine is for key presses. So, very low actual rx data volume. The one character smartpin buffer is fine for this.
Mike's approach of throwing a whole cog at it is probably to way to go for larger buffering.
Here's my rx routine:
getch
testp #rx_pin wz 'byte received? (IN high == yes)
if_z rdpin pb, #rx_pin 'get data from Z buffer
if_z shr pb, #32-8 'shift the data to bottom of register
if_z ret wcz 'restore C/Z flags of calling routine
jmp #getch 'wait while Smartpin buffer is empty
I am looking at adding some smartpin TX and perhaps then RX as well to see if it helps. I'm currently deciphering the P2 documentation to see what is the correct order/values for smart pin initialization of pin 62 (TX) in async serial mode, then plan on playing with your/Chip's code for Smartpin based TX and patching into putch() in the p2gcc library code. This might help me in the short term. A second COG is a little more work to connect up but could be better for dealing with the buffering issue. Though I think a single COG variant for Micropython would be nice and compact too.
wrpin #%00_11111_0, #rx_pin 'Asynchronous serial receive
wxpin ##ASYNCFG, #rx_pin 'set baurdrate and framing
dirh #rx_pin
wrpin #%01_11110_0, #tx_pin 'Asynchronous serial transmit mode
wxpin ##ASYNCFG, #tx_pin 'set X with baudrate and framing
dirh #tx_pin
' wypin #0, #tx_pin 'trigger first tx ready state (not needed if dual checked)
'single check is buffer full only, dual checking adds in tx flag
Note the last instruction is commented out entirely. If you are happy to emit one null during init, by uncommenting it, then the RQPIN in tx routine can be removed instead.
Thanks @evanh, this stuff would be useful. I had just hacked this test code in already for putch.c but it doesn't deal with fractional components yet so your code is nicer. It worked at 115200bps using Smartpin transmit for me so I'm happy.
int p2bitcycles;
void putch(int val)
{
__asm__(" rdlong r3, ##_p2bitcycles");
//__asm__(" mov r3, ##160000000/115200");
__asm__(" or r0, #$100");
__asm__(" shl r0, #1");
__asm__(" mov r1, #10");
__asm__(" getct r2");
__asm__("loop");
__asm__(" shr r0, #1 wc");
__asm__(" drvc #62");
__asm__(" addct1 r2, r3");
__asm__(" waitct1");
__asm__(" djnz r1, #loop");
}
void initputch(void)
{
__asm__(" flth #62");
__asm__(" wrpin ##%0000_0000_000_0000000000000_01_11110_0, #62 'set smartpin async tx serial");
__asm__(" rdlong r0, ##_p2bitcycles");
__asm__(" shl r0, #16");
__asm__(" or r0, #7 ' 8 data bits");
__asm__(" wxpin r0, #62");
__asm__(" drvh #62");
}
void smartputch(int val)
{
__asm__("polltx");
__asm__(" rqpin inb, #62 wc 'transmitting? (C high == yes) Needed to initiate tx");
__asm__(" testp #62 wz 'buffer free? (IN high == yes)");
__asm__("if_nc_or_z wypin r0, #tx_pin 'write new byte to Y buffer");
__asm__("if_nc_or_z jmp lr 'return");
__asm__(" jmp #polltx 'wait while Smartpin is both full (nz) and transmitting (c)");
}
Update: Using this at 4800bps and 160MHz keeps up with character rate during paste, but still drops some characters at the end of the line while it is processing the full line, so I still need some end of line transmit pacing during paste. I guess I need a better serial terminal app, though I found that loadp2 drops the DTR line and resets the prop after download before I can get it into minicom. Argh.
Update: Using this at 4800bps and 160MHz keeps up with character rate during paste, but still drops some characters at the end of the line while it is processing the full line, so I still need some end of line transmit pacing during paste..
Adding HW receive could buy more time, and you could remove wait-on-send from exit delay, and instead add wait-on-sent to Tx entry. (to avoid Tx overrun)
That allows more processor time elsewhere.
Well, definitely tidied it up making it a subroutine. It is somewhat customised for my needs but can still be used as a template. "clk_freq", "clk_mode" and "asyn_baud" are the cogRAM mapped versions of all three of the agreed shared convention variables, located at hub addresses $14, $18, and $1c respectively. "xtalmul" represents a dynamic direct replacement of XMUL. I made it the only dynamic component of the sysclock as an input parameter just to make the whole job easier.
'===============================================
'Set new sysclock frequency
'and also adjusts the diag comport to suit
' Note: Uses the CORDIC. This means that any already running operations will be lost
' input: xtalmul, asyn_baud
' result: clk_mode, clk_freq
'scratch: pa, pb, c_flag
setclkfrq
'recalculate sysclock hertz (assumes xtalmul unit is 1 MHz)
mov clk_freq, xtalmul 'range 1-1024
mul clk_freq, #50
mul clk_freq, ##20_000 'clk_freq useable as one second pause in a WAITX
'start cordic calculation for new baud divider of diag comport
qdiv clk_freq, asyn_baud 'cordic runs in parallel, 55 sysclocks
'make sure not transmitting on comport before changing clock rate
.txwait rqpin inb, #TXPIN wc 'transmiting? (C high == yes)
if_c jmp #.txwait
'sysclock frequency adjustment (assumes xtalmul unit is 1 MHz)
hubset clk_mode '**IMPORTANT** Switches to RCFAST using known prior mode
mov clk_mode, xtalmul 'replace old with new ...
sub clk_mode, #1 'range 1-1024
shl clk_mode, #8
or clk_mode, ##(1<<24 + (XDIV-1)<<18 + XPPPP<<4 + XOSC<<2)
hubset clk_mode 'setup PLL mode for new frequency (still operating at RCFAST)
waitx ##22_000_000/100 '~10ms (at RCFAST) for PLL to stabilise
mov pa, clk_mode
or pa, #XSEL 'add PLL as the clock source select
hubset pa 'engage! Switch back to newly set PLL
'finish up the baud divider calculation and set smartpins
getqx pa 'collect result of cordic divide operation
getqy pb '32.32 (64-bit) is way cool
rolword pa, pb, #1 'shift 16.16 into PA
sets pa, #7 'comport 8N1 framing (bottom 10 bits should be replaced but 9's enough)
wxpin pa, #RXPIN 'set rx baud and framing
_ret_ wxpin pa, #TXPIN 'set tx baud and framing
PS: I've also just gone and renamed rx_pin to RXPIN and tx_pin to TXPIN.
I need to fix the serial port code so I can paste in test code faster without corruption at high speed due to the rudimentary polling serial receive code not keeping up with pasted input rate. Am quickly getting tired of typing in test code manually. I will also look into other terminal programs that might delay per character instead of using the loadp2 terminal.
@rogloh : why are you re-inventing the wheel here? I mean sure, try it with a different compiler, but why not at least start with my existing micropython port to the P2 (https://github.com/totalspectrum/micropython). Trust me, just because I happened to compile it with a RISC-V compiler doesn't give it cooties .
I solved the serial buffer problem by running serial code in another COG (see BufferSerial.spin2, translated by spin2cpp into BufferSerial.c). There's code for VGA and USB support (also running in separate COGs). There are some P2 support objects (modpyb.c) which shouldn't be too hard to port to p2gcc. In fact I plan to convert them to use the standard propeller2.h header file that was discussed elsewhere in the forum.
Just for comparison, a minimal build with the latest RISC-V to P2 cross-compiler creates a firmware.bin file that's 119876 bytes. That includes 32K of interpreter + cache. It runs up to pyexec_friendly_repl, but hangs somewhere in there and I don't really want to debug it since I do have a fully working version of micropython with all the bells and whistles.
OK, I fibbed: I did spend a few minutes debugging minimal micropython, and it turned out it was just a matter of needing to define MICROPY_MIN_USE_STDOUT=1 (the interpreter was running, just not producing any output).
Since this thread is about "Can't wait for PropGCC for P2" and since it is a working version of GCC for P2, I'll add a pitch here for https://github.com/totalspectrum/riscvp2, which is a description of how to get a RISC-V toolchain and convert it to a P2 toolchain. I think the process is straightforward, and the resulting compiler is is quite complete: building a hello world app is as simple as:
It would be easy to eliminate the final "objcopy" step if we were to modify loadp2 to understand ELF files.
There's still a lot of room for improvement in the documentation, and I'm working on that, but it really needs help from others as well -- so far I haven't had any feedback at all, so I don't even know if anyone is using it. But I think it's the most complete c99 solution for the P2 right now, with a complete standard library (Note: Catalina is pretty complete as well, but is a c89 rather than c99 compiler -- for many users that won't matter).
Performance is very good, in fact it handily beats every other C compiler for the P2 on the Dhrystone 2.1 benchmark:
riscvp2: 46598 dhrystones/sec
p2gcc: 19436 dhrystones/sec (actual result was 9718 dhrystones/sec at 80 MHz clock)
catalina: 11171 dhrystones/sec
fastspin: 10330 dhrystones/sec
The steps I needed to take to build a minimal micropython for P2 were:
The firmware.bin, i.e. what actually runs on the P2, is 119876 bytes.
I think my problem here is really marketing: because it has "riscv" in the name people just dismiss it as "only an emulator". If I took the GNU MCU toolchain and went through and renamed "riscv-none-embed-gcc" to "p2-none-embed-gcc", and advertised it as "CMM for P2", I'm guessing people would accept it more. Because that's essentially what it is: a sort of compressed memory model using standard opcodes (rv32imc) instead of the custom opcodes that PropGCC used for its CMM.
I think my problem here is really marketing: because it has "riscv" in the name people just dismiss it as "only an emulator". If I took the GNU MCU toolchain and went through and renamed "riscv-none-embed-gcc" to "p2-none-embed-gcc", and advertised it as "CMM for P2", I'm guessing people would accept it more. Because that's essentially what it is: a sort of compressed memory model using standard opcodes (rv32imc) instead of the custom opcodes that PropGCC used for its CMM.
This subtlty dawned on me only very recently - in the last few days - and it has certainly made me much more interested. I would like to, at some point, see if I can get PropWare (with it's full CMake build system and C++ classes) working on top of your RISCV work.
I think my problem here is really marketing: because it has "riscv" in the name people just dismiss it as "only an emulator". If I took the GNU MCU toolchain and went through and renamed "riscv-none-embed-gcc" to "p2-none-embed-gcc", and advertised it as "CMM for P2", I'm guessing people would accept it more. Because that's essentially what it is: a sort of compressed memory model using standard opcodes (rv32imc) instead of the custom opcodes that PropGCC used for its CMM.
It's odd that people would reject a C compiler because it is "only an emulator" when they seem to demand an interpreted version of Spin that isn't even JIT compiled. So native compiled versions of Spin are not desirable and a RISC-V JIT implementation of C is "only an emulator" and hence is also not acceptable?
I do have a getch routine that runs in a separate cog. This allows for receiving characters at full speed. I'll try to find that code when I have a chance.
I think my problem here is really marketing: because it has "riscv" in the name people just dismiss it as "only an emulator". If I took the GNU MCU toolchain and went through and renamed "riscv-none-embed-gcc" to "p2-none-embed-gcc", and advertised it as "CMM for P2", I'm guessing people would accept it more. Because that's essentially what it is: a sort of compressed memory model using standard opcodes (rv32imc) instead of the custom opcodes that PropGCC used for its CMM.
It's odd that people would reject a C compiler because it is "only an emulator" when they seem to demand an interpreted version of Spin that isn't even JIT compiled. So native compiled versions of Spin are not desirable and a RISC-V JIT implementation of C is "only an emulator" and hence is also not acceptable?
It's a high-level marketing problem.
The techno-babble implementation details or names don't really play into the silence surrounding it.
ersmith, what you've created with RISC-V JIT is so brilliant it's going to take us mere mortals years to catch up. People stick with what they're familiar with. It's such a large unknown. But what it may or may not be lacking is a friendly way to get started and probe and poke at it. I don't know, it's on my list.
I can read C but personally don't really use C nor have this massive mountain of C-sourcecode I'm just itching to port over to the P2. I was thinking for more publicity to submit the compiler & JIT system to the riscv.org website when the EVAL boards are ready and shipping?
As for the fastspin compiler suite, I'm 100% on board and assure you I haven't even touched PNut once. Fastspin is the best way to experience programming a P2!!!. Life's just getting in the way of me even powering up my P2-EVAL right now. And I also have that fear of breaking the sample chip, with any replacements a few months out so I've been treating it with a very delicate touch.
I think my problem here is really marketing: because it has "riscv" in the name people just dismiss it as "only an emulator". If I took the GNU MCU toolchain and went through and renamed "riscv-none-embed-gcc" to "p2-none-embed-gcc", and advertised it as "CMM for P2", I'm guessing people would accept it more. Because that's essentially what it is: a sort of compressed memory model using standard opcodes (rv32imc) instead of the custom opcodes that PropGCC used for its CMM.
It's odd that people would reject a C compiler because it is "only an emulator" when they seem to demand an interpreted version of Spin that isn't even JIT compiled. So native compiled versions of Spin are not desirable and a RISC-V JIT implementation of C is "only an emulator" and hence is also not acceptable?
I think the reasoning here is that Spin2 is specifically optimized for P2's strengths (XBYTE and friends) and just trades speed for smaller code size, whereas recompiling RISC-V instructions is considered as a sub-optimal use of resources. Although as said above, the code is more compact due to the 3-operand architecture and the compressed opcodes.
I think there's likely a good bit of performance to be gained from a P2-specific compressed format or just native Hubexec.
Then again, that would take a large development effort, whereas good quality RISC-V backends for GCC and LLVM/clang are already available. Also, C itself isn't really equipped to make good use of Hubexec code, anyways. there's literally 450-ish registers (have you ever used that many local variables in C? Although a good optimizer can inline and bend the calling convention to eliminate unnecessary spilling. Or there could be an __attribute__ to move thread-local variables into cogram.) AND the LUT (which could be used for fast local int[] and some kind of FCACHE, I guess? Or to store some floating point routines?) well this turned into some nice rambling, hasn't it?
I think there's likely a good bit of performance to be gained from a P2-specific compressed format or just native Hubexec.
Perhaps you missed the benchmark I posted above, but the RISC-V toolchain compiled P2 binary runs the Dhrystone benchmark more than twice as fast as all of the P2 "native" compilers, including p2gcc, Catalina, and fastspin, all of which use hubexec (for that matter riscvp2 uses hubexec too, once the RISC-V opcodes are translated).
But Dhrystone is an artificial benchmark, so let's try something else, like maybe @heater 's fftbench:
command lines used:
fastspin -2 -O2 -o fastspin.bin fft_bench.c
catalina -lci -p2 -O3 -C P2_EVAL -C NATIVE -D PROPELLER fft_bench.c
p2gcc -D _cnt=getcnt fft_bench.c
riscv-none-embed-gcc -T riscvp2_lut.ld -specs=nano.specs -Os -o a.elf fft_bench.c
+ riscv-none-objcopy -O binary a.elf riscvp2.bin
results: time / size of binary loaded to P2
riscvp2: 25245 us 20968 bytes
fastspin: 39835 us 16384 bytes
catalina: 55876 us 27808 bytes
p2gcc: 64765 us 21204 bytes
All of the binaries are more or less the same size (I think the major difference is in the libraries that they use, catalina and riscvp2 being the most complete, fastspin the least complete). Once again riscvp2 is substantially faster than the other choices.
Also, C itself isn't really equipped to make good use of Hubexec code, anyways. there's literally 450-ish registers (have you ever used that many local variables in C? Although a good optimizer can inline and bend the calling convention to eliminate unnecessary spilling. Or there could be an __attribute__ to move thread-local variables into cogram.) AND the LUT (which could be used for fast local int[] and some kind of FCACHE, I guess? Or to store some floating point routines?)
fastspin already essentially uses an unlimited number of registers within a function, and with -O2 it uses LUT as an FCACHE. I wrote it, so I like it... but it's still slower than riscvp2.
(Oh, and the riscvp2_lut.ld option to the riscv compiler makes it use the LUT as FCACHE as well.)
Comments
Do you have any figures on how much smaller the compressed code becomes?
The compressed code is about 25% smaller than uncompressed (so we save about 64K on a 256K binary). The RISC-V instructions are already denser than P2 instructions, because the instruction set is for 3 operands and has very compiler friendly instructions like "beq r1, r2, label" which compares two registers and if they are equal jumps to the label.
Does this RISC-V - JIT pathway allow you to create a listing file that is P2-Assembler, with original C (or C++) source lines as comments ?
Could you also single step in a hypothetical debugger, with reference back to the starting source ?
No, definitely not. The final RISC-V to P2 compilation is happening on the P2 itself at run time, so none of the C source is available.
Understood, but that 'final RISC-V to P2 compilation' is using explicit rules, which the PC-Side could apply just like P2 does, in order to create the P2-asm.
To get commented listing, it would need to start with the RISC-V C-listing, and replace the RISC-V opcodes with P2-jit ones.
How useful that would be, I guess depends on how close the RISC-V & P2 code blocks are.
I need to fix the serial port code so I can paste in test code faster without corruption at high speed due to the rudimentary polling serial receive code not keeping up with pasted input rate. Am quickly getting tired of typing in test code manually. I will also look into other terminal programs that might delay per character instead of using the loadp2 terminal.
Don't have any take on its overall performance level as yet but there's some reasonable sign of life at least.
Is the code size still ~176kB indicated earlier ?
Basically yeah, it is 177kB (0x2c4b8) for a minimal build with rudimentary serial IO but nothing else. Though recently I have been extending features and increased the heap space to 20kB from its relatively tiny default of 2k (increasing my actual Python workspace to use for my tests so I don't run out of memory) so right now my overall P2 image has grown a bit. We shouldn't be worried about heap space in the build (it's statically allocated) because that is just the buffer space Python has to use for storing your run time Python code/data, so it's really the Micropython executable image code size that is the main item to consider. Whatever HUB RAM is left after that can be divided amongst other COGs & stacks and the Micropython heap.
I've just enabled these extra Micropython features below for some testing and found the P2 image size increased to ~195kB excluding any heap increase. As you add further Python features it will keep going higher.
#define MICROPY_DEBUG_PRINTERS (1)
#define MICROPY_PY_BUILTINS_BYTEARRAY (1)
#define MICROPY_PY_MATH (1)
#define MICROPY_PY_BUILTINS_HELP (1)
#define MICROPY_REPL_AUTO_INDENT (1)
#define MICROPY_PY_MATH_FACTORIAL (1)
EDIT: Added the scope and I see the problem right away, we are echoing characters back from the P2 at 300baud as well while receiving and can't execute the normal getch() during this time (single COG, not threaded). I need to look at improving serial.
The buffers are why it's better to adjust baud down, than to try to insert delays, as the former is exact, but the PC delays are coarse and elastic.
Having said that, I was using PC delays to do starved TX burst tests, and I was impressed I could get to maybe 100us LSBs over USB-UARTs for inter-packet pauses.
Maybe this ported Python does not yet use smart pins ? They would naturally give 1 char each way buffering.
Currently in FullDuplexSerial2.spin, but can be extracted quite easy, was a separate file while in development.
But with two serial you could use one for debug...
Mike
Mike's approach of throwing a whole cog at it is probably to way to go for larger buffering.
Here's my rx routine:
CON
DAT Note the last instruction is commented out entirely. If you are happy to emit one null during init, by uncommenting it, then the RQPIN in tx routine can be removed instead.
Update: Using this at 4800bps and 160MHz keeps up with character rate during paste, but still drops some characters at the end of the line while it is processing the full line, so I still need some end of line transmit pacing during paste. I guess I need a better serial terminal app, though I found that loadp2 drops the DTR line and resets the prop after download before I can get it into minicom. Argh.
Adding HW receive could buy more time, and you could remove wait-on-send from exit delay, and instead add wait-on-sent to Tx entry. (to avoid Tx overrun)
That allows more processor time elsewhere.
PS: I've also just gone and renamed rx_pin to RXPIN and tx_pin to TXPIN.
I solved the serial buffer problem by running serial code in another COG (see BufferSerial.spin2, translated by spin2cpp into BufferSerial.c). There's code for VGA and USB support (also running in separate COGs). There are some P2 support objects (modpyb.c) which shouldn't be too hard to port to p2gcc. In fact I plan to convert them to use the standard propeller2.h header file that was discussed elsewhere in the forum.
Since this thread is about "Can't wait for PropGCC for P2" and since it is a working version of GCC for P2, I'll add a pitch here for https://github.com/totalspectrum/riscvp2, which is a description of how to get a RISC-V toolchain and convert it to a P2 toolchain. I think the process is straightforward, and the resulting compiler is is quite complete: building a hello world app is as simple as: It would be easy to eliminate the final "objcopy" step if we were to modify loadp2 to understand ELF files.
There's still a lot of room for improvement in the documentation, and I'm working on that, but it really needs help from others as well -- so far I haven't had any feedback at all, so I don't even know if anyone is using it. But I think it's the most complete c99 solution for the P2 right now, with a complete standard library (Note: Catalina is pretty complete as well, but is a c89 rather than c99 compiler -- for many users that won't matter).
Performance is very good, in fact it handily beats every other C compiler for the P2 on the Dhrystone 2.1 benchmark:
The steps I needed to take to build a minimal micropython for P2 were:
(1) Install a RISC-V toolchain and then convert it to a P2 toolchain using the directions at https://github.com/totalspectrum/riscvp2.
(2) Modify the Makefile slightly to build for riscv-embed-none instead of arm-none-eabi. The modified Makefile is at https://github.com/totalspectrum/micropython/ports/minimal/Makefile.riscvp2. The differences really are pretty small, as you can see by diffing against the original Makefile.
(3) "make -f Makefile.riscvp2"
The firmware.bin, i.e. what actually runs on the P2, is 119876 bytes.
I think my problem here is really marketing: because it has "riscv" in the name people just dismiss it as "only an emulator". If I took the GNU MCU toolchain and went through and renamed "riscv-none-embed-gcc" to "p2-none-embed-gcc", and advertised it as "CMM for P2", I'm guessing people would accept it more. Because that's essentially what it is: a sort of compressed memory model using standard opcodes (rv32imc) instead of the custom opcodes that PropGCC used for its CMM.
This subtlty dawned on me only very recently - in the last few days - and it has certainly made me much more interested. I would like to, at some point, see if I can get PropWare (with it's full CMake build system and C++ classes) working on top of your RISCV work.
It's a high-level marketing problem.
The techno-babble implementation details or names don't really play into the silence surrounding it.
ersmith, what you've created with RISC-V JIT is so brilliant it's going to take us mere mortals years to catch up. People stick with what they're familiar with. It's such a large unknown. But what it may or may not be lacking is a friendly way to get started and probe and poke at it. I don't know, it's on my list.
I can read C but personally don't really use C nor have this massive mountain of C-sourcecode I'm just itching to port over to the P2. I was thinking for more publicity to submit the compiler & JIT system to the riscv.org website when the EVAL boards are ready and shipping?
As for the fastspin compiler suite, I'm 100% on board and assure you I haven't even touched PNut once. Fastspin is the best way to experience programming a P2!!!. Life's just getting in the way of me even powering up my P2-EVAL right now. And I also have that fear of breaking the sample chip, with any replacements a few months out so I've been treating it with a very delicate touch.
I think the reasoning here is that Spin2 is specifically optimized for P2's strengths (XBYTE and friends) and just trades speed for smaller code size, whereas recompiling RISC-V instructions is considered as a sub-optimal use of resources. Although as said above, the code is more compact due to the 3-operand architecture and the compressed opcodes.
I think there's likely a good bit of performance to be gained from a P2-specific compressed format or just native Hubexec.
Then again, that would take a large development effort, whereas good quality RISC-V backends for GCC and LLVM/clang are already available. Also, C itself isn't really equipped to make good use of Hubexec code, anyways. there's literally 450-ish registers (have you ever used that many local variables in C? Although a good optimizer can inline and bend the calling convention to eliminate unnecessary spilling. Or there could be an __attribute__ to move thread-local variables into cogram.) AND the LUT (which could be used for fast local int[] and some kind of FCACHE, I guess? Or to store some floating point routines?)
well this turned into some nice rambling, hasn't it?
But Dhrystone is an artificial benchmark, so let's try something else, like maybe @heater 's fftbench:
All of the binaries are more or less the same size (I think the major difference is in the libraries that they use, catalina and riscvp2 being the most complete, fastspin the least complete). Once again riscvp2 is substantially faster than the other choices.
fastspin already essentially uses an unlimited number of registers within a function, and with -O2 it uses LUT as an FCACHE. I wrote it, so I like it... but it's still slower than riscvp2.
(Oh, and the riscvp2_lut.ld option to the riscv compiler makes it use the LUT as FCACHE as well.)