Can't Wait for PropGCC on the P2?

Tubular · 2019-07-07 05:23

Thanks for explaining the inner workings, Eric.

Do you have any figures on how much smaller the compressed code becomes?

ersmith · 2019-07-07 21:40

Tubular wrote: »

Thanks for explaining the inner workings, Eric.

Do you have any figures on how much smaller the compressed code becomes?

The compressed code is about 25% smaller than uncompressed (so we save about 64K on a 256K binary). The RISC-V instructions are already denser than P2 instructions, because the instruction set is for 3 operands and has very compiler friendly instructions like "beq r1, r2, label" which compares two registers and if they are equal jumps to the label.

jmg · 2019-07-08 00:54

ersmith wrote: »

Tubular wrote: »

Thanks for explaining the inner workings, Eric.

Do you have any figures on how much smaller the compressed code becomes?

The compressed code is about 25% smaller than uncompressed (so we save about 64K on a 256K binary). The RISC-V instructions are already denser than P2 instructions, because the instruction set is for 3 operands and has very compiler friendly instructions like "beq r1, r2, label" which compares two registers and if they are equal jumps to the label.

Does this RISC-V - JIT pathway allow you to create a listing file that is P2-Assembler, with original C (or C++) source lines as comments ?
Could you also single step in a hypothetical debugger, with reference back to the starting source ?

ersmith · 2019-07-08 01:22

jmg wrote: »

Does this RISC-V - JIT pathway allow you to create a listing file that is P2-Assembler, with original C (or C++) source lines as comments ?

No, definitely not. The final RISC-V to P2 compilation is happening on the P2 itself at run time, so none of the C source is available.

jmg · 2019-07-08 02:46

ersmith wrote: »

jmg wrote: »

Does this RISC-V - JIT pathway allow you to create a listing file that is P2-Assembler, with original C (or C++) source lines as comments ?

No, definitely not. The final RISC-V to P2 compilation is happening on the P2 itself at run time, so none of the C source is available.

Understood, but that 'final RISC-V to P2 compilation' is using explicit rules, which the PC-Side could apply just like P2 does, in order to create the P2-asm.
To get commented listing, it would need to start with the RISC-V C-listing, and replace the RISC-V opcodes with P2-jit ones.
How useful that would be, I guess depends on how close the RISC-V & P2 code blocks are.

ersmith · 2019-07-08 11:17

jmg wrote: »

ersmith wrote: »

jmg wrote: »

Does this RISC-V - JIT pathway allow you to create a listing file that is P2-Assembler, with original C (or C++) source lines as comments ?

No, definitely not. The final RISC-V to P2 compilation is happening on the P2 itself at run time, so none of the C source is available.

Understood, but that 'final RISC-V to P2 compilation' is using explicit rules, which the PC-Side could apply just like P2 does, in order to create the P2-asm.

Except that the compilation really is "just in time", only code that is actually executed gets compiled. So for example in:

   printf("continue? ");
   x = getchar();
   if (x == 'y') {
      continue_operation();
   } else {
      cancel_operation();
   }

which arm of the "if" statement gets compiled on the P2 depends on what the user responded. A listing file wouldn't be practical, because where the code ends up in the cache memory depends on the history and the environment (including user input).

rogloh · 2019-07-10 02:34

So I was able to get out of Makefile hell a couple of days back and build a minimal Micropython with some amount of debug effort with PropellerGCC and Dave Hein's handy p2gcc translation tools. I have been able to run some of it from the REPL loop interactively however I notice a couple of crashes which I am still debugging. I have a hunch that there are still some remaining issues with the s2pasm and/or p2link tools and memory alignment for some symbols. I have found with a different link order different parts of the Micropython environment can lockup. It could potentially be module code size dependent too because adding debug code can cause it to stop failing as well (a Heisenbug), though each code change could also be affecting overall alignment. Further detective work is required and I'm still tinkering... ozpropdev showed me his PASM2 debugger and it looks like it could be something useful for this type of thing too.

I need to fix the serial port code so I can paste in test code faster without corruption at high speed due to the rudimentary polling serial receive code not keeping up with pasted input rate. Am quickly getting tired of typing in test code manually. I will also look into other terminal programs that might delay per character instead of using the loadp2 terminal.

Don't have any take on its overall performance level as yet but there's some reasonable sign of life at least.

loadp2 -t build/python.bin
( Entering terminal mode at 115200 bps.  Press Ctrl-] to exit. )
MicroPython v1.11-105-gef00048fe-dirty on 2019-07-10; P2-EVAL with propeller2-cpu
Type "help()" for more information.
>>> a = 1
>>> print (a)
1
>>> print (a+10)
11
>>> print("Hello world")
Hello world
>>> print(a ^ 2)
3
>>> help()
Welcome to MicroPython!

For online docs please visit http://docs.micropython.org/

Control commands:
  CTRL-A        -- on a blank line, enter raw REPL mode
  CTRL-B        -- on a blank line, enter normal REPL mode
  CTRL-C        -- interrupt a running program
  CTRL-D        -- on a blank line, exit or do a soft reset
  CTRL-E        -- on a blank line, enter paste mode

For further help on a specific object, type help(obj)
>>> dir()
['__name__', 'a']
>>> print(bin(a))
0b1
>>> def foo():
...     print("Hello")
... 
>>> foo()
Hello
>>> for i in range(3,-1,-1):
...     print(i)
... 
3
2
1
0
>>>

jmg · 2019-07-10 03:03

rogloh wrote: »

I need to fix the serial port code so I can paste in test code faster without corruption at high speed due to the rudimentary polling serial receive code not keeping up with pasted input rate. Am quickly getting tired of typing in test code manually. I will also look into other terminal programs that might delay per character instead of using the loadp2 terminal.

Can you just slow down the serial baud rate, as a quick-fix, to pace the characters better ?

Is the code size still ~176kB indicated earlier ?

rogloh · 2019-07-10 03:55

jmg wrote: »

Can you just slow down the serial baud rate, as a quick-fix, to pace the characters better ?

Good idea, might give that a go.

Is the code size still ~176kB indicated earlier ?

Basically yeah, it is 177kB (0x2c4b8) for a minimal build with rudimentary serial IO but nothing else. Though recently I have been extending features and increased the heap space to 20kB from its relatively tiny default of 2k (increasing my actual Python workspace to use for my tests so I don't run out of memory) so right now my overall P2 image has grown a bit. We shouldn't be worried about heap space in the build (it's statically allocated) because that is just the buffer space Python has to use for storing your run time Python code/data, so it's really the Micropython executable image code size that is the main item to consider. Whatever HUB RAM is left after that can be divided amongst other COGs & stacks and the Micropython heap.

I've just enabled these extra Micropython features below for some testing and found the P2 image size increased to ~195kB excluding any heap increase. As you add further Python features it will keep going higher.

#define MICROPY_DEBUG_PRINTERS (1)
#define MICROPY_PY_BUILTINS_BYTEARRAY (1)
#define MICROPY_PY_MATH (1)
#define MICROPY_PY_BUILTINS_HELP (1)
#define MICROPY_REPL_AUTO_INDENT (1)
#define MICROPY_PY_MATH_FACTORIAL (1)

rogloh · 2019-07-10 04:15

jmg wrote: »

Can you just slow down the serial baud rate, as a quick-fix, to pace the characters better ?

Gave it a try down at 300baud and 160MHz P2 and it still corrupted characters during the paste which is weird. My Mac system must still be buffering somewhere and bursting in data faster than this. 300 baud is ridiculously slow for just executing the readline buffer fill code before even parsing anything in Micropython itself. Need to hook a logic probe on the Rx pin to see if data is bursting faster than expected.

EDIT: Added the scope and I see the problem right away, we are echoing characters back from the P2 at 300baud as well while receiving and can't execute the normal getch() during this time (single COG, not threaded). I need to look at improving serial.

evanh · 2019-07-10 04:37

300 bps is too slow for smartpin async serial. The clock divider will overflow. 9600 is probably lowest safe bit rate.

rogloh · 2019-07-10 04:41

Yeah @evanh I won't be wanting 300baud. It was only a quick experiment. I am currently bit banging though I noticed your code in another thread using smartpin modes for TX which may help. Ideally for a single COG code there would be a receive and perhaps TX buffer in the VM COG's LUT RAM and we could be using some type of interrupt driven receive routines to help avoid data loss. Though without flow control or other HW back pressure there is always some risk of loss.

jmg · 2019-07-10 04:44

rogloh wrote: »

... My Mac system must still be buffering somewhere and bursting in data faster than this....

PC's can buffer, but they cannot deliver faster than the baud speed.
The buffers are why it's better to adjust baud down, than to try to insert delays, as the former is exact, but the PC delays are coarse and elastic.
Having said that, I was using PC delays to do starved TX burst tests, and I was impressed I could get to maybe 100us LSBs over USB-UARTs for inter-packet pauses.

evanh wrote: »

300 bps is too slow for smartpin async serial. The clock divider will overflow. 9600 is probably lowest safe bit rate.

Maybe this ported Python does not yet use smart pins ? They would naturally give 1 char each way buffering.

rogloh · 2019-07-10 04:45

Maybe this ported Python does not yet use smart pins ? They would naturally give 1 char each way buffering.

Correct. As I mentioned above is bit bang only and there is no time to simultaneously receive during transmit. Needs a better serial implementation.

msrobots · 2019-07-10 04:47

I have some PASM code for a fullduplex serial driver supporting 2 ports and all 4 buffer in lut. needs some longs in HUB as mailbox and a own COG.

Currently in FullDuplexSerial2.spin, but can be extracted quite easy, was a separate file while in development.

But with two serial you could use one for debug...

Mike

evanh · 2019-07-10 05:12

rogloh wrote: »

I am currently bit banging though I noticed your code in another thread using smartpin modes for TX which may help. Ideally for a single COG code there would be a receive and perhaps TX buffer in the VM COG's LUT RAM and we could be using some type of interrupt driven receive routines to help avoid data loss. Though without flow control or other HW back pressure there is always some risk of loss.

Currently, the only reason I use receive routine is for key presses. So, very low actual rx data volume. The one character smartpin buffer is fine for this.

Mike's approach of throwing a whole cog at it is probably to way to go for larger buffering.

Here's my rx routine:

getch
		testp	#rx_pin		wz	'byte received? (IN high == yes)
	if_z	rdpin	pb, #rx_pin		'get data from Z buffer
	if_z	shr	pb, #32-8		'shift the data to bottom of register
	if_z	ret			wcz	'restore C/Z flags of calling routine
		jmp	#getch			'wait while Smartpin buffer is empty

rogloh · 2019-07-10 05:22

I am looking at adding some smartpin TX and perhaps then RX as well to see if it helps. I'm currently deciphering the P2 documentation to see what is the correct order/values for smart pin initialization of pin 62 (TX) in async serial mode, then plan on playing with your/Chip's code for Smartpin based TX and patching into putch() in the p2gcc library code. This might help me in the short term. A second COG is a little more work to connect up but could be better for dealing with the buffering issue. Though I think a single COG variant for Micropython would be nice and compact too.

evanh · 2019-07-10 05:44

Cool. There is also the init of the smartpins when using them. Here's my relevant pieces:

CON

	CLOCKFREQ	= round(float(XTALFREQ) / float(XDIV) * float(XMUL) / float(XDIVP))
	ASYNCFG		= round(float(CLOCKFREQ) * 64.0 / float(BAUDRATE))<<10 + 7	'bitrate format is 16.6<<10, 8N1 framing

DAT

		wrpin	#%00_11111_0, #rx_pin		'Asynchronous serial receive
		wxpin	##ASYNCFG, #rx_pin		'set baurdrate and framing
		dirh	#rx_pin
		wrpin	#%01_11110_0, #tx_pin		'Asynchronous serial transmit mode
		wxpin	##ASYNCFG, #tx_pin		'set X with baudrate and framing
		dirh	#tx_pin
'		wypin	#0, #tx_pin			'trigger first tx ready state (not needed if dual checked)
							'single check is buffer full only, dual checking adds in tx flag

Note the last instruction is commented out entirely. If you are happy to emit one null during init, by uncommenting it, then the RQPIN in tx routine can be removed instead.

rogloh · 2019-07-10 06:10

Thanks @evanh, this stuff would be useful. I had just hacked this test code in already for putch.c but it doesn't deal with fractional components yet so your code is nicer. It worked at 115200bps using Smartpin transmit for me so I'm happy.

int p2bitcycles;

void putch(int val)
{
    __asm__("        rdlong  r3, ##_p2bitcycles");
    //__asm__("        mov     r3, ##160000000/115200");
    __asm__("        or      r0, #$100");
    __asm__("        shl     r0, #1");
    __asm__("        mov     r1, #10");
    __asm__("        getct   r2");
    __asm__("loop");
    __asm__("        shr     r0, #1 wc");
    __asm__("        drvc    #62");
    __asm__("        addct1  r2, r3");
    __asm__("        waitct1");
    __asm__("        djnz    r1, #loop");
}

void initputch(void)
{
    __asm__("        flth   #62");
    __asm__("        wrpin  ##%0000_0000_000_0000000000000_01_11110_0, #62 'set smartpin async tx serial");
    __asm__("        rdlong r0, ##_p2bitcycles");
    __asm__("        shl    r0, #16");
    __asm__("        or     r0, #7 ' 8 data bits");
    __asm__("        wxpin  r0, #62");
    __asm__("        drvh   #62");
}

void smartputch(int val)
{
    __asm__("polltx");
    __asm__("            rqpin   inb, #62   wc 'transmitting? (C high == yes)  Needed to initiate tx");
    __asm__("            testp   #62        wz 'buffer free? (IN high == yes)");
    __asm__("if_nc_or_z  wypin   r0, #tx_pin   'write new byte to Y buffer");
    __asm__("if_nc_or_z  jmp lr                'return");
    __asm__("            jmp #polltx           'wait while Smartpin is both full (nz) and transmitting (c)");
}

Update: Using this at 4800bps and 160MHz keeps up with character rate during paste, but still drops some characters at the end of the line while it is processing the full line, so I still need some end of line transmit pacing during paste. I guess I need a better serial terminal app, though I found that loadp2 drops the DTR line and resets the prop after download before I can get it into minicom. Argh.

jmg · 2019-07-10 07:03

rogloh wrote: »

Update: Using this at 4800bps and 160MHz keeps up with character rate during paste, but still drops some characters at the end of the line while it is processing the full line, so I still need some end of line transmit pacing during paste..

Adding HW receive could buy more time, and you could remove wait-on-send from exit delay, and instead add wait-on-sent to Tx entry. (to avoid Tx overrun)
That allows more processor time elsewhere.

evanh · 2019-07-10 10:28

Well, definitely tidied it up making it a subroutine. It is somewhat customised for my needs but can still be used as a template. "clk_freq", "clk_mode" and "asyn_baud" are the cogRAM mapped versions of all three of the agreed shared convention variables, located at hub addresses $14, $18, and $1c respectively. "xtalmul" represents a dynamic direct replacement of XMUL. I made it the only dynamic component of the sysclock as an input parameter just to make the whole job easier.

'===============================================
'Set new sysclock frequency
'and also adjusts the diag comport to suit
'     Note:  Uses the CORDIC.  This means that any already running operations will be lost
'  input:  xtalmul, asyn_baud
' result:  clk_mode, clk_freq
'scratch:  pa, pb, c_flag

setclkfrq

'recalculate sysclock hertz (assumes xtalmul unit is 1 MHz)
		mov	clk_freq, xtalmul		'range 1-1024
		mul	clk_freq, #50
		mul	clk_freq, ##20_000		'clk_freq useable as one second pause in a WAITX

'start cordic calculation for new baud divider of diag comport
		qdiv	clk_freq, asyn_baud		'cordic runs in parallel, 55 sysclocks

'make sure not transmitting on comport before changing clock rate
.txwait		rqpin	inb, #TXPIN	wc		'transmiting? (C high == yes)
	if_c	jmp	#.txwait

'sysclock frequency adjustment (assumes xtalmul unit is 1 MHz)
		hubset	clk_mode			'**IMPORTANT**  Switches to RCFAST using known prior mode

		mov	clk_mode, xtalmul		'replace old with new ...
		sub	clk_mode, #1			'range 1-1024
		shl	clk_mode, #8
		or	clk_mode, ##(1<<24 + (XDIV-1)<<18 + XPPPP<<4 + XOSC<<2)
		hubset	clk_mode			'setup PLL mode for new frequency (still operating at RCFAST)

		waitx	##22_000_000/100		'~10ms (at RCFAST) for PLL to stabilise
		mov	pa, clk_mode
		or	pa, #XSEL			'add PLL as the clock source select
		hubset	pa				'engage!  Switch back to newly set PLL

'finish up the baud divider calculation and set smartpins
		getqx	pa				'collect result of cordic divide operation
		getqy	pb				'32.32 (64-bit) is way cool
		rolword	pa, pb, #1			'shift 16.16 into PA
		sets	pa, #7				'comport 8N1 framing (bottom 10 bits should be replaced but 9's enough)
		wxpin	pa, #RXPIN			'set rx baud and framing
	_ret_	wxpin	pa, #TXPIN			'set tx baud and framing

PS: I've also just gone and renamed rx_pin to RXPIN and tx_pin to TXPIN.

ersmith · 2019-07-10 11:00

rogloh wrote: »

I need to fix the serial port code so I can paste in test code faster without corruption at high speed due to the rudimentary polling serial receive code not keeping up with pasted input rate. Am quickly getting tired of typing in test code manually. I will also look into other terminal programs that might delay per character instead of using the loadp2 terminal.

@rogloh : why are you re-inventing the wheel here? I mean sure, try it with a different compiler, but why not at least start with my existing micropython port to the P2 (https://github.com/totalspectrum/micropython). Trust me, just because I happened to compile it with a RISC-V compiler doesn't give it cooties

.

I solved the serial buffer problem by running serial code in another COG (see BufferSerial.spin2, translated by spin2cpp into BufferSerial.c). There's code for VGA and USB support (also running in separate COGs). There are some P2 support objects (modpyb.c) which shouldn't be too hard to port to p2gcc. In fact I plan to convert them to use the standard propeller2.h header file that was discussed elsewhere in the forum.

ersmith · 2019-07-10 11:28

Just for comparison, a minimal build with the latest RISC-V to P2 cross-compiler creates a firmware.bin file that's 119876 bytes. That includes 32K of interpreter + cache. It runs up to pyexec_friendly_repl, but hangs somewhere in there and I don't really want to debug it since I do have a fully working version of micropython with all the bells and whistles.

ersmith · 2019-07-10 13:34

OK, I fibbed: I did spend a few minutes debugging minimal micropython, and it turned out it was just a matter of needing to define MICROPY_MIN_USE_STDOUT=1 (the interpreter was running, just not producing any output).

Since this thread is about "Can't wait for PropGCC for P2" and since it is a working version of GCC for P2, I'll add a pitch here for https://github.com/totalspectrum/riscvp2, which is a description of how to get a RISC-V toolchain and convert it to a P2 toolchain. I think the process is straightforward, and the resulting compiler is is quite complete: building a hello world app is as simple as:

riscv-none-embed-gcc -T riscvp2.ld -o hello.elf hello.c
riscv-none-embed-objcopy -O binary hello.elf hello.bin

It would be easy to eliminate the final "objcopy" step if we were to modify loadp2 to understand ELF files.

There's still a lot of room for improvement in the documentation, and I'm working on that, but it really needs help from others as well -- so far I haven't had any feedback at all, so I don't even know if anyone is using it. But I think it's the most complete c99 solution for the P2 right now, with a complete standard library (Note: Catalina is pretty complete as well, but is a c89 rather than c99 compiler -- for many users that won't matter).

Performance is very good, in fact it handily beats every other C compiler for the P2 on the Dhrystone 2.1 benchmark:

riscvp2:  46598 dhrystones/sec
p2gcc:    19436 dhrystones/sec (actual result was 9718 dhrystones/sec at 80 MHz clock)
catalina: 11171 dhrystones/sec
fastspin: 10330 dhrystones/sec

The steps I needed to take to build a minimal micropython for P2 were:

(1) Install a RISC-V toolchain and then convert it to a P2 toolchain using the directions at https://github.com/totalspectrum/riscvp2.

(2) Modify the Makefile slightly to build for riscv-embed-none instead of arm-none-eabi. The modified Makefile is at https://github.com/totalspectrum/micropython/ports/minimal/Makefile.riscvp2. The differences really are pretty small, as you can see by diffing against the original Makefile.

(3) "make -f Makefile.riscvp2"

The firmware.bin, i.e. what actually runs on the P2, is 119876 bytes.

I think my problem here is really marketing: because it has "riscv" in the name people just dismiss it as "only an emulator". If I took the GNU MCU toolchain and went through and renamed "riscv-none-embed-gcc" to "p2-none-embed-gcc", and advertised it as "CMM for P2", I'm guessing people would accept it more. Because that's essentially what it is: a sort of compressed memory model using standard opcodes (rv32imc) instead of the custom opcodes that PropGCC used for its CMM.

DavidZemon · 2019-07-10 13:40

ersmith wrote: »

I think my problem here is really marketing: because it has "riscv" in the name people just dismiss it as "only an emulator". If I took the GNU MCU toolchain and went through and renamed "riscv-none-embed-gcc" to "p2-none-embed-gcc", and advertised it as "CMM for P2", I'm guessing people would accept it more. Because that's essentially what it is: a sort of compressed memory model using standard opcodes (rv32imc) instead of the custom opcodes that PropGCC used for its CMM.

This subtlty dawned on me only very recently - in the last few days - and it has certainly made me much more interested. I would like to, at some point, see if I can get PropWare (with it's full CMake build system and C++ classes) working on top of your RISCV work.

David Betz · 2019-07-10 14:15

ersmith wrote: »

I think my problem here is really marketing: because it has "riscv" in the name people just dismiss it as "only an emulator". If I took the GNU MCU toolchain and went through and renamed "riscv-none-embed-gcc" to "p2-none-embed-gcc", and advertised it as "CMM for P2", I'm guessing people would accept it more. Because that's essentially what it is: a sort of compressed memory model using standard opcodes (rv32imc) instead of the custom opcodes that PropGCC used for its CMM.

It's odd that people would reject a C compiler because it is "only an emulator" when they seem to demand an interpreted version of Spin that isn't even JIT compiled. So native compiled versions of Spin are not desirable and a RISC-V JIT implementation of C is "only an emulator" and hence is also not acceptable?

Dave Hein · 2019-07-10 16:54

I do have a getch routine that runs in a separate cog. This allows for receiving characters at full speed. I'll try to find that code when I have a chance.

whicker · 2019-07-10 17:36

David Betz wrote: »

ersmith wrote: »

I think my problem here is really marketing: because it has "riscv" in the name people just dismiss it as "only an emulator". If I took the GNU MCU toolchain and went through and renamed "riscv-none-embed-gcc" to "p2-none-embed-gcc", and advertised it as "CMM for P2", I'm guessing people would accept it more. Because that's essentially what it is: a sort of compressed memory model using standard opcodes (rv32imc) instead of the custom opcodes that PropGCC used for its CMM.

It's odd that people would reject a C compiler because it is "only an emulator" when they seem to demand an interpreted version of Spin that isn't even JIT compiled. So native compiled versions of Spin are not desirable and a RISC-V JIT implementation of C is "only an emulator" and hence is also not acceptable?

It's a high-level marketing problem.
The techno-babble implementation details or names don't really play into the silence surrounding it.

ersmith, what you've created with RISC-V JIT is so brilliant it's going to take us mere mortals years to catch up. People stick with what they're familiar with. It's such a large unknown. But what it may or may not be lacking is a friendly way to get started and probe and poke at it. I don't know, it's on my list.

I can read C but personally don't really use C nor have this massive mountain of C-sourcecode I'm just itching to port over to the P2. I was thinking for more publicity to submit the compiler & JIT system to the riscv.org website when the EVAL boards are ready and shipping?

As for the fastspin compiler suite, I'm 100% on board and assure you I haven't even touched PNut once. Fastspin is the best way to experience programming a P2!!!. Life's just getting in the way of me even powering up my P2-EVAL right now. And I also have that fear of breaking the sample chip, with any replacements a few months out so I've been treating it with a very delicate touch.

Wuerfel_21 · 2019-07-10 19:41

David Betz wrote: »

ersmith wrote: »

I think my problem here is really marketing: because it has "riscv" in the name people just dismiss it as "only an emulator". If I took the GNU MCU toolchain and went through and renamed "riscv-none-embed-gcc" to "p2-none-embed-gcc", and advertised it as "CMM for P2", I'm guessing people would accept it more. Because that's essentially what it is: a sort of compressed memory model using standard opcodes (rv32imc) instead of the custom opcodes that PropGCC used for its CMM.

It's odd that people would reject a C compiler because it is "only an emulator" when they seem to demand an interpreted version of Spin that isn't even JIT compiled. So native compiled versions of Spin are not desirable and a RISC-V JIT implementation of C is "only an emulator" and hence is also not acceptable?

I think the reasoning here is that Spin2 is specifically optimized for P2's strengths (XBYTE and friends) and just trades speed for smaller code size, whereas recompiling RISC-V instructions is considered as a sub-optimal use of resources. Although as said above, the code is more compact due to the 3-operand architecture and the compressed opcodes.
I think there's likely a good bit of performance to be gained from a P2-specific compressed format or just native Hubexec.
Then again, that would take a large development effort, whereas good quality RISC-V backends for GCC and LLVM/clang are already available. Also, C itself isn't really equipped to make good use of Hubexec code, anyways. there's literally 450-ish registers (have you ever used that many local variables in C? Although a good optimizer can inline and bend the calling convention to eliminate unnecessary spilling. Or there could be an __attribute__ to move thread-local variables into cogram.) AND the LUT (which could be used for fast local int[] and some kind of FCACHE, I guess? Or to store some floating point routines?)
well this turned into some nice rambling, hasn't it?

ersmith · 2019-07-10 21:13

Wuerfel_21 wrote: »

I think there's likely a good bit of performance to be gained from a P2-specific compressed format or just native Hubexec.

Perhaps you missed the benchmark I posted above, but the RISC-V toolchain compiled P2 binary runs the Dhrystone benchmark more than twice as fast as all of the P2 "native" compilers, including p2gcc, Catalina, and fastspin, all of which use hubexec (for that matter riscvp2 uses hubexec too, once the RISC-V opcodes are translated).

But Dhrystone is an artificial benchmark, so let's try something else, like maybe @heater 's fftbench:

command lines used:

fastspin -2 -O2 -o fastspin.bin fft_bench.c
catalina -lci -p2 -O3 -C P2_EVAL -C NATIVE -D PROPELLER fft_bench.c
p2gcc -D _cnt=getcnt fft_bench.c
riscv-none-embed-gcc -T riscvp2_lut.ld -specs=nano.specs -Os -o a.elf fft_bench.c
  + riscv-none-objcopy -O binary a.elf riscvp2.bin

results: time / size of binary loaded to P2

riscvp2:  25245 us  20968 bytes
fastspin: 39835 us  16384 bytes
catalina: 55876 us  27808 bytes
p2gcc:    64765 us  21204 bytes

All of the binaries are more or less the same size (I think the major difference is in the libraries that they use, catalina and riscvp2 being the most complete, fastspin the least complete). Once again riscvp2 is substantially faster than the other choices.

Also, C itself isn't really equipped to make good use of Hubexec code, anyways. there's literally 450-ish registers (have you ever used that many local variables in C? Although a good optimizer can inline and bend the calling convention to eliminate unnecessary spilling. Or there could be an __attribute__ to move thread-local variables into cogram.) AND the LUT (which could be used for fast local int[] and some kind of FCACHE, I guess? Or to store some floating point routines?)

fastspin already essentially uses an unlimited number of registers within a function, and with -O2 it uses LUT as an FCACHE. I wrote it, so I like it... but it's still slower than riscvp2.

(Oh, and the riscvp2_lut.ld option to the riscv compiler makes it use the LUT as FCACHE as well.)

Can't Wait for PropGCC on the P2?

Comments