Micropython for P2

Tubular · 2019-07-15 11:43

Having it all on SD card is great. I was pleasantly surprised when the edited source successfully saved back to the uSD card.

The free memory display in Pye status line is useful, too

Not sure how to detect the card most reliably, yet. We do have so many options including using a current sink to gently oppose the pullups built into the card, and read off the resulting analog voltage of the junction (or use the pin A>D threshold mode for simplicity). If we get this going it could be useful, because it might not be necessary to actually change the state of the node being 'read'.

rogloh · 2019-07-16 11:47

So as a comparison I was able to get my native P2 Micropython integrated with a milliseconds counter and running to the point where I could do the same benchmark ersmith ran with his RISC-V version and then check the results.

Eric's data shows he was able to run this test below at about 34000 iterations per second. It is a simple counter increment in a while loop, and it polls the milliseconds timer until the test completes in 10 seconds. The counter result printed is the number of iterations over 10 seconds.

### Performance

```
import pyb
def perfTest():
  millis = pyb.millis
  endTime = millis() + 10000
  count = 0
  while millis() < endTime:
    count += 1
  print("Count: ", count)
```

We're getting about 340K on this, which isn't great but we haven't
really optimized yet.

My own results listed below yielded some similar results. It seems we get almost 360k counter increments in 10 seconds of runtime using this same test loop in Micropython on a P2 clocked at 160MHz. This is a slightly better number for native P2 code without any runtime translation overheads from RISC-V, as one would expect, but not significantly so at this stage. Presumably the translation overhead is not significant and the JIT cache is working very nicely for this small example. Thankfully the P2 native version was certainly not any slower.

I ran it three times for consistency and also wanted to see how having multiple instructions in the loop affected the result so I created some other tests that doubled and tripled the inner loop's workload using some additional counter variables.

It appears from these numbers that the loop overhead itself is a significant component of this test. From what I computed approximately two thirds of the time in the single addition performance test is taken up by the while loop overhead itself, assuming that incrementing a second variable takes the same amount of time as incrementing the first one. So the counter addition on its own only took a third of the time in the first performance test. Another way to look at this is that we can do 3x as many Micropython additions per second in an unrolled sequence as these numbers suggest because the loop overhead was so significant, so perhaps it would achieve something in the order of ~100k counter increments per second at its raw processing rate with a 160MHz P2.

This also tells me that in general the performance of Micropython code will not be particularly fast on a P2 compared to TaqOZ and most likely Spin2 as well (which should really fly). Python is ~800 times slower to increment a variable than a register variable in PASM for example. Speed may not be an issue for everyone however, and they might prefer the portability/versatility/familiarity of Python etc. I still wonder how things might fare if we are able to reduce the hub overhead in the C function prologues/epilogues and any other useful optimisations we can find and use further native P2 speedups?

I have attached a demo binary of Micropython that uses a single COG and the regular P62/P63 serial port as the console, with an interrupt driven receive and smartpin based TX/RX built in (thanks ozpropdev for helping me there!). I've created a 256kB heap using hub RAM for large programs but this was a fairly minimal build so there are not a lot of Python modules included (no SD card support for example, and no 64 bit integers).

Pasting code works, though if you ever create a backlog beyond what the 2kB serial receive buffer in LUT temporarily holds while still processing the last line it would start to drop received characters. So far for the small pasted code snippets I've played with I haven't ever hit this limit, even at 230400 bps. If you find you do get dropped characters just paste less at a time or add some terminal delay per line sent.

I have included a native port of the "pyb" module ersmith uses and it included support for Pin functions, though I haven't tested this part with Smartpins thoroughly so if you use it YMMV. The timer stuff seems to be working okay and mine also supports an emulated 64 bit counter like ersmith's does, so the clock ticks should be accurate when the revA P2's 32 bit counter wraps. Having the same timing and IO interface means it should be easy to compare relative performance of code in these different implementations. The attached demo.zip file includes more details.

Cheers,
Roger.

loadp2 -t -l 2000000 -f 160000000 -b 230400 -PATCH build/python.bin
( Entering terminal mode at 230400 bps. Press Ctrl-] to exit. )
MicroPython v1.11-105-gef00048fe-dirty on 2019-07-16; P2-EVAL with propeller2-cpu
Type "help()" for more information.
>>>
paste mode; Ctrl-C to cancel, Ctrl-D to finish
=== import pyb
===
=== def perfTest():
=== millis = pyb.millis
=== endTime = millis() + 10000
=== count = 0
=== while millis() < endTime:
=== count += 1
=== print("Count: ", count)
===
=== def perfTest2():
=== millis = pyb.millis
=== endTime = millis() + 10000
=== count = 0
=== count2 = 0
=== while millis() < endTime:
=== count += 1
=== count2 += 1
=== print("Count: ", count)
===
=== def perfTest3():
=== millis = pyb.millis
=== endTime = millis() + 10000
=== count = 0
=== count2 = 0
=== count3 = 0
=== while millis() < endTime:
=== count += 1
=== count2 += 1
=== count3 += 1
=== print("Count: ", count)
===
=== def test(iterations, testfn):
=== for i in range(iterations):
=== testfn()
===
=== def run():
=== print("Testing 1 additions per loop over 10s")
=== test(3,perfTest)
=== print("Testing 2 additions per loop over 10s")
=== test(3,perfTest2)
=== print("Testing 3 additions per loop over 10s")
=== test(3,perfTest3)
===
=== run()
===
Testing 1 additions per loop over 10s
Count: 358407
Count: 358394
Count: 358394
Testing 2 additions per loop over 10s
Count: 269526
Count: 269520
Count: 269520
Testing 3 additions per loop over 10s
Count: 220009
Count: 220004
Count: 220004
>>>
>>>

rogloh · 2019-07-16 12:16

Note: Wanted to test it myself to be a fair comparison so here's the same test but run with Eric's Micropython program loaded.

loadp2 -t -f 160000000 -PATCH -b 230400 upython.binary
( Entering terminal mode at 230400 bps. Press Ctrl-] to exit. )
started USB on cog 2
MicroPython eric_v15 on 2019-07-14; P2-Eval-Board with p2-cpu
Type "help()" for more information.
>>>
paste mode; Ctrl-C to cancel, Ctrl-D to finish
=== import pyb
===
=== def perfTest():
=== millis = pyb.millis
=== endTime = millis() + 10000
=== count = 0
=== while millis() < endTime:
=== count += 1
=== print("Count: ", count)
===
=== def perfTest2():
=== millis = pyb.millis
=== endTime = millis() + 10000
=== count = 0
=== count2 = 0
=== while millis() < endTime:
=== count += 1
=== count2 += 1
=== print("Count: ", count)
===
=== def perfTest3():
=== millis = pyb.millis
=== endTime = millis() + 10000
=== count = 0
=== count2 = 0
=== count3 = 0
=== while millis() < endTime:
=== count += 1
=== count2 += 1
=== count3 += 1
=== print("Count: ", count)
===
=== def test(iterations, testfn):
=== for i in range(iterations):
=== testfn()
===
=== def run():
=== print("Testing 1 additions per loop over 10s")
=== test(3,perfTest)
=== print("Testing 2 additions per loop over 10s")
=== test(3,perfTest2)
=== print("Testing 3 additions per loop over 10s")
=== test(3,perfTest3)
===
>>> run()
Testing 1 additions per loop over 10s
Count: 337256
Count: 336693
Count: 337820
Testing 2 additions per loop over 10s
Count: 250918
Count: 250929
Count: 252519
Testing 3 additions per loop over 10s
Count: 204079
Count: 206387
Count: 206598
>>>

potatohead · 2019-07-16 19:59

Here is a great video on optimizing MicroPython. The short story is: Inline Assembly, and careful use of the language to emphasize the more lean constructs.

jmg · 2019-07-16 20:22

rogloh wrote: »

Note: Wanted to test it myself to be a fair comparison so here's the same test but run with Eric's Micropython program loaded.

Nice tests, Here is the same code, tweaked to run for same times, on a Desktop PC, Python 3.6 install
96402456 /337256 ~ 285.843x, & 96402456 / 358407 ~ 268.974x

# Tests are additions per loop over 10s
#  PC perf_counter      PC process_time      PC monotonic           MicroPython eric_v15      p2-cpu  MicroPython v1.11-105-gef00048fe-dirty
# Testing 1 additions  Testing 1 additions  Testing 1 additions    Testing 1 additions       Testing 1 additions per loop over 10s
# Count:  96402456     Count:  10221415     Count:  104482492      Count: 337256             Count: 358407
# Count:  96016418     Count:  10225649     Count:  105592940      Count: 336693             Count: 358394
# Count:  95987645     Count:  10238059     Count:  104914693      Count: 337820             Count: 358394
# Testing 2 additions  Testing 2 additions  Testing 2 additions    Testing 2 additions       Testing 2 additions per loop over 10s
# Count:  62619860     Count:  9753854      Count:  66711887       Count: 250918             Count: 269526
# Count:  62689434     Count:  9762975      Count:  66574504       Count: 250929             Count: 269520
# Count:  62628501     Count:  9733592      Count:  66501913       Count: 252519             Count: 269520
# Testing 3 additions  Testing 3 additions  Testing 3 additions    Testing 3 additions       Testing 3 additions per loop over 10s
# Count:  50340327     Count:  9379793      Count:  52883123       Count: 204079             Count: 220009
# Count:  50627717     Count:  9377611      Count:  52903206       Count: 206387             Count: 220004
# Count:  50213318     Count:  9368160      Count:  52937436       Count: 206598             Count: 220004
# 50340327/96402456    9379793/10221415     52937436/104482492     204079/ 337256            220009/358407      
# 52.21%               91.76%               50.66%                 60.51%                    61.38%            

# time.perf_counter() -> float .. It does include time elapsed during sleep and is system-wide. 
# time.process_time() -> float .. It does not include time elapsed during sleep.
# time.monotonic() -> float Return the value (in fractional seconds) of a monotonic clock, i.e. a clock that cannot go backwards. The clock is not affected by system clock updates.

and the code

# Py_OnP2 - Code run on a PC for a litmus check 
from time import perf_counter  # perf_counter_ns -> int, on 3.7

def perfTest():
 secsf = perf_counter         # time.perf_counter() -> float Seconds
 endTime = secsf() + 10.000
 count = 0
 while secsf() < endTime:
  count += 1
 print("Count: ", count)

def perfTest2():
 secsf = perf_counter
 endTime = secsf() + 10.000
 count = 0
 count2 = 0
 while secsf() < endTime:
  count += 1
  count2 += 1
 print("Count: ", count)

def perfTest3():
 secsf = perf_counter
 endTime = secsf() + 10.000
 count = 0
 count2 = 0
 count3 = 0
 while secsf() < endTime:
  count += 1
  count2 += 1
  count3 += 1
 print("Count: ", count)

def test(iterations, testfn):
 for i in range(iterations):
  testfn()

def run():
 print("Testing 1 additions per loop over 10s")
 test(3,perfTest)
 print("Testing 2 additions per loop over 10s")
 test(3,perfTest2)
 print("Testing 3 additions per loop over 10s")
 test(3,perfTest3)

run()

Addit : Included a process_time() and time.monotonic() runs to compare. May be better for loops testing, as hard real time is less important than CPU time allocated ?
Looks to give quite different numbers. Not sure what to make of that ? They all seem to take 10s/pass

brucee · 2019-07-16 21:36

Here is a great video on optimizing MicroPython.

I saw that video a number of months ago, and my opinion (remember opinion) is that MicroPython might be good for learning a language, but if the coder has to go through that much recoding to get performance, it is not real useful tool for many embedded applications.

The goal of most embedded compilers these days is to push the button at some point and get optimized code. Not saying you can't write slow code in any language, but ...

rogloh · 2019-07-16 23:43

@jmg, Yeah regardless of which result you take above it appears that a full PC really cranks running Python compared to the P2 running Micropython, though it's not really a pure comparison as it's running different software.

Let's say the PC is clocking at ~3.2GHz and running from L1 cache and the P2 is hitting 80MIPs, that is a 3200/80 speed ratio or 40:1, yet we have a ~275:1 ratio instead. As well as compiler and speed differences involved I guess the Micropython code itself may also be slower than full Python. It would be good to run Micropython on that PC to compare it too, not full Python.

Note: I built the P2-GCC version of Micropython using the following options where the optimizer was set to size (-Os) for all the Micropython files.

propeller-elf-gcc -I. -I../.. -Ibuild -Wall -std=c99 -mcog -S -Os -DNDEBUG -MD -o build/py/reader.spin1 ../../py/reader.c

I might try to play with other values and see if there is any other noticeable effect. In time I also want to see if I can come up with any sed scripts or other tools to translate the function prologues/epilogues into compressed versions using block hub transfers using "setq" instead of individual hub transfers for every register saved/loaded. That sequential code sequence alone already will drop the hub execution rate to below 20MIPs from its peak of 80MIPs and it is done twice in many functions, creating more hub overhead delay.

Wuerfel_21 · 2019-07-16 23:58

rogloh wrote: »

Let's say the PC is clocking at ~3.2GHz and running from L1 cache and the P2 is hitting 80MIPs, that is a 3200/80 speed ratio or 40:1, yet we have a ~275:1 ratio instead.

Slight nitpick, but PC CPUs execute more than one instruction per cycle. That brings the theoretical ratio closer to the actual ratio.

jmg · 2019-07-17 00:00

rogloh wrote: »

@jmg, Yeah regardless of which result you take above it appears that a full PC really cranks running Python compared to the P2 running Micropython, though it's not really a pure comparison as it's running different software..

Yes, it is a bit apples and oranges, but it does give a reference point.
I noted Python 3.7 adds a time in ns, so I downloaded 3.7.4, and that has bounced all speed numbers around !

I also see :
time.clock()
On Unix, return the current processor time as a floating point number expressed in seconds. The precision, and in fact the very definition of the meaning of “processor time”, depends on that of the C function of the same name.
On Windows, this function returns wall-clock seconds elapsed since the first call to this function, as a floating point number, based on the Win32 function QueryPerformanceCounter().
The resolution is typically better than one microsecond.
Deprecated since version 3.3, will be removed in version 3.8: The behaviour of this function depends on the platform: use perf_counter() or process_time() instead, depending on your requirements, to have a well defined behaviour.

# Tests are additions per loop over 10s   platform.architecture() =  ('64bit', 'WindowsPE')

# ~~~~~~~~~~~~~~~~~  Python 3.6.0 tests x64  ~~~~~~~~~~~~~~~~~~~~~~~~
# v3.6 Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)]
#  PC perf_counter      PC process_time      PC monotonic           MicroPython eric_v15      p2-cpu  MicroPython v1.11-105-gef00048fe-dirty
# Testing 1 additions  Testing 1 additions  Testing 1 additions    Testing 1 additions       Testing 1 additions per loop over 10s
# Count:  96402456     Count:  10221415     Count:  104482492      Count: 337256             Count: 358407
# Count:  96016418     Count:  10225649     Count:  105592940      Count: 336693             Count: 358394
# Count:  95987645     Count:  10238059     Count:  104914693      Count: 337820             Count: 358394
# Testing 2 additions  Testing 2 additions  Testing 2 additions    Testing 2 additions       Testing 2 additions per loop over 10s
# Count:  62619860     Count:  9753854      Count:  66711887       Count: 250918             Count: 269526
# Count:  62689434     Count:  9762975      Count:  66574504       Count: 250929             Count: 269520
# Count:  62628501     Count:  9733592      Count:  66501913       Count: 252519             Count: 269520
# Testing 3 additions  Testing 3 additions  Testing 3 additions    Testing 3 additions       Testing 3 additions per loop over 10s
# Count:  50340327     Count:  9379793      Count:  52883123       Count: 204079             Count: 220009
# Count:  50627717     Count:  9377611      Count:  52903206       Count: 206387             Count: 220004
# Count:  50213318     Count:  9368160      Count:  52937436       Count: 206598             Count: 220004
# 50340327/96402456    9379793/10221415     52937436/104482492     204079/ 337256            220009/358407      
# 52.21%               91.76%               50.66%                 60.51%                    61.38%            

# ~~~~~~~~~~~~~~~~~  Python 3.7.4 tests x64  ~~~~~~~~~~~~~~~~~~~~~~~~
# Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)]
#  PC perf_counter      PC process_time       PC monotonic           PC monotonic_ns       PC perf_counter_ns   
# Testing 1 additions   Testing 1 additions   Testing 1 additions    Testing 1 additions   Testing 1 additions  
# Count:  69068028      Count:  10148500      Count:  95233070       Count:  33003244      Count:  40014181     
# Count:  68900729      Count:  10155292      Count:  95487646       Count:  32999303      Count:  38385391     
# Count:  68788465      Count:  10164284      Count:  95183758       Count:  33171368      Count:  38259122     
# Testing 2 additions   Testing 2 additions   Testing 2 additions    Testing 2 additions   Testing 2 additions  
# Count:  49006048      Count:  9624707       Count:  61123391       Count:  27474554      Count:  30946131     
# Count:  49274749      Count:  9636241       Count:  61355218       Count:  27483137      Count:  30952452     
# Count:  49176978      Count:  9619928       Count:  61414878       Count:  27453529      Count:  30953880     
# Testing 3 additions   Testing 3 additions   Testing 3 additions    Testing 3 additions   Testing 3 additions  
# Count:  40616816      Count:  9230837       Count:  48422335       Count:  24739951      Count:  27198760     
# Count:  40905977      Count:  9246557       Count:  48341690       Count:  24748071      Count:  27321549     
# Count:  40905118      Count:  9244685       Count:  48421599       Count:  24736817      Count:  27281795

Summary :
Shows that on a PC, the integer _ns functions are actually slower than float, so must be adding more conversion software.
However, on P2 integer functions are likely to be faster.

process_time() looks to have significant overhead, maybe it needs to dialog with the OS to get the sleep time

Python 3.7.4 is slower than 3.6.0, but the monotonic() has the least speed degrade.
I might have expected perf_counter() to be the simplest call, and thus the fastest ? Appears not.

rogloh · 2019-07-17 00:05

Wuerfel_21 wrote: »

Slight nitpick, but PC CPUs execute more than one instruction per cycle. That brings the theoretical ratio closer to the actual ratio.

Superscalar, yep. That could buy plenty of performance. Plus all the other branch prediction and caching stuff in PC CPUs. Hub exec branches on P2's incur a lot of overhead along with the extra memory lookup latency. The P2 can't really ever try to compete with a PC here.

jmg · 2019-07-17 00:24

rogloh wrote: »

.... Hub exec branches on P2's incur a lot of overhead along with the extra memory lookup latency. The P2 can't really ever try to compete with a PC here.

I don't think users will expect the P2 to give the same numbers.
The comparisons are still useful - eg one indicator here, is the cost of the time-checking function, is twice as long in P2 vs PC - needs appx 2 'quanta units' vs PC's 1 'quanta unit'

Does P2.Python use 32b or 64b integers ?

# loop overhead times
# t3=10/52883123  = 1.8909e-7
# t2=10/66711887  = 1.4989e-7
# t1=10/104482492 = 9.57098e-8
# t2-t1 = 54.188ns
# t3-t2 = 39.197ns       << 50ns quanta ~ 20 MIPs
# t1-(t3-t2)= 56.51ns    << predicted time test alone
# 
# p3=10/204079 = 4.900e-5
# p2=10/250918 = 3.985e-5
# p1=10/337256 = 2.965e-5
# p2-p1 = 10.202us
# p3-p2 = 9.146us        << 10us quanta ~ 100kIPs
# p1-(p3-p2) = 20.504us  << predicted time test alone

rogloh · 2019-07-17 00:55

jmg wrote: »

Does P2.Python use 32b or 64b integers ?

In the build I put together it's 32 bit integers. Actually I think Micropython uses signed 30 or 31 bit numbers internally (build option) and translates accordingly. There didn't seem to be sufficient support for 64 bit stuff in the p2gcc libraries as yet in order to enable the larger integers, though in time it could be added.

I think there is some support for large integers in Eric's RISC-V based build. So there might be some difference there.

ozpropdev · 2019-07-17 00:59

Nice work Roger!
Good to see numbers on this stuff.

ersmith · 2019-07-17 01:15

Interesting results, Roger. I agree that the performance numbers are disappointing so far (for both versions). I suspect that we could both tweak our respective compilers somewhat and get a few percent more here and there. But if Parallax really wants maximum performance out of micropython they should arrange for a "real" P2 compiler port, rather than the hacks we're relying on now.

evanh · 2019-07-17 01:23

rogloh wrote: »

Wuerfel_21 wrote: »

Slight nitpick, but PC CPUs execute more than one instruction per cycle. That brings the theoretical ratio closer to the actual ratio.

Superscalar, yep. That could buy plenty of performance. Plus all the other branch prediction and caching stuff ...

All that other stuff is mostly about making superscalar work to an effective level.

ersmith · 2019-07-17 01:28

@Tubular : thanks for the feedback on booting micropython. I've put a v16 image in the first post which can now detect the sdcard it boots from and mount it automatically, so it should make development a lot easier. If there's a "main.py" file on the sdcard it will run that at boot time, so it's relatively easy to customize.

David Betz · 2019-07-17 01:45

ersmith wrote: »

Interesting results, Roger. I agree that the performance numbers are disappointing so far (for both versions). I suspect that we could both tweak our respective compilers somewhat and get a few percent more here and there. But if Parallax really wants maximum performance out of micropython they should arrange for a "real" P2 compiler port, rather than the hacks we're relying on now.

A "real" P2 compiler port? What would constitute a real P2 compiler? I suppose just modifying the old PropGCC to generate P2 code would not cut it. I guess then it would have to be a port of the current GCC or LLVM toolchains. I wonder how much one would have to budget for something like that? Who could do it?

rogloh · 2019-07-17 01:45

So I compiled Micropython for Unix and ran it on my Mac which contains this CPU

Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz

In the benchmarking code I imported "utime" instead of the "pyb" module and mapped the millis function to its utime.ticks_ms() function like this:

import utime

def perfTest():
  millis = utime.ticks_ms
  endTime = millis() + 10000
  count = 0
  while millis() < endTime:
    count += 1
  print("Count: ", count)

I achieved the following output:
>>> run()
Testing 1 additions per loop over 10s
Count: 141431655
Count: 142062157
Count: 140679732
Testing 2 additions per loop over 10s
Count: 111738309
Count: 111696344
Count: 112710277
Testing 3 additions per loop over 10s
Count: 102735397
Count: 101499762
Count: 99670684

This Mac is running it fast. Almost 400x faster than a P2! Wow.

rogloh · 2019-07-17 01:52

ersmith wrote: »

But if Parallax really wants maximum performance out of micropython they should arrange for a "real" P2 compiler port, rather than the hacks we're relying on now.

Yes having a C/C++ compiler that is fully optimized for P2 one day will be great. Though I start to wonder if hub exec may itself become the limiting factor in those circumstances too. Ideally we'd want it to be I guess. At that point there is probably no further gain possible.

jmg · 2019-07-17 02:07

ersmith wrote: »

Interesting results, Roger. I agree that the performance numbers are disappointing so far (for both versions). I suspect that we could both tweak our respective compilers somewhat and get a few percent more here and there. But if Parallax really wants maximum performance out of micropython they should arrange for a "real" P2 compiler port, rather than the hacks we're relying on now.

Here are some other platforms timings running a similar polling + INC loop I found.

# perfTest1 examples (Testing 1 32b addition + PollTimer) from other platforms  https://github.com/micropython/micropython/wiki/Performance
# MicroPython on Teensy 3.1: (96MHz ARM)                Count: 1,098,681  
# Micro Python v1.0.1 PYBv1.0 with STM32F405RG (168MHz) Count: 2,122,667
# C++ on Teensy 3.1:         (96MHz ARM)          32b++ Count: 95,835,923
# C++ on Arduino Pro Mini:   (16 MHz Atmega328)   32b++ Count: 4,970,227

ie ARM ports are in the 1M~2M ballpark.

If the P2 time-test here could reduce to the same as the inc lines, (~10us) you would get to ~0.5M, so that's not too bad, considering the path taken to get this number.

dgately · 2019-07-17 02:54

rogloh wrote: »

I achieved the following output:
>>> run()
Testing 1 additions per loop over 10s
Count:  141431655
Count:  142062157
Count:  140679732
Testing 2 additions per loop over 10s
Count:  111738309
Count:  111696344
Count:  112710277
Testing 3 additions per loop over 10s
Count:  102735397
Count:  101499762
Count:  99670684

Or this new MacBook Pro w/Intel Core i9, 2.3 GHz

>>> run()
Testing 1 additions per loop over 10s
Count:  229696533
Count:  222797036
Count:  225021793
Testing 2 additions per loop over 10s
Count:  196505183
Count:  192628191
Count:  196324457
Testing 3 additions per loop over 10s
Count:  173975859
Count:  175131054
Count:  178218134

dgately

Phil Pilgrim (PhiPi) · 2019-07-17 03:09

I'm not sure performance numbers really matter. If I were designing a product with the P2, I would use Spin and P2ASM. If I were teaching students using the P2, I would teach Python. There, performance doesn't really doesn't count for as much as learning an academically/administratively-accepted language.

-Phil

rogloh · 2019-07-17 03:24

.

jmg wrote: »
Here are some other platforms timings running a similar polling + INC loop I found.
# perfTest1 examples (Testing 1 32b addition + PollTimer) from other platforms  https://github.com/micropython/micropython/wiki/Performance
# MicroPython on Teensy 3.1: (96MHz ARM)                Count: 1,098,681  
# Micro Python v1.0.1 PYBv1.0 with STM32F405RG (168MHz) Count: 2,122,667
# C++ on Teensy 3.1:         (96MHz ARM)          32b++ Count: 95,835,923
# C++ on Arduino Pro Mini:   (16 MHz Atmega328)   32b++ Count: 4,970,227
ie ARM ports are in the 1M~2M ballpark.

If the P2 time-test here could reduce to the same as the inc lines, (~10us) you would get to ~0.5M, so that's not too bad, considering the path taken to get this number.

These are good numbers to compare against. If you derate the ARM based Teensy 3.1 down to 80MHz you might now have 916000 vs 360000 which is a fairer comparison. The ARM can likely access its RAM much faster than the P2 - it probably takes only a single clock cycle to read it vs many more on the P2 on average. So if we can speed up the timing code (I know it includes two CORDIC divisions which consume quite a few clocks, we could probably halve that to a single division with a fixed frequency encoded into the compile), along with the hub RAM transfer reduction we may be able help to narrow the gap further.

Also I've noticed the P2GCC toolchain in COG mode can often use a separate RAM address for each variable accessed (and there seem to be multiple instances of these addresses for the same variable). So in the generated code it needs to read the address of the variable (which the compiler assumed is already in COGRAM) from hub RAM, then the variable itself from hub RAM. This could be possibly changed to simply read the variable at a known address encoded into the RDLONG instruction by using RDLONG var, ##addr directly, potentially saving some extra hub cycles.

potatohead · 2019-07-17 06:24

brucee wrote: »

Here is a great video on optimizing MicroPython.

I saw that video a number of months ago, and my opinion (remember opinion) is that MicroPython might be good for learning a language, but if the coder has to go through that much recoding to get performance, it is not real useful tool for many embedded applications.

The goal of most embedded compilers these days is to push the button at some point and get optimized code. Not saying you can't write slow code in any language, but ...

Totally.

To me, it was a rundown on where gains can be had, up through the extremes of using inline assembly. Ideally, after viewing it, applying the various ideas, someone coding can simply make different choices. Some of those are pretty simple with nice returns.

Spin 2 will have a similar attribute. It's going to be faster, leaner due to how it's designed. But, it's also got inline, as several things do for P2. We need that because of the hardware features.

The MicroPython inline implementation is rough. One would definitely use it for critical sections only. FastSpin has a sweet implementation. The better that part of things is, the more likely it sees use.

How useful? Who knows? It's so early.

I just wanted to share what I thought was a great rundown on where the pain points can potentially be avoided.

Cluso99 · 2019-07-17 09:26

Phil Pilgrim (PhiPi) wrote: »

I'm not sure performance numbers really matter. If I were designing a product with the P2, I would use Spin and P2ASM. If I were teaching students using the P2, I would teach Python. There, performance doesn't really doesn't count for as much as learning an academically/administratively-accepted language.

-Phil

Agreed.

BTW I am using Python at work. Nothing i am doing with Python is time critical - i just dont care if it takes 5 mins or 50 mins as while it runs i do something else. What i like is the interactive part as i can debug each line when required as the syntax for decoding html is difficult for a beginner at this. Fortunately i have written html code a few moons ago so at least i understand the html section.

I find Pythons libraries (if you want to call them that) quite frustrating. The calling parameters are not clearly explained so one spends quite a lot of time trying various syntax to try and get the desired results. Error descriptions dont help much either.

evanh · 2019-07-17 10:10

Welcome to the world of open source.

The only way I made my way through learning anything in Bash was to google and google and google. There was always someone who had publicly asked the question on a forum somewhere. And there always seems to be someone with in-depth understanding of the mechanics of the language ready to answer. Some blogs had good examples too.

Googling worked quite well in the end but beats me how any Joe code monkey could have learned this stuff prior to it being available like this.

ersmith · 2019-07-17 13:05

I've updated the first post with v16b. It has the same features as v16, but is compiled with an improved riscvp2 engine and hence better performance. I ran @rogloh 's script (I actually entered the program and ran it entirely on the P2, thanks to pye.py) and got the following results:

>>> run()
Testing 1 addtions per loop over 10s
Count:  378045
Count:  379502
Count:  378045
Testing 2 additions per loop over 10s
Count:  286924
Count:  287338
Count:  288998
Testing 3 additions per loop over 10s
Count:  231199
Count:  231205
Count:  230136

This is a little faster than the p2gcc results above. No doubt we can tweak p2gcc to do even better, and then tweak riscvp2 again to do better still, and so on

. Although perhaps it might make sense to wait for Parallax to decide on a tool strategy and settle on a compiler then, and in the meantime work to add features to the P2 version of micropython.

What additional P2 features would people like to see? The current VGA driver is text only, to save space (it does support 8 bit foreground and background colors for each character). An alternative might be to do a 320x240x8bpp screen buffer. The memory cost would be 76K, versus the current 16K, and obviously the text wouldn't be as nice (we could do 50x24 in a 6x10 font, instead of 100x40 with the current 8x15 font). For reference, my version of micropython currently has a 220K user heap. I think Roger's has 256K, but AFAIK does not have any VGA support yet and not as many modules, so I don't think p2gcc will save us anything on RAM (but trying the experiment would be interesting).

Anything else that would be really nice to have in Python for the P2?

rogloh · 2019-07-17 14:49

Always good to see improvements Eric. I have also had some success with optimising the prologs/epilogs today after playing further and this has gained us two good things, higher performance and a smaller executable. With the code now using SETQ transfers instead of lots of rdlong/wrlongs the executable size has dropped by about 20kB to about 199kB (excluding the heap). I've also enabled a few other things in the configuration and that added a few more modules to this build too, but still no kbd/SD/VGA features like your's has, though it is certainly possible to integrate.

For this work I needed to use some PERL and SED scripts with some nasty long regular expressions matching over multiple lines. Turns out that there were two variants of the prolog generated and I was only matching one of them. Because of my reversed register save order with SETQ this messed things up in the code when it mixed old & new forms together so I reversed the register order in the VM to compensate and allow interoperability (which then broke setjmp/longjmp order :frown: ). Once I figure that out and reversed that too I was able to get the code to run again.

Here are the faster results I am getting now with most of the prologs/epilogs optimized for P2 (though not quite all yet until I match that second prolog type):

>>> run()
Testing 1 additions per loop over 10s
Count:  393664
Count:  393670
Count:  393669
Testing 2 additions per loop over 10s
Count:  299834
Count:  299827
Count:  299826
Testing 3 additions per loop over 10s
Count:  239221
Count:  239216
Count:  239215
>>> help('modules')
__main__          frozentest        pyb               uio
array             gc                sys               ustruct
builtins          micropython       ucollections
Plus any modules on the filesystem
>>>

Below is an example of the regexp that matches the longest function prolog. It's nasty to type or read and I have 7 different variants of them for prologs and epilogs for the different number of registers being saved/restored on the stack. This just substituted a single text string (in this case "_prologue_r8") into the code instead which I could then confirm in the files and transform later to real optimized code.

s/sub\s*sp, #4\s*\n\s*wrlong\s*r8, sp\s*\n\s*sub\s*sp, #4\s*\n\s*wrlong\s*r9, sp\s*\n\s*sub\s*sp, #4\s*\n\s*wrlong\s*r10, sp\s*\n\s*sub\s*sp, #4\s*\n\s*wrlong\s*r11, sp\s*\n\s*sub\s*sp, #4\s*\n\s*wrlong\s*r12, sp\s*\n\s*sub\s*sp, #4\s*\n\s*wrlong\s*r13, sp\s*\n\s*sub\s*sp, #4\s*\n\s*wrlong\s*r14, sp\s*\n\s*sub\s*sp, #4\s*\n\s*wrlong\s*lr, sp/_prologue_r8/g

That matches this prolog:

        sub     sp, #4
        wrlong  r8, sp
        sub     sp, #4
        wrlong  r9, sp
        sub     sp, #4
        wrlong  r10, sp
        sub     sp, #4
        wrlong  r11, sp
        sub     sp, #4
        wrlong  r12, sp
        sub     sp, #4
        wrlong  r13, sp
        sub     sp, #4
        wrlong  r14, sp
        sub     sp, #4
        wrlong  lr, sp

I then run a SED script to transform this "_prologue_r8" (and all the other variants) into the type of code below before the spin is translated to spin2 and assembled. It's sort of a round about way to go but it works. Ideally it could be done as part of sp2pasm parsing/translation. Sample prolog code above got converted to this:

'_prologue_r8 replacement
	sub	sp, #32
	mov	r15, lr
	setq	#7
	wrlong	r15, sp

jmg · 2019-07-17 20:47

ersmith wrote: »

What additional P2 features would people like to see? The current VGA driver is text only, to save space (it does support 8 bit foreground and background colors for each character). .. RAM = 16K, .100x40 with the current 8x15 font.
Anything else that would be really nice to have in Python for the P2?

I can see good lab use for P2-As-Instrument, and there, a larger font, or some font-zoom option could be nice ?
100x40 is good for text work, but not so good for 'across the room' reading.
I guess simplest is a zoom-all, that just drops the 100x40, and expands the clocks-per-pixel.

Zoom by character line would allow a by-lines mix of display heights, and it could use the original Y axis index (1..40) as starting point, to keep screen indexing portable.

That costs just 40 bytes, for a start-of-line tag, or it could use an existing terminal control code slot to trigger.

eg I find this
SGR
10 Primary(default) font
11–19 Alternative font Select alternative font n − 10

Next would be a selective zoom, or modest number of larger chars (Hex,Float, units ?), but maybe the Zoom-all is good enough ? SGR 11-19 could map to Scale/zoom x2,x3,x4,x5..x10 ?, applied to first-in-line.

Google also found this, which I had not seen before : https://en.wikipedia.org/wiki/Windows_Terminal

rogloh · 2019-07-19 00:36

Further to this, I added another benchmark in Micropython that does a factorial computation. I then found a fairly large performance difference below between the two types of builds (though am still testing v15 without Eric's latest changes). Running both at 160MHz I found that the RISC-V was ~3.5x slower than native P2 in this case. Both these tested builds have long int support (MICROPY_LONGINT_IMPL_MPZ) so that shouldn't have been a factor if one was different to the other. It could just be that the recursive nature of this code exercises the stack/heap a fair bit more now and places far more demands on the JIT translation due to the actual function calling instead of a simpler increment operation...or maybe something else?

import pyb

def perfTest():
  millis = pyb.millis
  endTime = millis() + 10000
  count = 0
  while millis() < endTime:
    count += 1
  print("Count: ", count)

def perfTest2():
  millis = pyb.millis
  endTime = millis() + 10000
  count = 0
  count2 = 0
  while millis() < endTime:
    count += 1
    count2 += 1
  print("Count: ", count)

def perfTest3():
  millis = pyb.millis
  endTime = millis() + 10000
  count = 0
  count2 = 0
  count3 = 0
  while millis() < endTime:
    count += 1
    count2 += 1
    count3 += 1
  print("Count: ", count)

def fact(n):
  if n == 0:
     return 1
  else:
     return n * fact(n-1)

def perfTest4():
  millis = pyb.millis
  endTime = millis() + 10000
  count = 0
  while millis() < endTime:
    count += 1
    fact(10)
  print("Count: ", count)


def test(iterations, testfn):
  for i in range(iterations):
     testfn()

def run():
#    print("Testing 1 additions per loop over 10s")
#    test(3,perfTest)
#    print("Testing 2 additions per loop over 10s")
#    test(3,perfTest2)
#    print("Testing 3 additions per loop over 10s")
#    test(3,perfTest3)
    print("Testing 10! calculations per loop over 10s")
    test(3,perfTest4)

Results:

P2 Native: 
Testing 10! calculations per loop over 10s
Count:  15461
Count:  15461
Count:  15461

RISC-V (v15): 
Testing 10! calculations per loop over 10s
Count:  4364
Count:  4362
Count:  4480

Micropython for P2

Comments