Having it all on SD card is great. I was pleasantly surprised when the edited source successfully saved back to the uSD card.
The free memory display in Pye status line is useful, too
Not sure how to detect the card most reliably, yet. We do have so many options including using a current sink to gently oppose the pullups built into the card, and read off the resulting analog voltage of the junction (or use the pin A>D threshold mode for simplicity). If we get this going it could be useful, because it might not be necessary to actually change the state of the node being 'read'.
So as a comparison I was able to get my native P2 Micropython integrated with a milliseconds counter and running to the point where I could do the same benchmark ersmith ran with his RISC-V version and then check the results.
Eric's data shows he was able to run this test below at about 34000 iterations per second. It is a simple counter increment in a while loop, and it polls the milliseconds timer until the test completes in 10 seconds. The counter result printed is the number of iterations over 10 seconds.
### Performance
```
import pyb
def perfTest():
millis = pyb.millis
endTime = millis() + 10000
count = 0
while millis() < endTime:
count += 1
print("Count: ", count)
```
We're getting about 340K on this, which isn't great but we haven't
really optimized yet.
My own results listed below yielded some similar results. It seems we get almost 360k counter increments in 10 seconds of runtime using this same test loop in Micropython on a P2 clocked at 160MHz. This is a slightly better number for native P2 code without any runtime translation overheads from RISC-V, as one would expect, but not significantly so at this stage. Presumably the translation overhead is not significant and the JIT cache is working very nicely for this small example. Thankfully the P2 native version was certainly not any slower.
I ran it three times for consistency and also wanted to see how having multiple instructions in the loop affected the result so I created some other tests that doubled and tripled the inner loop's workload using some additional counter variables.
It appears from these numbers that the loop overhead itself is a significant component of this test. From what I computed approximately two thirds of the time in the single addition performance test is taken up by the while loop overhead itself, assuming that incrementing a second variable takes the same amount of time as incrementing the first one. So the counter addition on its own only took a third of the time in the first performance test. Another way to look at this is that we can do 3x as many Micropython additions per second in an unrolled sequence as these numbers suggest because the loop overhead was so significant, so perhaps it would achieve something in the order of ~100k counter increments per second at its raw processing rate with a 160MHz P2.
This also tells me that in general the performance of Micropython code will not be particularly fast on a P2 compared to TaqOZ and most likely Spin2 as well (which should really fly). Python is ~800 times slower to increment a variable than a register variable in PASM for example. Speed may not be an issue for everyone however, and they might prefer the portability/versatility/familiarity of Python etc. I still wonder how things might fare if we are able to reduce the hub overhead in the C function prologues/epilogues and any other useful optimisations we can find and use further native P2 speedups?
I have attached a demo binary of Micropython that uses a single COG and the regular P62/P63 serial port as the console, with an interrupt driven receive and smartpin based TX/RX built in (thanks ozpropdev for helping me there!). I've created a 256kB heap using hub RAM for large programs but this was a fairly minimal build so there are not a lot of Python modules included (no SD card support for example, and no 64 bit integers).
Pasting code works, though if you ever create a backlog beyond what the 2kB serial receive buffer in LUT temporarily holds while still processing the last line it would start to drop received characters. So far for the small pasted code snippets I've played with I haven't ever hit this limit, even at 230400 bps. If you find you do get dropped characters just paste less at a time or add some terminal delay per line sent.
I have included a native port of the "pyb" module ersmith uses and it included support for Pin functions, though I haven't tested this part with Smartpins thoroughly so if you use it YMMV. The timer stuff seems to be working okay and mine also supports an emulated 64 bit counter like ersmith's does, so the clock ticks should be accurate when the revA P2's 32 bit counter wraps. Having the same timing and IO interface means it should be easy to compare relative performance of code in these different implementations. The attached demo.zip file includes more details.
Cheers,
Roger.
loadp2 -t -l 2000000 -f 160000000 -b 230400 -PATCH build/python.bin
( Entering terminal mode at 230400 bps. Press Ctrl-] to exit. )
MicroPython v1.11-105-gef00048fe-dirty on 2019-07-16; P2-EVAL with propeller2-cpu
Type "help()" for more information.
>>>
paste mode; Ctrl-C to cancel, Ctrl-D to finish
=== import pyb
===
=== def perfTest():
=== millis = pyb.millis
=== endTime = millis() + 10000
=== count = 0
=== while millis() < endTime:
=== count += 1
=== print("Count: ", count)
===
=== def perfTest2():
=== millis = pyb.millis
=== endTime = millis() + 10000
=== count = 0
=== count2 = 0
=== while millis() < endTime:
=== count += 1
=== count2 += 1
=== print("Count: ", count)
===
=== def perfTest3():
=== millis = pyb.millis
=== endTime = millis() + 10000
=== count = 0
=== count2 = 0
=== count3 = 0
=== while millis() < endTime:
=== count += 1
=== count2 += 1
=== count3 += 1
=== print("Count: ", count)
===
=== def test(iterations, testfn):
=== for i in range(iterations):
=== testfn()
===
=== def run():
=== print("Testing 1 additions per loop over 10s")
=== test(3,perfTest)
=== print("Testing 2 additions per loop over 10s")
=== test(3,perfTest2)
=== print("Testing 3 additions per loop over 10s")
=== test(3,perfTest3)
===
=== run()
===
Testing 1 additions per loop over 10s
Count: 358407
Count: 358394
Count: 358394
Testing 2 additions per loop over 10s
Count: 269526
Count: 269520
Count: 269520
Testing 3 additions per loop over 10s
Count: 220009
Count: 220004
Count: 220004
>>>
>>>
Here is a great video on optimizing MicroPython. The short story is: Inline Assembly, and careful use of the language to emphasize the more lean constructs.
Note: Wanted to test it myself to be a fair comparison so here's the same test but run with Eric's Micropython program loaded.
Nice tests, Here is the same code, tweaked to run for same times, on a Desktop PC, Python 3.6 install
96402456 /337256 ~ 285.843x, & 96402456 / 358407 ~ 268.974x
# Tests are additions per loop over 10s
# PC perf_counter PC process_time PC monotonic MicroPython eric_v15 p2-cpu MicroPython v1.11-105-gef00048fe-dirty
# Testing 1 additions Testing 1 additions Testing 1 additions Testing 1 additions Testing 1 additions per loop over 10s
# Count: 96402456 Count: 10221415 Count: 104482492 Count: 337256 Count: 358407
# Count: 96016418 Count: 10225649 Count: 105592940 Count: 336693 Count: 358394
# Count: 95987645 Count: 10238059 Count: 104914693 Count: 337820 Count: 358394
# Testing 2 additions Testing 2 additions Testing 2 additions Testing 2 additions Testing 2 additions per loop over 10s
# Count: 62619860 Count: 9753854 Count: 66711887 Count: 250918 Count: 269526
# Count: 62689434 Count: 9762975 Count: 66574504 Count: 250929 Count: 269520
# Count: 62628501 Count: 9733592 Count: 66501913 Count: 252519 Count: 269520
# Testing 3 additions Testing 3 additions Testing 3 additions Testing 3 additions Testing 3 additions per loop over 10s
# Count: 50340327 Count: 9379793 Count: 52883123 Count: 204079 Count: 220009
# Count: 50627717 Count: 9377611 Count: 52903206 Count: 206387 Count: 220004
# Count: 50213318 Count: 9368160 Count: 52937436 Count: 206598 Count: 220004
# 50340327/96402456 9379793/10221415 52937436/104482492 204079/ 337256 220009/358407
# 52.21% 91.76% 50.66% 60.51% 61.38%
# time.perf_counter() -> float .. It does include time elapsed during sleep and is system-wide.
# time.process_time() -> float .. It does not include time elapsed during sleep.
# time.monotonic() -> float Return the value (in fractional seconds) of a monotonic clock, i.e. a clock that cannot go backwards. The clock is not affected by system clock updates.
and the code
# Py_OnP2 - Code run on a PC for a litmus check
from time import perf_counter # perf_counter_ns -> int, on 3.7
def perfTest():
secsf = perf_counter # time.perf_counter() -> float Seconds
endTime = secsf() + 10.000
count = 0
while secsf() < endTime:
count += 1
print("Count: ", count)
def perfTest2():
secsf = perf_counter
endTime = secsf() + 10.000
count = 0
count2 = 0
while secsf() < endTime:
count += 1
count2 += 1
print("Count: ", count)
def perfTest3():
secsf = perf_counter
endTime = secsf() + 10.000
count = 0
count2 = 0
count3 = 0
while secsf() < endTime:
count += 1
count2 += 1
count3 += 1
print("Count: ", count)
def test(iterations, testfn):
for i in range(iterations):
testfn()
def run():
print("Testing 1 additions per loop over 10s")
test(3,perfTest)
print("Testing 2 additions per loop over 10s")
test(3,perfTest2)
print("Testing 3 additions per loop over 10s")
test(3,perfTest3)
run()
Addit : Included a process_time() and time.monotonic() runs to compare. May be better for loops testing, as hard real time is less important than CPU time allocated ?
Looks to give quite different numbers. Not sure what to make of that ? They all seem to take 10s/pass
I saw that video a number of months ago, and my opinion (remember opinion) is that MicroPython might be good for learning a language, but if the coder has to go through that much recoding to get performance, it is not real useful tool for many embedded applications.
The goal of most embedded compilers these days is to push the button at some point and get optimized code. Not saying you can't write slow code in any language, but ...
@jmg, Yeah regardless of which result you take above it appears that a full PC really cranks running Python compared to the P2 running Micropython, though it's not really a pure comparison as it's running different software.
Let's say the PC is clocking at ~3.2GHz and running from L1 cache and the P2 is hitting 80MIPs, that is a 3200/80 speed ratio or 40:1, yet we have a ~275:1 ratio instead. As well as compiler and speed differences involved I guess the Micropython code itself may also be slower than full Python. It would be good to run Micropython on that PC to compare it too, not full Python.
Note: I built the P2-GCC version of Micropython using the following options where the optimizer was set to size (-Os) for all the Micropython files.
I might try to play with other values and see if there is any other noticeable effect. In time I also want to see if I can come up with any sed scripts or other tools to translate the function prologues/epilogues into compressed versions using block hub transfers using "setq" instead of individual hub transfers for every register saved/loaded. That sequential code sequence alone already will drop the hub execution rate to below 20MIPs from its peak of 80MIPs and it is done twice in many functions, creating more hub overhead delay.
Let's say the PC is clocking at ~3.2GHz and running from L1 cache and the P2 is hitting 80MIPs, that is a 3200/80 speed ratio or 40:1, yet we have a ~275:1 ratio instead.
Slight nitpick, but PC CPUs execute more than one instruction per cycle. That brings the theoretical ratio closer to the actual ratio.
@jmg, Yeah regardless of which result you take above it appears that a full PC really cranks running Python compared to the P2 running Micropython, though it's not really a pure comparison as it's running different software..
Yes, it is a bit apples and oranges, but it does give a reference point.
I noted Python 3.7 adds a time in ns, so I downloaded 3.7.4, and that has bounced all speed numbers around !
I also see : time.clock()
On Unix, return the current processor time as a floating point number expressed in seconds. The precision, and in fact the very definition of the meaning of “processor time”, depends on that of the C function of the same name.
On Windows, this function returns wall-clock seconds elapsed since the first call to this function, as a floating point number, based on the Win32 function QueryPerformanceCounter().
The resolution is typically better than one microsecond. Deprecated since version 3.3, will be removed in version 3.8: The behaviour of this function depends on the platform: use perf_counter() or process_time() instead, depending on your requirements, to have a well defined behaviour.
Summary :
Shows that on a PC, the integer _ns functions are actually slower than float, so must be adding more conversion software.
However, on P2 integer functions are likely to be faster.
process_time() looks to have significant overhead, maybe it needs to dialog with the OS to get the sleep time
Python 3.7.4 is slower than 3.6.0, but the monotonic() has the least speed degrade.
I might have expected perf_counter() to be the simplest call, and thus the fastest ? Appears not.
Slight nitpick, but PC CPUs execute more than one instruction per cycle. That brings the theoretical ratio closer to the actual ratio.
Superscalar, yep. That could buy plenty of performance. Plus all the other branch prediction and caching stuff in PC CPUs. Hub exec branches on P2's incur a lot of overhead along with the extra memory lookup latency. The P2 can't really ever try to compete with a PC here.
.... Hub exec branches on P2's incur a lot of overhead along with the extra memory lookup latency. The P2 can't really ever try to compete with a PC here.
I don't think users will expect the P2 to give the same numbers.
The comparisons are still useful - eg one indicator here, is the cost of the time-checking function, is twice as long in P2 vs PC - needs appx 2 'quanta units' vs PC's 1 'quanta unit'
In the build I put together it's 32 bit integers. Actually I think Micropython uses signed 30 or 31 bit numbers internally (build option) and translates accordingly. There didn't seem to be sufficient support for 64 bit stuff in the p2gcc libraries as yet in order to enable the larger integers, though in time it could be added.
I think there is some support for large integers in Eric's RISC-V based build. So there might be some difference there.
Interesting results, Roger. I agree that the performance numbers are disappointing so far (for both versions). I suspect that we could both tweak our respective compilers somewhat and get a few percent more here and there. But if Parallax really wants maximum performance out of micropython they should arrange for a "real" P2 compiler port, rather than the hacks we're relying on now.
@Tubular : thanks for the feedback on booting micropython. I've put a v16 image in the first post which can now detect the sdcard it boots from and mount it automatically, so it should make development a lot easier. If there's a "main.py" file on the sdcard it will run that at boot time, so it's relatively easy to customize.
Interesting results, Roger. I agree that the performance numbers are disappointing so far (for both versions). I suspect that we could both tweak our respective compilers somewhat and get a few percent more here and there. But if Parallax really wants maximum performance out of micropython they should arrange for a "real" P2 compiler port, rather than the hacks we're relying on now.
A "real" P2 compiler port? What would constitute a real P2 compiler? I suppose just modifying the old PropGCC to generate P2 code would not cut it. I guess then it would have to be a port of the current GCC or LLVM toolchains. I wonder how much one would have to budget for something like that? Who could do it?
But if Parallax really wants maximum performance out of micropython they should arrange for a "real" P2 compiler port, rather than the hacks we're relying on now.
Yes having a C/C++ compiler that is fully optimized for P2 one day will be great. Though I start to wonder if hub exec may itself become the limiting factor in those circumstances too. Ideally we'd want it to be I guess. At that point there is probably no further gain possible.
Interesting results, Roger. I agree that the performance numbers are disappointing so far (for both versions). I suspect that we could both tweak our respective compilers somewhat and get a few percent more here and there. But if Parallax really wants maximum performance out of micropython they should arrange for a "real" P2 compiler port, rather than the hacks we're relying on now.
Here are some other platforms timings running a similar polling + INC loop I found.
# perfTest1 examples (Testing 1 32b addition + PollTimer) from other platforms https://github.com/micropython/micropython/wiki/Performance
# MicroPython on Teensy 3.1: (96MHz ARM) Count: 1,098,681
# Micro Python v1.0.1 PYBv1.0 with STM32F405RG (168MHz) Count: 2,122,667
# C++ on Teensy 3.1: (96MHz ARM) 32b++ Count: 95,835,923
# C++ on Arduino Pro Mini: (16 MHz Atmega328) 32b++ Count: 4,970,227
ie ARM ports are in the 1M~2M ballpark.
If the P2 time-test here could reduce to the same as the inc lines, (~10us) you would get to ~0.5M, so that's not too bad, considering the path taken to get this number.
I'm not sure performance numbers really matter. If I were designing a product with the P2, I would use Spin and P2ASM. If I were teaching students using the P2, I would teach Python. There, performance doesn't really doesn't count for as much as learning an academically/administratively-accepted language.
Here are some other platforms timings running a similar polling + INC loop I found.
# perfTest1 examples (Testing 1 32b addition + PollTimer) from other platforms https://github.com/micropython/micropython/wiki/Performance
# MicroPython on Teensy 3.1: (96MHz ARM) Count: 1,098,681
# Micro Python v1.0.1 PYBv1.0 with STM32F405RG (168MHz) Count: 2,122,667
# C++ on Teensy 3.1: (96MHz ARM) 32b++ Count: 95,835,923
# C++ on Arduino Pro Mini: (16 MHz Atmega328) 32b++ Count: 4,970,227
ie ARM ports are in the 1M~2M ballpark.
If the P2 time-test here could reduce to the same as the inc lines, (~10us) you would get to ~0.5M, so that's not too bad, considering the path taken to get this number.
These are good numbers to compare against. If you derate the ARM based Teensy 3.1 down to 80MHz you might now have 916000 vs 360000 which is a fairer comparison. The ARM can likely access its RAM much faster than the P2 - it probably takes only a single clock cycle to read it vs many more on the P2 on average. So if we can speed up the timing code (I know it includes two CORDIC divisions which consume quite a few clocks, we could probably halve that to a single division with a fixed frequency encoded into the compile), along with the hub RAM transfer reduction we may be able help to narrow the gap further.
Also I've noticed the P2GCC toolchain in COG mode can often use a separate RAM address for each variable accessed (and there seem to be multiple instances of these addresses for the same variable). So in the generated code it needs to read the address of the variable (which the compiler assumed is already in COGRAM) from hub RAM, then the variable itself from hub RAM. This could be possibly changed to simply read the variable at a known address encoded into the RDLONG instruction by using RDLONG var, ##addr directly, potentially saving some extra hub cycles.
I saw that video a number of months ago, and my opinion (remember opinion) is that MicroPython might be good for learning a language, but if the coder has to go through that much recoding to get performance, it is not real useful tool for many embedded applications.
The goal of most embedded compilers these days is to push the button at some point and get optimized code. Not saying you can't write slow code in any language, but ...
Totally.
To me, it was a rundown on where gains can be had, up through the extremes of using inline assembly. Ideally, after viewing it, applying the various ideas, someone coding can simply make different choices. Some of those are pretty simple with nice returns.
Spin 2 will have a similar attribute. It's going to be faster, leaner due to how it's designed. But, it's also got inline, as several things do for P2. We need that because of the hardware features.
The MicroPython inline implementation is rough. One would definitely use it for critical sections only. FastSpin has a sweet implementation. The better that part of things is, the more likely it sees use.
How useful? Who knows? It's so early.
I just wanted to share what I thought was a great rundown on where the pain points can potentially be avoided.
I'm not sure performance numbers really matter. If I were designing a product with the P2, I would use Spin and P2ASM. If I were teaching students using the P2, I would teach Python. There, performance doesn't really doesn't count for as much as learning an academically/administratively-accepted language.
-Phil
Agreed.
BTW I am using Python at work. Nothing i am doing with Python is time critical - i just dont care if it takes 5 mins or 50 mins as while it runs i do something else. What i like is the interactive part as i can debug each line when required as the syntax for decoding html is difficult for a beginner at this. Fortunately i have written html code a few moons ago so at least i understand the html section.
I find Pythons libraries (if you want to call them that) quite frustrating. The calling parameters are not clearly explained so one spends quite a lot of time trying various syntax to try and get the desired results. Error descriptions dont help much either.
The only way I made my way through learning anything in Bash was to google and google and google. There was always someone who had publicly asked the question on a forum somewhere. And there always seems to be someone with in-depth understanding of the mechanics of the language ready to answer. Some blogs had good examples too.
Googling worked quite well in the end but beats me how any Joe code monkey could have learned this stuff prior to it being available like this.
I've updated the first post with v16b. It has the same features as v16, but is compiled with an improved riscvp2 engine and hence better performance. I ran @rogloh 's script (I actually entered the program and ran it entirely on the P2, thanks to pye.py) and got the following results:
>>> run()
Testing 1 addtions per loop over 10s
Count: 378045
Count: 379502
Count: 378045
Testing 2 additions per loop over 10s
Count: 286924
Count: 287338
Count: 288998
Testing 3 additions per loop over 10s
Count: 231199
Count: 231205
Count: 230136
This is a little faster than the p2gcc results above. No doubt we can tweak p2gcc to do even better, and then tweak riscvp2 again to do better still, and so on . Although perhaps it might make sense to wait for Parallax to decide on a tool strategy and settle on a compiler then, and in the meantime work to add features to the P2 version of micropython.
What additional P2 features would people like to see? The current VGA driver is text only, to save space (it does support 8 bit foreground and background colors for each character). An alternative might be to do a 320x240x8bpp screen buffer. The memory cost would be 76K, versus the current 16K, and obviously the text wouldn't be as nice (we could do 50x24 in a 6x10 font, instead of 100x40 with the current 8x15 font). For reference, my version of micropython currently has a 220K user heap. I think Roger's has 256K, but AFAIK does not have any VGA support yet and not as many modules, so I don't think p2gcc will save us anything on RAM (but trying the experiment would be interesting).
Anything else that would be really nice to have in Python for the P2?
Always good to see improvements Eric. I have also had some success with optimising the prologs/epilogs today after playing further and this has gained us two good things, higher performance and a smaller executable. With the code now using SETQ transfers instead of lots of rdlong/wrlongs the executable size has dropped by about 20kB to about 199kB (excluding the heap). I've also enabled a few other things in the configuration and that added a few more modules to this build too, but still no kbd/SD/VGA features like your's has, though it is certainly possible to integrate.
For this work I needed to use some PERL and SED scripts with some nasty long regular expressions matching over multiple lines. Turns out that there were two variants of the prolog generated and I was only matching one of them. Because of my reversed register save order with SETQ this messed things up in the code when it mixed old & new forms together so I reversed the register order in the VM to compensate and allow interoperability (which then broke setjmp/longjmp order :frown: ). Once I figure that out and reversed that too I was able to get the code to run again.
Here are the faster results I am getting now with most of the prologs/epilogs optimized for P2 (though not quite all yet until I match that second prolog type):
>>> run()
Testing 1 additions per loop over 10s
Count: 393664
Count: 393670
Count: 393669
Testing 2 additions per loop over 10s
Count: 299834
Count: 299827
Count: 299826
Testing 3 additions per loop over 10s
Count: 239221
Count: 239216
Count: 239215
>>> help('modules')
__main__ frozentest pyb uio
array gc sys ustruct
builtins micropython ucollections
Plus any modules on the filesystem
>>>
Below is an example of the regexp that matches the longest function prolog. It's nasty to type or read and I have 7 different variants of them for prologs and epilogs for the different number of registers being saved/restored on the stack. This just substituted a single text string (in this case "_prologue_r8") into the code instead which I could then confirm in the files and transform later to real optimized code.
sub sp, #4
wrlong r8, sp
sub sp, #4
wrlong r9, sp
sub sp, #4
wrlong r10, sp
sub sp, #4
wrlong r11, sp
sub sp, #4
wrlong r12, sp
sub sp, #4
wrlong r13, sp
sub sp, #4
wrlong r14, sp
sub sp, #4
wrlong lr, sp
I then run a SED script to transform this "_prologue_r8" (and all the other variants) into the type of code below before the spin is translated to spin2 and assembled. It's sort of a round about way to go but it works. Ideally it could be done as part of sp2pasm parsing/translation. Sample prolog code above got converted to this:
'_prologue_r8 replacement
sub sp, #32
mov r15, lr
setq #7
wrlong r15, sp
What additional P2 features would people like to see? The current VGA driver is text only, to save space (it does support 8 bit foreground and background colors for each character). .. RAM = 16K, .100x40 with the current 8x15 font.
Anything else that would be really nice to have in Python for the P2?
I can see good lab use for P2-As-Instrument, and there, a larger font, or some font-zoom option could be nice ?
100x40 is good for text work, but not so good for 'across the room' reading.
I guess simplest is a zoom-all, that just drops the 100x40, and expands the clocks-per-pixel.
Zoom by character line would allow a by-lines mix of display heights, and it could use the original Y axis index (1..40) as starting point, to keep screen indexing portable.
That costs just 40 bytes, for a start-of-line tag, or it could use an existing terminal control code slot to trigger.
eg I find this SGR
10 Primary(default) font
11–19 Alternative font Select alternative font n − 10
Next would be a selective zoom, or modest number of larger chars (Hex,Float, units ?), but maybe the Zoom-all is good enough ? SGR 11-19 could map to Scale/zoom x2,x3,x4,x5..x10 ?, applied to first-in-line.
Further to this, I added another benchmark in Micropython that does a factorial computation. I then found a fairly large performance difference below between the two types of builds (though am still testing v15 without Eric's latest changes). Running both at 160MHz I found that the RISC-V was ~3.5x slower than native P2 in this case. Both these tested builds have long int support (MICROPY_LONGINT_IMPL_MPZ) so that shouldn't have been a factor if one was different to the other. It could just be that the recursive nature of this code exercises the stack/heap a fair bit more now and places far more demands on the JIT translation due to the actual function calling instead of a simpler increment operation...or maybe something else?
import pyb
def perfTest():
millis = pyb.millis
endTime = millis() + 10000
count = 0
while millis() < endTime:
count += 1
print("Count: ", count)
def perfTest2():
millis = pyb.millis
endTime = millis() + 10000
count = 0
count2 = 0
while millis() < endTime:
count += 1
count2 += 1
print("Count: ", count)
def perfTest3():
millis = pyb.millis
endTime = millis() + 10000
count = 0
count2 = 0
count3 = 0
while millis() < endTime:
count += 1
count2 += 1
count3 += 1
print("Count: ", count)
def fact(n):
if n == 0:
return 1
else:
return n * fact(n-1)
def perfTest4():
millis = pyb.millis
endTime = millis() + 10000
count = 0
while millis() < endTime:
count += 1
fact(10)
print("Count: ", count)
def test(iterations, testfn):
for i in range(iterations):
testfn()
def run():
# print("Testing 1 additions per loop over 10s")
# test(3,perfTest)
# print("Testing 2 additions per loop over 10s")
# test(3,perfTest2)
# print("Testing 3 additions per loop over 10s")
# test(3,perfTest3)
print("Testing 10! calculations per loop over 10s")
test(3,perfTest4)
Results:
P2 Native:
Testing 10! calculations per loop over 10s
Count: 15461
Count: 15461
Count: 15461
RISC-V (v15):
Testing 10! calculations per loop over 10s
Count: 4364
Count: 4362
Count: 4480
Comments
The free memory display in Pye status line is useful, too
Not sure how to detect the card most reliably, yet. We do have so many options including using a current sink to gently oppose the pullups built into the card, and read off the resulting analog voltage of the junction (or use the pin A>D threshold mode for simplicity). If we get this going it could be useful, because it might not be necessary to actually change the state of the node being 'read'.
Eric's data shows he was able to run this test below at about 34000 iterations per second. It is a simple counter increment in a while loop, and it polls the milliseconds timer until the test completes in 10 seconds. The counter result printed is the number of iterations over 10 seconds.
My own results listed below yielded some similar results. It seems we get almost 360k counter increments in 10 seconds of runtime using this same test loop in Micropython on a P2 clocked at 160MHz. This is a slightly better number for native P2 code without any runtime translation overheads from RISC-V, as one would expect, but not significantly so at this stage. Presumably the translation overhead is not significant and the JIT cache is working very nicely for this small example. Thankfully the P2 native version was certainly not any slower.
I ran it three times for consistency and also wanted to see how having multiple instructions in the loop affected the result so I created some other tests that doubled and tripled the inner loop's workload using some additional counter variables.
It appears from these numbers that the loop overhead itself is a significant component of this test. From what I computed approximately two thirds of the time in the single addition performance test is taken up by the while loop overhead itself, assuming that incrementing a second variable takes the same amount of time as incrementing the first one. So the counter addition on its own only took a third of the time in the first performance test. Another way to look at this is that we can do 3x as many Micropython additions per second in an unrolled sequence as these numbers suggest because the loop overhead was so significant, so perhaps it would achieve something in the order of ~100k counter increments per second at its raw processing rate with a 160MHz P2.
This also tells me that in general the performance of Micropython code will not be particularly fast on a P2 compared to TaqOZ and most likely Spin2 as well (which should really fly). Python is ~800 times slower to increment a variable than a register variable in PASM for example. Speed may not be an issue for everyone however, and they might prefer the portability/versatility/familiarity of Python etc. I still wonder how things might fare if we are able to reduce the hub overhead in the C function prologues/epilogues and any other useful optimisations we can find and use further native P2 speedups?
I have attached a demo binary of Micropython that uses a single COG and the regular P62/P63 serial port as the console, with an interrupt driven receive and smartpin based TX/RX built in (thanks ozpropdev for helping me there!). I've created a 256kB heap using hub RAM for large programs but this was a fairly minimal build so there are not a lot of Python modules included (no SD card support for example, and no 64 bit integers).
Pasting code works, though if you ever create a backlog beyond what the 2kB serial receive buffer in LUT temporarily holds while still processing the last line it would start to drop received characters. So far for the small pasted code snippets I've played with I haven't ever hit this limit, even at 230400 bps. If you find you do get dropped characters just paste less at a time or add some terminal delay per line sent.
I have included a native port of the "pyb" module ersmith uses and it included support for Pin functions, though I haven't tested this part with Smartpins thoroughly so if you use it YMMV. The timer stuff seems to be working okay and mine also supports an emulated 64 bit counter like ersmith's does, so the clock ticks should be accurate when the revA P2's 32 bit counter wraps. Having the same timing and IO interface means it should be easy to compare relative performance of code in these different implementations. The attached demo.zip file includes more details.
Cheers,
Roger.
loadp2 -t -l 2000000 -f 160000000 -b 230400 -PATCH build/python.bin
( Entering terminal mode at 230400 bps. Press Ctrl-] to exit. )
MicroPython v1.11-105-gef00048fe-dirty on 2019-07-16; P2-EVAL with propeller2-cpu
Type "help()" for more information.
>>>
paste mode; Ctrl-C to cancel, Ctrl-D to finish
=== import pyb
===
=== def perfTest():
=== millis = pyb.millis
=== endTime = millis() + 10000
=== count = 0
=== while millis() < endTime:
=== count += 1
=== print("Count: ", count)
===
=== def perfTest2():
=== millis = pyb.millis
=== endTime = millis() + 10000
=== count = 0
=== count2 = 0
=== while millis() < endTime:
=== count += 1
=== count2 += 1
=== print("Count: ", count)
===
=== def perfTest3():
=== millis = pyb.millis
=== endTime = millis() + 10000
=== count = 0
=== count2 = 0
=== count3 = 0
=== while millis() < endTime:
=== count += 1
=== count2 += 1
=== count3 += 1
=== print("Count: ", count)
===
=== def test(iterations, testfn):
=== for i in range(iterations):
=== testfn()
===
=== def run():
=== print("Testing 1 additions per loop over 10s")
=== test(3,perfTest)
=== print("Testing 2 additions per loop over 10s")
=== test(3,perfTest2)
=== print("Testing 3 additions per loop over 10s")
=== test(3,perfTest3)
===
=== run()
===
Testing 1 additions per loop over 10s
Count: 358407
Count: 358394
Count: 358394
Testing 2 additions per loop over 10s
Count: 269526
Count: 269520
Count: 269520
Testing 3 additions per loop over 10s
Count: 220009
Count: 220004
Count: 220004
>>>
>>>
loadp2 -t -f 160000000 -PATCH -b 230400 upython.binary
( Entering terminal mode at 230400 bps. Press Ctrl-] to exit. )
started USB on cog 2
MicroPython eric_v15 on 2019-07-14; P2-Eval-Board with p2-cpu
Type "help()" for more information.
>>>
paste mode; Ctrl-C to cancel, Ctrl-D to finish
=== import pyb
===
=== def perfTest():
=== millis = pyb.millis
=== endTime = millis() + 10000
=== count = 0
=== while millis() < endTime:
=== count += 1
=== print("Count: ", count)
===
=== def perfTest2():
=== millis = pyb.millis
=== endTime = millis() + 10000
=== count = 0
=== count2 = 0
=== while millis() < endTime:
=== count += 1
=== count2 += 1
=== print("Count: ", count)
===
=== def perfTest3():
=== millis = pyb.millis
=== endTime = millis() + 10000
=== count = 0
=== count2 = 0
=== count3 = 0
=== while millis() < endTime:
=== count += 1
=== count2 += 1
=== count3 += 1
=== print("Count: ", count)
===
=== def test(iterations, testfn):
=== for i in range(iterations):
=== testfn()
===
=== def run():
=== print("Testing 1 additions per loop over 10s")
=== test(3,perfTest)
=== print("Testing 2 additions per loop over 10s")
=== test(3,perfTest2)
=== print("Testing 3 additions per loop over 10s")
=== test(3,perfTest3)
===
>>> run()
Testing 1 additions per loop over 10s
Count: 337256
Count: 336693
Count: 337820
Testing 2 additions per loop over 10s
Count: 250918
Count: 250929
Count: 252519
Testing 3 additions per loop over 10s
Count: 204079
Count: 206387
Count: 206598
>>>
Nice tests, Here is the same code, tweaked to run for same times, on a Desktop PC, Python 3.6 install
96402456 /337256 ~ 285.843x, & 96402456 / 358407 ~ 268.974x
and the code
Addit : Included a process_time() and time.monotonic() runs to compare. May be better for loops testing, as hard real time is less important than CPU time allocated ?
Looks to give quite different numbers. Not sure what to make of that ? They all seem to take 10s/pass
I saw that video a number of months ago, and my opinion (remember opinion) is that MicroPython might be good for learning a language, but if the coder has to go through that much recoding to get performance, it is not real useful tool for many embedded applications.
The goal of most embedded compilers these days is to push the button at some point and get optimized code. Not saying you can't write slow code in any language, but ...
Let's say the PC is clocking at ~3.2GHz and running from L1 cache and the P2 is hitting 80MIPs, that is a 3200/80 speed ratio or 40:1, yet we have a ~275:1 ratio instead. As well as compiler and speed differences involved I guess the Micropython code itself may also be slower than full Python. It would be good to run Micropython on that PC to compare it too, not full Python.
Note: I built the P2-GCC version of Micropython using the following options where the optimizer was set to size (-Os) for all the Micropython files.
propeller-elf-gcc -I. -I../.. -Ibuild -Wall -std=c99 -mcog -S -Os -DNDEBUG -MD -o build/py/reader.spin1 ../../py/reader.c
I might try to play with other values and see if there is any other noticeable effect. In time I also want to see if I can come up with any sed scripts or other tools to translate the function prologues/epilogues into compressed versions using block hub transfers using "setq" instead of individual hub transfers for every register saved/loaded. That sequential code sequence alone already will drop the hub execution rate to below 20MIPs from its peak of 80MIPs and it is done twice in many functions, creating more hub overhead delay.
Slight nitpick, but PC CPUs execute more than one instruction per cycle. That brings the theoretical ratio closer to the actual ratio.
I noted Python 3.7 adds a time in ns, so I downloaded 3.7.4, and that has bounced all speed numbers around !
I also see :
time.clock()
On Unix, return the current processor time as a floating point number expressed in seconds. The precision, and in fact the very definition of the meaning of “processor time”, depends on that of the C function of the same name.
On Windows, this function returns wall-clock seconds elapsed since the first call to this function, as a floating point number, based on the Win32 function QueryPerformanceCounter().
The resolution is typically better than one microsecond.
Deprecated since version 3.3, will be removed in version 3.8: The behaviour of this function depends on the platform: use perf_counter() or process_time() instead, depending on your requirements, to have a well defined behaviour.
Summary :
Shows that on a PC, the integer _ns functions are actually slower than float, so must be adding more conversion software.
However, on P2 integer functions are likely to be faster.
process_time() looks to have significant overhead, maybe it needs to dialog with the OS to get the sleep time
Python 3.7.4 is slower than 3.6.0, but the monotonic() has the least speed degrade.
I might have expected perf_counter() to be the simplest call, and thus the fastest ? Appears not.
Superscalar, yep. That could buy plenty of performance. Plus all the other branch prediction and caching stuff in PC CPUs. Hub exec branches on P2's incur a lot of overhead along with the extra memory lookup latency. The P2 can't really ever try to compete with a PC here.
The comparisons are still useful - eg one indicator here, is the cost of the time-checking function, is twice as long in P2 vs PC - needs appx 2 'quanta units' vs PC's 1 'quanta unit'
Does P2.Python use 32b or 64b integers ?
In the build I put together it's 32 bit integers. Actually I think Micropython uses signed 30 or 31 bit numbers internally (build option) and translates accordingly. There didn't seem to be sufficient support for 64 bit stuff in the p2gcc libraries as yet in order to enable the larger integers, though in time it could be added.
I think there is some support for large integers in Eric's RISC-V based build. So there might be some difference there.
Good to see numbers on this stuff.
Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
In the benchmarking code I imported "utime" instead of the "pyb" module and mapped the millis function to its utime.ticks_ms() function like this:
I achieved the following output:
>>> run()
Testing 1 additions per loop over 10s
Count: 141431655
Count: 142062157
Count: 140679732
Testing 2 additions per loop over 10s
Count: 111738309
Count: 111696344
Count: 112710277
Testing 3 additions per loop over 10s
Count: 102735397
Count: 101499762
Count: 99670684
This Mac is running it fast. Almost 400x faster than a P2! Wow.
Here are some other platforms timings running a similar polling + INC loop I found.
ie ARM ports are in the 1M~2M ballpark.
If the P2 time-test here could reduce to the same as the inc lines, (~10us) you would get to ~0.5M, so that's not too bad, considering the path taken to get this number.
-Phil
These are good numbers to compare against. If you derate the ARM based Teensy 3.1 down to 80MHz you might now have 916000 vs 360000 which is a fairer comparison. The ARM can likely access its RAM much faster than the P2 - it probably takes only a single clock cycle to read it vs many more on the P2 on average. So if we can speed up the timing code (I know it includes two CORDIC divisions which consume quite a few clocks, we could probably halve that to a single division with a fixed frequency encoded into the compile), along with the hub RAM transfer reduction we may be able help to narrow the gap further.
Also I've noticed the P2GCC toolchain in COG mode can often use a separate RAM address for each variable accessed (and there seem to be multiple instances of these addresses for the same variable). So in the generated code it needs to read the address of the variable (which the compiler assumed is already in COGRAM) from hub RAM, then the variable itself from hub RAM. This could be possibly changed to simply read the variable at a known address encoded into the RDLONG instruction by using RDLONG var, ##addr directly, potentially saving some extra hub cycles.
Totally.
To me, it was a rundown on where gains can be had, up through the extremes of using inline assembly. Ideally, after viewing it, applying the various ideas, someone coding can simply make different choices. Some of those are pretty simple with nice returns.
Spin 2 will have a similar attribute. It's going to be faster, leaner due to how it's designed. But, it's also got inline, as several things do for P2. We need that because of the hardware features.
The MicroPython inline implementation is rough. One would definitely use it for critical sections only. FastSpin has a sweet implementation. The better that part of things is, the more likely it sees use.
How useful? Who knows? It's so early.
I just wanted to share what I thought was a great rundown on where the pain points can potentially be avoided.
BTW I am using Python at work. Nothing i am doing with Python is time critical - i just dont care if it takes 5 mins or 50 mins as while it runs i do something else. What i like is the interactive part as i can debug each line when required as the syntax for decoding html is difficult for a beginner at this. Fortunately i have written html code a few moons ago so at least i understand the html section.
I find Pythons libraries (if you want to call them that) quite frustrating. The calling parameters are not clearly explained so one spends quite a lot of time trying various syntax to try and get the desired results. Error descriptions dont help much either.
The only way I made my way through learning anything in Bash was to google and google and google. There was always someone who had publicly asked the question on a forum somewhere. And there always seems to be someone with in-depth understanding of the mechanics of the language ready to answer. Some blogs had good examples too.
Googling worked quite well in the end but beats me how any Joe code monkey could have learned this stuff prior to it being available like this.
This is a little faster than the p2gcc results above. No doubt we can tweak p2gcc to do even better, and then tweak riscvp2 again to do better still, and so on . Although perhaps it might make sense to wait for Parallax to decide on a tool strategy and settle on a compiler then, and in the meantime work to add features to the P2 version of micropython.
What additional P2 features would people like to see? The current VGA driver is text only, to save space (it does support 8 bit foreground and background colors for each character). An alternative might be to do a 320x240x8bpp screen buffer. The memory cost would be 76K, versus the current 16K, and obviously the text wouldn't be as nice (we could do 50x24 in a 6x10 font, instead of 100x40 with the current 8x15 font). For reference, my version of micropython currently has a 220K user heap. I think Roger's has 256K, but AFAIK does not have any VGA support yet and not as many modules, so I don't think p2gcc will save us anything on RAM (but trying the experiment would be interesting).
Anything else that would be really nice to have in Python for the P2?
For this work I needed to use some PERL and SED scripts with some nasty long regular expressions matching over multiple lines. Turns out that there were two variants of the prolog generated and I was only matching one of them. Because of my reversed register save order with SETQ this messed things up in the code when it mixed old & new forms together so I reversed the register order in the VM to compensate and allow interoperability (which then broke setjmp/longjmp order :frown: ). Once I figure that out and reversed that too I was able to get the code to run again.
Here are the faster results I am getting now with most of the prologs/epilogs optimized for P2 (though not quite all yet until I match that second prolog type):
Below is an example of the regexp that matches the longest function prolog. It's nasty to type or read and I have 7 different variants of them for prologs and epilogs for the different number of registers being saved/restored on the stack. This just substituted a single text string (in this case "_prologue_r8") into the code instead which I could then confirm in the files and transform later to real optimized code.
That matches this prolog:
I then run a SED script to transform this "_prologue_r8" (and all the other variants) into the type of code below before the spin is translated to spin2 and assembled. It's sort of a round about way to go but it works. Ideally it could be done as part of sp2pasm parsing/translation. Sample prolog code above got converted to this:
100x40 is good for text work, but not so good for 'across the room' reading.
I guess simplest is a zoom-all, that just drops the 100x40, and expands the clocks-per-pixel.
Zoom by character line would allow a by-lines mix of display heights, and it could use the original Y axis index (1..40) as starting point, to keep screen indexing portable.
That costs just 40 bytes, for a start-of-line tag, or it could use an existing terminal control code slot to trigger.
eg I find this
SGR
10 Primary(default) font
11–19 Alternative font Select alternative font n − 10
Next would be a selective zoom, or modest number of larger chars (Hex,Float, units ?), but maybe the Zoom-all is good enough ? SGR 11-19 could map to Scale/zoom x2,x3,x4,x5..x10 ?, applied to first-in-line.
Google also found this, which I had not seen before : https://en.wikipedia.org/wiki/Windows_Terminal
Results: