NQPY -- not quite python -- a study
Reinhard
Posts: 489
in Propeller 2
I started something here that has been on my mind for a long time.
A tool that translates python bytecode into assembler.
Here in the special case in p2asm.
It is a feasibility study and is far from mature.
It shows that the effort can be worth it.
I know that I am not popular here in the forum, but I would still be interested in your opinion.
Further information can be found in the readme file in the appendix.
Comments
@Reinhard
As someone else who isn't popular on the forum, I'd say I hope that wouldn't discourage you from working on it or even sharing it. I don't use Python myself, but am nevertheless intrigued by the idea of compiling it to native code and will check it out.
Cheers
Hey Guys!
Can't imagine why you'd consider yourselves unpopular. I hope silly talk wasn't exchanged in the past.
You both have a supporter in me, and I'm sure many many more people who've enjoyed and benefited from all your contributions over many years.
Really appreciate and enjoy reading all that you share.
I had the same thought. Both are members that offer up what they're working on which is a benefit to the Propeller community.
+1
Reinhard and avsa242 both probably meant that as popularity of their posted projects rather than personalities.
Understood, but I'd hate for them to stop offering up what the do, even if others don't comment.
I can't speak for Reinhard but yes Evan; that, or being "well-known" in general was what I thought was meant.
👍 for the support, regardless!
Cheers
As someone who i guess qualify as mildly popular, I will say that getting lots of replies on your forum thread really just has a lot to do with you yourself posting lots of them (and therefore bringing the thread back to the top). The thing has to be interesting, too, I guess. But if you indeed made something, it is inherently interesting, at least to you, so that's a non-issue.
Your generated assembly looks pretty poor. Flexspin has an extensive P1/P2 assembly optimizer module, that could possibly be used to help. It doesn't really have an external interface though and the internal one is questionably documented. It basically amounts to creating an IRList with your instructions in it and calling
OptimizeIRLocal
on it, which will then through a byzantine internal process eliminate unnecessary instructions in the list. It also needs a Function object (which I just realize needs to be passed both as a parameter and through globalcurfunc
- lol), mostly because optimization flags are per-function. Possibly can be cleaned up a bit if there's interest.Quite true! I lurk more than post (same IRL except without the creepy connotation of the word 'lurk'). Anyways, don't want to derail Reinhard's thread.
Thanks for the comments.
I can also take criticism, of course I prefer positive comments.
With no reaction at all, I feel like the 'elephant in the room that everyone is ignoring'.
As mentioned, this is a first attempt to get a feeling of whether and how this is even possible.
This results in ineffective instruction sequences like these:
'-----------------------------------------------------------------------------------------------------------------------------
' STACK LEN = 4
mov reg1 , delay
mov reg8 , ##123456
sub reg1 , reg8
mov delay , reg1 <<<<<<
'-----------------------------------------------------------------------------------------------------------------------------
' STACK LEN = 4
mov reg1 , delay <<<<<< uups
mov reg9 , ##1000000
cmp reg1 , reg9 wcz
if_nc jmp #Label88
'-----------------------------------------------------------------------------------------------------------------------------
and the tool is very wasteful in assigning register/variable names.
It's all a consequence of the solution approach.
Python is stack oriented, and that's the idea behind it:
Wait until Python's stack pointer points to STACKTOP again and then evaluate this frame.
Then I get this:
'-----------------------------------------------------------------------------------------------------------------------------
' Instruction(opname='LOAD_NAME', opcode=101, arg=1, argval='delay', argrepr='delay', offset=68, starts_line=15, is_jump_target=False) 1 1 1
' Instruction(opname='LOAD_CONST', opcode=100, arg=8, argval=123456, argrepr='123456', offset=70, starts_line=None, is_jump_target=False) 2 1 2
' Instruction(opname='INPLACE_SUBTRACT', opcode=56, arg=None, argval=None, argrepr='', offset=72, starts_line=None, is_jump_target=False) 1 -1 3
' Instruction(opname='STORE_NAME', opcode=90, arg=1, argval='delay', argrepr='delay', offset=74, starts_line=None, is_jump_target=False) 0 -1 4
' STACK LEN = 4
mov reg1 , delay
mov reg8 , ##123456
sub reg1 , reg8
mov delay , reg1
'-----------------------------------------------------------------------------------------------------------------------------
' Instruction(opname='LOAD_NAME', opcode=101, arg=1, argval='delay', argrepr='delay', offset=76, starts_line=16, is_jump_target=False) 1 1 1
' Instruction(opname='LOAD_CONST', opcode=100, arg=9, argval=1000000, argrepr='1000000', offset=78, starts_line=None, is_jump_target=False) 2 1 2
' Instruction(opname='COMPARE_OP', opcode=107, arg=1, argval='<=', argrepr='<=', offset=80, starts_line=None, is_jump_target=False) 1 -1 3
' Instruction(opname='POP_JUMP_IF_FALSE', opcode=114, arg=44, argval=88, argrepr='to 88', offset=82, starts_line=None, is_jump_target=False) 0 -1 4
' STACK LEN = 4
mov reg1 , delay
mov reg9 , ##1000000
cmp reg1 , reg9 wcz
if_nc jmp #Label88
'-----------------------------------------------------------------------------------------------------------------------------
Then I translate the current stack into registers.
I get the register designations from the bytecode instructions
with the attributes arg, argval and, very importantly, the flag is_jump_target.
That's the basic idea, maybe I can think of something better.
The aim is also to include entire spin objects, the Python import statement can be bent for this purpose.
Classes and class methods are not currently supported.
a comment:
I have just found out,
the whole thing works with Python3.10,3.5.
Strangely not with Python3.6.9
o o o
I have just compiled the newest python version (3.11.0) and I see the opcodes are changed.
At first glance it seems that would be even more effective to translate into native.
But then I would have to provide a separate tool version for each Python version.
I still have to think about that.
But I'll continue as a hobby, even if it's never going to be ready for series production.
As my brothers used to say, "beggars can't be choosers", so just be grateful that someone took the time and effort to make something of use.
Oh, it's just sunk in. Man, why?! It sounds like you're giving yourself licks with cats tails.
I remember a posting from someone else on the forum showing a Google effort at JIT of Javascript ... it came with all the typical compiler caveats on coding restrictions to make it actually perform similar speed of native compiled code. Basically, if you wanted the script to run with compiled speed then you had to rewrite it as if were another language.
PS: I shouldn't be surprised. I'm regularly in awe of what gets done on the forums.
Guido van Rossum, the Python inventor, has gotten involved in further development and sees making Python faster as a priority.
That's a good, great decision.
No question at all.
Otherwise I don't understand your answer.
@evanh said:
PS: I shouldn't be surprised. I'm regularly in awe of what gets done on the forums.
Actually I wanted to extend micropython. But I did not succeed in replicating the port on P2.
With the translator in Native, I saw something within my modest means.
But if everything doesn't work out, I'm also good at cooking.
I'll continue with the Python 3.10 version.
Unnecessary register assignments are eliminated.
Have a nice weekend.
Hi Reinhard,
some comment from my side.
If You want to have more feedback, I would recommend to make it as easy for the readers as possible. For example, I think that many people like me read the forum with a mobile phone. This means, that we cannot open the ZIP or not easily. But now I thought, I will have a look at it at the PC but it now turns out that you have to run python scripts to see what it is all about. At the moment I am not willing to spend too much amount of time, just to see anything.
So I would recommend to make visible relevant things directly.
A second aspect is, that you can get answers very well in this nice forum, if you ask a very clear question.
But I think, that there can be a different motivation for postings: At least for me there are lots of inspiring things here. I just read something and think, that I can combine some idea of it with other ideas and use them. So I think, that someone might stumble over something perhaps a year later and get inspired then, even if my project is not so well done.
I am not really sure if you want to read my opinion.
I have used Micropython for projects with ESP32 and with Raspi Pico. I was able to complete two projects in really really breathtaking time. This was possible, because everything, I needed, was just there in the compiled libraries. A third project failed badly, because I had to do things with Micropython, which is so very very slow.
(I also use Python on the PC but only if I can use the mighty and fast libraries pyplot and numpy.)
So I learned, that (Micro)Python can be very convenient, but only if:
In my opinion one of the main differences of Python to other languages is the dynamic typing. And at the same time this is a main reason, that Python is so very slow. So if the Python people want to speed up that language significantly, they must give up one of it's core features.
I have played a lot now with Taqoz and it's speed on P2. Random HUB ram access is very much slower then COG or LUT access. Only if you have some longer linear sequence of assembler codes in HUB ram, then the micro-cache can do it's job. If you have short snippets of code for each bytecode then it is faster to have a bytecode-interpreter executing in cog. As the bytecode is much more compact than the 32bit assembler code, you can squeeze much more into the 512k. So in the special case of P2 I think it is "not so bad" to have a bytecode (or wordcode) interpreter instead of a compiler. And it is probably very much simpler to do. The compiler will only be better, if it can do the optimization and if it does caching.
I think, that Peter has taken many months to find a very good compromise for Taqoz for P2 between speed and compactness. The idea to use 16bit wordcodes, which are directly addresses of code is magnificent. To have the addresses directly makes the wordcode-interpreter fast. To use words instead of longs makes the code compact. To have the core-codes executing from COG Ram makes them fast. To have the top of the stack in registers and to have the rest in LUT makes the stack fast. Same for the loop stack. Perhaps you want to write a translator of the Python bytecodes to wordcodes and then to use a similar words-code-interpreter like Taqoz?
Sorry, that some of my opinions are not so positive. I hope that the rest might give some positive inspirations...
Christof
@"Christof Eb."
Thanks for the detailed comments.
It is not 100% Python that is presented here. In contrast to micropython, no huge libraries are loaded here.
The Python bytecode is translated directly into assembler.
If there will be an import statement, spin objects should be included.
But that will show step by step.
Reinhard,
I meant that people on the forum should be grateful for the work of other forum members.