Shop OBEX P1 Docs P2 Docs Learn Events
NQPY -- not quite python -- a study — Parallax Forums

NQPY -- not quite python -- a study

I started something here that has been on my mind for a long time.
A tool that translates python bytecode into assembler.
Here in the special case in p2asm.
It is a feasibility study and is far from mature.
It shows that the effort can be worth it.
I know that I am not popular here in the forum, but I would still be interested in your opinion.
Further information can be found in the readme file in the appendix.

Comments

  • @Reinhard
    As someone else who isn't popular on the forum, I'd say I hope that wouldn't discourage you from working on it or even sharing it. I don't use Python myself, but am nevertheless intrigued by the idea of compiling it to native code and will check it out.

    Cheers

  • Hey Guys!

    Can't imagine why you'd consider yourselves unpopular. I hope silly talk wasn't exchanged in the past.

    You both have a supporter in me, and I'm sure many many more people who've enjoyed and benefited from all your contributions over many years.
    Really appreciate and enjoy reading all that you share.

  • Can't imagine why you'd consider yourselves unpopular.

    I had the same thought. Both are members that offer up what they're working on which is a benefit to the Propeller community.

    I hope that wouldn't discourage you from working on it or even sharing it.

    +1

  • evanhevanh Posts: 15,537
    edited 2023-08-09 14:53

    Reinhard and avsa242 both probably meant that as popularity of their posted projects rather than personalities.

  • Understood, but I'd hate for them to stop offering up what the do, even if others don't comment.

  • I can't speak for Reinhard but yes Evan; that, or being "well-known" in general was what I thought was meant. ;)
    👍 for the support, regardless!

    Cheers

  • As someone who i guess qualify as mildly popular, I will say that getting lots of replies on your forum thread really just has a lot to do with you yourself posting lots of them (and therefore bringing the thread back to the top). The thing has to be interesting, too, I guess. But if you indeed made something, it is inherently interesting, at least to you, so that's a non-issue.


    Your generated assembly looks pretty poor. Flexspin has an extensive P1/P2 assembly optimizer module, that could possibly be used to help. It doesn't really have an external interface though and the internal one is questionably documented. It basically amounts to creating an IRList with your instructions in it and calling OptimizeIRLocal on it, which will then through a byzantine internal process eliminate unnecessary instructions in the list. It also needs a Function object (which I just realize needs to be passed both as a parameter and through global curfunc - lol), mostly because optimization flags are per-function. Possibly can be cleaned up a bit if there's interest.

  • @Wuerfel_21 said:
    As someone who i guess qualify as mildly popular, I will say that getting lots of replies on your forum thread really just has a lot to do with you yourself posting lots of them (and therefore bringing the thread back to the top). The thing has to be interesting, too, I guess. But if you indeed made something, it is inherently interesting, at least to you, so that's a non-issue.

    Quite true! I lurk more than post (same IRL except without the creepy connotation of the word 'lurk'). Anyways, don't want to derail Reinhard's thread.

  • ReinhardReinhard Posts: 489
    edited 2023-08-10 07:09

    Thanks for the comments.
    I can also take criticism, of course I prefer positive comments. :)
    With no reaction at all, I feel like the 'elephant in the room that everyone is ignoring'.

    As mentioned, this is a first attempt to get a feeling of whether and how this is even possible.
    This results in ineffective instruction sequences like these:
    '-----------------------------------------------------------------------------------------------------------------------------
    ' STACK LEN = 4
    mov reg1 , delay
    mov reg8 , ##123456
    sub reg1 , reg8
    mov delay , reg1 <<<<<<
    '-----------------------------------------------------------------------------------------------------------------------------
    ' STACK LEN = 4
    mov reg1 , delay <<<<<< uups
    mov reg9 , ##1000000
    cmp reg1 , reg9 wcz
    if_nc jmp #Label88
    '-----------------------------------------------------------------------------------------------------------------------------

    and the tool is very wasteful in assigning register/variable names.
    It's all a consequence of the solution approach.
    Python is stack oriented, and that's the idea behind it:
    Wait until Python's stack pointer points to STACKTOP again and then evaluate this frame.

    while(True):
        try:
            temp = (next(it))
            stack.append(temp)
            stacktop+=stack_effect(temp.opcode,temp.arg)
            print("'" , temp,"   ",stacktop,"    ",stack_effect(temp.opcode,temp.arg),"    ",len(stack) )
    
            if temp.opname == 'GET_ITER':
                stacktop = 0
    
            if stacktop == 0:
                evaluate(stack)
                stack=[]
                print("'-----------------------------------------------------------------------------------------------------------------------------")
    
        except:
            break
    

    Then I get this:
    '-----------------------------------------------------------------------------------------------------------------------------
    ' Instruction(opname='LOAD_NAME', opcode=101, arg=1, argval='delay', argrepr='delay', offset=68, starts_line=15, is_jump_target=False) 1 1 1
    ' Instruction(opname='LOAD_CONST', opcode=100, arg=8, argval=123456, argrepr='123456', offset=70, starts_line=None, is_jump_target=False) 2 1 2
    ' Instruction(opname='INPLACE_SUBTRACT', opcode=56, arg=None, argval=None, argrepr='', offset=72, starts_line=None, is_jump_target=False) 1 -1 3
    ' Instruction(opname='STORE_NAME', opcode=90, arg=1, argval='delay', argrepr='delay', offset=74, starts_line=None, is_jump_target=False) 0 -1 4
    ' STACK LEN = 4
    mov reg1 , delay
    mov reg8 , ##123456
    sub reg1 , reg8
    mov delay , reg1
    '-----------------------------------------------------------------------------------------------------------------------------
    ' Instruction(opname='LOAD_NAME', opcode=101, arg=1, argval='delay', argrepr='delay', offset=76, starts_line=16, is_jump_target=False) 1 1 1
    ' Instruction(opname='LOAD_CONST', opcode=100, arg=9, argval=1000000, argrepr='1000000', offset=78, starts_line=None, is_jump_target=False) 2 1 2
    ' Instruction(opname='COMPARE_OP', opcode=107, arg=1, argval='<=', argrepr='<=', offset=80, starts_line=None, is_jump_target=False) 1 -1 3
    ' Instruction(opname='POP_JUMP_IF_FALSE', opcode=114, arg=44, argval=88, argrepr='to 88', offset=82, starts_line=None, is_jump_target=False) 0 -1 4
    ' STACK LEN = 4
    mov reg1 , delay
    mov reg9 , ##1000000
    cmp reg1 , reg9 wcz
    if_nc jmp #Label88
    '-----------------------------------------------------------------------------------------------------------------------------

    Then I translate the current stack into registers.
    I get the register designations from the bytecode instructions
    with the attributes arg, argval and, very importantly, the flag is_jump_target.
    That's the basic idea, maybe I can think of something better.
    The aim is also to include entire spin objects, the Python import statement can be bent for this purpose.
    Classes and class methods are not currently supported.

    a comment:
    I have just found out,
    the whole thing works with Python3.10,3.5.
    Strangely not with Python3.6.9

  • o o o
    I have just compiled the newest python version (3.11.0) and I see the opcodes are changed.
    At first glance it seems that would be even more effective to translate into native.
    But then I would have to provide a separate tool version for each Python version.
    I still have to think about that.
    But I'll continue as a hobby, even if it's never going to be ready for series production.

  • As my brothers used to say, "beggars can't be choosers", so just be grateful that someone took the time and effort to make something of use.

  • evanhevanh Posts: 15,537
    edited 2023-08-10 10:45

    @Reinhard said:
    I started something here that has been on my mind for a long time.
    A tool that translates python bytecode into assembler.

    Oh, it's just sunk in. Man, why?! It sounds like you're giving yourself licks with cats tails.

    I remember a posting from someone else on the forum showing a Google effort at JIT of Javascript ... it came with all the typical compiler caveats on coding restrictions to make it actually perform similar speed of native compiled code. Basically, if you wanted the script to run with compiled speed then you had to rewrite it as if were another language.

    PS: I shouldn't be surprised. I'm regularly in awe of what gets done on the forums.

  • @Genetix said:
    As my brothers used to say, "beggars can't be choosers", so just be grateful that someone took the time and effort to make something of use.

    Guido van Rossum, the Python inventor, has gotten involved in further development and sees making Python faster as a priority.
    That's a good, great decision.
    No question at all.
    Otherwise I don't understand your answer.

  • @evanh said:
    PS: I shouldn't be surprised. I'm regularly in awe of what gets done on the forums.

    Actually I wanted to extend micropython. But I did not succeed in replicating the port on P2.
    With the translator in Native, I saw something within my modest means.

    But if everything doesn't work out, I'm also good at cooking. ;)

  • I'll continue with the Python 3.10 version.
    Unnecessary register assignments are eliminated.
    Have a nice weekend.

  • @Reinhard said:
    I started something here that has been on my mind for a long time.
    A tool that translates python bytecode into assembler.
    Here in the special case in p2asm.
    It is a feasibility study and is far from mature.
    It shows that the effort can be worth it.
    I know that I am not popular here in the forum, but I would still be interested in your opinion.
    Further information can be found in the readme file in the appendix.

    Hi Reinhard,
    some comment from my side.
    If You want to have more feedback, I would recommend to make it as easy for the readers as possible. For example, I think that many people like me read the forum with a mobile phone. This means, that we cannot open the ZIP or not easily. But now I thought, I will have a look at it at the PC but it now turns out that you have to run python scripts to see what it is all about. At the moment I am not willing to spend too much amount of time, just to see anything.
    So I would recommend to make visible relevant things directly.

    A second aspect is, that you can get answers very well in this nice forum, if you ask a very clear question.

    But I think, that there can be a different motivation for postings: At least for me there are lots of inspiring things here. I just read something and think, that I can combine some idea of it with other ideas and use them. So I think, that someone might stumble over something perhaps a year later and get inspired then, even if my project is not so well done.


    I am not really sure if you want to read my opinion.
    I have used Micropython for projects with ESP32 and with Raspi Pico. I was able to complete two projects in really really breathtaking time. This was possible, because everything, I needed, was just there in the compiled libraries. A third project failed badly, because I had to do things with Micropython, which is so very very slow.
    (I also use Python on the PC but only if I can use the mighty and fast libraries pyplot and numpy.)

    So I learned, that (Micro)Python can be very convenient, but only if:

    • All needed libraries for the project are on board as compiled code. All code that you write in Python is very much slower. Actually you need a huge amount of memory to have all these libraries (that are not needed for the project) on board. I think, this is a big problem for P2. (Pico can execute from external flash memory.)
    • Of course the whole point of (Micro)Python is, that you have the whole system as interpreter available. If you must do the compiling on the PC, the benefit of the language vanishes. So in your system you will need an assembler on board.
    • The substantial benefit of P2 is, that it has 8 cores. So a language must be designed to run on those 8 cores in parallel. When I used Micropython on Pico, it did not work properly and crashed with the 2 cores. Today docu is still: "This module implements multithreading support. This module is highly experimental...." In my opinion as long as this is not working stably, Micropython on P2 does not make any sense at all.

    In my opinion one of the main differences of Python to other languages is the dynamic typing. And at the same time this is a main reason, that Python is so very slow. So if the Python people want to speed up that language significantly, they must give up one of it's core features.

    I have played a lot now with Taqoz and it's speed on P2. Random HUB ram access is very much slower then COG or LUT access. Only if you have some longer linear sequence of assembler codes in HUB ram, then the micro-cache can do it's job. If you have short snippets of code for each bytecode then it is faster to have a bytecode-interpreter executing in cog. As the bytecode is much more compact than the 32bit assembler code, you can squeeze much more into the 512k. So in the special case of P2 I think it is "not so bad" to have a bytecode (or wordcode) interpreter instead of a compiler. And it is probably very much simpler to do. The compiler will only be better, if it can do the optimization and if it does caching.

    I think, that Peter has taken many months to find a very good compromise for Taqoz for P2 between speed and compactness. The idea to use 16bit wordcodes, which are directly addresses of code is magnificent. To have the addresses directly makes the wordcode-interpreter fast. To use words instead of longs makes the code compact. To have the core-codes executing from COG Ram makes them fast. To have the top of the stack in registers and to have the rest in LUT makes the stack fast. Same for the loop stack. Perhaps you want to write a translator of the Python bytecodes to wordcodes and then to use a similar words-code-interpreter like Taqoz?

    Sorry, that some of my opinions are not so positive. I hope that the rest might give some positive inspirations...
    Christof

  • @"Christof Eb."
    Thanks for the detailed comments.
    It is not 100% Python that is presented here. In contrast to micropython, no huge libraries are loaded here.
    The Python bytecode is translated directly into assembler.
    If there will be an import statement, spin objects should be included.
    But that will show step by step.

  • GenetixGenetix Posts: 1,751
    edited 2023-08-11 23:37

    Reinhard,

    I meant that people on the forum should be grateful for the work of other forum members.

Sign In or Register to comment.