Inline assembly considered harmful
ersmith
Posts: 6,053
in Propeller 2
OK, probably a controversial title. The new inline assembly feature of Spin2 (and its corresponding feature in languages like C and BASIC) is certainly handy and lots of us are using it. I'd like to suggest though that it would be good for us to minimize use of it. Here are some reasons why:
(1) Assembly code is harder to read (and write) than high level language code. Especially this affects readers; newcomers to the platform probably haven't wrapped their heads around assembly code yet, and may well be confused by inline assembly blocks. It's better to keep as much as possible in the high level language.
(2) It's often not necessary. The heavy lifting is generally done by smart pins, so timing of the Spin2 code is (usually) not too critical. Even in cases where the Spin2 code is doing its job, it's frequently better to use addct/waitct to ensure cycle accurate delays, rather than relying on counting cycles in the assembly code. Spin2 is much faster than Spin1, so it's often possible to do real work in Spin2.
(3) If you really need exact timing, it may be better to run the code in another COG (not inline on the same COG).
(4) Inline assembly is not portable. Not all compilers/languages support inline assembly, and those that do have quite different syntaxes for it. This will matter a lot when we want to port drivers, in particular to MicroPython.
(5) If you need higher speed from your code, consider using a high level language/compiler that can compile to native code. For Spin2, there's fastspin. See the next message for some examples of how tightly fastspin can optimize Spin2 code.
Yes, having said all this there will be times when inline assembly can do the job and nothing else can. My argument is that those times are much fewer than we may initially think, and it's good to get in the habit of sticking to portable Spin2 code as much as possible.
(1) Assembly code is harder to read (and write) than high level language code. Especially this affects readers; newcomers to the platform probably haven't wrapped their heads around assembly code yet, and may well be confused by inline assembly blocks. It's better to keep as much as possible in the high level language.
(2) It's often not necessary. The heavy lifting is generally done by smart pins, so timing of the Spin2 code is (usually) not too critical. Even in cases where the Spin2 code is doing its job, it's frequently better to use addct/waitct to ensure cycle accurate delays, rather than relying on counting cycles in the assembly code. Spin2 is much faster than Spin1, so it's often possible to do real work in Spin2.
(3) If you really need exact timing, it may be better to run the code in another COG (not inline on the same COG).
(4) Inline assembly is not portable. Not all compilers/languages support inline assembly, and those that do have quite different syntaxes for it. This will matter a lot when we want to port drivers, in particular to MicroPython.
(5) If you need higher speed from your code, consider using a high level language/compiler that can compile to native code. For Spin2, there's fastspin. See the next message for some examples of how tightly fastspin can optimize Spin2 code.
Yes, having said all this there will be times when inline assembly can do the job and nothing else can. My argument is that those times are much fewer than we may initially think, and it's good to get in the habit of sticking to portable Spin2 code as much as possible.
Comments
EXAMPLE: bit-banged SPI
Some code to bit-bang an SPI interface (from the "simplespi" benchmark). First the plain Spin2:
Here's what gets generated (I've omitted the startup code):
Some things to note:
(1) Everything got inlined (the compiler will do this at -O2 with small functions or ones that are only called a few times).
(2) The IF statement with pinl/pinh got converted to drvc. We could also have used pinw, but that's a more complicated function; for just one bit it's usually best to use pinl/pinh, that lets the compiler do more optimization.
(3) Notice how the test of the high bit (x & $8000_0000) can be combined with the left shift. Something similar can be done with tests of the low bit and right shifts.
(4) The whole loop got converted into a block loaded via FCACHE into LUT memory, so it can run at full speed.
(5) Instead of `mov local02, ##512` the compiler generated `decod local02, #9`. Little optimizations like this add up, and it's the kind of thing that is easily overlooked when writing by hand.
==================================================================
EXAMPLE: timing input pulses
Spin2 code for timing the duration of an input pulse:
The Spin2 code compiled with fastspin -2 -O2:
The final assembly is pretty efficient; you won't really get much better writing by hand.
But often you need the fine control over timing. Your resulting SPI clock for example may be too fast for an LCD, in PASM you just insert a NOP, but
how do you do that in high level code, and independent of the optimization?
One problem with too much inline PASM is that all the PASM snippets of all the included objects must fit into the available cog memory that is reserved for inline PASM.
Andy
Impressive.
Certainly, users can tune their HLL code style to get better assembler, and I do that all the time on other MCUs.
There is always an Assembly-generated window open, showing how well, or poorly, the compiler did !!
However, if you need to use it then, go ahead, just be aware of the porting issues (if any).
Ariba, inline PASM doesn't have to go in COG memory. fastspin lets you do either.
It can also save cogs...
Tricky with FastSpin though, because loops run slow unless you figure out how to get it moved to LUT.
Personally, I really, really like inline assembly.
Eric changed things so that you can place functions in cog/lut memory, check the general docs.
He also has alternate specifications for inline asm that will load it into the fcache, also in general docs.
Frankly, I don't trust the available compilers for the Propeller, and certainly wouldn't use one for a client project. The free compiler writers have no financial motivation to support me or any other user of their tool, and they could just go away leaving users high-and-dry. A lot of people loved using BST, but it now sits, unchanged in years, because Brad walked away from the project and elected not to open-source his code before leaving the community (which is certainly his right, and while saddened by it, I don't criticize him for).
The assembly code coming out of available compilers certainly fits that description. Most of us doing inline assembly, however, are adding comments to help ourselves and those reading the code. I have received a lot of feedback from readers of my N&V columns and users of my objects -- a few have stated that my code motivated them to learn more about assembly. That said, we all know that I don't write complicated assembly, but I do write clean, nicely-formatted code that works.
My thought about Propeller compilers, exactly. Spin is fast enough for most tasks and when I need speed I can turn to assembly. With the P2, I'm happy that I don't have to start and stop a cog to get that speed. Of course, using Spin I don't have to worry about using a compiler. Long term, I'm going to encourage my friends at Parallax to write a C interpreter for the P2 (a BASIC interpreter, too). No, neither would be as fast as native code compilers (which P2 users are free to enjoy), but it would give better code density and the syntax that the C and BASIC communities really like. Parallax changed from a PIC tool supplier to a microcontroller vendor with the BASIC Stamp 1. I think an updated version of PBASIC that takes advantage of the P2 would be a welcomed product by a lot of people. As much as I like Spin, I'd probably use P2BASIC from time-to-time as well.
I disagree. Cogs are a resource that should not be wasted and using inline assembly in the P2 is a big help in saving resources. Why should I run a cog driver for just a couple of WS2812 LEDs on a convention badge? I was forced to in the P1, but not with the P2. I believe in freedom of choice, so I have full ports of my jm_rgbx_pixels and jm_apa102c drivers for the P2, but I also have "EZ" versions (for the WS2812 and APA102c) that use inline assembly for those projects that don't need the features of the full-blown objects.
So, what? We have fixed target, and in my opinion, should be writing good code for what we have versus something that *may* run somewhere else. I don't have to think about my Spin code running on another micro, so the portability argument is moot. Now, if what is meant that working Spin2 with inline assembly won't properly compile with an outside tool, it is the responsibility of the toolmaker to fix that.
Why? So I can deal with compiler bugs and hope they get fixed before my project deadline? Why should I or any other competent programmer trust that the compiler provider -- especially one not charging for the tool -- is doing the right thing? We're all human.
I spoke with @cgracey about this when writing my article and I don't believe this is the case. The way he explained it is the parameters, results, locals, an inline PASM2 get shuttled into the Spin cog when encountered; the code runs, and the variables are written back. IIRC, there is room for about 300 instructions, so the short segments that I and others are writing are not going to create a problem. I can have as many inline segments as I want, so long as none of them exceed ~300 instructions. None of mine do; they're generally less that 20-30 lines of assembly code.
I will determine my own needs, thank you, and will continue to include comments in my inline assembly segments. I'm looking forward to the day that compiler writers start commenting their assembly output so that others can learn from it, otherwise, making that output available seems pointless.
I'd love to know how this code helps a student learn P2 assembly.
If I desire precision for that routine, I think this way does the job well and actually helps those looking at my listings to learn from them.
I'm with you, Ray!
I think it's fair to say that I have written more information about programming Propeller chips than anyone else. I choose Spin, but have always shared with my readers that they have choices -- I have in fact promoted the non-Spin tools in my writing. You will never see me put in an article or book, "You should use Spin because..." Likewise, I think it's a disservice to the community tell others that they shouldn't use inline assembly if that suits the needs of their projects.
Or, you can manually place a function in LUT or COG by declaring it like:
I don't think it is reasonable to offer a language facility but then claim it is somehow "wrong" or "harmful" to use it. If you really believe that (as I do about some language elements that I have been asked to add in the past) then don't offer it.
If you do offer it, then expect it to be used, and be prepared to support it. Sure, you can provide "advice" on the use of the facility, but you should also be prepared for that advice to be ignored. You can't anticipate all ends, and probably shouldn't try.
Also, Parallax has only recently provided HLL support for the P2. For the past two years they have relied on their customers to develop their own software tools. I would hope that Parallax adopts one of the existing C solutions, and promotes it as the Parallax C solution. This would help to ensure software support in the future.
Of course you do, and you do a good job of it. But my point is that assembly language is inherently harder to read than high level language. That's the reason high level languages were invented! Moreover, someone reading inline assembly has to understand both the native language it's embedded in and the assembly itself. So whatever you do it's harder to read than just writing in the high level language.
I wasn't thinking so much of portability between micros (although that is a concern too) as portability between languages/tools. For code you're writing for yourself maybe this doesn't matter. For code shared with the community, the simpler the code is the more likely other people can take advantage of it in other languages/tools.
In particular, MicroPython (which is going to be an officially supported P2 language) does not have inline assembly, so objects with a lot of inline assembly are going to be harder to port. Again, this may or may not matter to any particular programmer.
Of course you will, and for your own projects use whatever style/tools/conventions you like! My thread title was deliberately provocative, but it was intended in a tongue-in-cheek way. Obviously everyone should use whatever tools/style/convention best suits them. But for code shared with the community, it is good to keep in mind the broader needs and abilities of the community. That's not to try to dictate anything (heaven knows that all code that's shared is helpful, and your code in particular is wonderful Jon!).
My point is just that code that minimizes inline assembly is even more accessible and broadly useful than code that unnecessarily uses inline assembly.
Just to emphasize: I'm not trying to criticize anyone's code, or dictate to anyone about how they must write their code. I'm offering a suggestion here that I hope people will consider, but it's just my 2c.
I remember when you added this... But, since forgot exactly how it works...
So, is the default optimization -O1? And, does that automatically use FCACHE?
MicroPython *does* have inline assembly on other platforms, see
http://docs.micropython.org/en/v1.9.3/pyboard/pyboard/tutorial/assembler.html
There's even a book about it
https://www.amazon.com.au/MicroPython-Inline-Assembler-Example-Magda-ebook/dp/B07ZQLSJHF
The creator of MicroPython talks here about how to speed up code all within MP, ending with the inline assembly (around 500x faster than naive code)
As a halfway point I expect being able to manipulate memory or registers using machine.mem16 may be useful, and isn't too hard to read
eg (from MP docs) machine.mem16[stm.GPIOA + stm.GPIO_ODR] ^= BIT14
--
Its true we don't need to hurry into having inline assembler in P2 MicroPython, thanks to the .CPU proposal you proposed and Rogloh also implemented in Native P2 MicroPython, so it really only becomes an issue once cogs are in short supply. But as soon as that happens you really want (and need) to be doing the inline assembly in same cog.
Eric I get where you're coming from the Flexgui (compiler) perspective, but for the official Spin2 (interpreter) having inline assembly, that doesn't need to spawn another cog to do something efficient in the pasm2 domain, makes a heap of sense.
Yes and yes. Since fastspin 4.3.0 you can turn individual optimizations on or off (so you could enable -O1 without FCACHE via -O1,!fcache).
fastspin/flexgui can handle inline assembly just fine. I'm just suggesting that people think twice before using it, because it will make porting their code to other languages (in particular MicroPython as it exists now) harder.
Similarly, although fastspin has some very nice (IMHO) extensions to Spin2 like default parameter values or being able to use Spin1 syntax, you should think twice before using those features and tying yourself to fastspin. That's one reason I added the -Werror option to warn about these kinds of things. They're not inherently wrong, and you should use them in your own code when appropriate, but in code that's shared with the community you may want to avoid non-portable constructs.
This thread could have easily been titled "Why Inline Assembly is Not Needed With Optimizing Compilers." It wasn't, though. He chose to be provocative (his words) by painting with a very broad, tyrannical brush.
Look, I understand that those writing compilers for the Propeller family put in an enormous effort, with no financial gain from it. But this is a choice they make -- nobody forces them to give away their work. Even if I don't use those tools, I have used my platform (e.g., writing for N&V) to promote those tools. I really like the Propeller as a device, and if people want to try it using something other than Spin, that's okay with me.
My suggestions to compiler writers:
1) Don't tell people what they should and shouldn't do; allow them their freedom to choose from what your tool will accommodate.
2) Make Spin compilers 100% compatible with Parallax tools, then add more features (BST did this).
3) If you're going to encourage people to look at the assembly output of your tool, make it easy on the eyes and useful for students.
Exactly.
I certainly did not read the entire thread and this is a bit off topic, but after reading this, I do have something to say. Please don't take this as an attack on you, just because I am quoting you and referring to what you said, because it is not an attack on you or your preferences.
Although not as long as you, I have been a Parallax customer and a forum member for many years. Unlike you, I prefer C/C++ over Spin code, but that is just my preference. Regardless of the language used for programming the Propeller, I personally do not believe that we have seen a well executed and complete Parallax supplied programming environment. In my opinion, the Propeller Tool and SimpleIDE both lacked completeness.
I agree with you that the free compiler writers have no financial motivation to provide support, and yes, I also agree they could also just disappear.
I believe that even though Parallax resources are limited, they need to come up with the funding to create a programming environment for both C/C++ and Spin, which is both complete and professional, as well as being supported by Parallax. I believe this is paramount to their success.
I don't think I ever claimed the assembly output was "readable", just that it was "efficient", and that's what I was focused on. If I was going for readability I would have shown the listing file output, which does at least have the original Spin code inserted as comments.
Also, regarding assembly output readability: The main thing I think is annoying with fastspin's ASM output is that even minor changes rename almost all the LR__XXX labels which is annoying when you diff two files - is there anything you could do to minimize the amount of label names a single change can impact?
As for PASM2, I thought that Spin2 was supposed to have enough commands and power to alleviate the need for inline assembly. If not then maybe that power should be added. PASM2 will not go away, but maybe with a more powerful Spin2, you could eliminate the use of inline assembly, for us neophytes.
I remember SimpleIDE and jazzed, once he left, so did the Parallax support for that IDE. Just a reminder how quickly things can change, even at Parallax.
Ray
That's exactly what ersmith has done ?
The in-line-asm feature is there and it is very useful, but it is a 'use with care' and 'do you really need to use that' feature.
To those who tell me that Spin2 is fast enough (indeed, it is much faster than Spin1, even at the same clock speed) and that I shouldn't need inline assembly, I ask them to swallow their own medicine and not use a compiler. Oh, but they prefer to write in a different syntax. I get it, and I support that. I prefer to write the bulk of my code in Spin2, and when it strikes me or I believe is absolutely necessary, I use some inline assembly. This is my preference. As others have pointed out, using inline assembly in the P2 is a great way to learn PASM2 is a controlled environment. I'm going to point this out in my seminar.
Except with Spin; they created it, they own it, and they support it. Like many, I hope that things work out in such a way the Parallax will add more official languages to the P2. The interpreter is no longer etched in silicon (it gets downloaded with your app) which means it can be updated, and it can be replaced with another language. I sincerely believe that a P2BASIC interpreter would make a lot of people very happy -- and it would be getting back to Parallax's roots.
Ray
I don't think you understand how the Spin2 tools work. The Spin2 program is compiled to bytecodes on the PC. The bytecodes and binary image of the interpreter cog code get downloaded into memory. On boot, the binary image of the interpreter is moved into a cog, started, and executes the compiled bytecodes.
This is not a simple exercise in Spin2/inline assembly.