P2asm assembler features?
TMM
Posts: 21
Your favorite? P2 noob is back with more inane questions! ![]()
I've been trying to find some information on what assembler features are supported in the assemblers that are out there. I've skimmed the following documents to try and find out:
- Spin2 doc v52
- Silicon doc v35
- PASM2 manual from 2022
I have not been able to find out things like:
- Do we have local labels?
- Do we have macros?
And probably more stuff that I've missed.
If there's some documentation I've missed then it'd be super helpful if someone could point me at it. I'm currently using flexspin as my assembler, but from what I've gathered the assemblers available for the P2 all seem to have the same feature set pretty much, and are largely compatible with each other.
Thanks a lot!

Comments
Labels is it. There is two levels, labels with a leading dot have locality between undotted labels.
Or at least that's true for all the Pasm2 assemblers that I've used. Pasm1 used a leading colon instead.
TMM,
I haven't looked at it myself but maybe working from a RISC-V emulator would be of interest for your project? Eric is pretty good at undertaking useful feature requests too - https://forums.parallax.com/discussion/170295/riscvp2-a-c-and-c-compiler-for-p2/p1
The first observation to make here is that the mainstream P2 assemblers (PNut, Spin Tools and flexspin) are really Spin2 compilers. Spin2 supports both inline ASM and DAT block ASM directly in its syntax (unlike C et al where it's very poorly integrated). "pure ASM" mode where no Spin runtime or compiled HLL code is emitted into the binary is essentially a special-case subset of the regular Spin2 syntax where only CON and DAT blocks are used.
Thus they all have the same basic syntax and features:
Flexspin's extensions:
#include,#ifdefand such) - doesn't really support full C macro expansionfoo:barto access.bardefined under global labelfoo)see also: https://github.com/totalspectrum/spin2cpp/blob/master/doc/spin.md
Spin Tools also has some equivalent to most of these extensions, but they're not quite compatible.
One thing to note here is that the Spin2 syntax descends from Spin1 and the P1. On the P1, PASM is used in a more microcode-like fashion. The P1 does not have hubexec, the 496 longs of code loaded into cog RAM is all there is, and you'd use that to build some sort of virtual system component. On P1, the Spin bytecode interpreter is ROM-resident and essentially is the actual CPU you program for and user PASM is essentially used to implement high-speed virtual peripherals. This paradigm of course never leads to very large PASM programs in a single file, so the language never grew facilities to handle such. (of course we are all very clever folks and figured ways to subvert this system, though in the end the Spin interpreter always wins on code density, which is possibly the most important thing on a 32K RAM machine).
The P2 deliberately is designed such that its PASM can be treated as a more normalTM CPU instruction set (execute from main memory, relative branches, stack-based CALL/RET, etc), but the tooling all descends from the P1 tooling and is therefore ill-equipped for larger programs. e.g. without the namespace extension or careful use of naming conventions, you can very easily reference a cog/lut label that doesn't exist in the cog running the code and cause horrible bugs. (I essentially suggested the namespace extension after writing a lot of code with just prefix naming to stop this problem. See for example my SNES emulator - this one has the misfortune of having
ppr_,ppc_andppm_prefixes to keep 3 cogs processing graphics from having label name collisions - this doesn't really help code readability)Thanks a lot for the background information! I guess it's nice that I wasn't just missing something this time
Would anyone here be at all interested in a sort of NASM but for P2? PWASM?
(Propeller Wide ASseMbler?)
Or does basically everyone use SPIN2 together with assembly? Am I just the only one that wants to do this?
I had found this! I have been considering it as a back-end but I feel that I should probably learn how to do it myself first, even if Eric's turns out to be way better. I also kind of want to expose the Propeller itself to user-space. The idea is more or less to make it a sort of "retro tech art piece" in the sense that I want to make something that you couldn't just run on an Raspberry Pi or something.
Yes, basically everyone on here. It's actually very nice I think, to have it all integrated, it just has the mentioned weaknesses when you try to make a huge PASM-only program, which I'd think is a niche usecase to begin with. (even the aforementioned game emulator programs end up using Spin and flexspin's C library to handle filesystem operations and user interface, though in a slightly non-standard way where the HLL code lives in upper memory). I haven't really had time to put the new namespace extension to good use yet, I think that'll largely solve the biggest issue.
That said, there's plenty of space for and fun in making your own tool. But you'd need some experience for that first.
Just want to throw a plug in here for Dave Hein's amazing p2asm program. A very pure and simple PASM assembler. No Spin! Dave died a few years ago, but his work lives on as part of Catalina.
Dave's original version (see https://github.com/davehein/p2gcc/tree/master/p2asm_src) is no longer maintained, but Catalina's version is!
You can find Catalina's version in the Catalina's source/p2asm_src directory, and the binary in Catalina's bin directory - it is completely self-contained and has no dependencies on either p2gcc or Catalina^. There is also super trivial example of a PASM program in demos/p2/flash.pasm.
I really should create a fork of Dave's github and put Catalina's version there. I'll add this to my TO DO list!
Ross.
^ Thought I'd better add a note for anyone exploring Catlaina's bin directory looking for the p2asm executable because as well as p2asm you will see p2_asm scripts (for both Linux and Windows) - these call p2asm but first they use use Catalina's pre-processor on the source to add C-like preprocessor capabilities - i.e #include, #if, #define etc. But this is optional, and p2asm can be used by itself and is not dependent on any Catalina components.
Well, perhaps for the sake of completeness:
The very nice Forth System Taqoz comes with an assembler too. And of course there are macros and you can define your own.
" I also kind of want to expose the Propeller itself to user-space. The idea is more or less to make it a sort of "retro tech art piece" in the sense that I want to make something that you couldn't just run on an Raspberry Pi or something"
In my experience Forth is a good way and tool to achieve this. In comparison to Python it trades convenience for speed and compactness. For Forth 512kB Ram is really big. And you scarcely need assembler because Forth is fast. You might want to have a look at Taqoz....
In my opinion there are main reasons, that propellers are not used widely:
1. Price of board hardware to get started.
2. The emphasis on the special interpteter languages Spin1 and Spin2, which are designed to be used together with assembler, if you need speed. Both have to be learned. Unfortunately you cannot build easily on the hughe amount of libraries or tutorials for Arduino.
I think, that the strength of P2 is, that it is more easy to do things with several cores than to use interrupts. There is no Linux to get in the way. So if I start a project, than my confidence to get it done is often higher with P2.
It's probably time to mention what TMM is planning - Make a tiny Linux for the Prop2 - https://forums.parallax.com/discussion/comment/1570107/#Comment_1570107
Oh, thank you, I had not read this.
Hm, @TMM , it will be rather difficult to get something useful out of P2's 512kB of RAM. P2 unfortunately has no Thumb instruction set and it's code density is low.
When PCs had <512kB their tools like compilers have been hand written in assembler. Do you plan to write a compiler from scratch?
I once tried to port an emulator for mc6809/coco3. That machine with OS9 seems to have worked with 512kB. I than was not able to make a really fast emulator for the memory management unit.
An idea would be to bring together Robert Swierczek's C-Compiler (I experimented with it here: https://forums.parallax.com/discussion/175070/a-study-apropos-a-multitasking-operating-system-for-p2-with-local-c-compiler#latest ) with an XBYTE machine for P2 written in assembler . This C-Compiler produces code for a virtual machine based on two accumulators, that has 8bit opcodes together with 24bit operands. It can compile itself and is more than minimal. The XBYTE mechanism of P2 is fast as it's hardware interpreter evokes code in COG/LUT ram and also can use the microcache of the streamer mechanism of P2 for fast access of HUB ram. It also uses 8bit codes. Perhaps a pure asm COG can run the virtual machine with XBYTE and a supporting server COG running C can handle the stuff, where you use the libraries of FlexC?
Oh, had to google, what CDE is. "Umpf" Well, you have 512kB of directly coupled RAM. A COG has to share it's bandwidth with the others. Think of a machine with 512kB and running at 25MHz.
Have fun, Christof
All my P2 programs (mostly CPU emulators) are 100% PASM2 and I've never used SPIN2.
@"Christof Eb." Thanks for the information! As for my project, apart from the memory amount the CDE used to run on 25MHz 68030 based unix machines! The original release was in 1993. The reason I picked the P2 for this project is because I think that even with the extmem overhead I can probably make something that can run a Unix and X server fast enough between the 8 cogs
My high-level plan is to basically use lut ram as a sort L2 cache, hubram as a sort of L3 cache, and have extram as main ram. Bandwidth wise it should be possible to achieve several hundred megabytes a second which is faster than contemporary (with original CDE) ram which was more around 100-150MB/s.
Theoretically the raw umph is there, which is why I wanted to try this!
And yeah, I might end up writing my own compiler for this project. But I'm still exploring the options. Right now I'm mostly tying to get to grips with all of the existing tools and work done by others. But the point of this is kind of to do something that seems on the face of it to be kind of ridiculous
Ok, it seemed to take a while for me to understand your goal.... My own -kind of ridiculous- experiments are more on the mechanical side, where I use P2 and it's compilers as tools to control something. For this it's helpful to know strengths and limits of P2 and just respect them like I do respect mechanical limits of parts or materials. So for me the 512kB limit is "given" and therefore P2 is more like 1983 than 1993. Just read that mc68030 had virtual address management and also small caches.
I am curious, how you will tackle the usage of the caches in combination with extmem. As P2 itself cannot handle virtual memory you will have to have some sort of a virtual processor? Or is Linux able to handle paged memory in a way that you only need to swap memory pages, when a task switch occurs?
Edit: I should add that I am curious for a few good ideas, because my feeble attempt about the mc6809 OS9 emulator ( https://forums.parallax.com/discussion/174794/towards-os9-operating-system-on-p2/p1 ) was throttled down from emulated speed about 3.6MHz to <1.3MHz due to the emulation of a MMU, which was in the way for each and every RAM access. Meanwhile I might also have learned some bits to make it better....
Christof
@"Christof Eb." The general idea is that I write a JITting VM that implements the MMU like functionality. Trying to keep code in LUT and data in cog memory in "cache lines" in the hopes of bursting in extmem data "just in time" as well. Since the compiler should know at least roughly what data is going to be read, most of the time, I think it could be reasonably fast. Like I only need to do 1 "useful" instruction every 10 clocks in order to hit my target. This is still pretty wildly optimistic, but I think it is "optimistic" and not "delusional" but we shall see.
If the code uses a lot of dynamic calculations to calculate pointer addresses then performance will suuuuuuck, but I'll burn that bridge when I get there.
The general idea is, more-or-less, to translate all code into relative addressing where possible. Then for each trace the compiler should have a reasonable idea what data will be necessary. The idea then is to make sure that when the trace starts the memory is already in cog ram, and when it needs to be paged out we write it to hubram, unless we run out of space there in which case it will go back into extram.
I'm experimenting with the idea of running the jit itself on a different cog as the executing cog, and use the dual ports to keep the "execution cog" as empty as possible, in the hopes of being able to keep all flags in "real flags" on the cog so I don't have to generate any code to save and restore flag states, and let all of the flags and side-effects on the cog just happen "naturally".
This is all hugely "handwavey" and I don't really know if any of this is going to work. But I think it at least MIGHT work?
And there's several ways in which I can cheat if I have my own C compiler, there could be "trusted" programs that are not jitted at all. The jit is really only necessary to prevent (accidental) crashes of the whole system by rogue pin fiddling or executing "privileged" instructions like COGABRT or something. The first prototype of this can work with all of this just compiled ahead of time. The JIT is only a protection mechanism which isn't inherently necessary to achieve what I want to achieve. It's really only necessary in order to be able to safely run potentially buggy or malicious code.
I might be able to get away with "just" a special C compiler to take unix source code, X11, and CDE and make it do the "mmu" stuff cooperatively and just hope nothing goes wrong.
Thank you, @TMM , for the explanation, very interesting! Good luck!
@TMM have you seen my JIT toolkit (https://hithub.com/totalspectrump2-jit-tools)? It might be useful to you. A slightly more sophisticated (and customized) version of it was used to produce the p2-riscv runtime for translating RISC-V to P2 instructions.
@ersmith I had seen that! It looks very interesting, I was planning to at least study it. You've made a lot of really cool stuff for the P2!
Catalina does all this, so I can advise you on a few things before you start:
1. Using the LUT as a cache is possible, but it's not very effective even if the LUT is shared between two cogs - it simply isn't big enough. It is probably better to use the LUT for additional cog code space - you're probably going to need it!
2. You can use a pair of spare smart pins (in "repository" mode) for faster communications between adjacent cogs than is possible via Hub RAM. And the advantage over using a shared LUT is that any two (or more) cogs can share pins, not just two adjacent cogs.
3. You'll want to use Hub RAM for all local variables and stack space as well as for caching the XMM RAM. The XMM RAM can be used for code and heap, but it is not fast enough to use for local variables or stack.
I suggest using Catalina as a prototyping tool, since it allows you to explore all these options now. You may decide not to use Catalina in your final solution, but you can use it to see whether or not the P2 is going to be fast enough for your needs.
Ross.