P2asm assembler features?

TMM · 2025-11-10 11:39

Your favorite? P2 noob is back with more inane questions!

I've been trying to find some information on what assembler features are supported in the assemblers that are out there. I've skimmed the following documents to try and find out:

Spin2 doc v52
Silicon doc v35
PASM2 manual from 2022

I have not been able to find out things like:

Do we have local labels?
Do we have macros?

And probably more stuff that I've missed.

If there's some documentation I've missed then it'd be super helpful if someone could point me at it. I'm currently using flexspin as my assembler, but from what I've gathered the assemblers available for the P2 all seem to have the same feature set pretty much, and are largely compatible with each other.

Thanks a lot!

evanh · 2025-11-10 12:27

Labels is it. There is two levels, labels with a leading dot have locality between undotted labels.

Or at least that's true for all the Pasm2 assemblers that I've used. Pasm1 used a leading colon instead.

evanh · 2025-11-10 13:00

TMM,
I haven't looked at it myself but maybe working from a RISC-V emulator would be of interest for your project? Eric is pretty good at undertaking useful feature requests too - https://forums.parallax.com/discussion/170295/riscvp2-a-c-and-c-compiler-for-p2/p1

Wuerfel_21 · 2025-11-10 13:47

The first observation to make here is that the mainstream P2 assemblers (PNut, Spin Tools and flexspin) are really Spin2 compilers. Spin2 supports both inline ASM and DAT block ASM directly in its syntax (unlike C et al where it's very poorly integrated). "pure ASM" mode where no Spin runtime or compiled HLL code is emitted into the binary is essentially a special-case subset of the regular Spin2 syntax where only CON and DAT blocks are used.

Thus they all have the same basic syntax and features:

local labels start with a leading dot
CON block constants and any constant expression can be used as values
FILE directive
convenient syntax for AUGS/AUGD
predefined symbols for pin modes etc.
DEBUG statement and associated debug stub machinery
(semi-new) DITTO unrolling
etc etc

Flexspin's extensions:

C-style preprocessor (#include, #ifdef and such) - doesn't really support full C macro expansion
triple @@@ true address operator (not relevant to pure mode)
conditional assembly (if/else/end based on constant expression)
accessing local labels from outside their scope (this inexplicably uses a colon even in P2ASM, like foo:bar to access .bar defined under global label foo)
(semi-new) label namespaces (essentially another layer of scoping)

see also: https://github.com/totalspectrum/spin2cpp/blob/master/doc/spin.md

Spin Tools also has some equivalent to most of these extensions, but they're not quite compatible.

One thing to note here is that the Spin2 syntax descends from Spin1 and the P1. On the P1, PASM is used in a more microcode-like fashion. The P1 does not have hubexec, the 496 longs of code loaded into cog RAM is all there is, and you'd use that to build some sort of virtual system component. On P1, the Spin bytecode interpreter is ROM-resident and essentially is the actual CPU you program for and user PASM is essentially used to implement high-speed virtual peripherals. This paradigm of course never leads to very large PASM programs in a single file, so the language never grew facilities to handle such. (of course we are all very clever folks and figured ways to subvert this system, though in the end the Spin interpreter always wins on code density, which is possibly the most important thing on a 32K RAM machine).

The P2 deliberately is designed such that its PASM can be treated as a more normal^TM CPU instruction set (execute from main memory, relative branches, stack-based CALL/RET, etc), but the tooling all descends from the P1 tooling and is therefore ill-equipped for larger programs. e.g. without the namespace extension or careful use of naming conventions, you can very easily reference a cog/lut label that doesn't exist in the cog running the code and cause horrible bugs. (I essentially suggested the namespace extension after writing a lot of code with just prefix naming to stop this problem. See for example my SNES emulator - this one has the misfortune of having ppr_, ppc_ and ppm_ prefixes to keep 3 cogs processing graphics from having label name collisions - this doesn't really help code readability)

TMM · 2025-11-10 19:09

@Wuerfel_21 said:
The first observation to make here is that the mainstream P2 assemblers (PNut, Spin Tools and flexspin) are really Spin2 compilers.

Thanks a lot for the background information! I guess it's nice that I wasn't just missing something this time

Would anyone here be at all interested in a sort of NASM but for P2? PWASM? (Propeller Wide ASseMbler?)

Or does basically everyone use SPIN2 together with assembly? Am I just the only one that wants to do this?

@evanh said:
I haven't looked at it myself but maybe working from a RISC-V emulator would be of interest for your project? Eric is pretty good at undertaking useful feature requests too - https://forums.parallax.com/discussion/170295/riscvp2-a-c-and-c-compiler-for-p2/p1

I had found this! I have been considering it as a back-end but I feel that I should probably learn how to do it myself first, even if Eric's turns out to be way better. I also kind of want to expose the Propeller itself to user-space. The idea is more or less to make it a sort of "retro tech art piece" in the sense that I want to make something that you couldn't just run on an Raspberry Pi or something.

Wuerfel_21 · 2025-11-10 19:23

@TMM said:
Or does basically everyone use SPIN2 together with assembly? Am I just the only one that wants to do this?

Yes, basically everyone on here. It's actually very nice I think, to have it all integrated, it just has the mentioned weaknesses when you try to make a huge PASM-only program, which I'd think is a niche usecase to begin with. (even the aforementioned game emulator programs end up using Spin and flexspin's C library to handle filesystem operations and user interface, though in a slightly non-standard way where the HLL code lives in upper memory). I haven't really had time to put the new namespace extension to good use yet, I think that'll largely solve the biggest issue.

That said, there's plenty of space for and fun in making your own tool. But you'd need some experience for that first.

RossH · 2025-11-11 04:57

Just want to throw a plug in here for Dave Hein's amazing p2asm program. A very pure and simple PASM assembler. No Spin! Dave died a few years ago, but his work lives on as part of Catalina.

Dave's original version (see https://github.com/davehein/p2gcc/tree/master/p2asm_src) is no longer maintained, but Catalina's version is!

You can find Catalina's version in the Catalina's source/p2asm_src directory, and the binary in Catalina's bin directory - it is completely self-contained and has no dependencies on either p2gcc or Catalina^. There is also super trivial example of a PASM program in demos/p2/flash.pasm.

I really should create a fork of Dave's github and put Catalina's version there. I'll add this to my TO DO list!

Ross.

^ Thought I'd better add a note for anyone exploring Catlaina's bin directory looking for the p2asm executable because as well as p2asm you will see p2_asm scripts (for both Linux and Windows) - these call p2asm but first they use use Catalina's pre-processor on the source to add C-like preprocessor capabilities - i.e #include, #if, #define etc. But this is optional, and p2asm can be used by itself and is not dependent on any Catalina components.

Christof Eb. · 2025-11-11 07:55

Well, perhaps for the sake of completeness:

The very nice Forth System Taqoz comes with an assembler too. And of course there are macros and you can define your own.

" I also kind of want to expose the Propeller itself to user-space. The idea is more or less to make it a sort of "retro tech art piece" in the sense that I want to make something that you couldn't just run on an Raspberry Pi or something"

In my experience Forth is a good way and tool to achieve this. In comparison to Python it trades convenience for speed and compactness. For Forth 512kB Ram is really big. And you scarcely need assembler because Forth is fast. You might want to have a look at Taqoz....

In my opinion there are main reasons, that propellers are not used widely:
1. Price of board hardware to get started.
2. The emphasis on the special interpteter languages Spin1 and Spin2, which are designed to be used together with assembler, if you need speed. Both have to be learned. Unfortunately you cannot build easily on the hughe amount of libraries or tutorials for Arduino.

I think, that the strength of P2 is, that it is more easy to do things with several cores than to use interrupts. There is no Linux to get in the way. So if I start a project, than my confidence to get it done is often higher with P2.

evanh · 2025-11-11 10:19

It's probably time to mention what TMM is planning - Make a tiny Linux for the Prop2 - https://forums.parallax.com/discussion/comment/1570107/#Comment_1570107

Christof Eb. · 2025-11-11 15:42

@evanh said:
It's probably time to mention what TMM is planning - Make a tiny Linux for the Prop2 - https://forums.parallax.com/discussion/comment/1570107/#Comment_1570107

Oh, thank you, I had not read this.

Hm, @TMM , it will be rather difficult to get something useful out of P2's 512kB of RAM. P2 unfortunately has no Thumb instruction set and it's code density is low.
When PCs had <512kB their tools like compilers have been hand written in assembler. Do you plan to write a compiler from scratch?

I once tried to port an emulator for mc6809/coco3. That machine with OS9 seems to have worked with 512kB. I than was not able to make a really fast emulator for the memory management unit.

An idea would be to bring together Robert Swierczek's C-Compiler (I experimented with it here: https://forums.parallax.com/discussion/175070/a-study-apropos-a-multitasking-operating-system-for-p2-with-local-c-compiler#latest ) with an XBYTE machine for P2 written in assembler . This C-Compiler produces code for a virtual machine based on two accumulators, that has 8bit opcodes together with 24bit operands. It can compile itself and is more than minimal. The XBYTE mechanism of P2 is fast as it's hardware interpreter evokes code in COG/LUT ram and also can use the microcache of the streamer mechanism of P2 for fast access of HUB ram. It also uses 8bit codes. Perhaps a pure asm COG can run the virtual machine with XBYTE and a supporting server COG running C can handle the stuff, where you use the libraries of FlexC?

Oh, had to google, what CDE is. "Umpf" Well, you have 512kB of directly coupled RAM. A COG has to share it's bandwidth with the others. Think of a machine with 512kB and running at 25MHz.

Have fun, Christof

TonyB_ · 2025-11-11 15:49

@TMM said:
Or does basically everyone use SPIN2 together with assembly? Am I just the only one that wants to do this?

All my P2 programs (mostly CPU emulators) are 100% PASM2 and I've never used SPIN2.

TMM · 2025-11-11 16:28

@"Christof Eb." Thanks for the information! As for my project, apart from the memory amount the CDE used to run on 25MHz 68030 based unix machines! The original release was in 1993. The reason I picked the P2 for this project is because I think that even with the extmem overhead I can probably make something that can run a Unix and X server fast enough between the 8 cogs

My high-level plan is to basically use lut ram as a sort L2 cache, hubram as a sort of L3 cache, and have extram as main ram. Bandwidth wise it should be possible to achieve several hundred megabytes a second which is faster than contemporary (with original CDE) ram which was more around 100-150MB/s.

Theoretically the raw umph is there, which is why I wanted to try this!

And yeah, I might end up writing my own compiler for this project. But I'm still exploring the options. Right now I'm mostly tying to get to grips with all of the existing tools and work done by others. But the point of this is kind of to do something that seems on the face of it to be kind of ridiculous

Christof Eb. · 2025-11-12 08:03

@TMM said:
@"Christof Eb." Thanks for the information! As for my project, apart from the memory amount the CDE used to run on 25MHz 68030 based unix machines! The original release was in 1993. The reason I picked the P2 for this project is because I think that even with the extmem overhead I can probably make something that can run a Unix and X server fast enough between the 8 cogs

My high-level plan is to basically use lut ram as a sort L2 cache, hubram as a sort of L3 cache, and have extram as main ram. Bandwidth wise it should be possible to achieve several hundred megabytes a second which is faster than contemporary (with original CDE) ram which was more around 100-150MB/s.

Theoretically the raw umph is there, which is why I wanted to try this!

And yeah, I might end up writing my own compiler for this project. But I'm still exploring the options. Right now I'm mostly tying to get to grips with all of the existing tools and work done by others. But the point of this is kind of to do something that seems on the face of it to be kind of ridiculous

Ok, it seemed to take a while for me to understand your goal.... My own -kind of ridiculous- experiments are more on the mechanical side, where I use P2 and it's compilers as tools to control something. For this it's helpful to know strengths and limits of P2 and just respect them like I do respect mechanical limits of parts or materials. So for me the 512kB limit is "given" and therefore P2 is more like 1983 than 1993. Just read that mc68030 had virtual address management and also small caches.

I am curious, how you will tackle the usage of the caches in combination with extmem. As P2 itself cannot handle virtual memory you will have to have some sort of a virtual processor? Or is Linux able to handle paged memory in a way that you only need to swap memory pages, when a task switch occurs?
Edit: I should add that I am curious for a few good ideas, because my feeble attempt about the mc6809 OS9 emulator ( https://forums.parallax.com/discussion/174794/towards-os9-operating-system-on-p2/p1 ) was throttled down from emulated speed about 3.6MHz to <1.3MHz due to the emulation of a MMU, which was in the way for each and every RAM access. Meanwhile I might also have learned some bits to make it better....

Christof

TMM · 2025-11-12 13:10

@"Christof Eb." The general idea is that I write a JITting VM that implements the MMU like functionality. Trying to keep code in LUT and data in cog memory in "cache lines" in the hopes of bursting in extmem data "just in time" as well. Since the compiler should know at least roughly what data is going to be read, most of the time, I think it could be reasonably fast. Like I only need to do 1 "useful" instruction every 10 clocks in order to hit my target. This is still pretty wildly optimistic, but I think it is "optimistic" and not "delusional" but we shall see.

If the code uses a lot of dynamic calculations to calculate pointer addresses then performance will suuuuuuck, but I'll burn that bridge when I get there.

The general idea is, more-or-less, to translate all code into relative addressing where possible. Then for each trace the compiler should have a reasonable idea what data will be necessary. The idea then is to make sure that when the trace starts the memory is already in cog ram, and when it needs to be paged out we write it to hubram, unless we run out of space there in which case it will go back into extram.

I'm experimenting with the idea of running the jit itself on a different cog as the executing cog, and use the dual ports to keep the "execution cog" as empty as possible, in the hopes of being able to keep all flags in "real flags" on the cog so I don't have to generate any code to save and restore flag states, and let all of the flags and side-effects on the cog just happen "naturally".

This is all hugely "handwavey" and I don't really know if any of this is going to work. But I think it at least MIGHT work?

And there's several ways in which I can cheat if I have my own C compiler, there could be "trusted" programs that are not jitted at all. The jit is really only necessary to prevent (accidental) crashes of the whole system by rogue pin fiddling or executing "privileged" instructions like COGABRT or something. The first prototype of this can work with all of this just compiled ahead of time. The JIT is only a protection mechanism which isn't inherently necessary to achieve what I want to achieve. It's really only necessary in order to be able to safely run potentially buggy or malicious code.

I might be able to get away with "just" a special C compiler to take unix source code, X11, and CDE and make it do the "mmu" stuff cooperatively and just hope nothing goes wrong.

Christof Eb. · 2025-11-12 14:03

Thank you, @TMM , for the explanation, very interesting! Good luck!

ersmith · 2025-11-12 20:45

@TMM have you seen my JIT toolkit (https://hithub.com/totalspectrump2-jit-tools)? It might be useful to you. A slightly more sophisticated (and customized) version of it was used to produce the p2-riscv runtime for translating RISC-V to P2 instructions.

TMM · 2025-11-13 11:17

@ersmith I had seen that! It looks very interesting, I was planning to at least study it. You've made a lot of really cool stuff for the P2!

RossH · 2025-11-14 01:53

@TMM said:
My high-level plan is to basically use lut ram as a sort L2 cache, hubram as a sort of L3 cache, and have extram as main ram.

Catalina does all this, so I can advise you on a few things before you start:
1. Using the LUT as a cache is possible, but it's not very effective even if the LUT is shared between two cogs - it simply isn't big enough. It is probably better to use the LUT for additional cog code space - you're probably going to need it!
2. You can use a pair of spare smart pins (in "repository" mode) for faster communications between adjacent cogs than is possible via Hub RAM. And the advantage over using a shared LUT is that any two (or more) cogs can share pins, not just two adjacent cogs.
3. You'll want to use Hub RAM for all local variables and stack space as well as for caching the XMM RAM. The XMM RAM can be used for code and heap, but it is not fast enough to use for local variables or stack.

I suggest using Catalina as a prototyping tool, since it allows you to explore all these options now. You may decide not to use Catalina in your final solution, but you can use it to see whether or not the P2 is going to be fast enough for your needs.

Ross.

TMM · 2025-11-18 13:45

@RossH Thanks a lot for the pointers!

I was thinking of using the LUT as a cache, but mostly as an instruction cache. The idea would be to page things in and out of hubram/psram on an as-needed basis. Since this will need to be able to do multitasking I will need a way to swap code and data.

But I will definitely start with Catalina, at least to see how you did all of this! Thanks again!

TonyB_ · 2025-11-18 22:00

@TMM said:
I was thinking of using the LUT as a cache, but mostly as an instruction cache. The idea would be to page things in and out of hubram/psram on an as-needed basis. Since this will need to be able to do multitasking I will need a way to swap code and data.

I'm not sure anyone has yet tried LUT sharing where cog A only does fast block reads of code from hub RAM for cog B to execute. Cog B could start running a new block of code before most of it has been read as fast block moves take only one cycle after the first one. I think fast block moves can be interrupted and ideally a new fast block move inside an interrupt routine terminates the old interrupted one without any funny business but I have not tested this.

Have you considered using XBYTE? The FIFO acts a fast instruction queue of up to 76 bytes (19 longs). There is a six cycle overhead for each bytecode but immediate data can be read in only two cycles, much faster than from hub RAM. Bytecode programs are always interpreted but will be smaller than P2 object code.

RossH · 2025-11-19 00:04

@TMM said:
@RossH Thanks a lot for the pointers!

I was thinking of using the LUT as a cache, but mostly as an instruction cache. The idea would be to page things in and out of hubram/psram on an as-needed basis. Since this will need to be able to do multitasking I will need a way to swap code and data.

But I will definitely start with Catalina, at least to see how you did all of this! Thanks again!

Normal execution from XMM RAM is fairly straightforward, and uses a Hub-RAM based cache (see target/p2/xmm.t and target/p2/cogcache.t) plus Roger Loh's PSRAM drivers (see target/p2/cogpsram.t). The LUT is also involved here, but only as a page buffer (which can be disabled if the LUT is required for other purposes).

The Hub-RAM cache can be from 1k to 64k. Once you see basically how how the XMM execution and cache works, shifting the cache to the LUT is then fairly straightforward (see target/p2/lutcache.t). The LUT cache size is limited to 1k (theoretically it could be 2k, but I use the other 1k for code - this generally turns out to be a better use of the LUT).

Using pins for cog-to-cog communications is supported by the cache and the floating point plugin (see target/p2/xmm.c, target/p2/cogcache.c, target/p2/lutcache.c and target/p2/floatc.t - search for the symbol FLOAT_PINS or CACHE_PINS). This of course requires that enough spare smart pins are available, so it is not enabled by default.

There is a (very brief) description of this stuff in the Catalina Reference Manual (Propeller 2), in the section Cache Support and Smart Pins.

Feel free to ask questions. I will try and remember how it all hangs together - but no guarantees!

Note that Catalina also supports multi-tasking, but not when executing from XMM RAM - that's still on my "to do" list, because it adds even more complexity to something that is already very complex. Instead, Catalina opts for a "multi-model" approach, where multiple tasks can be executing on multiple cogs from Hub RAM, but in conjunction with a single task that executes from XMM RAM. They communicate (if required) via Hub RAM. Building such programs is quite complex, so Catalina provides a utility (called Catapult) that does this job for you. See the document called Getting Started with Catapult.

Ross.

rogloh · 2025-11-23 06:11

@TMM,
I noticed RossH had mentioned PSRAM drivers above and that you were interested in caching and execution from code stored in external RAM. I was also interested in this a while back and some initial work has been done on this. It started out as a discussion regarding AVR emulation in order to break the 512k limit with GCC support etc but morphed into ideas for trying to execute large P2 programs from external RAM that could be read from external RAM in strips/rows into HUB. I was able to get something going using an approach that escaped out from normal branch/call instructions etc and could bring in code on demand from PSRAM as well as support a simple cache. It was able to run a modified version of MicroPython as a proof of concept and get fairly decent results. I've also had ideas for running directly from LUT but can't recall where I left that or if any old threads or snippets covered those ideas.

You might want to check out this thread linked below.
https://forums.parallax.com/discussion/174344/p2-native-avr-cpu-emulation-with-external-memory-xbyte-etc/p1

Unfortunately back then I think the whole external memory thing was all rather new/niche and there wasn't a huge interest seen in people for something like this, so it sort of fizzled out and I had plenty of other things to work on like my video stuff. Even RossH took a while to see the light perhaps from his prior P1 experiences, LOL.

RossH · 2025-11-24 01:45

@rogloh said:
Even RossH took a while to see the light perhaps from his prior P1 experiences, LOL.

I'm not sure whether I "saw the light" or I was "seduced by the dark side".

Beware! As Yoda warned Luke ... “If Once You Start Down The Dark Path, Forever Will It Dominate Your Destiny.”

Christof Eb. · 2025-11-24 13:12

As I fear that to emulate virtual memory on P2 will always be slow I am wondering, if it would be a good idea to go back a little bit in time and try to run one of the first Unix. This was designed to run on a pdp11, which had only a memory space of 64k words. The lsi11-23 running rt11-xm I did some programmung on to control a test rig had a mmu and 1MB of Ram but we did use this as a Ram Disk to load overlays from.
I wonder how they did task switches then in unix? Did they swap the whole task to/from disk? I think they also had overlays?
Cheers Christof

P2asm assembler features?

Comments