Neat Ross. I will make use of the #include and #ifdef
You guys do realize that fastspin has always had #include and #ifdef, right? Ross, you're certainly free to use fastspin as the backend assembler for Catalina (it's MIT licensed). Like bstc it has a simple preprocessor and support for @ @ @ absolute addressing on both P1 and P2. In fact if you rename the fastspin executable to anything that starts with the letters "bstc" it somewhat emulates bstc's command line options and error message syntax.
It'd be great if you, @"Dave Hein", and I can come to some agreement on what the Propeller specific libraries should look like for Catalina, p2gcc, and fastspin. @DavidZemon started a thread about language independent APIs for P2 which would be a good place for us to discuss that.
Hi Eric
My current priority is to get Catalina for the P2 up to the level of functionality of Catalina for the P1. After that, I'd be happy to investigate alternative APIs. It should be easy enough to support both.
Neat Ross. I will make use of the #include and #ifdef
You guys do realize that fastspin has always had #include and #ifdef, right? Ross, you're certainly free to use fastspin as the backend assembler for Catalina (it's MIT licensed). Like bstc it has a simple preprocessor and support for @ @ @ absolute addressing on both P1 and P2. In fact if you rename the fastspin executable to anything that starts with the letters "bstc" it somewhat emulates bstc's command line options and error message syntax.
Hi Eric
I want Catalina to be independent of any particular SPIN tool on the P2, just as it was on the P1. Because the P2 toolset is still very fluid, I don't want to tie myself down to any one tool just yet by depending on any specific toolset functionality. So I would have needed this anyway, and in any case it only took a couple of hours to develop.
You guys do realize that fastspin has always had #include and #ifdef, right?
Eric,
Does that apply to pasm code?
Is there any docs? I've not got spin under my belt so docs would help a lot.
I've been enjoying fastspin for P2ES testing. I use Pnut only for v33 FPGA testing now.
You guys do realize that fastspin has always had #include and #ifdef, right?
Eric,
Does that apply to pasm code?
Yes.
Is there any docs? I've not got spin under my belt so docs would help a lot.
I've been enjoying fastspin for P2ES testing. I use Pnut only for v33 FPGA testing now.
I haven't had time to write a complete Spin document. The doc/spin.md file (translated to doc/spin.pdf in the spin2gui distribution) describes the differences between "standard" Spin and fastspin. It assumes familiarity with Spin, as documented in the Propeller 1 manual.
There's also doc/basic.md, which describes the BASIC dialect that fastspin understands. It's a little more complete, since I couldn't rely on any third party documentation (BASIC varies a lot between implementations!)
I little off topic here but is there any documentation on the CMM memory model. I have used it in the past but find the timing of functions to be unclear and the instructions it creates to be strange. I try to use LMM when ever possible because I need the fast and accurate timing of PASM code.
I little off topic here but is there any documentation on the CMM memory model. I have used it in the past but find the timing of functions to be unclear and the instructions it creates to be strange. I try to use LMM when ever possible because I need the fast and accurate timing of PASM code.
Catalina and PropGCC have different CMM models, but the underlying idea is similar -- that instead of executing PASM instructions directly (as in LMM) there is some decompression of encoded words, so that the code is smaller but slower. The PropGCC CMM instructions are documented at https://github.com/parallaxinc/propgcc-docs/blob/master/doc/CompressedMemoryModel.md.
In PropGCC you can recover some of the timing in a particular (small) function by declaring it with "__attribute__((fcache))", which disables compression on that function and forces it to be loaded into FCACHE before it executes. I'm not sure if Catalina has anything similar.
Thank you for that. I have always wonder why the code timing never lined up and why code generated didn't match the propeller instructions. Mystery solved.
I little off topic here but is there any documentation on the CMM memory model. I have used it in the past but find the timing of functions to be unclear and the instructions it creates to be strange. I try to use LMM when ever possible because I need the fast and accurate timing of PASM code.
Catalina and PropGCC have different CMM models, but the underlying idea is similar -- that instead of executing PASM instructions directly (as in LMM) there is some decompression of encoded words, so that the code is smaller but slower. The PropGCC CMM instructions are documented at https://github.com/parallaxinc/propgcc-docs/blob/master/doc/CompressedMemoryModel.md.
In PropGCC you can recover some of the timing in a particular (small) function by declaring it with "__attribute__((fcache))", which disables compression on that function and forces it to be loaded into FCACHE before it executes. I'm not sure if Catalina has anything similar.
There is a brief description of the Catalina CMM instruction set in the document "Catalina_compact.inc" in the default Catalina target directory. Catalina CMM includes a FCACHE capability, which is used to speed up some library functions and also the multithreading code - but you cannot force a specific C function to be FCACHED.
Another small milestone on the way to Catalina on the P2 ...
Attached is the first Compact Memory Model (CMM) program compiled by Catalina for the P2. Same program as before - a simple multi-cog "blink all LEDS" for the P2_EVAL board.
Same provisos apply - no optimization has yet been done to the code generator, so it is a little larger and less efficient than it will eventually be, but the actual program code (i.e. minus the kernel code, which is about the same size for both LMM and CMM) is about 70% of the size of the LMM, and from experience this will eventually come down to about 50% of the size. I would expect it to also be about 50% of the size of the native code generator, once I get that working.
Not really sure what to do next. There is still lots of debugging and library work to do on these code generators, but that's not much fun - and of course there are no P2-specific plugins as yet, which will make Catalina a lot more useful, but which will take an awful lot of work.
So I think I may get a version of the "native" P2 code generator working first - just to get a better "feel" for the P2. So far, most of my time has been taken up with tripping over all the "gotchas" involved in porting P1 code to the P2.
what is your overall feeling while converting from Pasm1 to Pasm2?
I am still overwhelmed with the new instructions...
Hi Mike
I have not made much use of many of the new instructions or features yet - this is one reason I want to move onto the native code generator before going back and converting all the existing Catalina plugin code. But overall, it seems fairly easy to convert P1 code to a "P1-like" subset of the P2 - but I have to say that some of the worst "gotchas" seem to be unnecessary and self-inflicted
For instance, why can't we easily align data on a long or word boundary in a DAT block on the P2, as we could on the P1? You probably don't realize just how much you rely in this when coding for the P1 ... until you try to port that code to the P2!
I have seen some discussion here about introducing an "align" keyword - as in "align long" or "align word", but the extra verbiage makes it very undesirable to have to align every declaration separately - my preference would be for either a different type of DAT block - say ADAT for a P1-style "aligned" DAT block - or else have an align directive that turned alignment on or off altogether until the next align directive (i.e. "align on" or "align off").
RossH, why do you need to align data on the P2? It can read and write longs and words at any offset.
I don't need it, but on the P1 I relied on it in several instances, because I didn't anticipate this would change
For example, in the compact mode on the P1 I made use of the fact that longs were automatically aligned on long boundaries, so when you mixed word instructions and long instructions, you could always read memory as longs and either get one complete long instruction, or two complete word instructions in each long read. The cost was the occasional word of memory "wasted", but it made the compact mode simple, compact and fast.
Decoding instructions is more complex (and slower) if each time it might require a second read and then reconstruction of the instruction.
I have also written P1 code where I assumed a long address would always have 00 as the two lower bits, so I either didn't need to store them, or I could use them for something else. Luckily, I didn't do this in the final version of the compact kernel - but I did it in a few other places, which I shall have to track down
I could simplify the compact mode quite a lot given the new capabilities of the P2 ... but I have lots of other things to do first, so I have just left it to assume that all longs are long aligned ... but now I have to explicitly make this the case!
So I think I may get a version of the "native" P2 code generator working first - just to get a better "feel" for the P2. So far, most of my time has been taken up with tripping over all the "gotchas" involved in porting P1 code to the P2.
I think you'll find the native code generator to be quite straightforward -- there's really no need for LMM on P2, and every LMM instruction can map pretty directly to one or two native P2 instructions. This is particularly true if you use ptra or ptrb as the stack pointer: then you can do something like:
wrlong r0, ptra++ ' push argument
calla #C_function ' call the function
...
C_function
...
reta
That's assuming your stack grows up -- I can't remember how Catalina does it. If you need it to grow down, you can still do that, but at the cost of extra instructions in the function prologue / epilogue:
wrlong r0, --ptra ' push argument
calld pa, #C_function ' call the function, return address + CZ placed in pa register
...
C_function
wrlong pa, --ptra ' save return address on stack
...
rdlong pa, ptra++ ' pop return address
jmp pa ' jump indirect through it
In leaf functions the second sequence can omit the push/pop and actually end up faster, since you save the stack accesses that calla and reta implicitly insert.
Beware of the difference between "pa" and "ptra" and "pb" and "ptrb": I've confused them enough that I've finally learned that they are different registers, but I wish "pa" and "pb" had been called "parama" and "paramb" instead.
Hi Ross
FYI The Pnut compiler supports "ALIGNW" and "ALIGNL" directives.
They only apply to the next data item, so they don't really help much in this case
You just emit them before the data item in question. That is, every time you used to write:
long x
word y
you output
alignl
long x
alignw
word y
instead. There's an obvious optimization where you skip outputting the "alignl" or "alignw" if the data is already known to be aligned properly (so the "alignw" in the example is redundant).
I could very easily modify fastspin to have a directive or command line option to automatically align things -- fastspin supports both P1 and P2, so the alignment code is there for P1 and just skipped for P2. Would that be useful to you?
Hi Ross
FYI The Pnut compiler supports "ALIGNW" and "ALIGNL" directives.
They only apply to the next data item, so they don't really help much in this case
You just emit them before the data item in question. That is, every time you used to write:
long x
word y
you output
alignl
long x
alignw
word y
instead. There's an obvious optimization where you skip outputting the "alignl" or "alignw" if the data is already known to be aligned properly (so the "alignw" in the example is redundant).
Yes this is what I do, except I currently use the "orgh" directive (I didn't know about "alignl")
I could very easily modify fastspin to have a directive or command line option to automatically align things -- fastspin supports both P1 and P2, so the alignment code is there for P1 and just skipped for P2. Would that be useful to you?
Well, it would be ... if you could convince all the assemblers to also support it!
So I think I may get a version of the "native" P2 code generator working first - just to get a better "feel" for the P2. So far, most of my time has been taken up with tripping over all the "gotchas" involved in porting P1 code to the P2.
I think you'll find the native code generator to be quite straightforward -- there's really no need for LMM on P2, and every LMM instruction can map pretty directly to one or two native P2 instructions. This is particularly true if you use ptra or ptrb as the stack pointer: then you can do something like:
wrlong r0, ptra++ ' push argument
calla #C_function ' call the function
...
C_function
...
reta
That's assuming your stack grows up -- I can't remember how Catalina does it. If you need it to grow down, you can still do that, but at the cost of extra instructions in the function prologue / epilogue:
wrlong r0, --ptra ' push argument
calld pa, #C_function ' call the function, return address + CZ placed in pa register
...
C_function
wrlong pa, --ptra ' save return address on stack
...
rdlong pa, ptra++ ' pop return address
jmp pa ' jump indirect through it
In leaf functions the second sequence can omit the push/pop and actually end up faster, since you save the stack accesses that calla and reta implicitly insert.
Beware of the difference between "pa" and "ptra" and "pb" and "ptrb": I've confused them enough that I've finally learned that they are different registers, but I wish "pa" and "pb" had been called "parama" and "paramb" instead.
Regards,
Eric
Thanks for the "heads up", Eric!
Catalina's stack grows downwards. I probably won't change that in the short term. I might look at changing it later.
Attached is the first Catalina C program running in native mode on the P2. As usual, it's just the multi-cog "all LEDs blink" program for the P2_EVAL board.
However, I cheated a bit - it just occurred to me that if I took the LMM code generator, and just tweaked the kernel to execute the code straight from hub rather than via the usual LMM loop, then with a very few simple modifications I would have a P2 program executing natively!
And it works!
But this is not how things really should be - it's just a bit of a novelty that I couldn't resist posting. I really need to replace the calls to most of the LMM "primitives" with the relevant P2 instructions - that will make the code much faster!
I'm not familiar with HyperRam, but it did occur to me that one day there might be a need for both EMM and XMM support on the P2.
But they aren't very high on my list
What is "EMM"?
EMM was not really a separate memory model - it is essentially just LMM code but where the kernel and the plugins were loaded directly from EEPROM instead of first loading them into Hub RAM. It was quite useful on the P1 where RAM was so limited, but since the P2 has so much more RAM it may not be as useful.
Comments
You guys do realize that fastspin has always had #include and #ifdef, right? Ross, you're certainly free to use fastspin as the backend assembler for Catalina (it's MIT licensed). Like bstc it has a simple preprocessor and support for @ @ @ absolute addressing on both P1 and P2. In fact if you rename the fastspin executable to anything that starts with the letters "bstc" it somewhat emulates bstc's command line options and error message syntax.
Hi Eric
My current priority is to get Catalina for the P2 up to the level of functionality of Catalina for the P1. After that, I'd be happy to investigate alternative APIs. It should be easy enough to support both.
Ross.
Hi Eric
I want Catalina to be independent of any particular SPIN tool on the P2, just as it was on the P1. Because the P2 toolset is still very fluid, I don't want to tie myself down to any one tool just yet by depending on any specific toolset functionality. So I would have needed this anyway, and in any case it only took a couple of hours to develop.
Ross.
Does that apply to pasm code?
Is there any docs? I've not got spin under my belt so docs would help a lot.
I've been enjoying fastspin for P2ES testing. I use Pnut only for v33 FPGA testing now.
I haven't had time to write a complete Spin document. The doc/spin.md file (translated to doc/spin.pdf in the spin2gui distribution) describes the differences between "standard" Spin and fastspin. It assumes familiarity with Spin, as documented in the Propeller 1 manual.
There's also doc/basic.md, which describes the BASIC dialect that fastspin understands. It's a little more complete, since I couldn't rely on any third party documentation (BASIC varies a lot between implementations!)
Mike
Catalina and PropGCC have different CMM models, but the underlying idea is similar -- that instead of executing PASM instructions directly (as in LMM) there is some decompression of encoded words, so that the code is smaller but slower. The PropGCC CMM instructions are documented at https://github.com/parallaxinc/propgcc-docs/blob/master/doc/CompressedMemoryModel.md.
In PropGCC you can recover some of the timing in a particular (small) function by declaring it with "__attribute__((fcache))", which disables compression on that function and forces it to be loaded into FCACHE before it executes. I'm not sure if Catalina has anything similar.
Mike
There is a brief description of the Catalina CMM instruction set in the document "Catalina_compact.inc" in the default Catalina target directory. Catalina CMM includes a FCACHE capability, which is used to speed up some library functions and also the multithreading code - but you cannot force a specific C function to be FCACHED.
Attached is the first Compact Memory Model (CMM) program compiled by Catalina for the P2. Same program as before - a simple multi-cog "blink all LEDS" for the P2_EVAL board.
Same provisos apply - no optimization has yet been done to the code generator, so it is a little larger and less efficient than it will eventually be, but the actual program code (i.e. minus the kernel code, which is about the same size for both LMM and CMM) is about 70% of the size of the LMM, and from experience this will eventually come down to about 50% of the size. I would expect it to also be about 50% of the size of the native code generator, once I get that working.
Not really sure what to do next. There is still lots of debugging and library work to do on these code generators, but that's not much fun - and of course there are no P2-specific plugins as yet, which will make Catalina a lot more useful, but which will take an awful lot of work.
So I think I may get a version of the "native" P2 code generator working first - just to get a better "feel" for the P2. So far, most of my time has been taken up with tripping over all the "gotchas" involved in porting P1 code to the P2.
what is your overall feeling while converting from Pasm1 to Pasm2?
I am still overwhelmed with the new instructions...
Mike
Hi Mike
I have not made much use of many of the new instructions or features yet - this is one reason I want to move onto the native code generator before going back and converting all the existing Catalina plugin code. But overall, it seems fairly easy to convert P1 code to a "P1-like" subset of the P2 - but I have to say that some of the worst "gotchas" seem to be unnecessary and self-inflicted
For instance, why can't we easily align data on a long or word boundary in a DAT block on the P2, as we could on the P1? You probably don't realize just how much you rely in this when coding for the P1 ... until you try to port that code to the P2!
I have seen some discussion here about introducing an "align" keyword - as in "align long" or "align word", but the extra verbiage makes it very undesirable to have to align every declaration separately - my preference would be for either a different type of DAT block - say ADAT for a P1-style "aligned" DAT block - or else have an align directive that turned alignment on or off altogether until the next align directive (i.e. "align on" or "align off").
Ross.
FYI The Pnut compiler supports "ALIGNW" and "ALIGNL" directives.
Aren't aligned accesses faster, since only one hub access needs to be done, and atomic, since they don't span two longs?
The penalty for unaligned word/long accesses is just ONE clock.
I don't need it, but on the P1 I relied on it in several instances, because I didn't anticipate this would change
For example, in the compact mode on the P1 I made use of the fact that longs were automatically aligned on long boundaries, so when you mixed word instructions and long instructions, you could always read memory as longs and either get one complete long instruction, or two complete word instructions in each long read. The cost was the occasional word of memory "wasted", but it made the compact mode simple, compact and fast.
Decoding instructions is more complex (and slower) if each time it might require a second read and then reconstruction of the instruction.
I have also written P1 code where I assumed a long address would always have 00 as the two lower bits, so I either didn't need to store them, or I could use them for something else. Luckily, I didn't do this in the final version of the compact kernel - but I did it in a few other places, which I shall have to track down
I could simplify the compact mode quite a lot given the new capabilities of the P2 ... but I have lots of other things to do first, so I have just left it to assume that all longs are long aligned ... but now I have to explicitly make this the case!
They only apply to the next data item, so they don't really help much in this case
I'm pretty sure all of the assemblers for the P2 support those, they're standard parts of the P2 assembly language .
I think you'll find the native code generator to be quite straightforward -- there's really no need for LMM on P2, and every LMM instruction can map pretty directly to one or two native P2 instructions. This is particularly true if you use ptra or ptrb as the stack pointer: then you can do something like: That's assuming your stack grows up -- I can't remember how Catalina does it. If you need it to grow down, you can still do that, but at the cost of extra instructions in the function prologue / epilogue: In leaf functions the second sequence can omit the push/pop and actually end up faster, since you save the stack accesses that calla and reta implicitly insert.
Beware of the difference between "pa" and "ptra" and "pb" and "ptrb": I've confused them enough that I've finally learned that they are different registers, but I wish "pa" and "pb" had been called "parama" and "paramb" instead.
Regards,
Eric
You just emit them before the data item in question. That is, every time you used to write: you output instead. There's an obvious optimization where you skip outputting the "alignl" or "alignw" if the data is already known to be aligned properly (so the "alignw" in the example is redundant).
I could very easily modify fastspin to have a directive or command line option to automatically align things -- fastspin supports both P1 and P2, so the alignment code is there for P1 and just skipped for P2. Would that be useful to you?
Yes this is what I do, except I currently use the "orgh" directive (I didn't know about "alignl")
Well, it would be ... if you could convince all the assemblers to also support it!
I'm not familiar with HyperRam, but it did occur to me that one day there might be a need for both EMM and XMM support on the P2.
But they aren't very high on my list
Thanks for the "heads up", Eric!
Catalina's stack grows downwards. I probably won't change that in the short term. I might look at changing it later.
Attached is the first Catalina C program running in native mode on the P2. As usual, it's just the multi-cog "all LEDs blink" program for the P2_EVAL board.
However, I cheated a bit - it just occurred to me that if I took the LMM code generator, and just tweaked the kernel to execute the code straight from hub rather than via the usual LMM loop, then with a very few simple modifications I would have a P2 program executing natively!
And it works!
But this is not how things really should be - it's just a bit of a novelty that I couldn't resist posting. I really need to replace the calls to most of the LMM "primitives" with the relevant P2 instructions - that will make the code much faster!
Still, I am beginning to like working on the P2!
EMM was not really a separate memory model - it is essentially just LMM code but where the kernel and the plugins were loaded directly from EEPROM instead of first loading them into Hub RAM. It was quite useful on the P1 where RAM was so limited, but since the P2 has so much more RAM it may not be as useful.