[Solved?] Inline assembly: Immediate instructions broken?

pedward · 2013-03-16 15:11

I guess the part you're not saying, that I seem to derive from the discussion, is that the assembler is working from a hub view of memory, not from a COG view of memory. The symbolic offsets are treated as hub addresses, even when they actually can't be used that way in COG mode because you can't directly access HUB memory. But because the code is compiled and stored in HUB memory, then copied to COG memory to run, the assembler is treating it as a monolithic binary.

The code only works because you have targeted fixups for a few instructions.

Am I getting right?

ersmith · 2013-03-16 16:35

pedward wrote: »

I guess the part you're not saying, that I seem to derive from the discussion, is that the assembler is working from a hub view of memory, not from a COG view of memory.

I'm sorry, I guess I wasn't clear -- that is what I meant when I said that GAS uses byte addresses (as required in HUB memory) whereas PASM uses long addresses (as required in COG memory). Thanks for clarifying this.

The default mode of operation for GCC is LMM mode, where everything executes from HUB memory. That's one reason we chose that convention. But the other was simply that all the GNU tools (assembler, linker, etc.) are built with the expectation that addresses are byte addresses -- all sizes and offsets are calculated as bytes, and when the linker places sections in memory and adjusts symbols it operates on the size in bytes. Changing this would be very difficult if not impossible.

The code only works because you have targeted fixups for a few instructions.

I wouldn't quite put it that way... the way I think of it is just that PASM and GAS have different ways of looking at everything. For GAS, everything is bytes, and so a jmp instruction targets a byte address (that happens to be encoded with the bottom 2 bits missing because those always have to be 0). That isn't quite accurate from the hardware perspective, of course: PASM actually matches the real hardware much better. But GAS is a cross-platform assembler that targets zillions of other processors, too, so it's not surprising that it has a more "traditional" and byte oriented view of operations, rather than the Propeller long oriented view.

Moreover, it is quite possible to write significant self-modifying code in GAS: the LMM and CMM kernels, for example. CMM in particular involves unpacking instructions and executing them in place, so there's lots of self-modifying code. It just requires a different way of thinking than PASM does.

It is still possible to use PASM style syntax and symbols with PropGCC, as long as the PASM code is self-contained and does not make external references (which is the usual case, actually). For example, you can compile the PASM file into a .dat block that is just a "binary blob", and have PropGCC load that. Or, you can have spin2cpp convert PASM syntax to GAS syntax with the --gas option, which does things like converting symbol references into offsets and dividing by 4.

Just for the record, I do prefer PASM syntax for writing COG code, and I expect most people probably do, which is one reason I added the --gas and --dat options to spin2cpp. Unfortunately PASM is missing a lot of features such as sections and relocatable output which we needed for GCC. Moreover, GCC is pretty much a hub compiler (with LMM, CMM, and XMM modes, anyway) and so the byte addresses of GAS are actually a better fit by default to GCC; which is not surprising, since they were written for each other.

This is a good conversation... it's great to have more minds thinking about this, and if we can find a better way of merging PASM and GCC I'd be pleased. But so far the best solution we've come up with for making GAS "more compatible" is the .cog_ram directive.

Eric

pedward · 2013-03-16 17:24

ersmith wrote: »

I'm sorry, I guess I wasn't clear -- that is what I meant when I said that GAS uses byte addresses (as required in HUB memory) whereas PASM uses long addresses (as required in COG memory). Thanks for clarifying this.

Eric

Trying to clarify further, this model has bigger implications than just byte offsets. The assembler is treating the prop as a 32KB memory micro with the code compiled like it was going to run from HUB address space, like a conventional micro.

It's not compiling code with a NUMA model (which, after considerable pondering I've decided is closest to what the Propeller is), which has "local" memory and "other" memory that's shared with other processors.

I fear that relying on the similarity between big system NUMA and how the Propeller does it, wouldn't really buy anything for compatibility because there aren't any OSS compilers that "know" NUMA, they seem to punt the problem to the OS.

I wonder if it may be possible to adjust the compiler's perspective to treat the Propeller as a NUMA architecture, specifically all addresses are "long" unless stored in a specific section, then those accesses are byte oriented?

jac_goudsmit · 2013-03-16 21:49

ersmith wrote: »

Some technical notes:

Thanks Eric, I really appreciate this feedback; it's probably the most useful reply I got so far.

It's not too hard to keep GCC and the libraries from breaking by introducing a directive (e.g. ".imm_dat") to enable spin-asm compatible behavior. Obviously GCC doesn't use this so it won't break. By default the directive is off, and there should probably be a directive to counteract it, (e.g. ".imm_gcc").

You mentioned that you don't want to keep track of two addresses for each symbol but that's not really necessary: it's always possible to calculate the cog address if you know the hub address of the start of the cog. The linker already generates the Propeller-specific _load_start_* symbol so it knows where the code for a cog begins.

If the ORG command gets implemented the same way as spin-asm, the assembler just needs to keep a list of all the ORG statements and where they occur. Also, an implicit ORG 0 should be generated at the location where _load_start_* points to. Then instead of calculating (hub_address - _load_start) / 4 for cog addresses, it should find the last ORG statement before the referenced address, and use ((hub_address_of_symbol - hub_address_where_ORG_appears) / 4 + ORG_argument). For example, if label foo appears at hub address 120 and there is an ORG 234 at hub address 100, a MOV bar, #foo can translate #foo by calculating ((120 - 100) / 4+ 234) = 239.

To prevent offset problems in cases where multiple modules are linked into the same cog or where an ORG has a reference to another address and has to be calculated too, the final calculation of cog and hub address should happen at link time, not at compile time (I'm not sure where exactly this functionality should go, at this time). In an expression such as #foo + 123 when .imm_dat is active, the compiler would mark the expression with a flag (say PROPELLER_REF_COGADDRESS) that indicates that the linker should make the referenced address into a cog address. When the linker sees this, it interprets the expression as ((foo - previous_ORG_location) / 4 + previous_ORG_argument) + 123. The result gets stored in the final instruction.

By the way, the setting of .cog_ram would be irrelevant for this way of working: I think the question whether an address should be encoded as cog address or hub address is not a matter (i.e. attribute) of the symbol itself, but of the relocation in which the symbol is used. Two instructions MOV bar, #foo in different locations of the program, one when .imm_dat is active and one when .imm_gcc is active, should produce different code.

Adding the @ and & prefixes should be trivial: basically they would clear (@) or set (&) the previously mentioned PROPELLER_REF_COGADDRESS in the relocation record, regardless of the .imm_dat vs. .imm_gcc setting.

I would like to propose that there be a

#define TC_PARSE_CONS_EXPRESSION(EXP,  NBYTES)  md_cons_expression(EXP)

in tc-propeller.h, and implement md_cons_expression to call expression( ) after parsing for #, @ or & . The TC_PARS_CONS_EXPRSSION macro is invoked by directives such as LONG so it will also fix bug 60. The existing parse_src function can also call md_cons_expression instead of parsing the # prefix iteself, and flagging it locally.

Thanks for reading and considering! I'm thinking of doing this work myself but I have very little time and I want to get my own project done before the OPC, so I don't know if/when I'll get to this. If I end up implementing this change, would you consider integrating it in the main branch?

===Jac

Heater. · 2013-03-17 23:25

I'm not sue I can go along with classifying the Prop as a NUMA machine.

In NUMA any memory location you are accessing may be in memory local to you or far away in a space shared with others. That sounds a bit Prop like but remember in NUMA all memory lives in the same address space and is accessed the same way. Your code will run if it is written for NUMA or not. It uses the same instructions to get at "near" memory or "far". Of course if your program takes advantage of the memory architecture by keeping it's working set closer to the CPU doing the work on it then it will run much faster. This is much like the issues of writing efficient code with cached memory systems only with the extra hassle of thinking about parallel processes and how to divide and distribute the work.

By contrast a Prop COG has two totally different address spaces, COG and HUB. With different instructions to access them. You can think of it like the MIPs architecture where most operations are register to register and very fast but memory access requires special load and store instructions which may take a long time depending on if your data is in one cache or another or main memory or swap space. So I see the Prop as a machine with 512 32 bit registers.

Where things get weird looking at the Prop that way is that it executes instructions from it's registers rather than fetching them from main memory.

This seems to be a unique Propeller feature. Does anyone know of another CPU that does that?

That weirdness is what this thread is all about. Clearly GCC is expecting a "normal" computer with byte addressed memory. Anything else get's you into a world of confusion. Any kind of NUMA awareness on the part of a compiler would not really help here.

jac_goudsmit · 2013-03-18 07:03

Heater. wrote: »

That weirdness is what this thread is all about. Clearly GCC is expecting a "normal" computer with byte addressed memory. Anything else get's you into a world of confusion. Any kind of NUMA awareness on the part of a compiler would not really help here.

Since instructions mostly work on cog addresses, I see the Propeller as 8 computers that each have 2KB of memory organized in longs, with a relatively slow access method to a larger byte-addressed memory. When it's running machine language it sees itself that way. Spin as well as LMM and other non-cog memory models try to focus on the hub memory as main memory (and slow down the overall expected speed by doing so).

The Propeller is not unique in using longword-sized addressing: The TMS320Cxx series of DSP's also uses longword-addressed memory and there is a GCC for it. I understand that it was easier to use byte-addressing when porting Binutils and GCC for the Propeller, and that in most cases it makes more sense to use LMM or CMM or XMM for programming at a higher level, because that's exactly when you need the larger amount of memory in the form of the hub. And while I have no actual experience porting GCC, I think I have a good grasp of how compilers work and I understand how it makes sense to calculate e.g. offsets in memory in terms of bytes (not longs). Perhaps in hindsight, Eric should have seen that it would have been better to implement the @ operator in the first place to indicate in assembly source code that a hub address is needed, so that the # operator could have been reserved for cog mode addresses and be used in the same way as in spin-asm.

It all doesn't really matter how we got here and I think Eric (and David and Steve and everyone else who contributed) did an awesome job porting Binutils and GCC, and this is definitely not a personal attack on any of them. What matters is that pasm code for the spin compiler (particularly self-modifying code that uses instructions other than MOV(A) ) simply doesn't work the way that a programmer with spin-asm experience would expect. It fails silently and in an unexpected way, and workarounds such as the .cog_ram directive just aren't good enough in my opinion. I think we need to make it possible to program cog-mode PASM in the same way as spin-asm works, and I'm sure it's possible to do this, even without breaking GCC, see post #35 above.

I think that a working, compatible GAS is not too much for users / programmers to ask. I wish I had more time to work on it.

===Jac

Heater. · 2013-03-18 07:30

jac_goudsmit,

The TMS320Cxx series of DSP's also uses longword-addressed memory and there is a GCC for it.[/COLOR]

Indeed there are many machines that fetch instructions and/or data as 32 bit longs. Generally they still work with byte addresses and require accesses to be aligned on 4 byte boundaries. Think MIPs, 68000, even some ARMs won't tolerate misaligned access.

I see the Propeller as 8 computers that each have 2KB of memory organized in longs,

Many others see it that way to. That leads to the interesting conclusion that the Prop's COGs don't have any registers, everything is done memory to memory. I guess it does not make much sense to fuss over the distinction.

ersmith · 2013-03-18 15:19

jac_goudsmit wrote: »

It's not too hard to keep GCC and the libraries from breaking by introducing a directive (e.g. ".imm_dat") to enable spin-asm compatible behavior. Obviously GCC doesn't use this so it won't break. By default the directive is off, and there should probably be a directive to counteract it, (e.g. ".imm_gcc").

Or maybe we should make .cog_ram do this. Although I wonder: what about .cog_ram doesn't work for you now?

You mentioned that you don't want to keep track of two addresses for each symbol but that's not really necessary: it's always possible to calculate the cog address if you know the hub address of the start of the cog. The linker already generates the Propeller-specific _load_start_* symbol so it knows where the code for a cog begins.

Good point... but I'm not sure the linker is organized in such a way that the _load_start_* symbols are available at points where we want to calculate expressions. I suspect that the assignment of section addresses is done after all expression evaluation (it almost has to be, since in many architectures the size of instructions can depend on the value of expressions).

If the ORG command gets implemented the same way as spin-asm, the assembler just needs to keep a list of all the ORG statements and where they occur.

ORG is a whole 'nother problem, since in binutils it's the linker that assigns addresses -- it almost doesn't matter what the assembler does. Now, we could have the assembler resolve all addresses for programs that don't make external references, but that's not much different from using bstc or spin2cpp to generate binary blobs.

Thanks for reading and considering! I'm thinking of doing this work myself but I have very little time and I want to get my own project done before the OPC, so I don't know if/when I'll get to this. If I end up implementing this change, would you consider integrating it in the main branch?

I fear this is going to be harder than you think -- PASM compatibility in GAS is something that many of us have wrestled with (and as you've seen, come up with unsatisfying solutions to). But I hope I'm wrong, and I'd be interested to see what you come up with! The ultimate decision on what goes into the release will lie with David (the project leader) and Parallax.

Thanks for thinking about this
Eric

jac_goudsmit · 2013-03-19 10:34

Heater. wrote: »

jac_goudsmit,

Indeed there are many machines that fetch instructions and/or data as 32 bit longs. Generally they still work with byte addresses and require accesses to be aligned on 4 byte boundaries. Think MIPs, 68000, even some ARMs won't tolerate misaligned access.

I know the 68000 will throw a bus exception when you access something that's not aligned on its own size, but unlike the TI DSP I mentioned, the 68000 bus is still byte-addressed. The TI DSP uses longword addressing very much in the same way as cog memory on the Propeller works: Address "x" is 32 bits and address "x+1" is the next 32 bits. I never worked with the GCC compiler for the TI DSP (I don't think it existed back then) but the TI compiler that I used in 1998/99 had some strange properties. For example, sizeof(char) was equal to sizeof(long) because by definition, sizeof(char) is one, but one address is enough to store a 32-bit integer too. I don't remember how strings were handled but I can imagine how hard it must have been to port GCC to it.

Many others see it that way to. That leads to the interesting conclusion that the Prop's COGs don't have any registers, everything is done memory to memory. I guess it does not make much sense to fuss over the distinction.

I agree, the distinction between "everything is a register" and "there are no registers" is not really important. I know the GCC compiler regards the cog as lots of registers: it doesn't let you take the address of a cog variable which conveniently circumvents probably the most important issue when the compiler was ported: How do you implement a pointer variable? The answer (and I agree with how this works) was to simply define a pointer in C/C++ as a location in hub memory.

That still doesn't take away the need for cog pointers when using GAS, though :-)

===Jac

jac_goudsmit · 2013-03-19 14:51

ersmith wrote: »

Or maybe we should make .cog_ram do this. Although I wonder: what about .cog_ram doesn't work for you now?

I had a .cog_ram directive in my code that I used for demonstrating this problem, and it didn't fix the problem of not dividing address #foo by 4 in a mov bar, #foo. I'm pretty sure I was using the original out-of-the-box Binutils but now I'm starting to doubt myself. o_O

Good point... but I'm not sure the linker is organized in such a way that the _load_start_* symbols are available at points where we want to calculate expressions. I suspect that the assignment of section addresses is done after all expression evaluation (it almost has to be, since in many architectures the size of instructions can depend on the value of expressions).

But as you say, a possible workaround would be to put a label at the start of what needs to be loaded into a cog, and the calculate the byte offset from that label and divide it by 4. It seems to me that if it can be done in source code, that it can be done implicitly by the linker.

ORG is a whole 'nother problem, since in binutils it's the linker that assigns addresses -- it almost doesn't matter what the assembler does. Now, we could have the assembler resolve all addresses for programs that don't make external references, but that's not much different from using bstc or spin2cpp to generate binary blobs.

I don't pretend to know how ORG works in GAS, but I think the usual meaning in any assembly language is "Hey Assembler, assume that the following code is loaded at address XXX". The meaning is NOT "Hey linker/loader, I want you to actually load this at address XXX". I don't think there would be a problem if the ORG instruction for the Propeller (which is not specifically implemented at this time) would work the same as for Spin asm, which is "Assume that this is loaded from address XXX". Many drivers like to move their code to a different location in the cog, so that they can reserve to top of the cog memory as a buffer that starts at address 0. I know several video drivers that do this. Steve's Propalyzer code also relocates itself so that it can fill the top of the cog with code that overwrites itself with its own results (mov $, INA). The most efficient way to do this in spin-asm is to use ORG. But of course if there is a remote possibility that ORG should have a meaning for hub memory, there's always the option of implementing an optional @ prefix to the ORG argument.

I fear this is going to be harder than you think -- PASM compatibility in GAS is something that many of us have wrestled with (and as you've seen, come up with unsatisfying solutions to). But I hope I'm wrong, and I'd be interested to see what you come up with! The ultimate decision on what goes into the release will lie with David (the project leader) and Parallax.

I think I hear a challenge :-)

===Jac

David Betz · 2013-03-19 16:50

jac_goudsmit wrote: »

I had a .cog_ram directive in my code that I used for demonstrating this problem, and it didn't fix the problem of not dividing address #foo by 4 in a mov bar, #foo. I'm pretty sure I was using the original out-of-the-box Binutils but now I'm starting to doubt myself. o_O

But as you say, a possible workaround would be to put a label at the start of what needs to be loaded into a cog, and the calculate the byte offset from that label and divide it by 4. It seems to me that if it can be done in source code, that it can be done implicitly by the linker.

I don't pretend to know how ORG works in GAS, but I think the usual meaning in any assembly language is "Hey Assembler, assume that the following code is loaded at address XXX". The meaning is NOT "Hey linker/loader, I want you to actually load this at address XXX". I don't think there would be a problem if the ORG instruction for the Propeller (which is not specifically implemented at this time) would work the same as for Spin asm, which is "Assume that this is loaded from address XXX". Many drivers like to move their code to a different location in the cog, so that they can reserve to top of the cog memory as a buffer that starts at address 0. I know several video drivers that do this. Steve's Propalyzer code also relocates itself so that it can fill the top of the cog with code that overwrites itself with its own results (mov $, INA). The most efficient way to do this in spin-asm is to use ORG. But of course if there is a remote possibility that ORG should have a meaning for hub memory, there's always the option of implementing an optional @ prefix to the ORG argument.

I think I hear a challenge :-)

===Jac

You're certainly welcome to work on this but I would recommend that you not change anything in the generic parts of binutils, just the machine-specific modules. Any major changes in the way gas or ld work would make it more difficult to move to a new gcc release or to upstream our Propeller changes.

jac_goudsmit · 2013-03-20 11:16

David Betz wrote: »

You're certainly welcome to work on this but I would recommend that you not change anything in the generic parts of binutils, just the machine-specific modules. Any major changes in the way gas or ld work would make it more difficult to move to a new gcc release or to upstream our Propeller changes.

Of course.

===Jac

David Betz · 2013-03-20 11:58

jac_goudsmit wrote: »

Of course.

===Jac

Thanks! Let me know if you come up with something you'd like us to look at.

jac_goudsmit · 2013-04-13 10:12

Update/Bump:

As I understand, Eric made some improvements to the P2test branch to fix this. There now is a .pasm directive (to replace .cog_ram), and the @ operator was added. It appears that the immediate operator # now works correctly for all instructions, and as an extra feature, disassemblies (using objdump) also seem to work better than before.

Awesome!

Thanks Eric!

===Jac

SRLM · 2013-04-13 11:38

jac_goudsmit wrote: »

Update/Bump:

As I understand, Eric made some improvements to the P2test branch to fix this. There now is a .pasm directive (to replace .cog_ram), and the @ operator was added. It appears that the immediate operator # now works correctly for all instructions, and as an extra feature, disassemblies (using objdump) also seem to work better than before.

Awesome!

Thanks Eric!

===Jac

So, is this resolved for all cases?

jac_goudsmit · 2013-04-13 16:25

SRLM wrote: »

So, is this resolved for all cases?

The case where it matters is when assembling / compiling code for Cog mode, and it looks like my code is getting translated correctly all the way, including directives such as LONG.

My code doesn't need the @ or & operator but I did a brief test and they get apparently assembled correctly too: LONG &location translates to the cog memory address of location, and LONG @location gets translated to the hub memory address of location. I didn't test whether @location works in an instruction but in cog mode that shouldn't be necessary anyway. I also didn't test if &location gives a cog memory location regardless of whether .pasm was given.

Eric told me that when .pasm is given (or its command-line equivalent), it should be possible to throw spin files at the assembler and it will basically ignore everything outside DAT sections. I haven't tested this either.

What still doesn't work is that in inline assembly, an argument declared with "i" restriction (which expands to #value) won't work with a LONG (or presumably other) directive. So issue 60 is still open. In other words, LONG doesn't accept a spurious # at the beginning of its argument. This is fairly easy to work around so I'm not worried about it.

Bottom line: I didn't do a full test but it appears solved.

===Jac

[Solved?] Inline assembly: Immediate instructions broken?

Comments