C Expressive Enough for Idiomatic PASM?

ersmith · 2011-08-18 19:45

RossH wrote: »

I suppose you could decide on a "minimum" and compile all your library code to use only that number of scratch registers - but for many C programs there is more library code than application code, so this would essentially negate the whole point of doing it in the first place!

Also, it would be difficult (perhaps impossible) to write an algorithm that could determine what would be the appropriate number of scratch registers to use in any particular case. The alternative would be to do this manually - i.e recompile your program several times using a different numbers of scratch registers, and then run it to see whether the code size/performance tradeoff suits your needs.

Very good points here, Ross. You do raise an interesting possibility though -- it certainly would be possible to make additional builtin registers available at compile time via a command line switch. Did you experiment with this much in Catalina? As you point out, it wouldn't help the libraries at all, but would ease some register pressure on the user code. But, again as you point out, it would become quite a burden to the programmer to try different switches to see the effects on code, so really only a limited selection of additional registers would be practical.

In gcc we're going to try to compromise a bit by allowing the user to put variables explicitly in cog ram, at least for the PASM mode. That's a lot harder to do in LMM mode, which is another complication I guess for any attempt to add registers dynamically at compile time.

Eric

Phil Pilgrim (PhiPi) · 2011-08-18 20:21

So it seems that GCC is stuck with a traditional n-accumulator-load/store programming model. While I understand and appreciate the motivation for standardizing on GCC and am not trying to rock the boat -- seriously -- are there other C compiler tools/builders/whatever that support a more relaxed programming model where the variables themselves can be treated as "accumulators", as is done in PASM?

-Phil

RossH · 2011-08-18 21:05

ersmith wrote: »

it certainly would be possible to make additional builtin registers available at compile time via a command line switch. Did you experiment with this much in Catalina?

Hi Eric,

A bit, but not using a a command line switch - I just recompiled the whole compiler each time (why not, when it only takes a few seconds?

).

However, the tradeoff is very different if you are using LMM/XMM since the kernel space is so much more valuable for other things - having too many scratch registers probably means moving significant functionality out of the kernel.

Ideally I wanted Catalina to have around 40 scratch registers (e.g. 32 integer registers, 8 floating point registers) but the XMM support requirements meant that 24 was as high as I could get to without a lot of other painful sacrifices. And really, the benefit for C programs beyond about 16 scratch registers seemed to be minimal anyway - both on code size and performance. The law of diminishing returns kicks in hard, especially for the type of C programs that we typically run on the propeller (which are not large or complex enough to really benefit from many more registers).

If you are targeting your cog-based C compiler at driver code, you certainly need your code to be fast - but driver code is not typically algorithmically complex - so I suspect GCC won't get much benefit from having lots of scratch registers either.

I will certainly be interested in the final outcome.

Ross.

RossH · 2011-08-18 21:16

Phil Pilgrim (PhiPi) wrote: »

So it seems that GCC is stuck with a traditional n-accumulator-load/store programming model. While I understand and appreciate the motivation for standardizing on GCC and am not trying to rock the boat -- seriously -- are there other C compiler tools/builders/whatever that support a more relaxed programming model where the variables themselves can be treated as "accumulators", as is done in PASM?

-Phil

Hi Phil,

The problem is really with C itself, not with any particular compiler implementation - the whole C language is specifically designed to fit well with "traditional" processor architectures.

There are many other unorthodox processor architectures that support variants of C (e.g. array processors) - it may be interesting to investigate those. But what you would end up with would not really be C, so you'd have to wonder if there was any real benefit.

Ross.

Phil Pilgrim (PhiPi) · 2011-08-18 21:37

Ross,

So I guess I'm still at a loss wondering why something like this can't still be optimized to the obvious single line of PASM:

        mov   r1,variablea
        mov   r2,variableb
        add   r1,r2
        mov   variablea,r1

-Phil

RossH · 2011-08-18 22:33

Phil Pilgrim (PhiPi) wrote: »
Ross,

So I guess I'm still at a loss wondering why something like this can't still be optimized to the obvious single line of PASM:
        mov   r1,variablea
        add   r1,variableb
        mov   variablea,r1
-Phil

Hi Phil,

You might be surprised! Let's assume for arguments sake that your line of code is a = a + b

I don't have the C language spec to hand, but from memory if the architecture is one that raises a signal on integer overflow, and you had the program set up to intercept that signal, then in fact code like this is both correct and cannot be further optimized! The compiler needs to use the temporary register r1 to evaluate the expression so that it does not leave a with the wrong value should an overflow occur while adding a and b.

I agree that the code is probably dumb in the case of two integer variables in cog ram, especially since I doubt any propeller C compiler would raise a signal on integer overflow (most C compilers don't ... but some do!). However, if the variables were type float instead of int, then most C compilers would do so (I think it may even be mandatory to do so).

Note that if the same line were written as a += b, then you would expect a to be overwritten even if an overflow occurred. So in that case the line could legally be optimized to:

        add   variablea,variableb

Ross.

Heater. · 2011-08-19 00:04

RossH,

Brilliant, in all the years of dinking around with C I have never heard of getting a signal out of arithmetic overflow. But then I'm not one for burying by head in language specs.

Turns out you can do it with GCC http://tinymicros.com/wiki/Linux_Overflow_Signal_Handler_Example
Works on Intel and ARM here.

However you are mistaken on one point "if the same line were written as a += b, then you would expect a to be overwritten even if an overflow occurred"

I experiment with the code on that link and sure enough the following code generates the same overflow signal:

int a = INT_MAX;
  int b = 1;
  installSignalHandlers ();
  a += b;

Turns out GCC inserts a function call to do the addition, check for overflow and raise software interrupt or whatever. I guess the CPU does not raise such exceptions in hardware, like divide by zero for example, so this would be awful slow to use except when debugging/testing.

Also if the hardware did generate these exceptions the code could be optimized the same way Phil suggests because the exception would be raised before the result was written back. If the checking is done via a function call for addition the same applies.

Anyway I really don't think we should expect C code running in COG to be able to do any of that.

Whilst I'm here, who cares about libraries for in COG code? With only 496 instructions you can fill that with a page or two of source. Any common routines you need can be cut and pasted into that source. Not saying libs should not be supported but doing so should not impact the optimizations/register usage of the code generator.

RossH · 2011-08-19 02:09

Heater. wrote: »

RossH,

However you are mistaken on one point "if the same line were written as a += b, then you would expect a to be overwritten even if an overflow occurred"

Well, I it is possible I am mistaken, but I don't believe so. If a platform implements integer overflow (it doesn't have to) then the oveflow signal should occur in both cases. The difference is that if you read the value of a when you are in the signal handler caused by an overflow in the statement a+= b, then I believe it would be expected for a to contain the overflowed value (if you think about it, on most architectures this would have to be the case, otherwise the overflow has not actually occurred, and cannot be detected!). But in the signal handler caused by an overflow in the statement a = a + b, variable a must contain the original value of a, and not the overflowed value - this is why the original statements as posted by Phil cannot be optimized further.

Heater. wrote: »

Whilst I'm here, who cares about libraries for in COG code? With only 496 instructions you can fill that with a page or two of source. Any common routines you need can be cut and pasted into that source. Not saying libs should not be supported but doing so should not impact the optimizations/register usage of the code generator.

Well, I agree about the page or two of source being as much as you're likely to be able to fit, but why no libraries? They don't add any complexity to the code. If you plan on leaving them out, you may as well also leave out function calls altogether - after all, who needs them if you can only fit a page or two in each program?. For that matter, who needs recursion? Or stack variables? Or pointers? Or arrays? Or signals? Or a non-local goto?

Throw them all out by all means! Just don't call the resulting language C

Ross.

RossH · 2011-08-19 02:13

Heater. wrote: »

RossH,

However you are mistaken on one point "if the same line were written as a += b, then you would expect a to be overwritten even if an overflow occurred"

Well, I it is possible I am mistaken, but I don't believe so. If a platform implements integer overflow (it doesn't have to) then the oveflow signal should occur in both cases. The difference is that if you read the value of a when you are in the signal handler caused by an overflow in the statement a+= b, then I believe it would be expected (or at least acceptable) for a to contain the overflowed value (if you think about it, on most architectures this would have to be the case, otherwise the overflow has not actually occurred, and cannot be detected!). But in the signal handler caused by an overflow in the statement a = a + b, variable a must contain the original value of a, and not the overflowed value - this is why the original statements as posted by Phil cannot be optimized further.

GCC may treat these cases identically, and that may be an acceptable implemention - but it is only one possible interpretation (and a rather strange one!). But on platforms where the overlow is detected in hardware the two cases would generally be different, and be as I have described.

Heater. wrote: »

Whilst I'm here, who cares about libraries for in COG code? With only 496 instructions you can fill that with a page or two of source. Any common routines you need can be cut and pasted into that source. Not saying libs should not be supported but doing so should not impact the optimizations/register usage of the code generator.

Well, I agree about the page or two of source being as much as you're likely to be able to fit, but why no libraries? They don't add any complexity to the code. If you plan on leaving them out, you may as well also leave out function calls altogether - after all, who needs them if you can only fit a page or two in each program?. For that matter, who needs recursion? Or stack variables? Or pointers? Or arrays? Or signals? Or a non-local goto?

Throw them all out by all means! Just don't call the resulting language C

Ross.

jmg · 2011-08-19 02:46

Heater. wrote: »

Whilst I'm here, who cares about libraries for in COG code? With only 496 instructions you can fill that with a page or two of source. Any common routines you need can be cut and pasted into that source. Not saying libs should not be supported but doing so should not impact the optimizations/register usage of the code generator.

I've seen C libraries that are effectively in-line functions.

They exist as libraries so they are easy to manage, but they run just like pasted code.

RossH · 2011-08-19 02:54

jmg wrote: »

I've seen C libraries that are effectively in-line functions.

They exist as libraries so they are easy to manage, but they run just like pasted code.

Exactly! In fact, this is what the Catalina Optimizer does - if a function can be inlined, it is inlined - whether it is a library function or a user-defined function.

This means there is no overhead for writing your program as lots of small library functions. And it can make your code a lot easier to read, maintain and reuse!

Ross.

ersmith · 2011-08-19 04:15

Phil Pilgrim (PhiPi) wrote: »
So I guess I'm still at a loss wondering why something like this can't still be optimized to the obvious single line of PASM:
        mov   r1,variablea
        mov   r2,variableb
        add   r1,r2
        mov   variablea,r1

Actually it can, but it depends on the circumstances. If the variables are in hub memory then there's no way to access them directly except by loading them into a cog register. And as Ross points out if overflow detection is enabled then there are other issues (that's not normally the case for C, but it is for other languages or for some compiler options).

The current gcc prototype for the propeller will take the C function:

#define _COGMEM __attribute__((cogmem))
#define _NATIVE __attribute__((native))

_COGMEM int a;
_COGMEM int b;

_NATIVE void update(void)
{
    a = a + b;
}

and produce the result:

_update
    add    _a, _b
_update_ret
    ret
_b
    long    0
_a
    long    0

I don't think it gets much more idiomatic than that :-). Note that we do have to give some hints to the compiler, though. The _COGMEM define is to tell it to put a and b into cog registers; without that they will go in hub memory. And the _NATIVE define tells it we want a function that uses the standard call/ret PASM calling convention. Note that such functions cannot be recursive, so this isn't the default for C.

Eric

RossH · 2011-08-19 05:16

ersmith wrote: »
The current gcc prototype for the propeller will take the C function:
#define _COGMEM __attribute__((cogmem))
#define _NATIVE __attribute__((native))

_COGMEM int a;
_COGMEM int b;

_NATIVE void update(void)
{
    a = a + b;
}
and produce the result:
_update
    add    _a, _b
_update_ret
    ret
_b
    long    0
_a
    long    0
I don't think it gets much more idiomatic than that :-).

Eric,

In C, an int is a signed type, so I believe the instruction should be adds _a, _b

Catalina doesn't have cog variables of course, but when I compile a similar code fragment:

   register int a;
   register int b;

   a = 1;
   b = 2;
   a = a + b;

I get:

mov r17, #1
 mov r16, #2
 adds r17, r16

Ross.

ericball · 2011-08-19 08:19

Is there a particular reason why the register keyword won't suffice instead of the cogmem attribute?

ersmith · 2011-08-19 08:38

RossH wrote: »

Eric,

In C, an int is a signed type, so I believe the instruction should be adds _a, _b

Ross

Unless I'm mistaken the only difference between add and adds is in how the Z and C flags are set. Since the flags aren't used, there's no need to set them, so add is fine here. adds would be a little clearer in the PASM code, but normally gcc only checks for signedness when the result is actually compared.

Eric

ersmith · 2011-08-19 08:42

ericball wrote: »

Is there a particular reason why the register keyword won't suffice instead of the cogmem attribute?

Unfortunately the "register" keyword only allows you to put variables into predefined registers; it doesn't allow declaration of new registers. Changing this behavior would involve changing the machine independent parts of gcc. It would be a nice idea (I agree it would look much nicer and seem more natural) but it's a lot of work compared to using machine specific attributes.

Global and static register variables are not allowed by the C standard, so whatever we use will be a compiler specific extension anyway.

Eric

RossH · 2011-08-19 16:48

ersmith wrote: »

Unless I'm mistaken the only difference between add and adds is in how the Z and C flags are set. Since the flags aren't used, there's no need to set them, so add is fine here. adds would be a little clearer in the PASM code, but normally gcc only checks for signedness when the result is actually compared.

Eric

Yes, you're right. It only matters if you intend to use the carry/overflow bit.

Ross.

RossH · 2011-08-19 18:12

ersmith wrote: »

Unfortunately the "register" keyword only allows you to put variables into predefined registers; it doesn't allow declaration of new registers. Changing this behavior would involve changing the machine independent parts of gcc. It would be a nice idea (I agree it would look much nicer and seem more natural) but it's a lot of work compared to using machine specific attributes.

Global and static register variables are not allowed by the C standard, so whatever we use will be a compiler specific extension anyway.

Eric

Hi Eric,

I also get asked the "register" question on Catalina, but there the answer is a bit more obvious - Catalina has hub-based variables, stack-based variables and cog-based variables (i.e. registers), so it is clear that the register keyword should retain its more traditional meaning of selecting between stack-based and hub-based (not between hub-based and cog-based). To do otherwise would confuse and annoy most C programmers.

But PASM has only hub-based and cog-based variables. So if the cog-based version of GCC is primarily intended to be a PASM replacement (e.g. to be used to allow drivers to be written in C), then you may also only have cog based and hub-based variables. In that case it would be perfectly logical to use the register keyword to select between cog-based and hub-based.

It would also be in accordance with the C standard - the register keyword is really intended to just give the compiler a hint to make access to the variable "as fast as possible" (that's what it actually says!). On the propeller, which has effectively 496 registers, it would be quite logical to have this mean to make the variable cog-based instead of hub-based.

It would also solve another thorny problem, - i.e. that the propeller has no real concept of the "address" of a cog location - but the C standard explicitly says that you cannot take the address of a variable in a register!

Two birds with one stone!

Ross.

jmg · 2011-08-19 19:31

RossH wrote: »

It would also solve another thorny problem, - i.e. that the propeller has no real concept of the "address" of a cog location

? - Did you mean from other COGs ?
At the asm level, the prop certainly does know about the address of a cog location, and I could see that being used in C, for low level apps...

You could of course, use this ?

#define _register __attribute__((cogmem))

_register int a;

or maybe 

#define COG_register __attribute__((cogmem))

COG_register int a;

RossH · 2011-08-19 20:00

jmg wrote: »

? - Did you mean from other COGs ?
At the asm level, the prop certainly does know about the address of a cog location, and I could see that being used in C, for low level apps...

No, I mean there is no way to have one cog location "point" to another cog location in the same cog. You can of course pretend that a cog value is the index of a cog location, and then manipulate the source and destination fields of an instruction (which is what we all do) but that's not quite the same thing. For one thing, there is no way to differentiate between a cog address of (say) $400 and a hub address of $400 - at least not without artificially introducing some flag bits in the upper (unused) bits. Messy!

Ross.

ersmith · 2011-08-19 21:25

RossH wrote: »

But PASM has only hub-based and cog-based variables. So if the cog-based version of GCC is primarily intended to be a PASM replacement (e.g. to be used to allow drivers to be written in C), then you may also only have cog based and hub-based variables. In that case it would be perfectly logical to use the register keyword to select between cog-based and hub-based.

It would also be in accordance with the C standard - the register keyword is really intended to just give the compiler a hint to make access to the variable "as fast as possible" (that's what it actually says!). On the propeller, which has effectively 496 registers, it would be quite logical to have this mean to make the variable cog-based instead of hub-based.

Yes, but unfortunately the C standard also is quite explicit that the register keyword is only permitted for function local variables (ISO/IEC 9899:1999 section 6.9); it also conflicts with the static keyword. This makes it much less useful for declaring cog variables, since normally we would want such variables to be persistent beyond a single function invocation. Indeed, function local variables are already (normally) placed in cog memory, in the predefined registers.

For functions with a lot of local variables (more than the predefined registers) it might be useful to be able to force them to be allocated in cog registers. But a more general solution for that case is to provide a compiler switch or pragma that extends the number of predefined registers -- that would allow all functions to share the allocated space.

Eric

ersmith · 2011-08-19 21:35

jmg wrote: »

? - Did you mean from other COGs ?
At the asm level, the prop certainly does know about the address of a cog location, and I could see that being used in C, for low level apps...

Allowing pointers to cog locations opens up a big can of worms. Ross has already pointed out the basic issues -- the address space overlaps with hub memory addresses, and self modifying code is required to dereference such pointers. Other problems: the addresses have to increment differently (in hub memory int x[2] is *(x + 8), in cog memory it would be *(x + 2)), char and short addresses are not well defined, accessing bytes and shorts requires masking, and writing bytes and shorts requires read-modify-write. I wouldn't say these problems are insurmountable, but they are certainly very difficult. I think for now treating cog based variables like registers (so no arrays in cog memory, and no pointer dereferencing) is probably the best approach. That's what we're doing for the propeller gcc port -- when it sees the cogmem attribute it checks the variable type to eliminate arrays, and sets some bits internally to forbid taking the address of the variable.

I should point out here that the compiler is still a work in progress, so we may end up changing the particular syntax of how variables are placed in cog memory (maybe instead of an attribute there will be some other gcc extension). The semantics aren't very likely to change, at least not for the first release.

Eric

RossH · 2011-08-20 01:34

ersmith wrote: »

...
Indeed, function local variables are already (normally) placed in cog memory, in the predefined registers.

For functions with a lot of local variables (more than the predefined registers) it might be useful to be able to force them to be allocated in cog registers. But a more general solution for that case is to provide a compiler switch or pragma that extends the number of predefined registers -- that would allow all functions to share the allocated space.

Eric

Hi Eric,

I didn't realize you intended to have predefined registers - so yes, you would want the register keyword to be able to specify their use. The rest of your post also implies you will also have stack-based local variables - I assumed you would not have those either, since they are so costly to access that they would really start to eat up your 496 available instructions. If all the data is on the stack, a single very trivial looking line of C code can easily take a dozen PASM instructions. This means that each cog-based C program may only be able to have around 40 such lines of C code!

However, asuming you do want this capability, then I can see that you have little alternative but specifying each variable's actual location via addiitonal GCC attributes - there just aren't enough available C storage class specifiers.

Ross.

dnalor · 2011-08-20 13:06

For variables in HUB-Memory I would expect to have it to declare as "far", because it is another address range. All others ("near") are in COG-Memory.

jmg · 2011-08-20 14:57

ersmith wrote: »

That's what we're doing for the propeller gcc port -- when it sees the cogmem attribute it checks the variable type to eliminate arrays, and sets some bits internally to forbid taking the address of the variable.

So anyone wanting arrays, will need to use in line ASM ?
Will ASM be able to reference/access the address ok ?

RossH · 2011-08-20 16:36

jmg wrote: »

So anyone wanting arrays, will need to use in line ASM ?
Will ASM be able to reference/access the address ok ?

Hi jmg,

Not necessarily - perhaps Eric just means the array itself couldn't be declared in cog ram. A pointer to the array could be.

Ross.

RossH · 2011-08-20 16:38

dnalor wrote: »

For variables in HUB-Memory I would expect to have it to declare as "far", because it is another address range. All others ("near") are in COG-Memory.

dnalor,

The "near" and "far" style addressing is slightly different - both a near and far address can refer to the same location. On the propeller, they would refer to separate address spaces.

Ross.

ersmith · 2011-08-21 06:18

jmg wrote: »

So anyone wanting arrays, will need to use in line ASM ?
Will ASM be able to reference/access the address ok ?

Arrays will have to be stored in hub memory. In fact by default all data will be stored in hub memory, so if you just declare "int x" or "int a[9]" these will be declaring hub variables. To put "x" in cog memory you'll have to give it a special attribute. There will be no way (at least in the first release) to put an array like "a" into cog memory.

Yes, it will be possible to create/access arrays in cog memory using inline assembly. We'll probably create some macros for doing this, so everyone doesn't have to re-invent the wheel.

Note, though, that if you think about it arrays in cog memory are not any more efficient than arrays in hub memory. The sequence to access a pointer in hub memory (r = *x) would look something like:

rdlong r,x

and it takes 8 cycles (if we're aligned on the hub window). The same sequence for accessing a cog ram variable would look like:

movs :temp,x
nop
:temp mov r,0-0

and it takes 12 cycles. So the hub memory code will actually be faster, if it can be kept aligned to the hub window access. gcc tries to do this; it can't always succeed, but keeping track of tedious things like instruction timings is something computers are actually pretty good at.

Eric

Phil Pilgrim (PhiPi) · 2011-08-21 08:19

Eric,

'Good point about comparative array access efficiency. The disparity certainly exists for word and byte arrays. I believe it's a push for in-cog long arrays, though, because the array index does not have to be shifted to compute the address, as it does for hub arrays, saving another four cycles. Still, though, if gcc can sort things to hit the sweet spot, there's little reason to clog precious cog memory with arrays and constant tables. The only exception would be that, if the same code is used in multiple cogs, arrays other than constant tables would have to be replicated in the hub, which would waste space over having them in the cogs.

-Phil

Kevin Wood · 2011-08-21 13:23

>>> The only exception would be that, if the same code is used in multiple cogs, arrays other than constant tables would have to be replicated in the hub, which would waste space over having them in the cogs.

I'm not clear on this, but why would you need to replicate the arrays? Wouldn't the "same" code in say 3 cogs just r/w to a single array?

Edit: Well, okay, lexical scoping, but global vars really aren't that bad.

However, if it's as you say, it could be a compiler option to choose where the array is allocated, choosing to preserve hub space over shorter access times.

C Expressive Enough for Idiomatic PASM?

Comments