Is there a memory saving when using a short in Prop C? (Answered)

mikeologist · 2017-12-10 09:20

My question is based on a few concepts:
1) Prop C compiles to Prop ASM
2) Prop ASM only uses longs, the byte and word are structures of SPIN

Given that I have those correct (hope I do), does Prop C have an underlying structure that allows it to store two shorts in the same long?
If so, can I guarantee this benefit by order of declaration and/or malloc?
I'm trying to save space on my flags and sentinels, would be nice if there was a boolean or bit data type.

Heater. · 2017-12-10 09:41

Whilst PASM operates on 32 bit words it does have RDBYTE, RDWORD, RDLONG, WRBYTE, WRWORD, WRLONG. So as far a accessing data in HUB it can deal with 8, 16 and 32 bit sizes.

Give that your C program is using data from HUB it will use the correct instructions to access char, short etc.

If you define a bunch of shorts in your C program they will get packed efficiently into HUB memory and you will save space over using regular int etc.

iseries · 2017-12-10 11:39

A quick check of the code shows that each value takes up its correct space in memory. Having said that all data is word aligned so there are holes in memory when defining chars.

Even though two chars are defined next to each other the are always word aligned and take up 2 bytes each.

So since shorts use two bytes they would use less space then "int" or "long".

If you use a structure they are packed structures so they would also use less memory.

CMM memory model is interpreted assembly where LMM memory model builds straight assembly code. Use the latter to get processor speed and timing with your code but uses more memory.

Mike

evanh · 2017-12-10 11:48

The difference in addressing between Hub byte wide and Cog long-word wide is a potential point of confusion.

I tripped up on this recently myself when using the Hub as a mailbox between multiple Cogs. In my Prop2 testing I accidentally forgot that the addressing offset had to be multiplied by 4 to correctly space out a table of long words in HubRAM. The writing routine used a basic add to build the pointer while the reading routine used a PTRA++. This reading routine was automatically scaled for the data size being read, ie: post-increment by 4 for each read, whereas the pre-calculated write address was not auto-scaled.

Took me a while to work out why the data didn't look like it was being written at all.

Heater. · 2017-12-10 11:56

iseries,

Even though two chars are defined next to each other the are always word aligned and take up 2 bytes each.

I hope that is not true for strings and arrays of bytes.

I don't have the means to check at the moment.

evanh · 2017-12-10 12:44

No, not true. Hub addressing is byte sized and RD/WRBYTE can pick/place at byte by byte increments.

EDIT: That said, iseries is talking about a C implementation. And also states that "If you use a structure they are packed structures so they would also use less memory."

So, tables of bytes at least will be efficient packing even if individual byte variables aren't.

Heater and I were both talking from the view of the Cogs rather than Prop C.

macca · 2017-12-10 13:58

mikeologist wrote: »

My question is based on a few concepts:
1) Prop C compiles to Prop ASM
2) Prop ASM only uses longs, the byte and word are structures of SPIN

Given that I have those correct (hope I do), does Prop C have an underlying structure that allows it to store two shorts in the same long?
If so, can I guarantee this benefit by order of declaration and/or malloc?
I'm trying to save space on my flags and sentinels, would be nice if there was a boolean or bit data type.

I did some tests in the past with sample programs using int, short or byte variables, sometimes you save some space and other times you are increasing the code size. While it may be true that the data section size decreases, the code section may increase because conversion or additional code is necessary, for example if you use signed shorts, internally they may need to be converted to signed longs and requires bit shifting to extend sign. With unsigned variables you may actually save something using short or byte because they never needs conversions.

My conclusion is that it depends on what the code does and generally is not worth the effort unless absolutely necessary.

Heater. · 2017-12-10 15:38

I was talking from the point of view of COGS, HUB and prop-gcc.

Sadly I don't have a prop-gcc installed here to verify any claims I make.

macca is right as far as I recall. Changing a 32 bit value to a 16 bit might incur the penalty of increased code size. Might not be worth it for a few variables but should be a win if you have arrays.

There is nothing like building the code to find out...

mikeologist · 2017-12-10 15:42

Heater. wrote: »

Whilst PASM operates on 32 bit words it does have RDBYTE, RDWORD, RDLONG, WRBYTE, WRWORD, WRLONG. So as far a accessing data in HUB it can deal with 8, 16 and 32 bit sizes.

Give that your C program is using data from HUB it will use the correct instructions to access char, short etc.

If you define a bunch of shorts in your C program they will get packed efficiently into HUB memory and you will save space over using regular int etc.

That is excellent info. So I'll just leave that to the compiler.

Getting into making true abstract data structures for the propeller in C.
This will help, Thanks a bunch

mikeologist · 2017-12-10 16:06

macca wrote: »

I did some tests in the past with sample programs using int, short or byte variables, sometimes you save some space and other times you are increasing the code size. ...

My conclusion is that it depends on what the code does and generally is not worth the effort unless absolutely necessary.

Heater. wrote: »

macca is right as far as I recall. Changing a 32 bit value to a 16 bit might incur the penalty of increased code size. Might not be worth it for a few variables but should be a win if you have arrays.

There is nothing like building the code to find out...

This is spectacularly sound advise. I will experiment with each case.

I can view the hex by opening my compiled file in SPIN, is there a way to view the compiled assembly code between?

Heater. · 2017-12-10 16:33

You can use objdump with the -d flag to disassemble the executable and display the PASM.

There should be a prop-gcc objdump in whatever directory your compiler is installed into.

Sorry I forget the exact details now. Google will help.

mikeologist · 2017-12-10 18:16

Heater. wrote: »

You can use objdump with the -d flag to disassemble the executable and display the PASM.

There should be a prop-gcc objdump in whatever directory your compiler is installed into.

Sorry I forget the exact details now. Google will help.

I googled it and found this page

In cmd I navigated to my projects folder and used: >propeller-elf-objdump -d -S Stack.elf | propeller-elf-c++filt --strip-underscore > Stack.dump and it spit it right out.

Now, I've got a whole new host of questions. I will read and try to answer them myself, but I do have one big one that I must ask. Since this is run in Prop ASM, according to the above-linked site it has an entire, all be it minimal, C kernel.
Can I minimize this in memory by compiling all of my C for the entire project into a single file, dispatching the necessary bits using the provided multi-cog library and can I remove the unused bits from my final dump file and then finish compiling that into the SPIN hex file?

Thank you all for taking the time

Heater. · 2017-12-10 18:37

I'm not sure I understand your question.

Typically your entire project is compiled into a single file. All loaded to the Prop at once.

Starting COGs is done much the same as it is with Spin. Point a COG at some code to run and start it.

There might be a question about pruning out unused functions, in your code or in library code. I don't recall how that worked out just now.

mikeologist · 2017-12-10 19:45

Heater. wrote: »

I'm not sure I understand your question.

Typically your entire project is compiled into a single file. All loaded to the Prop at once.

Starting COGs is done much the same as it is with Spin. Point a COG at some code to run and start it.

There might be a question about pruning out unused functions, in your code or in library code. I don't recall how that worked out just now.

The questions is inspired by the line:
main.dump Full dissembly dump of the completely-linked program, including the C kernel and runtime
in the table near the top of the article that I linked.

I assume this means that Porp ASM has no facilities other than the provided commands, which is the nature of all ASM code regardless of the platform. Forgive me I'm new to ASM. As such, any completely C program must carry with it every facility that in needs beyond the bare-metal commands. While this is presumably true for all ASM it is very different than writing C programs in windows or linux as they provide the kernel and all the associated facilities, internal or external.
To that point, if this gcc application has been designed to implement a kernel with the program I am willing to bet that there are unused, general-purpose portions of that kernel that could be removed by the old guess and check method.

It also seems that this should only exist once and though it's not the best coding practice, I'd imagine everything needs to be compiled into a single program and loading all sub-programs with the multi-cog library. This as opposed to writing an individual C program for each cog.

Heater. · 2017-12-10 20:46

Hmm...

In a typical C program you have:

A bunch of C language source files
-> Which get compiled into a bunch of assembly language files ( you may never see those)
-> Which get assembled into a bunch of object files (Actual binary instructions, you may never see those either)
-> Which get linked into an executable binary file.

Now, that executable may need functions from libraries to be loaded at run time. Not applicable to prop-gcc as there is no dynamic library support.
It may make calls to an operating system. Not applicable to prop-gcc as the is no operating system.

The "kernel" referred to there is not any operating system kernel, like Linux, but rather the LMM kernel that is required to run the instructions of your C program from HUB memory.

What is LMM?

Well, a Propeller cannot execute code from HUB memory. But some years ago Bill Henning proposed a nice fast way for a PASM program in COG to load instructions from HUB, one at a time, and execute them. That opened the way to prop-gcc and the Catalina and other C compilers. The "kernel" is just that little bit of code required to run your C instructions from HUB memory.

Yes, all the code you want to run on whatever COGS will be contained in that single executable binary file.

Now, the question is, if there are functions in my source code or in a library that are never used, are they included in the final executable binary?

mikeologist · 2017-12-10 22:35

Heater. wrote: »

Yes, all the code you want to run on whatever COGS will be contained in that single executable binary file.

Now, the question is, if there are functions in my source code or in a library that are never used, are they included in the final executable binary?

You produce excellent answers, thank you.
That answers a few of my other questions that I was reading to answer (re actual memory location, kernel function, et al)

So there's probably little to none that can be removed from such a simple kernel.

As far as does the compiler include unused code. I did a bit of reading and yes, it looks as though gcc can be set to optimize out all unused code. On this page the answers say yes.
The second answer says: "GCC/binutils can do this if you compile with -ffunction-sections -fdata-sections and link with --gc-sections." and each section is linked to the gcc page.
In checking if this is used in the SimpleIDE I now see the "Enable Pruning" option.
From the Prop C manual:
Enable Pruning: Enable this option to have the compiler and linker remove unused code from
the program image. This option saves the most space in projects using non-optimized libraries;
Propeller GCC uses optimized libraries, so the savings are usually minimal with this option
enabled.
I assume, hopefully correctly, that these are the same function?

So, looks like I just need to use structs and "Enable Pruning" to really cram in the bits.
I think I'll test "Enable Pruning" to be sure it gets everything.

This does pose a new question on the same topic though. This is one I will try to answer on my own. When the answer specified "GCC/binutils" this includes objdump, meaning I can optimize out excess code in my dump file for review. Pretty sure the answer is yes. I grew up on Turbo C and switched to C++ Builder a little after the turn of the century. Guess I've got a lot of reading to do on gcc to catch up.

Heater. · 2017-12-10 23:11

No do tell, what is the new question?

C++ is a whole other bucket of worms. But used wisely it can help organize your code without adding any overhead in space or time.

It works for all those 8 bit Arduino users after all.

mikeologist · 2017-12-10 23:47

mikeologist wrote: »

When the answer specified "GCC/binutils" this includes objdump, meaning I can optimize out excess code in my dump file for review. Pretty sure the answer is yes.

My fault for not putting it as a question. The question was, is objdump compatible with these optimization flags? I assume the answer is yes because they state GCC & binutils, of which I assume objdump is one.

I ask because I know a lot of languages and I think my goals will require me to learn well both Prop C and Prop ASM. That being the case I plan on using objdump to make my own rosetta stone using C concepts with which I'm already very familiar.

Heater. · 2017-12-11 07:35

Ah yes. That is a time honored technique. Compile some little C functions and have a look at the assembler version that the compiler produces. This can be very useful when you want to get to know an unfamiliar assembly language or instruction set. It's also very useful if you want to write a function in assembler that will be called from C and you need to know how C will be passing the parameters into it and returning the result.

It's also useful sometimes to check that the C compiler has optimized things nicely.

There is a better way than using objdump. Just use gcc to compile the source into assembler output rather than going all the way to object files and linking a final executable.

Like so:

$ gcc -S -o myfunc.s myfunc.c

Which compiles this:

int myfunc (int x, int y) {
  return x + y;
}

into this assembly language file:

        .file   "myfunc.c"
        .text
        .globl  myfunc
        .type   myfunc, @function
myfunc:
.LFB0:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        movl    %edi, -4(%rbp)
        movl    %esi, -8(%rbp)
        movl    -4(%rbp), %edx
        movl    -8(%rbp), %eax
        addl    %edx, %eax
        popq    %rbp
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc
.LFE0:
        .size   myfunc, .-myfunc
        .ident  "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609"
        .section        .note.GNU-stack,"",@progbits

That is x86 but prop-gcc does much the same.

macca · 2017-12-11 10:49

Heater. wrote: »

There is a better way than using objdump. Just use gcc to compile the source into assembler output rather than going all the way to object files and linking a final executable.

That gives the occasion to show the difference between using int or short, the above function compiled with prop-gcc gives this source:

.text
	.balign	4
	.global	_myfunc
_myfunc
	sub	sp, #4
	wrlong	r14, sp
	mov	r14, sp
	sub	sp, #8
	mov	r7, r14
	sub	r7, #8
	wrlong	r0, r7
	mov	r7, r14
	sub	r7, #4
	wrlong	r1, r7
	mov	r7, r14
	sub	r7, #4
	mov	r6, r14
	sub	r6, #8
	rdlong	r6, r6
	rdlong	r7, r7
	add	r7, r6
	mov	r0, r7
	mov	sp, r14
	rdlong	r14, sp
	add	sp, #4
	lret

If the second parameter is declared as short, the code increases by 3 instructions:

int myfunc (int x, short y) {
  return x + y;
}

	.text
	.balign	4
	.global	_myfunc
_myfunc
	sub	sp, #4
	wrlong	r14, sp
	mov	r14, sp
	sub	sp, #8
	mov	r7, r14
	sub	r7, #8
	wrlong	r0, r7
	mov	r7, r1
	mov	r6, r14
	sub	r6, #4
	wrword	r7, r6
	mov	r7, r14
	sub	r7, #4
	rdword	r7, r7
	shl	r7, #16   ; <- sign extend
	sar	r7, #16   ; <- sign extend
	mov	r6, r14
	sub	r6, #8
	rdlong	r6, r6
	add	r7, r6
	mov	r0, r7
	mov	sp, r14
	rdlong	r14, sp
	add	sp, #4
	lret

Of course this is a just a sample and in a real program things may be very different, it is to explain that smaller variables may not result in smaller code, or faster, that code is also slower because of the additional instructions.

If you are using arrays of course things may be very different, as Heater wrote. Say you want to collect 1000 samples from a 12-bit ADC, if you declare an array of 1000 int (32 bit each), it uses 4000 bytes, but if you declare it as short (16 bit each, more than enough for the data) it uses only 2000 bytes, in this case any increase of the code size is largely compensated by the saved data size.

mikeologist · 2017-12-11 16:37

Heater. wrote: »

There is a better way than using objdump. Just use gcc to compile the source into assembler output rather than going all the way to object files and linking a final executable.

Very useful demo. So I can handle all the pieces individually and count changes at a 1 to 1 ratio. Though manageable, 2000 lines of ASM is a little more daunting of a task to understand than your wonderfully simple example.

macca wrote: »

If the second parameter is declared as short, the code increases by 3 instructions:

The result of "saving" space on a 32-bit word by using half of it as a 16-bit word is that I have to use more steps to achieve that same goal. So that would be counterproductive on that level.

Based on that I could say that because of the operational overhead I'm not likely to experience any actual memory gain unless they are paired and packed and even then it will be slowing the runtime performance with the extra access functions.

macca wrote: »

If you are using arrays of course things may be very different, as Heater wrote. Say you want to collect 1000 samples from a 12-bit ADC, if you declare an array of 1000 int (32 bit each), it uses 4000 bytes, but if you declare it as short (16 bit each, more than enough for the data) it uses only 2000 bytes, in this case any increase of the code size is largely compensated by the saved data size.

Concluding that the only place to consider using this is somewhere where there will be substantial memory savings.

If I modify this ASM code how do I compile it back into my project?

Thank you all so much for all the great answers and examples.

Is there a memory saving when using a short in Prop C? (Answered)

Comments