ElfSizer: Determine your PropGCC binary size without needing SimpleIDE

DavidZemon · 2016-02-07 04:13

Determining the size of a project has been a serious problem with PropWare ever since I first started. I've been relying on the number provided by propeller-load, but I know that is not an accurate representation of the code size. Today, I am pleased to announce a standalone tool which will invoke objdump to determine the size of a PropGCC binary. It is written in C++ and (if I did it write) has no dependencies on shared libraries except for system defaults (like libstdc++).

Thanks goes to @jazzed for providing the original code in SimpleIDE. ElfSizer is a direct rip from Build::procReadyReadSizes(). All I did was wrap it with some command-line arguments and then swap out Qt functions for STL implementations.

My goal is for this program to be Linux/Windows/Mac/Pi compatible, though as of this post I have only tested Linux. A 64-bit Windows binary is available on the build server, but it has not been tested.

Source code: https://github.com/DavidZemon/elfsizer
Binary downloads: http://david.zemon.name:8111/project.html?projectId=ElfSizer&guest=1

My next step is to hook this into PropWare. Once done, I will look at testing on Windows and Pi.

In the meantime, I hope it will get some use by fellow PropGCC users. If anyone has bug reports or feature requests, just let me know.

David Betz · 2016-02-07 08:42

Did you try propeller-elf-size or propeller-elf-image-size?

DavidZemon · 2016-02-07 14:50

David Betz wrote: »

Did you try propeller-elf-size or propeller-elf-image-size?

No

I just tried them now. Ugh... Nothing like spending the weekend on a project and finding out the work was unnecessary.

The good news is the output of propeller-elf-size will be much easier to parse, and I can therefore do it with a CMake script rather than invoking an external program which invokes another external program.

DavidZemon · 2016-02-07 15:55

Well, I was able to redo it with ~50 lines of CMake script and barely more than an hour of work. And PropWare doesn't have any extra dependencies. Much, much better.

Thanks David

David Betz · 2016-02-07 17:59

DavidZemon wrote: »

David Betz wrote: »

Did you try propeller-elf-size or propeller-elf-image-size?

No

I just tried them now. Ugh... Nothing like spending the weekend on a project and finding out the work was unnecessary.

The good news is the output of propeller-elf-size will be much easier to parse, and I can therefore do it with a CMake script rather than invoking an external program which invokes another external program.

I tried propeller-elf-image-size and there seems to be something wrong with it. I think maybe it doesn't include the size of the LMM or CMM kernel. Anyway, it gives a smaller number than propeller-elf-size.

DavidZemon · 2016-02-07 18:34

Oh! That's good to know.

David Betz · 2016-02-07 18:37

DavidZemon wrote: »

Oh! That's good to know.

If you have a use for propeller-elf-image-size I can fix it. I wrote it for Steve to use with SimpleIDE but he obviously didn't use it. I suspect it isn't used for anything at this point.

DavidZemon · 2016-02-07 18:39

David Betz wrote: »

DavidZemon wrote: »

Oh! That's good to know.

If you have a use for propeller-elf-image-size I can fix it. I wrote it for Steve to use with SimpleIDE but he obviously didn't use it. I suspect it isn't used for anything at this point.

Nope. No use since propeller-elf-size reports the same thing, plus extra. If you're thinking of removing it from PropGCC, you've got my vote.

David Betz · 2016-02-07 18:42

DavidZemon wrote: »

David Betz wrote: »

DavidZemon wrote: »

Oh! That's good to know.

If you have a use for propeller-elf-image-size I can fix it. I wrote it for Steve to use with SimpleIDE but he obviously didn't use it. I suspect it isn't used for anything at this point.

Nope. No use since propeller-elf-size reports the same thing, plus extra. If you're thinking of removing it from PropGCC, you've got my vote.

Yeah, I realized that after I had already written propeller-elf-image-size. I didn't know about propeller-elf-size before that. One thing that propeller-elf-image-size does that propeller-elf-size doesn't is distinguish between hub, RAM and flash usage for XMM programs. I guess no one uses XMM though.

DavidZemon · 2016-02-07 19:01

David Betz wrote: »

DavidZemon wrote: »

David Betz wrote: »

DavidZemon wrote: »

Oh! That's good to know.

If you have a use for propeller-elf-image-size I can fix it. I wrote it for Steve to use with SimpleIDE but he obviously didn't use it. I suspect it isn't used for anything at this point.

Nope. No use since propeller-elf-size reports the same thing, plus extra. If you're thinking of removing it from PropGCC, you've got my vote.

Yeah, I realized that after I had already written propeller-elf-image-size. I didn't know about propeller-elf-size before that. One thing that propeller-elf-image-size does that propeller-elf-size doesn't is distinguish between hub, RAM and flash usage for XMM programs. I guess no one uses XMM though.

My best guess is that the big projects are still written in Spin - or interpretted via a Spin kernel (like Tachyon & PropBASIC). Most folks writing something big enough to need XMM are not using PropGCC. It's hard to fight momentum... especially when Parallax is (last I saw) still pushing Spin for commercial and production use.

potatohead · 2016-02-07 19:12

I am pretty sure a SPIN kernel isn't used in either Tachyon or Pbasic. LMM is in Pbasic but that's not really a SPIN thing. I'm unsure about Tachyon. It's not hard to put LMM and other memory models to use in it though. Peter has done a great job maximizing Forth.

As for momentum, there are good reasons to use SPIN and PASM. Those recommendations are more than just inertia. The C tools today are looking really good and they are capable. On the P2, I expect them to see more use as project size and resources can scale up.

So don't take this as a "C is bad" kind of thing. It's not. Do take it as a SPIN + PASM is really good kind of thing. That's the truth, and a rational basis for those recommendations.

It's an artifact of the chip, SPIN and PASM being designed as one atomic thing.

DavidZemon · 2016-02-07 19:16

potatohead wrote: »

I am pretty sure a SPIN kernel isn't used in either Tachyon or Pbasic. LMM is in Pbasic but that's not really a SPIN thing.

Sorry, when I say "Spin" I'm referring to anything compiled via a Spin compiler such as PropTool, bstc, OpenSpin, etc.

David Betz · 2016-02-07 20:08

DavidZemon wrote: »

David Betz wrote: »

DavidZemon wrote: »

David Betz wrote: »

DavidZemon wrote: »

Oh! That's good to know.

If you have a use for propeller-elf-image-size I can fix it. I wrote it for Steve to use with SimpleIDE but he obviously didn't use it. I suspect it isn't used for anything at this point.

Nope. No use since propeller-elf-size reports the same thing, plus extra. If you're thinking of removing it from PropGCC, you've got my vote.

Yeah, I realized that after I had already written propeller-elf-image-size. I didn't know about propeller-elf-size before that. One thing that propeller-elf-image-size does that propeller-elf-size doesn't is distinguish between hub, RAM and flash usage for XMM programs. I guess no one uses XMM though.

My best guess is that the big projects are still written in Spin - or interpretted via a Spin kernel (like Tachyon & PropBASIC). Most folks writing something big enough to need XMM are not using PropGCC. It's hard to fight momentum... especially when Parallax is (last I saw) still pushing Spin for commercial and production use.

Spin or Tachyon or PropBASIC will not let you write programs as large as you can with PropGCC in XMM mode, not even close. I have my doubts about C being used much on P2 either. This is too much of a Spin/PASM bias in the Propeller community and I don't think that will change with P2 since I imagine that the P2 customers will be a subset of the current P1 customers.

potatohead · 2016-02-07 20:19

Nah. There are a set of us who want that environment, no doubt. I do, and I've made the reasons clear. IMHO, given what has transpired, that is all reasonable and expected.

But, there will be new users and the P2 chip offers a more roomy experience. The P2 chip, assuming we get it made this time around, is shaping up to offer a compelling and we'll differentiated set of features. A big one is large projects, multiprocessing, real time, and no OS needed, though having one could rock hard too.

When we had C on the earlier design, it made a lot of sense. I had a good experience, and a lot of that boiled down to all the technical details needed for doing it on P1 being largely optional to start out.

Big programs on P1 is just enough of a niche as to not attract enough attention, IMHO. In the sweet spot, SPIN and C are comparable, and there is inertia in play for sure. A lot of that has to do with the specialized nature of the Prop.

Hobby, amateur, and specialized users may not see the benefits C can provide on P1 as being worth the effort C can require too. This depends on their body of C knowledge and their project requirements. IMHO, this dynamic should open up and relax things some on P2. Favorable to C.

This is not a negative either. It's just an artifact of where everything is at on the P1. It's really easy to say, "too big" and put that C code onto something else much better suited for it. Diminishing returns are in play here.

But, that "make it work on P1 anyway" effort can pay off.

On the P2, just grabbing some code out there and running it will be appropriate for a lot more use cases, due to the higher speed, HUBEXEC offering a more C friendly execute mode, and more RAM.

If SDRAM on this one ends up easy and fast?

C should start to shine, IMHO. Big programs will have room to work and speed that makes sense. When we can talk megabytes and reasonable speed, people are going to find they can employ C, do bigger stuff, and still get that real time on a Prop chip.

Because of all that, I think it's really important to not extrapolate too much from P1.

Additionally, the volume of new users will be impacted favorably with a good C environment being avaliable. I was never sure of that on P1, just due to its size and how it works.

P2 is an order better and larger. That's going to favor C in a lot more scenarios as well as expand the number of plausible scenarios. This means people will be able to select a P2 chip for more things, and select C to do it. This all holds the potential for a much larger community, and we all want that.

Finally, I feel very strongly about the goals being orthogonal more than anything else. Maximizing both makes great sense. There are users aplenty and no need to cultivate any sort of exclusivity.

Having options is good.

David Betz · 2016-02-07 20:27

potatohead wrote: »

If SDRAM on this one ends up easy and fast?

Did I miss something? Will it be possible to run code directly from SDRAM? The ESP8266 chip I've been playing with has a TLB and can run code through an on-chip cache from quad SPI flash. It's a lot slower than running code from its internal IRAM but it seems to work well enough for most things and the IRAM can handle the timing-critical stuff. I don't think the P2 has a TLB though. Does it have some sort of XIP from SDRAM?

potatohead · 2016-02-07 20:40

Nope. We won't get that. No hardware memory management / cache, but there are interrupts and direct execute from HUB, and no need to load a COG, etc... that changes a lot of things.

We get relocatable code too.

But, a COG or two, out of 16, can do what is being done on P1, and do so more quickly, and with enough RAM to work without too much hassle, IMHO.

XMM should be really big, and it can go to HUBEXEC code too. That should be a nice gain, right?

By easy and fast, I mean a "DMA" type COG setup to fetch data from SDRAM, and some software to manage things should result in a standard set of code to make XMM possible, a lot faster and largely transparent.

With all the COGS left over, load up peripheral libraries, and there will be a lot fewer worries about resources and all that.

David Betz · 2016-02-07 20:48

potatohead wrote: »

XMM should be really big, and it can go to HUBEXEC code too. That should be a nice gain, right?

I don't think it would make a lot of sense to do XMM the way it is done in P1. You could do overlays. That's a bit archaic though. XMM made sense on the P1 because it was just a slower version of LMM with access to a lot more memory. In both cases, instructions were fetched from hub memory and executed in a COG. Now that we have hubexec, there is no need for LMM so the delta between hubexec and XMM will be huge.

potatohead · 2016-02-07 20:54

Yes, of course. However it ends up working best, there are enough COGS, HUB throughput, and RAM to stream big code and get larger, more consistent execute times.

That is gonna matter, IMHO.

And again, options are good. One can trade overlays and all that for a slower, but simpler big program too.

All of it more roomy, faster and with new hardware features. Good times ahead for C programmers, if you ask me.

Remember, the big slow or modest speed program, combined with faster little, programs running concurrently, or parallel depending on what it is, still applies.

"Slow" XMM will be a lot faster than SPIN on the P1 is now. Sure seems to me, that's plenty fast, given how we have seen bigger projects work in Propland.

I tend to think in terms of use cases and applicability more than I do speed relative to the COG. On the P1, for example, SPIN runs about as fast as assembly language does on an older 8 bit computer. It's a dog relative to the COG, which looks like real hardware on the same.

Comparing them directly is negative. However, when comparing them to project requirements, it's a lot different. A lot of people said SPIN was slow. I thought it fast, because the stuff I wanted SPIN to do was well matched to the speed. PASM closes the gaps, and a lot more can be done than one would think.

The same things are true in C, and depending on the memory model, people can make their choices, size, speed, external RAM, etc... this is powerful and useful. The only reason there isn't more going on right now is due to the small overall size and speed of the P1.

If, on the P2, XMM runs at some multiple of SPIN on the P1, that is going to be pretty fast relative to a lot of higher level bigger project requirements.

The egg beater hub also means running a few threads will run just as fast, and there remains plenty of RAM and COGS to mix the execute and memory models to optimum performance too.

The big thing will be example code and some cool projects / demos to show the way. If I were to very seriously generalize SPIN vs C, there is not yet enough of that in C land to help pull people in. And that is not a blame, just a fact.

On P2, that should change. It's more of a little system on chip, it's got more general purpose hardware features, etc...

Heater. · 2016-02-07 21:06

Wait a minute.

XMM is an aberration, a freak of nature, an amusing mutant in a circus. Perhaps should have been drowned at birth.

Running any kind of code on Propeller from external memory is a good trick. It's great for us weirdos that want to run CP/M on a emulator on the Prop. In the same way that Linux has been run from an 8 bit ATMEL whatever.

It is in no way a sensible, practical, proposition.

Anyone who needs "big code" for a practical, economic, application would be using an STM32 F4 or whatever, perhaps with a Prop like chip to help with some hard real-time stuff.

I don't see this situation changing much with the P2.

But, the P2 has room enough to run some usable C code from HUB. So the pressure for more space is diminished. I would not even think about XMM.

I would not worry about the vanishingly minute "Propeller community" with their "Spin/PASM bias". This is a big world and I really hope the P2 reaches further than this "old boys club". C/C++ is one answer to getting that reach.

Me, I want to see the P2 run a Javascript engine. Kids today don't have the attention span for anything more complicated

David Betz · 2016-02-07 21:11

potatohead wrote: »

Yes, of course. However it ends up working best, there are enough COGS, HUB throughput, and RAM to stream big code and get larger, more consistent execute times.

That is gonna matter, IMHO.

And again, options are good. One can trade overlays and all that for a slower, but simpler big program too.

All of it more roomy, faster and with new hardware features. Good times ahead for C programmers, if you ask me.

I guess we'll see in a year or so. I'm getting tired of the language wars. I'm tempted to just use Spin/PASM from any Propeller programming I do in the future.

potatohead · 2016-02-07 21:41

Yeah, that is why I'm careful to frame it as options.

"Better" or "right" discussions are energy draining as well as endlessly subjective.

@Heater, I disagree. A lot of those big programs need to be fast because the device does not offer what a Prop does. And sometimes they just need to be fast anyway. Fair enough.

But, there are totally good reasons to run big programs XMM style. Just presenting a good UI is one easy to spot case. Being able to offer up a nice display a user can work with doesn't require a ton of speed, particularly when it can benefit from a helper COG or two.

I'm putting the finishing touches on a simple blitter. It takes object lists and screen target lists and writes the display. Won't take much to extend that into something that can take UI data from external RAM. And doing that is just a COG or two. One running a big program, one able to get display work done.

We have enough room for this kind of thing to work nicely, and that's a real gain over P1. Doing that kind of thing on a P1 is hard. So far, it's mostly a doddle on P2. Need to see what SDRAM access turns out like, but if it's close to what we got on the last design, big programs will be relevant and useful.

@David. Well, if you want. SPIN on P2 may take a while...

I figured we would get C and PASM first, just like last time. And that's the time for people to do cool stuff to attract attention and figure things out. Get the jump on SPIN.

Personally, I believe having good options means more users. More users means more code and that all is a good thing for everyone.

We have people here fluent in a lot of things! IMHO, that means people can show up and see success, unless we squabble about it.

Like that BASIC thread. One user posted up some success they had. Damn cool! That's what we are here for right? To some of us, it's just BASIC. To others, it's a show that hints at the fact that they too can get something done.

And as a learning device, a P2 may well turn out to be quite the playground.

I'll say this too. I can program in C. I'm not very skilled, but I can. On P1, it just didn't make enough sense, and I picked up SPIN and PASM and had a good time.

On the P2, it will make a lot more sense. Many of us may find being able to just flat out run or use more C code out there changes things.

My mind is open. I want to learn stuff, do stuff and enjoy doing that with others. And if I can help, I do. When more of us do that, it's better.

So the most important thing is when the new chip brings in new people, we are there to do what we do. Maybe they bring stuff to the table we don't too. Sweet!

I see a lot of push to do the "one true thing" everyone can reuse, etc... When that happens, great. But, it should not come at the expense of people getting stuff done.

The SPIN+PASM way is one way, the C way is another, Forth, BASIC, Heater wants JS...

Bring it on. It should be all good.

jmg · 2016-02-07 21:55

Heater. wrote: »

Running any kind of code on Propeller from external memory is a good trick. It's great for us weirdos that want to run CP/M on a emulator on the Prop. In the same way that Linux has been run from an 8 bit ATMEL whatever.

It is in no way a sensible, practical, proposition.

Err, that's not really true today, eXecute In Place (XIP) is growing, even on those larger MCU's.

QuadSPI is being used for that today, and the faster variants from Spansion et al, are being designed in.

You can now get both FLASH and RAM in low pin count Serial streaming designed especially for XIP.

Anyone designing a new MCU core these days, should include Short-SKIP style opcodes, as they co-operate nicely with Serial streaming memory.

jmg · 2016-02-07 21:56

David Betz wrote: »

Yeah, I realized that after I had already written propeller-elf-image-size. I didn't know about propeller-elf-size before that. One thing that propeller-elf-image-size does that propeller-elf-size doesn't is distinguish between hub, RAM and flash usage for XMM programs. I guess no one uses XMM though.

Does it make sense to merge the features, and then retire the least-used ?

David Betz · 2016-02-07 22:01

jmg wrote: »

David Betz wrote: »

Yeah, I realized that after I had already written propeller-elf-image-size. I didn't know about propeller-elf-size before that. One thing that propeller-elf-image-size does that propeller-elf-size doesn't is distinguish between hub, RAM and flash usage for XMM programs. I guess no one uses XMM though.

Does it make sense to merge the features, and then retire the least-used ?

Since XMM is pretty much deprecated and David seems to think that propeller-elf-size does the job for his purposes, I think we can probably just stick with that. It's part of the standard binutils toolchain.

ersmith · 2016-02-07 23:42

David Betz wrote: »

I tried propeller-elf-image-size and there seems to be something wrong with it. I think maybe it doesn't include the size of the LMM or CMM kernel. Anyway, it gives a smaller number than propeller-elf-size.

If it is leaving out the kernel then that's useful, because these days the kernel is overlaid with the bss/heap so the space it takes can be used by the application at runtime, i.e. the effective memory footprint does exclude the kernel.

DavidZemon · 2016-02-08 00:47

ersmith wrote: »

If it is leaving out the kernel then that's useful, because these days the kernel is overlaid with the bss/heap so the space it takes can be used by the application at runtime, i.e. the effective memory footprint does exclude the kernel.

I don't know about you, but I carefully write (embedded) code to use as little dynamic memory allocation as possible. It is a cool feature and I'm glad it was implemented... but it does me little to no good.

ersmith · 2016-02-08 19:19

DavidZemon wrote: »

ersmith wrote: »

If it is leaving out the kernel then that's useful, because these days the kernel is overlaid with the bss/heap so the space it takes can be used by the application at runtime, i.e. the effective memory footprint does exclude the kernel.

I don't know about you, but I carefully write (embedded) code to use as little dynamic memory allocation as possible. It is a cool feature and I'm glad it was implemented... but it does me little to no good.

Really? The bss is where all uninitialized variables go. So if you declare:

char myBigBuffer[2048];

That's 2K of bss space used right there, and so the kernel is free for this program. In general if you add up all the (uninitialized) variable space, if it's more than about 1500 bytes then the kernel is not taking up any extra room.

Note that the bss is cleared to 0 before the C code starts (but after the kernel starts, obviously). So if you have variables that you iniitialize to 0, like:

int x = 0;

then you can possibly save a bit of space by putting them in bss:

int x;

DavidZemon · 2016-02-08 19:31

ersmith wrote: »

Really? The bss is where all uninitialized variables go.

Ah! I thought you meant that it was returned to the heap, not stack (stack == bss right?). That is much more useful

ersmith · 2016-02-08 19:44

DavidZemon wrote: »

ersmith wrote: »

Really? The bss is where all uninitialized variables go.

Ah! I thought you meant that it was returned to the heap, not stack (stack == bss right?). That is much more useful

Just to clarify my usage of terms (I think these are pretty standard in the C world, but perhaps I've misremembered something):

"data" is the initialized data, variables that have an explicit initial value given to them when they are declared
"bss" is the uninitialized data, variables that are declared but not given any initial value. These all get set to 0 before main starts (I think this is required by the C standard; certainly it is widespread usage).
"heap" is the memory used for long term run-time allocation (malloc, realloc, etc.)
"stack" is the memory used at run-time for spilling variables from registers, saving return addresses and local variables, and similar transient (per-function) memory allocation including "alloca".

In PropGCC, memory is laid out with code first, then data, then bss. After that comes the runtime allocation area used for the heap and stack. The heap grows up from the end of bss, and the stack grows down from end of memory.

The LMM/CMM kernel is loaded after the data, so it overlays the bss (and perhaps extends into the heap/stack area, depending on how big the bss is). This isn't a problem, because the kernel is loaded very first thing and then is not needed any more (if we need to start a new cog, we copy the kernel back to a temporary hub buffer from COG memory). After the kernel is loaded the bss is zeroed.

DavidZemon · 2016-02-08 20:10

ersmith wrote: »

DavidZemon wrote: »

ersmith wrote: »

Really? The bss is where all uninitialized variables go.

Ah! I thought you meant that it was returned to the heap, not stack (stack == bss right?). That is much more useful

Just to clarify my usage of terms (I think these are pretty standard in the C world, but perhaps I've misremembered something):

"data" is the initialized data, variables that have an explicit initial value given to them when they are declared
"bss" is the uninitialized data, variables that are declared but not given any initial value. These all get set to 0 before main starts (I think this is required by the C standard; certainly it is widespread usage).
"heap" is the memory used for long term run-time allocation (malloc, realloc, etc.)
"stack" is the memory used at run-time for spilling variables from registers, saving return addresses and local variables, and similar transient (per-function) memory allocation including "alloca".

In PropGCC, memory is laid out with code first, then data, then bss. After that comes the runtime allocation area used for the heap and stack. The heap grows up from the end of bss, and the stack grows down from end of memory.

The LMM/CMM kernel is loaded after the data, so it overlays the bss (and perhaps extends into the heap/stack area, depending on how big the bss is). This isn't a problem, because the kernel is loaded very first thing and then is not needed any more (if we need to start a new cog, we copy the kernel back to a temporary hub buffer from COG memory). After the kernel is loaded the bss is zeroed.

Ah, this is my mistake then. I thought data and bss were subsets of the stack.

Question time: where are each of the following variables stored, assuming GCC isn't pruning unused variables:

int global1 = 3;
int global2;

void foo () {
  int fooVar1 = 7;
  int fooVar2;
}

int main () {
  int mainVar1 = 5;
  int mainVar2;

  foo();

  return 0;
}

If I had to guess, I would say this:

data = global1
bss = global2
stack = mainVar1, mainVar2, fooVar1, fooVar2

But I could see GCC easily optimizing mainVar1 and mainVar2 into data and bss, similar to global1 and global2.

JasonDorie · 2016-02-08 21:35

David Betz wrote: »

I'm getting tired of the language wars. I'm tempted to just use Spin/PASM from any Propeller programming I do in the future.

Spin generally ends up being more compact, but C/C++ is faster, and for certain things can actually end up smaller because of optimization, inlining, etc. I'm not strongly advocating one or the other - I like them both.

I have found there are things that Spin does poorly that are simple in C/C++, like having a single object used in multiple other code classes. I know you can fake it with DAT sections, and parameter passing, but that's kind of a hack, and falls apart the moment you want two of them.

On the other hand, Spin has some very cool built-ins for OUTA and DIRA (the array index and ellipsis operators), and a few other things that are pretty cool too.

I think there's a place for both, so maybe there should be a truce instead of a war.

ElfSizer: Determine your PropGCC binary size without needing SimpleIDE

Comments