ElfSizer: Determine your PropGCC binary size without needing SimpleIDE
DavidZemon
Posts: 2,973
in Propeller 1
Determining the size of a project has been a serious problem with PropWare ever since I first started. I've been relying on the number provided by propeller-load, but I know that is not an accurate representation of the code size. Today, I am pleased to announce a standalone tool which will invoke objdump to determine the size of a PropGCC binary. It is written in C++ and (if I did it write) has no dependencies on shared libraries except for system defaults (like libstdc++).
Thanks goes to @jazzed for providing the original code in SimpleIDE. ElfSizer is a direct rip from Build::procReadyReadSizes(). All I did was wrap it with some command-line arguments and then swap out Qt functions for STL implementations.
My goal is for this program to be Linux/Windows/Mac/Pi compatible, though as of this post I have only tested Linux. A 64-bit Windows binary is available on the build server, but it has not been tested.
Source code: https://github.com/DavidZemon/elfsizer
Binary downloads: http://david.zemon.name:8111/project.html?projectId=ElfSizer&guest=1
My next step is to hook this into PropWare. Once done, I will look at testing on Windows and Pi.
In the meantime, I hope it will get some use by fellow PropGCC users. If anyone has bug reports or feature requests, just let me know.
Thanks goes to @jazzed for providing the original code in SimpleIDE. ElfSizer is a direct rip from Build::procReadyReadSizes(). All I did was wrap it with some command-line arguments and then swap out Qt functions for STL implementations.
My goal is for this program to be Linux/Windows/Mac/Pi compatible, though as of this post I have only tested Linux. A 64-bit Windows binary is available on the build server, but it has not been tested.
Source code: https://github.com/DavidZemon/elfsizer
Binary downloads: http://david.zemon.name:8111/project.html?projectId=ElfSizer&guest=1
My next step is to hook this into PropWare. Once done, I will look at testing on Windows and Pi.
In the meantime, I hope it will get some use by fellow PropGCC users. If anyone has bug reports or feature requests, just let me know.
Comments
No
I just tried them now. Ugh... Nothing like spending the weekend on a project and finding out the work was unnecessary.
The good news is the output of propeller-elf-size will be much easier to parse, and I can therefore do it with a CMake script rather than invoking an external program which invokes another external program.
Thanks David
Nope. No use since propeller-elf-size reports the same thing, plus extra. If you're thinking of removing it from PropGCC, you've got my vote.
My best guess is that the big projects are still written in Spin - or interpretted via a Spin kernel (like Tachyon & PropBASIC). Most folks writing something big enough to need XMM are not using PropGCC. It's hard to fight momentum... especially when Parallax is (last I saw) still pushing Spin for commercial and production use.
As for momentum, there are good reasons to use SPIN and PASM. Those recommendations are more than just inertia. The C tools today are looking really good and they are capable. On the P2, I expect them to see more use as project size and resources can scale up.
So don't take this as a "C is bad" kind of thing. It's not. Do take it as a SPIN + PASM is really good kind of thing. That's the truth, and a rational basis for those recommendations.
It's an artifact of the chip, SPIN and PASM being designed as one atomic thing.
Sorry, when I say "Spin" I'm referring to anything compiled via a Spin compiler such as PropTool, bstc, OpenSpin, etc.
But, there will be new users and the P2 chip offers a more roomy experience. The P2 chip, assuming we get it made this time around, is shaping up to offer a compelling and we'll differentiated set of features. A big one is large projects, multiprocessing, real time, and no OS needed, though having one could rock hard too.
When we had C on the earlier design, it made a lot of sense. I had a good experience, and a lot of that boiled down to all the technical details needed for doing it on P1 being largely optional to start out.
Big programs on P1 is just enough of a niche as to not attract enough attention, IMHO. In the sweet spot, SPIN and C are comparable, and there is inertia in play for sure. A lot of that has to do with the specialized nature of the Prop.
Hobby, amateur, and specialized users may not see the benefits C can provide on P1 as being worth the effort C can require too. This depends on their body of C knowledge and their project requirements. IMHO, this dynamic should open up and relax things some on P2. Favorable to C.
This is not a negative either. It's just an artifact of where everything is at on the P1. It's really easy to say, "too big" and put that C code onto something else much better suited for it. Diminishing returns are in play here.
But, that "make it work on P1 anyway" effort can pay off.
On the P2, just grabbing some code out there and running it will be appropriate for a lot more use cases, due to the higher speed, HUBEXEC offering a more C friendly execute mode, and more RAM.
If SDRAM on this one ends up easy and fast?
C should start to shine, IMHO. Big programs will have room to work and speed that makes sense. When we can talk megabytes and reasonable speed, people are going to find they can employ C, do bigger stuff, and still get that real time on a Prop chip.
Because of all that, I think it's really important to not extrapolate too much from P1.
Additionally, the volume of new users will be impacted favorably with a good C environment being avaliable. I was never sure of that on P1, just due to its size and how it works.
P2 is an order better and larger. That's going to favor C in a lot more scenarios as well as expand the number of plausible scenarios. This means people will be able to select a P2 chip for more things, and select C to do it. This all holds the potential for a much larger community, and we all want that.
Finally, I feel very strongly about the goals being orthogonal more than anything else. Maximizing both makes great sense. There are users aplenty and no need to cultivate any sort of exclusivity.
Having options is good.
We get relocatable code too.
But, a COG or two, out of 16, can do what is being done on P1, and do so more quickly, and with enough RAM to work without too much hassle, IMHO.
XMM should be really big, and it can go to HUBEXEC code too. That should be a nice gain, right?
By easy and fast, I mean a "DMA" type COG setup to fetch data from SDRAM, and some software to manage things should result in a standard set of code to make XMM possible, a lot faster and largely transparent.
With all the COGS left over, load up peripheral libraries, and there will be a lot fewer worries about resources and all that.
That is gonna matter, IMHO.
And again, options are good. One can trade overlays and all that for a slower, but simpler big program too.
All of it more roomy, faster and with new hardware features. Good times ahead for C programmers, if you ask me.
Remember, the big slow or modest speed program, combined with faster little, programs running concurrently, or parallel depending on what it is, still applies.
"Slow" XMM will be a lot faster than SPIN on the P1 is now. Sure seems to me, that's plenty fast, given how we have seen bigger projects work in Propland.
I tend to think in terms of use cases and applicability more than I do speed relative to the COG. On the P1, for example, SPIN runs about as fast as assembly language does on an older 8 bit computer. It's a dog relative to the COG, which looks like real hardware on the same.
Comparing them directly is negative. However, when comparing them to project requirements, it's a lot different. A lot of people said SPIN was slow. I thought it fast, because the stuff I wanted SPIN to do was well matched to the speed. PASM closes the gaps, and a lot more can be done than one would think.
The same things are true in C, and depending on the memory model, people can make their choices, size, speed, external RAM, etc... this is powerful and useful. The only reason there isn't more going on right now is due to the small overall size and speed of the P1.
If, on the P2, XMM runs at some multiple of SPIN on the P1, that is going to be pretty fast relative to a lot of higher level bigger project requirements.
The egg beater hub also means running a few threads will run just as fast, and there remains plenty of RAM and COGS to mix the execute and memory models to optimum performance too.
The big thing will be example code and some cool projects / demos to show the way. If I were to very seriously generalize SPIN vs C, there is not yet enough of that in C land to help pull people in. And that is not a blame, just a fact.
On P2, that should change. It's more of a little system on chip, it's got more general purpose hardware features, etc...
XMM is an aberration, a freak of nature, an amusing mutant in a circus. Perhaps should have been drowned at birth.
Running any kind of code on Propeller from external memory is a good trick. It's great for us weirdos that want to run CP/M on a emulator on the Prop. In the same way that Linux has been run from an 8 bit ATMEL whatever.
It is in no way a sensible, practical, proposition.
Anyone who needs "big code" for a practical, economic, application would be using an STM32 F4 or whatever, perhaps with a Prop like chip to help with some hard real-time stuff.
I don't see this situation changing much with the P2.
But, the P2 has room enough to run some usable C code from HUB. So the pressure for more space is diminished. I would not even think about XMM.
I would not worry about the vanishingly minute "Propeller community" with their "Spin/PASM bias". This is a big world and I really hope the P2 reaches further than this "old boys club". C/C++ is one answer to getting that reach.
Me, I want to see the P2 run a Javascript engine. Kids today don't have the attention span for anything more complicated
"Better" or "right" discussions are energy draining as well as endlessly subjective.
@Heater, I disagree. A lot of those big programs need to be fast because the device does not offer what a Prop does. And sometimes they just need to be fast anyway. Fair enough.
But, there are totally good reasons to run big programs XMM style. Just presenting a good UI is one easy to spot case. Being able to offer up a nice display a user can work with doesn't require a ton of speed, particularly when it can benefit from a helper COG or two.
I'm putting the finishing touches on a simple blitter. It takes object lists and screen target lists and writes the display. Won't take much to extend that into something that can take UI data from external RAM. And doing that is just a COG or two. One running a big program, one able to get display work done.
We have enough room for this kind of thing to work nicely, and that's a real gain over P1. Doing that kind of thing on a P1 is hard. So far, it's mostly a doddle on P2. Need to see what SDRAM access turns out like, but if it's close to what we got on the last design, big programs will be relevant and useful.
@David. Well, if you want. SPIN on P2 may take a while... I figured we would get C and PASM first, just like last time. And that's the time for people to do cool stuff to attract attention and figure things out. Get the jump on SPIN.
Personally, I believe having good options means more users. More users means more code and that all is a good thing for everyone.
We have people here fluent in a lot of things! IMHO, that means people can show up and see success, unless we squabble about it.
Like that BASIC thread. One user posted up some success they had. Damn cool! That's what we are here for right? To some of us, it's just BASIC. To others, it's a show that hints at the fact that they too can get something done.
And as a learning device, a P2 may well turn out to be quite the playground.
I'll say this too. I can program in C. I'm not very skilled, but I can. On P1, it just didn't make enough sense, and I picked up SPIN and PASM and had a good time.
On the P2, it will make a lot more sense. Many of us may find being able to just flat out run or use more C code out there changes things.
My mind is open. I want to learn stuff, do stuff and enjoy doing that with others. And if I can help, I do. When more of us do that, it's better.
So the most important thing is when the new chip brings in new people, we are there to do what we do. Maybe they bring stuff to the table we don't too. Sweet!
I see a lot of push to do the "one true thing" everyone can reuse, etc... When that happens, great. But, it should not come at the expense of people getting stuff done.
The SPIN+PASM way is one way, the C way is another, Forth, BASIC, Heater wants JS...
Bring it on. It should be all good.
Err, that's not really true today, eXecute In Place (XIP) is growing, even on those larger MCU's.
QuadSPI is being used for that today, and the faster variants from Spansion et al, are being designed in.
You can now get both FLASH and RAM in low pin count Serial streaming designed especially for XIP.
Anyone designing a new MCU core these days, should include Short-SKIP style opcodes, as they co-operate nicely with Serial streaming memory.
If it is leaving out the kernel then that's useful, because these days the kernel is overlaid with the bss/heap so the space it takes can be used by the application at runtime, i.e. the effective memory footprint does exclude the kernel.
I don't know about you, but I carefully write (embedded) code to use as little dynamic memory allocation as possible. It is a cool feature and I'm glad it was implemented... but it does me little to no good.
Really? The bss is where all uninitialized variables go. So if you declare: That's 2K of bss space used right there, and so the kernel is free for this program. In general if you add up all the (uninitialized) variable space, if it's more than about 1500 bytes then the kernel is not taking up any extra room.
Note that the bss is cleared to 0 before the C code starts (but after the kernel starts, obviously). So if you have variables that you iniitialize to 0, like: then you can possibly save a bit of space by putting them in bss:
Ah! I thought you meant that it was returned to the heap, not stack (stack == bss right?). That is much more useful
Just to clarify my usage of terms (I think these are pretty standard in the C world, but perhaps I've misremembered something):
"data" is the initialized data, variables that have an explicit initial value given to them when they are declared
"bss" is the uninitialized data, variables that are declared but not given any initial value. These all get set to 0 before main starts (I think this is required by the C standard; certainly it is widespread usage).
"heap" is the memory used for long term run-time allocation (malloc, realloc, etc.)
"stack" is the memory used at run-time for spilling variables from registers, saving return addresses and local variables, and similar transient (per-function) memory allocation including "alloca".
In PropGCC, memory is laid out with code first, then data, then bss. After that comes the runtime allocation area used for the heap and stack. The heap grows up from the end of bss, and the stack grows down from end of memory.
The LMM/CMM kernel is loaded after the data, so it overlays the bss (and perhaps extends into the heap/stack area, depending on how big the bss is). This isn't a problem, because the kernel is loaded very first thing and then is not needed any more (if we need to start a new cog, we copy the kernel back to a temporary hub buffer from COG memory). After the kernel is loaded the bss is zeroed.
Ah, this is my mistake then. I thought data and bss were subsets of the stack.
Question time: where are each of the following variables stored, assuming GCC isn't pruning unused variables:
If I had to guess, I would say this:
data = global1
bss = global2
stack = mainVar1, mainVar2, fooVar1, fooVar2
But I could see GCC easily optimizing mainVar1 and mainVar2 into data and bss, similar to global1 and global2.
Spin generally ends up being more compact, but C/C++ is faster, and for certain things can actually end up smaller because of optimization, inlining, etc. I'm not strongly advocating one or the other - I like them both.
I have found there are things that Spin does poorly that are simple in C/C++, like having a single object used in multiple other code classes. I know you can fake it with DAT sections, and parameter passing, but that's kind of a hack, and falls apart the moment you want two of them.
On the other hand, Spin has some very cool built-ins for OUTA and DIRA (the array index and ellipsis operators), and a few other things that are pretty cool too.
I think there's a place for both, so maybe there should be a truce instead of a war.