Why can't we work out stack space at compile time?

PointDog · 2006-10-27 04:34

Hi all, I hope I'm not starting a new thread for nothing, I've searched the old posts and can't turn anything up.

I am wondering why the stack space required for an object, or mehod called by the cognew command can't be calculated at compile time.

All other languages use a simple call tree generator to map out the possible call paths, and then the stack space is the largest of the branches.

This is mildly inefficient as some calls that are mutualy exclusive in external hardware logic, are not in code. This however is out weighed by the amount of trouble you get when you are searching for the cognew(@needle_in_a_haystack, @stack_too_small) procedure that is overwriting another procedures stack, or worse still another procedure on another cog.

It seems very backward to compile an application into the u-controller, guess the stack size, and use another object to check that your guess was correct. Worse still is if you make a mistake and don't run the procedure to its call limits during testing(which can be very hard to keep track of on paper), you may get an incorrect result.

I've written code that compiles to over 30k on a pic, using about 1.5k of ram, and it looks as though calculating stack sizes if I port the code would take weeks.

Perhaps someone knows the reason why it has to be this way?

I am considering writing a program to calculate the stack size from the spin source, but because no-one has attempted it, I fear that I have missed something, so if anyone knows why the stack size can't be calculated at compile time, let me know.

Mike Green · 2006-10-27 05:03

Two things complicate the calculation of maximum stack requirement: 1) Recursion is allowed although not commonly used. 2) Temporaries are pushed into the stack so it's not enough to do a static call tree. Some calls to the same routine from a routine under study may require different stack sizes and these are program path dependent. Clearly a worst case scenario can be determined, but it's not via a simple static call tree generator.

acantostega · 2006-10-27 05:05

Just a guess, but what about, say, a recursive function that calls itself or terminates according to some user input? (e.g. a parser for a serial stream) It seems to me you couldn't know in advance how much stack it would need.
Edit: you beat me to it Mike!

Phil Pilgrim (PhiPi) · 2006-10-27 05:08

It's impossible to know the stack requirements at compile time, because Spin allows recursive procedure calls, and each recursion pushes more stuff on the stack. The programmer, therefore, will always know more than the compiler ever can about how large a particular stack can grow.

-Phil

Mike Green · 2006-10-27 05:28

On the other hand, if you're not using recursion, the compiler may be able to determine an upper limit on stack size. In any event, the compiler can either report a minimum stack size required or an error (that it can't be determined from the information available).

Mike

PointDog · 2006-10-27 07:43

I dont quite understand how a call to the same routine could take different amounts of stack space, even with temoraries pushed on to the stack, wouldn't the amount of stack used by a procedure always be the temps_in_the_current_procedure + return_address_storage + stack_space_used_by_largest_call_branch? of course the stack space for the largest call branch would have to be calulated the same way, so the calculation effectively starts from the "leaves" of the call tree rather than the "roots". This of course doesn't take into account recursion, which would require user input stating the maximum amount of recursion.

Sorry if I seem a bit thick on the subject, I'm trying to migrate from the PIC because my applications are really pushing the chip to its limits, so some of the concepts of the propeller are new to me.

Mike Green · 2006-10-27 13:42

PointDog,
You're quite right about the amount of stack required for a call being the # temps at the point of call + return + # parameters + maximum stack used by the procedure being called. The problem is that you have to look for the worst case call in the procedure being analyzed rather than just using a static "who called whom" tree which the compiler already needs to determine which procedures to leave out because they're not ever used. This is not an insurmountable problem, but might require a rewrite of large parts of the compiler depending on how it's organized.
Mike

PointDog · 2006-10-27 15:18

Hmm, forgive me if I don't see the major issue. I would think that the designers would have thought of this before me. I would ask your personal opinion (Mike) , it seems that you have a lot of experience with these chips. I have a project that has, at this point nearly 40k of compiled PIC code, but it seems that one 8bit core @ 40mhz is just not cutting the mustard. I have thousands of lines of C code to convert to spin to get the extra performance. If it's worthwile, I would like to try it on a propeller. The architecure seems to be light years ahead of the PIC. The problem is (ignoring the RTC and sleep modes that you've already shown me can be overcome, or even improved upon), I have to somehow convert the C code to spin. The C compiler that I use (BoostC) handles all the memory issues, I know the arcitecture enough to write a procedure in my sleep. Is it worthwhile (adding that this is what puts food on my table) to spend hundreds of hours porting code to spin? I'm talking about industrial grade equipment, not experimental projects. As a business decision, it would seem better to write the software to calculate the maximum stack size (bearing in mind that VB or VC# is trivial compared to real time U-C applications) and perhaps take care of some of the syntactical changes in converting the code from C at the same time. In your opinion (remember that you can't be sued for your opinion in Australia), is this a worthwhile project, or is it better to add another PIC on a common bus and double the processing power??? If there is enough support for a c2spin converter it will take a lot of the load off of me. Hell while I'm here, if there's anyone out there who is interested in a c2spin converter, post a reply, I can write in vb.net (most experience, but not the best language) C#, or C++ (my fav), It looks to be a big task, but with help, and a little open sourcing, it could be an option. And a note to the propeller engineers: is there any chance that a c2spin converter could warrant some support from parallax? (I'm talking about info, not money) It would be an open source (free) project compiler, I don't think it would encroach on any intelectual property of parallax? This would likely open the propeller up to full blown commercial designs. Let me know. And Mike, let me know your opinion. Thanks

Mike Green · 2006-10-27 17:43

Like a lot of these things, the issues are complex. You're talking about having several thousands of lines of C code designed for an 8 bit processor with very limited RAM memory. You haven't mentioned what kind of PIC processor you're using or what's connected to it. These are all important. You're contemplating converting to a processor with about the same amount of memory, but it's all RAM although some of that is required for the program. The COGs each have their own memory and sometimes that helps you and sometimes it doesn't. If you're running SPIN code, it's the interpreter that actually sits in the COG's memory. If you put an assembly routine into the COG, the routine has to be loaded from the 32K HUB memory first. You could reuse the space once the COG is launched, but SPIN doesn't provide for that automatically since you may need that copy of the assembly code for another COG to run.

The Propeller is a 32 bit computer, but can handle byte and 16-bit word data just as easily in SPIN (the actual COG instruction set provides 32 bit arithmetic only). This all impacts how you do the conversion.

SPIN and C are not very different. They have similar operators and function at a similar level (conceptually). It's a no-brainer to go through a C program and convert it line by line into SPIN. The real time consuming part is in the I/O functionality. Given the way the Propeller works, maybe you could do things much more efficiently making use of the Propeller's power. On the other hand, if the PIC you're using has a fancy I/O function block (like a USB interface for example), you might spend an inordinate amount of time reimplementing what MicroChip has already done using the a bare Propeller COG. Timers, frequency synthesizers, video output generators, 1-wire control, and UARTs are straightforward given that someone else has already done the work for the Propeller. You get the idea?

About the stack size. I wouldn't worry about it too much. You should have a rough idea of the depth of your static call tree just based on how you've written your C program. Figure out the average number of parameters on a call, add 3 to account for the return address, result variable (whether used or not), and the back chain, then you have a starting value for the stack size. Double or triple it for testing purposes. You can always put in a check free stack space routine that's called at the "leaves" of the call tree and saves the minimum free space for analysis later.

If you'd be willing to share some more detailed information about your project, I'd be happy to offer a more focused opinion. Feel free to do it via Private Message here.

Cliff L. Biffle · 2006-10-27 18:05

It may be worth noting before you put a tremendous amount of work into it:

Assembler on a single Cog in the Propeller will probably not outperform your PIC, unless you're doing a lot of multibyte arithmetic. The Propeller requires four clocks for most instructions; iirc, PICs run most instructions in a single cycle. So, you'd be moving from 40MIPS to 20MIPS.

Of course, you've got eight cores.

PointDog said...
I have thousands of lines of C code to convert to spin to get the extra performance.

You may be disappointed. SPIN is an interpreted language, and is hundreds of times slower than native-compiled C code.

No C compiler currently targets the Propeller. (In fact, no compilers target the Propeller's native instruction set at all, with the exception of some research projects I haven't released yet.)

That said, converting C to SPIN should be straightforward if that's what you want to do. SPIN is spookily similar to BCPL, which the original C dialect was translated to before compilation. You could even compile it directly to SPIN bytecodes, though the only reference materials are mine and they're not officially supported.

Parallax has been quite friendly to open-source and third-party tools targeting the Propeller; I wouldn't expect resistance from them.

PointDog · 2006-10-27 21:10

Thanks Cliff, the PIC is an 18F4620, the clock is 40mhz, so it runs 10 mips, all pics are 4 osc cycle/instruction, with conditional branches taking 8 cycles.

Cliff L. Biffle said...
It may be worth noting before you put a tremendous amount of work into it:

Thats exactly my problem, it's a tremendous amount of work to shift processors, But I'm using the fastest pic available, unless I jump to the dspic, which requires a new compiler, and learning a new architecture anyway. I looked into the AVR's, but they dont seem to be too much faster. (16Mips). A lot of my code is fairly high level, the main portion being a FAT16 file system, so I figured that it may be a similar amount of work to write a converter as it would to port the code, however I don't yet know the architecture, or spin well enough to make a decision.

Mike Green said...
If you'd be willing to share some more detailed information about your project, I'd be happy to offer a more focused opinion.

Mike, the project is a datalogger, used by electricity distribution companies. It records to a compact flash card. At the moment it has 2x 200ksps A/D convertes on SPI, an hitachi LCD, the CF card, and an FT245 usb-fifo because the microchip on-board usb looked like a lot of work to set up, test, and then write drivers for. The code has to continuously monitor a power line, and sample at a high frequency to record harmonic voltages, as well as AFLC load control signals (used by electricity companies to switch things like water heaters on and off), while its monitoring, and doing RMS calculations of the main waveform, and at least one harmonic frequency, it also has to record the data to a flash card using the FAT16 file system. The majority of my code is the fat16 file system (32k), containing procedures to format, chdir, mkdir, delete, open_file, and close_file. The hardware part is easy enough, but I dread porting the fat16 code by hand, and then testing it. It is also the part that uses the most ram, It has to buffer a 512 byte sector.

But its also the part that would benefit the most from the multi-core architecture, I would build a software FIFO into the buffer, and then one cog could take care of the filing system, one could take care of the continual RMS conversion, one could take over the SPI bus, but with two MOSI (SDI) pins so that the converters can talk at the same time (can't do that on a pic, software sdi uses too much of the processor).

Oh and the logger is an 8 channel unit, so the pic hat the moment has to do all that eight times.

The PIC can do it, but I've had to drop the harmonic and AFLC measurement function. I will be releasing the instrument without it, and then looking for a way to incorporate it in a revision.
I wish the propeller was out when I started this design, I would have written it in spin to start with.

Mike Green · 2006-10-27 22:24

Someone has already written and posted code to read files using FAT16 using a mix of SPIN and assembly and an SD/MMC card rather than a CF card. You may be able to use large portions of that existing code and just add the write related stuff. That makes it sound so easy, but the low level write stuff isn't very different from low level reads and you already have the mid- and high-level logic worked out.
You could use the USB-serial chip instead of the FIFO one to save some pins. The FullDuplexSerial routines sit in a single cog and can handle over 1MBps. You could even do floating point if you needed it at 40-60us per operation using 1-2 cogs.

Post Edited (Mike Green) : 10/27/2006 10:28:46 PM GMT

Gadgetman · 2006-10-28 09:58

Do you need to use CF-cards?
And is price a large factor in this project?
(I can't imagine that you'll be selling hundreds of thousands of them. Too specialized )

What about a FAT16/32-on-a-chip solution?
They also sells complete breakout-boards for hobbyists and prototyping...
You can read more about the product on DOSonCHIP

I mean, if the FAT16 part of the code is the largest bit, and there's a product available that can just... let you bypass that...

As for your Stack-size problems, there exists an object that can read how much of an object's stack is actually used.
Stack Monitor
I can't see why Phil didn't mention it himself, as it's quite an Excellent Piece Of Code...

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Don't visit my new website...

PointDog · 2006-10-29 13:23

Regarding serial IO do you mean 1MByte per second, or 1MBit per second. And does the ft232 support that kind of speed?

Mike Green · 2006-10-29 14:41

One megabit per second. The FullDuplexSerial routine is a bit inefficient because it has to go back and forth between transmitting and receiving. Just guessing, but with a half duplex assembly routine you could probably receive a bit in maybe 6-10 instructions on average. That would take 300-500ns for a rate of 2-3 megabits per second or 200 to 300 megabytes per second assuming a 10 bit serial byte frame. The FT232R datasheet says it can handle up to 3 megabits per second.

Why can't we work out stack space at compile time?

Comments