Overlays for large spin programs
Dr_Acula
Posts: 5,484
I've got this obsession about trying to run large spin programs on the propeller! Searches for this on google seem to recursively point me back to old threads that I have long since forgotten I started, so apologies in advance for sounding like a broken record
Ok, let's say we have a propeller chip, we have some mass storage such as an SD card and we have a display for debugging such as a VGA display. Many propeller boards will be able to do this. Further, just to make the challenge a bit more interesting, can this be done with just the proptool?
I'm going to note at this stage there are at least 4 propeller compilers that may be able to do this - proptool, homespun, BST and sphinx. Maybe openspin. sphinx might be the easiest to modify as it is written in spin and works on the propeller, but I'm going to try the proptool first.
What are overlays? Well, I'm glad you asked. Overlays http://en.wikipedia.org/wiki/Overlay_(programming) were popular before virtual memory drivers came along. In a simple sense, you have a core program, and then you have other parts of the program that are loaded into memory and run as needed. For a very simplistic model, a propeller might have the lower 16k of hub devoted to the core program which would contain the SD driver and the graphics driver, and then other parts of the program are loaded in as needed.
My understanding of overlays is they were used back in the 1980s when memory was typically 64k or less, but they were a reinvention of a technique used in mainframes 20 years earlier. What is interesting is the systems that used overlays had memory that was equal or less than that available in the propeller. So, if the boffins of yore could do this, we should be able too. Or to translate this to propeller-speak, this is totally impossible *grin*
Let's try hacking the proptool. What we need is to have 2 files. We take file 1, which has three objects, SD, VGA and Overlay1 and we compile that. The we have file 2, which also has three objects SD, VGA and Overlay2, and we compile that. Now we need to run the program and move Overlay1 or Overlay 2 into higher hub memory and see if we can swap them in and out.
First step is to create a common point in both programs. Some of the propeller compiles have the @@@ symbol but let's try the proptool first. Compiled spin programs don't start at address 0 in hub ram, so we need to work out how the compiler is working. Dat sections seem to go first, and with our two programs, the Dat sections are going to be the same so that should not matter. Var seems to be compiled after the first Pub so that makes things easier. But the location of the first Pub seems to move depending on how many objects are added. And further, it doesn't matter where in the code the objects are listed, they all end up at the beginning.
I'm using a string print routine to track where things are ending up and using F8 to compile. If I print a string, and have one object, the string appears at address 0x2A. If I add a second object that moves to address 0x2E. So it appears there is a lookup table of objects at the beginning of the program with one long per object. I think looking further at the bytes, that some of those bytes are the jump location in code.
So if program 1 and program 2 both contain the same number of objects then the jump locations should not change.
I think what we need now is dummy overlays. So program 1 might be compiled with a call to Overlay1, and from within overlay1 we have an overlay loader (read a binary file from SD) and then continue with the program with calls to Pubs within that overlay. Then the program needs to return and load up Overlay2. I'm not sure of the exact details of this, but I think that if code is swapped in and out of high memory and both programs are compiled with the same core objects and one dummy overlay object, all the jump locations should be the same. Further, it should be possible to pass one long which is a pointer to a list of longs in hub and this can be a way of passing data back and forth.
There are many compiler options but I'm thinking of staying with the proptool, and compiling two binary programs with a simple F8 and then save to SD card. They will be different sizes, but if this works, the core part of the program with the SD and display driver should be the same. If we have some sort of data pointer in the program, eg some text "end program", it should be possible for a program to work out how big it is itself by searching through its own hub. It can then load in another binary program, search for this same string, and then after that string start moving that data into higher memory. So we can then load the top part of the program in and out of high memory without the lower part of the program realising it has happened because the lower part of the program is the same.
This is the theory. I'm not entirely sure of the syntax within a spin program, but I think it will end up with a core program with lots of overlays commented out, and you uncomment them one at a time and recompile each time for a new binary file. The structure of the program is up to the programmer, but considering a game for instance, one might have one overlay that involves artificial intelligence (eg in pacman) and a new overlay that involves moving sprites around as per the recalculated positions. This means overlays are not being moved in and out of memory too often as there is obviously a time cost associated with this. At the very simplest level, the code is something like
Will this outperform a cached memory solution using virtual memory? Possibly not, and I suspect the GCC coders have a better solution. But... Spin can't run using cached memory, so there are not really any other options for Spin.
I'd be very interested in thoughts from forum members. Is this crazy?
Ok, let's say we have a propeller chip, we have some mass storage such as an SD card and we have a display for debugging such as a VGA display. Many propeller boards will be able to do this. Further, just to make the challenge a bit more interesting, can this be done with just the proptool?
I'm going to note at this stage there are at least 4 propeller compilers that may be able to do this - proptool, homespun, BST and sphinx. Maybe openspin. sphinx might be the easiest to modify as it is written in spin and works on the propeller, but I'm going to try the proptool first.
What are overlays? Well, I'm glad you asked. Overlays http://en.wikipedia.org/wiki/Overlay_(programming) were popular before virtual memory drivers came along. In a simple sense, you have a core program, and then you have other parts of the program that are loaded into memory and run as needed. For a very simplistic model, a propeller might have the lower 16k of hub devoted to the core program which would contain the SD driver and the graphics driver, and then other parts of the program are loaded in as needed.
My understanding of overlays is they were used back in the 1980s when memory was typically 64k or less, but they were a reinvention of a technique used in mainframes 20 years earlier. What is interesting is the systems that used overlays had memory that was equal or less than that available in the propeller. So, if the boffins of yore could do this, we should be able too. Or to translate this to propeller-speak, this is totally impossible *grin*
Let's try hacking the proptool. What we need is to have 2 files. We take file 1, which has three objects, SD, VGA and Overlay1 and we compile that. The we have file 2, which also has three objects SD, VGA and Overlay2, and we compile that. Now we need to run the program and move Overlay1 or Overlay 2 into higher hub memory and see if we can swap them in and out.
First step is to create a common point in both programs. Some of the propeller compiles have the @@@ symbol but let's try the proptool first. Compiled spin programs don't start at address 0 in hub ram, so we need to work out how the compiler is working. Dat sections seem to go first, and with our two programs, the Dat sections are going to be the same so that should not matter. Var seems to be compiled after the first Pub so that makes things easier. But the location of the first Pub seems to move depending on how many objects are added. And further, it doesn't matter where in the code the objects are listed, they all end up at the beginning.
I'm using a string print routine to track where things are ending up and using F8 to compile. If I print a string, and have one object, the string appears at address 0x2A. If I add a second object that moves to address 0x2E. So it appears there is a lookup table of objects at the beginning of the program with one long per object. I think looking further at the bytes, that some of those bytes are the jump location in code.
So if program 1 and program 2 both contain the same number of objects then the jump locations should not change.
I think what we need now is dummy overlays. So program 1 might be compiled with a call to Overlay1, and from within overlay1 we have an overlay loader (read a binary file from SD) and then continue with the program with calls to Pubs within that overlay. Then the program needs to return and load up Overlay2. I'm not sure of the exact details of this, but I think that if code is swapped in and out of high memory and both programs are compiled with the same core objects and one dummy overlay object, all the jump locations should be the same. Further, it should be possible to pass one long which is a pointer to a list of longs in hub and this can be a way of passing data back and forth.
There are many compiler options but I'm thinking of staying with the proptool, and compiling two binary programs with a simple F8 and then save to SD card. They will be different sizes, but if this works, the core part of the program with the SD and display driver should be the same. If we have some sort of data pointer in the program, eg some text "end program", it should be possible for a program to work out how big it is itself by searching through its own hub. It can then load in another binary program, search for this same string, and then after that string start moving that data into higher memory. So we can then load the top part of the program in and out of high memory without the lower part of the program realising it has happened because the lower part of the program is the same.
This is the theory. I'm not entirely sure of the syntax within a spin program, but I think it will end up with a core program with lots of overlays commented out, and you uncomment them one at a time and recompile each time for a new binary file. The structure of the program is up to the programmer, but considering a game for instance, one might have one overlay that involves artificial intelligence (eg in pacman) and a new overlay that involves moving sprites around as per the recalculated positions. This means overlays are not being moved in and out of memory too often as there is obviously a time cost associated with this. At the very simplest level, the code is something like
PUB main start sd driver start vga driver repeat load overlay1 run overlay 1 load overlay2 run overlay2
Will this outperform a cached memory solution using virtual memory? Possibly not, and I suspect the GCC coders have a better solution. But... Spin can't run using cached memory, so there are not really any other options for Spin.
I'd be very interested in thoughts from forum members. Is this crazy?
Comments
I think the relocation of those compiled objects is quite simple. David Betz might be the best person to answer this as I think he worked out what needs to e done to relocate an object, which is after all, what you are trying to do. Each object would then become an overlay.
Thanks for the summary of how to relocate Spin objects. Could you explain a little what's happening here? For example, if I have two instances of a Spin object, which of these values are the same for both instances and which vary? Also, if I want to run the same object on two different COGs then I must have a separate stack for each COG. Which of these variables indicates where the stack lives?
Thanks!
David
VBASE is the absolute starting address of the object's VAR variables. This is different for each instance of the object. DBASE is the absolute address of the RESULT variable when calling a method. This is dependent on the location in the stack at the time that a method is called. DCURR is the stack pointer. When a method is called, DCURR is set to the current stack pointer value plus space for the stack frame, local variables and calling parameters.
The stack frame is a 4-word structure that is placed on the stack when calling another method. It contains the values of PBASE, VBASE, DBASE and PCURR just before a method is to be called. The stack frame is used by the RETURN and ABORT bytecodes to restore the Spin state variables. The stack frame also contains two additional bits to indicate whether the return value should be pushed to the stack, and whether an ABORT should stop returning at this point or cause a return to the previous method.
From what I remember, I also answered at least one of your ancient threads, Dr Acula.
What I do in my CogOS and in another big command driven program is:
1. Drivers (like FullDuplex or TV or keyboard, SD card...) are separated in a PASM and a spin part.
2. The PASM mailbox is created at the end of HUB-RAM by a memory manager which also knows which drivers are running
3. The spin part is told by an init function where to find the mailbox
The overlays only need to use those stripped-down spin parts of the drivers. They ask a stripped down version of the memory manager where to find the mailbox, init the spin driver and then can use the ressource.
Passing parameters to the overlay works as well as starting functions in the overlay loader (somehow similar to an INT13 interrupt) and returning a result.
Maybe it would be an improvement if it is possible to share one spin driver between overlay and overlay loader, but so far my overlays are small enough.
We had "overlays" in the 1970's on the Marconi Locus 16 embedded mini-computer for radar systems. Except they wern't really overlays in the modern sense but rather a means of switching in chunks of RAM to high memory that has previously been loaded with code.
We had overlays in the late 70's early 80's with CP/M. Sometimes as hardware memory bank switching, as above. Sometime using actual "overlaying linkers" that would build programs bigger than memory space and swap functions in and out of disk.
We had overlays in the 1980's on the PC. Whilst working at Racal on a schematic design and PCB layout package called CADSTAR. CADSTAR was written in PL/M (A Pascal/C like language) and had to fit all it's code and the design data into 640K. This was done with PLINK an "overlaying linker".
The thing about an overlaying linker is that you can tell it how to partition your program into independent parts that can be swapped in and out of high memory. Of course those parts can have independent sub-parts and so on in a nested fashion. This was all specified in a big linker configuration file. With that in place your code could call functions in overlays directly. The linker would insert code that would intercept those calls, load the appropriate overlay, find the function in it and call it.
Sweet.
Except as your program grows his becomes a real pain to organize. Then you find that the amount of common, shared, code you need in the "resident" section becomes too big. All in all it was a pain to build programs with overlays.
Then you start adopting the various methods of expanding the PC memory, bank switching basically. That gets you more space for design data. But organizational problems grow and grow!
Then comes 32 bit machines and Linux, virtual memory, shared libraries and swap files. And we are happy at last
In Spinix, I would use a shell script to run large programs that have been broken up into smaller programs. That how it runs the Spin compiler. The spc script runs spinit, spasm and splink to compile, assemble and link a spin program.
YES !!!! This is a very sweet resource that people are just beginning to see the utility and FUN of using. Spinx is a delightful system setup that makes the Propeller with an SDcard very extensible and powerful.
The dilemma is that people are very wary of Forth. The public perception is that it is passe. But it is better than ever when you apply 32bits, 8 CPUs, and have 32K of ram extended by a SDcard interface.
I wasn't aware that Spinx had progressed that far. I am still focusing on Forth for my own personal learning, and pfth is being very rewarding.
So - consider a large spin program - object 1,2,3,4 each compiles to 15k so the total size is 60k. This clearly won't fit in the hub. For simplicity lets assume it is all spin with no pasm, and no cogs are being started or stopped. Object 1 stays in memory and objects 2,3,4 are loaded in and out as required.
The program is written in the proptool and so there are no #ifdefs. Blocks of code are commented in and out as needed and the program is compiled 3 times - object 1+2, object 1+3 then object 1+4. These 3 binary files, each 30k in size, are copied to an SD disk and the program starts off running the program in object 1+2.
The main object passes control to each object in turn.
I think the clue to getting this working is Dave Hein's previous work - in fact I think much of the code might be already written. His code is below and explains the whole pbase vbase concept. The main difference I am looking for is a way of including an SD loader, and making the tasks much bigger.
Looking at your code, each task is calling objects like the serial object. Normal spin has limitations too such as objects not being able to reference back to other objects declared in the main routine. But there are other solutions - eg an overlay could replicate all the spin code for running the serial driver, and just pass it the location in hub and a new serial driver can be incrementing head, tail and transmitting and receiving data.
So - for simplicity, each overlay consists of spin code, and if any objects are needed within that spin code, they are included in the overlay object, even if that duplicates code already in the main object (and assuming no conflicts like trying to talk to the same hub locations). So things are as simple as possible - the overlay is a self contained piece of code.
I'm still not sure about jumps in spin - whether they are relative or absolute. If they are relative, then code is easier to relocate to different memory locations. But I suspect jumps are absolute, so if they are, I'm thinking of ways of compiling code.
If an overlay is self contained, could one compile it to always run at a fixed location in ram? Say it runs at location 20,000 in hub. DAT code seems to end up at the start of a program, so could one pad out the beginning of a spin program with a DAT section consisting of 20,000 NOPs.
Hmm - that won't work. If the overlay has any DAT code it will end up at the beginning as well, with an unknown length.
Maybe instead of the overlay sitting in high ram, it would be easier to relocate to low ram? Then it becomes like a standard Spin program.
So we kick off the program with a bootloader that has a jump to a PUB Main, and we put in a DAT with 20,000 NOPs and the compiler then reorders this so the DAT is at the beginning and then jumps to the PUB which is now running from higher memory. Now we load a precompiled overlay into lower memory. Maybe we need to save the object list first and then reload that back in, as the object list sits at the beginning of the program.
Hmm- tricky.
Ok, the challenge is to try to hack the proptool in the simplest way. Let's check that.
Program 1
and now add a little DAT section
Program 2
We know that the DAT section gets moved so it ends up before the program. So this means the program starts at 0x18. Then we can infer that the PUB compiles to 36 65 04 7C 32 00
The "repeat" presumably compiles to a hidden GoTo. So then we can look at this PUB code and confirm that yes, indeed, it is the same code but in a different location. So in fact, the "repeat" is compiling to a "GoTo PC +/- a certain number"
The code at the beginning of programs is complicated, so lets leave the complicated stuff at the beginning and jump to a simple routine at the top of ram. So we can maybe compile a program which might be
I'm not entirely sure about stacks and the like, but the general idea is to have the main routine kept very simple and running in high memory. Also in high memory needs to be the code to actually load and unload overlays - eg an SD driver, or an SRAM or Flash driver. Probably whichever is the smallest code. I'm kind of drawn to SPI SRAM with 4 data lines.
I'm not sure of the details, but I think there must be a solution where you have a self contained spin program where the main has been relocated to near the top of ram, then you run it, and then trick the program by moving different code into lower ram, running that, then unloading it and moving the original code back. If the stack etc are preserved, the program never need know that a new bit of code was moved in and out of lower ram.
You might be better off using homespun or bst and look at the listing. There you can see where code is being located and what comes first. IIRC bst is capable of reserving space at the bottom of hub before any spin objects. I don't recall the command, but it was so we could access the lower hub ram using rd/wrbyte/word/long and the immediate hub value mode.
Dave, yes I am thinking along the same lines. Overlays are complete spin programs. Same number of parameters so a common skeleton structure.
I'm wondering about something like
If that would work, need to also think about DAT and VAR and OBJ - do they need to be the same structure for the overlays?
Sounds like you are building an operating system.
I used to take the Kyedos route and load a second program when certain menus were requested but I've recently figured out how to move all the touchscreen menu data onto the SD card which has greatly reduced the size of the program. I no longer need to break it up into two programs. I fear my consolidation down to a single program may be short lived. I can store information about a menu's layout and contents on the SD card but there are times when there will be something unique about a menu or display that doesn't lend itself to being stored externally and requires a unique method. These unique methods continue to accumulate, threatening my single program arrangement.
It would sure be nice if these unique methods could be stored externally and used as a sort of "overlay".
I'll continue to watch these discussions with interest.