Idea: Relocatable assembly modules - serialize machine code from one cog into a
Dennis Ferron
Posts: 480
So, I'm working on this TV driver and I want to make the driver as simple as possible. That means I don't have anything but the bare minimum code to generate NTSC. Yet I'd also like to use my new driver in as many different kinds of new programs as possible, without adding too much code. I want to be able to add and remove features in the driver, without changing my simple core codebase.
For instance, what if I want to superimpose a mouse cursor over the video image without erasing pixels under the cursor. That's trivial to do when you generate the NTSC, because injecting the mouse cursor pixels into the TV output doesn't erase the pixels in the buffer, but more difficult to do outside the TV driver because you'd have to erase the buffer pixels and save them when the cursor obscures them. And what if there is no buffer (sprite based)? Or, what if I want to impose text over scanline 3-d graphics? I don't want this Smile in the TV driver core codebase because it isn't 100% necessary, but I still need to run this code in the TV driver cog.
One way this could be done is with conditional compilation, and have all the features marked with #ifdef's. We don't have that in the Propeller tool. But honestly, even if I did have conditional compilation, I wouldn't want to use it, because the problem is still the same - it's difficult to read code that is loaded with extra junk.
It occurred to me that the real root of my problem is that I can't break a Propeller assembly program into pieces in many files, and combine these components in one place to build a cog program. Or... what if I could?
What I propose is a Spin + Assembly library for building the 2Kb cog programs conditionally, but at run time, not compile time. A feature plug-in framework for assembly language drivers, if you will. You have your basic cog program that can be customized (like my TV driver) and in that program you set aside some amount of memory to be used to hold the code and data of plug-ins. Say I only use 1K of memory for my TV driver; at the end of it I can reserve another 1K to hold plug ins.
The scheme requires a little cooperation from the programs. There would be two little kernels, one for serializing (streaming programs out to hub memory) and another for deserialization (streaming code in from the hub). They'd be snippets of assembly code that you cut and paste into the top of your assembly language program; they would be as small as possible and you only need one or the other in each module.
In your assembly code that you intend to plug in to something else, you would start with the serialization kernel. After that, you would code an entry point and the function that will be called when the plug-in is activated. You would then use cogstart (from Spin) to actually start the serialization kernel + plug in in it's own cog. (This scheme requires a spare cog temporarily.) Now you have a cog running the serialization kernel + the plug-in, even though the plug-in isn't actually plugged into anything. But all that happens is that the serialization kernel examines cog memory from the start of your entry point to the end of your data segment, and copies those bytes out to hub memory.
Next, you destroy the cog that is running the plug-in and load your host driver (the program that will accept the plug-in). The host driver contains a deserialization kernel that starts up before your real driver actually starts. From Spin, you query the deserialization kernel as to where in cog RAM the plug-in will be relocated to (your plug in compiled to a certain memory address that is different than the location reserved for it). Again, from Spin, you take this information you garnered about where the plug-in will be moved to, and decode all of the instructions that reference cog RAM locations in the plug-in image. You then update these numbers to reflect the new address range that the plug in will be located in.
(Variables that contain pointers to other cog RAM memory addresses might be a special case - you don't know whether they are meant to point to labels in code, or whether they are actually not pointers, just number values that should be kept the same. I propose it should be up to the programmer to mark these special variables, by putting them into a special, labeled section which the serializer is informed about.)
Once the Spin code part of the library finishes the relocation adjustment, the deserializer in the driver cog is given the address of the program image in cog RAM, and it reconstitutes this image back into cog RAM but now as part of the host program. The code gets run by the host program calling into the entry point of the plug-in. (This entry point is known to the host program because it is the label of the space reserved for the plug-in buffer.)
I'm not asking for volunteers to write this for me - I'll write it. What I want to know is what you think of it.
In addtion to customization, it would also allow for "gypsy" code that wanders from cog to cog in response to load. Because the scheme preserves code and data perfectly, not just static compiled data but actual runtime data as well, one could imagine that instead of just plug-ins you could also have self-contained tasks that get transferred to whatever cog has the most free CPU cycles. For instance, suppose in a 3-d game you have some extra cycles here and there in different cogs. Tasks that can run in the background (like player AI) could be shuffled from cog to cog depending on what part of the 3-d engine is idle.
For instance, what if I want to superimpose a mouse cursor over the video image without erasing pixels under the cursor. That's trivial to do when you generate the NTSC, because injecting the mouse cursor pixels into the TV output doesn't erase the pixels in the buffer, but more difficult to do outside the TV driver because you'd have to erase the buffer pixels and save them when the cursor obscures them. And what if there is no buffer (sprite based)? Or, what if I want to impose text over scanline 3-d graphics? I don't want this Smile in the TV driver core codebase because it isn't 100% necessary, but I still need to run this code in the TV driver cog.
One way this could be done is with conditional compilation, and have all the features marked with #ifdef's. We don't have that in the Propeller tool. But honestly, even if I did have conditional compilation, I wouldn't want to use it, because the problem is still the same - it's difficult to read code that is loaded with extra junk.
It occurred to me that the real root of my problem is that I can't break a Propeller assembly program into pieces in many files, and combine these components in one place to build a cog program. Or... what if I could?
What I propose is a Spin + Assembly library for building the 2Kb cog programs conditionally, but at run time, not compile time. A feature plug-in framework for assembly language drivers, if you will. You have your basic cog program that can be customized (like my TV driver) and in that program you set aside some amount of memory to be used to hold the code and data of plug-ins. Say I only use 1K of memory for my TV driver; at the end of it I can reserve another 1K to hold plug ins.
The scheme requires a little cooperation from the programs. There would be two little kernels, one for serializing (streaming programs out to hub memory) and another for deserialization (streaming code in from the hub). They'd be snippets of assembly code that you cut and paste into the top of your assembly language program; they would be as small as possible and you only need one or the other in each module.
In your assembly code that you intend to plug in to something else, you would start with the serialization kernel. After that, you would code an entry point and the function that will be called when the plug-in is activated. You would then use cogstart (from Spin) to actually start the serialization kernel + plug in in it's own cog. (This scheme requires a spare cog temporarily.) Now you have a cog running the serialization kernel + the plug-in, even though the plug-in isn't actually plugged into anything. But all that happens is that the serialization kernel examines cog memory from the start of your entry point to the end of your data segment, and copies those bytes out to hub memory.
Next, you destroy the cog that is running the plug-in and load your host driver (the program that will accept the plug-in). The host driver contains a deserialization kernel that starts up before your real driver actually starts. From Spin, you query the deserialization kernel as to where in cog RAM the plug-in will be relocated to (your plug in compiled to a certain memory address that is different than the location reserved for it). Again, from Spin, you take this information you garnered about where the plug-in will be moved to, and decode all of the instructions that reference cog RAM locations in the plug-in image. You then update these numbers to reflect the new address range that the plug in will be located in.
(Variables that contain pointers to other cog RAM memory addresses might be a special case - you don't know whether they are meant to point to labels in code, or whether they are actually not pointers, just number values that should be kept the same. I propose it should be up to the programmer to mark these special variables, by putting them into a special, labeled section which the serializer is informed about.)
Once the Spin code part of the library finishes the relocation adjustment, the deserializer in the driver cog is given the address of the program image in cog RAM, and it reconstitutes this image back into cog RAM but now as part of the host program. The code gets run by the host program calling into the entry point of the plug-in. (This entry point is known to the host program because it is the label of the space reserved for the plug-in buffer.)
I'm not asking for volunteers to write this for me - I'll write it. What I want to know is what you think of it.
In addtion to customization, it would also allow for "gypsy" code that wanders from cog to cog in response to load. Because the scheme preserves code and data perfectly, not just static compiled data but actual runtime data as well, one could imagine that instead of just plug-ins you could also have self-contained tasks that get transferred to whatever cog has the most free CPU cycles. For instance, suppose in a 3-d game you have some extra cycles here and there in different cogs. Tasks that can run in the background (like player AI) could be shuffled from cog to cog depending on what part of the 3-d engine is idle.
Comments
My log($1) cents,
Marty
for any code that fits in the hub, since you can patch up all the loop addresses and the like.
You also have to take into account library subroutines (that is, this subroutine you're "linking in" requires the
multiplication subroutine, so that needs to be linked in). And you need to decide if you want your relocation
information (what is where) to live in the hub with the code (fast but competes for space) or in main memory
(slower, does not compete for space, and you don't have to worry about other cogs taking that memory).
Finally, you have to decide what your data model will be; will it be ASM-like in that you just have a flat set
of registers? Or do you want to go ahead and use a main-memory stack and move data to and fro?
I was going to make a basic compiler targetting this sort of architecture, but like so many things, I just haven't
found the time. But imagine an interactive BASIC environment running at full assembly speeds!
One snag - how does the loader figure out what it's base address is? I was thinking you could use JMPRET to a known address (i.e. JMPRET 1, #0, where 0 is a NOP and 1 is a RET) then pull the source (i.e. PC) from the RET as the base address. (Can't use JMPRET 0,#0 due to pipeline issues.)
I don't think a generic loader is possible because it's virtually impossible to distinguish code from data and whether immediate values are data or register addresses.
I'm fond of the KISS approach. Library subroutines could be hell to support, and they'd need to live in the cog with the "code swapper" ultimately limiting the swappable code size. I vote to just leave them out if they're not strictly needed. Code/Data separation in swapped code could be handled with a delimiter, or equivalent system. (I.E. some nonsense instruction) Immediate values and register addresses can be easily differentiated. There's a bit in every instruction that controls this. Hm... how would u access data structures from the "host" cog? Maybe hide some instructions in the Data section of the swapped code? This would also give access to static functions in the "host" cog. (i.e. use a two stage JMPRET chain.)
rockiki: Funny I had the same thought: "What if Femto BASIC worked this way?"
ericball: Yeah, you'd have to tag what's what so the loader/linker knows what to change and what not to touch.
Mike: I see. I wasn't aware that was how the large memory model works. That'll be a good point for further research.
rockiki: Libraries... subroutines... you are playing into my natural inclination to explore all the complex possilibities, but...
Lawson: You've got it. KISS is definitely where I need to go with this. Thing is, if I give the system too many features, it might become more work to use it than to do without it. I've been working on a component model for software in general, and I keep running into the same brick wall: the dependencies kill you. On the other hand, there is no getting away from some issue with dependencies; besides what rockiki mentions with subroutines there is the question of how will the movable code communicate with the host? I'm a fan of message-passing (a la Smalltalk) but that's either the best idea or the worst idea here. I'll tell you which next week.
So then I thought about it from the COG's perspective - each coglong contains either an instruction or data. Instructions are an opcode, a destination address and a source address or an immediate value (which may be an address value). Hmmm... If we ignore immediate data, then the source & destination of each instruction are both addresses and the loader will need to be updated. And... there's no reason we can't use data addresses instead of immediate values.
- ORG 0 each relocatable routine
- relocatable routines may not use immediate values - all data must be stored following the code
- the loader is passed the base address of the relocatable routine (14 bits), the number of code longs (8 or 9 bits) and the number of data longs (8 or 9 bits)
- after each code RDLONG, the loader does an ADD ptr, baseaddr << 9 + baseaddr
Hmm... What's missing is how the routine returns to the caller.· Well... that's an exercise for tomorrow.· (Though I have a really sneaky idea...)
Post Edited (ericball) : 7/11/2007 8:37:43 PM GMT