Preprocessor progress for the reverse LMM
Phil Pilgrim (PhiPi)
Posts: 23,514
In an earlier thread, I detailed a method for speeding large memory model (LMM) execution by arraying the code backwards in memory. Since it's awkward to write code this way, some sort of preprocessor was called for, which I've written in Perl. An adidtional advantage of using a preprocessor is that you can make up your own programming model and opcodes and incorporate them into the assembler as if they were native to the hardware. And that's just what I ended up doing.
My first objective was to design a virtual machine (VM) for the LMM whose instructions were all single longs. This means that no immediate data could be included as a second long after the original instruction, since doing so can be rather wasteful. Nonetheless, it's still possible to include up to nine bits of data in instructions that require VM intervention by cramming them into the unused destination field of a jmp (which is just a jmpret with an nr):
The VM can then extract the data from the instruction for whatever purpose it's intended. The nine-bit limitation precludes relative jmps and calls to addresses more than 512 longs from the caller, but this can be handled using a jump table in the leftover cog memory. My feeling is that such a table will remain small, since most jumps are local anyway.
I've also eliminated tstjz, tstjnz, and djnz since their implementation would require an extra long. But I might add them later to use specific registers, rather than the full set, as a compromise.
In addition to straight assembly language, the VM accommodates "threaded code". This gives it the capability to emulate Forth interpreters, for example, and provides the user some extensibility to make up his own instructions. Each threaded code "instruction" is just a hub address that points to a subroutine, so it takes up a word rather than a long. While this has the potential to result in very compact code, relative to assembly, the savings are difficult to realize in short programs, due to the necessary overhead of useful library "words", which is the term typically used when referring to threaded code subroutines.
The inclusion of threaded code also necessitates designing a programming model based on a restricted, specialized register set, along with an expression stack, much like the stack in Forth, so I'll introduce that model next.
Programs in this model include three segments:
····0: The base address of the current object (@@0).
····1: The size of the stack area in longs.
····2: The beginning hub address of the LMM program header, which actually comes at the end of the program.
It can also include any other data or variable space the user program needs. (Variable space can be partitioned from the stacks by adjusting the stack pointers at the beginning of one's program.)
When the VM starts, it reads the register segment into the cog and fixes any addresses in the included address table by adding the object's base address to them. This helps to save time in jumps and calls during runtime. It then initializes the stack pointers and starts the emulation loop from the beginning of the user's program.
The stack area houses both the expression stack (longs), building upward from the beginning of the array, and the return stack (words), building downward from the end. No attempt is made to detect stack collisions, so the programmer must make sure to include enough room.
Several special registers are available to the programmer. These are:
Incidentally, here's what the preprocessor produces from the above code:
Well, this thread is getting pretty long, so maybe I should stop here for now, even though there's a lot more to talk about. I'm really not sure how much further to pursue this: whether to treat it as useful or just as an acdemic exercize. At the very least, it demonstrates what could be accomplished with a decent set of preprocessor hooks built into the IDE. As it is now, I have to copy the output from my Perl program and paste it into the IDE edit window.
-Phil
Post Edited (Phil Pilgrim (PhiPi)) : 1/28/2008 10:49:47 PM GMT
My first objective was to design a virtual machine (VM) for the LMM whose instructions were all single longs. This means that no immediate data could be included as a second long after the original instruction, since doing so can be rather wasteful. Nonetheless, it's still possible to include up to nine bits of data in instructions that require VM intervention by cramming them into the unused destination field of a jmp (which is just a jmpret with an nr):
jmpret <data>,#VM_service nr
The VM can then extract the data from the instruction for whatever purpose it's intended. The nine-bit limitation precludes relative jmps and calls to addresses more than 512 longs from the caller, but this can be handled using a jump table in the leftover cog memory. My feeling is that such a table will remain small, since most jumps are local anyway.
I've also eliminated tstjz, tstjnz, and djnz since their implementation would require an extra long. But I might add them later to use specific registers, rather than the full set, as a compromise.
In addition to straight assembly language, the VM accommodates "threaded code". This gives it the capability to emulate Forth interpreters, for example, and provides the user some extensibility to make up his own instructions. Each threaded code "instruction" is just a hub address that points to a subroutine, so it takes up a word rather than a long. While this has the potential to result in very compact code, relative to assembly, the savings are difficult to realize in short programs, due to the necessary overhead of useful library "words", which is the term typically used when referring to threaded code subroutines.
The inclusion of threaded code also necessitates designing a programming model based on a restricted, specialized register set, along with an expression stack, much like the stack in Forth, so I'll introduce that model next.
Programs in this model include three segments:
- The "register segment" (rseg), which resides in cog memory with the VM and includes the register space and jump table.
- The "code segment" (cseg), which comprises the executable assembly code and threaded code and runs backwards in hub memory.
- The "data segment" (dseg), which contains any user variables that reside in hub memory. In reality, the dseg is just part of the cseg, with the exception that it's not reversed, which makes its visualization a lot easier.
····0: The base address of the current object (@@0).
····1: The size of the stack area in longs.
····2: The beginning hub address of the LMM program header, which actually comes at the end of the program.
It can also include any other data or variable space the user program needs. (Variable space can be partitioned from the stacks by adjusting the stack pointers at the beginning of one's program.)
When the VM starts, it reads the register segment into the cog and fixes any addresses in the included address table by adding the object's base address to them. This helps to save time in jumps and calls during runtime. It then initializes the stack pointers and starts the emulation loop from the beginning of the user's program.
The stack area houses both the expression stack (longs), building upward from the beginning of the array, and the return stack (words), building downward from the end. No attempt is made to detect stack collisions, so the programmer must make sure to include enough room.
Several special registers are available to the programmer. These are:
- pc: The LMM program counter.
- ip: The threaded code instruction pointer.
- ep: The expression stack pointer.
- op: The object pointer (base address of the current object).
- rp: The return stack pointer.
- ra and rb: General purpose data registers (accumulators) for which specialized instructions exist and which, together, can be treated as a 64-bit register in certain circumstances.
- rx: General purpose data register for which an additional specialized instruction exists for handling object-relative addresses.
- jmp, call, ret: These aren't really "new", of course. They're just intercepted by the preprocessor and converted to the appropriate relative jumps (add pc,#<displacement>) or VM-mediated jumps, calls, and returns.
- mov <reg>,#@<hub address>: This is just a special notation that overloads the mov instruction. When the preprocessor encounters it, it places a reference to the hub address in the jump table (which gets fixed on startup) and encodes a simple mov from the jump table location. This entails that this particular jump table entry take up an entire long (for speed), rather than a singe word, as other entries do.
- enter: Enter threaded code. The next "instruction" is the address of a threaded code word.
- next: Execute the next threaded code word. This is used at end of an assembly-language subroutine "word" instead of ret
- exit: Exit threaded code back into assembly language.
- xnext: Combination of exit and next, used at the end of a threaded code "word" which is, itself, written in threaded code.
- pusha, pushb, pushx, pushab: Push the indicated register (ra, rb, rx) or register pair (ra:rb) onto the expression stack.
- popa, popb, popx, popab: Pop the indicated register (ra, rb, rx) or register pair (ra:rb) off of the expression stack.
- lda, ldb, ldab: Load the indicated register (ra, rb,) or register pair (ra:rb) from the word(s) pointed to by ip, post-decrementing (backwards cseg, remember) ip by two for each one.
- ldx: Load the rx register from the word pointed to by ip, post-decrementing ip by two and adding op to rx.
- mulab: Multiply (unsigned) ra by rb, leaving the 64-bit product in ra:rb.
- divabx: Divide (unsigned) ra:rb by rx, leaving the quotient in rb and the remainder in ra. An overflow will occur if rx < ra on entry, but it's not flagged.
cseg double enter '(a -- 2*a) Double the number on top of the stack. .dup .add .xnext dup enter '(a -- a a) Duplicate the number on top of the stack. popa pusha pusha next add popa '(a b -- a+b) Add the two numbers on top of the stack. popb add ra,rb pusha next xnext xnext 'Link to xnext from .xnext
Incidentally, here's what the preprocessor produces from the above code:
'--------[noparse][[/noparse] Code Segment ]-------- xnext jmpret vm#_vm_ret,vm#_xnext 'Link to xnext from .xnext jmpret vm#_vm_ret,vm#_next jmpret vm#_vm_ret,vm#_pusha add vm#_ra,vm#_rb jmpret vm#_vm_ret,vm#_popb add jmpret vm#_vm_ret,vm#_popa '(a b -- a+b) Add the two numbers on top of the stack. jmpret vm#_vm_ret,vm#_next jmpret vm#_vm_ret,vm#_pusha jmpret vm#_vm_ret,vm#_pusha jmpret vm#_vm_ret,vm#_popa dup jmpret vm#_vm_ret,vm#_enter '(a -- a a) Duplicate the number on top of the stack. word 0 word @xnext,@add,@dup double jmpret vm#_vm_ret,vm#_enter '(a -- 2*a) Double the number on top of the stack. '------[noparse][[/noparse] Register Segment ]------ org $1F0 '--------[noparse][[/noparse] Header ]-------- my_prog word 0,0
Well, this thread is getting pretty long, so maybe I should stop here for now, even though there's a lot more to talk about. I'm really not sure how much further to pursue this: whether to treat it as useful or just as an acdemic exercize. At the very least, it demonstrates what could be accomplished with a decent set of preprocessor hooks built into the IDE. As it is now, I have to copy the output from my Perl program and paste it into the IDE edit window.
-Phil
Post Edited (Phil Pilgrim (PhiPi)) : 1/28/2008 10:49:47 PM GMT
Comments
I think this is something more than academic.
My assessment of the Propeller situation is as follows:
(a) PASM is way too fast for many tasks, but it is the only possibility to speed up the too slow SPIN. The true performance of the Propeller has never been unleashed as not many algorithmically demanding programs can be written in 496 PASM instructions. With notable exceptions as TV and some high speed Data Acquisition, COGs are mainly waiting at the moment.
(b) The possible speed-up by using multiple COGs overcompensates the degradation from LMM. But what's more: There might be not much degradation at all! By your improved LMM loop an instruction accessing the HUB memory is not slower than a true HUB instruction! So everything depends on the ratio of HUB/COG instructions. I gave some figures in another thread some weeks ago that showed a bad speed-down with the old 2-cycle LMM loop, but when thinking in terms of large scale assembly programs, those programs will have to access much more HUB variables than the recent small scale PASM "snippets".
(c) We shall have to see how the new C-compiler will behave. Maybe its code will be very good, but my general experience tells me it is very difficult to generate small code for a RISC machine without extreme internal code optimization. So maybe the generated code will soon fill the HUB, so we again have to wait for a secondary overlay scheme.
Conclusion:
I should very much like to write (or generate!) assembly code not restricted to 496 instructions. A basic assembler is not difficult to write as the many simulators available show; uploading a binary is also well understood. So the IDE could be soon elided.
Post Edited (deSilva) : 1/29/2008 2:42:27 AM GMT
Phil you can try the PreCompiler Feature of PreSpin to simplify the handling.
PreSpin automatically gets the Source code written in the IDE, does its own preprocessing (includes for example), then writes a file: preprocessed.spin. If you specify a Precompiler (your Perl program) in the settings, then this program is started. After the Precompiler has finished, the changed preprocessed.spin is opened in the IDE and the Spin Compiler and Uploader is started. You can do this all with 1 mouseclick.
Your Perl programm must be able to read the preprocessed.spin file, makes the Preprocessing and overwrite the preprocessed.spin with the output.
Andy
Thanks for the reminder! I've installed PreSpin and am very impressed with its capabilities! The automated file dialogs are ponderously slow to transact, though. (The "File Open" dialog for preprocessed.spin often takes a full minute to complete.) But I assume this is a Windows issue that you don't have any control over.
Aside from that, I do have a couple suggestions:
1. Have PreSpin write a one-line file in the Propeller Tool folder showing where preprocessed.spin was written. That way the user's preprocessor can find it, given the multitude of possibilities across several projects emanating from different folders. Better still, would be just to pipe the code to the user's STDIN and retrieve it from STDOUT, allowing the user's preprocessor to be path-agnostic.
2. My preprocessor can flag errors. Right now, all it can do is write the error as a comment to preprocessed.spin. If there were some way to communicate errors to PreSpin (via STDERR, perhaps) to block the automatic Compile+Run, that would be very nice. Even better would be for PreSpin to locate and highlight the error in the original source, via some agreed-upon error-reporting protocol.
I know this entails a lot of work. But I'm convinced that the more successful and accepted PreSpin becomes, the more incentive and direction Parallax (i.e. Jeff Martin) will have for including something like it in the IDE. Promoting and achieving this latter objective has become my mantra for the Propeller Forum (in case no one's noticed ).
Thanks!
-Phil
The 1 Minute Delay at loading should not be, I never had such a problem. I see, that this makes PreSpin not very usefull for you.
1) When I start the external Preprocessor, I pass the full path and name of preprocessed.spin as a CommandLine parameter to the application. I also set the Working directory for the application to the one where preprocessed.spin is located, so it should work with a relative Path.
2) You can write a Error Message like this as first line of preprocessed.spin:
The first line is tested after preprocessing and if <' --> Preprocessor:> in the Line is detected, the Compiler and Downloader is not started.
But I can't jump to Lines with Errors.
Perhaps you can describe where exactly this long Delay at Loading occures.
After the FileOpen Dialoge of PreSpin is closed, half a second later should open the Open-Dialoge of PropTools and
the path and filename is written to the Filename field. After another half second the Dialoge closes and the Spin File is loaded.
Andy
Well, this is weird. I was trying to do a screen capture of the stuck dialog box for you and managed to crash Windows Explorer in the process. When it came back up, the problem was gone. Anyway, thanks for the additional information.
As it turns out, your program has inspired me to perform the necessary interactions with the Propeller Tool in a standalone fashion using a Windows-aware version of my preprocessor. My plan is to do everything through the clipboard, instead of using an auxilliary file, putting the original source in a block comment so it can be flipped back for editing. I think I should be able to do error highlightling in the process by homing the cursor, and incrementing down the correct number of lines and over the correct number of columns. We'll see.
Thanks again for an excellent program!
-Phil