Phil Pilgrim (PhiPi)
01-29-2008, 05:11 AM
In an earlier thread (http://forums.parallax.com/showthread.php?p=701330), I detailed a method for speeding large memory model (LMM) execution by arraying the code backwards in memory. Since it's awkward to write code this way, some sort of preprocessor was called for, which I've written in Perl. An adidtional advantage of using a preprocessor is that you can make up your own programming model and opcodes and incorporate them into the assembler as if they were native to the hardware. And that's just what I ended up doing.
My first objective was to design a virtual machine (VM) for the LMM whose instructions were all single longs. This means that no immediate data could be included as a second long after the original instruction, since doing so can be rather wasteful. Nonetheless, it's still possible to include up to nine bits of data in instructions that require VM intervention by cramming them into the unused destination field of a jmp (which is just a jmpret with an nr):
jmpret <data>,#VM_service nr
The VM can then extract the data from the instruction for whatever purpose it's intended. The nine-bit limitation precludes relative jmps and calls to addresses more than 512 longs from the caller, but this can be handled using a jump table in the leftover cog memory. My feeling is that such a table will remain small, since most jumps are local anyway.
I've also eliminated tstjz, tstjnz, and djnz since their implementation would require an extra long. But I might add them later to use specific registers, rather than the full set, as a compromise.
In addition to straight assembly language, the VM accommodates "threaded code". This gives it the capability to emulate Forth interpreters, for example, and provides the user some extensibility to make up his own instructions. Each threaded code "instruction" is just a hub address that points to a subroutine, so it takes up a word rather than a long. While this has the potential to result in very compact code, relative to assembly, the savings are difficult to realize in short programs, due to the necessary overhead of useful library "words", which is the term typically used when referring to threaded code subroutines.
The inclusion of threaded code also necessitates designing a programming model based on a restricted, specialized register set, along with an expression stack, much like the stack in Forth, so I'll introduce that model next.
Programs in this model include three segments: The "register segment" (rseg), which resides in cog memory with the VM and includes the register space and jump table.
The "code segment" (cseg), which comprises the executable assembly code and threaded code and runs backwards in hub memory.
The "data segment" (dseg), which contains any user variables that reside in hub memory. In reality, the dseg is just part of the cseg, with the exception that it's not reversed, which makes its visualization a lot easier.The LMM VM is started with a cognew that provides the address of a stack area, which is a long array in VAR space whose first three longs must include the following data:
····0: The base address of the current object (@@0).
····1: The size of the stack area in longs.
····2: The beginning hub address of the LMM program header, which actually comes at the end of the program.
It can also include any other data or variable space the user program needs. (Variable space can be partitioned from the stacks by adjusting the stack pointers at the beginning of one's program.)
When the VM starts, it reads the register segment into the cog and fixes any addresses in the included address table by adding the object's base address to them. This helps to save time in jumps and calls during runtime. It then initializes the stack pointers and starts the emulation loop from the beginning of the user's program.
The stack area houses both the expression stack (longs), building upward from the beginning of the array, and the return stack (words), building downward from the end. No attempt is made to detect stack collisions, so the programmer must make sure to include enough room.
Several special registers are available to the programmer. These are: pc: The LMM program counter.
ip: The threaded code instruction pointer.
ep: The expression stack pointer.
op: The object pointer (base address of the current object).
rp: The return stack pointer.
ra and rb: General purpose data registers (accumulators) for which specialized instructions exist and which, together, can be treated as a 64-bit register in certain circumstances.
rx: General purpose data register for which an additional specialized instruction exists for handling object-relative addresses.An assortment of new instructions has been added, whose execution is mediated by the VM. The preprocessor compiles each of these instructions into a VM call, so the user needn't be concerned about the details. Here are the new instrucitons: jmp, call, ret: These aren't really "new", of course. They're just intercepted by the preprocessor and converted to the appropriate relative jumps (add pc,#<displacement>) or VM-mediated jumps, calls, and returns.
mov <reg>,#@<hub address>: This is just a special notation that overloads the mov instruction. When the preprocessor encounters it, it places a reference to the hub address in the jump table (which gets fixed on startup) and encodes a simple mov from the jump table location. This entails that this particular jump table entry take up an entire long (for speed), rather than a singe word, as other entries do.
enter: Enter threaded code. The next "instruction" is the address of a threaded code word.
next: Execute the next threaded code word. This is used at end of an assembly-language subroutine "word" instead of ret
exit: Exit threaded code back into assembly language.
xnext: Combination of exit and next, used at the end of a threaded code "word" which is, itself, written in threaded code.
pusha, pushb, pushx, pushab: Push the indicated register (ra, rb, rx) or register pair (ra:rb) onto the expression stack.
popa, popb, popx, popab: Pop the indicated register (ra, rb, rx) or register pair (ra:rb) off of the expression stack.
lda, ldb, ldab: Load the indicated register (ra, rb,) or register pair (ra:rb) from the word(s) pointed to by ip, post-decrementing (backwards cseg, remember) ip by two for each one.
ldx: Load the rx register from the word pointed to by ip, post-decrementing ip by two and adding op to rx.
mulab: Multiply (unsigned) ra by rb, leaving the 64-bit product in ra:rb.
divabx: Divide (unsigned) ra:rb by rx, leaving the quotient in rb and the remainder in ra. An overflow will occur if rx < ra on entry, but it's not flagged.Threaded code words are entered as if they were normal assembly language instructions but are prefixed with a period (.), so they stand out as special. (Since the preprocessor is two-pass and maintains a symbol table, this wouldn't have been necessary, but it looks nice visually.) It's also possible to chain multiple threaded code words on one line, like this:
cseg
double enter '(a -- 2*a) Double the number on top of the stack.
.dup .add .xnext
dup enter '(a -- a a) Duplicate the number on top of the stack.
popa
pusha
pusha
next
add popa '(a b -- a+b) Add the two numbers on top of the stack.
popb
add ra,rb
pusha
next
xnext xnext 'Link to xnext from .xnext
Incidentally, here's what the preprocessor produces from the above code:
'--------[ Code Segment ]--------
xnext jmpret vm#_vm_ret,vm#_xnext 'Link to xnext from .xnext
jmpret vm#_vm_ret,vm#_next
jmpret vm#_vm_ret,vm#_pusha
add vm#_ra,vm#_rb
jmpret vm#_vm_ret,vm#_popb
add jmpret vm#_vm_ret,vm#_popa '(a b -- a+b) Add the two numbers on top of the stack.
jmpret vm#_vm_ret,vm#_next
jmpret vm#_vm_ret,vm#_pusha
jmpret vm#_vm_ret,vm#_pusha
jmpret vm#_vm_ret,vm#_popa
dup jmpret vm#_vm_ret,vm#_enter '(a -- a a) Duplicate the number on top of the stack.
word 0
word @xnext,@add,@dup
double jmpret vm#_vm_ret,vm#_enter '(a -- 2*a) Double the number on top of the stack.
'------[ Register Segment ]------
org $1F0
'--------[ Header ]--------
my_prog word 0,0
Well, this thread is getting pretty long, so maybe I should stop here for now, even though there's a lot more to talk about. I'm really not sure how much further to pursue this: whether to treat it as useful or just as an acdemic exercize. At the very least, it demonstrates what could be accomplished with a decent set of preprocessor hooks built into the IDE. As it is now, I have to copy the output from my Perl program and paste it into the IDE edit window.
-Phil
Post Edited (Phil Pilgrim (PhiPi)) : 1/28/2008 10:49:47 PM GMT
My first objective was to design a virtual machine (VM) for the LMM whose instructions were all single longs. This means that no immediate data could be included as a second long after the original instruction, since doing so can be rather wasteful. Nonetheless, it's still possible to include up to nine bits of data in instructions that require VM intervention by cramming them into the unused destination field of a jmp (which is just a jmpret with an nr):
jmpret <data>,#VM_service nr
The VM can then extract the data from the instruction for whatever purpose it's intended. The nine-bit limitation precludes relative jmps and calls to addresses more than 512 longs from the caller, but this can be handled using a jump table in the leftover cog memory. My feeling is that such a table will remain small, since most jumps are local anyway.
I've also eliminated tstjz, tstjnz, and djnz since their implementation would require an extra long. But I might add them later to use specific registers, rather than the full set, as a compromise.
In addition to straight assembly language, the VM accommodates "threaded code". This gives it the capability to emulate Forth interpreters, for example, and provides the user some extensibility to make up his own instructions. Each threaded code "instruction" is just a hub address that points to a subroutine, so it takes up a word rather than a long. While this has the potential to result in very compact code, relative to assembly, the savings are difficult to realize in short programs, due to the necessary overhead of useful library "words", which is the term typically used when referring to threaded code subroutines.
The inclusion of threaded code also necessitates designing a programming model based on a restricted, specialized register set, along with an expression stack, much like the stack in Forth, so I'll introduce that model next.
Programs in this model include three segments: The "register segment" (rseg), which resides in cog memory with the VM and includes the register space and jump table.
The "code segment" (cseg), which comprises the executable assembly code and threaded code and runs backwards in hub memory.
The "data segment" (dseg), which contains any user variables that reside in hub memory. In reality, the dseg is just part of the cseg, with the exception that it's not reversed, which makes its visualization a lot easier.The LMM VM is started with a cognew that provides the address of a stack area, which is a long array in VAR space whose first three longs must include the following data:
····0: The base address of the current object (@@0).
····1: The size of the stack area in longs.
····2: The beginning hub address of the LMM program header, which actually comes at the end of the program.
It can also include any other data or variable space the user program needs. (Variable space can be partitioned from the stacks by adjusting the stack pointers at the beginning of one's program.)
When the VM starts, it reads the register segment into the cog and fixes any addresses in the included address table by adding the object's base address to them. This helps to save time in jumps and calls during runtime. It then initializes the stack pointers and starts the emulation loop from the beginning of the user's program.
The stack area houses both the expression stack (longs), building upward from the beginning of the array, and the return stack (words), building downward from the end. No attempt is made to detect stack collisions, so the programmer must make sure to include enough room.
Several special registers are available to the programmer. These are: pc: The LMM program counter.
ip: The threaded code instruction pointer.
ep: The expression stack pointer.
op: The object pointer (base address of the current object).
rp: The return stack pointer.
ra and rb: General purpose data registers (accumulators) for which specialized instructions exist and which, together, can be treated as a 64-bit register in certain circumstances.
rx: General purpose data register for which an additional specialized instruction exists for handling object-relative addresses.An assortment of new instructions has been added, whose execution is mediated by the VM. The preprocessor compiles each of these instructions into a VM call, so the user needn't be concerned about the details. Here are the new instrucitons: jmp, call, ret: These aren't really "new", of course. They're just intercepted by the preprocessor and converted to the appropriate relative jumps (add pc,#<displacement>) or VM-mediated jumps, calls, and returns.
mov <reg>,#@<hub address>: This is just a special notation that overloads the mov instruction. When the preprocessor encounters it, it places a reference to the hub address in the jump table (which gets fixed on startup) and encodes a simple mov from the jump table location. This entails that this particular jump table entry take up an entire long (for speed), rather than a singe word, as other entries do.
enter: Enter threaded code. The next "instruction" is the address of a threaded code word.
next: Execute the next threaded code word. This is used at end of an assembly-language subroutine "word" instead of ret
exit: Exit threaded code back into assembly language.
xnext: Combination of exit and next, used at the end of a threaded code "word" which is, itself, written in threaded code.
pusha, pushb, pushx, pushab: Push the indicated register (ra, rb, rx) or register pair (ra:rb) onto the expression stack.
popa, popb, popx, popab: Pop the indicated register (ra, rb, rx) or register pair (ra:rb) off of the expression stack.
lda, ldb, ldab: Load the indicated register (ra, rb,) or register pair (ra:rb) from the word(s) pointed to by ip, post-decrementing (backwards cseg, remember) ip by two for each one.
ldx: Load the rx register from the word pointed to by ip, post-decrementing ip by two and adding op to rx.
mulab: Multiply (unsigned) ra by rb, leaving the 64-bit product in ra:rb.
divabx: Divide (unsigned) ra:rb by rx, leaving the quotient in rb and the remainder in ra. An overflow will occur if rx < ra on entry, but it's not flagged.Threaded code words are entered as if they were normal assembly language instructions but are prefixed with a period (.), so they stand out as special. (Since the preprocessor is two-pass and maintains a symbol table, this wouldn't have been necessary, but it looks nice visually.) It's also possible to chain multiple threaded code words on one line, like this:
cseg
double enter '(a -- 2*a) Double the number on top of the stack.
.dup .add .xnext
dup enter '(a -- a a) Duplicate the number on top of the stack.
popa
pusha
pusha
next
add popa '(a b -- a+b) Add the two numbers on top of the stack.
popb
add ra,rb
pusha
next
xnext xnext 'Link to xnext from .xnext
Incidentally, here's what the preprocessor produces from the above code:
'--------[ Code Segment ]--------
xnext jmpret vm#_vm_ret,vm#_xnext 'Link to xnext from .xnext
jmpret vm#_vm_ret,vm#_next
jmpret vm#_vm_ret,vm#_pusha
add vm#_ra,vm#_rb
jmpret vm#_vm_ret,vm#_popb
add jmpret vm#_vm_ret,vm#_popa '(a b -- a+b) Add the two numbers on top of the stack.
jmpret vm#_vm_ret,vm#_next
jmpret vm#_vm_ret,vm#_pusha
jmpret vm#_vm_ret,vm#_pusha
jmpret vm#_vm_ret,vm#_popa
dup jmpret vm#_vm_ret,vm#_enter '(a -- a a) Duplicate the number on top of the stack.
word 0
word @xnext,@add,@dup
double jmpret vm#_vm_ret,vm#_enter '(a -- 2*a) Double the number on top of the stack.
'------[ Register Segment ]------
org $1F0
'--------[ Header ]--------
my_prog word 0,0
Well, this thread is getting pretty long, so maybe I should stop here for now, even though there's a lot more to talk about. I'm really not sure how much further to pursue this: whether to treat it as useful or just as an acdemic exercize. At the very least, it demonstrates what could be accomplished with a decent set of preprocessor hooks built into the IDE. As it is now, I have to copy the output from my Perl program and paste it into the IDE edit window.
-Phil
Post Edited (Phil Pilgrim (PhiPi)) : 1/28/2008 10:49:47 PM GMT