Cluso99 said...
We will also going to need to have a return to spin address as a fixed location. If the interpreter changes, then the "PUSHx" address you currently use will change. Let me think on this. We do not want to waste code space in the interpreter. The interpreter always has the fetch loop at cog $020 and my interpreter maintains this. So, you wish to just return (no push required) just jmp #$020. Therefore, maybe all pushes should be done in LMM? What do you think?
Rather than keeping the loop at 20 and putting regs somewhere in the interpreter code. I have been putting the spare longs between the init code and the start of the loop.
The reason for this is if you want to use inline asm in a separate object you need the offsets not just to loop but to the lmm entry points for jmp,. etc and to the regs and to x,y,a, etc. Putting the regs, lmm etc at the start of the code in the same way that x,y,a, etc is, makes it easy to have a set of offset for the other object that dont keep moving. I ended up with something like this that I copy into any object I want inline asm in
'must be same as lmm_spin with local variables overlayed
org 0
x res 1 'these 8 occupy the entry-code space
y res 1
a res 1
sda_bit
t1 res 1
delaycnt
t2 res 1
data1
op res 1
parm1
op2 res 1
count1
adr res 1
ackbit1
reg1
res 1
scl_bit
reg2
res 1
devAddr
reg3
res 1
status
reg4
res 1
plx
reg5
res 1
ply
reg6
res 1
plz
reg7
res 1
reg8
res 1
reg9
res 1
reg10
res 1
reg11
res 1
lmm_jmpret
res 1
lmm_jmpnz
res 1
lmm_djnz
res 1
lmm_jmpnc
res 1
lmm_jmpz
res 1
lmm_jmpc
res 1
loop res 1
This gets all the interesting offsets into that object.
@Brad, a bytecode($3C) command will work great for this application. I was planning on doing some self-modifying code to insert the $3C, but you already have a way to do it. Now I just need to figure out how the Spin interpreter handles parameters so I can pass the address of the LMM PASM routine to it.
Ray has a good example above. The easy way to do it is look at the list files to see how parameters are pushed by the bytecode and then follow the source to see how they are popped by the interpreter. Because Bytecode() always evaluates to constants.. something like
Bytecode($39,(@fred>>8) & $FF, @Fred & $FF, $3C) is quite easy.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
"I mean, if I went around sayin' I was an emperor just because some moistened bint had lobbed a scimitar at me they'd put me away!"
I still need to figure out how to push a variable onto the stack from Spin, but I should be able to figure that out by looking at the bytecode for a function call.
I moved the sqrt, strsize and strcomp functions to Hub RAM, and I execute them as LMM PASM code for now.· I'll eventually execute them as FCACHE loops.· I'm currently moving the *move and *fill functions to Hub RAM.· These will be done as tight loops in FCACHE as well.
Cluso, I like you idea of adding a second opcode byte after $3C.· This will allow for an additional 256 opcodes.· I'll take a look at your overlay loader in the OBEX.
Timmoore said...
you can also use @@@ to get rid of the +16 to get hub addresses
I'm not familiar with the @@@ operator.· I'll look into it.· I do know that @@ will add the object offset to a value, but that's done at run time.· I realized today that relative jumps can be implemented in LLM PASM code just by manipulaiting the pc.· I implemented the strsize with the following code.
strsize0 mov a,x
rdbyte t1,a wz
add a,#1
if_nz sub lmm_pc, #12 'Jump back two instructions if not zero
sub x,a
jmp #notx
Heres, an example, div and sqr are running from lmm, $3c allows inline asm. the example is Beans txbyte. Both the inline asm address and parameter for txbyte are passed as spin parameters.
I am using @@@ from bst to get the hub address of the lmm asm routines and bytecode from bst.
Some of the lmm code is from hippys example·from a while ago.
·Bean, I finally got a chance to look at your code, and I was puzzled by the use of the djnz in the LMM PASM code.· It seemed like the address would be executed if the dnjz did not jump, but continued on.· Then I realized that the top half of the address long is zero, so this is executed as a NOP.· That's a nice trick.
Tim, I looked at your code and saw how the @@@ is used.· That's a nice feature in BST.· I looked at your TransmitByte routine and I don't understand why you have two calls to the $3C psuedo-op.· It looks like the first call passes the value of char, and the second call passes the value of result.· Is that correct, and why do you do that?
Is there a description·of the $C7 psuedo-op somewhere?· I·suppose I could look at the Spin interpreter code to see what it's doing.· How does it destinguish between an address offset (@txbyte) and a local stack offset?· Does the $C7 push·the sum of a one-byte offset·plus the object offset?· I guess a $6x pushes the value located relative to the "result" location.
Dave
|===========================================================================|
Spin Block TransmitByte with 1 Parameters and 0 Extra Stack Longs. Method 2
PUB TransmitByte(char)
Local Parameter DBASE:0000 - Result
Local Parameter DBASE:0004 - char
|===========================================================================|
62 result := bytecode($C7, @txbyte,$64,$3C)
Addr : 08E2: C7 10 : Memory Op Long PBASE + ADDRESS Address = 0010
Addr : 08E4: 64 : Variable Operation Local Offset - 1 Read
Addr : 08E5: 3C : Unused
Addr : 08E6: 61 : Variable Operation Local Offset - 0 Write
63 result := bytecode($C7, @txbyte,$60,$3C)
Addr : 08E7: C7 10 : Memory Op Long PBASE + ADDRESS Address = 0010
Addr : 08E9: 60 : Variable Operation Local Offset - 0 Read
Addr : 08EA: 3C : Unused
Addr : 08EB: 61 : Variable Operation Local Offset - 0 Write
Addr : 08EC: 32 : Return
Post Edited (Dave Hein) : 6/26/2010 3:22:17 PM GMT
Dave, the 2nd call was testing returning values to spin so its not needed.
I have basically been writing spin code similar to what I need and seeing what byte codes I end up with.
$C7 looks good for the lmm address, it takes the address from the object table and address a offset (upto 255), so i found it works even when the hub address is > 256.
$60, $64, $68 read the local variable from result, parameters, local variables (left to right with result being 0), up to 7 variables. Dont know that happens after that. Basically bits 4..2 is the variable no.
$6F gets the address of a local variable in this case the address of the 3rd variable (result is 0), again bits 4..2 is the variable no.
One of the things I have realized is that since the paramters are on a stack the parameters should be in the oppersite order.
This is what I am using in my driver that uses inline pasm. The parameter to lmm is the last. It pops this of the stack, starts executing it, the inline pasm code then pops the rest of the parameters off the stack. You want it this way so the number of parameters can vary depending on what you are calling.
By the way, what I am working on is a I2C driver. I started a while ago looking at writing an I2C in LMM but running on a separate cog. This wold allow me to keep the differnt device drivers separate, load up the ones I want but run them faster than spin. Since I only need 400KHz the slow down of LMM isn't a problem. Dave, with your idea I changed the approach and am trying to write the I2C drivers using inline pasm just for the parts that need to be faster. e.g. the device init, etc. is in spin, the device access for a main sensor loop is in inline asm.
Heres a working I2C device driver for HMC5843 using inline pasm for the main sensor read. Init, calibration etc is in spin, as well as filtering the sensor readings, etc.
The demo reads the sensors using both spin and inline pasm, the inline pasm (really a mix of pasm and spin) is ~3.8x faster than the full spin code.
Reading the sensor in spin is 5.7ms in spin/pasm < 1.5ms.
The device driver is in a separate object so you cant address hte labels etc in the asm in the main code directly.
Thought you may be interested in me posting the latest Spin Bytecode document. It is a conglomeration of Hippy's original research, Chip's code and my findings.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Links to other interesting threads:
That document is very useful.· I don't understand the nomenclature for the extra bytes, but I'll read it over a few times to see if it makes sense.· Can the register opcode $3F be used to write values into the cog's memory?· Is that how you write patches into the Spin interpreter?· Do you have some sample code for doing that?
Dave:
Here is the link to the code which places·my zero-footprint debugger into a running version of the ROM Interpreter. (This uses my debugger in mode -3).
The ROM Interpreter addresses (in cog)·are known (and fixed). If you use·a modified ram version of the interpreter, these addresses will change. http://forums.parallax.com/showthread.php?p=748420·see the post beginning with... Release v0.275 for·SPIN (mode -3: interrupts the Rom Interpreter). Download the two ...Spin3_275.spin programs.
In ClusoDebuggerSpin3_275.spin look at the routine...
PRI FindAndPatch | ix, pb, vb, id, tf
'==========================================
'locate the footprint in hub ram
tf := 0 'set true counts =0
pb := word[noparse][[/noparse]$000C] 'PINIT 'start looking from Init Prog Counter
vb := word[noparse][[/noparse]$0008] 'VBASE 'stop at Variable Base
repeat ix from pb to (vb - 1)
id := byte[noparse][[/noparse]ix+3] <<24 | byte[noparse][[/noparse]ix+2] <<16 | byte[noparse][[/noparse]ix+1] <<8 | byte[noparse][[/noparse]ix] 'little endian!
if id == $A0E0433A
tf++
quit
'==========================================
id := byte[noparse][[/noparse]ix+6] <<8 | byte[noparse][[/noparse]ix+9]
if id == $9495
tf++
id := byte[noparse][[/noparse]ix+13] <<24 | byte[noparse][[/noparse]ix+16] <<16 | byte[noparse][[/noparse]ix+30] <<8 | byte[noparse][[/noparse]ix+33]
if id == $B4B5B4B5
tf++
id := byte[noparse][[/noparse]ix+37] <<24 | byte[noparse][[/noparse]ix+36] <<16 | byte[noparse][[/noparse]ix+35] <<8 | byte[noparse][[/noparse]ix+34] 'little endian!
if id == $A0E0433A
tf++
'==========================================
if tf == 4
fdx.str(string("i) Footprint found at "))
fdx.hex(ix,4)
else
fdx.str(string("e) Footprint NOT FOUND"))
fdx.tx($0D)
REBOOT 'NOT FOUND so reboot propeller!!
'==========================================
'all found so patch
byte[noparse][[/noparse]ix+6] := $82 ' := [noparse][[/noparse]$1E2]
byte[noparse][[/noparse]ix+9] := $83 ' := [noparse][[/noparse]$1E3]
byte[noparse][[/noparse]ix+13] := $A2 ' [noparse][[/noparse]$1E2] :=
byte[noparse][[/noparse]ix+16] := $A3 ' [noparse][[/noparse]$1E3] :=
byte[noparse][[/noparse]ix+30] := $A2 ' [noparse][[/noparse]$1E2] :=
byte[noparse][[/noparse]ix+33] := $A3 ' [noparse][[/noparse]$1E3] :=
'==========================================
fdx.str(string(", patch installed."))
This routine looks for the routine "RunOnce"
PRI RunOnce ( xa, xb ) | safe1, safe2
'==========================================
result := $43E0A0 'footprint
'==========================================
safe1 := OUTA '\ save instructions at $1E2
safe2 := OUTB '/ & $1E3
OUTA := xa '\ $1E2 <--- place 2x instructions in here
OUTB := xb '/ $1E3 <--- place 2x instructions in here
result := LookUp( 1 : 1..3 ) '> Run instructons at $1E2-$1E3
OUTA := safe1 '\ restore instructions at $1E2
OUTB := safe2 '/ & $1E3
'==========================================
result := $43E0A0 'footprint
'==========================================
·Presuming it is found, the routine "FindAndPatch"· has ix set to point to the "RunOnce" routine address in hub (note the result:= has·the bytecode $3A. "FindAndPatch" then patches parts of the hub routine "RunOnce" and then returns. "RunOnce" has not yet been executed.
Next the debugger is started.
Finally, the routine "StartTrace" is run, which runs the patched routine "RunOnce" 3 times to install the patches into the Interpreter which takes control of the Interpreter.
PRI StartTrace : okay | x1, x2
'==========================================
'this will install the debug code into $1F0-$1F3 and start it
x1 := long[noparse][[/noparse]D_BOOT] '\ copy code to $1F0..$1F3
x2 := long[noparse][[/noparse]D_BOOT + 4] '|
RunOnce ( x1, x2 ) '|
x1 := long[noparse][[/noparse]D_BOOT + 8] '|
x2 := long[noparse][[/noparse]D_BOOT + 12] '|
RunOnce ( x1, x2 ) '/
x1 := long[noparse][[/noparse]D_BOOT + 16] '\ now place jump to execute debugging
x2 := 0 '|
RunOnce ( x1, x2 ) '/
'==========================================
x1 := $11111111 '<===============================
So, we are loading the LMM routine into $1F0-$1F3 by running 2 instructions at a time, 3 times, from cog $1E2-$1E3 of the interpreter cog. The first 2 instructions load $1F0-$1F1 & $1F2-$1F3 respectively, and the 3rd pair just uses the first to jump to $1F0 which executes the debugger.
You will obviously be loading a different set of code to run elsewhere.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Links to other interesting threads:
OK, I see how that works.· The $3F opcode does read and write cog memory, but only the 32 longs located from $1E0 to $1FF.· It is normally used·by the Spin compiler to access the special registers from $1F0 to $1FF.· The range function in the Spin interpreter occupies the locations $1DF to $1E4, so this is patched with instructions to write other locations in cog memory, and it is call by using the lookup instruction.· That's a cool trick.
As an update, I switched 2 more i2c drivers to use inline pasm so I have HMC5843 (tri-axis compass), ADXL345(tri-axis accelerometer) and ITG-3200(tri-axis gyro). I integrated this into the main sensor loop for my Quadcopter. The sensor reading loop has gone from ~44 times/sec to ~120 times/sec. And I still have the pressure sensor to change to inline pasm.
The lmm spin interpreter code I use hasn't changed from above.
So I can on average speed up I2C by 3.3-5.6x (its not higher because it still needs delays to slow the i2c clock speed down to 400KHz) without using more Cogs. The cost is memory, its added the spin interpreter ~2K bytes to the hub memory though I might be able to reuse the memory as stack space for other cogs though I am already doing this for all other spin cogs i.e. using hub space of pasm cogs as stack.
The other problems I havn't solved yet, is the core i2c routines start, stop, read/write get copied into each i2c object. using more hub memory. I could move then into the spin interpreter at the cost of moving other stuff out to lmm but the other stuff gets more speed benifit from runing in the cog than the i2c routines. Or I could collapse the i2c objects together but I was trying to avoid that.
So the first location after the jmp to _lmm_call contains the address to store the return address·and the next location is the call address?
Thats nicer than the way I did it, I stored the return address in x and the called routine needed to save it, if it called another routine. Though at the cost of 5 longs, relying on the called routine to fix it, means I only added 1 instruction to the normal jmp version
I found if you are using bytecode for the lmm address, there are 2 forms of $C7 for the lmm routine address
result := bytecode($68, $64, $C7 , @lmmgetdata1, $3c)
this works if the offset of lmmgetdata1 is < $7f from the start of the object. If it is more then the 2nd form is useful
Tim: I was thinking that the I2C routine would be PASM and loaded into 1 cog. It may be able to sample all 4 sensors (compass, accel, gyro, pressure) and just place the results into hub for the processing to be done by another cog(s) in PASM as well. Of course the code for processing is easier in spin. Perhaps this could be a mix using the LMM inline code to speedup the maths section. BTW in my faster interpreter, the maths section gets the most gain in speed due to the faster decode.
What cogs are you using for what in the Quad? If you are short of space, what is consuming hub memory?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Links to other interesting threads:
1. Current way, all in spin, reading the 4 sensors - compass, gyro, accelerometer and pressure takes ~22ms, so 44/sec
2. All in PASM, takes another cog but I do have·2 spare, means i have to put all the pasm code together in·1 file so the code isn't really usable for other projects, didn't really want to take this approach, I like having the separate objects that can be reused in other projects.
3. PASM cog using LMM (this was the approach I was taking), load up lmm kernel in a cog, the pasm device drivers are in other objects and are loadable into the lmm cog.
4. Inline pasm. I have converted 3 of the 4 drivers (the 4th is converted but still being tested). This doesn't take another cog. and with 3 drivers using inline pasm the time to read the sensors is 8.125ms so I can read the sensors ~123/sec.
I have integrated method 4 into the Quad code, I can ifdef between the 2 methods and the IMU looks stable. Working on the pressure sensor driver now.
I think the 4th approach works well, I can put some stuff in inline pasm and have other code in spin. e.g. I read the device using pasm but the filtering on the output is spin code.
The only downside is ~4Kbyte added to the code for the lmm/spin interpreter and inline pasm. I haven't optimized the code and I expect to remove some duplication whick will·remove ~1.5Kbytes. I have the space for it though if I want to support a video overlay driver its going to be very close.
Update: I just integrated the pressure sensor with inline pasm into the Quad code, the sensor read loop time is now 6.5ms so I can read all the sensors ~150/sec.
Post Edited (Timmoore) : 6/28/2010 12:29:54 AM GMT
Tim: Ok, nice work. Well, if you get stuck for space I can interrupt the existing rom interpreter and replace one or more routines with an LMM machine. That would save the hub space for the interpreter, less a bit of overhead. I will wait and see how you go anyway.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Links to other interesting threads:
Cluso99, Thanks. I thought about your work. I will need to think about it a bit. Currently I insert the lmm and spare space·between the interpreter initialization code and the main loop, this makes it easy for lmm code in other objects to know the entry points to the lmm, loop and spare variable space.
If·I patch the rom interpreter I can't move the code around so the offsets will be spread. Its doable but messy and all the offsets would change with any change to the patches.
Heres a running lmm with 4 i2c drivers using inline lmm pasm.
The drivers are written as separate objects, they do their initialization in spin and have a routine that does the core sensor reading in inline pasm. The drivers are averaging 45-50 longs for the inline pasm. The core i2c pasm routines (start/stop, etc) are split into another object i2clmm.spin ~160 longs of inline pasm.
The lmm removes sqr, div, locknew and coginit out to lmm code to gain space in the spin interpreter.
The lmm has function used by the inline pasm code - jmp to hub address, call to hub address, call via hub jump table.
The lmm is entered via bytecode $3c. Parameters can be passed in and output of the inline pasm routines via the spin stack.
I am getting an average speed up for i2c of 3x. It was faster but I lost ~10% in speed when optimizing for memory usage - removing ~400longs.
Cluso, Tim and Bean,· I finally had a chance to look at your code from the past few days.· The calling and return code looks good.· It seems like there are a number of ways this can be implemented.· Another way to do it would be to put the return address on the stack.· Or maybe we could use Tim's technique with a dedicated return address register, and push it to the stack only if calling another routine.
The jump table code looks very interesting.· That could provide a substantial speedup in a lot of cases.
I've spent the past few days trying to figure out how to patch the LMM interpreter into running cog code instead of reloading it with a new image.· I ran into a few problems, but I finally got it to work.· The addresses $1E0 to $1EF are accessable with the $3F opcode.· I use a technique similar to Hippy's and Cluso's, but instead of using only $1E2 and $1E3 I write a 5-long LMM interpreter into $1E0 to $1E3 and $1E6. $1E6 contains a constant labelled "wrtop", which isn't reference directly in the Spin interpreter -- so it seems to be safe to use.
I execute the interpreter with the lookup function.· It executes a LMM PASM routine that writes the permanent LMM interpreter into the sqrt/boolean not space.· When this is done I jump to the permanent interpreter and run another LMM PASM routine that restores the five locations at $1E0.· It continues execution at $1E0, and the Spin interpreter doesn't know what happened to it.
I put all the LMM interpreter code in the lmm_interp.spin object.· It's just the basic LMM interpreter with only the FJMP psuedo-op.· It doesn't have any extra registers, but the Spin interpreter x, y, a, t1, t2, and other registers are available.· The interpreter is patched by calling the start method, and LMM PASM code is executed by calling the run method with the address of the PASM code and one parameter.· These are pushed on the stack, and are popped into the x and y registers.
The attached demo code implements a serial output, serial input and a function to read the cog's registers all in one cog.· The ReadReg method is interesting because it only requires two PASM instructions, and I load and execute it from the stack.
My next step is to implement more psuedo-ops, such as the subroutine call and a cache routine, and see if I can speed up the sqrt and other instructions that are running as LMM PASM.
Dave
Post Edited (Dave Hein) : 6/29/2010 7:11:28 PM GMT
Very nice, I will be looking at using this with my lmm if you dont mind.
Some suggestions
1. Look at the interpreter code for jA/B, this is coginit, locknew/lockset/lockclr, these are slow/uncommon so I dont see a reason to attempt to cache this for execution, just using lmm should be good enough and it gives you a fair bit of space 19 longs.
2. I would look at switching the order of arguments to $3c, put @txcode after value, the arguments are pushed onto the stack so the last is first on the stack. That mean you can write lmm code with differing no of arguments, the lmm address is always forst, followed by other arguments.
3. Have you thought about reloading the spin code from the rom, e.g. when sqr is called rather than executing lmm code, reload the code from rom, executing it and then patching it back - probably not worth it if you want to have code there since you have to load it twice but for lmm variable space it might be worth it.
The jump table was mostly to make lmm objects managable, you dont have an easy way to get the hub address of lmm code in other objects except by calling a spin function and having it return the hub address, so rather than lots of helper functions that just return a lmm address, I made the helper return a jump table.
Initially I had the jump table access in lmm code, this meant there was a lot of dupicate jump table code. Moving it into lmm kernel, removed the dup code and approx doubles the speed of doing a jump table call.
@Cluso, do you have a link for the faster sqrt code.· I would like to incorporate it into the LMM PASM interpreter.· It would also be good to have a link to the compiled Spin interpreter listing.· Is this any different than just getting a listing from BST?
I have a question about one line of code in the Spin interpreter.· A few lines after the "push" label, and just before the "jmp #loop" instruction there is a test instruction that writes the zero flag.· However, the code at the top of the loop ignores the zero flag.· Do you know the reason for the test instruction?· It seems to be a wasted instruction.
@Tim, I agree that the jA/B code is a good candidate for moving to LMM PASM.· I decided to use this space for the FCACHE memory.· Three longs are used to jump to the jA/B LMM PASM code, which leaves 16 longs for running small loops.· I also moved the strsize/strcomp section to hub RAM, and I implemented the FCACHE loader there.· I plan on implementing the FCALL function there as well.
I intended to put the code address on the stack last, but I got it backwards.· I will change the order to the way you suggested.
I plan on running the loops in the FCACHE.· I've already done this with the strsize and strcomp routines.· strsize should run a bit faster than the current Spin interpreter's version because I seperated it from the strcomp loop.· The new strsize loop is only 3 longs, and there will be no hub stalls versus the current loop that is 9 longs.
I hope to post an updated version of the interpreter soon.
Dave
Post Edited (Dave Hein) : 6/30/2010 3:05:04 AM GMT
Thanks for the info on the mathops.· It makes sense now.
It would save a few longs if I put the code at the beginning of the top object.· However, I would like to put all of the LMM stuff in its own object, which would mean that the addresses will most likely be greater than $1FF.
Comments
The reason for this is if you want to use inline asm in a separate object you need the offsets not just to loop but to the lmm entry points for jmp,. etc and to the regs and to x,y,a, etc. Putting the regs, lmm etc at the start of the code in the same way that x,y,a, etc is, makes it easy to have a set of offset for the other object that dont keep moving. I ended up with something like this that I copy into any object I want inline asm in
This gets all the interesting offsets into that object.
Ray has a good example above. The easy way to do it is look at the list files to see how parameters are pushed by the bytecode and then follow the source to see how they are popped by the interpreter. Because Bytecode() always evaluates to constants.. something like
Bytecode($39,(@fred>>8) & $FF, @Fred & $FF, $3C) is quite easy.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
"I mean, if I went around sayin' I was an emperor just because some moistened bint had lobbed a scimitar at me they'd put me away!"
I still need to figure out how to push a variable onto the stack from Spin, but I should be able to figure that out by looking at the bytecode for a function call.
I moved the sqrt, strsize and strcomp functions to Hub RAM, and I execute them as LMM PASM code for now.· I'll eventually execute them as FCACHE loops.· I'm currently moving the *move and *fill functions to Hub RAM.· These will be done as tight loops in FCACHE as well.
Cluso, I like you idea of adding a second opcode byte after $3C.· This will allow for an additional 256 opcodes.· I'll take a look at your overlay loader in the OBEX.
·
· result := bytecode($C7, @txbyte,$64,$3C)
as long as txbyte is in the first 256 bytes of hub then if you call at start of your lmm code
call··· #popyx
x contains @txbyte and y contains char
·
I am using @@@ from bst to get the hub address of the lmm asm routines and bytecode from bst.
Some of the lmm code is from hippys example·from a while ago.
Updated to use @@@ in more places.
Post Edited (Timmoore) : 6/26/2010 3:47:57 AM GMT
I could use that in PropBasic LMM instead of always adding the offset.
[noparse][[/noparse]Edit] Oh, I see that the Parallax Propeller tool doesn't support [url=mailto:'@@@']'@@@'[/url] [noparse]:([/noparse]
Bean.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Use BASIC on the Propeller with the speed of assembly language.
PropBASIC thread http://forums.parallax.com/showthread.php?p=867134
March 2010 Nuts and Volts article·http://www.parallax.com/Portals/0/Downloads/docs/cols/nv/prop/col/nvp5.pdf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
There are two rules in life:
· 1) Never divulge all information
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
If you choose not to decide, you still have made a choice. [noparse][[/noparse]RUSH - Freewill]
Post Edited (Bean) : 6/26/2010 1:10:32 PM GMT
Tim, I looked at your code and saw how the @@@ is used.· That's a nice feature in BST.· I looked at your TransmitByte routine and I don't understand why you have two calls to the $3C psuedo-op.· It looks like the first call passes the value of char, and the second call passes the value of result.· Is that correct, and why do you do that?
Is there a description·of the $C7 psuedo-op somewhere?· I·suppose I could look at the Spin interpreter code to see what it's doing.· How does it destinguish between an address offset (@txbyte) and a local stack offset?· Does the $C7 push·the sum of a one-byte offset·plus the object offset?· I guess a $6x pushes the value located relative to the "result" location.
Dave
Post Edited (Dave Hein) : 6/26/2010 3:22:17 PM GMT
I have basically been writing spin code similar to what I need and seeing what byte codes I end up with.
$C7 looks good for the lmm address, it takes the address from the object table and address a offset (upto 255), so i found it works even when the hub address is > 256.
$60, $64, $68 read the local variable from result, parameters, local variables (left to right with result being 0), up to 7 variables. Dont know that happens after that. Basically bits 4..2 is the variable no.
$6F gets the address of a local variable in this case the address of the 3rd variable (result is 0), again bits 4..2 is the variable no.
One of the things I have realized is that since the paramters are on a stack the parameters should be in the oppersite order.
This is what I am using in my driver that uses inline pasm. The parameter to lmm is the last. It pops this of the stack, starts executing it, the inline pasm code then pops the rest of the parameters off the stack. You want it this way so the number of parameters can vary depending on what you are calling.
By the way, what I am working on is a I2C driver. I started a while ago looking at writing an I2C in LMM but running on a separate cog. This wold allow me to keep the differnt device drivers separate, load up the ones I want but run them faster than spin. Since I only need 400KHz the slow down of LMM isn't a problem. Dave, with your idea I changed the approach and am trying to write the I2C drivers using inline pasm just for the parts that need to be faster. e.g. the device init, etc. is in spin, the device access for a main sensor loop is in inline asm.
Post Edited (Timmoore) : 6/26/2010 4:56:19 PM GMT
The demo reads the sensors using both spin and inline pasm, the inline pasm (really a mix of pasm and spin) is ~3.8x faster than the full spin code.
Reading the sensor in spin is 5.7ms in spin/pasm < 1.5ms.
The device driver is in a separate object so you cant address hte labels etc in the asm in the main code directly.
Updated the attachment to fix a few bugs
Post Edited (Timmoore) : 6/26/2010 8:39:40 PM GMT
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
Thanks,
Dave
Here is the link to the code which places·my zero-footprint debugger into a running version of the ROM Interpreter. (This uses my debugger in mode -3).
The ROM Interpreter addresses (in cog)·are known (and fixed). If you use·a modified ram version of the interpreter, these addresses will change.
http://forums.parallax.com/showthread.php?p=748420·see the post beginning with... Release v0.275 for·SPIN (mode -3: interrupts the Rom Interpreter). Download the two ...Spin3_275.spin programs.
In ClusoDebuggerSpin3_275.spin look at the routine...
This routine looks for the routine "RunOnce"
·Presuming it is found, the routine "FindAndPatch"· has ix set to point to the "RunOnce" routine address in hub (note the result:= has·the bytecode $3A. "FindAndPatch" then patches parts of the hub routine "RunOnce" and then returns. "RunOnce" has not yet been executed.
Next the debugger is started.
Finally, the routine "StartTrace" is run, which runs the patched routine "RunOnce" 3 times to install the patches into the Interpreter which takes control of the Interpreter.
So, we are loading the LMM routine into $1F0-$1F3 by running 2 instructions at a time, 3 times, from cog $1E2-$1E3 of the interpreter cog. The first 2 instructions load $1F0-$1F1 & $1F2-$1F3 respectively, and the 3rd pair just uses the first to jump to $1F0 which executes the debugger.
You will obviously be loading a different set of code to run elsewhere.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
Post Edited (Cluso99) : 6/27/2010 7:09:46 AM GMT
The lmm spin interpreter code I use hasn't changed from above.
So I can on average speed up I2C by 3.3-5.6x (its not higher because it still needs delays to slow the i2c clock speed down to 400KHz) without using more Cogs. The cost is memory, its added the spin interpreter ~2K bytes to the hub memory though I might be able to reuse the memory as stack space for other cogs though I am already doing this for all other spin cogs i.e. using hub space of pasm cogs as stack.
The other problems I havn't solved yet, is the core i2c routines start, stop, read/write get copied into each i2c object. using more hub memory. I could move then into the spin interpreter at the cost of moving other stuff out to lmm but the other stuff gets more speed benifit from runing in the cog than the i2c routines. Or I could collapse the i2c objects together but I was trying to avoid that.
Only 14 COG locations for the LMM execution code.
Bean
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Use BASIC on the Propeller with the speed of assembly language.
PropBASIC thread http://forums.parallax.com/showthread.php?p=867134
March 2010 Nuts and Volts article·http://www.parallax.com/Portals/0/Downloads/docs/cols/nv/prop/col/nvp5.pdf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
There are two rules in life:
· 1) Never divulge all information
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
If you choose not to decide, you still have made a choice. [noparse][[/noparse]RUSH - Freewill]
Thats nicer than the way I did it, I stored the return address in x and the called routine needed to save it, if it called another routine. Though at the cost of 5 longs, relying on the called routine to fix it, means I only added 1 instruction to the normal jmp version
I found if you are using bytecode for the lmm address, there are 2 forms of $C7 for the lmm routine address
this works if the offset of lmmgetdata1 is < $7f from the start of the object. If it is more then the 2nd form is useful
If the top bit is set of the address, then the address is extended.
Post Edited (Timmoore) : 6/27/2010 11:36:08 PM GMT
What cogs are you using for what in the Quad? If you are short of space, what is consuming hub memory?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
1. Current way, all in spin, reading the 4 sensors - compass, gyro, accelerometer and pressure takes ~22ms, so 44/sec
2. All in PASM, takes another cog but I do have·2 spare, means i have to put all the pasm code together in·1 file so the code isn't really usable for other projects, didn't really want to take this approach, I like having the separate objects that can be reused in other projects.
3. PASM cog using LMM (this was the approach I was taking), load up lmm kernel in a cog, the pasm device drivers are in other objects and are loadable into the lmm cog.
4. Inline pasm. I have converted 3 of the 4 drivers (the 4th is converted but still being tested). This doesn't take another cog. and with 3 drivers using inline pasm the time to read the sensors is 8.125ms so I can read the sensors ~123/sec.
I have integrated method 4 into the Quad code, I can ifdef between the 2 methods and the IMU looks stable. Working on the pressure sensor driver now.
I think the 4th approach works well, I can put some stuff in inline pasm and have other code in spin. e.g. I read the device using pasm but the filtering on the output is spin code.
The only downside is ~4Kbyte added to the code for the lmm/spin interpreter and inline pasm. I haven't optimized the code and I expect to remove some duplication whick will·remove ~1.5Kbytes. I have the space for it though if I want to support a video overlay driver its going to be very close.
Update: I just integrated the pressure sensor with inline pasm into the Quad code, the sensor read loop time is now 6.5ms so I can read all the sensors ~150/sec.
Post Edited (Timmoore) : 6/28/2010 12:29:54 AM GMT
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
If·I patch the rom interpreter I can't move the code around so the offsets will be spread. Its doable but messy and all the offsets would change with any change to the patches.
The drivers are written as separate objects, they do their initialization in spin and have a routine that does the core sensor reading in inline pasm. The drivers are averaging 45-50 longs for the inline pasm. The core i2c pasm routines (start/stop, etc) are split into another object i2clmm.spin ~160 longs of inline pasm.
The lmm removes sqr, div, locknew and coginit out to lmm code to gain space in the spin interpreter.
The lmm has function used by the inline pasm code - jmp to hub address, call to hub address, call via hub jump table.
The lmm is entered via bytecode $3c. Parameters can be passed in and output of the inline pasm routines via the spin stack.
I am getting an average speed up for i2c of 3x. It was faster but I lost ~10% in speed when optimizing for memory usage - removing ~400longs.
The jump table code looks very interesting.· That could provide a substantial speedup in a lot of cases.
I've spent the past few days trying to figure out how to patch the LMM interpreter into running cog code instead of reloading it with a new image.· I ran into a few problems, but I finally got it to work.· The addresses $1E0 to $1EF are accessable with the $3F opcode.· I use a technique similar to Hippy's and Cluso's, but instead of using only $1E2 and $1E3 I write a 5-long LMM interpreter into $1E0 to $1E3 and $1E6. $1E6 contains a constant labelled "wrtop", which isn't reference directly in the Spin interpreter -- so it seems to be safe to use.
I execute the interpreter with the lookup function.· It executes a LMM PASM routine that writes the permanent LMM interpreter into the sqrt/boolean not space.· When this is done I jump to the permanent interpreter and run another LMM PASM routine that restores the five locations at $1E0.· It continues execution at $1E0, and the Spin interpreter doesn't know what happened to it.
I put all the LMM interpreter code in the lmm_interp.spin object.· It's just the basic LMM interpreter with only the FJMP psuedo-op.· It doesn't have any extra registers, but the Spin interpreter x, y, a, t1, t2, and other registers are available.· The interpreter is patched by calling the start method, and LMM PASM code is executed by calling the run method with the address of the PASM code and one parameter.· These are pushed on the stack, and are popped into the x and y registers.
The attached demo code implements a serial output, serial input and a function to read the cog's registers all in one cog.· The ReadReg method is interesting because it only requires two PASM instructions, and I load and execute it from the stack.
My next step is to implement more psuedo-ops, such as the subroutine call and a cache routine, and see if I can speed up the sqrt and other instructions that are running as LMM PASM.
Dave
Post Edited (Dave Hein) : 6/29/2010 7:11:28 PM GMT
Some suggestions
1. Look at the interpreter code for jA/B, this is coginit, locknew/lockset/lockclr, these are slow/uncommon so I dont see a reason to attempt to cache this for execution, just using lmm should be good enough and it gives you a fair bit of space 19 longs.
2. I would look at switching the order of arguments to $3c, put @txcode after value, the arguments are pushed onto the stack so the last is first on the stack. That mean you can write lmm code with differing no of arguments, the lmm address is always forst, followed by other arguments.
3. Have you thought about reloading the spin code from the rom, e.g. when sqr is called rather than executing lmm code, reload the code from rom, executing it and then patching it back - probably not worth it if you want to have code there since you have to load it twice but for lmm variable space it might be worth it.
The jump table was mostly to make lmm objects managable, you dont have an easy way to get the hub address of lmm code in other objects except by calling a spin function and having it return the hub address, so rather than lots of helper functions that just return a lmm address, I made the helper return a jump table.
Initially I had the jump table access in lmm code, this meant there was a lot of dupicate jump table code. Moving it into lmm kernel, removed the dup code and approx doubles the speed of doing a jump table call.
BTW Have you realised I published a compiled listing of Chips Interpreter?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
The listing I didn't realise, do you have a pointer? a search didn't find it.
I have a question about one line of code in the Spin interpreter.· A few lines after the "push" label, and just before the "jmp #loop" instruction there is a test instruction that writes the zero flag.· However, the code at the top of the loop ignores the zero flag.· Do you know the reason for the test instruction?· It seems to be a wasted instruction.
@Tim, I agree that the jA/B code is a good candidate for moving to LMM PASM.· I decided to use this space for the FCACHE memory.· Three longs are used to jump to the jA/B LMM PASM code, which leaves 16 longs for running small loops.· I also moved the strsize/strcomp section to hub RAM, and I implemented the FCACHE loader there.· I plan on implementing the FCALL function there as well.
I intended to put the code address on the stack last, but I got it backwards.· I will change the order to the way you suggested.
I plan on running the loops in the FCACHE.· I've already done this with the strsize and strcomp routines.· strsize should run a bit faster than the current Spin interpreter's version because I seperated it from the strcomp loop.· The new strsize loop is only 3 longs, and there will be no hub stalls versus the current loop that is 9 longs.
I hope to post an updated version of the interpreter soon.
Dave
Post Edited (Dave Hein) : 6/30/2010 3:05:04 AM GMT
Why do you use 3 longs to enter the lmm? I use this for the lmm code, just make sure the core lmm routines are in the first $1ff of hub memory.
It would save a few longs if I put the code at the beginning of the top object.· However, I would like to put all of the LMM stuff in its own object, which would mean that the addresses will most likely be greater than $1FF.
Dave