Inter COG calls in PASM.
heater
Posts: 3,370
My 6809 emulator has far to much PASM code to fit in a single COG. Currently it makes use of Cluso's overlay loader to pull in lumps PASM from HUB as and when required. This is awful slow.
What I want to do is move the code that is overlays into one or more other COGS. I tried this with the Z80 emulator but the way I did it turned out to be very slow.
Normally PASM code is given some command number in a HUB variable that it reads and then calls an appropriate function.
I want to get rid of this command decoding by replacing the command numbers with actual COG addresses of the function to be invoked so that they may be used directly in jumps.
This is made a bit easier by the use of the @@@ operator in BST.
So in PASM we can call a function in some other COG with:
In the slave COG we have a loop that will detect when its command is not zero and then use it as a jump to the correct routine_
This means that the calling COG needs to know the address of the slave COGs command LONG in HUB. This we can get with
cmd_ptr long @@@cmd 'Pointer to slave COG 1 command in HUB.
The calling COG also needs the address in COG of the required function in the slave. Which it does anyway if the PASM for both COGS is in the same Spin file. The appropriate "org 0" statements have to be used to separate them.
Attached is a simple program to test and demonstrate this.
I'm fishing here for any suggestions to enhance performance or neaten it up.
What would be nice is to be able to split the different COGs PASM into separate files but I then setting up the addresses of the command pointers and COG functions becomes pain. Unless someone knows a sneaky way to do that linkage.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
What I want to do is move the code that is overlays into one or more other COGS. I tried this with the Z80 emulator but the way I did it turned out to be very slow.
Normally PASM code is given some command number in a HUB variable that it reads and then calls an appropriate function.
I want to get rid of this command decoding by replacing the command numbers with actual COG addresses of the function to be invoked so that they may be used directly in jumps.
This is made a bit easier by the use of the @@@ operator in BST.
So in PASM we can call a function in some other COG with:
mov temp, #some_function 'Invoke some_function in slave COG wrlong temp, cmd_ptr ' :wait rdlong temp, cmd_ptr wz 'and wait for it to complete if_z jmp #:wait
In the slave COG we have a loop that will detect when its command is not zero and then use it as a jump to the correct routine_
get_cmd_1 rdlong temp_1, cmd_ptr_1 wz 'Read a command from HUB if_nz jmp temp_1 'If set jump to the required function (no # here) jmp #get_cmd_1
This means that the calling COG needs to know the address of the slave COGs command LONG in HUB. This we can get with
cmd_ptr long @@@cmd 'Pointer to slave COG 1 command in HUB.
The calling COG also needs the address in COG of the required function in the slave. Which it does anyway if the PASM for both COGS is in the same Spin file. The appropriate "org 0" statements have to be used to separate them.
Attached is a simple program to test and demonstrate this.
I'm fishing here for any suggestions to enhance performance or neaten it up.
What would be nice is to be able to split the different COGs PASM into separate files but I then setting up the addresses of the command pointers and COG functions becomes pain. Unless someone knows a sneaky way to do that linkage.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Comments
Waitpeq will immediately start when it detects it's "ID", so both COGs are in sync. On one side you can write to the port and on the other side you can read immediately. This way you don't loose any cycles because of HUB-RAM access on both sides.
The slave will then set the pins to it's ID itself while the master switches to input and uses waitpeq to wait for the slave. If slave is done, it outputs the masters ID.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
1. How many helper cogs do you have?
For 3 helper cogs...
The cog address for each helper is stored as a concatenation of 3 x 9bit cog addresses (for a jump) and 5 info bits (as I use in the faster spin interpreter)
e.g. xxxxx ccccccccc bbbbbbbbb aaaaaaaaa
where aaaaaaaaa = cog 1 address, bbbbbbbbb = cog 2 address, ccccccccc = cog 3 address, and xxxxx are flags
if a cog is not required it is address "get_cmd_1" above.
So helper cog 1 would be faster, just doing a jmp indirect to the long fetched.
Cog 2 would need to do SHR temp, #9 first
Cog 3 would need to do SHR temp, #18 first
These commands could in turn be an overlay to be loaded. So for example, cog 3 may be a large overlay loader, with the 5 bits being an offset to the overlay to be loaded.
Once you have the method, setting up addresses can be done in spin before anything is started. See my debugger for that.
The above way would also allow for more that 1 helper cog to be active at once. A clear would be done by writing a byte or word back to hub by each used helper cog to indicate when complete.
MagIO2: our biggest problem with emulations like these is there is never any pins spare :-( It still does not overcome the fact that addresses (routine numbers if you like) have to be passed, so waiting for something to do via hub is about as fast as one can get.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
Cluso: Not sure how many COGs required yet. Somehow I knew the "stuff three addresses per LONG" trick would come up. But I'm not sure I understand where you are going with it here.
If we have three addresses for three helper COGS stuffed into one command LONG then 2 COGs have to shift the address prior to using it as you say. What happens when the command is complete? The COGs now have to pack their "done status" (Which could well be the address of get_cmd") back into the command LONG in HUB. Which means a bunch of shifting, and ORing.
Actually I don't see the point of the packing at all yet. But given that our emulator dispatch tables are laid out like that it must fit in somewhere.
Anyway, Cluso, you have provided a nice optimization here. That is: use the address of get_cmd for the "done" status rather than a zero. This reduces the slave COGs wait loop to just two instructions:
and gets rid of the requirement of defining zero in a COG long. Saves 3 LONGs in the example.
The caller COG's wait loop gains an instruction:
as it now has to compare the HUB command with something rather than just checking for zero. BUT if I understand correctly that "cmp" does not cost any time as it sits in a tight loop with rdlong and so execution speed is controlled by the HUB access windows anyway.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Post Edited (heater) : 11/15/2009 8:29:33 AM GMT
I did think about having each COGs code in a different Spin file and setting up addresses/commands in Spin. Its a bit messy as one has to pass the addresses around from module to module and manually check that none have been forgotten.
On the other hand it saves having to append "_0", "_1", etc or whatever to all the names that are common to each COG. For example they will probably all need a read_memory_byte function so there has to be three copies of that function each with a different name. We don't get any help from the assembler finding any mistakes here either. Accidentally use a name from the wrong COG and the assembler won't tell you.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
When I thought about the 6502 emulator, I thought that if all involved COGs read from external RAM at the same time all could monitor the input data and decide which will interpret the data according to the opcode. I did not think what would happen with multi byte opcodes though, and all registers have to be in HUB too.
I discovered not long ago a working HP9820 (Manufactured in 1971!) at the UNI and I was planning in coding a simulator for it, It may be easier than a 6809 .
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Visit some of my articles at Propeller Wiki:
MATH on the propeller propeller.wikispaces.com/MATH
pPropQL: propeller.wikispaces.com/pPropQL
pPropQL020: propeller.wikispaces.com/pPropQL020
OMU for the pPropQL/020 propeller.wikispaces.com/OMU
I had a similar scheme set up for my attempt at a four COG Z80 emulation. All COGs look at the ops and only one of them proceeds. Multi-byte ops was OK, whoever decides to emulate the instruction hoovers up the extra bytes and writes out an updated PC for the next round.
Anyway I ended up with a four COG emulator that was no faster than the single COG one. Transferring control and ensuring all the Z80 regs are available to all COGs all the time slowed it down. Also it ate up a lot of HUB space with extra dispatch tables. Turned out that for the Z80 where CP/M does not use many non 8080 instructions its quicker overall to put the least used ops into overlays.
The 6809 is a more of a pain because all the ops are generally used and need to be quick.
Here is an update of the inter COG call test with the address of get_cmd used as the null operation instead of zero.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
This is the client program:
And here's the host (server) program:
The client calls the host to set and reset an output pin. This results in a 500KHz square wave, which indicates a net service overhead of 1µsec.
-Phil
_
Because the CONstant block is completely compiled first, so it has no idea of what is to follow and can't reference other symbols as they don't exist yet.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
If you always do what you always did, you always get what you always got.
Yeah, so I figured. I even tried putting a separate CONstant block after the DAT section, and that didn't produce the desired result, either. So I guess the compiler must scan for and process all the constant blocks first, regardless of where they occur.
-Phil
It does precisely that.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
If you always do what you always did, you always get what you always got.
-Phil
I would. When I was trying to nut out why the Parallax compiler generated different floating point math to the one I was playing with I spent about 6 days inside the Parallax compiler single stepping it with IDA, particularly around the CON block parser. It's a really nice bit of clean code. It's not often that de-compiled assembler is almost self documenting [noparse]:)[/noparse]
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
If you always do what you always did, you always get what you always got.
Excuse me whilst I go off the air for a week or so trying to understand how that all works!
What I would like to do with that is put the remote procedures command numbers into the emulators op-code dispatch tables. Straight away I find I can't use a "remote" constant to define a long:
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Post Edited (heater) : 11/16/2009 7:06:43 AM GMT
that should probably be :
The extra # is only used where it is an immediate operand in assembler.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
If you always do what you always did, you always get what you always got.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
Problems:
1) To get the flags set right and include external RAM access code may not fit.
2) The performance is, well I would say "glacial" but glaciers seem to be moving quite fast nowadays.
I have tried hard to split ops up into micro-ops wherever possible, the usual candidates, load something, do something, write something. Where the micro-ops are shared by many different ops in different combinations. But the 6809 has a lot of ops that don't split so well. Let's look at the MUL instruction as an example:
MUL and 8bit multiply with 16 bit result.
That's 27 instructions in my implementation. Including getting the flags right. So to run that as an overlay is 5 instructions intro to the loader + 3 instructions of loader loop per instruction in the overlay plus running the thing itself, another 27 instructions. That's a total of 109 instructions.
To do that with Phils cunning RPC mechanism is 6 instructions in the caller and 5 extra in the RPC dispatcher + 27 MUL = 38 instructions.
Almost 3 times faster !
Other long winded ops include:
PUSHes and POPs that can save/restore up to 8 registers (8 or 16 bit) depending on a bit mask in the op. TFR and EXG that can move any reg to any reg.
DAA decimal adjust
SWI. There are three different software interrupts that stack all regs.
RTI.
I haven't come to any decision about how to proceed yet.
Do we have a COG or two to burn?
Should we also look to applying this to ZiCog?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.