Inter COG calls in PASM.

heater · 2009-11-14 20:59

My 6809 emulator has far to much PASM code to fit in a single COG. Currently it makes use of Cluso's overlay loader to pull in lumps PASM from HUB as and when required. This is awful slow.

What I want to do is move the code that is overlays into one or more other COGS. I tried this with the Z80 emulator but the way I did it turned out to be very slow.

Normally PASM code is given some command number in a HUB variable that it reads and then calls an appropriate function.

I want to get rid of this command decoding by replacing the command numbers with actual COG addresses of the function to be invoked so that they may be used directly in jumps.

This is made a bit easier by the use of the @@@ operator in BST.

So in PASM we can call a function in some other COG with:

                 mov     temp, #some_function   'Invoke some_function in slave COG
                 wrlong  temp, cmd_ptr     '
:wait            rdlong  temp, cmd_ptr wz  'and wait for it to complete
           if_z  jmp     #:wait

In the slave COG we have a loop that will detect when its command is not zero and then use it as a jump to the correct routine_

get_cmd_1        rdlong  temp_1, cmd_ptr_1 wz   'Read a command from HUB
           if_nz jmp     temp_1                 'If set jump to the required function (no # here)
                 jmp     #get_cmd_1

This means that the calling COG needs to know the address of the slave COGs command LONG in HUB. This we can get with

cmd_ptr long @@@cmd 'Pointer to slave COG 1 command in HUB.

The calling COG also needs the address in COG of the required function in the slave. Which it does anyway if the PASM for both COGS is in the same Spin file. The appropriate "org 0" statements have to be used to separate them.

Attached is a simple program to test and demonstrate this.

I'm fishing here for any suggestions to enhance performance or neaten it up.

What would be nice is to be able to split the different COGs PASM into separate files but I then setting up the addresses of the command pointers and COG functions becomes pain. Unless someone knows a sneaky way to do that linkage.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

MagIO2 · 2009-11-14 23:14

If you have 1-3 free pins (depends on the number of COGs you want to cooperate) you can signal a call using the port. The slave(s) would then use the waitpeq to see if it has something to do. Maybe you could also use additional pins for passing the 'command' and parameters directly.

Waitpeq will immediately start when it detects it's "ID", so both COGs are in sync. On one side you can write to the port and on the other side you can read immediately. This way you don't loose any cycles because of HUB-RAM access on both sides.

The slave will then set the pins to it's ID itself while the master switches to input and uses waitpeq to wait for the slave. If slave is done, it outputs the masters ID.

heater · 2009-11-14 23:26

I did think about using pins for signalling. Problem is the emulators need external RAM which eats pins, for example on a TriBlade prop board there are no free pins. Should be useful in other applications though.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

MagIO2 · 2009-11-14 23:50

To bad that Chip did not open port B for internal communication. Would have been perfect for such things.

Cluso99 · 2009-11-15 00:21

Heater:

1. How many helper cogs do you have?

For 3 helper cogs...
The cog address for each helper is stored as a concatenation of 3 x 9bit cog addresses (for a jump) and 5 info bits (as I use in the faster spin interpreter)
e.g. xxxxx ccccccccc bbbbbbbbb aaaaaaaaa
where aaaaaaaaa = cog 1 address, bbbbbbbbb = cog 2 address, ccccccccc = cog 3 address, and xxxxx are flags
if a cog is not required it is address "get_cmd_1" above.

So helper cog 1 would be faster, just doing a jmp indirect to the long fetched.
Cog 2 would need to do SHR temp, #9 first
Cog 3 would need to do SHR temp, #18 first

These commands could in turn be an overlay to be loaded. So for example, cog 3 may be a large overlay loader, with the 5 bits being an offset to the overlay to be loaded.

Once you have the method, setting up addresses can be done in spin before anything is started. See my debugger for that.

The above way would also allow for more that 1 helper cog to be active at once. A clear would be done by writing a byte or word back to hub by each used helper cog to indicate when complete.

MagIO2: our biggest problem with emulations like these is there is never any pins spare :-( It still does not overcome the fact that addresses (routine numbers if you like) have to be passed, so waiting for something to do via hub is about as fast as one can get.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm

kuroneko · 2009-11-15 00:34

heater said...
What would be nice is to be able to split the different COGs PASM into separate files but I then setting up the addresses of the command pointers and COG functions becomes pain. Unless someone knows a sneaky way to do that linkage.

I know what you mean. Unfortunately it seems you can't refer to DAT labels in the CONstants section so I ended up defining the command CONstants manually (HX512 DMA). This is a bit painful when your actual code is still in flux but once it stabilises it's not too bad. A caller can then simply pick up the command with obj#cmd.

heater · 2009-11-15 08:08

MagIO2: The lack of that "phantom" port B for internal comms tricks has been missed many times. It's a real shame.

Cluso: Not sure how many COGs required yet. Somehow I knew the "stuff three addresses per LONG" trick would come up. But I'm not sure I understand where you are going with it here.

If we have three addresses for three helper COGS stuffed into one command LONG then 2 COGs have to shift the address prior to using it as you say. What happens when the command is complete? The COGs now have to pack their "done status" (Which could well be the address of get_cmd") back into the command LONG in HUB. Which means a bunch of shifting, and ORing.
Actually I don't see the point of the packing at all yet. But given that our emulator dispatch tables are laid out like that it must fit in somewhere.

Anyway, Cluso, you have provided a nice optimization here. That is: use the address of get_cmd for the "done" status rather than a zero. This reduces the slave COGs wait loop to just two instructions:

get_cmd_1        rdlong  temp_1, cmd_ptr_1 wz   'Read a command from HUB
                 jmp     temp_1                 'Jump to the required function (no # here)

and gets rid of the requirement of defining zero in a COG long. Saves 3 LONGs in the example.

The caller COG's wait loop gains an instruction:

                 mov     temp_0, #on_overflow   'Invoke on_overflow in slave COG 1
                 wrlong  temp_0, cmd_ptr_1s     '
:wait            rdlong  temp_0, cmd_ptr_1s     'and wait for it to complete
                 cmp     temp_0, #on_overflow wz
           if_z  jmp     #:wait
                 jmp     #done_0

as it now has to compare the HUB command with something rather than just checking for zero. BUT if I understand correctly that "cmp" does not cost any time as it sits in a tight loop with rdlong and so execution speed is controlled by the HUB access windows anyway.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Post Edited (heater) : 11/15/2009 8:29:33 AM GMT

heater · 2009-11-15 08:26

Oh yeah, the thing about having multiple helper COGs active at once is neat and must be useful somewhere. We have the problem that they probably all want access to ext RAM so I don't what to think about it.

I did think about having each COGs code in a different Spin file and setting up addresses/commands in Spin. Its a bit messy as one has to pass the addresses around from module to module and manually check that none have been forgotten.

On the other hand it saves having to append "_0", "_1", etc or whatever to all the names that are common to each COG. For example they will probably all need a read_memory_byte function so there has to be three copies of that function each with a different name. We don't get any help from the assembler finding any mistakes here either. Accidentally use a name from the wrong COG and the assembler won't tell you.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Ale · 2009-11-15 09:53

heater:

When I thought about the 6502 emulator, I thought that if all involved COGs read from external RAM at the same time all could monitor the input data and decide which will interpret the data according to the opcode. I did not think what would happen with multi byte opcodes though, and all registers have to be in HUB too.

I discovered not long ago a working HP9820 (Manufactured in 1971!) at the UNI and I was planning in coding a simulator for it, It may be easier than a 6809

.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Visit some of my articles at Propeller Wiki:
MATH on the propeller propeller.wikispaces.com/MATH
pPropQL: propeller.wikispaces.com/pPropQL
pPropQL020: propeller.wikispaces.com/pPropQL020
OMU for the pPropQL/020 propeller.wikispaces.com/OMU

heater · 2009-11-15 10:08

Ale:

I had a similar scheme set up for my attempt at a four COG Z80 emulation. All COGs look at the ops and only one of them proceeds. Multi-byte ops was OK, whoever decides to emulate the instruction hoovers up the extra bytes and writes out an updated PC for the next round.

Anyway I ended up with a four COG emulator that was no faster than the single COG one. Transferring control and ensuring all the Z80 regs are available to all COGs all the time slowed it down. Also it ate up a lot of HUB space with extra dispatch tables. Turned out that for the Z80 where CP/M does not use many non 8080 instructions its quicker overall to put the least used ops into overlays.

The 6809 is a more of a pain because all the ops are generally used and need to be quick.

Here is an update of the inter COG call test with the address of get_cmd used as the null operation instead of zero.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Phil Pilgrim (PhiPi) · 2009-11-15 23:05

Here's a scheme for handling remote PASM procedure calls between objects. It uses jump tables, which alleviates having to know the actual cog addresses of the various remote procedures. (Unfortunately, it's not possible to include even relative DAT addresses in a CONstant block. 'Not sure why.)

This is the client program:

[b]CON[/b]

  [b]_clkmode[/b]      = [b]xtal1[/b] + [b]pll16x[/b]
  [b]_xinfreq[/b]      = 5_000_000

  [b]const[/b]         = 6

[b]OBJ[/b]

  rpc           : "rpc_host"

[b]PUB[/b] start

  [b]cognew[/b](@client, rpc.start)

[b]DAT[/b]

'-------[noparse][[/noparse]* Calling program ]----------------------------------------------------

              [b]org[/b]       0

client        [b]jmp[/b]       #startup                'Jump to startup. (Later, holds saved Z flag in bit 0.)

'Jump table: Identical entries, one for each available routine in host.

              [b]jmpret[/b]    cmdvalue,#cmdset        'Put command + 1 into cmdvalue source field; jump to cmdset.
              [b]jmpret[/b]    cmdvalue,#cmdset        'Put command + 1 into cmdvalue source field; jump to cmdset.

'Set up and execute remote procedure call.

cmdset        [b]wrlong[/b]    cmdvalue,[b]par[/b]            'Write it to command central (hub).
              [b]muxnz[/b]     client,#1               'Save the non-Z flag in (now unused) client bit 0.
         
'Wait for return.              

waitret       [b]rdlong[/b]    cmdvalue,[b]par[/b] [b]wz[/b]         'Has remote procedure returned?
        [b]if_nz[/b] [b]jmp[/b]       #waitret                '  No:  Keep checking.
        
              [b]test[/b]      client,#1 [b]wz[/b]            '  Yes: Restore the Z flag.
rpcall        [b]ret[/b]                               '       Return to caller.

'Initialization.
              
startup       [b]mov[/b]       parmaddr,[b]par[/b]            'Save the parameter address (not used in this example.)
              [b]add[/b]       parmaddr,#4

'Main program code.

mainloop      [b]jmpret[/b]    rpcall,#rpc#SET         'Call the remote set routine.
              [b]jmpret[/b]    rpcall,#rpc#CLR         'Call the remote set routine.
              [b]jmp[/b]       #mainloop               'Loop back.

'Variables.

cmdvalue      [b]res[/b]       1
parmaddr      [b]res[/b]       1

And here's the host (server) program:

[b]CON[/b]

  SET           = 1             'Set command.
  CLR           = 2             'Clr command.

[b]PUB[/b] start

  [b]cognew[/b](@host, @instrbuf)      'Start the host service.
  [b]return[/b] @instrbuf              'Return the command buffer address.

[b]DAT[/b]

'-------[noparse][[/noparse]* Command and parameter buffer ]---------------------------------------

instrbuf      [b]long[/b]      0                       'Command buffer.
              [b]long[/b]      0                       'Parameter buffer.

'-------[noparse][[/noparse]* Called program ]-----------------------------------------------------

              [b]org[/b]       0

'Jump table.
              
host          [b]jmp[/b]       #setup                  'Go do setup first. (Becomes "jmp #waitcmd" after setup.)  
              [b]long[/b]      0                       'Dummy placeholder, since command is now value + 1 .
              [b]jmp[/b]       #setpin                 'Address of set routine.
              [b]jmp[/b]       #clrpin                 'Address of clr routine.

'Initialization.
                            
setup         [b]mov[/b]       argaddr,[b]par[/b]             'Save parameter address (not used in this example).
              [b]add[/b]       argaddr,#4
              [b]mov[/b]       [b]dira[/b],#1                 'Enable pin A0 output.
              [b]movs[/b]      host,#waitcmd           'Make command zero (NOP) point back to waitcmd.
              [b]jmp[/b]       #waitcmd                'Start the dispatcher.

'Procedure return.
        
rpret         [b]wrlong[/b]    done,[b]par[/b]                'Return from remote call. Tell client.

'Call dispatcher.
              
waitcmd       [b]rdlong[/b]    command,[b]par[/b]             'Get the command.
              [b]jmp[/b]       command                 'Jump to it via jump table.

'Remote procedures.

setpin        [b]mov[/b]       [b]outa[/b],#1
              [b]jmp[/b]       #rpret

clrpin        [b]mov[/b]       [b]outa[/b],#0
              [b]jmp[/b]       #rpret

'Constant and variable.

done          [b]long[/b]      0

command       [b]res[/b]       1
argaddr       [b]res[/b]       1

The client calls the host to set and reset an output pin. This results in a 500KHz square wave, which indicates a net service overhead of 1µsec.

-Phil

_

BradC · 2009-11-16 01:07

Phil Pilgrim (PhiPi) said...
(Unfortunately, it's not possible to include even relative DAT addresses in a CONstant block. 'Not sure why.)

Because the CONstant block is completely compiled first, so it has no idea of what is to follow and can't reference other symbols as they don't exist yet.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
If you always do what you always did, you always get what you always got.

Phil Pilgrim (PhiPi) · 2009-11-16 04:23

Brad,

Yeah, so I figured. I even tried putting a separate CONstant block after the DAT section, and that didn't produce the desired result, either. So I guess the compiler must scan for and process all the constant blocks first, regardless of where they occur.

-Phil

BradC · 2009-11-16 04:35

Phil Pilgrim (PhiPi) said...
Brad,
Yeah, so I figured. I even tried putting a separate CONstant block after the DAT section, and that didn't produce the desired result, either. So I guess the compiler must scan for and process all the constant blocks first, regardless of where they occur.

It does precisely that.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
If you always do what you always did, you always get what you always got.

Phil Pilgrim (PhiPi) · 2009-11-16 05:49

Well, I wouldn't be so bold as to say "precisely", since neither of us has access to Chip's compiler source.

But, yes, there's overwhelming evidence to support that hypothesis.

-Phil

BradC · 2009-11-16 06:23

Phil Pilgrim (PhiPi) said...
Well, I wouldn't be so bold as to say "precisely"

I would. When I was trying to nut out why the Parallax compiler generated different floating point math to the one I was playing with I spent about 6 days inside the Parallax compiler single stepping it with IDA, particularly around the CON block parser. It's a really nice bit of clean code. It's not often that de-compiled assembler is almost self documenting [noparse]:)[/noparse]

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
If you always do what you always did, you always get what you always got.

heater · 2009-11-16 07:01

Phil, spectacular.

Excuse me whilst I go off the air for a week or so trying to understand how that all works!

What I would like to do with that is put the remote procedures command numbers into the emulators op-code dispatch tables. Straight away I find I can't use a "remote" constant to define a long:

dispatch_table   long #rpc#CLR       'error
                       long #rpc#SET       'more error

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Post Edited (heater) : 11/16/2009 7:06:43 AM GMT

BradC · 2009-11-16 07:05

heater said...
Phil, spectacular.

Excuse me whilst I go off the air for a week or so trying to understand how that all works!

What I would like to do with that is put the remote procedures command numbers into the emulators op-code dispatch tables. Straight away I find I can't use a constant to define a long:
dispatch_table   long #rpc#CLR       'error
                       long #rpc#SET       'more error

that should probably be :

dispatch_table   long rpc#CLR       'error
                       long rpc#SET       'more error

The extra # is only used where it is an immediate operand in assembler.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
If you always do what you always did, you always get what you always got.

heater · 2009-11-16 07:14

Of course. Never thought of that, what a silly bunt.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Cluso99 · 2009-11-16 09:50

Initially, maybe you could use another overlay to decode the extra instructions and more overlays to implement them. I think the main thing is to get the code running so that you know just what you are dealing with. I would expect you may find numerous sections of common code which could be called via the extra vectors for each instruction emuation.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm

heater · 2009-11-16 11:45

As it stands pretty much all the op-codes are implemented. As usual flag setting is a bit neglected so far. I do have a 6809 instruction set exerciser program now that people have been using to check their FPGA implementations of 6809.

Problems:
1) To get the flags set right and include external RAM access code may not fit.
2) The performance is, well I would say "glacial" but glaciers seem to be moving quite fast nowadays.

I have tried hard to split ops up into micro-ops wherever possible, the usual candidates, load something, do something, write something. Where the micro-ops are shared by many different ops in different combinations. But the 6809 has a lot of ops that don't split so well. Let's look at the MUL instruction as an example:

MUL and 8bit multiply with 16 bit result.
That's 27 instructions in my implementation. Including getting the flags right. So to run that as an overlay is 5 instructions intro to the loader + 3 instructions of loader loop per instruction in the overlay plus running the thing itself, another 27 instructions. That's a total of 109 instructions.
To do that with Phils cunning RPC mechanism is 6 instructions in the caller and 5 extra in the RPC dispatcher + 27 MUL = 38 instructions.

Almost 3 times faster !

Other long winded ops include:
PUSHes and POPs that can save/restore up to 8 registers (8 or 16 bit) depending on a bit mask in the op. TFR and EXG that can move any reg to any reg.
DAA decimal adjust
SWI. There are three different software interrupts that stack all regs.
RTI.

I haven't come to any decision about how to proceed yet.

Do we have a COG or two to burn?
Should we also look to applying this to ZiCog?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Inter COG calls in PASM.

Comments