HUB EXEC Update Here

whicker · 2014-02-11 20:23

ozpropdev wrote: »

I believe address $0 was chosen to fit in with register re-mapping / multi tasks.

I guess I need to ask again in a different way:

Is this register remapping per task, or per cog?

ozpropdev · 2014-02-11 20:33

whicker wrote: »

I guess I need to ask again in a different way:

Is this register remapping per task, or per cog?

Re mapping is per task, See Prop2 DOCS extract below

REGISTER REMAPPING
------------------

The SETMAP instruction is used to remap a 2^n-sized block of registers starting at $000, so
that direct accesses to those registers will be redirected to a range of identically-sized
blocks, which also build from $000. This feature allows a single program to run multiple
instances of itself by having unique sets of statically-addressable registers which switch
according to either INDB or the current task.

When using remapping, you must locate your program code above the last used block of
registers which the upper-most block of registers will be remapped to. For example, if you
select 8 blocks of 16 registers, but are only using 6 of those blocks, your program code
must not start below register 96 (6*16), to avoid encroaching into the registers which are
going to be the recipients of remapping.

Here is the SETMAP instruction:

    SETMAP  D/#             - Configure register remapping to %M_BBB_RRR

        %M = mode

            %0 = INDB selects the block
            %1 = task number selects the block

        %BBB = block count

            %000 = 1 block          remapping disabled for %000
            %001 = 2 blocks         remapping enabled for %001..%111
            %010 = 4 blocks
            %011 = 8 blocks
            %100 = 16 blocks
            %101 = 32 blocks
            %110 = 64 blocks
            %111 = 128 blocks

        %RRR = register count

            %000 = 1 register       remap $000
            %001 = 2 registers      remap $000..$001
            %010 = 4 registers      remap $000..$003
            %011 = 8 registers      remap $000..$007
            %100 = 16 registers     remap $000..$00F
            %101 = 32 registers     remap $000..$01F
            %110 = 64 registers     remap $000..$03F
            %111 = 128 registers    remap $000..$07F


The new mapping scheme will be in effect on the third instruction after SETMAP. After that,
changes to INDB or the task number will have an immediate effect on block selection. The
remapping mechanism only works with hard-coded D and S addresses which range from $000 to
the remapped-register-count minus 1 (see %RRR above), not via INDA and INDB accesses.

Below is an elaboration of all uniquely-useful remapping schemes:


                                  S/D addresses
%M_BBB_RRR    blocks regs      initial -> remapped       block selector
-----------------------------------------------------------------------------
%x_000_xxx    1      x               <same>

%0_001_000    2      1      %000000000 -> %00000000P     P = INDB[0]
%0_001_001    2      2      %00000000X -> %0000000PX
%0_001_010    2      4      %0000000XX -> %000000PXX     (2 threads)
%0_001_011    2      8      %000000XXX -> %00000PXXX
%0_001_100    2      16     %00000XXXX -> %0000PXXXX
%0_001_101    2      32     %0000XXXXX -> %000PXXXXX
%0_001_110    2      64     %000XXXXXX -> %00PXXXXXX
%0_001_111    2      128    %00XXXXXXX -> %0PXXXXXXX

%0_010_000    4      1      %000000000 -> %0000000PP     PP = INDB[1..0]
%0_010_001    4      2      %00000000X -> %000000PPX
%0_010_010    4      4      %0000000XX -> %00000PPXX     (4 threads)
%0_010_011    4      8      %000000XXX -> %0000PPXXX
%0_010_100    4      16     %00000XXXX -> %000PPXXXX
%0_010_101    4      32     %0000XXXXX -> %00PPXXXXX
%0_010_110    4      64     %000XXXXXX -> %0PPXXXXXX
%0_010_111    4      128    %00XXXXXXX -> %PPXXXXXXX

%0_011_000    8      1      %000000000 -> %000000PPP     PPP = INDB[2..0]
%0_011_001    8      2      %00000000X -> %00000PPPX
%0_011_010    8      4      %0000000XX -> %0000PPPXX     (8 threads)
%0_011_011    8      8      %000000XXX -> %000PPPXXX
%0_011_100    8      16     %00000XXXX -> %00PPPXXXX
%0_011_101    8      32     %0000XXXXX -> %0PPPXXXXX
%0_011_110    8      64     %000XXXXXX -> %PPPXXXXXX

%0_100_000    16     1      %000000000 -> %00000PPPP     PPPP = INDB[3..0]
%0_100_001    16     2      %00000000X -> %0000PPPPX
%0_100_010    16     4      %0000000XX -> %000PPPPXX     (16 threads)
%0_100_011    16     8      %000000XXX -> %00PPPPXXX
%0_100_100    16     16     %00000XXXX -> %0PPPPXXXX
%0_100_101    16     32     %0000XXXXX -> %PPPPXXXXX

%0_101_000    32     1      %000000000 -> %0000PPPPP     PPPPP = INDB[4..0]
%0_101_001    32     2      %00000000X -> %000PPPPPX
%0_101_010    32     4      %0000000XX -> %00PPPPPXX     (32 threads)
%0_101_011    32     8      %000000XXX -> %0PPPPPXXX
%0_101_100    32     16     %00000XXXX -> %PPPPPXXXX

%0_110_000    64     1      %000000000 -> %000PPPPPP     PPPPPP = INDB[5..0]
%0_110_001    64     2      %00000000X -> %00PPPPPPX
%0_110_010    64     4      %0000000XX -> %0PPPPPPXX     (64 threads)
%0_110_011    64     8      %000000XXX -> %PPPPPPXXX

%0_111_000    128    1      %000000000 -> %00PPPPPPP     PPPPPPP = INDB[6..0]
%0_111_001    128    2      %00000000X -> %0PPPPPPPX
%0_111_010    128    4      %0000000XX -> %PPPPPPPXX     (128 threads)

%1_001_000    2      1      %000000000 -> %00000000T     T = bit 0 of the task number
%1_001_001    2      2      %00000000X -> %0000000TX
%1_001_010    2      4      %0000000XX -> %000000TXX     (2 tasks)
%1_001_011    2      8      %000000XXX -> %00000TXXX
%1_001_100    2      16     %00000XXXX -> %0000TXXXX
%1_001_101    2      32     %0000XXXXX -> %000TXXXXX
%1_001_110    2      64     %000XXXXXX -> %00TXXXXXX
%1_001_111    2      128    %00XXXXXXX -> %0TXXXXXXX

%1_010_000    4      1      %000000000 -> %0000000TT     TT = task number
%1_010_001    4      2      %00000000X -> %000000TTX
%1_010_010    4      4      %0000000XX -> %00000TTXX     (4 tasks)
%1_010_011    4      8      %000000XXX -> %0000TTXXX
%1_010_100    4      16     %00000XXXX -> %000TTXXXX
%1_010_101    4      32     %0000XXXXX -> %00TTXXXXX
%1_010_110    4      64     %000XXXXXX -> %0TTXXXXXX
%1_010_111    4      128    %00XXXXXXX -> %TTXXXXXXX


Here is an example program which uses remapping with multi-threading:

DAT             org

period          long    2-1             '$000, thread 0   (20 longs initally execute as NOPs)
time            long    0               '$001, thread 0
pin_x           long    0               '$002, thread 0
pin_y           long    1               '$003, thread 0

                long    4-1             '$000, thread 1
                long    0               '$001, thread 1
                long    2               '$002, thread 1
                long    3               '$003, thread 1

                long    8-1             '$000, thread 2
                long    0               '$001, thread 2
                long    4               '$002, thread 2
                long    5               '$003, thread 2

                long    16-1            '$000, thread 3
                long    0               '$001, thread 3
                long    6               '$002, thread 3
                long    7               '$003, thread 3

pc              long    loop[4]         '$010..$013, all threads start at loop

                setmap  #%0_010_010     'remap 4 blocks of 4 regs by INDA[1..0]
                fixindb #pc+3,#pc       'set INDA to cycle through blocks and threads
                nop                     'allow SETMAP to take effect before 'switch'

loop            switch                  'switch to next thread
                incmod  time,period wc  'increment time and reset if period reached (C=1)
        if_c    notp    pin_x           'if period reached, toggle pin_x
                setpc   pin_y           'if period reached, pin_y high
                jmp     #loop           '(4 threads executing same code with unique variables)


Here is an example program which uses remapping with multi-tasking:

DAT             org

period          long    2-1             '$000, task 0   (16 longs initally execute like NOPs)
time            long    0               '$001, task 0
pin_x           long    0               '$002, task 0
pin_y           long    1               '$003, task 0

                long    4-1             '$000, task 1
                long    0               '$001, task 1
                long    2               '$002, task 1
                long    3               '$003, task 1

                long    8-1             '$000, task 2
                long    0               '$001, task 2
                long    4               '$002, task 2
                long    5               '$003, task 2

                long    16-1            '$000, task 3
                long    0               '$001, task 3
                long    6               '$002, task 3
                long    7               '$003, task 3


                setmap  #%1_010_010     'remap 4 blocks of 4 regs by task
                settask #%%3210         'set all 4 tasks in motion
                jmptask #%1111,#loop    'herd tasks to loop


loop            incmod  time,period wc  'increment time and reset if period reached (C=1)
        if_c    notp    pin_x           'if period reached, toggle pin_x
                setpc   pin_y           'if period reached, pin_y high
                jmp     #loop           '(4 tasks executing same code with unique registers)

Ariba · 2014-02-11 21:48

My proposal for the mnemonics:

LCALL    for link-call
  LRET     for link return (is in fact: jmp 0)
  LJMP     for list-jump

It may be a bit confusing that LCALL and LJMP look similar but do very different things, but the similar look shows that LJMP also writes to the link-reg at $0.

Andy

cgracey · 2014-02-11 22:16

ozpropdev wrote: »

Re mapping is per task...

Remapping is systemic to the cog and affects all tasks. You can select between INDB-based register block selection and task-based register block selection.

Here is a little program that used task-based remapping and the LINK instruction (which has many other proposed names):

dat	org

lnkreg	long	0	'task0 lnkreg	(these longs execute as nop's at start)
pinreg	long	0	'task0 pinreg

	long	0	'task1 lnkreg
	long	1	'task1 pinreg

	long	0	'task2 lnkreg
	long	2	'task2 pinreg

	long	0	'task3 lnkreg
	long	3	'task3 pinreg


	setmap	#%1_010_001	'remap 4 blocks of 2 registers by task

	jmptask	#%1000,#t3	'set task 3,2,1 start address
	jmptask	#%0100,#t2
	jmptask	#%0010,#t1

	settask	#%%3210		'enable 4 tasks in round-robin


t0	link	#flip		'task0 flips pin0
	notp	#4		'...then pin4
	jmp	#t0

t1	link	#flip		'task1 flips pin1
	notp	#5		'...then pin5
	jmp	#t1

t2	link	#flip		'task2 flips pin2
	notp	#6		'...then pin6
	jmp	#t2

t3	link	#flip		'task3 flips pin3
	notp	#7		'...then pin7
	jmp	#t3


flip	notp	pinreg		'flip task's pin
	jmp	lnkreg		'return to task

cgracey · 2014-02-11 22:24

I got the LINK stuff all implemented and tested after an overnight bug that turned out to be related to data-forwarding. When changing the D address to $000 in the last stage of the pipeline, there are several other circuits that need to know about it. Things are working perfectly again, now that that's solved. Whew!

Tonight I'm going to add a quick feature that Baggers had requested a while back: a 32-bit write mask for WRWIDE. This will let you control which bytes get written on a WRWIDE. There are also three instructions which will set the mask for you, based on non-zero bytes/words/longs within the WIDE registers. Make way for WIDEBM, WIDEWM, and WIDELM. On startup and after every WRWIDE, the mask will be reset to all 1's.

rogloh · 2014-02-11 22:28

cgracey wrote: »

I've got the link instructions implemented:

LINKTAB D,@S 'jump to location in table (PC+S+D) and write {Z,C,%00_0000_0000_0000,PC} into register $000.
LINK D/#/@ 'jump to location and write {Z,C,%00_0000_0000_0000,PC} into register $000.

They seem to all work fine, but I'll test more tomorrow and get an update out.

Chip, does this above notation for LINKTAB mean if there is a jump table stored in hub memory at some given (fixed) address that any users referencing it with LINKTAB instructions would always need to compute their D or S relative to this table, rather than allowing absolute base + offset addressing? ie. if we always want to keep the table starting at $1000 in hub memory, and the LINKTAB instruction is called from some arbitrary address $204C, then we need S to be set to -$204C, and D to be the index (or vice versa), to be able to compensate?

The reason I ask is I see a very good use LINKTAB for offering us a call vector table holding function entry addresses of dynamically relocatable function blocks which get loaded into hub RAM from external storage (eg. function blocks read in from SDRAM). If all function calls then used this LINKTAB calling mechanism it wouldn't ever matter where the functions themselves were loaded into hub memory as long as each was in a contiguous memory block and used relative jumps internally within the function itself. However a big problem appears if using this new LINKTAB instruction always needs D/S address offsets computed relative to the current instruction address (which itself is dynamic for relocatable code) then all these LINKTAB D or S values need to be patched when loaded the code into relocatable addresses in hub memory to become relative to where they were loaded so they can still access the same table in hub memory at the fixed address. That's not as nice or fast as being able to load it in anywhere and start executing straight away, and you need to keep track of all places to patch while you load it, which is slow and ugly to patch on the fly. Having absolute as well as relative versions of this instruction would be nicer. I can see uses for both.

Roger.

ozpropdev · 2014-02-11 22:57

cgracey wrote: »

Remapping is systemic to the cog and affects all tasks. You can select between INDB-based register block selection and task-based register block selection.

Oops! I forgot about the INDB mode.

cgracey wrote: »

Tonight I'm going to add a quick feature that Baggers had requested a while back: a 32-bit write mask for WRWIDE. This will let you control which bytes get written on a WRWIDE. There are also three instructions which will set the mask for you, based on non-zero bytes/words/longs within the WIDE registers. Make way for WIDEBM, WIDEWM, and WIDELM. On startup and after every WRWIDE, the mask will be reset to all 1's.

That would be very useful.

Sapieha · 2014-02-11 23:22

Hi rogloh.

It is for that possibility's I asked for

CALLFUN Table, Index

rogloh wrote: »

Chip, does this above notation for LINKTAB mean if there is a jump table stored in hub memory at some given (fixed) address that any users referencing it with LINKTAB instructions would always need to compute their D or S relative to this table, rather than allowing absolute base + offset addressing? ie. if we always want to keep the table starting at $1000 in hub memory, and the LINKTAB instruction is called from some arbitrary address $204C, then we need S to be set to -$204C, and D to be the index (or vice versa), to be able to compensate?

The reason I ask is I see a very good use LINKTAB for offering us a call vector table holding function entry addresses of dynamically relocatable function blocks which get loaded into hub RAM from external storage (eg. function blocks read in from SDRAM). If all function calls then used this LINKTAB calling mechanism it wouldn't ever matter where the functions themselves were loaded into hub memory as long as each was in a contiguous memory block and used relative jumps internally within the function itself. However a big problem appears if using this new LINKTAB instruction always needs D/S address offsets computed relative to the current instruction address (which itself is dynamic for relocatable code) then all these LINKTAB D or S values need to be patched when loaded the code into relocatable addresses in hub memory to become relative to where they were loaded so they can still access the same table in hub memory at the fixed address. That's not as nice or fast as being able to load it in anywhere and start executing straight away, and you need to keep track of all places to patch while you load it, which is slow and ugly to patch on the fly. Having absolute as well as relative versions of this instruction would be nicer. I can see uses for both.

Roger.

cgracey · 2014-02-11 23:35

rogloh wrote: »

Chip, does this above notation for LINKTAB mean if there is a jump table stored in hub memory at some given (fixed) address that any users referencing it with LINKTAB instructions would always need to compute their D or S relative to this table, rather than allowing absolute base + offset addressing? ie. if we always want to keep the table starting at $1000 in hub memory, and the LINKTAB instruction is called from some arbitrary address $204C, then we need S to be set to -$204C, and D to be the index (or vice versa), to be able to compensate?

The reason I ask is I see a very good use LINKTAB for offering us a call vector table holding function entry addresses of dynamically relocatable function blocks which get loaded into hub RAM from external storage (eg. function blocks read in from SDRAM). If all function calls then used this LINKTAB calling mechanism it wouldn't ever matter where the functions themselves were loaded into hub memory as long as each was in a contiguous memory block and used relative jumps internally within the function itself. However a big problem appears if using this new LINKTAB instruction always needs D/S address offsets computed relative to the current instruction address (which itself is dynamic for relocatable code) then all these LINKTAB D or S values need to be patched when loaded the code into relocatable addresses in hub memory to become relative to where they were loaded so they can still access the same table in hub memory at the fixed address. That's not as nice or fast as being able to load it in anywhere and start executing straight away, and you need to keep track of all places to patch while you load it, which is slow and ugly to patch on the fly. Having absolute as well as relative versions of this instruction would be nicer. I can see uses for both.

Roger.

The @S is the relative address of the start of the table (or whatever is there) and D is the offset, or index. The jump address is PC+relative+D. AUGS can be used to create bigger offsets than 9 bits. This LINKTAB instruction amounts to a special shortcut which could be coded discretely using other instructions. There are a lot of ways you could code up similar branches, but hopefully this is a useful form.

Sapieha · 2014-02-11 23:49

Hi Chip.

Only problem ---

That Functions ends with RET so that table can mix even other subroutines

cgracey wrote: »

The @S is the relative address of the start of the table (or whatever is there) and D is the offset, or index. The jump address is PC+relative+D. AUGS can be used to create bigger offsets than 9 bits. This LINKTAB instruction amounts to a special shortcut which could be coded discretely using other instructions. There are a lot of ways you could code up similar branches, but hopefully this is a useful form.

cgracey · 2014-02-12 00:01

Sapieha wrote: »

Hi Chip.

Only problem ---

That Functions ends with RET so that table can mix even other subroutines

How is that a problem?

Sapieha · 2014-02-12 00:10

Hi Chip

In time I else any one write any Routine them don't know how it will be used.

Maybe I will use it both as function and standard subroutine so it is always problem if I need think that it need special return requirements

And as I said and will always say

> skip LINKTAB and remade it to CALLTAB that use standard RETURN system.
That will give much more flexibility in write portable code.

cgracey wrote: »

How is that a problem?

Cluso99 · 2014-02-12 00:58

cgracey wrote: »

I got the LINK stuff all implemented and tested after an overnight bug that turned out to be related to data-forwarding. When changing the D address to $000 in the last stage of the pipeline, there are several other circuits that need to know about it. Things are working perfectly again, now that that's solved. Whew!

Tonight I'm going to add a quick feature that Baggers had requested a while back: a 32-bit write mask for WRWIDE. This will let you control which bytes get written on a WRWIDE. There are also three instructions which will set the mask for you, based on non-zero bytes/words/longs within the WIDE registers. Make way for WIDEBM, WIDEWM, and WIDELM. On startup and after every WRWIDE, the mask will be reset to all 1's.

Sounds quite useful although I am not quite understanding the WIDEBM/WM/LM (byte/word/long mask) concept with respect to the "based on non-zero bytes/words/longs within the WIDE registers. But I am happy to wait for the docs.

rogloh · 2014-02-12 02:09

cgracey wrote: »

The @S is the relative address of the start of the table (or whatever is there) and D is the offset, or index. The jump address is PC+relative+D. AUGS can be used to create bigger offsets than 9 bits. This LINKTAB instruction amounts to a special shortcut which could be coded discretely using other instructions. There are a lot of ways you could code up similar branches, but hopefully this is a useful form.

Ok thanks Chip. Well it is a useful form in any applications where these jump tables are always going to be stored relative to the block calling them, but unfortunately they are not quite as useful when the jump table is designed to be common to all functions and stored globally at a fixed known address in hub memory and we need to access them from relocated code. In that case we will have to patch all calls on the fly to compensate for their calling addresses, that will slow things down and require storage of all call addresses.

When I saw LINKTAB I was really hoping to come up with a scheme using this calling mechanism for allowing a type of "XMM" execution mode where code stored in a large SDRAM could be read into hub memory on the fly and executed using hub exec (think of doing "malloc" / "free" for actual functions to be copied in from SDRAM dynamically). I saw this as being a great way to be able to very quickly cache moderately sized functions in hub memory to be executed at speed. The pointers in the common function table would either point to the hub address where they are loaded, or to the address of the code that manages hub memory blocks, loads the requested function in and executes it immediately afterwards. The caller doesn't need to know if the code is yet in hub memory, by the time it returns it will have been executed regardless. I had thought this could have worked out well for that but it probably won't be practical with the relative addresses required now, pity, bad luck there I guess.

Also I just reread my quoted post, bit of a grammar fail there - sorry about that, must have rushed that one out too fast.

cgracey · 2014-02-12 02:10

Cluso99 wrote: »

Sounds quite useful although I am not quite understanding the WIDEBM/WM/LM (byte/word/long mask) concept with respect to the "based on non-zero bytes/words/longs within the WIDE registers. But I am happy to wait for the docs.

SETMASK D/# sets the 32-bit mask to any value
WIDEBM sets a write bit for each corresponding byte that is non-0
WIDEWM sets two bits for each corresponding word that is non-0
WIDELM sets four bits for each corresponding long that is non-0

Those last three instructions cause bytes/words/longs in the eight WIDE registers NOT be written if their values are zero. By using the SETWIDZ instruction to clear the WIDEs, and then writing, say, some long non-0 pixel values into the WIDEs, and then doing a WIDELM before a WRWIDE, only the new pixels will be written to hub memory, leaving the original hub background data intact where there were no new pixels.

WIDEBM/WIDEWM/WIDELM are kind of fancy and maybe not so critical, as you could just do a RDWIDE to load background data first. The SETMASK instruction is really important, though, as it lets you mask byte-writes when you do 32 at a time with WRWIDE.

cgracey · 2014-02-12 02:14

rogloh wrote: »

Ok thanks Chip. Well it is a useful form in any applications where these jump tables are always going to be stored relative to the block calling them, but unfortunately they are not quite as useful when the jump table is designed to be common to all functions and stored globally at a fixed known address in hub memory and we need to access them from relocated code. In that case we will have to patch all calls on the fly to compensate for their calling addresses, that will slow things down and require storage of all call addresses.

When I saw LINKTAB I was really hoping to come up with a scheme using this calling mechanism for allowing a type of "XMM" execution mode where code stored in a large SDRAM could be read into hub memory on the fly and executed using hub exec (think of doing "malloc" / "free" for actual functions to be copied in from SDRAM dynamically). I saw this as being a great way to be able to very quickly cache moderately sized functions in hub memory to be executed at speed. The pointers in the common function table would either point to the hub address where they are loaded, or to the address of the code that manages hub memory blocks, loads the requested function in and executes it immediately afterwards. The caller doesn't need to know if the code is yet in hub memory, by the time it returns it will have been executed regardless. I had thought this could have worked out well for that but it probably won't be practical with the relative addresses required now, pity, bad luck there I guess.

Also I just reread my quoted post, bit of a grammar fail there - sorry about that, must have rushed that one out too fast.

Could you describe the instruction you would like to see in the chip? Maybe we could make it work.

Baggers · 2014-02-12 03:01

cgracey wrote: »

I got the LINK stuff all implemented and tested after an overnight bug that turned out to be related to data-forwarding. When changing the D address to $000 in the last stage of the pipeline, there are several other circuits that need to know about it. Things are working perfectly again, now that that's solved. Whew!

Tonight I'm going to add a quick feature that Baggers had requested a while back: a 32-bit write mask for WRWIDE. This will let you control which bytes get written on a WRWIDE. There are also three instructions which will set the mask for you, based on non-zero bytes/words/longs within the WIDE registers. Make way for WIDEBM, WIDEWM, and WIDELM. On startup and after every WRWIDE, the mask will be reset to all 1's.

Awesome progress Chip, you're on a roll lately

David Betz · 2014-02-12 03:43

ozpropdev wrote: »

It's all about "LEAF" functions isn't it?

What about LEAF,LEAFRET?

In fact, GCC is likely to use this instruction to call all functions not just leaf functions so LEAF probably isn't appropriate.

David Betz · 2014-02-12 03:53

rogloh wrote: »

When I saw LINKTAB I was really hoping to come up with a scheme using this calling mechanism for allowing a type of "XMM" execution mode where code stored in a large SDRAM could be read into hub memory on the fly and executed using hub exec (think of doing "malloc" / "free" for actual functions to be copied in from SDRAM dynamically). I saw this as being a great way to be able to very quickly cache moderately sized functions in hub memory to be executed at speed. The pointers in the common function table would either point to the hub address where they are loaded, or to the address of the code that manages hub memory blocks, loads the requested function in and executes it immediately afterwards. The caller doesn't need to know if the code is yet in hub memory, by the time it returns it will have been executed regardless.

Yes, this is what I was thinking about when I mentioned the idea of using an overlay loader to allow programs larger than hub memory. I think the scheme you describe is quite similar to traditional overlay handlers.

rogloh · 2014-02-12 04:59

cgracey wrote: »

Could you describe the instruction you would like to see in the chip? Maybe we could make it work.

Ok sure, will try. Where you already have this capability:

LINKTAB D,@S 'jump to location in table (PC+S+D) and write {Z,C,%00_0000_0000_0000,PC} into register $000.

I would instead prefer to be able to index into the lookup table at an absolute address using D+S but just do the same thing as you have done with writing data to "LR" register 0, so it now becomes this:

LINKTAB D,S 'jump to location in table (S+D) and write {Z,C,%00_0000_0000_0000,PC} into register $000.

You could use S as the base and make D the index.

We still need to setup D,S values before the jump is done and then handle the LR appropriately within the leaf/non-leaf function being jumped to for the overall call and return sequence to work. As the lookup table address base in hub RAM would remain fixed, that value could stay in a COG register permanently but we would need to setup D with a index value to create the lookup table index each time around.

All function calls made to the relocatable hub code loaded from say SDRAM then effectively become this...

AUGD    #((FN_INDEX * 4) >> 9)
MOV     index, #((FN_INDEX * 4) & $1ff)
LINKTAB index, table_base

where FN_INDEX is the index into the lookup table for the function being called, index is the temporary COG register that is added to the fixed table_base register to access the table.

The cost is now 3 PASM instructions per each function call instead of one call instruction, but it gives us the indirection we require for calling dynamically relocatable code. The AUGD operation is required to obtain table offsets > 511, when we have more than 128 function entry addresses listed in the table, since each table entry is read from a long of 4 bytes. But if the scaling shift of 2 could also be done in hardware before the addition operation, making the created lookup address (S + D<<2) that could extend the range to 512 functions before the extra AUGD would be required which is starting to become a reasonably large number of total functions in an application. In those cases we would only require 2 PASM instructions of hub memory overhead per call made.

Hope that helps describe what I am interested in here.

Cheers,

Roger.

Bill Henning · 2014-02-12 07:31

Hi Chip,

cgracey wrote: »

Could you describe the instruction you would like to see in the chip? Maybe we could make it work.

I'll also take a stab at explaining it - I asked for something very similar in posts #120 & #121 in this thread (based on what Sapieha was asking for), main difference is I'd like a word table of pointers.

(Extracted from #120)

CALLVECT D,#n
CALLVECT D,S

CALLLIST looked weird with 3 L's, so I changed it to VECT for the example

D holds the base address of a WORD table in the hub

n or S are the index

This way it could dispatch to 512 system/VM routines. (per DLL)

(Extracted from #121)

If the addresses in the word table are relative to their position, this would get us DLL's.

The calling task/cog would place the start of the DLL in the register "dllbase"

Then it could call any routine in the DLL with

CALLVECT dllbase,#routineindex ' (0..511)<<2

A dll would be: (in the hub)

WORD @function0 ' may be better used to hold number of functions, or offset to dllstaticdata
WORD @function1
....
WORD @functionN-1

' local DLL data area
dllstaticdata:

function0:

function1:

...

As relative addresses would be used, no need for a relocating loader

This would allow compiled code to share a DLL version of libc, libm etc (as long as the libraries were thread-safe)

As long as crt0.s contained (in a cog register) a small table of DLL base pointers, C code could call DLL functions with a simple

CALLVEC DLLx,#fn_index

A single instruction!

Of course, a DLL could be used for overlays, efficient handling of virtual functions etc

Cluso99 · 2014-02-12 08:08

LINKTAB uses D as the PC relativebase address, and @S as the offset into the table.

ragloh prefers this be an absolute address. So D would hold the absolute base address, and @S would be the offset.

In both the abovecases, @S would be added to thevalue stored in the register D, but PC would also be added in the original and not in ragloh case.

IMHO, the ragloh case (absolutee table address in reg D) makesthe most sense. D only needs to be set once. If is relocatable, then thiscan easily be calculated once and added to D.

In either case, the LINKTAB would save the return address in cog $000 and JMP to the table + offset (@S). The LINKTAB could be preceded by AUGS if a larger offset than 511 is required.

Bill's case is quite different because he is asking for a Table of Words representing addresses. These require the fetching of D and added to #S<<2? and the resulting address would be fetched (16 bit absolute address <<2) and this wouldbethe addressto be Jumpedto. As I understand, this is not possible in the current 4 stage pipeline.

Bill Henning · 2014-02-12 08:28

Yes, mine would be a hub instruction.

And I originally suggested it for P3

Table address in D could be absolute only.

Table entries should be relative to start of table, scaled by 4

Cluso99 wrote: »

LINKTAB uses D as the PC relativebase address, and @S as the offset into the table.

ragloh prefers this be an absolute address. So D would hold the absolute base address, and @S would be the offset.

In both the abovecases, @S would be added to thevalue stored in the register D, but PC would also be added in the original and not in ragloh case.

IMHO, the ragloh case (absolutee table address in reg D) makesthe most sense. D only needs to be set once. If is relocatable, then thiscan easily be calculated once and added to D.

In either case, the LINKTAB would save the return address in cog $000 and JMP to the table + offset (@S). The LINKTAB could be preceded by AUGS if a larger offset than 511 is required.

Bill's case is quite different because he is asking for a Table of Words representing addresses. These require the fetching of D and added to #S<<2? and the resulting address would be fetched (16 bit absolute address <<2) and this wouldbethe addressto be Jumpedto. As I understand, this is not possible in the current 4 stage pipeline.

Cluso99 · 2014-02-12 08:29

ragloh/Roger,
I see you think the table address should be in S and offset in D. Currently D must be a cog register, so this is best to hold the Table absolute base address (it's easy to set this once, and add a relocation value if required). S can be an immediate 9 bits permitting up to 512 table entries, and can be expanded by using AUGS.

So the only difference is deleting the PC addition.

We can also use this method to link directly into relocatable code routines by setting D to the relocated base of the routines, and #S being the start offset from the base, without actually requiring a jump table.

Cluso99 · 2014-02-12 08:35

Yes Bill, scaled by 4 because the table is longs, and yes the offset is from the table start.

This should work whether the table is in hub or cog, so if D is < $200, no scaling of S. Chip is this possible to do?

Bill Henning · 2014-02-12 08:38

If the table >$200, it should be a table of words, to save hub memory.

Cluso99 wrote: »

Yes Bill, scaled by 4 because the table is longs, and yes the offset is from the table start.

This should work whether the table is in hub or cog, so if D is < $200, no scaling of S. Chip is this possible to do?

Sapieha · 2014-02-12 08:44

Hi Bill.

In my opinion -- and what I was thinking --- That table still need be JUMPS to routines that ends with RET

Bill Henning wrote: »

If the table >$200, it should be a table of words, to save hub memory.

Cluso99 · 2014-02-12 10:04

Bill, the table has to be jmps and therefore longs. You cannot do an extra (3rd) fetch in the existing 4 stage pipeline. So you cannot fetch a word because it needs to be the next instruction tobe executed.

Bill Henning · 2014-02-12 10:18

Ray,

I understand that. Basically, I was proposing a new hub instruction for P3, for high level language / DLL support, in a way that minimized memory used at the expense of an extra hub read cycle.

Cluso99 wrote: »

Bill, the table has to be jmps and therefore longs. You cannot do an extra (3rd) fetch in the existing 4 stage pipeline. So you cannot fetch a word because it needs to be the next instruction tobe executed.

rogloh · 2014-02-12 12:56

LINKTAB uses D as the PC relativebase address, and @S as the offset into the table.

ragloh prefers this be an absolute address. So D would hold the absolute base address, and @S would be the offset.

In both the abovecases, @S would be added to thevalue stored in the register D, but PC would also be added in the original and not in ragloh case.

IMHO, the ragloh case (absolutee table address in reg D) makesthe most sense. D only needs to be set once. If is relocatable, then thiscan easily be calculated once and added to D.

In either case, the LINKTAB would save the return address in cog $000 and JMP to the table + offset (@S). The LINKTAB could be preceded by AUGS if a larger offset than 511 is required.

Bill's case is quite different because he is asking for a Table of Words representing addresses. These require the fetching of D and added to #S<<2? and the resulting address would be fetched (16 bit absolute address <<2) and this wouldbethe addressto be Jumpedto. As I understand, this is not possible in the current 4 stage pipeline.

Yes as Cluso says you can easily make the absolute form of LINKTAB relative when you are computing the base address, but it is difficult to go the other way in relocated code, as you need to patch code on the fly at each LINKTAB position in the code.

Also think of it this way:

A relative LINKTAB can be used to allow static code to call relocated DLL code, where the table of jumps is relative to the calling code.

But an absolute LINKTAB can now go much further and allows relocated DLLs to call other relocated code at will. This is what we want for enabling hub exec of large SDRAM applications.

In this case there is a single instance of the jump table at a static (fixed) hub address known to all callers, and therefore each instance of LINKTAB used as a function call does not need to be patched to compensate for its call position when the DLL is loaded. That patching could be lot of work, and it would definitely limit the speed and requires storage of all patch positions to be loaded to hub RAM first which is probably too much effort to make it workable at speed.

To make it fully clear, for this LINKTAB instruction itself I would just like the COG PC to be set to D+S, (or even better D+ S<<2), so the actual table itself holds absolute JMP instructions, not 16/32 bit pointers which are read and copied into the PC. The "LR" at register 0 also needs to be updated so we can "return" later to the original caller.

BTW my username is rogloh not ragloh. My eyesight is going too.

UPDATE: Please read below for alternative scheme that doesn't require LINKTAB.

HUB EXEC Update Here

Comments