Re mapping is per task, See Prop2 DOCS extract below
REGISTER REMAPPING
------------------
The SETMAP instruction is used to remap a 2^n-sized block of registers starting at $000, so
that direct accesses to those registers will be redirected to a range of identically-sized
blocks, which also build from $000. This feature allows a single program to run multiple
instances of itself by having unique sets of statically-addressable registers which switch
according to either INDB or the current task.
When using remapping, you must locate your program code above the last used block of
registers which the upper-most block of registers will be remapped to. For example, if you
select 8 blocks of 16 registers, but are only using 6 of those blocks, your program code
must not start below register 96 (6*16), to avoid encroaching into the registers which are
going to be the recipients of remapping.
Here is the SETMAP instruction:
SETMAP D/# - Configure register remapping to %M_BBB_RRR
%M = mode
%0 = INDB selects the block
%1 = task number selects the block
%BBB = block count
%000 = 1 block remapping disabled for %000
%001 = 2 blocks remapping enabled for %001..%111
%010 = 4 blocks
%011 = 8 blocks
%100 = 16 blocks
%101 = 32 blocks
%110 = 64 blocks
%111 = 128 blocks
%RRR = register count
%000 = 1 register remap $000
%001 = 2 registers remap $000..$001
%010 = 4 registers remap $000..$003
%011 = 8 registers remap $000..$007
%100 = 16 registers remap $000..$00F
%101 = 32 registers remap $000..$01F
%110 = 64 registers remap $000..$03F
%111 = 128 registers remap $000..$07F
The new mapping scheme will be in effect on the third instruction after SETMAP. After that,
changes to INDB or the task number will have an immediate effect on block selection. The
remapping mechanism only works with hard-coded D and S addresses which range from $000 to
the remapped-register-count minus 1 (see %RRR above), not via INDA and INDB accesses.
Below is an elaboration of all uniquely-useful remapping schemes:
S/D addresses
%M_BBB_RRR blocks regs initial -> remapped block selector
-----------------------------------------------------------------------------
%x_000_xxx 1 x <same>
%0_001_000 2 1 %000000000 -> %00000000P P = INDB[0]
%0_001_001 2 2 %00000000X -> %0000000PX
%0_001_010 2 4 %0000000XX -> %000000PXX (2 threads)
%0_001_011 2 8 %000000XXX -> %00000PXXX
%0_001_100 2 16 %00000XXXX -> %0000PXXXX
%0_001_101 2 32 %0000XXXXX -> %000PXXXXX
%0_001_110 2 64 %000XXXXXX -> %00PXXXXXX
%0_001_111 2 128 %00XXXXXXX -> %0PXXXXXXX
%0_010_000 4 1 %000000000 -> %0000000PP PP = INDB[1..0]
%0_010_001 4 2 %00000000X -> %000000PPX
%0_010_010 4 4 %0000000XX -> %00000PPXX (4 threads)
%0_010_011 4 8 %000000XXX -> %0000PPXXX
%0_010_100 4 16 %00000XXXX -> %000PPXXXX
%0_010_101 4 32 %0000XXXXX -> %00PPXXXXX
%0_010_110 4 64 %000XXXXXX -> %0PPXXXXXX
%0_010_111 4 128 %00XXXXXXX -> %PPXXXXXXX
%0_011_000 8 1 %000000000 -> %000000PPP PPP = INDB[2..0]
%0_011_001 8 2 %00000000X -> %00000PPPX
%0_011_010 8 4 %0000000XX -> %0000PPPXX (8 threads)
%0_011_011 8 8 %000000XXX -> %000PPPXXX
%0_011_100 8 16 %00000XXXX -> %00PPPXXXX
%0_011_101 8 32 %0000XXXXX -> %0PPPXXXXX
%0_011_110 8 64 %000XXXXXX -> %PPPXXXXXX
%0_100_000 16 1 %000000000 -> %00000PPPP PPPP = INDB[3..0]
%0_100_001 16 2 %00000000X -> %0000PPPPX
%0_100_010 16 4 %0000000XX -> %000PPPPXX (16 threads)
%0_100_011 16 8 %000000XXX -> %00PPPPXXX
%0_100_100 16 16 %00000XXXX -> %0PPPPXXXX
%0_100_101 16 32 %0000XXXXX -> %PPPPXXXXX
%0_101_000 32 1 %000000000 -> %0000PPPPP PPPPP = INDB[4..0]
%0_101_001 32 2 %00000000X -> %000PPPPPX
%0_101_010 32 4 %0000000XX -> %00PPPPPXX (32 threads)
%0_101_011 32 8 %000000XXX -> %0PPPPPXXX
%0_101_100 32 16 %00000XXXX -> %PPPPPXXXX
%0_110_000 64 1 %000000000 -> %000PPPPPP PPPPPP = INDB[5..0]
%0_110_001 64 2 %00000000X -> %00PPPPPPX
%0_110_010 64 4 %0000000XX -> %0PPPPPPXX (64 threads)
%0_110_011 64 8 %000000XXX -> %PPPPPPXXX
%0_111_000 128 1 %000000000 -> %00PPPPPPP PPPPPPP = INDB[6..0]
%0_111_001 128 2 %00000000X -> %0PPPPPPPX
%0_111_010 128 4 %0000000XX -> %PPPPPPPXX (128 threads)
%1_001_000 2 1 %000000000 -> %00000000T T = bit 0 of the task number
%1_001_001 2 2 %00000000X -> %0000000TX
%1_001_010 2 4 %0000000XX -> %000000TXX (2 tasks)
%1_001_011 2 8 %000000XXX -> %00000TXXX
%1_001_100 2 16 %00000XXXX -> %0000TXXXX
%1_001_101 2 32 %0000XXXXX -> %000TXXXXX
%1_001_110 2 64 %000XXXXXX -> %00TXXXXXX
%1_001_111 2 128 %00XXXXXXX -> %0TXXXXXXX
%1_010_000 4 1 %000000000 -> %0000000TT TT = task number
%1_010_001 4 2 %00000000X -> %000000TTX
%1_010_010 4 4 %0000000XX -> %00000TTXX (4 tasks)
%1_010_011 4 8 %000000XXX -> %0000TTXXX
%1_010_100 4 16 %00000XXXX -> %000TTXXXX
%1_010_101 4 32 %0000XXXXX -> %00TTXXXXX
%1_010_110 4 64 %000XXXXXX -> %0TTXXXXXX
%1_010_111 4 128 %00XXXXXXX -> %TTXXXXXXX
Here is an example program which uses remapping with multi-threading:
DAT org
period long 2-1 '$000, thread 0 (20 longs initally execute as NOPs)
time long 0 '$001, thread 0
pin_x long 0 '$002, thread 0
pin_y long 1 '$003, thread 0
long 4-1 '$000, thread 1
long 0 '$001, thread 1
long 2 '$002, thread 1
long 3 '$003, thread 1
long 8-1 '$000, thread 2
long 0 '$001, thread 2
long 4 '$002, thread 2
long 5 '$003, thread 2
long 16-1 '$000, thread 3
long 0 '$001, thread 3
long 6 '$002, thread 3
long 7 '$003, thread 3
pc long loop[4] '$010..$013, all threads start at loop
setmap #%0_010_010 'remap 4 blocks of 4 regs by INDA[1..0]
fixindb #pc+3,#pc 'set INDA to cycle through blocks and threads
nop 'allow SETMAP to take effect before 'switch'
loop switch 'switch to next thread
incmod time,period wc 'increment time and reset if period reached (C=1)
if_c notp pin_x 'if period reached, toggle pin_x
setpc pin_y 'if period reached, pin_y high
jmp #loop '(4 threads executing same code with unique variables)
Here is an example program which uses remapping with multi-tasking:
DAT org
period long 2-1 '$000, task 0 (16 longs initally execute like NOPs)
time long 0 '$001, task 0
pin_x long 0 '$002, task 0
pin_y long 1 '$003, task 0
long 4-1 '$000, task 1
long 0 '$001, task 1
long 2 '$002, task 1
long 3 '$003, task 1
long 8-1 '$000, task 2
long 0 '$001, task 2
long 4 '$002, task 2
long 5 '$003, task 2
long 16-1 '$000, task 3
long 0 '$001, task 3
long 6 '$002, task 3
long 7 '$003, task 3
setmap #%1_010_010 'remap 4 blocks of 4 regs by task
settask #%%3210 'set all 4 tasks in motion
jmptask #%1111,#loop 'herd tasks to loop
loop incmod time,period wc 'increment time and reset if period reached (C=1)
if_c notp pin_x 'if period reached, toggle pin_x
setpc pin_y 'if period reached, pin_y high
jmp #loop '(4 tasks executing same code with unique registers)
LCALL for link-call
LRET for link return (is in fact: jmp 0)
LJMP for list-jump
It may be a bit confusing that LCALL and LJMP look similar but do very different things, but the similar look shows that LJMP also writes to the link-reg at $0.
Remapping is systemic to the cog and affects all tasks. You can select between INDB-based register block selection and task-based register block selection.
Here is a little program that used task-based remapping and the LINK instruction (which has many other proposed names):
dat org
lnkreg long 0 'task0 lnkreg (these longs execute as nop's at start)
pinreg long 0 'task0 pinreg
long 0 'task1 lnkreg
long 1 'task1 pinreg
long 0 'task2 lnkreg
long 2 'task2 pinreg
long 0 'task3 lnkreg
long 3 'task3 pinreg
setmap #%1_010_001 'remap 4 blocks of 2 registers by task
jmptask #%1000,#t3 'set task 3,2,1 start address
jmptask #%0100,#t2
jmptask #%0010,#t1
settask #%%3210 'enable 4 tasks in round-robin
t0 link #flip 'task0 flips pin0
notp #4 '...then pin4
jmp #t0
t1 link #flip 'task1 flips pin1
notp #5 '...then pin5
jmp #t1
t2 link #flip 'task2 flips pin2
notp #6 '...then pin6
jmp #t2
t3 link #flip 'task3 flips pin3
notp #7 '...then pin7
jmp #t3
flip notp pinreg 'flip task's pin
jmp lnkreg 'return to task
I got the LINK stuff all implemented and tested after an overnight bug that turned out to be related to data-forwarding. When changing the D address to $000 in the last stage of the pipeline, there are several other circuits that need to know about it. Things are working perfectly again, now that that's solved. Whew!
Tonight I'm going to add a quick feature that Baggers had requested a while back: a 32-bit write mask for WRWIDE. This will let you control which bytes get written on a WRWIDE. There are also three instructions which will set the mask for you, based on non-zero bytes/words/longs within the WIDE registers. Make way for WIDEBM, WIDEWM, and WIDELM. On startup and after every WRWIDE, the mask will be reset to all 1's.
LINKTAB D,@S 'jump to location in table (PC+S+D) and write {Z,C,%00_0000_0000_0000,PC} into register $000.
LINK D/#/@ 'jump to location and write {Z,C,%00_0000_0000_0000,PC} into register $000.
They seem to all work fine, but I'll test more tomorrow and get an update out.
Chip, does this above notation for LINKTAB mean if there is a jump table stored in hub memory at some given (fixed) address that any users referencing it with LINKTAB instructions would always need to compute their D or S relative to this table, rather than allowing absolute base + offset addressing? ie. if we always want to keep the table starting at $1000 in hub memory, and the LINKTAB instruction is called from some arbitrary address $204C, then we need S to be set to -$204C, and D to be the index (or vice versa), to be able to compensate?
The reason I ask is I see a very good use LINKTAB for offering us a call vector table holding function entry addresses of dynamically relocatable function blocks which get loaded into hub RAM from external storage (eg. function blocks read in from SDRAM). If all function calls then used this LINKTAB calling mechanism it wouldn't ever matter where the functions themselves were loaded into hub memory as long as each was in a contiguous memory block and used relative jumps internally within the function itself. However a big problem appears if using this new LINKTAB instruction always needs D/S address offsets computed relative to the current instruction address (which itself is dynamic for relocatable code) then all these LINKTAB D or S values need to be patched when loaded the code into relocatable addresses in hub memory to become relative to where they were loaded so they can still access the same table in hub memory at the fixed address. That's not as nice or fast as being able to load it in anywhere and start executing straight away, and you need to keep track of all places to patch while you load it, which is slow and ugly to patch on the fly. Having absolute as well as relative versions of this instruction would be nicer. I can see uses for both.
Remapping is systemic to the cog and affects all tasks. You can select between INDB-based register block selection and task-based register block selection.
Tonight I'm going to add a quick feature that Baggers had requested a while back: a 32-bit write mask for WRWIDE. This will let you control which bytes get written on a WRWIDE. There are also three instructions which will set the mask for you, based on non-zero bytes/words/longs within the WIDE registers. Make way for WIDEBM, WIDEWM, and WIDELM. On startup and after every WRWIDE, the mask will be reset to all 1's.
Chip, does this above notation for LINKTAB mean if there is a jump table stored in hub memory at some given (fixed) address that any users referencing it with LINKTAB instructions would always need to compute their D or S relative to this table, rather than allowing absolute base + offset addressing? ie. if we always want to keep the table starting at $1000 in hub memory, and the LINKTAB instruction is called from some arbitrary address $204C, then we need S to be set to -$204C, and D to be the index (or vice versa), to be able to compensate?
The reason I ask is I see a very good use LINKTAB for offering us a call vector table holding function entry addresses of dynamically relocatable function blocks which get loaded into hub RAM from external storage (eg. function blocks read in from SDRAM). If all function calls then used this LINKTAB calling mechanism it wouldn't ever matter where the functions themselves were loaded into hub memory as long as each was in a contiguous memory block and used relative jumps internally within the function itself. However a big problem appears if using this new LINKTAB instruction always needs D/S address offsets computed relative to the current instruction address (which itself is dynamic for relocatable code) then all these LINKTAB D or S values need to be patched when loaded the code into relocatable addresses in hub memory to become relative to where they were loaded so they can still access the same table in hub memory at the fixed address. That's not as nice or fast as being able to load it in anywhere and start executing straight away, and you need to keep track of all places to patch while you load it, which is slow and ugly to patch on the fly. Having absolute as well as relative versions of this instruction would be nicer. I can see uses for both.
Chip, does this above notation for LINKTAB mean if there is a jump table stored in hub memory at some given (fixed) address that any users referencing it with LINKTAB instructions would always need to compute their D or S relative to this table, rather than allowing absolute base + offset addressing? ie. if we always want to keep the table starting at $1000 in hub memory, and the LINKTAB instruction is called from some arbitrary address $204C, then we need S to be set to -$204C, and D to be the index (or vice versa), to be able to compensate?
The reason I ask is I see a very good use LINKTAB for offering us a call vector table holding function entry addresses of dynamically relocatable function blocks which get loaded into hub RAM from external storage (eg. function blocks read in from SDRAM). If all function calls then used this LINKTAB calling mechanism it wouldn't ever matter where the functions themselves were loaded into hub memory as long as each was in a contiguous memory block and used relative jumps internally within the function itself. However a big problem appears if using this new LINKTAB instruction always needs D/S address offsets computed relative to the current instruction address (which itself is dynamic for relocatable code) then all these LINKTAB D or S values need to be patched when loaded the code into relocatable addresses in hub memory to become relative to where they were loaded so they can still access the same table in hub memory at the fixed address. That's not as nice or fast as being able to load it in anywhere and start executing straight away, and you need to keep track of all places to patch while you load it, which is slow and ugly to patch on the fly. Having absolute as well as relative versions of this instruction would be nicer. I can see uses for both.
Roger.
The @S is the relative address of the start of the table (or whatever is there) and D is the offset, or index. The jump address is PC+relative+D. AUGS can be used to create bigger offsets than 9 bits. This LINKTAB instruction amounts to a special shortcut which could be coded discretely using other instructions. There are a lot of ways you could code up similar branches, but hopefully this is a useful form.
The @S is the relative address of the start of the table (or whatever is there) and D is the offset, or index. The jump address is PC+relative+D. AUGS can be used to create bigger offsets than 9 bits. This LINKTAB instruction amounts to a special shortcut which could be coded discretely using other instructions. There are a lot of ways you could code up similar branches, but hopefully this is a useful form.
In time I else any one write any Routine them don't know how it will be used.
Maybe I will use it both as function and standard subroutine so it is always problem if I need think that it need special return requirements
And as I said and will always say
> skip LINKTAB and remade it to CALLTAB that use standard RETURN system.
That will give much more flexibility in write portable code.
I got the LINK stuff all implemented and tested after an overnight bug that turned out to be related to data-forwarding. When changing the D address to $000 in the last stage of the pipeline, there are several other circuits that need to know about it. Things are working perfectly again, now that that's solved. Whew!
Tonight I'm going to add a quick feature that Baggers had requested a while back: a 32-bit write mask for WRWIDE. This will let you control which bytes get written on a WRWIDE. There are also three instructions which will set the mask for you, based on non-zero bytes/words/longs within the WIDE registers. Make way for WIDEBM, WIDEWM, and WIDELM. On startup and after every WRWIDE, the mask will be reset to all 1's.
Sounds quite useful although I am not quite understanding the WIDEBM/WM/LM (byte/word/long mask) concept with respect to the "based on non-zero bytes/words/longs within the WIDE registers. But I am happy to wait for the docs.
The @S is the relative address of the start of the table (or whatever is there) and D is the offset, or index. The jump address is PC+relative+D. AUGS can be used to create bigger offsets than 9 bits. This LINKTAB instruction amounts to a special shortcut which could be coded discretely using other instructions. There are a lot of ways you could code up similar branches, but hopefully this is a useful form.
Ok thanks Chip. Well it is a useful form in any applications where these jump tables are always going to be stored relative to the block calling them, but unfortunately they are not quite as useful when the jump table is designed to be common to all functions and stored globally at a fixed known address in hub memory and we need to access them from relocated code. In that case we will have to patch all calls on the fly to compensate for their calling addresses, that will slow things down and require storage of all call addresses.
When I saw LINKTAB I was really hoping to come up with a scheme using this calling mechanism for allowing a type of "XMM" execution mode where code stored in a large SDRAM could be read into hub memory on the fly and executed using hub exec (think of doing "malloc" / "free" for actual functions to be copied in from SDRAM dynamically). I saw this as being a great way to be able to very quickly cache moderately sized functions in hub memory to be executed at speed. The pointers in the common function table would either point to the hub address where they are loaded, or to the address of the code that manages hub memory blocks, loads the requested function in and executes it immediately afterwards. The caller doesn't need to know if the code is yet in hub memory, by the time it returns it will have been executed regardless. I had thought this could have worked out well for that but it probably won't be practical with the relative addresses required now, pity, bad luck there I guess.
Also I just reread my quoted post, bit of a grammar fail there - sorry about that, must have rushed that one out too fast.
Sounds quite useful although I am not quite understanding the WIDEBM/WM/LM (byte/word/long mask) concept with respect to the "based on non-zero bytes/words/longs within the WIDE registers. But I am happy to wait for the docs.
SETMASK D/# sets the 32-bit mask to any value
WIDEBM sets a write bit for each corresponding byte that is non-0
WIDEWM sets two bits for each corresponding word that is non-0
WIDELM sets four bits for each corresponding long that is non-0
Those last three instructions cause bytes/words/longs in the eight WIDE registers NOT be written if their values are zero. By using the SETWIDZ instruction to clear the WIDEs, and then writing, say, some long non-0 pixel values into the WIDEs, and then doing a WIDELM before a WRWIDE, only the new pixels will be written to hub memory, leaving the original hub background data intact where there were no new pixels.
WIDEBM/WIDEWM/WIDELM are kind of fancy and maybe not so critical, as you could just do a RDWIDE to load background data first. The SETMASK instruction is really important, though, as it lets you mask byte-writes when you do 32 at a time with WRWIDE.
Ok thanks Chip. Well it is a useful form in any applications where these jump tables are always going to be stored relative to the block calling them, but unfortunately they are not quite as useful when the jump table is designed to be common to all functions and stored globally at a fixed known address in hub memory and we need to access them from relocated code. In that case we will have to patch all calls on the fly to compensate for their calling addresses, that will slow things down and require storage of all call addresses.
When I saw LINKTAB I was really hoping to come up with a scheme using this calling mechanism for allowing a type of "XMM" execution mode where code stored in a large SDRAM could be read into hub memory on the fly and executed using hub exec (think of doing "malloc" / "free" for actual functions to be copied in from SDRAM dynamically). I saw this as being a great way to be able to very quickly cache moderately sized functions in hub memory to be executed at speed. The pointers in the common function table would either point to the hub address where they are loaded, or to the address of the code that manages hub memory blocks, loads the requested function in and executes it immediately afterwards. The caller doesn't need to know if the code is yet in hub memory, by the time it returns it will have been executed regardless. I had thought this could have worked out well for that but it probably won't be practical with the relative addresses required now, pity, bad luck there I guess.
Also I just reread my quoted post, bit of a grammar fail there - sorry about that, must have rushed that one out too fast.
Could you describe the instruction you would like to see in the chip? Maybe we could make it work.
I got the LINK stuff all implemented and tested after an overnight bug that turned out to be related to data-forwarding. When changing the D address to $000 in the last stage of the pipeline, there are several other circuits that need to know about it. Things are working perfectly again, now that that's solved. Whew!
Tonight I'm going to add a quick feature that Baggers had requested a while back: a 32-bit write mask for WRWIDE. This will let you control which bytes get written on a WRWIDE. There are also three instructions which will set the mask for you, based on non-zero bytes/words/longs within the WIDE registers. Make way for WIDEBM, WIDEWM, and WIDELM. On startup and after every WRWIDE, the mask will be reset to all 1's.
When I saw LINKTAB I was really hoping to come up with a scheme using this calling mechanism for allowing a type of "XMM" execution mode where code stored in a large SDRAM could be read into hub memory on the fly and executed using hub exec (think of doing "malloc" / "free" for actual functions to be copied in from SDRAM dynamically). I saw this as being a great way to be able to very quickly cache moderately sized functions in hub memory to be executed at speed. The pointers in the common function table would either point to the hub address where they are loaded, or to the address of the code that manages hub memory blocks, loads the requested function in and executes it immediately afterwards. The caller doesn't need to know if the code is yet in hub memory, by the time it returns it will have been executed regardless.
Yes, this is what I was thinking about when I mentioned the idea of using an overlay loader to allow programs larger than hub memory. I think the scheme you describe is quite similar to traditional overlay handlers.
Could you describe the instruction you would like to see in the chip? Maybe we could make it work.
Ok sure, will try. Where you already have this capability:
LINKTAB D,@S 'jump to location in table (PC+S+D) and write {Z,C,%00_0000_0000_0000,PC} into register $000.
I would instead prefer to be able to index into the lookup table at an absolute address using D+S but just do the same thing as you have done with writing data to "LR" register 0, so it now becomes this:
LINKTAB D,S 'jump to location in table (S+D) and write {Z,C,%00_0000_0000_0000,PC} into register $000.
You could use S as the base and make D the index.
We still need to setup D,S values before the jump is done and then handle the LR appropriately within the leaf/non-leaf function being jumped to for the overall call and return sequence to work. As the lookup table address base in hub RAM would remain fixed, that value could stay in a COG register permanently but we would need to setup D with a index value to create the lookup table index each time around.
All function calls made to the relocatable hub code loaded from say SDRAM then effectively become this...
where FN_INDEX is the index into the lookup table for the function being called, index is the temporary COG register that is added to the fixed table_base register to access the table.
The cost is now 3 PASM instructions per each function call instead of one call instruction, but it gives us the indirection we require for calling dynamically relocatable code. The AUGD operation is required to obtain table offsets > 511, when we have more than 128 function entry addresses listed in the table, since each table entry is read from a long of 4 bytes. But if the scaling shift of 2 could also be done in hardware before the addition operation, making the created lookup address (S + D<<2) that could extend the range to 512 functions before the extra AUGD would be required which is starting to become a reasonably large number of total functions in an application. In those cases we would only require 2 PASM instructions of hub memory overhead per call made.
Hope that helps describe what I am interested in here.
Could you describe the instruction you would like to see in the chip? Maybe we could make it work.
I'll also take a stab at explaining it - I asked for something very similar in posts #120 & #121 in this thread (based on what Sapieha was asking for), main difference is I'd like a word table of pointers.
(Extracted from #120)
CALLVECT D,#n
CALLVECT D,S
CALLLIST looked weird with 3 L's, so I changed it to VECT for the example
D holds the base address of a WORD table in the hub
n or S are the index
This way it could dispatch to 512 system/VM routines. (per DLL)
(Extracted from #121)
If the addresses in the word table are relative to their position, this would get us DLL's.
The calling task/cog would place the start of the DLL in the register "dllbase"
Then it could call any routine in the DLL with
CALLVECT dllbase,#routineindex ' (0..511)<<2
A dll would be: (in the hub)
WORD @function0 ' may be better used to hold number of functions, or offset to dllstaticdata
WORD @function1
....
WORD @functionN-1
' local DLL data area
dllstaticdata:
function0:
function1:
...
As relative addresses would be used, no need for a relocating loader
This would allow compiled code to share a DLL version of libc, libm etc (as long as the libraries were thread-safe)
As long as crt0.s contained (in a cog register) a small table of DLL base pointers, C code could call DLL functions with a simple
CALLVEC DLLx,#fn_index
A single instruction!
Of course, a DLL could be used for overlays, efficient handling of virtual functions etc
LINKTAB uses D as the PC relativebase address, and @S as the offset into the table.
ragloh prefers this be an absolute address. So D would hold the absolute base address, and @S would be the offset.
In both the abovecases, @S would be added to thevalue stored in the register D, but PC would also be added in the original and not in ragloh case.
IMHO, the ragloh case (absolutee table address in reg D) makesthe most sense. D only needs to be set once. If is relocatable, then thiscan easily be calculated once and added to D.
In either case, the LINKTAB would save the return address in cog $000 and JMP to the table + offset (@S). The LINKTAB could be preceded by AUGS if a larger offset than 511 is required.
Bill's case is quite different because he is asking for a Table of Words representing addresses. These require the fetching of D and added to #S<<2? and the resulting address would be fetched (16 bit absolute address <<2) and this wouldbethe addressto be Jumpedto. As I understand, this is not possible in the current 4 stage pipeline.
LINKTAB uses D as the PC relativebase address, and @S as the offset into the table.
ragloh prefers this be an absolute address. So D would hold the absolute base address, and @S would be the offset.
In both the abovecases, @S would be added to thevalue stored in the register D, but PC would also be added in the original and not in ragloh case.
IMHO, the ragloh case (absolutee table address in reg D) makesthe most sense. D only needs to be set once. If is relocatable, then thiscan easily be calculated once and added to D.
In either case, the LINKTAB would save the return address in cog $000 and JMP to the table + offset (@S). The LINKTAB could be preceded by AUGS if a larger offset than 511 is required.
Bill's case is quite different because he is asking for a Table of Words representing addresses. These require the fetching of D and added to #S<<2? and the resulting address would be fetched (16 bit absolute address <<2) and this wouldbethe addressto be Jumpedto. As I understand, this is not possible in the current 4 stage pipeline.
ragloh/Roger,
I see you think the table address should be in S and offset in D. Currently D must be a cog register, so this is best to hold the Table absolute base address (it's easy to set this once, and add a relocation value if required). S can be an immediate 9 bits permitting up to 512 table entries, and can be expanded by using AUGS.
So the only difference is deleting the PC addition.
We can also use this method to link directly into relocatable code routines by setting D to the relocated base of the routines, and #S being the start offset from the base, without actually requiring a jump table.
Bill, the table has to be jmps and therefore longs. You cannot do an extra (3rd) fetch in the existing 4 stage pipeline. So you cannot fetch a word because it needs to be the next instruction tobe executed.
I understand that. Basically, I was proposing a new hub instruction for P3, for high level language / DLL support, in a way that minimized memory used at the expense of an extra hub read cycle.
Bill, the table has to be jmps and therefore longs. You cannot do an extra (3rd) fetch in the existing 4 stage pipeline. So you cannot fetch a word because it needs to be the next instruction tobe executed.
LINKTAB uses D as the PC relativebase address, and @S as the offset into the table.
ragloh prefers this be an absolute address. So D would hold the absolute base address, and @S would be the offset.
In both the abovecases, @S would be added to thevalue stored in the register D, but PC would also be added in the original and not in ragloh case.
IMHO, the ragloh case (absolutee table address in reg D) makesthe most sense. D only needs to be set once. If is relocatable, then thiscan easily be calculated once and added to D.
In either case, the LINKTAB would save the return address in cog $000 and JMP to the table + offset (@S). The LINKTAB could be preceded by AUGS if a larger offset than 511 is required.
Bill's case is quite different because he is asking for a Table of Words representing addresses. These require the fetching of D and added to #S<<2? and the resulting address would be fetched (16 bit absolute address <<2) and this wouldbethe addressto be Jumpedto. As I understand, this is not possible in the current 4 stage pipeline.
Yes as Cluso says you can easily make the absolute form of LINKTAB relative when you are computing the base address, but it is difficult to go the other way in relocated code, as you need to patch code on the fly at each LINKTAB position in the code.
Also think of it this way:
A relative LINKTAB can be used to allow static code to call relocated DLL code, where the table of jumps is relative to the calling code.
But an absolute LINKTAB can now go much further and allows relocated DLLs to call other relocated code at will. This is what we want for enabling hub exec of large SDRAM applications. In this case there is a single instance of the jump table at a static (fixed) hub address known to all callers, and therefore each instance of LINKTAB used as a function call does not need to be patched to compensate for its call position when the DLL is loaded. That patching could be lot of work, and it would definitely limit the speed and requires storage of all patch positions to be loaded to hub RAM first which is probably too much effort to make it workable at speed.
To make it fully clear, for this LINKTAB instruction itself I would just like the COG PC to be set to D+S, (or even better D+ S<<2), so the actual table itself holds absolute JMP instructions, not 16/32 bit pointers which are read and copied into the PC. The "LR" at register 0 also needs to be updated so we can "return" later to the original caller.
BTW my username is rogloh not ragloh. My eyesight is going too.
UPDATE: Please read below for alternative scheme that doesn't require LINKTAB.
Comments
I guess I need to ask again in a different way:
Is this register remapping per task, or per cog?
Re mapping is per task, See Prop2 DOCS extract below
Andy
Remapping is systemic to the cog and affects all tasks. You can select between INDB-based register block selection and task-based register block selection.
Here is a little program that used task-based remapping and the LINK instruction (which has many other proposed names):
Tonight I'm going to add a quick feature that Baggers had requested a while back: a 32-bit write mask for WRWIDE. This will let you control which bytes get written on a WRWIDE. There are also three instructions which will set the mask for you, based on non-zero bytes/words/longs within the WIDE registers. Make way for WIDEBM, WIDEWM, and WIDELM. On startup and after every WRWIDE, the mask will be reset to all 1's.
Chip, does this above notation for LINKTAB mean if there is a jump table stored in hub memory at some given (fixed) address that any users referencing it with LINKTAB instructions would always need to compute their D or S relative to this table, rather than allowing absolute base + offset addressing? ie. if we always want to keep the table starting at $1000 in hub memory, and the LINKTAB instruction is called from some arbitrary address $204C, then we need S to be set to -$204C, and D to be the index (or vice versa), to be able to compensate?
The reason I ask is I see a very good use LINKTAB for offering us a call vector table holding function entry addresses of dynamically relocatable function blocks which get loaded into hub RAM from external storage (eg. function blocks read in from SDRAM). If all function calls then used this LINKTAB calling mechanism it wouldn't ever matter where the functions themselves were loaded into hub memory as long as each was in a contiguous memory block and used relative jumps internally within the function itself. However a big problem appears if using this new LINKTAB instruction always needs D/S address offsets computed relative to the current instruction address (which itself is dynamic for relocatable code) then all these LINKTAB D or S values need to be patched when loaded the code into relocatable addresses in hub memory to become relative to where they were loaded so they can still access the same table in hub memory at the fixed address. That's not as nice or fast as being able to load it in anywhere and start executing straight away, and you need to keep track of all places to patch while you load it, which is slow and ugly to patch on the fly. Having absolute as well as relative versions of this instruction would be nicer. I can see uses for both.
Roger.
Oops! I forgot about the INDB mode.
That would be very useful.
It is for that possibility's I asked for
CALLFUN Table, Index
The @S is the relative address of the start of the table (or whatever is there) and D is the offset, or index. The jump address is PC+relative+D. AUGS can be used to create bigger offsets than 9 bits. This LINKTAB instruction amounts to a special shortcut which could be coded discretely using other instructions. There are a lot of ways you could code up similar branches, but hopefully this is a useful form.
Only problem ---
That Functions ends with RET so that table can mix even other subroutines
How is that a problem?
In time I else any one write any Routine them don't know how it will be used.
Maybe I will use it both as function and standard subroutine so it is always problem if I need think that it need special return requirements
And as I said and will always say
> skip LINKTAB and remade it to CALLTAB that use standard RETURN system.
That will give much more flexibility in write portable code.
Ok thanks Chip. Well it is a useful form in any applications where these jump tables are always going to be stored relative to the block calling them, but unfortunately they are not quite as useful when the jump table is designed to be common to all functions and stored globally at a fixed known address in hub memory and we need to access them from relocated code. In that case we will have to patch all calls on the fly to compensate for their calling addresses, that will slow things down and require storage of all call addresses.
When I saw LINKTAB I was really hoping to come up with a scheme using this calling mechanism for allowing a type of "XMM" execution mode where code stored in a large SDRAM could be read into hub memory on the fly and executed using hub exec (think of doing "malloc" / "free" for actual functions to be copied in from SDRAM dynamically). I saw this as being a great way to be able to very quickly cache moderately sized functions in hub memory to be executed at speed. The pointers in the common function table would either point to the hub address where they are loaded, or to the address of the code that manages hub memory blocks, loads the requested function in and executes it immediately afterwards. The caller doesn't need to know if the code is yet in hub memory, by the time it returns it will have been executed regardless. I had thought this could have worked out well for that but it probably won't be practical with the relative addresses required now, pity, bad luck there I guess.
Also I just reread my quoted post, bit of a grammar fail there - sorry about that, must have rushed that one out too fast.
SETMASK D/# sets the 32-bit mask to any value
WIDEBM sets a write bit for each corresponding byte that is non-0
WIDEWM sets two bits for each corresponding word that is non-0
WIDELM sets four bits for each corresponding long that is non-0
Those last three instructions cause bytes/words/longs in the eight WIDE registers NOT be written if their values are zero. By using the SETWIDZ instruction to clear the WIDEs, and then writing, say, some long non-0 pixel values into the WIDEs, and then doing a WIDELM before a WRWIDE, only the new pixels will be written to hub memory, leaving the original hub background data intact where there were no new pixels.
WIDEBM/WIDEWM/WIDELM are kind of fancy and maybe not so critical, as you could just do a RDWIDE to load background data first. The SETMASK instruction is really important, though, as it lets you mask byte-writes when you do 32 at a time with WRWIDE.
Could you describe the instruction you would like to see in the chip? Maybe we could make it work.
Awesome progress Chip, you're on a roll lately
Ok sure, will try. Where you already have this capability:
LINKTAB D,@S 'jump to location in table (PC+S+D) and write {Z,C,%00_0000_0000_0000,PC} into register $000.
I would instead prefer to be able to index into the lookup table at an absolute address using D+S but just do the same thing as you have done with writing data to "LR" register 0, so it now becomes this:
LINKTAB D,S 'jump to location in table (S+D) and write {Z,C,%00_0000_0000_0000,PC} into register $000.
You could use S as the base and make D the index.
We still need to setup D,S values before the jump is done and then handle the LR appropriately within the leaf/non-leaf function being jumped to for the overall call and return sequence to work. As the lookup table address base in hub RAM would remain fixed, that value could stay in a COG register permanently but we would need to setup D with a index value to create the lookup table index each time around.
All function calls made to the relocatable hub code loaded from say SDRAM then effectively become this...
where FN_INDEX is the index into the lookup table for the function being called, index is the temporary COG register that is added to the fixed table_base register to access the table.
The cost is now 3 PASM instructions per each function call instead of one call instruction, but it gives us the indirection we require for calling dynamically relocatable code. The AUGD operation is required to obtain table offsets > 511, when we have more than 128 function entry addresses listed in the table, since each table entry is read from a long of 4 bytes. But if the scaling shift of 2 could also be done in hardware before the addition operation, making the created lookup address (S + D<<2) that could extend the range to 512 functions before the extra AUGD would be required which is starting to become a reasonably large number of total functions in an application. In those cases we would only require 2 PASM instructions of hub memory overhead per call made.
Hope that helps describe what I am interested in here.
Cheers,
Roger.
I'll also take a stab at explaining it - I asked for something very similar in posts #120 & #121 in this thread (based on what Sapieha was asking for), main difference is I'd like a word table of pointers.
(Extracted from #120)
CALLVECT D,#n
CALLVECT D,S
CALLLIST looked weird with 3 L's, so I changed it to VECT for the example
D holds the base address of a WORD table in the hub
n or S are the index
This way it could dispatch to 512 system/VM routines. (per DLL)
(Extracted from #121)
If the addresses in the word table are relative to their position, this would get us DLL's.
The calling task/cog would place the start of the DLL in the register "dllbase"
Then it could call any routine in the DLL with
CALLVECT dllbase,#routineindex ' (0..511)<<2
A dll would be: (in the hub)
WORD @function0 ' may be better used to hold number of functions, or offset to dllstaticdata
WORD @function1
....
WORD @functionN-1
' local DLL data area
dllstaticdata:
function0:
function1:
...
As relative addresses would be used, no need for a relocating loader
This would allow compiled code to share a DLL version of libc, libm etc (as long as the libraries were thread-safe)
As long as crt0.s contained (in a cog register) a small table of DLL base pointers, C code could call DLL functions with a simple
CALLVEC DLLx,#fn_index
A single instruction!
Of course, a DLL could be used for overlays, efficient handling of virtual functions etc
ragloh prefers this be an absolute address. So D would hold the absolute base address, and @S would be the offset.
In both the abovecases, @S would be added to thevalue stored in the register D, but PC would also be added in the original and not in ragloh case.
IMHO, the ragloh case (absolutee table address in reg D) makesthe most sense. D only needs to be set once. If is relocatable, then thiscan easily be calculated once and added to D.
In either case, the LINKTAB would save the return address in cog $000 and JMP to the table + offset (@S). The LINKTAB could be preceded by AUGS if a larger offset than 511 is required.
Bill's case is quite different because he is asking for a Table of Words representing addresses. These require the fetching of D and added to #S<<2? and the resulting address would be fetched (16 bit absolute address <<2) and this wouldbethe addressto be Jumpedto. As I understand, this is not possible in the current 4 stage pipeline.
And I originally suggested it for P3
Table address in D could be absolute only.
Table entries should be relative to start of table, scaled by 4
I see you think the table address should be in S and offset in D. Currently D must be a cog register, so this is best to hold the Table absolute base address (it's easy to set this once, and add a relocation value if required). S can be an immediate 9 bits permitting up to 512 table entries, and can be expanded by using AUGS.
So the only difference is deleting the PC addition.
We can also use this method to link directly into relocatable code routines by setting D to the relocated base of the routines, and #S being the start offset from the base, without actually requiring a jump table.
This should work whether the table is in hub or cog, so if D is < $200, no scaling of S. Chip is this possible to do?
In my opinion -- and what I was thinking --- That table still need be JUMPS to routines that ends with RET
I understand that. Basically, I was proposing a new hub instruction for P3, for high level language / DLL support, in a way that minimized memory used at the expense of an extra hub read cycle.
Yes as Cluso says you can easily make the absolute form of LINKTAB relative when you are computing the base address, but it is difficult to go the other way in relocated code, as you need to patch code on the fly at each LINKTAB position in the code.
Also think of it this way:
A relative LINKTAB can be used to allow static code to call relocated DLL code, where the table of jumps is relative to the calling code.
But an absolute LINKTAB can now go much further and allows relocated DLLs to call other relocated code at will. This is what we want for enabling hub exec of large SDRAM applications. In this case there is a single instance of the jump table at a static (fixed) hub address known to all callers, and therefore each instance of LINKTAB used as a function call does not need to be patched to compensate for its call position when the DLL is loaded. That patching could be lot of work, and it would definitely limit the speed and requires storage of all patch positions to be loaded to hub RAM first which is probably too much effort to make it workable at speed.
To make it fully clear, for this LINKTAB instruction itself I would just like the COG PC to be set to D+S, (or even better D+ S<<2), so the actual table itself holds absolute JMP instructions, not 16/32 bit pointers which are read and copied into the PC. The "LR" at register 0 also needs to be updated so we can "return" later to the original caller.
BTW my username is rogloh not ragloh. My eyesight is going too.
UPDATE: Please read below for alternative scheme that doesn't require LINKTAB.