LUT-exec model concept

rogloh · 2020-08-26 06:09

Following on from a recent discussion in the HyperRAM driver thread here is an idea for a "LUT-exec" model which might allow us to execute code stored externally to the P2's own memory....

On the P2 we would like to find a way to execute code from external sources like HyperRAM/Flash at a reasonable speed. The P1 offered us XMM however its performance was only mediocre and it would be nice to try to improve on this with the P2 if possible. The faster P2 combined with something like HyperRAM's high bandwidth and large capacity external memory should hopefully be able offer us something better. Reading code into HUB from external memory and then using HUB exec could be one approach to take, but another way to consider is to execute the externally sourced code from within LUT RAM. It's not going to be as fast as native hub-exec of internal memory and there are serious penalties for code that branches a lot but might still be able to improve on what the previous XMM had to offer, or what is possible if every external address gets read individually from external memory.

The LUT memory space already offers us several advantages. Firstly, instructions can be executed there at the highest speed. Secondly, it is rather fast to read instructions into LUT from HUB using SETQ2 bursts. Thirdly, LUT offers us a fast way to identify branch target addresses by making use of the COG's hardware stack and RDLUT operations together as shown later. Finally, instructions that use RDLONG/WRLONG such as stack operations and other general data accesses do not interfere with the execution unlike hub-exec where we have to compete with the fifo reading in the instructions from HUB and this can slow down execution. Also not all VMs would probably need every single bit of COG+LUT RAM, so there is probably some free space to try to put to good use.

So how could such a LUT-exec model work...?

We would basically need the build tools emitting code for this execution mode to encode and translate few special instructions when it comes to branches and calls and returns, instead of generating them natively for HUB or COG exec. Also the DJNZ/TJNZ/SKIP/REP types of P2 instructions could not be utilized in the output code, and enabling inline assembly would require some care to avoid certain instructions that could break the model.

For any branches or calls generated, they would now need to be encoded into two longs instead of one long. The first long is a P2 CALL instruction which escapes out to a special handler routine in COG (or LUT) RAM. The second long following it is the constant that contains the target branch address. LUT execution of the externally sourced code will stop at this point once the CALL executes, and the special handler will then take over.

'  a branch to an address xxxxxx is done like this:
     CALL jump_handler 
long $xxxxxx ' the target address

' a call to an address yyyyyy is done like this:
     CALL call_handler ' instruction to branch to call handler in COGRAM or LUTRAM
long $yyyyyy ' the target address

The two special branch/call handlers would need to pop the hardware stack and that will give the LUT address offset of the target address constant because it is also where the currently executed code would normally return to (the return address is just after the CALL instruction). This LUT RAM address can then be read quickly to find the target address using a single RDLUT. Additionally for the call handler case, it will also need to push the application's return address (which will just be the starting external memory address of the previous block read into LUT from HUB plus the LUT offset*4 + 4) onto the HUB stack. That will let the application code do its own return to this address later. Here is how these handlers could work:

'jump handler code present in COG RAM (or LUT) does this:
     POP   offset
     RDLUT target, offset
     ' jump to the routine to read target address from ext-mem/cache to HUB/LUT here
     ' it ends with....
     JMP   #$200 ' to resume LUT-EXEC

' call handler code present in COG RAM (or LUT) does this:
     POP   offset
     RDLUT target, offset
     ADD   offset, #1
     MUL   offset, #4
     ADD   address, offset ' address is the start of the code read last into LUT (previous branch target)
     PUSHA address ' if PTRA is SP
     ' jump to the routine to read target address from ext-mem/cache to HUB/LUT here
     ' it ends with....
     JMP   #$200 ' to resume LUT-EXEC

For any conditional branches/calls we would still be able to make use of the Z&C flags and do something like this for example, where r2 is being compared to r3 and the code will branch if r2 < r3:

     CMP r2, r3 wc '  condition being tested
if_c CALL jump_handler
long $target_address | (if_c << 28) ' i.e. combine the 4 bit conditional tests with the target address in long

The main thing to ensure here would be to encode the same 4 bit "if_xx" condition in the long constant that follows each type of CALL instruction above. That will prevent the long constant from ever be executed in the case of the branch not being taken. This means the branch/call target address is restricted to 28 bits; handily this also matches up with the HyperRAM driver's full 256MB address range.

For any cases where indirect calls are made (e.g. function pointers), these can be supported by using the CALLPA or CALLPB instructions to copy the value of the register containing the indirect address into a parameter register, either PA or PB. We would need to reserve one of these two PA or PB registers for the VM's exclusive use.

     CALLPA r1, indirect_call_handler ' instruction to perform an indirect call to the address in register r1

In this case PA would get the address from register r1 and the handler can make direct use of it, and also setup the stack in HUB before branching:

'indirect_call_handler code does this:
     POP   offset
     MUL   offset, #4  ' address in LUT, conveniently also removes flags
     ADD   address, offset ' address is the start of the code read last into LUT (previous branch target)
     PUSHA address ' if PTRA is SP
     ' jump to the routine to read the jump address in PA from ext-mem/cache to HUB/LUT here
     ' it ends with....
     JMP   #$200 ' resume LUT-EXEC

For any cases where the application code returns (typically with a RETA for example), it will need to transform its return instruction into this:

     JMP ret_handler

The return handler would need to read the real return address from the HUB stack and use that.

'ret_handler code does this:
     POPA address ' if PTRA is SP
     ' jump to the routine to read target address from ext-mem/cache to HUB/LUT here
     ' it ends with....
     JMP  #$200 ' resume LUT-EXEC

We still need to deal with the overrun cases where we reach the end of the executable block of code in LUTRAM and have not yet branched out, triggering the next code block reload. There are two types of reload cases here:
1) No branch or call has just started on the last instruction slot in the LUT execution area of size "MAX" (where MAX is some power of 2).
2) A branch has just started at the last instruction slot in LUT RAM, but its target offset has been chopped off because it was never part of the read block.

The first case can be handled by always keeping one instruction at the end of the LUT RAM executable area that jumps to a reload handler whose job it is to reload the next continuing portion of code into LUT RAM and resume execution. So when the code reaches the end of the block, it will trigger a reload of more code and can continue executing.

     JMP reload_handler

'reload handler does this:
     ADD address, #(MAX*4) ' address is the start of the code read last into LUT (previous branch target)
     ' go call the routine to read target address from ext-mem/cache to HUB/LUT here
     ' ends with....
     JMP #$200 ' resume LUT-EXEC

There is one further complexity for the second reload case when the last instruction stored in the LUT executable area is itself a call to one of our branch handlers and its target address was never read into the LUT RAM. For this to be solved one way might be to make a modication to the branch and call handlers mentioned above, where we also will test if they were invoked right at the end of the LUT RAM (after "MAX" instruction slots have been executed). If so, in this special case we then obtain the target address directly from the HUB, assuming one additional long is always read/stored there beforehand from the original HyperRAM read request. E.g.

'jump handler does this:
     POP    offset
     TEST   offset, #(MAX-1) wz ' test if reached MAX instructions, note: destroys Z
     RDLUT  target, offset
if_z ADD    address, #(MAX*4) ' compute end of last read block
if_z RDLONG address, address ' read next branch target address
     ' go call the routine to read target address from ext-mem/cache to HUB/LUT here
     ' ends with....
     TESTB  offset, #30 wz ' also restore Z before resuming
     JMP    #$200 ' resume LUT-EXEC

This technique does add a bit more overhead to each branch however. It would be nicer to find an even simpler way to deal with this if possible, maybe by patching the last instruction stored at location MAX-1 in LUT RAM dynamically to use a different continuation handler if it had a CALL instruction opcode read there initially. There is a chance this may match other arbitrary data accidentally however, depending on whether any constant data is also stored in the executable area...

Another interesting feature of this execution model is that it essentially already provides a two level storage hierarchy for holding external code. The HyperRAM driver would read in requested blocks of code to HUB, and then the VM COG could then bring (smaller portions?) of these into LUT from HUB where they are then executed. So in theory this approach could be extended to also support a cache of code blocks being frequently executed. Caching would become beneficial if it is faster to identify that your target branch region is already available in any blocks present in HUB memory compared to the time it takes to request another block from HyperRAM, and that the hit rate is high enough to compensate for any extra cache management overhead. It might be possible for the VM COG to overlap some cache tag management functions in parallel to the HyperRAM driver streaming in the data, doing work in otherwise dead time while waiting perhaps.

Finally, P2 has a few possible instructions that may also be helpful for caching...

ENCOD ' to find the top bit, possibly useful for managing a free list of blocks, in case they need to be determined dynamically to hold code blocks in HUB.
MUL, or SEUSSF perhaps for calculating hashes of the address as some more randomized value to help avoid thrashing if two target addresses with the same LSBs otherwise hash to the same address.
RDFAST+RFLONG for a hash bucket scheme?

So does anyone else have any thoughts on this whole idea? And whether they think it could be useful or if there are major problems with it etc..?

Roger.

AJL · 2020-08-26 13:51

Given you are talking about needing a tool to generate code to work with this scheme, the 'last instruction as a call' problem could be handled as part of compilation.

Rather than the call being included in this block it could be bumped to the next block and replaced by a NOP.

This is a little wasteful, but it has the advantage of making the more complex processing happen at compile time, and if the location contains $00 as data instead of code then there is no modification made. The performance hit in this corner case is the time taken to load a whole new block into LUT just to hit the branch or call on the first instruction. The smaller the block, the less impact, but also the more likely for it to occur.

Presumably the compiler could still use regular branches and calls, plus DJNZ/TJNZ/SKIP/REP types, as long as the targets are relative and don't reach beyond the loaded block. That would allow tight loops to maintain maximum performance if they fit in the block size, even at the expense of padding the preceding block with a jump to 'fetch next block' and as many NOPs as necessary, to ensure the loops don't span block divides.

This should help maintain code density within blocks, and help reduce thrashing where code is branching to other routines that could fit in the same standard sized block.

wmosscrop · 2020-08-26 17:54

rogloh wrote: »

We still need to deal with the overrun cases where we reach the end of the executable block of code in LUTRAM and have not yet branched out, triggering the next code block reload. There are two types of reload cases here:
1) No branch or call has just started on the last instruction slot in the LUT execution area of size "MAX" (where MAX is some power of 2).
2) A branch has just started at the last instruction slot in LUT RAM, but its target offset has been chopped off because it was never part of the read block.

The first case can be handled by always keeping one instruction at the end of the LUT RAM executable area that jumps to a reload handler whose job it is to reload the next continuing portion of code into LUT RAM and resume execution. So when the code reaches the end of the block, it will trigger a reload of more code and can continue executing.

Interesting idea. The IBM 5100 (1975) used a similar method for handling the BASIC interpreter code. The code was stored in ROS (read-only-storage) and could not be executed directly; it had to be copied to RAM before it could be executed.

This meant that if your code alternated between LET and PRINT statements, for example, the interpreter code for each had to be reloaded with each statement, since they were handled by separate blocks of code. This resulted in reduced performance.

Which brings me to my point: how do you propose preventing something like this from dramatically slowing down execution for compiled code?

Especially in the case where the code for a loop spans the last execution slot in a block? Would it even be possible to detect this situation and realign the block sizes to keep a loop (if possible) within a single block?

Walter

Surac · 2020-08-26 19:40

This is an Alan Turing problem. You can’t find the bottleneck only by looking at the code. You have to life with it

rogloh · 2020-08-27 01:55

AJL wrote: »

Given you are talking about needing a tool to generate code to work with this scheme, the 'last instruction as a call' problem could be handled as part of compilation.

Rather than the call being included in this block it could be bumped to the next block and replaced by a NOP.

This is a little wasteful, but it has the advantage of making the more complex processing happen at compile time, and if the location contains $00 as data instead of code then there is no modification made. The performance hit in this corner case is the time taken to load a whole new block into LUT just to hit the branch or call on the first instruction. The smaller the block, the less impact, but also the more likely for it to occur.

Presumably the compiler could still use regular branches and calls, plus DJNZ/TJNZ/SKIP/REP types, as long as the targets are relative and don't reach beyond the loaded block. That would allow tight loops to maintain maximum performance if they fit in the block size, even at the expense of padding the preceding block with a jump to 'fetch next block' and as many NOPs as necessary, to ensure the loops don't span block divides.

This should help maintain code density within blocks, and help reduce thrashing where code is branching to other routines that could fit in the same standard sized block.

Well I guess that is one way to consider it. I was hoping to come up with something that puts less of a burden on the tools side (at least initially) so there might be a better chance that someday people like @n_ermosh and @ersmith might be able to incorporate something like this into the tools without a lot of extra hassles. Tools that need to compute static offsets from branch entry points to be able to NOP them are a little tricker, but I guess achievable if we wanted.

Thinking about it last night I did find a way to help improve the code above that was dealing with this, for specifically optimising the normal branch case where a branch is not chopped off at the end of the LUT RAM execution space. It now only slows the normal branch operations by 4 clocks instead of 8 now, and added nothing to the overhead for the chopped off special case too.

'jump handler does this:
     POP    offset
     RDLUT  target, offset
     AND    offset, #MAX-1
     TJZ    offset, #special_case
     ' go call the routine to read target address from ext-mem/cache to HUB/LUT here
     ' ends with....
     JMP    #$200 ' resume LUT-EXEC
special_case
     ADD    address, #(MAX*4) ' compute end of last read block
     RDLONG address, address ' read next branch target address
     ' go call the routine to read target address from ext-mem/cache to HUB/LUT here
     ' ends with....
     JMP    #$200 ' resume LUT-EXEC

There are going to be two competing factors here as you setup the executable block size in LUT RAM. If you make it too short you need to keep bringing in more, if you make it long, you may be able to remain executing for longer but you would add more clocks to bring in more code than you need after you branch. Without any code to test, in general my gut feeling tells me there is probably a sweet spot at something around 8-16 instructions or so, because branches happen pretty frequently. I think some old rule of thumb was around every 5 instructions of so.

So the idea might be to bring in some decent sized chunks of code, something like 256B-1kB from HyperRAM into HUB, then bring in small portions of that (much faster) into LUT on demand. If say 8-16 different large blocks of some code are in HUB RAM, and we can quickly load from that, then once loaded much of the program's execution when in loops can probably remain executing without needing a lot of reloading from HyperRAM. We just need a fairly efficient way to translate a branch target into a block and locate where that block is in HUB to reload it.

I did some back of envelope stuff and figured if we mapped external addresses directly to HUB addresses before reading to LUT (I know this is not realistic in any way whatsoever because you have to still have to translate them first and do the caching, but I wanted a ballpark figure for some upper bound on performance when it is already present in HUB), then something over 20MIPS was achievable @252MHz, assuming 1 branch follows 4 real instructions. This assumes all branches and no calls, again unrealistic, but whatever.

Eg.

4 LUT instructions executed before branch - 8 clocks
CALL for branch handler - 4 clocks
POP - 2 clocks
RDLUT - 3 clocks
AND - 2 clocks
TJZ - 2 clocks
SETQ2 #15 - 2 clocks, assuming 16 instructions get read into LUT
RDLONG - ' 9...16 + 15 = 24..31, averages ~28 clocks
JMP #$200 - 4 clocks

The total is 55 clocks for 5 instructions executed, or one in ~11 clocks, making 252/11 ~ 22.9 MIPS @252MHz.
If you only use 8 LUT RAM entries for code instead of 16, this shaves 8 clocks from the above and you get 47 clocks for 5 instructions or 26.8 MIPS @252MHz.

The cache/address translation overhead is going to slow this down from here in the real world. I just hope it can achieve better than ~ 2-3 MIPS we will see at best if we try to read HyperRAM on every instruction. If we could average something like 5-10 MIPS I'd be pretty happy.

If I'd known about this LUT exec idea before, I might have suggested to @cgracey to add a baby CAM (like 8-16 entries or something) to the P2. It would have sped this entire process up enormously. We could have had some very useful instructions like these for translating external memory addresses to HUB addresses:

CAMSET D, #/S       ' associate CAM entry matching #/S with D
CAMGET D, #/S {wc}  ' get D entry associated with CAM entry matching "#/S", set C=1 if found

Or if this baby CAM per COG needed too many gates, one common CAM could have been HUB based and shareable amongst COGs with HUBOP perhaps. That would still have helped this problem massively. It could have been useful for a range of other fast lookup applications too. Woulda/shoulda/coulda...

rogloh · 2020-08-27 02:04

wmosscrop wrote: »

Which brings me to my point: how do you propose preventing something like this from dramatically slowing down execution for compiled code?

Especially in the case where the code for a loop spans the last execution slot in a block? Would it even be possible to detect this situation and realign the block sizes to keep a loop (if possible) within a single block?

Walter

I don't propose to fix it at this stage, it will be what it is initially, and yes branches/calls will dramatically slow things down as we would expect. The main issue is that right now we don't even have any way to execute external code, so this is primarily a suggested starting point. As we find ways to improve it further we will probably try to apply what we can. Some of that could involve optimizing the tools. This concept is really only to allow for any huge applications that people might want to run from external memory one day. Perhaps 99% of users might not need/want to ever use this with the P2.

rogloh · 2020-08-27 03:12

By the way, consider the most pathological case of one actual instruction per branch instruction. [ I don't believe every instruction being an instruction branch as being codeable in a high level language, so lets say 1:1 ratio of branches to regular instructions is effectively the worst possible case. ] With the numbers above for 8 instruction slot LUTRAM, you get 2 instructions executed in 47 clocks or ~10.7 MIPS @ 252MHz. Still pretty high under this extreme case.

So we now really need to find a fast way to map a 32 bit external address to a block number (of some fixed size) in HUB RAM. That will make all the difference here. Eg. Take upper bits of 32 bits and hash to get a block number then add to an offset in HUB memory where this is stored. We also need to know if this was the block associated with the address or some other one which we will replace by reading from HyperRAM. A multi-way cache would be good to avoid thrashing too.

If we can do all mapping/lookup stuff in under about another 47 clocks, then the MIPS numbers above will be approximately halved to find the actual expected performance range (assuming most of the code's working set of instructions is already in the cache so the hit rate remains high). If it takes more like 94 clocks then the MIPS will be down to 1/3 of the numbers above (still useful by the way).

Actually I just realized another big problem with this scheme... if either SETQ or AUGS are present in the application code and they get broken up across different blocks read into LUTRAM it will mess things up, so both SETQ + AUGS use may need to be limited in the application as well. That's going to be much of more of a restriction, especially AUGS use for reading long constants etc. So having the tools pad these with an optional prior NOP so they are always stay together within any (say) 8 or 16 long offset from any prior branch target might be best. It puts more of a burden on the tools though which I am trying to avoid.

rogloh · 2020-08-27 07:45

I coded a hacked up poor man's hash lookup sequence (which is still incomplete and could be improved more with any luck), just to get a gauge of what things might look like in terms of extra overhead cycles to see if this general idea still has merit.

In the case of a cache lookup match it would add 34 clocks to the above code I had plus one rdlong, making 43..50 extra clocks, average say 47 clocks. This brings the total numbers to 94 clocks for 5 instructions which is 13.4 MIPS @ 252MHz.

So if the hit rate is very high and we average 4 instructions per branch instruction, in the best cases we could probably still peak above 10 MIPS assuming 8 LUT entries get read each branch. This accounts for the special cases of dealing with chopped off branches in LUT RAM, and also the last 8 longs of a 256 byte cache row being loaded (which also needed attention so we don't run out of instructions).

If the time to read in 256 bytes from HyperRAM is about another estimated 1.5us (sysclk/1) and the hit rate remains high (like ~80%) once much of a program's working set is loaded, then we still could see numbers in the high single digits MIPS range. This will obviously depend on how well the caching works with the application.

E.g. With 80% hit rate, and 4 LUT exec instructions per branch on average...

0.80 * 94 + 0.2 * (252*1.5) ~ 150 clocks per 5 instructions or 1 in 30 clocks @ 252 MHz. This is still over 8 MIPS which would be great to achieve if running an application directly from HyperFlash/HyperRAM for example.

So I think LUT-exec probably still has merit at this point if we can somehow solve the SETQ/AUGS issue.

' sample code to lookup an external address to see if it is cached somewhere in HUB
'
' assumes 16-32kB cache using 256 byte rows, based at hubcachebase in HUB
' 1kB tag / hash table, 256 entries , based at tagbaseaddr in HUB
' 8 LUT entries loaded at a time to be executed

'jump handler does this:
    pop     offset
    rdlut   address, offset
    and     offset, #7
    tjz     offset, #choppedcase
lookup
    mov     addr, address       ' get address into working register
    and     addr, mask_0FFFFF00 ' mask upper bits
    mov     hash, addr          ' prepare hash
    seussr  hash                ' or xor+mul with some weird constant?
    splitb  hash                ' does this sequence make a good hash???
    and     hash, mask_03FC     ' extract suitable hash offset portion, long aligned value    
    add     hash, tagbaseaddr   ' compute hash table address in hub
    rdlong  data, hash          ' lookup tag data in hash table
    tjs     data, #empty        ' hash entry not in use -> could use it fill with HyperRAM data or check next etc
    getbyte block, data, #0     ' get associated block number
    and     data, addrmaskvalid ' mask valid portion to match
    sub     data, addr          ' compare with addr upper bits
    tjnz    data, #nomatch      ' address doesn't match, we may trash this or try next etc?
    rolbyte block, address, #0  ' 256B boundary
    add     block, hubcachebase ' hubcachebase aligned to long boundary

    getbyte addr, address, #0   ' get LSByte of address
    sub     addr, #%11100000    ' compare with last LUT-exec block of the 256 bytes
    tjns    #specialcase        ' if last block, handle differently    
    setq2   #8-1                ' read 8 longs
    rdlong  0, block            ' read data to LUT
    jmp     #$200               ' go execute it

choppedcase
    add     address, #32        ' compute end of last read block
    rdlong  address, address    ' read next branch target address
    jmp     #lookup 
    
specialcase                     ' last 32 byte block in cache line needs adjustment
    shr     addr, #2            ' convert to longs
    bith    addr, #9            ' add $200
    andn    block, #%11111      ' align down to 32 byte boundary
    setq2   #8-1                ' read 8 longs
    rdlong  0, block            ' read data to LUT
    jmp     addr                ' go execute it

nomatch
    ' load into this entry etc / compare next if set associative?
    ' TODO
    ...

empty
    ' load into this entry etc
    ' TODO
    ...

Update: one possible way to deal with AUGS being broken up might be to do this special extra compare for AUGS opcode before we resume LUT-exec, and patch it with a different reload instruction instead. It would add 7 clocks to the branch handling path in all normal cases or 9 clocks when it matches an AUGS in the last instruction. The only issue then might be incorrectly matching against the same value randomly used in a constant address. To remedy that you might have to look one instruction earlier to see if it is one of our special branch instructions (CALL) that uses a long constant...still doable. The SETQ fix could be done the same way (with 7-9 more clocks of associated checking penalty), or not used in the generated code, which would be a pity as it would slow down prologs/epilogs and also prevent burst transfers.

    setq2   #8-1
    rdlong  0, block
    rdlut   data, #7 ' read last instruction
    and     data, instr_mask ' extract portion of interest
    sub     data, AUGS_opcode  ' AUGS opcode instruction pattern
    tjnz    data, #$200   ' not AUGS, then start LUT-exec
    wrlut   reload_instr, #7 '  patch reload handler
    jmp     #$200 ' now start LUT-exec

reload_instr 
    jmp     #\reload_handler2
  

reload_handler2
    add     address, #28
    jmp     #lookup

Update2:
Actually the SETQ could be tested with only 4 more clocks of penalty to the 7 above. I think something like this may work...or maybe I need another AND mask and reverse the test order. TBD.

    setq2   #8-1
    rdlong  0, block
    rdlut   data, #7 ' read last instruction
    and     data, instr_mask ' extract portion of interest
    xor     data, AUGS_opcode  ' AUGS opcode instruction pattern
    tjz     data, #patch   ' is AUGS, then patch it
    xor     data, AUGS_xor_SETQ_opcode ' undoes XOR and compares with SETQ
    tjnz    data, #$200   ' not AUGS or SETQ, then start LUT-exec
patch
    wrlut   reload_instr, #7 '  patch reload handler
    jmp     #$200 ' now start LUT-exec

reload_instr 
    jmp     #\reload_handler2
  

reload_handler2
    add     address, #28
    jmp     #lookup

For SETQ2 and AUGD, these might not be generated in application code and in that case losing them is not an issue.

rogloh · 2020-08-27 11:21

I went and looked at P2GCC output of Micropython code, it doesn't include SETQ2 anywhere and only one place it uses AUGD to clear memory (which could be changed in prefix.spin2). So there is a chance that P2GCC based code could be run from external memory using this LUT-exec scheme, without any real changes.

Still needs a way to load it into external RAM of course, and a modified VM that does the work above.

AJL · 2020-08-27 12:26

And a post process of the P2GCC output to insert the two-long branch and call structures, adjusting all of the targets as it goes.

rogloh · 2020-08-27 13:11

Doh, of course! I knew it couldn't be that simple. I went and missed the most critical part. LOL. I can't make it work unless these tools change accordingly. I think patching the branches might be okay, but it's the address recalculation that is a problem.

Then again maybe if the changes are all in the PASM source after translation, the addresses should all get recalculated for us...?

rogloh · 2020-08-27 14:00

I noticed that p2gcc emits CALLD LR, #xxxx instructions for its function calls and JMP LR to return. Need a way to handle that type of call. Probably just need another handler and replace CALLD with my own CALL. The JMP LR can become CALLPA LR, indirect_call_handler.

Eg.

' I would replace 
CALLD   LR, #xxxxx
' with this:
CALL    calld_handler
long #xxxxx

'calld_handler then does this
POP     offset
MOV     LR, address
RDLUT   address, offset ' get next target address to call
SHL     offset, #2 ' convert to bytes
ADD     LR, offset
JMP     #lookup ' go get code to read in

LUT-exec model concept

Comments