LUT-exec model concept
rogloh
Posts: 5,787
Following on from a recent discussion in the HyperRAM driver thread here is an idea for a "LUT-exec" model which might allow us to execute code stored externally to the P2's own memory....
On the P2 we would like to find a way to execute code from external sources like HyperRAM/Flash at a reasonable speed. The P1 offered us XMM however its performance was only mediocre and it would be nice to try to improve on this with the P2 if possible. The faster P2 combined with something like HyperRAM's high bandwidth and large capacity external memory should hopefully be able offer us something better. Reading code into HUB from external memory and then using HUB exec could be one approach to take, but another way to consider is to execute the externally sourced code from within LUT RAM. It's not going to be as fast as native hub-exec of internal memory and there are serious penalties for code that branches a lot but might still be able to improve on what the previous XMM had to offer, or what is possible if every external address gets read individually from external memory.
The LUT memory space already offers us several advantages. Firstly, instructions can be executed there at the highest speed. Secondly, it is rather fast to read instructions into LUT from HUB using SETQ2 bursts. Thirdly, LUT offers us a fast way to identify branch target addresses by making use of the COG's hardware stack and RDLUT operations together as shown later. Finally, instructions that use RDLONG/WRLONG such as stack operations and other general data accesses do not interfere with the execution unlike hub-exec where we have to compete with the fifo reading in the instructions from HUB and this can slow down execution. Also not all VMs would probably need every single bit of COG+LUT RAM, so there is probably some free space to try to put to good use.
So how could such a LUT-exec model work...?
We would basically need the build tools emitting code for this execution mode to encode and translate few special instructions when it comes to branches and calls and returns, instead of generating them natively for HUB or COG exec. Also the DJNZ/TJNZ/SKIP/REP types of P2 instructions could not be utilized in the output code, and enabling inline assembly would require some care to avoid certain instructions that could break the model.
For any branches or calls generated, they would now need to be encoded into two longs instead of one long. The first long is a P2 CALL instruction which escapes out to a special handler routine in COG (or LUT) RAM. The second long following it is the constant that contains the target branch address. LUT execution of the externally sourced code will stop at this point once the CALL executes, and the special handler will then take over.
The two special branch/call handlers would need to pop the hardware stack and that will give the LUT address offset of the target address constant because it is also where the currently executed code would normally return to (the return address is just after the CALL instruction). This LUT RAM address can then be read quickly to find the target address using a single RDLUT. Additionally for the call handler case, it will also need to push the application's return address (which will just be the starting external memory address of the previous block read into LUT from HUB plus the LUT offset*4 + 4) onto the HUB stack. That will let the application code do its own return to this address later. Here is how these handlers could work:
For any conditional branches/calls we would still be able to make use of the Z&C flags and do something like this for example, where r2 is being compared to r3 and the code will branch if r2 < r3:
The main thing to ensure here would be to encode the same 4 bit "if_xx" condition in the long constant that follows each type of CALL instruction above. That will prevent the long constant from ever be executed in the case of the branch not being taken. This means the branch/call target address is restricted to 28 bits; handily this also matches up with the HyperRAM driver's full 256MB address range.
For any cases where indirect calls are made (e.g. function pointers), these can be supported by using the CALLPA or CALLPB instructions to copy the value of the register containing the indirect address into a parameter register, either PA or PB. We would need to reserve one of these two PA or PB registers for the VM's exclusive use.
In this case PA would get the address from register r1 and the handler can make direct use of it, and also setup the stack in HUB before branching:
For any cases where the application code returns (typically with a RETA for example), it will need to transform its return instruction into this:
The return handler would need to read the real return address from the HUB stack and use that.
We still need to deal with the overrun cases where we reach the end of the executable block of code in LUTRAM and have not yet branched out, triggering the next code block reload. There are two types of reload cases here:
1) No branch or call has just started on the last instruction slot in the LUT execution area of size "MAX" (where MAX is some power of 2).
2) A branch has just started at the last instruction slot in LUT RAM, but its target offset has been chopped off because it was never part of the read block.
The first case can be handled by always keeping one instruction at the end of the LUT RAM executable area that jumps to a reload handler whose job it is to reload the next continuing portion of code into LUT RAM and resume execution. So when the code reaches the end of the block, it will trigger a reload of more code and can continue executing.
There is one further complexity for the second reload case when the last instruction stored in the LUT executable area is itself a call to one of our branch handlers and its target address was never read into the LUT RAM. For this to be solved one way might be to make a modication to the branch and call handlers mentioned above, where we also will test if they were invoked right at the end of the LUT RAM (after "MAX" instruction slots have been executed). If so, in this special case we then obtain the target address directly from the HUB, assuming one additional long is always read/stored there beforehand from the original HyperRAM read request. E.g.
This technique does add a bit more overhead to each branch however. It would be nicer to find an even simpler way to deal with this if possible, maybe by patching the last instruction stored at location MAX-1 in LUT RAM dynamically to use a different continuation handler if it had a CALL instruction opcode read there initially. There is a chance this may match other arbitrary data accidentally however, depending on whether any constant data is also stored in the executable area...
Another interesting feature of this execution model is that it essentially already provides a two level storage hierarchy for holding external code. The HyperRAM driver would read in requested blocks of code to HUB, and then the VM COG could then bring (smaller portions?) of these into LUT from HUB where they are then executed. So in theory this approach could be extended to also support a cache of code blocks being frequently executed. Caching would become beneficial if it is faster to identify that your target branch region is already available in any blocks present in HUB memory compared to the time it takes to request another block from HyperRAM, and that the hit rate is high enough to compensate for any extra cache management overhead. It might be possible for the VM COG to overlap some cache tag management functions in parallel to the HyperRAM driver streaming in the data, doing work in otherwise dead time while waiting perhaps.
Finally, P2 has a few possible instructions that may also be helpful for caching...
ENCOD ' to find the top bit, possibly useful for managing a free list of blocks, in case they need to be determined dynamically to hold code blocks in HUB.
MUL, or SEUSSF perhaps for calculating hashes of the address as some more randomized value to help avoid thrashing if two target addresses with the same LSBs otherwise hash to the same address.
RDFAST+RFLONG for a hash bucket scheme?
So does anyone else have any thoughts on this whole idea? And whether they think it could be useful or if there are major problems with it etc..?
Roger.
On the P2 we would like to find a way to execute code from external sources like HyperRAM/Flash at a reasonable speed. The P1 offered us XMM however its performance was only mediocre and it would be nice to try to improve on this with the P2 if possible. The faster P2 combined with something like HyperRAM's high bandwidth and large capacity external memory should hopefully be able offer us something better. Reading code into HUB from external memory and then using HUB exec could be one approach to take, but another way to consider is to execute the externally sourced code from within LUT RAM. It's not going to be as fast as native hub-exec of internal memory and there are serious penalties for code that branches a lot but might still be able to improve on what the previous XMM had to offer, or what is possible if every external address gets read individually from external memory.
The LUT memory space already offers us several advantages. Firstly, instructions can be executed there at the highest speed. Secondly, it is rather fast to read instructions into LUT from HUB using SETQ2 bursts. Thirdly, LUT offers us a fast way to identify branch target addresses by making use of the COG's hardware stack and RDLUT operations together as shown later. Finally, instructions that use RDLONG/WRLONG such as stack operations and other general data accesses do not interfere with the execution unlike hub-exec where we have to compete with the fifo reading in the instructions from HUB and this can slow down execution. Also not all VMs would probably need every single bit of COG+LUT RAM, so there is probably some free space to try to put to good use.
So how could such a LUT-exec model work...?
We would basically need the build tools emitting code for this execution mode to encode and translate few special instructions when it comes to branches and calls and returns, instead of generating them natively for HUB or COG exec. Also the DJNZ/TJNZ/SKIP/REP types of P2 instructions could not be utilized in the output code, and enabling inline assembly would require some care to avoid certain instructions that could break the model.
For any branches or calls generated, they would now need to be encoded into two longs instead of one long. The first long is a P2 CALL instruction which escapes out to a special handler routine in COG (or LUT) RAM. The second long following it is the constant that contains the target branch address. LUT execution of the externally sourced code will stop at this point once the CALL executes, and the special handler will then take over.
' a branch to an address xxxxxx is done like this: CALL jump_handler long $xxxxxx ' the target address ' a call to an address yyyyyy is done like this: CALL call_handler ' instruction to branch to call handler in COGRAM or LUTRAM long $yyyyyy ' the target address
The two special branch/call handlers would need to pop the hardware stack and that will give the LUT address offset of the target address constant because it is also where the currently executed code would normally return to (the return address is just after the CALL instruction). This LUT RAM address can then be read quickly to find the target address using a single RDLUT. Additionally for the call handler case, it will also need to push the application's return address (which will just be the starting external memory address of the previous block read into LUT from HUB plus the LUT offset*4 + 4) onto the HUB stack. That will let the application code do its own return to this address later. Here is how these handlers could work:
'jump handler code present in COG RAM (or LUT) does this: POP offset RDLUT target, offset ' jump to the routine to read target address from ext-mem/cache to HUB/LUT here ' it ends with.... JMP #$200 ' to resume LUT-EXEC ' call handler code present in COG RAM (or LUT) does this: POP offset RDLUT target, offset ADD offset, #1 MUL offset, #4 ADD address, offset ' address is the start of the code read last into LUT (previous branch target) PUSHA address ' if PTRA is SP ' jump to the routine to read target address from ext-mem/cache to HUB/LUT here ' it ends with.... JMP #$200 ' to resume LUT-EXEC
For any conditional branches/calls we would still be able to make use of the Z&C flags and do something like this for example, where r2 is being compared to r3 and the code will branch if r2 < r3:
CMP r2, r3 wc ' condition being tested if_c CALL jump_handler long $target_address | (if_c << 28) ' i.e. combine the 4 bit conditional tests with the target address in long
The main thing to ensure here would be to encode the same 4 bit "if_xx" condition in the long constant that follows each type of CALL instruction above. That will prevent the long constant from ever be executed in the case of the branch not being taken. This means the branch/call target address is restricted to 28 bits; handily this also matches up with the HyperRAM driver's full 256MB address range.
For any cases where indirect calls are made (e.g. function pointers), these can be supported by using the CALLPA or CALLPB instructions to copy the value of the register containing the indirect address into a parameter register, either PA or PB. We would need to reserve one of these two PA or PB registers for the VM's exclusive use.
CALLPA r1, indirect_call_handler ' instruction to perform an indirect call to the address in register r1
In this case PA would get the address from register r1 and the handler can make direct use of it, and also setup the stack in HUB before branching:
'indirect_call_handler code does this: POP offset MUL offset, #4 ' address in LUT, conveniently also removes flags ADD address, offset ' address is the start of the code read last into LUT (previous branch target) PUSHA address ' if PTRA is SP ' jump to the routine to read the jump address in PA from ext-mem/cache to HUB/LUT here ' it ends with.... JMP #$200 ' resume LUT-EXEC
For any cases where the application code returns (typically with a RETA for example), it will need to transform its return instruction into this:
JMP ret_handler
The return handler would need to read the real return address from the HUB stack and use that.
'ret_handler code does this: POPA address ' if PTRA is SP ' jump to the routine to read target address from ext-mem/cache to HUB/LUT here ' it ends with.... JMP #$200 ' resume LUT-EXEC
We still need to deal with the overrun cases where we reach the end of the executable block of code in LUTRAM and have not yet branched out, triggering the next code block reload. There are two types of reload cases here:
1) No branch or call has just started on the last instruction slot in the LUT execution area of size "MAX" (where MAX is some power of 2).
2) A branch has just started at the last instruction slot in LUT RAM, but its target offset has been chopped off because it was never part of the read block.
The first case can be handled by always keeping one instruction at the end of the LUT RAM executable area that jumps to a reload handler whose job it is to reload the next continuing portion of code into LUT RAM and resume execution. So when the code reaches the end of the block, it will trigger a reload of more code and can continue executing.
JMP reload_handler 'reload handler does this: ADD address, #(MAX*4) ' address is the start of the code read last into LUT (previous branch target) ' go call the routine to read target address from ext-mem/cache to HUB/LUT here ' ends with.... JMP #$200 ' resume LUT-EXEC
There is one further complexity for the second reload case when the last instruction stored in the LUT executable area is itself a call to one of our branch handlers and its target address was never read into the LUT RAM. For this to be solved one way might be to make a modication to the branch and call handlers mentioned above, where we also will test if they were invoked right at the end of the LUT RAM (after "MAX" instruction slots have been executed). If so, in this special case we then obtain the target address directly from the HUB, assuming one additional long is always read/stored there beforehand from the original HyperRAM read request. E.g.
'jump handler does this: POP offset TEST offset, #(MAX-1) wz ' test if reached MAX instructions, note: destroys Z RDLUT target, offset if_z ADD address, #(MAX*4) ' compute end of last read block if_z RDLONG address, address ' read next branch target address ' go call the routine to read target address from ext-mem/cache to HUB/LUT here ' ends with.... TESTB offset, #30 wz ' also restore Z before resuming JMP #$200 ' resume LUT-EXEC
This technique does add a bit more overhead to each branch however. It would be nicer to find an even simpler way to deal with this if possible, maybe by patching the last instruction stored at location MAX-1 in LUT RAM dynamically to use a different continuation handler if it had a CALL instruction opcode read there initially. There is a chance this may match other arbitrary data accidentally however, depending on whether any constant data is also stored in the executable area...
Another interesting feature of this execution model is that it essentially already provides a two level storage hierarchy for holding external code. The HyperRAM driver would read in requested blocks of code to HUB, and then the VM COG could then bring (smaller portions?) of these into LUT from HUB where they are then executed. So in theory this approach could be extended to also support a cache of code blocks being frequently executed. Caching would become beneficial if it is faster to identify that your target branch region is already available in any blocks present in HUB memory compared to the time it takes to request another block from HyperRAM, and that the hit rate is high enough to compensate for any extra cache management overhead. It might be possible for the VM COG to overlap some cache tag management functions in parallel to the HyperRAM driver streaming in the data, doing work in otherwise dead time while waiting perhaps.
Finally, P2 has a few possible instructions that may also be helpful for caching...
ENCOD ' to find the top bit, possibly useful for managing a free list of blocks, in case they need to be determined dynamically to hold code blocks in HUB.
MUL, or SEUSSF perhaps for calculating hashes of the address as some more randomized value to help avoid thrashing if two target addresses with the same LSBs otherwise hash to the same address.
RDFAST+RFLONG for a hash bucket scheme?
So does anyone else have any thoughts on this whole idea? And whether they think it could be useful or if there are major problems with it etc..?
Roger.
Comments
Rather than the call being included in this block it could be bumped to the next block and replaced by a NOP.
This is a little wasteful, but it has the advantage of making the more complex processing happen at compile time, and if the location contains $00 as data instead of code then there is no modification made. The performance hit in this corner case is the time taken to load a whole new block into LUT just to hit the branch or call on the first instruction. The smaller the block, the less impact, but also the more likely for it to occur.
Presumably the compiler could still use regular branches and calls, plus DJNZ/TJNZ/SKIP/REP types, as long as the targets are relative and don't reach beyond the loaded block. That would allow tight loops to maintain maximum performance if they fit in the block size, even at the expense of padding the preceding block with a jump to 'fetch next block' and as many NOPs as necessary, to ensure the loops don't span block divides.
This should help maintain code density within blocks, and help reduce thrashing where code is branching to other routines that could fit in the same standard sized block.
This meant that if your code alternated between LET and PRINT statements, for example, the interpreter code for each had to be reloaded with each statement, since they were handled by separate blocks of code. This resulted in reduced performance.
Which brings me to my point: how do you propose preventing something like this from dramatically slowing down execution for compiled code?
Especially in the case where the code for a loop spans the last execution slot in a block? Would it even be possible to detect this situation and realign the block sizes to keep a loop (if possible) within a single block?
Walter
Well I guess that is one way to consider it. I was hoping to come up with something that puts less of a burden on the tools side (at least initially) so there might be a better chance that someday people like @n_ermosh and @ersmith might be able to incorporate something like this into the tools without a lot of extra hassles. Tools that need to compute static offsets from branch entry points to be able to NOP them are a little tricker, but I guess achievable if we wanted.
Thinking about it last night I did find a way to help improve the code above that was dealing with this, for specifically optimising the normal branch case where a branch is not chopped off at the end of the LUT RAM execution space. It now only slows the normal branch operations by 4 clocks instead of 8 now, and added nothing to the overhead for the chopped off special case too.
There are going to be two competing factors here as you setup the executable block size in LUT RAM. If you make it too short you need to keep bringing in more, if you make it long, you may be able to remain executing for longer but you would add more clocks to bring in more code than you need after you branch. Without any code to test, in general my gut feeling tells me there is probably a sweet spot at something around 8-16 instructions or so, because branches happen pretty frequently. I think some old rule of thumb was around every 5 instructions of so.
So the idea might be to bring in some decent sized chunks of code, something like 256B-1kB from HyperRAM into HUB, then bring in small portions of that (much faster) into LUT on demand. If say 8-16 different large blocks of some code are in HUB RAM, and we can quickly load from that, then once loaded much of the program's execution when in loops can probably remain executing without needing a lot of reloading from HyperRAM. We just need a fairly efficient way to translate a branch target into a block and locate where that block is in HUB to reload it.
I did some back of envelope stuff and figured if we mapped external addresses directly to HUB addresses before reading to LUT (I know this is not realistic in any way whatsoever because you have to still have to translate them first and do the caching, but I wanted a ballpark figure for some upper bound on performance when it is already present in HUB), then something over 20MIPS was achievable @252MHz, assuming 1 branch follows 4 real instructions. This assumes all branches and no calls, again unrealistic, but whatever.
Eg.
4 LUT instructions executed before branch - 8 clocks
CALL for branch handler - 4 clocks
POP - 2 clocks
RDLUT - 3 clocks
AND - 2 clocks
TJZ - 2 clocks
SETQ2 #15 - 2 clocks, assuming 16 instructions get read into LUT
RDLONG - ' 9...16 + 15 = 24..31, averages ~28 clocks
JMP #$200 - 4 clocks
The total is 55 clocks for 5 instructions executed, or one in ~11 clocks, making 252/11 ~ 22.9 MIPS @252MHz.
If you only use 8 LUT RAM entries for code instead of 16, this shaves 8 clocks from the above and you get 47 clocks for 5 instructions or 26.8 MIPS @252MHz.
The cache/address translation overhead is going to slow this down from here in the real world. I just hope it can achieve better than ~ 2-3 MIPS we will see at best if we try to read HyperRAM on every instruction. If we could average something like 5-10 MIPS I'd be pretty happy.
If I'd known about this LUT exec idea before, I might have suggested to @cgracey to add a baby CAM (like 8-16 entries or something) to the P2. It would have sped this entire process up enormously. We could have had some very useful instructions like these for translating external memory addresses to HUB addresses:
Or if this baby CAM per COG needed too many gates, one common CAM could have been HUB based and shareable amongst COGs with HUBOP perhaps. That would still have helped this problem massively. It could have been useful for a range of other fast lookup applications too. Woulda/shoulda/coulda...
So we now really need to find a fast way to map a 32 bit external address to a block number (of some fixed size) in HUB RAM. That will make all the difference here. Eg. Take upper bits of 32 bits and hash to get a block number then add to an offset in HUB memory where this is stored. We also need to know if this was the block associated with the address or some other one which we will replace by reading from HyperRAM. A multi-way cache would be good to avoid thrashing too.
If we can do all mapping/lookup stuff in under about another 47 clocks, then the MIPS numbers above will be approximately halved to find the actual expected performance range (assuming most of the code's working set of instructions is already in the cache so the hit rate remains high). If it takes more like 94 clocks then the MIPS will be down to 1/3 of the numbers above (still useful by the way).
Actually I just realized another big problem with this scheme... if either SETQ or AUGS are present in the application code and they get broken up across different blocks read into LUTRAM it will mess things up, so both SETQ + AUGS use may need to be limited in the application as well. That's going to be much of more of a restriction, especially AUGS use for reading long constants etc. So having the tools pad these with an optional prior NOP so they are always stay together within any (say) 8 or 16 long offset from any prior branch target might be best. It puts more of a burden on the tools though which I am trying to avoid.
In the case of a cache lookup match it would add 34 clocks to the above code I had plus one rdlong, making 43..50 extra clocks, average say 47 clocks. This brings the total numbers to 94 clocks for 5 instructions which is 13.4 MIPS @ 252MHz.
So if the hit rate is very high and we average 4 instructions per branch instruction, in the best cases we could probably still peak above 10 MIPS assuming 8 LUT entries get read each branch. This accounts for the special cases of dealing with chopped off branches in LUT RAM, and also the last 8 longs of a 256 byte cache row being loaded (which also needed attention so we don't run out of instructions).
If the time to read in 256 bytes from HyperRAM is about another estimated 1.5us (sysclk/1) and the hit rate remains high (like ~80%) once much of a program's working set is loaded, then we still could see numbers in the high single digits MIPS range. This will obviously depend on how well the caching works with the application.
E.g. With 80% hit rate, and 4 LUT exec instructions per branch on average...
0.80 * 94 + 0.2 * (252*1.5) ~ 150 clocks per 5 instructions or 1 in 30 clocks @ 252 MHz. This is still over 8 MIPS which would be great to achieve if running an application directly from HyperFlash/HyperRAM for example.
So I think LUT-exec probably still has merit at this point if we can somehow solve the SETQ/AUGS issue.
Update: one possible way to deal with AUGS being broken up might be to do this special extra compare for AUGS opcode before we resume LUT-exec, and patch it with a different reload instruction instead. It would add 7 clocks to the branch handling path in all normal cases or 9 clocks when it matches an AUGS in the last instruction. The only issue then might be incorrectly matching against the same value randomly used in a constant address. To remedy that you might have to look one instruction earlier to see if it is one of our special branch instructions (CALL) that uses a long constant...still doable. The SETQ fix could be done the same way (with 7-9 more clocks of associated checking penalty), or not used in the generated code, which would be a pity as it would slow down prologs/epilogs and also prevent burst transfers.
Update2:
Actually the SETQ could be tested with only 4 more clocks of penalty to the 7 above. I think something like this may work...or maybe I need another AND mask and reverse the test order. TBD. For SETQ2 and AUGD, these might not be generated in application code and in that case losing them is not an issue.
Still needs a way to load it into external RAM of course, and a modified VM that does the work above.
Then again maybe if the changes are all in the PASM source after translation, the addresses should all get recalculated for us...?
Eg.