Looking for a way to save AUDS/AUGD/SETQ state
rogloh
Posts: 6,217
Hi all,
I am currently looking for a good way to determine the state of prior pending AUGS/AUGD/SETQ instructions at any stopped point in some executed P2 code. I am working on an external memory scheme that fragments code into blocks of 128 or 256 bytes and this can stop code executing at any instruction. I will essentially replay these 3 instructions prior to resuming to reload this state but it would be nice to not have to search for them manually and just know what the current AUGD/AUGS/SETQ state is in order to restore their values, eg. for cordic use or loading extended constant values > 511. It may not necessarily need to be used for block transfers, although I might not want to rule that out either. I don't expect I will need to worry about SETQ2 or ALTxx in my application as high level code like "C" won't use that unless somehow tied to the optimizer. I'm also assuming that once consumed AUGD/AUGS will always default back to zero (hopefully I'm not wrong in my assumption).
I think I can probably extract AUGS and "Q" values with these snippets:
mov augsval, #0 shr augsval, #9 ' shift value down by 9 bits setnib augsval, #$F, 1 ' create AUGS instruction to be executed later ... augsval long 0 ' this instruction will be executed to restore any pending "AUGS" value mov qval, #0 muxq qval, ##-1 ' extract current Q value ... setq qval ' restores Q value
But not sure if AUGD can similarly be done as well, maybe with ALTxx somehow although I might have to hunt for this instruction specifically. AUGD/AUGS both have one nibble as $F so it's not too hard to identify but I'll have to go read more longs from HUB for this.
Any other ideas? I guess it's possible to just use LUT RAM if I keep some free space for this. HUB RAM is similarly possible but burns quite a lot more clocks, still probably faster than direct instruction matching.
wrlut #0, lutstorage rdlut audgval, lutstorage shr augdval, #9 ' shift value down by 9 bits bith augdval, #23+(4<<6) ' create AUGD instruction to be executed later ... augdval long 0 ' this instruction will be executed to restore any pending "AUGD" value
UPDATE: Damn, I just realized that any conditional flags might also need to be copied too which complicates this scheme further.... so I might still have to resort to brute force lookup of prior 3 instructions.
I also wondered if there was a way to somehow use forced interrupts on block boundaries if these could automatically restore the pending AUGD/AUDS/SETQ state but after reading the documentation it looks like they wait for those instructions to complete first before triggering:
"When an interrupt event occurs, certain conditions must be met during execution before the interrupt branch can happen:
- ALTxx / CRCNIB / SCA / SCAS / GETCT+WC / GETXACC / SETQ / SETQ2 / XORO32 / XBYTE must not be executing
- AUGS must not be executing or waiting for a S/# instruction
- AUGD must not be executing or waiting for a D/# instruction
- REP must not be executing or active
- STALLI must not be executing or active
- The cog must not be stalled in any WAITx instruction"

Comments
Yeah, I was going to start a lecture on this until I got to the bottom of your posting. Consequently, restoring of Q is only needed if you've overwritten it in your ISR.
The biggest issue arises with the Cordic. If its pipeline is being utilised for faster compute then that routine needs to hold off interrupts for the duration of the pipelining. Which can be a substantial number of clocks.
@rogloh Could we back up and look at this on a higher level? You mentioned using C code, if so then maybe we could modify the compiler or alternatively add a postprocess to the compiler output so that the problematic instructions do not end a block (inserting NOPs if necessary). A postprocessor would have the advantage that it could cope with user assembly too, including inline assembly in the C code.
[Oh, oops, I hadn't understood Roger's intent at all.]
The IRQ-blocking approach is a problem because then the instructions will start to read past the loaded code block.
However, as Eric points out, the correct solution to this is to just not assemble such instructions over the block boundary.
Probably going to want fixed block granular boundaries too. With things like branches and tight loops avoiding sitting across a block-end boundary as much as possible. It's an enhanced version of the above instruction bumping that then also assembles extra NOPs to push a branch target into the next block.
EDIT: On the other hand, would there be feature for multiple contiguous blocks being copied physically adjacent in hubRAM? If so then there is options for grouping of blocks. This would support not putting spacing NOPs for every block.
Sure, I was thinking of going down this route too, but then wanted to initially demo the entire concept with build scripts that modify the existing flexspin's P2ASM output files generated from compiled C code to prove it was achievable, before discussing all the gory details and requesting suitable changes to flexC compilation etc. Strangely I had sort of anticipated my ideas might get shot down before it even had a chance to shine...so your comment here is really refreshing and encouraging.
I already know that this idea is achievable with what I did using p2gcc and Micropython ages ago and at the time it obtained what I thought was surprisingly decent performance, something in the ballpark of ~ 40% of native code speed in the best case and on a fast P2 I think this could be very useful to support in flexC as well. For example when running some larger application code like a fully featured editor app, or a local build tool, or a GUI + graphics libraries etc from external memory. The key for decent performance is assigning a large enough instruction cache to contain most or ideally all of the working set of a running application.
So I'll post what I've already gone and figured out here over the last couple of weeks as a brain dump and perhaps you can chime in on what you think what else might be possible...
The essence of my external code memory scheme initially is to leverage a fully associative I-cache in HUB RAM operating with a FIFO cache replacement policy that dynamically contains executable P2 code stored and read in from external memory (eg. PSRAM) and is then executed in regular HUB-exec mode just like most flexspin code already does. Right now there is currently no plan for an external memory D-cache and all application data still remains in HUB, although down the line it might be good to have some indirect methods for accessing large amounts of lookup table or read-only type data held in external memory. I already have external memory APIs that can do this so it's not too difficult to merge that into the C code as is.
To separate the cacheable functions, the P2 function blocks that are output by flexspin are divided into two addressable memory regions that store these routines. Functions that are considered internal exist below 1MB and those that are considered external and cacheable are placed > 1MB (it could be split at 512kB too if needed, I've just used 1MB for now). An "orgh $100000" line is placed into the intermediate p2asm file and any C functions marked as "FAR_" are moved after this separator line prior to the final binary being generated from this p2asm source file. Currently I just have a preprocessor macro that prepends "FAR_" to nominated external memory functions and a sed script that splits things up and moves them out of where they were compiled initially in the p2asm file to after the 1MB separator line. The flexspin toolchain complains about any generated addresses above 512kB as a warning but thankfully it still generates the code at the higher addresses (and simply pads the binary up to 1MB with zeroes prior to generating the far code blocks). Some file compression steps might be desirable down the line to deal with that, but this larger binary could be post-processed and split and flashed into different flash regions with loadp2. Speaking of which there is nothing to prevent the external memory image to be stored and and "executed" from flash. It would require a different memory driver to read from flash instead of external RAM but could still work in with the same I-cache scheme I have developed (albeit at a lower performance to a PSRAM or HyperRAM based solution for example).
The proof of concept code just marks these "FAR" C functions using "#define FAR(x) FAR_##x" and "FAR_" is explicitly searched for by my translation sed scripts and these functions gets handled differently, but with proper compiler support I thought it would be much nicer just to use another locating attribute in the compiler such as what is shown below, so it could be tracked internally in the toolchain and potentially generated at the right high addresses in the output file and also possibly dealt with differently during assembly code generation:
int function(int args) __attribute__(ext)At boot time I needed to tap into the startup code to initialize things and so I just preceded the "call #_main" line in the p2asm file, with another earlier call to a special C function which will setup my external memory handler (it spawns my SPIN2 based memory object as another COG) and inits the cache structures in HUB RAM and could load the LUTRAM with the call handlers. It also copies the image out of flash and into some PSRAM at the appropriate external memory addresses obtained from the original binary, before returning to then call the original main() function. Some compilers have the concept of startup code that gets called before main executes but I didn't see it with flexspin so I needed to hack the original p2asm code a little here. I also needed to incorporate my COG RAM variables as well.
The function calls made in the two different memory regions need to be coded differently to each other. We need to replace all calls & branch instructions that target external memory with special FAR calls using indirect calls/trampolines implemented using the callpa instruction. The "pa" register will pass the target external memory branch/call address to the special call handler. This also includes indirect calls made using a register - as happens when it uses the function addresses stored in the object's method table. Conveniently the same existing parameter and return stack use is retained in all the functions generated, so nothing else changes there.
All internal COG/HUB memory functions calling other internal COG/HUB functions directly still occur as normal, they're still just implemented using the typical "call #_function" style. So there is no additional call/branch overhead or changes needed there. Same thing happens if an external memory function calls an internal memory function - the current calling code being generated remains the same. In this case it will just return back from internal memory to the I-cache row stored in HUB and continue on from there.
However any internal memory based function calling an external memory function such as "call #_FAR_function" is instead replaced by a callpa instruction to call a special handler that does the far call for it using a "callpa ##_FAR_function, farcallint" instruction to make the call. Similarly any external memory based function calling another external function does the same type of "callpa ##_FAR_function, farcallext" replacement and the handler for that is implemented slightly differently to account for the return address needing to be translated back to the original external memory address and optionally re-cached if its original cache row was overwritten in the meantime.
All of the other indirect function calls made (in both internal and external functions) need to instead call special handlers which test the address being called to see if above the internal/external boundary and then either branches to the external memory call handler when the address is above the boundary, or to the destination address directly if a low memory function is being called. This is done with a simple test of the "pa" register parameter being passed holding the called address.
Branches within external memory functions also need to be handled, and the address tested to see if it exists in the cache. In this case we have three possibilities:
1) branch address exists in the currently cached row - we just jump directly to HUB at this address mapped to the cache row
2) branch address exists somewhere in a currently cached row - jump directly to HUB at that mapped row address in the cache (a cache-hit)
3) branch address does not exist in the cache (cache-miss). The oldest cache row entry is dropped and reloaded with the new block contents by transferring a block of data from external memory into hub at the cache row start.
So we then end up having this group of call/branch handlers being identified and transformed in the source code:
making a far direct call from int->ext memory
call #FAR_xxx -> callpa ##FAR_xxx, farcallint
making a far indirect call from int->ext memory
call reg -> callpa reg, farcallinttest
making a far direct call from ext->ext memory
call #FAR_xxx -> callpa ##FAR_xxx, farcallext
making a far indirect call from ext->ext memory
call reg -> callpa reg, farcallexttest
making a far branch from ext->ext memory
jmp #LR_xxx -> callpa ##LR__xxx, farjmp
Right now branching directly from int->ext memory is not implemented - it should not really ever happen unless setjmp/longjmp is used perhaps. Maybe that is something for later...
For highest speed I store these various call handlers in LUT RAM. From what I recall they consumed something less than 128 longs or so with the cache handling code as well. These routines could be stored in HUB if needed but that would introduce more branch overhead from a couple of extra FIFO reloads and I want to minimize the overhead. There are also about 20 longs of state and callpa target address values required in COGRAM which I prepend just before the COG_BSS_START label is defined. Unfortunately the P2 does not seem to allow ##S constants with callpa or with flexspin? so it required 5 extra COGRAM longs to hold these addresses. Doing a "callpa ##value, ##address" complains about crossing the COG/HUB boundary or something, might be offset/absolute address related. I just got around it by using S registers directly in these calls at the expense of COGRAM - although this saves having dual AUGS+AUGD instruction overheads on all special calls/branches which is a good thing too.
There are a number of restrictions that currently need to be enforced for the code generated in functions marked as external to still operate ok at their cached addresses. Most of this is probably already the case with typical C-compiled output. But here's a likely incomplete list of what we can't do assuming we don't know the row boundary at compile time. Note that use of this is still fine within internal memory functions...
Now this list of restrictions above can be reduced further if the compiler knows the cache row size and where the final memory addresses storing the instructions are located (which it would in its final step). For example if we have a two-instruction ALTx sequence occurring across a (say) 128 or 256 byte address boundary (TBD) we could choose to then insert a NOP instead and start it at the beginning of the next row. This extra NOP will only slow it down very marginally but is nicer than losing the ALTx feature/optimization. Similarly any local branches that are known to end in the same cache row would not need to be translated to do the special FAR branch handler call instead of the faster regular branch, and instructions such as DJNZ reg, #label could still then occur too if #label is still within the same 128 or 256 byte block as the current caller's address. In my case with scripts I couldn't easily determine this so I ruled it out, but if there was compiler support in the assembler for this it would be much faster.
At one time I thought I might be able to feed the toolchain with the modified p2asm code to generate a listing file and then try to extract the instruction sequences placed around at block wrap addresses and work from there by adding NOPs and adjusting each function by function before recompiling again, but this could get somewhat tedious and a lot of patching might be needed as more local branch addresses get affected in the file. Perhaps it's still doable in a single pass. Not 100% sure there.
Some special optimizations that happen in generated code could affect things too so I thought it might be simplest to start out disabling all optimizations for external functions being generated or at least those known to affect things adversely.
Also we obviously need to handle the block wrapping cases when instructions executed have to cross cached rows. I do this by keeping extra instruction space in the cache rows in HUB, with 12 bytes before the row's instructions are stored, and 4 bytes after the row ends. So for 256 byte cacheable blocks I'd actually store them in a 272 byte total row size in HUB. A MUL block, #272 instruction is then utilized to convert from block ID to row index offset to add to the cache start address. At the end of each cache row in HUB I store a "JMP #\wrap_handler" instruction - this is setup once at init time. If the executed instructions ever get to the end of the row this instruction executes and the next incremented block number is checked as to whether it's already been cached, or reloaded if needed. Also if the last row ended with AUGS/AUGD/SETQ sequences etc, I copy the last 12 bytes at the end of the row into the reserved area before the start of the next row and optionally jump into this region accordingly to repeat these instructions before the next row begins more instructions. This is the complication I have to resolve and was the source of the post above. If AUGS/AUGD use is separated by individual instructions it is easy to handle by copying a single individual last instruction directly but if it's a sequence of more than one of these special instructions it gets a bit more complicated and more address adjustments need to be made. Fortunately some of this can happen while the new block is being read in via the streamer but some instruction copying has to be done initially before the prior cache row is lost by being overwritten. If some flexspin compiler change could assist with this by introducing NOPs at block boundaries it would be really handy. I will potentially corrupt the current "Q" value with my block transfers of AUGD/AUGS etc so probably want to preserve and reload it before resuming after the row wraps if the Cordic needs it. I doubt the C code does block transfers natively requiring SETQ/SETQ2 use unless the optimizer makes some use of that.
I do have some (untested) sample code already written for this based on my earlier work which I am still implementing/refining. The FIFO replacement policy I use (or "LRC" - least recently cached) for row replacement is very fast to implement and requires no extra state management unlike the other LRU/LFU approaches would etc. Another fast method is a simple random replacement policy which I've read is used by some ARM cores. One benefit we have here is that the next cache row to be replaced can be pre-computed while the external memory block is being read in as we have a small amount of extra time there. As to HUB RAM usage, my simple caching scheme supports up to 255 rows of up to 256 bytes of I-cache (~69kB HUB RAM use with extra row/tag overheads) and uses a block->cache row mapping table allowing up to 16MB (or more) of memory to be addressed (eg. PSRAM on P2-Edge). The byte based block->cache row mapping table I use is also stored in hub RAM and is scalable so if your application only addressed up to 4MB of RAM and used a 256 byte block size it would only need to take 16kB of HUB RAM, while 16MB applications would need 64kB etc. Alternatively a cache with more than 255 rows is possible (it may be faster with smaller row sizes), but this requires a word based block->row table instead of being byte based which doubles its size - might still be okay with moderate external memory sizes. In any case there is some flexibility here for tuning the cache performance. Gut feeling tells me that the sweet spot probably lies in the 32-64 instructions per block and this is a tradeoff between latency to fill cache rows and working set hit rate. A larger block will create more spatial locality and reduce the wrap overheads that occur at the end of each row, but it will take longer to fill. A smaller block allows more blocks in the I-cache and also less time to reload from external memory but probably needs more blocks to be loaded per function call. If my memory driver is used to access memory the initial latency is in the order of 1usec plus the 256 byte transfers takes about the same time at the 256MHz rate making a total of ~2uS per row read. So I've picked 256 bytes for testing initially to try to balance this out. A directly coupled memory driver (also saving a COG) is another option for higher performance as it would reduce the latency significantly but may not share external memory so well with other COGs - especially any video COGs using external memory at the same time (which might be one target application too).
Brain dump complete...