Looking for a way to save AUDS/AUGD/SETQ state

rogloh · 2026-01-28 06:12

Hi all,
I am currently looking for a good way to determine the state of prior pending AUGS/AUGD/SETQ instructions at any stopped point in some executed P2 code. I am working on an external memory scheme that fragments code into blocks of 128 or 256 bytes and this can stop code executing at any instruction. I will essentially replay these 3 instructions prior to resuming to reload this state but it would be nice to not have to search for them manually and just know what the current AUGD/AUGS/SETQ state is in order to restore their values, eg. for cordic use or loading extended constant values > 511. It may not necessarily need to be used for block transfers, although I might not want to rule that out either. I don't expect I will need to worry about SETQ2 or ALTxx in my application as high level code like "C" won't use that unless somehow tied to the optimizer. I'm also assuming that once consumed AUGD/AUGS will always default back to zero (hopefully I'm not wrong in my assumption).

I think I can probably extract AUGS and "Q" values with these snippets:

mov augsval, #0 
shr augsval, #9  ' shift value down by 9 bits
setnib augsval, #$F, 1 ' create AUGS instruction to be executed later
...
augsval long 0 ' this instruction will be executed to restore any pending "AUGS" value

mov qval, #0
muxq qval, ##-1  ' extract current Q value
...
setq qval  ' restores Q value

But not sure if AUGD can similarly be done as well, maybe with ALTxx somehow although I might have to hunt for this instruction specifically. AUGD/AUGS both have one nibble as $F so it's not too hard to identify but I'll have to go read more longs from HUB for this.

Any other ideas? I guess it's possible to just use LUT RAM if I keep some free space for this. HUB RAM is similarly possible but burns quite a lot more clocks, still probably faster than direct instruction matching.

wrlut #0, lutstorage
rdlut audgval, lutstorage
shr augdval, #9  ' shift value down by 9 bits
bith augdval, #23+(4<<6) ' create AUGD instruction to be executed later
...
augdval long 0 ' this instruction will be executed to restore any pending "AUGD" value

UPDATE: Damn, I just realized that any conditional flags might also need to be copied too which complicates this scheme further.... so I might still have to resort to brute force lookup of prior 3 instructions.

I also wondered if there was a way to somehow use forced interrupts on block boundaries if these could automatically restore the pending AUGD/AUDS/SETQ state but after reading the documentation it looks like they wait for those instructions to complete first before triggering:

"When an interrupt event occurs, certain conditions must be met during execution before the interrupt branch can happen:

ALTxx / CRCNIB / SCA / SCAS / GETCT+WC / GETXACC / SETQ / SETQ2 / XORO32 / XBYTE must not be executing
AUGS must not be executing or waiting for a S/# instruction
AUGD must not be executing or waiting for a D/# instruction
REP must not be executing or active
STALLI must not be executing or active
The cog must not be stalled in any WAITx instruction"

evanh · 2026-01-28 07:49

@rogloh said:
... but after reading the documentation it looks like they wait for those instructions to complete first before triggering:

"When an interrupt event occurs, certain conditions must be met during execution before the interrupt branch can happen:

ALTxx / CRCNIB / SCA / SCAS / GETCT+WC / GETXACC / SETQ / SETQ2 / XORO32 / XBYTE must not be executing

AUGS must not be executing or waiting for a S/# instruction

AUGD must not be executing or waiting for a D/# instruction

REP must not be executing or active

STALLI must not be executing or active

The cog must not be stalled in any WAITx instruction"

Yeah, I was going to start a lecture on this until I got to the bottom of your posting. Consequently, restoring of Q is only needed if you've overwritten it in your ISR.

The biggest issue arises with the Cordic. If its pipeline is being utilised for faster compute then that routine needs to hold off interrupts for the duration of the pipelining. Which can be a substantial number of clocks.

ersmith · 2026-01-28 12:02

@rogloh Could we back up and look at this on a higher level? You mentioned using C code, if so then maybe we could modify the compiler or alternatively add a postprocess to the compiler output so that the problematic instructions do not end a block (inserting NOPs if necessary). A postprocessor would have the advantage that it could cope with user assembly too, including inline assembly in the C code.

evanh · 2026-01-28 16:14

[Oh, oops, I hadn't understood Roger's intent at all.]

Wuerfel_21 · 2026-01-28 17:17

The IRQ-blocking approach is a problem because then the instructions will start to read past the loaded code block.

However, as Eric points out, the correct solution to this is to just not assemble such instructions over the block boundary.

evanh · 2026-01-28 23:27

Probably going to want fixed block granular boundaries too. With things like branches and tight loops avoiding sitting across a block-end boundary as much as possible. It's an enhanced version of the above instruction bumping that then also assembles extra NOPs to push a branch target into the next block.

EDIT: On the other hand, would there be feature for multiple contiguous blocks being copied physically adjacent in hubRAM? If so then there is options for grouping of blocks. This would support not putting spacing NOPs for every block.

rogloh · 2026-01-29 05:14

@ersmith said:
@rogloh Could we back up and look at this on a higher level? You mentioned using C code, if so then maybe we could modify the compiler or alternatively add a postprocess to the compiler output so that the problematic instructions do not end a block (inserting NOPs if necessary). A postprocessor would have the advantage that it could cope with user assembly too, including inline assembly in the C code.

Sure, I was thinking of going down this route too, but then wanted to initially demo the entire concept with build scripts that modify the existing flexspin's P2ASM output files generated from compiled C code to prove it was achievable, before discussing all the gory details and requesting suitable changes to flexC compilation etc. Strangely I had sort of anticipated my ideas might get shot down before it even had a chance to shine...so your comment here is really refreshing and encouraging.

I already know that this idea is achievable with what I did using p2gcc and Micropython ages ago and at the time it obtained what I thought was surprisingly decent performance, something in the ballpark of ~ 40% of native code speed in the best case and on a fast P2 I think this could be very useful to support in flexC as well. For example when running some larger application code like a fully featured editor app, or a local build tool, or a GUI + graphics libraries etc from external memory. The key for decent performance is assigning a large enough instruction cache to contain most or ideally all of the working set of a running application.

So I'll post what I've already gone and figured out here over the last couple of weeks as a brain dump and perhaps you can chime in on what you think what else might be possible...

The essence of my external code memory scheme initially is to leverage a fully associative I-cache in HUB RAM operating with a FIFO cache replacement policy that dynamically contains executable P2 code stored and read in from external memory (eg. PSRAM) and is then executed in regular HUB-exec mode just like most flexspin code already does. Right now there is currently no plan for an external memory D-cache and all application data still remains in HUB, although down the line it might be good to have some indirect methods for accessing large amounts of lookup table or read-only type data held in external memory. I already have external memory APIs that can do this so it's not too difficult to merge that into the C code as is.

To separate the cacheable functions, the P2 function blocks that are output by flexspin are divided into two addressable memory regions that store these routines. Functions that are considered internal exist below 1MB and those that are considered external and cacheable are placed > 1MB (it could be split at 512kB too if needed, I've just used 1MB for now). An "orgh $100000" line is placed into the intermediate p2asm file and any C functions marked as "FAR_" are moved after this separator line prior to the final binary being generated from this p2asm source file. Currently I just have a preprocessor macro that prepends "FAR_" to nominated external memory functions and a sed script that splits things up and moves them out of where they were compiled initially in the p2asm file to after the 1MB separator line. The flexspin toolchain complains about any generated addresses above 512kB as a warning but thankfully it still generates the code at the higher addresses (and simply pads the binary up to 1MB with zeroes prior to generating the far code blocks). Some file compression steps might be desirable down the line to deal with that, but this larger binary could be post-processed and split and flashed into different flash regions with loadp2. Speaking of which there is nothing to prevent the external memory image to be stored and and "executed" from flash. It would require a different memory driver to read from flash instead of external RAM but could still work in with the same I-cache scheme I have developed (albeit at a lower performance to a PSRAM or HyperRAM based solution for example).

The proof of concept code just marks these "FAR" C functions using "#define FAR(x) FAR_##x" and "FAR_" is explicitly searched for by my translation sed scripts and these functions are handled differently, but with proper compiler support I thought it would be much nicer just to use another locating attribute in the compiler such as what is shown below, so it could be tracked internally in the toolchain and potentially generated at the right high addresses in the output file and also possibly dealt with differently during assembly code generation:

int function(int args) __attribute__(ext)

At boot time I needed to tap into the startup code to initialize things and so I just preceded the "call #_main" line in the p2asm file, with another earlier call to a special C function which will setup my external memory handler (it spawns my SPIN2 based memory object as another COG) and inits the cache structures in HUB RAM and could load the LUTRAM with the call handlers. It also copies the image out of flash and into some PSRAM at the appropriate external memory addresses obtained from the original binary, before returning to then call the original main() function. Some compilers have the concept of startup code that gets called before main executes but I didn't see it with flexspin so I needed to hack the original p2asm code a little here. I also needed to incorporate my COG RAM variables as well.

The function calls made in the two different memory regions need to be coded differently to each other. We need to replace all calls & branch instructions that target external memory with special FAR calls using indirect calls/trampolines implemented using the callpa instruction. The "pa" register will pass the target external memory branch/call address to the special call handler. This also includes indirect calls made using a register - as happens when it uses the function addresses stored in the object's method table. Conveniently the same existing parameter and return stack use is retained in all the functions generated, so nothing else changes there.

All internal COG/HUB memory functions calling other internal COG/HUB functions directly still occur as normal, they're still just implemented using the typical "call #_function" style. So there is no additional call/branch overhead or changes needed there. Same thing happens if an external memory function calls an internal memory function - the current calling code being generated remains the same. In this case it will just return back from internal memory to the I-cache row stored in HUB and continue on from there.

However any internal memory based function calling an external memory function such as "call #_FAR_function" is instead replaced by a callpa instruction to call a special handler that does the far call for it using a "callpa ##_FAR_function, farcallint" instruction to make the call. Similarly any external memory based function calling another external function does the same type of "callpa ##_FAR_function, farcallext" replacement and the handler for that is implemented slightly differently to account for the return address needing to be translated back to the original external memory address and optionally re-cached if its original cache row was overwritten in the meantime.

All of the other indirect function calls made (in both internal and external functions) need to instead call special handlers which test the address being called to see if above the internal/external boundary and then either branches to the external memory call handler when the address is above the boundary, or to the destination address directly if a low memory function is being called. This is done with a simple test of the "pa" register parameter being passed holding the called address.

Branches within external memory functions also need to be handled, and the address tested to see if it exists in the cache. In this case we have three possibilities:
1) branch address exists in the currently cached row - we just jump directly to HUB at this address mapped to the cache row
2) branch address exists somewhere in a currently cached row - jump directly to HUB at that mapped row address in the cache (a cache-hit)
3) branch address does not exist in the cache (cache-miss). The oldest cache row entry is dropped and reloaded with the new block contents by transferring a block of data from external memory into hub at the cache row start.

So we then end up having this group of call/branch handlers being identified and transformed in the source code:

making a far direct call from int->ext memory
call #FAR_xxx -> callpa ##FAR_xxx, farcallint
making a far indirect call from int->ext memory
call reg -> callpa reg, farcallinttest
making a far direct call from ext->ext memory
call #FAR_xxx -> callpa ##FAR_xxx, farcallext
making a far indirect call from ext->ext memory
call reg -> callpa reg, farcallexttest
making a far branch from ext->ext memory
jmp #LR_xxx -> callpa ##LR__xxx, farjmp

Right now branching directly from int->ext memory is not implemented - it should not really ever happen unless setjmp/longjmp is used perhaps. Maybe that is something for later...

For highest speed I store these various call handlers in LUT RAM. From what I recall they consumed something less than 128 longs or so with the cache handling code as well. These routines could be stored in HUB if needed but that would introduce more branch overhead from a couple of extra FIFO reloads and I want to minimize the overhead. There are also about 20 longs of state and callpa target address values required in COGRAM which I prepend just before the COG_BSS_START label is defined. Unfortunately the P2 does not seem to allow ##S constants with callpa or with flexspin? so it required 5 extra COGRAM longs to hold these addresses. Doing a "callpa ##value, ##address" complains about crossing the COG/HUB boundary or something, might be offset/absolute address related. I just got around it by using S registers directly in these calls at the expense of COGRAM - although this saves having dual AUGS+AUGD instruction overheads on all special calls/branches which is a good thing too.

There are a number of restrictions that currently need to be enforced for the code generated in functions marked as external to still operate ok at their cached addresses. Most of this is probably already the case with typical C-compiled output. But here's a likely incomplete list of what we can't do assuming we don't know the row boundary at compile time. Note that use of this is still fine within internal memory functions...

no ALTxx use
no calling FCACHED code - and "pa" register is now frequently clobbered by the external memory code
no REP blocks
no JMPRELs
no self modifying code
no TJNZ/DJNZ use etc, or other local branches using offsets instead of absolute addresses - only JMP #\label supported for now.
a simple AUGS/AUGD/SETQ sequence is okay, but multiple consecutive of these may cause complications (hence this original discussion post). I will need to handle the AUGD sequences in order to support all my callpa stuff and AUGS for large constants. Also AUGD+SETQ # might be needed for the CORDIC operations...? Haven't dug into much of that side yet or know what the C-compiler outputs for that.
SCA/SCAS use?
any LOC use to determine addresses for reading
more...TBD

Now this list of restrictions above can be reduced further if the compiler knows the cache row size and where the final memory addresses storing the instructions are located (which it would in its final step). For example if we have a two-instruction ALTx sequence occurring across a (say) 128 or 256 byte address boundary (TBD) we could choose to then insert a NOP instead and start it at the beginning of the next row. This extra NOP will only slow it down very marginally but is nicer than losing the ALTx feature/optimization. Similarly any local branches that are known to end in the same cache row would not need to be translated to do the special FAR branch handler call instead of the faster regular branch, and instructions such as DJNZ reg, #label could still then occur too if #label is still within the same 128 or 256 byte block as the current caller's address. In my case with scripts I couldn't easily determine this so I ruled it out, but if there was compiler support in the assembler for this it would be much faster.

At one time I thought I might be able to feed the toolchain with the modified p2asm code to generate a listing file and then try to extract the instruction sequences placed around at block wrap addresses and work from there by adding NOPs and adjusting each function by function before recompiling again, but this could get somewhat tedious and a lot of patching might be needed as more local branch addresses get affected in the file. Perhaps it's still doable in a single pass. Not 100% sure there.

Some special optimizations that happen in generated code could affect things too so I thought it might be simplest to start out disabling all optimizations for external functions being generated or at least those known to affect things adversely.

Also we obviously need to handle the block wrapping cases when instructions executed have to cross cached rows. I do this by keeping extra instruction space in the cache rows in HUB, with 12 bytes before the row's instructions are stored, and 4 bytes after the row ends. So for 256 byte cacheable blocks I'd actually store them in a 272 byte total row size in HUB. A MUL block, #272 instruction is then utilized to convert from block ID to row index offset to add to the cache start address. At the end of each cache row in HUB I store a "JMP #\wrap_handler" instruction - this is setup once at init time. If the executed instructions ever get to the end of the row this instruction executes and the next incremented block number is checked as to whether it's already been cached, or reloaded if needed. Also if the last row ended with AUGS/AUGD/SETQ sequences etc, I copy the last 12 bytes at the end of the row into the reserved area before the start of the next row and optionally jump into this region accordingly to repeat these instructions before the next row begins more instructions. This is the complication I have to resolve and was the source of the post above. If AUGS/AUGD use is separated by individual instructions it is easy to handle by copying a single individual last instruction directly but if it's a sequence of more than one of these special instructions it gets a bit more complicated and more address adjustments need to be made. Fortunately some of this can happen while the new block is being read in via the streamer but some instruction copying has to be done initially before the prior cache row is lost by being overwritten. If some flexspin compiler change could assist with this by introducing NOPs at block boundaries it would be really handy. I will potentially corrupt the current "Q" value with my block transfers of AUGD/AUGS instructios etc so probably want to preserve and reload it before resuming after the row wraps if the Cordic needs it. I doubt the C code does block transfers natively requiring SETQ/SETQ2 use unless the optimizer makes some use of that. If I don't have to do these block transfers in theory Q is already maintained.

I do have some (untested) sample code already written for this based on my earlier work which I am still implementing/refining. The FIFO replacement policy I use (or "LRC" - least recently cached) for row replacement is very fast to implement and requires no extra state management unlike the other LRU/LFU approaches would etc. Another fast method is a simple random replacement policy which I've read is used by some ARM cores. One benefit we have here is that the next cache row to be replaced can be pre-computed while the external memory block is being read in as we have a small amount of extra time there. As to HUB RAM usage, my simple caching scheme supports up to 255 rows of up to 256 bytes of I-cache (~69kB HUB RAM use with extra row/tag overheads) and uses a block->cache row mapping table allowing up to 16MB (or more) of memory to be addressed (eg. PSRAM on P2-Edge). The byte based block->cache row mapping table I use is also stored in hub RAM and is scalable so if your application only addressed up to 4MB of RAM and used a 256 byte block size it would only need to take 16kB of HUB RAM, while 16MB applications would need 64kB etc. Alternatively a cache with more than 255 rows is possible (it may be faster with smaller row sizes), but this requires a word based block->row table instead of being byte based which doubles its size - might still be okay with moderate external memory sizes. In any case there is some flexibility here for tuning the cache performance. Gut feeling tells me that the sweet spot probably lies in the 32-64 instructions per block and this is a tradeoff between latency to fill cache rows and working set hit rate. A larger block will create more spatial locality and reduce the wrap overheads that occur at the end of each row, but it will take longer to fill. A smaller block allows more blocks in the I-cache and also less time to reload from external memory but probably needs more blocks to be loaded per function call. If my PSRAM memory driver is used to access memory the initial latency is in the order of 1usec plus the 256 byte transfers takes about the same time at the 256MHz rate making a total of ~2uS per row read. So I've chosen 256 bytes for testing initially to try to balance this out. A directly coupled memory driver (also saving a COG) is another option for higher performance as it would reduce the latency significantly but may not share external memory so well with other COGs - especially any video COGs using external memory at the same time (which might well be one target application too).

Brain dump complete...

rogloh · 2026-01-29 23:48

@Wuerfel_21 said:
However, as Eric points out, the correct solution to this is to just not assemble such instructions over the block boundary.

Yes, that is the ideal and desired outcome now. I wonder if this a straightforward thing to do in the compiler itself vs attempting via external scripts? Eric and yourself are obviously the most knowledgeable people in this regard with potential flexspin changes.

rogloh · 2026-01-29 23:53

@evanh said:
Probably going to want fixed block granular boundaries too. With things like branches and tight loops avoiding sitting across a block-end boundary as much as possible. It's an enhanced version of the above instruction bumping that then also assembles extra NOPs to push a branch target into the next block.

That's maybe a bit more than I was wanting at this point. Although if a loop is really tight adding some NOPs prior to it would help if the number of NOPs is relatively small and the loop size is known to fit into a block.

FCACHE code use is another similar thing that could benefit, if the size fits into say a 256 byte block. Although a far simpler solution to the FCACHE issue is to not make these functions sit in external in the first place. Keep them into low memory without such restrictions.

EDIT: On the other hand, would there be feature for multiple contiguous blocks being copied physically adjacent in hubRAM? If so then there is options for grouping of blocks. This would support not putting spacing NOPs for every block.

Interesting idea but that's starting to get quite a lot more difficult to achieve with my current (simple/fast) caching scheme.

rogloh · 2026-01-31 06:17

I believe I've found a reasonably fast way to check the instructions for AUGS/AUGD/SETQ/SETQ2 when I wrap a cached row. This sample code only takes 80 clocks in a REP loop for searching for up to 4 prior instructions at the end of a row to be repeated at the beginning of the next. It could even be sped up further if the ## constants get stored in COGRAM - that saves 16 clocks. I figure this is the worst case and could potentially happen during a block memory fill operation with constants, although I really doubt that it could get hardcoded like that in C and it probably still would use some temporary registers, unless the optimizer does some fancy stuff in the code.

eg. Here's a block fill at some fixed address in memory:

SETQ ## len-1
WRLONG ##0, ##ADDR

which I guess in theory could potentially create this instruction sequence:
AUGD
SETQ
AUGD
AUGS
WRLONG

I'd expect normally in most other cases it would only ever have to do one AUGS or AUGD or SETQ and the code would exit the REP loop much earlier after one or two iterations. Also the C compiler is not going to generate SETQ2 but it's simple enough to test for both with the shared mask.

Maybe this idea is still workable if most of the searching overhead overlaps with the external memory transfer and doesn't slow things down. Obviously there are still plenty of other things that can't be done like ALTx, REP, SKIP, FCACHE calls etc but I suspect they can be disabled via the optimizer settings for the FAR functions. My scripts could also look for some of these and warn during the compilation stage.

' This routine loads a new external cache row from external memory using "block" to index the block read from external memory to hub RAM
' pa is the address of the block map table entry in hub RAM.  
' loadrowwrap is entered when the routine is called due to a cache row wrapping, and loadrow is entered due to a call/branch in external memory

loadrowwrap     skipf   #%1100                      ' entry of routine for wrapping has z=1, and Q saved already, so skip saving if already modified
loadrow         mov     extaddr, block              ' get block number
                mul     extaddr, #BLOCKSIZE         ' multiply by block size to get block address in ext mem 0-16MB
                neg     saveq, #1
                muxq    saveq, #0  
                setq    #3-1                        ' write 3 command/data longs to mailbox entries
                wrlong  extaddr, mailbox            ' issue external read transfer request to ext memory driver to bring in new row

                ' we now have a little time to spare while the driver does the memory read

                mov     activerow, hubaddr          ' get active row start address from the hub read address
                sub     hubaddr, #16                ' rewind by 4 instructions
                setq    #4-1                        ' copy 4 last instructions retrieved from old row to start of new row
                wrlong  instr1, hubaddr
                not     saveq
                setq    saveq                       ' restore inverted saved Q value after block transfer
                mov     offset, #0                  ' clear offset to rewind by
        if_nz   jmp     #testsdone                  ' skip these tests if we are branching/calling instead of row wrapping

                ' now we test for any instructions to repeat if we wrapped (up to 4?)

                mov     saveq, #instr4              ' using saveq as a temp variable
                rep     @.testsdone, #4
                alti    saveq, #%010                ' decrement S each time
                mov     instr4, 0-0
                and     instr4, ##$0ff801fe         ' setq/setq2 mask
                cmp     instr4, ##$0d600028 wz      ' test for setq/setq2
                getnib  instr4, instr4, #6          ' get 6th nibble
        if_nz   cmp     instr4, #15 wz              ' test for $F which is AUGD or AUGS instruction
        if_nz   jmp     #.testsdone                 ' exit when we don't find one of these instructions
                add     offset, #4                  ' otherwise increment rewind amount used later

.testsdone
                rdlong  ownerblock, nextrow wz      ' compute the old owner of this row if any, zero means unassigned owner
                wrlong  pa, nextrow                 ' set the new owner of the cache row
       if_nz    wrbyte  #0, ownerblock              ' clear out old block map table mapping of owner
                wrbyte  nextrow, pa                 ' setup new block mapping entry 

                incmod  nextrow, #TOTALROWS-1 wc    ' select next cache row to use next time (254 rows x 276 bytes, 0=reserved)
        if_c    mov     nextrow, #1                 ' only 1 based row numbers are used as zero is reserved
                mov     hubaddr, nextrow            ' get nextrow to use
                mul     hubaddr, #BLOCKSIZE+20      ' scale by row size
                add     hubaddr, cachebase          ' add offset in HUB RAM to get row start address for next transfer

                shr     pa, #24                     ' get branch offset in cache row
                add     pa, activerow               ' add to row base
                sub     pa, offset                  ' rewind by some long(s) if needed, to repeat last AUGS/AUGS/SETQ/SETQ2 instructions
                rczl    flags wcz                   ' restore flags

poll            rdlong  flags, mailbox              ' check for completion of external request
                tjns    flags, pa                   ' branch to just read cache row with jump into HUB EXEC
                jmp     #poll                       ' or retry

ersmith · 2026-01-31 20:44

@rogloh there's a lot to think about there and I'll have to mull it over. At least in principle I think it shouldn't be hard to make the compiler avoid splitting some instructions across block boundaries, if it knows the size of the blocks (we'd want to have a small number of such blocks). We'd also have to disable some things like FCACHE, JMPREL and REP while in this mode, but again, that should be quite do-able.

It sounds like you're mostly interested in running code from external RAM, rather than putting data there? If so, an alternative would be to use riscvp2-gcc, which already has the ability to run code directly from flash (execute in place). Making it use external RAM instead should be quite easy, it just needs a simple routine to read data from the RAM for the runtime, and a corresponding write utility for loadp2. I have had porting your RAM code to these frameworks on my "todo" list, but I haven't had a chance to get to it (it mostly involves stripping out a lot of the functionality).

And of course Ross has external memory support in Catalina as well; I haven't really tried it out though.

rogloh · 2026-02-01 02:14

Thanks for replying to this @ersmith . Yes, it will need more figuring out. I started looking through your toolchain files to try to get an idea of the extent of PASM2 code the C compilation generates so I could try to identify and disable the parts that could affect the way the external caching code works but came to the conclusion this was far too involved to try to work out directly from the code - at least for me. So I suspect it will be somewhat trial and error as well as educated planning. But when generated code is limited to normal load/store/register ops and conditional calls/branches I think it can be made to work without a lot of issues. It's probably the optimizer that makes use of P2 specific code speedups using special instructions that may create the problems needing to be avoided. We can always look at the full P2 instruction list to determine which ones will create issues due to blocks boundary crossings. It's just the set of these instructions the toolchain will generate for C is an unknown to me - maybe you know it already.

I almost have a framework going which is converting the p2asm file into the output I need to achieve the feature and am close to actually testing/debugging using my AUGS/AUGD/SETQ replication intended to solve the block crossing problem. I think I'll keep trying to get that going first. Whether or not NOP insertion can be applied here in the tools may come later down the track but things that would assist initially would mainly be a way to disable unwanted optimizer settings and collect external functions to be placed at high memory addresses in the final p2asm file using an attribute that the function is targeting external memory. Also it might be good for us to find some simple method to use preprocessor macros to apply the attribute on an entire file basis as well - which probably simplifies porting large codebases rather than editing every line containing a function. I guess that involves some macro that can be globally applied to functions and which can be redefined by making it either a null string or an external memory attribute. Or perhaps another sed script can do work here too.

I had thought about Catalina but I really like how we can mix P2 Spin2 objects and C code within flexspin. That is super useful. I can see that having a rather large C code footprint that wouldn't fit normally can be run from large external memory but could still operate at speed with other Spin2 based Cogs or internal memory functions where needed is something particularly enticing and can open up more possibilities for the P2. Right now your flexspin toolchain is probably the best goto setup for doing this vs the other toolchain environments and its something I'm already using anyway. It just needs the external memory extension which is what I'm playing around with to get a feel of what is achievable on the P2.

Wuerfel_21 · 2026-02-01 19:01

@rogloh said:
Thanks for replying to this @ersmith . Yes, it will need more figuring out. I started looking through your toolchain files to try to get an idea of the extent of PASM2 code the C compilation generates so I could try to identify and disable the parts that could affect the way the external caching code works but came to the conclusion this was far too involved to try to work out directly from the code - at least for me. So I suspect it will be somewhat trial and error as well as educated planning. But when generated code is limited to normal load/store/register ops and conditional calls/branches I think it can be made to work without a lot of issues. It's probably the optimizer that makes use of P2 specific code speedups using special instructions that may create the problems needing to be avoided. We can always look at the full P2 instruction list to determine which ones will create issues due to blocks boundary crossings. It's just the set of these instructions the toolchain will generate for C is an unknown to me - maybe you know it already.

The actual code generation is mainly in backends/asm/outasm.c. All the opcodes are constants of form OPC_xxx. You can guess where the optimizer and peepholes live.
Generally we do not actually generate many prefix instructions in the optimizer, they are troublesome. What WILL generate ALT-type instructions is the pernicious feature known as sub-registers.
IIRC there are only a few cases where SETQ can be generated. One of them is when you write a memset or longfill, which it will try to turn into an inline SETQ+WRLONG.

rogloh · 2026-02-01 22:26

@Wuerfel_21 said:
The actual code generation is mainly in backends/asm/outasm.c. All the opcodes are constants of form OPC_xxx. You can guess where the optimizer and peepholes live.
Generally we do not actually generate many prefix instructions in the optimizer, they are troublesome. What WILL generate ALT-type instructions is the pernicious feature known as sub-registers.
IIRC there are only a few cases where SETQ can be generated. One of them is when you write a memset or longfill, which it will try to turn into an inline SETQ+WRLONG.

Hi Ada. I'd actually looked there as well thinking that was where it was going to be but it seemed difficult to map that down to all possible instruction sequences without deciphering a whole lot of the software's logic first (rather daunting). I expect with many of the optimizer sequences removed the generated code would be reasonably straightforward and should work okay. The AUGS use will be most common, as well as AUGD for my own long call extensions. I want to still be able to support the SETQ instruction - and as well as block filling, the CORDIC could use it too. Without a NOP padding extension built into the tools initially I'll have to jettison REP, SKIP, ALTx, JMPREL, SCA/S as well (maybe others..?) along with TJNZ->DJNZ style optimizations also. I can at least hunt for cases of these in the external functions with SED/GREP and provide a warning/error in the build script.

In other updates, I found my approach to copy the instructions from a prior block that wraps into a prepend area at the front of each I-cache row wasn't working out as nicely as I wanted and starting growing extra corner cases to resolve so I think I may move instead to use an append area approach where slightly more instructions data is copied (3 more longs) into the row at row load time and I look for AUGS/AUGD/SETQ in the final instruction in the row. If it's not present I just copy the "JMP #\wrap_handler" instruction to immediate follow the block data in the row, otherwise I keep looking for the first non AUGS/AUGD/SETQ in the next location (or reach the end of the 3 longs), at which point I copy in the JMP #wrap_handler instruction immediately after that. This method is simpler and only has to be done once at row load time and in most cases won't have to do more than one search so won't add too much more overhead - and the initial PSRAM latency+transfer time will still swamp it. The benefit is far less work has to happen for cases that branch to the next block that's already in the cache as I don't have to worry if the row got loaded due to a call/branch jumping to somewhere within the row, vs the wrap around case. So I'm going to move over to try this approach until a NOP padding change could be potentially done to the compiler down the track if this thing works out.

rogloh · 2026-02-02 07:07

I was able to get my test code for the key performance related routines trimmed down a little more. This code handles local branches and block boundary crossings and would be called frequently. These two routines along with the cache hit rate will influence the performance obtainable when cached code from external memory is executed in hub RAM.

Boundary crossings at the end of each block of external memory will take about 26+9..16+13..20 or 48..62 extra clocks when that happens. For the best case pure straight line code (which is not realistic) this introduces a penalty from 48 up to 62 clocks every 128 clocks (64 instructions) using 256 byte blocks for example. That is an efficiency of something between 128/190 = 67% and 128/176 = 72% of native P2 speed. This is the number when wrapping is only happening in this limited manner and cached code is running as fast as it can from hub memory and not accessing hub memory as data. Further branches will slow it down more and small loops that span block boundaries will also impact it strongly. Although when you factor in memory reads from data stored in HUB ram, the performance can actually improve. Eg. If there were say 10 HUB memory reads and 6 writes in the 64 instructions being executed before the wrap happens, the fifo pipeline stalls and we can see (using average hub window delays) numbers more like:
(48 * 2 +18 * 10+12 * 6) / (48 * 2 + 18 * 10+12 * 6 + 55) = 86% of native hub exec timing. This is very helpful.

Branches with the destination code found in other cached rows will take an additional 33-40 clock cycles per branch on top of the usual 13-20 clocks. For local branching that obtains all cache hits, it will depend on how frequently the branch instructions are being executed in loops and throughout the functions being executed. For example if a local branch is happening on average after 6 instructions, and we set the hub window cycle timing penalty for the RDBYTE to sit in the middle of the range we'd get an efficiency of 12+17 / (12+17+36) or 45% of native hubexec. Again this penalty can be reduced further when extra HUB memory reads / writes are occurring in the non-branch instructions.

For other branch instruction ratios we'd get these (worst case but fully cached) numbers:
branch after every 4 = (8+17)/(8+17+36) = 41% of native
branch after every 5 = (10+17)(10+17+36) = 43% of native
branch after every 7 = (14+17)(14+17+36) = 46% of native
branch after every 8 = (16+17)(16+17+36) = 48% of native
branch after every 9 = (18+17)(18+17+36) = 49% of native
branch after every 10 = (20+17)(20+17+36) = 51% of native

A cache miss will extend the penalty significantly, probably in the order of ~580 cycles at 256MHz with 256 byte blocks using my driver, though this delay can be improved with smaller 128 byte blocks, or if a dedicated memory driver was used. The hit rate will strongly impact performance. If almost everything currently executing can fit in the I-cache (eg. its dynamic working set fits inside 64kB) I'd hope the performance should still be reasonably decent.

Latest PASM2 code for this is shown below and this runs in LUT memory - although it could also be run from HUB - with some additional penalty from the additional fifo stalls.

' This routine is jumped to whenever a cached row boundary is crossed and another row needs to be filled/entered.
' Usage:
'    calld flags, wrap_handler  ' automatically placed at end of cache row after optional AUGS/AUGD/SETQ instruction sequence.
'
' wrap_handler points to the following routine in LUT RAM:
' 
wrap_row
                subr    activerow, flags            ' get offset from start of currently active cache row in hub RAM (flags obtained by calld)
                shl     activerow, #24              ' modulus 256 bytes (or other size)
                add     block, #1                   ' advance to next block
                getword pa, block, #0               ' index into table using current 16 bit block number (wraps at 64k blocks)
                add     pa, blockmaptable           ' add base address of block mapping table in hub RAM
                or      pa, activerow               ' setup upper bits of PA as the jump offset within block
                rdbyte  activerow, pa               ' lookup block mapping table, 0 indicates unmapped
                tjz     activerow, #loadrow         ' if zero, we don't have this block and need to load it into the next cache row
                mul     activerow, #BLOCKSIZE+20    ' row size is 256 bytes plus 5 extra longs per cache row
                add     activerow, cachebase        ' add to start address of i-cache in hub RAM
                shr     pa, #24                     ' extract offset in block
                add     pa, activerow               ' setup jump address to be start of current active cache row plus the offset
                jmp     pa                          ' jump into new i-cache row address in hub RAM (flags unchanged)

' This routine is called when we do a far jump within the external memory functions.  
' PA is passed the address to jump to.
' Usage:
'       callpa ##external_far_address, farbranch_handler
'
' farbranch_handler points to the following routine in LUT RAM:

farbranch       pop     flags                       ' preserve flags for later if we trash them via cache reload
                ror     pa, #8                      ' align address on block 
                getword block, pa, #0               ' extract the block we are going to
                add     pa, blockmaptable           ' add block to base of block mapping table, clear z
                rdbyte  activerow, pa               ' check if this block is mapped in the cache
                tjz     activerow, #loadrow         ' otherwise load new row if we get a cache miss
                mul     activerow, #BLOCKSIZE+20    ' scale row by block size plus 5 extra longs per cache row
                add     activerow, cachebase        ' add to start address of I-cache in hub RAM
                shr     pa, #24                     ' extract offset within block
                add     pa, activerow               ' add to start of active cache row in hub RAM
                jmp     pa                          ' jump into the I-cache row


' This routine loads a new external cache row from external memory using "block" to index the block read from external memory to hub RAM and is called on demand internally.
' PA is the address of the block map table in hub RAM 

loadrow         getword extaddr, block, #0          ' get block number to lookup (0-65535)
                mul     extaddr, #BLOCKSIZE         ' multiply by block size to get actual address in ext mem 0-16MB
                setq    #3-1                        ' write 3 command/data longs to mailbox entries
                wrlong  extaddr, mailbox            ' issue external read transfer request to ext memory driver to bring in new row

                ' we now have a little time to spare while the driver does the memory read

                mov     activerow, hubaddr          ' get active row start address from the hub read address
                add     activerow, #BLOCKSIZE+16    ' go to owner tag area at end of row

                rdlong  offset, activerow wz        ' compute the old owner of this row if any, zero means unassigned owner
                wrlong  pa, activerow               ' set the new block table entry address as "owner" of the cache row
       if_nz    wrbyte  #0, offset                  ' clear out old block map table mapping of previous owner
                wrbyte  nextrow, pa                 ' setup new block mapping entry 


                incmod  nextrow, #TOTALROWS-1 wc    ' select next cache row to use next time (255 rows x 276 bytes, 0=reserved)
        if_c    mov     nextrow, #1                 ' only 1 based row numbers are used as zero is reserved
                mov     hubaddr, nextrow            ' get nextrow to use
                mul     hubaddr, #BLOCKSIZE+20      ' scale by row size
                add     hubaddr, cachebase          ' add offset in HUB RAM to get row start address for next transfer

poll            rdlong  flags, mailbox              ' check for completion of external request
                tjs     flags, #poll                ' retry until ready

                sub     activerow, #20              ' get back to start of row
                setq    #4-1                        ' copy next 4 instructions retrieved after this row ends
                rdlong  instr1, activerow
                mov     offset, #4
                add     offset, activerow
                sub     activerow, #BLOCKSIZE-4

                ' now we need to test for any instructions to be executed after block boundary (up to 4?)

                sets    pa, #instr1                 ' using lsbs of pa a temporary incrementing variable
                rep     @.testsdone, #4
                alti    pa, #%011                   ' increment S field each time
                mov     instr1, 0-0
                and     instr1, ##$0ff801fe         ' setq/setq2 mask
                cmp     instr1, ##$0d600028 wz      ' test for setq/setq2
                getnib  instr1, instr1, #6          ' get 6th nibble
        if_nz   cmp     instr1, #15 wz              ' test for $F which is AUGD or AUGS instruction
        if_nz   jmp     #.testsdone                 ' exit when we don't find one of these instructions
                add     offset, #4                  ' otherwise increment offset pointer
.testsdone
                wrlong  pad, offset
                shr     pa, #24                     ' get branch offset in cache row
                add     pa, activerow               ' add to row base
                rczl    flags wcz                   ' restore flags
                jmp     pa                          ' jump into newly cached row in hub RAM

' These variables are needed in COGRAM for the external memory feature & the management of cache state and memory access.

instr1          long    0           ' stores fourth last instruction executed on a row if special
instr2          long    0           ' stores third last instruction executed on a row if special
instr3          long    0           ' stores second last instruction executed on a row if special
instr4          long    0           ' stores last instruction executed on a row if special
flags           long    0           ' stores flags/return address
activerow       long    0           ' current active cache row adress executing code from external memory
cachebase       long    0           ' points to start of i-cache in HUB RAM 
blockmaptable   long    0           ' points to start of block map table in HUB RAM
nextrow         long    1           ' already setup in advance of first cache read
block           long    0           ' current active block
extaddr         long    0           ' external address working register (start of 3 long request data)
hubaddr         long    BLOCKSIZE+8 ' must be setup in advance of first cache read
transfersize    long    BLOCKSIZE+16' burst transfer size in bytes
offset          long    0           ' temp variable
extrdaddr       long    $B0000000   ' read burst request + offset in external memory
mailbox         long    0           ' external memory mailbox address for this COG ID
pad             calld   flags, wraphandler  ' end of cache row terminating instruction to branch back to wrap_handler

Wuerfel_21 · 2026-02-02 19:49

Rather than swap in fixed size blocks from external memory, why not try to do whole functions at once? There'd be no need to mess with any internal branches, handle prefix instructions, etc. Main downside is more difficult cache management and that comically large functions maybe wouldn't fit.

rogloh · 2026-02-03 00:32

@Wuerfel_21 said:
Rather than swap in fixed size blocks from external memory, why not try to do whole functions at once? There'd be no need to mess with any internal branches, handle prefix instructions, etc.

Yeah that's sort of like an FCACHE into HUB but from external memory. That could potentially perform quite well, depending on how frequently larger functions have to be read in and for how long they can stick around in memory. It might be interesting to attempt that sometime down the track. All these internal branches could end up being a real Achilles heel of my initial approach unless their target addresses can be examined in the listing file and left intact whenever they will land in the same block. Although I still think if it can perform in the 25-50% of native range on a fast P2 it's still probably quite usable for a decent range of big software. It won't be targeting any directly accessed data segments in external memory though right now.

Main downside is more difficult cache management and that comically large functions maybe wouldn't fit.

You answered it. It's basically a heap alloc/memory fragmentation problem and we'd probably have to "free" a lot of functions by cache row invalidation when a new larger function needs the space, plus load time delays would vary quite a bit between functions, depending on their size. For now I think I'll try to start out simpler, maybe it could morph to the above over time if we solve it.

ersmith · 2026-02-03 02:06

For riscvp2 I used a "trace cache" where basically stuff goes into the cache until there's a branch, or until the cache is full (at which point a branch to the next instruction gets forced into the cache). Branches are initially set up to go to a "trampoline" which checks to see if the target is already in cache, and if it is then fixes up the branch so it'll always go there. The main down side of this is when the cache gets full it all has to get flushed and we start again; but that happens less often than you would think.

Performance running from flash is pretty good, but that's probably because the benchmarks are small and mostly fit in cache. E.g. riscvp2 normally gets ~73421 dhrystones/sec, but compiled to run from flash it gets ~72709. The text size is about 50K, and the cache is 64K, so that explains the small difference. It'd be interesting to compile a version that runs from flash but only uses LUT cache, maybe I'll try that.

(On dhrystone riscvp2 is about twice as fast as flexspin, mainly because gcc's optimizer is much better.)

rogloh · 2026-02-03 05:07

@ersmith said:
For riscvp2 I used a "trace cache" where basically stuff goes into the cache until there's a branch, or until the cache is full (at which point a branch to the next instruction gets forced into the cache). Branches are initially set up to go to a "trampoline" which checks to see if the target is already in cache, and if it is then fixes up the branch so it'll always go there.

That's not a bad idea and could be used until the tools might be changed to emit different branch instructions for external memory based functions vs internal memory functions and would identify which block the target label falls in to decide what to generate. If a branch is destined for the local block we could convert it at runtime to use a faster JMP #\hubaddress and keep the conditional flags from the original although there'd be a prior AUGD to deal with too. Probably better to make it into a LOC PA, #hubaddr, JMP PA sequence via a quick instruction rewrite to HUB.

Previously I had a test in my jump handler to see if the branch target was in the currently active block which cost a few more cycles for all the other cases so I removed it for now but it could pay off in this case when branch instructions could be rewritten, particularly in tight loops that fit in the same 64 instruction block. A row flush would trash it but it would automatically refresh itself next time it occurred and little extra management would be required. To alleviate the test overhead we could have two rewrite cases done at initial test time, one rewrites the instruction to the LOC PA, #hubaddr, JMP PA sequence when the target is in the current block, and another would go and change it to not check again for local blocks by modifying the instruction to use a different handler entry point which doesn't test for local block range again but just does the normal cache check + branch, and a skipf entry point could be handy there.

In a more advanced version if we associated a list of rows containing branches to each of the rows cached (256 bits/row) we could even modify instructions again to jump directly into cached code but we'd need to invalidate this list of cached rows when a branch target row was ejected. (Sounds like your scheme above). Obviously gets harder and is slower to manage and the extra memory use for such schemes needs to be controlled/constrained as well. But there are some useful tradeoffs to consider here.

A similar scheme could be employed to fixup the row wrap condition. Initially when a row is loaded in we branch off to the wrap handler. This handler currently locates the row to branch to and computes the address to jump there or loads in the new row to cache. We could rewrite the current CALLD instruction heading to the wrap handler to do the JMP #hubaddr instruction directly into the cached row holding the next block. This would speedup code loops that extend between rows once both are cached. The overhead for simply keeping a list of prior row addresses in the cache is trivial (one long per row) making this very interesting. Only real overhead is to delete the linked row whenever a row is ejected/reused by another request to load a new block, this could be done by flushing the prior row out of cache, or more efficiently by rewriting its last instruction back to the rewrap handler which avoids a whole chain of other changes. I think this is a pretty good way to go in fact and would be a performance booster for sure.

rogloh · 2026-02-03 08:00

I just modified the branch code handler to check if we can do a local JMP and update the hub instruction if required. The first time the branch instruction is run it will check if a local branch in the same block is possible, if so it computes the address and writes it into hub as a jump instruction before branching to this same destination address. If a local branch is not possible, the call to the handler is modified to use a different entry point and this will not do the same local check again. So the overhead reduces back to what I listed previously from then on until this block is evicted from the cache rows and is reloaded.

It will add more cycles in the first branch handler call encountered. If it can be converted into a local jump, which is expected to occur frequently, it just takes an additional 4 cycles to do this test plus a hub write to update the instruction vs the original timing (I've replaced the read operation to determine the condition flags to preserve them). If it is not a local branch it takes 18 extra clocks plus another hub read and a hub write before reverting back to the original branch overhead from then on. If we didn't rewrite the instruction to avoid the local branch test each time, it will take 14 clocks extra for the test every time but would avoid that extra one-off hub read and write and 18 clocks. IMO it's probably still worth doing this second type of far jump rewrite to improve cases that cross block boundaries in loops, maybe less so for far branches outside of loops. It only takes perhaps around 4 extra invocations of the non-local far branch to pay for these additional modification cycles initially, then you remain ahead every time so if the cache row remains around for long enough for that to occur during execution of that code we are golden.

' This routine is called when we do a far jump within the external memory functions. 
' PA is passed the address to jump to.
' Usage:
'       callpa ##external_far_address, farbranch_handler
'
' farbranch_handler and farbranch_handler2 points to the following routine(s) in LUT RAM:
' 

farbranch_update
                skipf   #%11000
                rdlong  offset, flags               ' read branch handler instruction
                sets    offset, #farbranch_handler2 ' alter instruction branch pointer
                wrlong  offset, flags               ' rewrite instruction to hub
                ' fall through however the next two instructions get skipped
farbranch       pop     flags                       ' preserve flags for later if we trash them via cache reload
                ror     pa, #8                      ' align address on block 
                getword block, pa, #0               ' extract the block we are going to
                add     pa, blockmaptable           ' add block to base of block mapping table, clear z
                rdbyte  activerow, pa               ' check if this block is mapped in the cache
                tjz     activerow, #loadrow         ' otherwise load new row if we get a cache miss
                mul     activerow, #BLOCKSIZE+20    ' scale by block size plus 5 extra longs per cache row
                add     activerow, cachebase        ' add to start address of I-cache in hub RAM
                shr     pa, #24                     ' extract offset within block
                add     pa, activerow               ' add to start of active cache row in hub RAM
                jmp     pa                          ' jump into the I-cache row

farbranch_test  pop     flags                       ' preserve flags for later if we trash them via cache reload
                ror     pa, #8                      ' align address on block 
                xor     block, pa                   ' test for matching block without affecting flags
                zerox   block, #15                  ' clear upper bits
                sub     flags, #8                   ' rewind saved return address by two longs
                tjnz    block, farbranch_update     ' if we didn't match, go update handler call to not test again
                getword block, pa, #0               ' restore block for next time
                rdlong  offset, flags
                shr     pa, #24                     ' get offset within block
                add     pa, activerow               ' add to start of cached row
                rolnib  pa, offset, #7
                ror     pa, #4
                setr    pa, #%110110000             ' convert address into absolute jump via opcode modification
                wrlong  pa, flags                   ' write back to hub at the correct address
                jmp     pa                          ' jump to this address

' in COG RAM
farbranch_handler  long farbranch_test  ' called first time a far branch occurs to modify to local jump if possible
farbranch_handler2 long farbranch       ' called if the far branch occurs and the local jump wasn't possible

rogloh · 2026-02-04 01:14

Was mulling over the caching stuff last night and think I have a good use for the time while we wait for the block to be read into a cache row. Because I already have pointers to the cache row owners and also the indices to each cache row assigned per block in my data structures I can readily compute the prior memory block linking to its next (ie. old) cache row being discarded (from adjacent index entries in the block map table). If this is non-zero (in-use) I can then delink the prior block's direct HUB jump in its cache row and convert that instruction back to the generic CALLD form which will perform the row boundary wrap operation next time it gets crossed. I can also link the wrapping code's block to the one being read in to link to this block being read in the cache with a direct JMP #\hubaddress format. Most of this can happen during otherwise dead waiting time which is perfect.

It'll mean that if a new block being cached has the prior block in cache it will link the prior block's row with a direct jump to the start of the new row (with appropriate offset due to any extra AUGS/AUDG/SETQ being appended at the end of the row) for the next time it crosses the boundary. That saves doing the indirect call handler each and every time it crosses the boundary, which should help speed it up enormously for cases when loops cross boundaries.

I could potentially also examine whether the following block to the one being added is also in cache and link the new one to that with a direct JMP but I doubt that optimization will help all that much as most of the time it won't necessarily be in the cache to begin with. However if I have extra clocks available while waiting I'll probably add that capability as well as it's not difficult and may help incrementally.

So with all the recent changes and not too much overhead we get the following improvements:

1) uninterrupted SETQ/SETQ2/AUGS/AUGD operation due to reading in a slightly larger block into cache and letting the wrap handler instruction position be offset by several instruction at the end of the row if it sees any sequence of these special instructions right at the end of the row. It could be extended to look for ALT as well down the track if that was of much use in the code,
2) local block far jumps being identified the first time encountered and then converted to faster local hub branches when the destination is in the same block, and future local tests skipped otherwise,
3) cache row wraps automatically identified and converted to local branches once the next block is in the cache,
4) automatic cleanup when cache rows are removed/reassigned to new external blocks.

I know I said that premature optimization is the root of all evil but I think I can work with this set of features as the starting point to be tested.

One thing I still don't know yet is the CORDIC use of SETQ. Does the CORDIC require the optional SETQ immediately executed before the CORDIC instruction occurs or can Q be setup several instructions prior and used independently with its prior retained value? If the CORDIC request needs to follow the SETQ I think this scheme already handles that and so I may not ever need to preserve Q over row boundaries given we wont use MUXQ in the generated code. I'm sort of guessing that's how it would work.

evanh · 2026-02-04 05:36

@rogloh said:
I know I said that premature optimization is the root of all evil but I think I can work with this set of features as the starting point to be tested.

The key being "premature". When or what is premature? I certainly found I made a lot of progress on the performance engineering early on when developing the 4-bit SD card driver. But I did have a clear objective of making sure I was aiming for run-time recalculation of the clock divider. Which prevented me from hard coding the timing too much.

One thing I still don't know yet is the CORDIC use of SETQ. Does the CORDIC require the optional SETQ immediately executed before the CORDIC instruction occurs or can Q be setup several instructions prior and used independently with its prior retained value?

Any instruction that has a modal prefixing trigger by a SETQ instruction needs SETQ tightly placed ahead. There is exceptions like ALTx/AUGx that postpone this but for the most part SETQ has to be immediately prior. Otherwise the modal bit resets again and the relevant instruction will act like no SETQ occurred.

Other non-modal uses of Q, like for CRCNIB and MUXQ, don't care about when SETQ fills out the Q register. They just use whatever value is held in Q.

evanh · 2026-02-04 06:25

... But I did have a clear objective of making sure I was aiming for run-time recalculation of the clock divider. Which prevented me from hard coding the timing too much.

I also had a number of earlier attempts under my belt as well. I'd been honing that adjustable divider approach with QPI and HyperBus experiments.

That said, I did still do a number of slower paced command/response routines as well. In fact, I considered and tested each one before deciding on which to finally use.

rogloh · 2026-02-05 01:43

@evanh said:
Other non-modal uses of Q, like for CRCNIB and MUXQ, don't care about when SETQ fills out the Q register. They just use whatever value is held in Q.

Yeah those were ones I thought would need Q preserved. Luckily the C compiler should not generated those, unless MUXQ got used for some special bitwise optimization perhaps for expressions like:
D = (D & !mask) | (S & mask);
Even then a C compiler optimizer could ideally be coaxed to emit the SETQ immediately prior to the MUXQ if required.

There is exceptions like ALTx/AUGx that postpone this but for the most part SETQ has to be immediately prior.

Yes I've not tested that one out but I suspect the AUGS/AUDG should let SETQ be prefixed even earlier to work like this:
AUGD (optional for SETQ immediate)
SETQ # ' uses the AUGD
AUGD (or AUGS for reverse order)
AUGS (or AUGD for reverse order)
WRLONG #, # ' uses the AUGD/AUDS/SETQ

Also the other re-ordered variants could probably work too if the SETQ doesn't need the AUGD like:

AUGS
SETQ reg
AUGD
WRLONG #, # ' uses the AUGS/AUGS & SETQ

or

AUGD
SETQ reg
AUGS
WRLONG #,# ' uses the AUGD/AUGS & SETQ

or

AUGS
AUGD
SETQ reg
WRLONG #, # ' uses the AUGD & AUGS & SETQ

The first case above is why I am checking for up to 4 SETQ/AUGD/AUGS prefix instructions at the end of the cached row to find when the next regular instruction occurs - I then put the JMP or cache row wrap handler call after this non-prefix matching instruction. That lets the instruction that needs these special prefix instruction to consume them before the branch happens. There are probably some atypical cases where you could insert other instructions that don't consume these prefix instructions right after the last prefix instruction but I don't expect the C-compiler to do that. The benefit of this approach now is there is no need for NOP padding to resolve this problem.

Of course NOP padding could help with other things like ALTx but if we wanted I could check for that too - at the cost of more cycles taken during cache row reload. Unfortunately ALTI specifically has a different instruction prefix that can't share the same matching mask as all the other ALTx instructions.

rogloh · 2026-02-05 01:58

By the way here is the list of instructions I will have to restrict in the external memory functions generated. I am thinking I'll do a search for these in the external memory area of the p2asm file associated with holding the executable code for these functions and print some warning and/or abort the build script if it finds these. If a compiler optimizer setting generates these I'll need to disable that setting. In the future with changes to the toolchain some of the local branches could be used if the distance remains within the current memory block being assembled, and REP or SKIP could work if it falls within a block. But we can't really cope with those that don't fall in the block so I guess it has to be disabled all the time unless there is a way for the tools to generate two different versions - tricky if flags are involved and need preservation.

ALTB    D  - Prevent
ALTB    D,{#}S  - Prevent
ALTD    D  - Prevent
ALTD    D,{#}S  - Prevent
ALTGB   D  - Prevent
ALTGB   D,{#}S  - Prevent
ALTGN   D  - Prevent
ALTGN   D,{#}S  - Prevent
ALTGW   D  - Prevent
ALTGW   D,{#}S  - Prevent
ALTI    D  - Prevent
ALTI    D,{#}S  - Prevent
ALTR    D  - Prevent
ALTR    D,{#}S  - Prevent
ALTS    D  - Prevent
ALTS    D,{#}S  - Prevent
ALTSB   D  - Prevent
ALTSB   D,{#}S  - Prevent
ALTSN   D  - Prevent
ALTSN   D,{#}S  - Prevent
ALTSW   D  - Prevent
ALTSW   D,{#}S  - Prevent
CALLA   #{\}A  - Prevent
CALLA   D        {WC/WZ/WCZ}  - Prevent
CALLB   #{\}A  - Prevent
CALLB   D        {WC/WZ/WCZ}  - Prevent
CALLD   D,{#}S   {WC/WZ/WCZ}  - Prevent
CALLD   PA/PB/PTRA/PTRB,#{\}A  - Prevent
CALLPA  {#}D,{#}S  - Prevent - I will generate this myself.
CALLPB  {#}D,{#}S  - Prevent
CRCBIT  D,{#}S  - Prevent
CRCNIB  D,{#}S  - Prevent
DJF     D,{#}S  - Prevent
DJNF    D,{#}S  - Prevent
DJNZ    D,{#}S  - Prevent
DJZ     D,{#}S  - Prevent
JATN    {#}S  - Prevent
JCT1    {#}S  - Prevent
JCT2    {#}S  - Prevent
JCT3    {#}S  - Prevent
JFBW    {#}S  - Prevent
JINT    {#}S  - Prevent
JMP     D        {WC/WZ/WCZ}  - Prevent or create a new handler for this?
JMPREL  {#}D  - Prevent
JNATN   {#}S  - Prevent
JNCT1   {#}S  - Prevent
JNCT2   {#}S  - Prevent
JNCT3   {#}S  - Prevent
JNFBW   {#}S  - Prevent
JNINT   {#}S  - Prevent
JNPAT   {#}S  - Prevent
JNQMT   {#}S  - Prevent
JNSE1   {#}S  - Prevent
JNSE2   {#}S  - Prevent
JNSE3   {#}S  - Prevent
JNSE4   {#}S  - Prevent
JNXFI   {#}S  - Prevent
JNXMT   {#}S  - Prevent
JNXRL   {#}S  - Prevent
JNXRO   {#}S  - Prevent
JPAT    {#}S  - Prevent
JQMT    {#}S  - Prevent
JSE1    {#}S  - Prevent
JSE2    {#}S  - Prevent
JSE3    {#}S  - Prevent
JSE4    {#}S  - Prevent
JXFI    {#}S  - Prevent
JXMT    {#}S  - Prevent
JXRL    {#}S  - Prevent
JXRO    {#}S  - Prevent
LOC     PA/PB/PTRA/PTRB,#{\}A - Prevent if locating labels in external memory?
MUXQ    D,{#}S - Prevent
REP     {#}D,{#}S   - Prevent
SCA     D,{#}S          {WZ}  - Prevent
SCAS    D,{#}S          {WZ}  - Prevent
SKIP    {#}D   - Prevent
SKIPF   {#}D   - Prevent - should not be getting put into hubexec code anyway
TJF     D,{#}S  - Prevent
TJNF    D,{#}S  - Prevent
TJNS    D,{#}S  - Prevent
TJNZ    D,{#}S  - Prevent
TJS     D,{#}S  - Prevent
TJV     D,{#}S  - Prevent
TJZ     D,{#}S  - Prevent
XORO32  D   - Prevent

evanh · 2026-02-05 08:20

Shouldn't this all be done at compile time? Have the extra NOPs bedded in the built binary file and just do a normal caching mechanism that doesn't have to worry about the instruction combinations. Then the optimiser can deal with loops crossing block boundaries and other such complexities. It could be a post-compile assembly only optimising pass ... that is also a necessity to fit the memory model.

rogloh · 2026-02-05 08:31

Came up with the following (untested) PASM2 to resolve cache row loading. It will cleanup the data structures during the time it is waiting for the data to be read. It does increase the size of the code in the LUT RAM.

Currently with everything else coded it's taking up to 147 LUT RAM longs and 27 in COGRAM. I could remove up to 31 LUT RAM longs or so if I executed part of it from HUB during the data waiting time, though moving any more than that will slow down the handler which still runs faster in LUT due to branching. I could try to unroll the AUGD/AUGS/SETQ test loop perhaps but the handful of remaining HUB accesses will still restart the FIFO and take longer to execute. There's still a 4 long block read and two or three HUB writes that have to happen after the data comes in from external memory unfortunately. It's also tricky/impossible to get it to compile from HUB using the extra variable names I placed into COGRAM after compilation happens. In any case there's still sufficient room in LUT for all this right now, it just eats into the space you could nominate to hold other time sensitive C functions.

I was also thinking about FCACHE support from external memory function. There's probably a way to do it by checking if the length of the FCACHE data exceeds the current row length and automatically pulling in the next row and copying in parts into COGRAM before invoking it and returning to execute from cache etc. We'd need to modify the current FCACHE handling to account for it. IMO its probably far easier just to keep any time critical FCACHE enabled functions in internal HUB memory for now.

loadrow_wrap    modc    _SET wc                     ' C flag set to indicate we can update the branch wrap instruction in the caller's row
                skipf   #%100000                    ' fall through to normal call/jmp invoked row loader but skip the clearing of C flag
loadrow         getword extaddr, block, #0          ' get block number to lookup (0-65535)
                mul     extaddr, #BLOCKSIZE         ' multiply by block size to get actual address in ext mem 0-16MB
                add     extaddr, extrdaddr
                setq    #3-1                        ' write 3 command/data longs to mailbox entries
                wrlong  extaddr, mailbox            ' issue external read transfer request to ext memory driver to bring in new row

                ' we now have some time to spare while the memory driver does the external memory read, so do all we can here
                modc    _CLR wc                     ' clear C flag to differentiate regular loadrow calls vs wrapped cache row calls
                getword block, block, #0            ' keep incrementing block to 16 bits

                ' delink the ejected cache row's prior block (and convert it back to CALLD form)
                mov     activerow, hubaddr          ' get active row start address from the hub read address
                add     hubaddr, #BLOCKSIZE+16      ' go to owner tag area at end of row
                rdlong  hubaddr, hubaddr            ' read owner entry
                tjz     hubaddr, #nowrite           ' if no prior owner - first time used, skip
                wrlong  #0, hubaddr                 ' clear out old block map table entry address
                sub     hubaddr, #1                 ' prior block block map table address
                rdbyte  hubaddr, hubaddr            ' get row assigned if any
                cmp     hubaddr, nextrow wz         ' check if we are ejecting caller's row
    if_z        modc    _CLR wc                     ' if so, clear C flag so we don't update prior block JMP
                tjz     hubaddr, #nowrite           ' if prior block to last owner's block is not also cached, skip
                mul     hubaddr, #BLOCKSIZE+20      ' convert cache row index to row address offset
                add     hubaddr, cachebase          ' convert to cache row address in hub
                add     hubaddr, #BLOCKSIZE+19      ' go to end of row where non-zero offset may be stored
                rdbyte  offset, hubaddr             ' read this offset
                sub     hubaddr, #19                ' back to end of block in row
                add     hubaddr, offset             ' advance offset
                wrlong  calldinst, hubaddr          ' write CALLD instruction at end of row

nowrite         mov     offset, pa                  ' get address of block table entry for block we are loading
                setbyte offset, #0, #3              ' ensure top byte is cleared
                add     activerow, #BLOCKSIZE+16    ' move to owner area of cache row
                wrlong  offset, activerow           ' set the new block table entry address in PA as "owner" of the cache row
                wrbyte  nextrow, pa                 ' setup new block mapping entry to point to this next assigned cache row index
                sub     activerow, #20              ' go to address of last block address in the row
                mov     offset, activerow           ' save location as pointer in offset
                add     offset, #4                  ' advance one long where calld or jump instruction will be written

                incmod  nextrow, #TOTALROWS-1 wz    ' select next cache row to use next time (255 rows x 264 bytes, 0=reserved)
        if_z    mov     nextrow, #1                 ' only 1 based row numbers are used as zero is reserved
                mov     hubaddr, nextrow            ' get nextrow to use as row index
                mul     hubaddr, #BLOCKSIZE+20      ' scale by row size
                add     hubaddr, cachebase          ' add offset in HUB RAM to get row start address for new block transfer next time
                                                    ' can't touch nextrow or hubaddr after this point
                ' wait until read is done
poll            rdlong  instr1, mailbox             ' check for completion of external request
                tjs     instr1, #poll               ' retry until ready

                ' external RAM read is now done and block contents are in HUB RAM
                ' now we need to test for any instructions to be executed after block boundary (up to 4)
                setq    #4-1                        ' copy next 4 instructions retrieved 
                rdlong  instr1, activerow           ' ..into local storage to examine
                sub     activerow, #BLOCKSIZE-4     ' rewind to start of cached row

                sets    pa, #instr1                 ' reusing the lsbs of PA for a temporary incrementing variable
                rep     @.testsdone, #4             ' examine up to 4 instructions
                alti    pa, #%011                   ' increment S field each time in PA
                mov     instr1, 0-0                 ' extract instruction to examine
                and     instr1, setqmask            ' apply a SETQ/SETQ2 bitmask
                cmp     instr1, setqmatch wz        ' test for SETQ/SETQ2 instruction
                getnib  instr1, instr1, #6          ' get 6th nibble of instruction (untouched by mask)
        if_nz   cmp     instr1, #15 wz              ' test for $F which is AUGD or AUGS instruction
        if_nz   jmp     #.testsdone                 ' exit when we don't find one of these instructions
                add     offset, #4                  ' otherwise increment offset pointer
.testsdone

                wrlong  calldinst, offset           ' write CALLD instruction to wrap handler location at end of this cached row
                sub     offset, activerow           ' convert pointer to offset from start of row
                sub     offset, #4                  ' rewind by one long
                add     activerow, #BLOCKSIZE+19    ' move to last byte of the row in the cache
                wrbyte  offset, activerow           ' write offset byte in owner long address pointer (spare) msbs
                sub     activerow, #BLOCKSIZE+19    ' rewind to start of row
                shr     pa, #24                     ' get branch offset in cache row
        if_nc   skipf   #%1110                      ' if we entered via a call or jump, we can skip otherwise continue
                add     pa, activerow               ' add to cache row base address
                ' pa is now the full hub target address of the prior row's calld request if prior row is still cached
                ' and we can modify the instruction we called from to an explicit jmp #\hub
                setr    pa, #%110110000             ' convert address int jmp instruction via opcode modification
                sub     flags, #4                   ' back up to the previous CALLD instruction that invoked this handler
                wrlong  pa, flags                   ' write the JMP instruction

                rczl    flags wcz                   ' restore flags
                jmp     pa                          ' jump into newly cached row in hub RAM

rogloh · 2026-02-05 08:37

@evanh said:
Shouldn't this all be done at compile time? Have the extra NOPs bedded in the built binary file and just do a normal caching mechanism that doesn't have to worry about the instruction combinations. Then the optimiser can deal with loops crossing block boundaries and other such complexities. It could be a post-compile assembly only optimising pass ... that is also a necessity.

It should be done at compile time ideally yes. Until we have possible support in flexspin this initial proof of concept version has to do the AUGD/AUGS/SETQ checking when a new cache row is loaded and will do some other compile time checks to ensure the "illegal" instructions are not present in the code identified as targeting external memory. It's only going to add a little more extra overhead to the delay from waiting for external RAM to be read for the instruction checking so it should still provide comparable results.

evanh · 2026-02-05 08:43

@rogloh said:
... It's only going to add a little more extra overhead to the delay from waiting for external RAM to be read for the instruction checking so it should still provide comparable results.

Cool.

rogloh · 2026-02-07 06:17

I've coded up the far function call handling logic now so I should just about have all I need in place to begin testing...probably a little more sanity checking is needed first on my re-entrant stuff below. I believe it should work in with flexspin functions call stack handling and only every consume one or two local stack longs at any given time, but if I'm wrong it'll just trash the internal stack. Flexspin uses the internal call stack as well but automatically seems to also ultimately push its return addresses on the hub stack as its calls go deeper down, unless it's a leaf function and doesn't have to save arguments etc.

' These two handler entry points are only ever called from internal memory functions:

'1) Indirect calls need to be tested to see where they branch by examining the destination address upper bits.
' If calling internal code, jump directly to the destination, called function returns to original caller.
' If calling external code, check cache and jump there with reload if necessary, called function returns to original caller.

'2) Direct far calls from internal to external memory is handled below.

farcalltestint
                encod   flags, pa                   ' find uppermost bit
                sub     flags, #20                  ' test for address over 20 bits
                tjns    flags, #farcallint2ext      ' external addresses will have flags positive
                jmp     pa                          ' we will be returning directly back to internal hub/cog memory

farcallint2ext  rczr    flags                       ' retain current flags, could be returning to cog/hub
                ror     pa, #8                      ' align next block value with bits from 8 up in address being called
                getword block, pa, #0               ' save as the block we are going to (16 bit range)
                add     pa, blockmaptable           ' add new block value to base of block mapping table, clear z
                rdbyte  activerow, pa               ' check if we are mapped to the cache
                tjz     activerow, #loadrow         ' we need to load the new row if we get a cache miss
                mul     activerow, #BLOCKSIZE+20    ' scale by 256 bytes plus 5 extra longs per cache row
                add     activerow, cachebase        ' add to start address of I-cache in HUB RAM
                shr     pa, #24                     ' extract offset inside block
                add     pa, activerow               ' add to start of active cached row in HUB RAM
                jmp     pa                          ' jump to I-cache row+offset, will return to internal HUB/COG code after original call


' These three handler entry points are only ever called from external memory functions:

'1) Indirect calls need to be tested to see where they branch, to internal or to external memory.

'2,3) Direct call handlers also differ slightly depending on whether we are calling internal or external functions.

' After the call is made all of these will return back here to check if the far return address is still in the cache, 
' and reload if necessary before jumping back to the original caller.

farcalltestext
                encod   flags, pa                   ' find uppermost bit
                sub     flags, #20                  ' test for address over 20 bits
                tjns    flags, #farcallext2ext      ' external addresses will have flags as positive

farcallext2int  skipf   #%1100000                   ' skip instructions marked with **
farcallext2ext  pop     flags                       ' save my return address and current flags, we will be returning to extmem
                subr    activerow, flags            ' activerow now contains offset from start of row
                ror     activerow, #8               ' save active row and offset
                add     block, activerow            ' convert to external memory address
                pusha   block                       ' save return address as block+(offset<<24)
                ror     pa, #8                      ' ** external calls only
                alti    rethandler, #%101_100_100   ' ** override next line with an instruction to call #\xxx instead of register pa
                call    pa                          ' indirectly call internal function, or call our handler code at xxx
                popa    pa                          ' restore return address as block+(offset<<24)
xxx             getword block, pa, #0               ' extract block index
                add     pa, blockmaptable           ' add block map table offset
                rdbyte  activerow, pa               ' go lookup block table at this address to get row index
                tjz     activerow, #loadrow         ' test if we need to reload or not
                mul     activerow, #BLOCKSIZE+20    ' scale row index to row offset in cache
                add     activerow, cachebase        ' add to start address of I-cache in HUB RAM
                shr     pa, #24                     ' extract offset from msbs of popped return address
                add     pa, activerow               ' add to start of active row
                jmp     pa                          ' jump to I-cache row+offset address to call a new far function or to return to one

rogloh · 2026-02-14 05:14

@ersmith
Hi Eric, I haven't been posting here for a week, but have already completed the coding and started executing some test code from my external memory extensions to flexspin. Some of the external memory code appears to run but I ran into a bug when calling internal code from external functions that I have been trying to debug. Unfortunately the routine I want to examine for this exists in LUT memory so instead of printf debugging I was trying to run the debugger to examine the registers to help me rapidly locate the problem but am unable to get debugging working correctly.
If I enable -gbrk and set _BAUD to 230400 I do get some initial debug output but then it hangs somewhere and no more output is observed.

I've created the image I am running via a three step process which includes some LUT code executed at startup by patching the p2asm file and then assembling that in a second call to flexspin.

flexspin -2 bigdemo.spin2 extmem.spin2
next my own script runs which patches bigdemo.p2asm created from the prior command to include some extra LUT code and includes a DEBUG statement in the PASM2 initially executed from LUT
flexspin -2 -l -gbrk -O0 -D_BAUD=230400 -DTOTALROWS=256 -DBLOCKSIZE=256 bigdemo.p2asm

A binary is generated which I can run. In the output I see an INIT statement then garbage is printed which looks like a baud rate problem (quoted below - weird characters may upset this web rendering).

Cog0 INIT $0000_0000 $0000_0000 load D???t?h? ?,?V?/?m???bݙ???|?| ) ??Kp? ????-bۭ~ܨn??2??F?@/?Üg=??=?ު?n&?-N?y????3?&?LG_%

The issue is that I don't ever touch the baud rate in the code, or do any manual clock changes from the default flexspin sets up automatically. I am just running typical C code (file attached).

Do you know if/why the baud rate might be changing or some other issues related with enabling debugging in C code...?
Normally in SPIN2 code with the Debugger we can nominate DEBUG_BAUD and DEBUG_COGS etc. What do we do when we want to debug C code with the debugger enabled on code executed in LUTRAM? Is there anything else that needs to be setup for this other than enabling -gbrk and defining _BAUD on the command line invocation of flexspin?
Thanks
Roger.

UPDATE: in case it matters, am using version 7.6.2 I updated on Feb 12 2026

Looking for a way to save AUDS/AUGD/SETQ state

Comments