P2 native & AVR CPU emulation with external memory, XBYTE etc
This is following on from a portion of a brief subdiscussion we had earlier in the Console Emulation thread around here.. https://forums.parallax.com/discussion/comment/1533831/#Comment_1533831
I've got my AVR emulator coded and working now using HUB RAM only for code storage, and I've been looking into the FIFO wrap interrupt wondering if I can use it to support external RAM blocks read into HUB RAM to be executed until they complete fully to the end of the FIFO block, then providing an interrupt whenever the next block is needed to be read, but can't seem to get it reliably triggering at well defined points.
Attached is some test code that shows the FIFO block wrap interrupt (FBW) triggering somewhere around 52-53 bytes read into the 64 byte block. I've also seen other values depending on extra waitx delays I had introduced to the test code as well, so there could be some amount of hub window slop to this. It's not especially precise when it triggers, so for now I've given up on the idea of using it to help emulation deal with the program counter exceeding the FIFO block size.
What I'm now looking at is a way to count the PC wrap points manually. This is only going to add a couple of instructions of overhead to the loop which isn't too bad if it allows continuous external memory based operation.
@TonyB_ I've also investigated the use of XBYTE but when it's all coded and counted I can't see a benefit in clock cycles saved once I need to break into the XBYTE loop to include my new block wrap test and debugger test on each instruction, plus store the low AVR opcode byte separately, and it will also need another instruction to be introduced per each EXECF sequence which takes up more COG RAM. In both cases the total loop overhead appears to be 20 clocks so any of the XBYTE gain we may have achieved has been swallowed up.
To support external memory, my AVR emulation loop is going to follow something like this code portion below. An I-cache of sorts will be used so once the working set of blocks is loaded into HUB (assuming it fits), most executed code should continue to be read and executed directly from HUB. The branches, skips and returns, and the other instructions that use two words per opcode need other similar handling to setup the new FIFO read address at their correct block offsets in HUB. The LPM and SPM instructions which can also read/write single bytes to the code memory have to be handled differently as well (though that should be simpler if there is no actual caching, just some extra pre-fetching for LPM auto increment cases):
mainloop rfword instr ' read next instruction from FIFO incmod addr, #blocksize ' increment and wrap on block size (in words) tjz addr, #getnextblock ' branch out to read in next block now continue push return_addr ' save emulator return address getbyte x, instr, #1 ' extract upper opcode portion shr x, #1 wc ' 7 bit table index altd x, #opcode_table ' lookup execf sequence in table execf 0-0 ' execute AVR instruction for opcode opcodetable long (%0 << 10) + op_nop ... op_nop ret getnextblock ' TODO: use getptr to find current address and locate the associated hub block from an address map table ' if not present then read new block from ext mem into a free hub block (or replace oldest block), to act as I-cache ' index into hub block base address to setup rdfast jmp #continue
So the hierarchy of execution speed would comprise of the following cases:
- case 1) code is read and executed directly from a HUB RAM block (some multiple of 64 bytes)
- case 2) code branches out to an address already stored in a HUB RAM block buffer - branch into current block is fastest, different block slightly slower due to an additional mapping table lookup
- case 3) code branches out to a new block not yet in HUB RAM. It first needs to be pulled in from external memory and stored in a free HUB RAM block and then mapped accordingly in a table. Any older address it replaces needs to be invalidated in the mapping table for that block's start address.
Hopefully once up and running, case 3 will become less frequent because hitting it often will slow down execution significantly. It's basically a cache miss and incurs the time penalty to load a block of (say) 128-256 bytes from external RAM. Haven't decided on the best block size but given the latency of PSRAM access I'm guessing around 128-256 bytes per block might not be a bad size to try with maybe up to 16kB worth of I-cache blocks. We will see...
Comments
I've come up with an approach for caching external code and loading it on demand in my emulator - am still to fully incorporate and test it but it should do something useful with any luck. It gets called whenever needing a new block of instructions (eg, wrapping at end of current block into a new block) in the main loop sample code posted earlier. Further work is needed for the other branching cases that jump part way into blocks but that handling will be almost identical, but just give an address offset to apply before the RDFAST call is made, and an adjustment would be made to the addr variable above for tracking the next block wrap point as well as to the active block.
There could be more optimized ways to map blocks to cache entries using addresses instead of byte indexes in the block mapping table, but that will increase the mapping table size by 4x. Right now for a 256k target AVR (eg ATMega2560), and a 128 byte block size, the direct mapping table would be 2kB in size with byte indexes to i-cache entries, but if using long addresses instead it would increase this up to 8kB (which still may be okay) and could save one instruction in the most frequently taken cache hit path. However if the entire 8MB AVR address space was ever supported someday (not sure it ever would if not already supported by GCC) this mapping table would blow out to 256k instead of 64kB, with long addresses instead and 128 byte blocks. The only other table I use gets indexed by i-cache entry so we know which block is using which cache entry and I already use longs for that one. For (say) a 16kB i-cache and 128 byte blocks it would need 16k/128=128 entries x 32 bits = 512 bytes which is not much of a storage issue on the P2.
Sample caching and external memory driver read code is shown below. For lower latency the PSRAM read code could be made part of the emulator COG itself it we wanted exclusive access to the RAM without other COGs sharing it. But for initial testing I'll probably just use my own memory driver mailbox.
There are two issues here: (1) using FIFO and RFxxxx instructions in conjunction with external RAM and (2) using XBYTE.
If (1) works then (2) will automatically, however for AVR emulation it seems XBYTE offers no advantage. With hindsight, it would have been easy for XBYTE to optionally extract a bytecode from the high byte of an instruction word, read from hub RAM using RFWORD instead of RFBYTE. One extra XBYTE config bit would be needed, exactly matching bit 0 of the RFBYTE/RFWORD opcode. Something useful for a P2+/P3, I think.
Yes, and I am focusing on working on 1) now. With another day or two's time applied to it I think I'll probably have something running from external RAM. The XBYTE thing was separate to this, and unfortunately in this case XBYTE didn't help me a lot - I initially thought it might save a few clocks but I had to break into the loop which added more. I did at least move to using 7 bit opcodes now to index my opcode table and that freed a lot of spare space which I have needed for the extra external memory access/caching code. I've also added a few other niceties for controlling P2 specific stuff like the I/O and smartpins, the CORDIC, ATN and COG spawning in the AVR IO space, as well as a UART. That part is working quite nicely now and I can run C test programs compiled with GCC targeting the AVR 6 core.
Once I can make it work from external RAM, we would have a way to run large programs (slowly) from external memory with low P2 memory overhead. This could be useful in some applications such as providing GUI style editors/debug tools etc which don't have to run especially fast, and could get large but won't need to take too much P2 space. They could also spawn P2 native COGs to do high speed work. The approach would work for other emulators and possibly one day real P2 code too. We'll see where it goes...
I've got my emulated AVR code running from external PSRAM now. The AVR program space is stored in external RAM, and read into an I-cache on demand where it is executed from HUB RAM blocks using RFWORD. I'm currently testing 128 bytes per cache entry, 32kB total cache but of course this can be altered. The data space for the AVR is held in HUB RAM (up to 64kB).
Using some performance counters I put inside the emulator I am also tracking total instructions executed, total branches, cache hits and cache misses. With this silly test program below I get about 63 P2 clocks per AVR instruction on average including the read overheads. I'm seeing about 11k branches in 81k instructions or around one branch in 7.3 instructions. This result would then yield about 4MIPs at 252MHz.
I've mapped a simple polled UART, the CORDIC operations, P2 I/O pins, P2 Smartpins, COG spawning, ATN, P2 counter, PseudoRNG, and LOCK management into the AVR IO space. So in theory you could fully control a P2 from a large AVR GCC program and spawn COGs from data held in the AVR program space etc. A video driver COG's screen buffer memory could be mapped into a portion of AVR external RAM so the AVR could present a console too, which would be quite useful for a large text UI application that doesn't consume too much P2 memory.
Still testing it but it seems to be working now.
The C code I ran above is this:
Here's the secret unoptimized sauce that reads and executes cached code on demand from external PSRAM (with my driver). It is targeting a 16 bit opcode format for the AVR but could be changed to handle other CPUs opcode types.
I'm sunk at EXECF. O_o
The 0-0 just means the instruction is D-patched by the previous altd. I was not including the rest of the AVR code here, just the code that tracks blocks of instructions, but for more clues, here's a bit more:
Great work, Roger. Is a cache miss massively costly? I think Evan's point is the code is quite simple up to and including EXECF but not so much afterwards!
Not really too costly at high P2 clock speeds. The miss would just bring in another batch of 128 bytes to HUB (could be smaller or larger). The new block will remain in HUB RAM until it is pushed out if/when the working set has fully cycled through the total number of cache rows. The latency/execution time for the external read is in the order of 1us for 128 byte transfers at 250MHz, but this would hold another 64 instructions so it wouldn't happen that often. Once the set has been loaded into HUB, it remains much faster to select, and branches within the existing active block are faster again. I think it should be usable for some (non time-critical) emulator applications.
I'll post the full emulator when it's tidied up and tested a little more...
Update: this first cache implementation is very basic and its replacement scheme is least recently cached, not proper LFU, LRU etc. How well that works in practice is TBD and relies on the working set not really exceeding the number of cache entries available very often or for very long. There might be better schemes to use but at the same time we don't really want to spend a lot of cycles on it and the incmod wrap scheme I use is crazy simple/fast.
Lol, oops, I didn't even notice the 0-0 there. My O_o is a facial expression. EXECF I know nothing about and the returning to
mainloop
comment bamboozled me.The return to mainloop (or debugloop) only happens because of that "push return_addr" instruction prior to the actual EXECF which itself is basically a jump and skipf combined into one operation. EXECF is great. Last night during some optimizations I found I can save another 4 cycles with an EXECF instead of JMPs to my branch_handler too.
I've also moved the readword sequence back into the callers and only call loadnextblock when required which saves another 8 clocks most of the time too instead of the call + return made on each readword operation.
I'm very surprised. Over half of my P2ASM code is preceded by SKIPF or EXECF. Most of the remainder would be too if nested skipping were available, i.e. skipping inside subroutines called from skip sequences.
EXECF the instruction is great but EXECF the name is not, frankly. JSKIPF or JMPSKP are more descriptive. The latter leaves door open for future CALLSKP for nested skipping.
I guess my focus is generally on combined speed and determinism. Every time I read up on SKIPF I find I'm adding unneeded execution overhead.
And the odd time I think I might have a use for it I then find the inevitable REP loop blows that out of the water.
Yeah that would have been nice, to have saved the current EXECF skip sequence on the stack instead of just keeping one and restoring it for the final return. However I did just find you could nest the calls/returns more than one level deep within EXECF which is still really useful. Previously I had thought it was only limited to one call deep for some reason until I discovered that.
Soooo does this mean I will soon be able to run my AVR Code made using Basom AVR on the Prop? :-)
Potentially down the track it might allow something to work, yeah. Keep in mind this does not try to emulate any actual AVR HW peripherals but instead maps some of the P2 features (including IO pins) to the AVR IO space, so actual device drivers and functions expecting real AVR peripherals would not work without changing things to use some P2 replacements where applicable. Right now this emulator just targets the largest ATMega2560/1 devices, but it could be scaled back to the smaller devices with two byte program counters. To date I've focused on compiling GCC programs not BASIC though and it is an experimental exercise to see how emulated code could be made to run with external memory. Ideally we would have a way to run real P2 code this way.
Today I was also looking into the Arduino API and build process and wondering if some of that could ultimately be made to run on the P2 too, as some sort of frankenhybrid P2/Arduino board thing that uses the Arduino API and C/C++ code. It's probably going to be around 4-5x slower than native AVRs though and rather non-deterministic too.
That sounds very interesting, there is so much cool stuff going on since we have external memory !
Yeah it really opens up new possibilities with the P2 including good things for audio, video, emulators, RAM filesystem storage, possible future on board dev tools for P2 coding etc, etc, etc.
Wow, just catching up on all this, you have been busy. This is potentially super useful, even if it doesn't run at full avr speed.
I'm curious what the cog use is like?
It's a single COG emulator and it's getting rather full now, maybe ~40 longs left over both LUT and COG RAM, though there is some extra space used for my own debug during development which can be removed in the end. The CPU emulation code itself is fully complete and the spare space left could either be allocated to a few more IO features, or could contain some internal PSRAM reading code to avoid needing a separate memory driver COG. I do have a second SPIN COG which is used for disassembling code in my debugger portion and decodes opcodes to print instruction names but this part is not required for normal operation. For smaller AVRs you could use less than 64k RAM and the whole thing could have quite a small HUB footprint. I actually started out on a version that reads AVR code purely from HUB RAM and this has recently morphed into one that can read from external RAM (I do still have the original version though).
Here's an idea I had recently for executing native P2 instructions read from external memory while remaining in COGEXEC mode.
It uses 64 bits for each instruction. This doubles code memory space used which is somewhat wasteful but allows reasonably fast execution in COGEXEC mode, 8x slower than native when not branching, and a nice way to deal with instructions that require special handling. The 8x rate is a handy multiple for the HUB window access. The 64 bit P2 code to be executed could be generated by converting normal 32 bits instructions to 64 bits with (hopefully) fairly simple tools. Most normal instructions are just zero extended to 64 bits while the special ones have other upper bits added to "escape" them into 64 bit instructions.
This approach allows the current program counter to be easily computed vs any variable length instruction schemes that try to conserve memory.
The basic P2 emulator loop is shown below. REP was not used so interrupts could still be supported eventually perhaps.
Importantly the flags are not affected by this loop and are preserved for the application.
We'd like a larger JMP/CALL space with immediates otherwise we are limited to 1MB only in HUB RAM.
We could potential include some special additional instructions with 32 bit immediates like these:
Maybe ### could be used by tools to indicate 32bit constants.
An alternative approach is to keep the executed instructions as 32 bits only and restrict the set of allowed instructions to be executed from external memory so as not to crash the emulator. This could work ok for tools generating code for high-level languages such as C which would only need a subset of the P2 instruction set anyway. We would still need some way to branch in/out of external memory addresses though with (conditional) calls/jumps. They could probably be transformed into local branches that then emulate those instructions.
In fact, you'll need at least two consecutive longs anyway, in order to be able to execute/process sequences like "GETCT WC + GETCT"; other instruction sequences that also block interrupts till some condition is satisfied, but don't forcefully rely on the "strictly consecutive processing" concept would also work, though the total execution time would not match the "bare metal", which isn't the objective.
Yes the consecutive long thing get rather tricky. I'm also looking at what it takes to execute i-cache rows directly with hub exec, maybe with a
_ret_ NOP
instruction placed at the end of each cache row to allow emulator return for handling of wrapping into the next block. Again if the instructions that need to be consecutive span two disjoint i-cache rows we have a problem, unless the offending last instruction (alt, setq, augs etc) is somehow copied/duplicated at the beginning of the next block in some reserved space for this purpose - maybe that could work too.We'd still need to find a way to escape to perform far calls and jumps at a minimum. I've not figured out a way to do that yet without losing the JMP address information in the instruction itself, unless it is taken from some prior loaded register, which changes the coding a bit more. Eg, this might work, and it could allow conditional branches too with if prefixes on the jmp #farjmp_handler instruction where needed. Far calls would be similar. Other stuff like standalone djnz and jatn instructions would still not be possible though.
I'm not certain what the target is here. Is this intended for emulators, to "escape" to P2 code inside emulated instructions to speed them up?
Is is possible to do fast block reads using SETQ/SETQ2 at the same time as streamer is reading external RAM, to copy code from hub RAM to cog/LUT RAM only just after hub RAM is written by streamer?
Sure can. Streamer has priority, as in it'll stall the cog to grab the hubRAM databus, but as long as the cog isn't reading where the streamer is writing then no issue. Or more accurately, the FIFO does the cog stalling.
The "escape" part is to allow us to break out of the emulator's instruction processing loop running the code natively and go handle special case instructions that could otherwise crash the emulator. Although emulator is probably a bad description of what this is, if it is running actual P2 code. It's probably more like a type of VM.
It would be great to come up with a scheme that can work without special tool support for generating runnable code but it doesn't look like that is doable. We'd really need other tools that generate this type of code. A special mode in P2GCC could be one approach where far calls and jumps are coded in a way that work with the i-cached external memory.
Also I thought more about the COGRAM handler for a far call and I think the callpa #/D, #/S instruction could be rather useful for jumping out of the emulator/VM in hub exec mode. It would also push the current PC which we extract and push to a real stack and later pop for the return address. The "pa" register could be used to hold the target and this callpa instruction could be still made conditional with if_xxx prefixes etc, and we could also augment the #/D value for creating immediate 32 bit branch target addresses.
Thinking about how this would work for streamer @ sysclk/2. Is it: FIFO writes block of 8 longs one to each slice, do nothing for 8 cycles, write next block etc. Block read stalled during FIFO writes, read 8 longs, stalled again etc.
EDIT:
"FIFO write" = read from FIFO and write to hub RAM
Dunno. There likely is high and low water lines but no idea if they're at a round eight apart.
EDIT: Actually, it's more likely to be sporadic singles , doubles and triples I guess. The access time is single tick but the slot rotation limits when that happens.
EDIT2: It suppose that very slot restriction will create a rhythm like you've described in that syclock/2 case.
Cases like that seems to be crying for a pre-processing unit, perhaps by properlly expanding the code into the LUT, and running it there...