P2 native & AVR CPU emulation with external memory, XBYTE etc

This is following on from a portion of a brief subdiscussion we had earlier in the Console Emulation thread around here.. https://forums.parallax.com/discussion/comment/1533831/#Comment_1533831
I've got my AVR emulator coded and working now using HUB RAM only for code storage, and I've been looking into the FIFO wrap interrupt wondering if I can use it to support external RAM blocks read into HUB RAM to be executed until they complete fully to the end of the FIFO block, then providing an interrupt whenever the next block is needed to be read, but can't seem to get it reliably triggering at well defined points.
Attached is some test code that shows the FIFO block wrap interrupt (FBW) triggering somewhere around 52-53 bytes read into the 64 byte block. I've also seen other values depending on extra waitx delays I had introduced to the test code as well, so there could be some amount of hub window slop to this. It's not especially precise when it triggers, so for now I've given up on the idea of using it to help emulation deal with the program counter exceeding the FIFO block size.
What I'm now looking at is a way to count the PC wrap points manually. This is only going to add a couple of instructions of overhead to the loop which isn't too bad if it allows continuous external memory based operation.
@TonyB_ I've also investigated the use of XBYTE but when it's all coded and counted I can't see a benefit in clock cycles saved once I need to break into the XBYTE loop to include my new block wrap test and debugger test on each instruction, plus store the low AVR opcode byte separately, and it will also need another instruction to be introduced per each EXECF sequence which takes up more COG RAM. In both cases the total loop overhead appears to be 20 clocks so any of the XBYTE gain we may have achieved has been swallowed up.
To support external memory, my AVR emulation loop is going to follow something like this code portion below. An I-cache of sorts will be used so once the working set of blocks is loaded into HUB (assuming it fits), most executed code should continue to be read and executed directly from HUB. The branches, skips and returns, and the other instructions that use two words per opcode need other similar handling to setup the new FIFO read address at their correct block offsets in HUB. The LPM and SPM instructions which can also read/write single bytes to the code memory have to be handled differently as well (though that should be simpler if there is no actual caching, just some extra pre-fetching for LPM auto increment cases):
mainloop rfword instr ' read next instruction from FIFO
incmod addr, #blocksize ' increment and wrap on block size (in words)
tjz addr, #getnextblock ' branch out to read in next block now
continue push return_addr ' save emulator return address
getbyte x, instr, #1 ' extract upper opcode portion
shr x, #1 wc ' 7 bit table index
altd x, #opcode_table ' lookup execf sequence in table
execf 0-0 ' execute AVR instruction for opcode
opcodetable
long (%0 << 10) + op_nop
...
op_nop
ret
getnextblock
' TODO: use getptr to find current address and locate the associated hub block from an address map table
' if not present then read new block from ext mem into a free hub block (or replace oldest block), to act as I-cache
' index into hub block base address to setup rdfast
jmp #continue
So the hierarchy of execution speed would comprise of the following cases:
- case 1) code is read and executed directly from a HUB RAM block (some multiple of 64 bytes)
- case 2) code branches out to an address already stored in a HUB RAM block buffer - branch into current block is fastest, different block slightly slower due to an additional mapping table lookup
- case 3) code branches out to a new block not yet in HUB RAM. It first needs to be pulled in from external memory and stored in a free HUB RAM block and then mapped accordingly in a table. Any older address it replaces needs to be invalidated in the mapping table for that block's start address.
Hopefully once up and running, case 3 will become less frequent because hitting it often will slow down execution significantly. It's basically a cache miss and incurs the time penalty to load a block of (say) 128-256 bytes from external RAM. Haven't decided on the best block size but given the latency of PSRAM access I'm guessing around 128-256 bytes per block might not be a bad size to try with maybe up to 16kB worth of I-cache blocks. We will see...
Comments
I've come up with an approach for caching external code and loading it on demand in my emulator - am still to fully incorporate and test it but it should do something useful with any luck. It gets called whenever needing a new block of instructions (eg, wrapping at end of current block into a new block) in the main loop sample code posted earlier. Further work is needed for the other branching cases that jump part way into blocks but that handling will be almost identical, but just give an address offset to apply before the RDFAST call is made, and an adjustment would be made to the addr variable above for tracking the next block wrap point as well as to the active block.
There could be more optimized ways to map blocks to cache entries using addresses instead of byte indexes in the block mapping table, but that will increase the mapping table size by 4x. Right now for a 256k target AVR (eg ATMega2560), and a 128 byte block size, the direct mapping table would be 2kB in size with byte indexes to i-cache entries, but if using long addresses instead it would increase this up to 8kB (which still may be okay) and could save one instruction in the most frequently taken cache hit path. However if the entire 8MB AVR address space was ever supported someday (not sure it ever would if not already supported by GCC) this mapping table would blow out to 256k instead of 64kB, with long addresses instead and 128 byte blocks. The only other table I use gets indexed by i-cache entry so we know which block is using which cache entry and I already use longs for that one. For (say) a 16kB i-cache and 128 byte blocks it would need 16k/128=128 entries x 32 bits = 512 bytes which is not much of a storage issue on the P2.
Sample caching and external memory driver read code is shown below. For lower latency the PSRAM read code could be made part of the emulator COG itself it we wanted exclusive access to the RAM without other COGs sharing it. But for initial testing I'll probably just use my own memory driver mailbox.
getnextblock incmod activeblock, maxcodeblocks ' we just rolled over into the next block mov pb, blockmaptable ' get block table address that maps blocks to indexes in cache add pb, activeblock ' add to now active block rdbyte pa, pb wz ' read associated block idx in i-cache mem, 0=unmapped if_z jmp #blocknotpresent ' handle block not present case mul pa, blocksize ' multiply by block size add pa, cachebase ' add to start address of i-cache in HUB _ret_ rdfast #0, pa ' setup new FIFO read address blocknotpresent incmod nextfreeidx, maxfreeidx ' roll around to find next block index to use (circular queue) mov pa, nextfreeidx ' contains the block index in i-cache we will use mul pa, #4 ' this free block table uses 32 bit entries add pa, freeblockbase ' add to start address of free block table in HUB rdlong hubaddr, pa wc ' c flag=1 means free, c=0 means in use by address provided if_nc wrbyte #0, hubaddr ' unmap previous association in block table wrlong pb, pa ' associate new block area with map entry using it wrbyte nextfreeidx, pb ' setup block table association for this block ' now go read a block from ext mem mov extaddr, activeblock ' get external memory source address to read by: mul extaddr, blocksize ' ...multiplying the new active block by block size add extaddr, extbasereq ' ...and including the external memory request and offset mov hubaddr, blocksize ' get hub RAM destination address by: mul hubaddr, nextfreeidx ' ...multiplying next free block idx by block size add hubaddr, cachebase ' ...and offsetting to start of i-cache in HUB setq #3-1 ' transfer 3 longs wrlong extaddr, mailbox ' send request to driver poll rdlong pa, mailbox wc ' poll for result if_c jmp #poll ' wait until ready _ret_ rdfast #0, hubaddr ' setup new FIFO read address activeblock long 0 ' currently active block number in use for reading code (based on emulated PC) maxcodeblocks long 0 ' number of blocks to cover emulated address space for code blockmaptable long 0 ' start address of block map table in HUB cachebase long 0 ' start address of i-cache in HUB nextfreeidx long 0 ' next free block idx to use in i-cache maxfreeidx long 0 ' maximum number of blocks mappable in i-cache freeblockbase long 0 ' start address of free block table in HUB mailbox long 0 ' address of mailbox for external memory driver extaddr long 0 ' external memory address hubaddr long 0 ' hub address blocksize long 0 ' size of a block in bytes extbasereq long $B0000000 ' external memory request code for burst read + any offset of AVR code in external memory
There are two issues here: (1) using FIFO and RFxxxx instructions in conjunction with external RAM and (2) using XBYTE.
If (1) works then (2) will automatically, however for AVR emulation it seems XBYTE offers no advantage. With hindsight, it would have been easy for XBYTE to optionally extract a bytecode from the high byte of an instruction word, read from hub RAM using RFWORD instead of RFBYTE. One extra XBYTE config bit would be needed, exactly matching bit 0 of the RFBYTE/RFWORD opcode. Something useful for a P2+/P3, I think.
Yes, and I am focusing on working on 1) now. With another day or two's time applied to it I think I'll probably have something running from external RAM. The XBYTE thing was separate to this, and unfortunately in this case XBYTE didn't help me a lot - I initially thought it might save a few clocks but I had to break into the loop which added more. I did at least move to using 7 bit opcodes now to index my opcode table and that freed a lot of spare space which I have needed for the extra external memory access/caching code. I've also added a few other niceties for controlling P2 specific stuff like the I/O and smartpins, the CORDIC, ATN and COG spawning in the AVR IO space, as well as a UART. That part is working quite nicely now and I can run C test programs compiled with GCC targeting the AVR 6 core.
Once I can make it work from external RAM, we would have a way to run large programs (slowly) from external memory with low P2 memory overhead. This could be useful in some applications such as providing GUI style editors/debug tools etc which don't have to run especially fast, and could get large but won't need to take too much P2 space. They could also spawn P2 native COGs to do high speed work. The approach would work for other emulators and possibly one day real P2 code too. We'll see where it goes...
I've got my emulated AVR code running from external PSRAM now. The AVR program space is stored in external RAM, and read into an I-cache on demand where it is executed from HUB RAM blocks using RFWORD. I'm currently testing 128 bytes per cache entry, 32kB total cache but of course this can be altered. The data space for the AVR is held in HUB RAM (up to 64kB).
Using some performance counters I put inside the emulator I am also tracking total instructions executed, total branches, cache hits and cache misses. With this silly test program below I get about 63 P2 clocks per AVR instruction on average including the read overheads. I'm seeing about 11k branches in 81k instructions or around one branch in 7.3 instructions. This result would then yield about 4MIPs at 252MHz.
( Entering terminal mode. Press Ctrl-] or Ctrl-Z to exit. ) AVR emulator running on COG 2 LED test: ATN test: ATN flag is clear Sending ATN ATN was received CORDIC tests: 1000*1000 = 1000000 1000000*12311 = 12311000000 1000/20 = 50 10000000000/33 = 303030303 1000//30 = 10 SQRT(200) = 14 LOG2(1234) = 10.269126676 EXP2(9) = 512 Counter test: COUNTER32 = 90604913 COUNTER32 = 90762371 COUNTER32 = 90915203 COUNTER32 = 91068035 COUNTER32 = 91220867 COUNTER32 = 91373699 COUNTER32 = 91526531 COUNTER32 = 91679363 COUNTER32 = 91832195 COUNTER32 = 91985027 Pseudo Random Number Generator test: RANDOM32 = d82ff183 RANDOM32 = 7f2ae15d RANDOM32 = 72146b34 RANDOM32 = 5e96abfb RANDOM32 = ca1779fd RANDOM32 = 8e10dabc RANDOM32 = 10c078e1 RANDOM32 = 6b118797 RANDOM32 = 6c8233bc RANDOM32 = 9eca37c3 INSTRUCTIONS = 81964 ticks=5200058 ticks/instr=63 BRANCHES = 11186 HITS = 11977 MISSES = 29 Press <space> to stop COG, other keys echo: Stopping
I've mapped a simple polled UART, the CORDIC operations, P2 I/O pins, P2 Smartpins, COG spawning, ATN, P2 counter, PseudoRNG, and LOCK management into the AVR IO space. So in theory you could fully control a P2 from a large AVR GCC program and spawn COGs from data held in the AVR program space etc. A video driver COG's screen buffer memory could be mapped into a portion of AVR external RAM so the AVR could present a console too, which would be quite useful for a large text UI application that doesn't consume too much P2 memory.
Still testing it but it seems to be working now.
The C code I ran above is this:
#include <avr/io.h> #define _SFR_IO32(io_addr) _MMIO_DWORD(((io_addr) + __SFR_OFFSET)) #undef DIRA #undef DIRB #undef PORTA #undef PORTB #undef PINA #undef PINB #define DIRA _SFR_IO32(0x0) #define DIRB _SFR_IO32(0x4) #define PORTA _SFR_IO32(0x8) #define PORTB _SFR_IO32(0xC) #define PINA _SFR_IO32(0x10) #define PINB _SFR_IO32(0x14) #define XVAL _SFR_IO32(0x18) #define YVAL _SFR_IO32(0x1C) #define COUNTER _SFR_IO8(0x20) #define COUNTER32 _SFR_IO32(0x20) #define RANDOM _SFR_IO8(0x24) #define RANDOM32 _SFR_IO32(0x24) #define HUB _SFR_IO32(0x28) #define CORDICOP _SFR_IO8(0x2C) #define COGINIT _SFR_IO8(0x2D) #define SMARTPIN _SFR_IO8(0x2E) #define SMARTOP _SFR_IO8(0x2F) #define LOCKID _SFR_IO8(0x30) #define LOCK _SFR_IO8(0x31) #define COGATN _SFR_IO8(0x32) #define COGID _SFR_IO8(0x33) #define PERF _SFR_IO8(0x34) #define UDR _SFR_IO8(0x35) #define USR _SFR_IO8(0x36) #define AVAL _SFR_MEM32(0x60) #define BVAL _SFR_MEM32(0x64) #define CVAL _SFR_MEM32(0x68) #define MULTIPLY 0 // Y:X = A*B #define DIVIDE 1 // X=(0:A)/B, Y=(0:A)%B or X=(C:A)/B, Y=(C:A)%B #define FRAC 2 // X=(A:0)/B, Y=(A:0)%B or X=(A:C)/B, Y=(A:C)%B #define ROTATE 3 // X,Y=rotate(A,0) by B or (A,C) by B #define VECTOR 4 // (A, B) into length X, angle Y #define SQRT 5 // X=sqrt(B:A) #define LOG 6 // X=log(A), result in 5:27 unsigned bit format #define EXP 7 // X=exp(A), A in 5:27 bit unsigned format #define CPARAM 8 // ADD/OR this value in to include the C PARAMETER as setq data in CORDIC, instead of treating as 0 #define CORDIC1(op, a) (AVAL=(a), CORDICOP=((op) & ~CPARAM), XVAL) #define CORDIC2(op, a, b) (AVAL=(a), BVAL=(b), CORDICOP=((op) & ~CPARAM), XVAL) #define CORDIC3(op, c, a, b) (AVAL=(a), BVAL=(b), CVAL=(c), CORDICOP=((op) | CPARAM), XVAL) #define CORDIC1_Y(op, a) (AVAL=(a), CORDICOP=((op) & ~CPARAM), YVAL) #define CORDIC2_Y(op, a, b) (AVAL=(a), BVAL=(b), CORDICOP=((op) & ~CPARAM), YVAL) #define CORDIC3_Y(op, c, a, b) (AVAL=(a), BVAL=(b), CVAL=(c), CORDICOP=((op) | CPARAM), YVAL) #define CORDIC1_64(op, a) (AVAL=(a), CORDICOP=((op) & ~CPARAM), *(unsigned long long *)&XVAL) #define CORDIC2_64(op, a, b) (AVAL=(a), BVAL=(b), CORDICOP=((op) & ~CPARAM), *(unsigned long long *)&XVAL) #define CORDIC3_64(op, c, a, b) (AVAL=(a), BVAL=(b), CVAL=(c), CORDICOP=((op) | CPARAM), *(unsigned long long *)&XVAL) #define NEWCOG 16 #define NEWCOG_PAIR 17 #define HUBEXEC 32 #define HUBEXEC_PAIR 33 #define COGSTART(id, a, b) (PINA=(a), PINB=(b), COGINIT=(id), 0, COGINIT) #define COGSTOP(x) (COGID=x) #define HUBOP(x) (HUB=x) #define ATN(mask) COGATN=mask #define WAITATN() while (COGATN == 0) do { } #define POLLATN() (COGATN!=0) #define SMART_AK 0 #define SMART_WX 1 #define SMART_WY 2 #define SMART_WR 3 #define AKPIN(x, pin) do { SMARTPIN = (pin); PINA = (x); SMARTOP = SMART_AK; } while(0); #define WXPIN(x, pin) do { SMARTPIN = (pin); PINA = (x); SMARTOP = SMART_WX; } while(0); #define WYPIN(x, pin) do { SMARTPIN = (pin); PINA = (x); SMARTOP = SMART_WY; } while(0); #define WRPIN(x, pin) do { SMARTPIN = (pin); PINA = (x); SMARTOP = SMART_WR; } while(0); #define RDPIN(pin) (SMARTPIN=(pin), SMARTPIN) #define RQPIN(pin) (SMARTPIN=(pin), SMARTOP) #undef RXC #undef UDRE #define RXC 7 #define UDRE 5 #include <stdio.h> // uart rx for stdin static int p2getch(FILE *stream) { while (!(USR & (1<<RXC))); return (unsigned char)UDR; } // uart tx for stdout static int p2putch(char ch, FILE *stream) { if (ch == '\n') // LF->CRLF p2putch('\r', stream); while (!(USR & (1<<UDRE))); UDR = (unsigned char)ch; return 0; } // IO mapping static FILE p2io = FDEV_SETUP_STREAM(p2putch, p2getch, _FDEV_SETUP_RW); // There is no support in AVR-LIBC for printing 64 bit values with printf, // so here's a slow way to do it, just for testing... void print_uint64_dec(uint64_t u) { if (u >= 10) { uint64_t udiv10 = u/10; u %= 10; print_uint64_dec(udiv10); } putchar((int)u + '0'); } void wait(count) //temporary hack for a delay until I add a WAITX { uint8_t x; uint8_t last=COUNTER; while(1) { x = COUNTER; if (x < last) { if (count-- == 0) return; } last = x; } } uint32_t clk1; uint32_t clk2; uint32_t insn1; uint32_t insn2; uint32_t branches1; uint32_t hits1; uint32_t misses1; void main(void) { uint8_t x; uint32_t result; clk1 = COUNTER32; PERF=0; // read instruction count insn1 = XVAL; PERF=1; // read branch count branches1 = XVAL; PERF=2; // read cache hits hits1 = XVAL; PERF=3; // read cache misses misses1 = XVAL; // map STDIO to UART stdout = &p2io; stdin = &p2io; printf("AVR emulator running on COG %d\n", COGID); puts("\nLED test:"); // LIGHT up both LEDs on P2Edge32MB DIRA |= 0xC0000; PORTA |= 0xC0000; puts("ATN test:"); POLLATN(); // clear any old ATN mask if (POLLATN()) puts("ATN flag is not clear"); else puts("ATN flag is clear"); puts("Sending ATN"); ATN(1<<COGID); // send ourselves an ATN if (POLLATN()) puts("ATN was received"); else puts("ATN was not received"); puts("CORDIC tests:"); printf("1000*1000 = %lu\n", CORDIC2(MULTIPLY, 1000, 1000)); printf("1000000*12311 = "); print_uint64_dec(CORDIC2_64(MULTIPLY, 1000000, 12311)); puts(""); printf("1000/20 = %lu\n", CORDIC2(DIVIDE, 1000, 20)); printf("10000000000/33 = %lu\n", CORDIC3(DIVIDE, (unsigned long)(10000000000LL>>32), (unsigned long)(10000000000LL),33)); printf("1000//30 = %lu\n", CORDIC2_Y(DIVIDE, 1000, 30)); printf("SQRT(200) = %lu\n", CORDIC2(SQRT, 200, 0)); result=CORDIC1(LOG,1234); printf("LOG2(1234) = %lu.", result>>27); result = XVAL & 0x7ffffff; result = CORDIC2(MULTIPLY, result, 1000000000); result = CORDIC3(DIVIDE, YVAL, XVAL, 1L<<27); printf("%09lu\n", result); printf("EXP2(9) = %lu\n", CORDIC1(EXP, 9L<<27)); puts("Counter test:"); for (x = 0; x<10; x++) printf("COUNTER32 = %lu\n", COUNTER32); puts("Pseudo Random Number Generator test:"); for (x = 0; x<10; x++) printf("RANDOM32 = %08lx\n", RANDOM32); PERF=0; // read instruction count insn2 = XVAL; clk2 = COUNTER32; // read P2 ticks printf("INSTRUCTIONS = %lu ticks=%lu ticks/instr=%lu\n", insn2-insn1, clk2-clk1, (clk2-clk1)/(insn2-insn1)); PERF=1; // read branch count printf("BRANCHES = %lu\n", XVAL-branches1); PERF=2; // read cache hits printf("HITS = %lu\n", XVAL-hits1); PERF=3; // read cache misses printf("MISSES = %lu\n", XVAL-misses1); puts("Press <space> to stop COG, other keys echo:"); while(1) { if ((x=getchar()) == 32) { puts("\nStopping\n"); wait(10); COGSTOP(COGID); } else putchar(x); } }
Here's the secret unoptimized sauce that reads and executes cached code on demand from external PSRAM (with my driver). It is targeting a 16 bit opcode format for the AVR but could be changed to handle other CPUs opcode types.
' The main emulator and optional debugging loops debugloop getptr addr ' save HUB FIFO position call debugger_base ' run debugger (hub EXEC) rdfast #0, addr ' restore HUB FIFO and fall through to normal processing mainloop rfword instr ' read next instruction from HUB FIFO incmod blockoffset, #BLOCKWRAP wz if_z call #loadnextblock ' wrap to next block when we reach the end push return_addr ' push emulator return address (mainloop or debugloop) getbyte d, instr, #1 ' extract upper opcode portion shr d, #1 wc ' only use 7 bits for table index, retain lsb in C flag altd d, #opcode_table ' lookup opcode in table execf 0-0 ' execute AVR instruction for opcode branch_handler add branches, #1 ' FOR PERFORMANCE TESTING ONLY! remove in final subr lastbranchstart, blockoffset wc ' FOR PERF ONLY! add instructions, lastbranchstart ' FOR PERF ONLY! if_c add instructions, #BLOCKSIZE/2 ' FOR PERF ONLY! mov blockoffset, addr ' get branch address and blockoffset, #BLOCKMASK>>1 ' ...to compute offset (in words, later adjusted to bytes) mov lastbranchstart, blockoffset ' FOR PERF ONLY! mov activeblock, addr ' get branch address shr activeblock, #BLOCKSHIFT-1 ' ...to compute next active block number jmp #readblock ' continue to read the next block readword rfword addr ' reads next word from FIFO incmod blockoffset, #BLOCKWRAP wz ' detect block wrap around if_nz ret ' if none then return loadnextblock incmod activeblock, lastcodeblock ' we just rolled over into the next block subr lastbranchstart, #BLOCKSIZE/2 ' FOR PERF ONLY! add instructions, lastbranchstart ' FOR PERF ONLY! mov lastbranchstart, #0 ' FOR PERF ONLY! readblock mov pb, blockmaptable ' get start of block mapping table that maps blocks to entries in cache add pb, activeblock ' add this address to the active block to index this table rdbyte hubaddr, pb wz ' find the i-cache memory index associated with this block, 0=unmapped if_z jmp #blocknotpresent ' go handle the block not present case, if required add cachehits, #1 ' FOR PERF ONLY! mul hubaddr, #BLOCKSIZE ' multiply cache entry # by block size in bytes (transfer size) add hubaddr, cachebase ' add to start address of i-cache in HUB RAM add hubaddr, blockoffset ' add offset needed for branch cases, harmless in non-branch wrap case add hubaddr, blockoffset ' add offset needed for branch cases, harmless in non-branch wrap case _ret_ rdfast #0, hubaddr ' setup new FIFO read address and return blocknotpresent incmod nextfreeidx, maxfreeidx ' roll around to find next free cache index to use (circular queue) add cachemisses, #1 ' FOR PERF ONLY! mov pa, nextfreeidx ' pa contains the index in i-cache we will use add pa, #1 ' use one based cache entries mov hubaddr, pa ' save for later wrbyte pa, pb ' associate new cache entry with this block mul pa, #4 ' this free block table uses 32 bit entries, so x4 to get table offset add pa, freeblocktable ' add this to the start address of the free block table in HUB rdlong lastaddr, pa wc ' lookup block using this cache entry, c=1 if free, c=0 in use by address provided if_nc wrbyte #0, lastaddr ' unmap previous block to cache entry association in block table wrlong pb, pa ' associate new block map address with this cache entry mov extaddr, activeblock ' get external memory source address to read, by: mul extaddr, #BLOCKSIZE ' ...multiplying the new active block by block size in bytes add extaddr, extrdburstreq ' ...and including the external memory request and offset mul hubaddr, #BLOCKSIZE ' multiply next free block idx by block size in bytes add hubaddr, cachebase ' ...and offset to start of i-cache in HUB mov transfersize, #BLOCKSIZE ' setup number of bytes to read setq #3-1 ' transfer 3 longs wrlong extaddr, mailbox ' send request to external memory driver poll rdlong pa, mailbox wc ' check for result if_c jmp #poll ' retry until ready add hubaddr, blockoffset ' add offset needed for branch case add hubaddr, blockoffset ' add offset needed for branch case _ret_ rdfast #0, hubaddr ' setup new FIFO read address
I'm sunk at EXECF. O_o
The 0-0 just means the instruction is D-patched by the previous altd. I was not including the rest of the AVR code here, just the code that tracks blocks of instructions, but for more clues, here's a bit more:
opcode_table long (%0 << 10) + op_nop_movw ' 0000_000x 00-01 ... op_nop_movw if_nc ret ' nop case getnib s, instr, #0 ' get s index altgw s, #regbase ' lookup register getword d1, 0-0, #0 ' get register pair getnib d, instr, #1 ' get d index altsw d, #regbase ' lookup register _ret_ setword 0-0, d1, #0 ' set register pair
Great work, Roger. Is a cache miss massively costly? I think Evan's point is the code is quite simple up to and including EXECF but not so much afterwards!
Not really too costly at high P2 clock speeds. The miss would just bring in another batch of 128 bytes to HUB (could be smaller or larger). The new block will remain in HUB RAM until it is pushed out if/when the working set has fully cycled through the total number of cache rows. The latency/execution time for the external read is in the order of 1us for 128 byte transfers at 250MHz, but this would hold another 64 instructions so it wouldn't happen that often. Once the set has been loaded into HUB, it remains much faster to select, and branches within the existing active block are faster again. I think it should be usable for some (non time-critical) emulator applications.
I'll post the full emulator when it's tidied up and tested a little more...
Update: this first cache implementation is very basic and its replacement scheme is least recently cached, not proper LFU, LRU etc. How well that works in practice is TBD and relies on the working set not really exceeding the number of cache entries available very often or for very long. There might be better schemes to use but at the same time we don't really want to spend a lot of cycles on it and the incmod wrap scheme I use is crazy simple/fast.
Lol, oops, I didn't even notice the 0-0 there. My O_o is a facial expression. EXECF I know nothing about and the returning to
mainloop
comment bamboozled me.The return to mainloop (or debugloop) only happens because of that "push return_addr" instruction prior to the actual EXECF which itself is basically a jump and skipf combined into one operation. EXECF is great. Last night during some optimizations I found I can save another 4 cycles with an EXECF instead of JMPs to my branch_handler too.
I've also moved the readword sequence back into the callers and only call loadnextblock when required which saves another 8 clocks most of the time too instead of the call + return made on each readword operation.
I'm very surprised. Over half of my P2ASM code is preceded by SKIPF or EXECF. Most of the remainder would be too if nested skipping were available, i.e. skipping inside subroutines called from skip sequences.
EXECF the instruction is great but EXECF the name is not, frankly. JSKIPF or JMPSKP are more descriptive. The latter leaves door open for future CALLSKP for nested skipping.
I guess my focus is generally on combined speed and determinism. Every time I read up on SKIPF I find I'm adding unneeded execution overhead.
And the odd time I think I might have a use for it I then find the inevitable REP loop blows that out of the water.
Yeah that would have been nice, to have saved the current EXECF skip sequence on the stack instead of just keeping one and restoring it for the final return. However I did just find you could nest the calls/returns more than one level deep within EXECF which is still really useful. Previously I had thought it was only limited to one call deep for some reason until I discovered that.
Soooo does this mean I will soon be able to run my AVR Code made using Basom AVR on the Prop? :-)
Potentially down the track it might allow something to work, yeah. Keep in mind this does not try to emulate any actual AVR HW peripherals but instead maps some of the P2 features (including IO pins) to the AVR IO space, so actual device drivers and functions expecting real AVR peripherals would not work without changing things to use some P2 replacements where applicable. Right now this emulator just targets the largest ATMega2560/1 devices, but it could be scaled back to the smaller devices with two byte program counters. To date I've focused on compiling GCC programs not BASIC though and it is an experimental exercise to see how emulated code could be made to run with external memory. Ideally we would have a way to run real P2 code this way.
Today I was also looking into the Arduino API and build process and wondering if some of that could ultimately be made to run on the P2 too, as some sort of frankenhybrid P2/Arduino board thing that uses the Arduino API and C/C++ code. It's probably going to be around 4-5x slower than native AVRs though and rather non-deterministic too.
That sounds very interesting, there is so much cool stuff going on since we have external memory !
Yeah it really opens up new possibilities with the P2 including good things for audio, video, emulators, RAM filesystem storage, possible future on board dev tools for P2 coding etc, etc, etc.
Wow, just catching up on all this, you have been busy. This is potentially super useful, even if it doesn't run at full avr speed.
I'm curious what the cog use is like?
It's a single COG emulator and it's getting rather full now, maybe ~40 longs left over both LUT and COG RAM, though there is some extra space used for my own debug during development which can be removed in the end. The CPU emulation code itself is fully complete and the spare space left could either be allocated to a few more IO features, or could contain some internal PSRAM reading code to avoid needing a separate memory driver COG. I do have a second SPIN COG which is used for disassembling code in my debugger portion and decodes opcodes to print instruction names but this part is not required for normal operation. For smaller AVRs you could use less than 64k RAM and the whole thing could have quite a small HUB footprint. I actually started out on a version that reads AVR code purely from HUB RAM and this has recently morphed into one that can read from external RAM (I do still have the original version though).
Here's an idea I had recently for executing native P2 instructions read from external memory while remaining in COGEXEC mode.
It uses 64 bits for each instruction. This doubles code memory space used which is somewhat wasteful but allows reasonably fast execution in COGEXEC mode, 8x slower than native when not branching, and a nice way to deal with instructions that require special handling. The 8x rate is a handy multiple for the HUB window access. The 64 bit P2 code to be executed could be generated by converting normal 32 bits instructions to 64 bits with (hopefully) fairly simple tools. Most normal instructions are just zero extended to 64 bits while the special ones have other upper bits added to "escape" them into 64 bit instructions.
This approach allows the current program counter to be easily computed vs any variable length instruction schemes that try to conserve memory.
The basic P2 emulator loop is shown below. REP was not used so interrupts could still be supported eventually perhaps.
Importantly the flags are not affected by this loop and are preserved for the application.
loop rflong instr ' read P2 instruction to be executed rflong esc ' read as 64 bits incmod x, #BLOCKSIZE-1 ' update partial PC at end of block tjz x, #nextblock ' handle wrap case testesc tjnz esc, #escape ' handle escaped/64 bit instructions instr nop {patched} ' instruction to be executed jmp #loop ' restart sequence esc long 0 ' stores high 32 bits of 64 bit instruction x long 0 ' offset in block block long 0 ' block number being executed nextblock ' go read next block from i-cache/extmem and update the PC ' also compute the hub address needed for FIFO reads from the i-cache entry for this PC incmod block, #LASTBLOCK ' increment block number (PC = (block*BLOCKSIZE + x) * 4) ... rdfast #0, hubaddr jmp #testesc ' continue executing instruction escape ' - Special handling for "escaped" instructions would be done here. ' - This uses the data read in the "esc" long to branch off and handle any instructions that ' affect the pipeline or change/require the PC like the ones listed below. ' - Some instructions may not be supported and could be ignored as NOPs or trigger exceptions? ' - The JMPs/CALLs will be pretty easy to emulate, would just need to lookup next block in i-cache ' and update FIFO read position. We do need to handle both S/#S cases and #S would jump into ' the emulator COG so that might need to be prevented. ' - ALTxxx and AUGx may need a temporary pipeline created by copying this instruction and the ' next one into some sequential COG addresses before executing both of them together. ' How many we could do in a row might need to be restricted. ' - Any _ret_ instructions need to be escaped as well to update the PC after executing them. { jmp call jxx jnxx ? rep ' ? = may not be able to support these, or need to find some way to do them ? skipf ? skip ? execf * rfxx ' * we would want to prevent these as they affect the FIFO we are using * rfvar[s] * wfxx * rdfast * wrfast * fblock djnz etc ijnz etc tjnz etc altxx etc p scas ' p = affects pipeline p sca p xoro setq ' may need to be back to back in some cases for burst transfers calla callb callpa callpb calld jmprel ret reta retb augd augs loc _ret_ }
We'd like a larger JMP/CALL space with immediates otherwise we are limited to 1MB only in HUB RAM.
We could potential include some special additional instructions with 32 bit immediates like these:
Maybe ### could be used by tools to indicate 32bit constants.
An alternative approach is to keep the executed instructions as 32 bits only and restrict the set of allowed instructions to be executed from external memory so as not to crash the emulator. This could work ok for tools generating code for high-level languages such as C which would only need a subset of the P2 instruction set anyway. We would still need some way to branch in/out of external memory addresses though with (conditional) calls/jumps. They could probably be transformed into local branches that then emulate those instructions.
In fact, you'll need at least two consecutive longs anyway, in order to be able to execute/process sequences like "GETCT WC + GETCT"; other instruction sequences that also block interrupts till some condition is satisfied, but don't forcefully rely on the "strictly consecutive processing" concept would also work, though the total execution time would not match the "bare metal", which isn't the objective.
Yes the consecutive long thing get rather tricky. I'm also looking at what it takes to execute i-cache rows directly with hub exec, maybe with a
_ret_ NOP
instruction placed at the end of each cache row to allow emulator return for handling of wrapping into the next block. Again if the instructions that need to be consecutive span two disjoint i-cache rows we have a problem, unless the offending last instruction (alt, setq, augs etc) is somehow copied/duplicated at the beginning of the next block in some reserved space for this purpose - maybe that could work too.We'd still need to find a way to escape to perform far calls and jumps at a minimum. I've not figured out a way to do that yet without losing the JMP address information in the instruction itself, unless it is taken from some prior loaded register, which changes the coding a bit more. Eg, this might work, and it could allow conditional branches too with if prefixes on the jmp #farjmp_handler instruction where needed. Far calls would be similar. Other stuff like standalone djnz and jatn instructions would still not be possible though.
mov farjmp, ##branch_target_addr ' uses augs where required jmp #farjmp_handler farjmp_handler ' check farjmp address value and branch accordingly (to HUB RAM, or i-cache ROW)
I'm not certain what the target is here. Is this intended for emulators, to "escape" to P2 code inside emulated instructions to speed them up?
Is is possible to do fast block reads using SETQ/SETQ2 at the same time as streamer is reading external RAM, to copy code from hub RAM to cog/LUT RAM only just after hub RAM is written by streamer?
Sure can. Streamer has priority, as in it'll stall the cog to grab the hubRAM databus, but as long as the cog isn't reading where the streamer is writing then no issue. Or more accurately, the FIFO does the cog stalling.
The "escape" part is to allow us to break out of the emulator's instruction processing loop running the code natively and go handle special case instructions that could otherwise crash the emulator. Although emulator is probably a bad description of what this is, if it is running actual P2 code. It's probably more like a type of VM.
It would be great to come up with a scheme that can work without special tool support for generating runnable code but it doesn't look like that is doable. We'd really need other tools that generate this type of code. A special mode in P2GCC could be one approach where far calls and jumps are coded in a way that work with the i-cached external memory.
Also I thought more about the COGRAM handler for a far call and I think the callpa #/D, #/S instruction could be rather useful for jumping out of the emulator/VM in hub exec mode. It would also push the current PC which we extract and push to a real stack and later pop for the return address. The "pa" register could be used to hold the target and this callpa instruction could be still made conditional with if_xxx prefixes etc, and we could also augment the #/D value for creating immediate 32 bit branch target addresses.
Thinking about how this would work for streamer @ sysclk/2. Is it: FIFO writes block of 8 longs one to each slice, do nothing for 8 cycles, write next block etc. Block read stalled during FIFO writes, read 8 longs, stalled again etc.
EDIT:
"FIFO write" = read from FIFO and write to hub RAM
Dunno. There likely is high and low water lines but no idea if they're at a round eight apart.
EDIT: Actually, it's more likely to be sporadic singles , doubles and triples I guess. The access time is single tick but the slot rotation limits when that happens.
EDIT2: It suppose that very slot restriction will create a rhythm like you've described in that syclock/2 case.
Cases like that seems to be crying for a pre-processing unit, perhaps by properlly expanding the code into the LUT, and running it there...