P2 native & AVR CPU emulation with external memory, XBYTE etc

rogloh · 2022-02-05 04:07

This is following on from a portion of a brief subdiscussion we had earlier in the Console Emulation thread around here.. https://forums.parallax.com/discussion/comment/1533831/#Comment_1533831

I've got my AVR emulator coded and working now using HUB RAM only for code storage, and I've been looking into the FIFO wrap interrupt wondering if I can use it to support external RAM blocks read into HUB RAM to be executed until they complete fully to the end of the FIFO block, then providing an interrupt whenever the next block is needed to be read, but can't seem to get it reliably triggering at well defined points.

Attached is some test code that shows the FIFO block wrap interrupt (FBW) triggering somewhere around 52-53 bytes read into the 64 byte block. I've also seen other values depending on extra waitx delays I had introduced to the test code as well, so there could be some amount of hub window slop to this. It's not especially precise when it triggers, so for now I've given up on the idea of using it to help emulation deal with the program counter exceeding the FIFO block size.

What I'm now looking at is a way to count the PC wrap points manually. This is only going to add a couple of instructions of overhead to the loop which isn't too bad if it allows continuous external memory based operation.

@TonyB_ I've also investigated the use of XBYTE but when it's all coded and counted I can't see a benefit in clock cycles saved once I need to break into the XBYTE loop to include my new block wrap test and debugger test on each instruction, plus store the low AVR opcode byte separately, and it will also need another instruction to be introduced per each EXECF sequence which takes up more COG RAM. In both cases the total loop overhead appears to be 20 clocks so any of the XBYTE gain we may have achieved has been swallowed up.

To support external memory, my AVR emulation loop is going to follow something like this code portion below. An I-cache of sorts will be used so once the working set of blocks is loaded into HUB (assuming it fits), most executed code should continue to be read and executed directly from HUB. The branches, skips and returns, and the other instructions that use two words per opcode need other similar handling to setup the new FIFO read address at their correct block offsets in HUB. The LPM and SPM instructions which can also read/write single bytes to the code memory have to be handled differently as well (though that should be simpler if there is no actual caching, just some extra pre-fetching for LPM auto increment cases):

mainloop        rfword  instr               ' read next instruction from FIFO 
                incmod  addr, #blocksize    ' increment and wrap on block size (in words)
                tjz     addr, #getnextblock ' branch out to read in next block now
continue        push    return_addr         ' save emulator return address
                getbyte x, instr, #1        ' extract upper opcode portion
                shr     x, #1 wc            ' 7 bit table index
                altd    x, #opcode_table    ' lookup execf sequence in table
                execf   0-0                 ' execute AVR instruction for opcode

opcodetable
                long  (%0 << 10) + op_nop
                ...

op_nop
                ret

getnextblock
                ' TODO: use getptr to find current address and locate the associated hub block from an address map table
                ' if not present then read new block from ext mem into a free hub block (or replace oldest block), to act as I-cache
                ' index into hub block base address to setup rdfast
                jmp #continue

So the hierarchy of execution speed would comprise of the following cases:

case 1) code is read and executed directly from a HUB RAM block (some multiple of 64 bytes)
case 2) code branches out to an address already stored in a HUB RAM block buffer - branch into current block is fastest, different block slightly slower due to an additional mapping table lookup
case 3) code branches out to a new block not yet in HUB RAM. It first needs to be pulled in from external memory and stored in a free HUB RAM block and then mapped accordingly in a table. Any older address it replaces needs to be invalidated in the mapping table for that block's start address.

Hopefully once up and running, case 3 will become less frequent because hitting it often will slow down execution significantly. It's basically a cache miss and incurs the time penalty to load a block of (say) 128-256 bytes from external RAM. Haven't decided on the best block size but given the latency of PSRAM access I'm guessing around 128-256 bytes per block might not be a bad size to try with maybe up to 16kB worth of I-cache blocks. We will see...

rogloh · 2022-02-06 01:58

I've come up with an approach for caching external code and loading it on demand in my emulator - am still to fully incorporate and test it but it should do something useful with any luck. It gets called whenever needing a new block of instructions (eg, wrapping at end of current block into a new block) in the main loop sample code posted earlier. Further work is needed for the other branching cases that jump part way into blocks but that handling will be almost identical, but just give an address offset to apply before the RDFAST call is made, and an adjustment would be made to the addr variable above for tracking the next block wrap point as well as to the active block.

There could be more optimized ways to map blocks to cache entries using addresses instead of byte indexes in the block mapping table, but that will increase the mapping table size by 4x. Right now for a 256k target AVR (eg ATMega2560), and a 128 byte block size, the direct mapping table would be 2kB in size with byte indexes to i-cache entries, but if using long addresses instead it would increase this up to 8kB (which still may be okay) and could save one instruction in the most frequently taken cache hit path. However if the entire 8MB AVR address space was ever supported someday (not sure it ever would if not already supported by GCC) this mapping table would blow out to 256k instead of 64kB, with long addresses instead and 128 byte blocks. The only other table I use gets indexed by i-cache entry so we know which block is using which cache entry and I already use longs for that one. For (say) a 16kB i-cache and 128 byte blocks it would need 16k/128=128 entries x 32 bits = 512 bytes which is not much of a storage issue on the P2.

Sample caching and external memory driver read code is shown below. For lower latency the PSRAM read code could be made part of the emulator COG itself it we wanted exclusive access to the RAM without other COGs sharing it. But for initial testing I'll probably just use my own memory driver mailbox.

getnextblock
                incmod  activeblock, maxcodeblocks          ' we just rolled over into the next block
                mov     pb, blockmaptable                   ' get block table address that maps blocks to indexes in cache
                add     pb, activeblock                     ' add to now active block
                rdbyte  pa, pb wz                           ' read associated block idx in i-cache mem, 0=unmapped
    if_z        jmp     #blocknotpresent                    ' handle block not present case
                mul     pa, blocksize                       ' multiply by block size
                add     pa, cachebase                       ' add to start address of i-cache in HUB
    _ret_       rdfast  #0, pa                              ' setup new FIFO read address
blocknotpresent
                incmod  nextfreeidx, maxfreeidx             ' roll around to find next block index to use (circular queue)
                mov     pa, nextfreeidx                     ' contains the block index in i-cache we will use
                mul     pa, #4                              ' this free block table uses 32 bit entries
                add     pa, freeblockbase                   ' add to start address of free block table in HUB
                rdlong  hubaddr, pa wc                      ' c flag=1 means free, c=0 means in use by address provided
    if_nc       wrbyte  #0, hubaddr                         ' unmap previous association in block table
                wrlong  pb, pa                              ' associate new block area with map entry using it
                wrbyte  nextfreeidx, pb                     ' setup block table association for this block
' now go read a block from ext mem
                mov     extaddr, activeblock                ' get external memory source address to read by:
                mul     extaddr, blocksize                  ' ...multiplying the new active block by block size
                add     extaddr, extbasereq                 ' ...and including the external memory request and offset
                mov     hubaddr, blocksize                  ' get hub RAM destination address by:
                mul     hubaddr, nextfreeidx                ' ...multiplying next free block idx by block size
                add     hubaddr, cachebase                  ' ...and offsetting to start of i-cache in HUB
                setq    #3-1                                ' transfer 3 longs
                wrlong  extaddr, mailbox                    ' send request to driver
poll            rdlong  pa, mailbox wc                      ' poll for result
    if_c        jmp     #poll                               ' wait until ready
    _ret_       rdfast  #0, hubaddr                         ' setup new FIFO read address


activeblock     long    0   ' currently active block number in use for reading code (based on emulated PC)
maxcodeblocks   long    0   ' number of blocks to cover emulated address space for code
blockmaptable   long    0   ' start address of block map table in HUB   
cachebase       long    0   ' start address of i-cache in HUB
nextfreeidx     long    0   ' next free block idx to use in i-cache
maxfreeidx      long    0   ' maximum number of blocks mappable in i-cache
freeblockbase   long    0   ' start address of free block table in HUB

mailbox         long    0   ' address of mailbox for external memory driver

extaddr         long    0   ' external memory address
hubaddr         long    0   ' hub address
blocksize       long    0   ' size of a block in bytes

extbasereq      long    $B0000000 ' external memory request code for burst read + any offset of AVR code in external memory

TonyB_ · 2022-02-06 10:41

@rogloh said:
This is following on from a portion of a brief subdiscussion we had earlier in the Console Emulation thread around here.. https://forums.parallax.com/discussion/comment/1533831/#Comment_1533831

I've got my AVR emulator coded and working now using HUB RAM only for code storage, and I've been looking into the FIFO wrap interrupt wondering if I can use it to support external RAM blocks read into HUB RAM to be executed until they complete fully to the end of the FIFO block, then providing an interrupt whenever the next block is needed to be read, but can't seem to get it reliably triggering at well defined points.

Attached is some test code that shows the FIFO block wrap interrupt (FBW) triggering somewhere around 52-53 bytes read into the 64 byte block. I've also seen other values depending on extra waitx delays I had introduced to the test code as well, so there could be some amount of hub window slop to this. It's not especially precise when it triggers, so for now I've given up on the idea of using it to help emulation deal with the program counter exceeding the FIFO block size.

What I'm now looking at is a way to count the PC wrap points manually. This is only going to add a couple of instructions of overhead to the loop which isn't too bad if it allows continuous external memory based operation.

@TonyB_ I've also investigated the use of XBYTE but when it's all coded and counted I can't see a benefit in clock cycles saved once I need to break into the XBYTE loop to include my new block wrap test and debugger test on each instruction, plus store the low AVR opcode byte separately, and it will also need another instruction to be introduced per each EXECF sequence which takes up more COG RAM. In both cases the total loop overhead appears to be 20 clocks so any of the XBYTE gain we may have achieved has been swallowed up.

There are two issues here: (1) using FIFO and RFxxxx instructions in conjunction with external RAM and (2) using XBYTE.

If (1) works then (2) will automatically, however for AVR emulation it seems XBYTE offers no advantage. With hindsight, it would have been easy for XBYTE to optionally extract a bytecode from the high byte of an instruction word, read from hub RAM using RFWORD instead of RFBYTE. One extra XBYTE config bit would be needed, exactly matching bit 0 of the RFBYTE/RFWORD opcode. Something useful for a P2+/P3, I think.

rogloh · 2022-02-06 14:11

@TonyB_ said:
There are two issues here: (1) using FIFO and RFxxxx instructions in conjunction with external RAM and (2) using XBYTE.

If (1) works then (2) will automatically, however for AVR emulation it seems XBYTE offers no advantage. With hindsight, it would have been easy for XBYTE to optionally extract a bytecode from the high byte of an instruction word, read from hub RAM using RFWORD instead of RFBYTE. One extra XBYTE config bit would be needed, exactly matching bit 0 of the RFBYTE/RFWORD opcode. Something useful for a P2+/P3, I think.

Yes, and I am focusing on working on 1) now. With another day or two's time applied to it I think I'll probably have something running from external RAM. The XBYTE thing was separate to this, and unfortunately in this case XBYTE didn't help me a lot - I initially thought it might save a few clocks but I had to break into the loop which added more. I did at least move to using 7 bit opcodes now to index my opcode table and that freed a lot of spare space which I have needed for the extra external memory access/caching code. I've also added a few other niceties for controlling P2 specific stuff like the I/O and smartpins, the CORDIC, ATN and COG spawning in the AVR IO space, as well as a UART. That part is working quite nicely now and I can run C test programs compiled with GCC targeting the AVR 6 core.

Once I can make it work from external RAM, we would have a way to run large programs (slowly) from external memory with low P2 memory overhead. This could be useful in some applications such as providing GUI style editors/debug tools etc which don't have to run especially fast, and could get large but won't need to take too much P2 space. They could also spawn P2 native COGs to do high speed work. The approach would work for other emulators and possibly one day real P2 code too. We'll see where it goes...

rogloh · 2022-02-10 10:12

I've got my emulated AVR code running from external PSRAM now. The AVR program space is stored in external RAM, and read into an I-cache on demand where it is executed from HUB RAM blocks using RFWORD. I'm currently testing 128 bytes per cache entry, 32kB total cache but of course this can be altered. The data space for the AVR is held in HUB RAM (up to 64kB).

Using some performance counters I put inside the emulator I am also tracking total instructions executed, total branches, cache hits and cache misses. With this silly test program below I get about 63 P2 clocks per AVR instruction on average including the read overheads. I'm seeing about 11k branches in 81k instructions or around one branch in 7.3 instructions. This result would then yield about 4MIPs at 252MHz.

( Entering terminal mode.  Press Ctrl-] or Ctrl-Z to exit. )
AVR emulator running on COG 2

LED test:
ATN test:
ATN flag is clear
Sending ATN
ATN was received
CORDIC tests:
1000*1000 = 1000000
1000000*12311 = 12311000000
1000/20 = 50
10000000000/33 = 303030303
1000//30 = 10
SQRT(200) = 14
LOG2(1234) = 10.269126676
EXP2(9) = 512
Counter test:
COUNTER32 = 90604913
COUNTER32 = 90762371
COUNTER32 = 90915203
COUNTER32 = 91068035
COUNTER32 = 91220867
COUNTER32 = 91373699
COUNTER32 = 91526531
COUNTER32 = 91679363
COUNTER32 = 91832195
COUNTER32 = 91985027
Pseudo Random Number Generator test:
RANDOM32 = d82ff183
RANDOM32 = 7f2ae15d
RANDOM32 = 72146b34
RANDOM32 = 5e96abfb
RANDOM32 = ca1779fd
RANDOM32 = 8e10dabc
RANDOM32 = 10c078e1
RANDOM32 = 6b118797
RANDOM32 = 6c8233bc
RANDOM32 = 9eca37c3
INSTRUCTIONS = 81964  ticks=5200058  ticks/instr=63
BRANCHES = 11186
HITS = 11977
MISSES = 29
Press <space> to stop COG, other keys echo:

Stopping

I've mapped a simple polled UART, the CORDIC operations, P2 I/O pins, P2 Smartpins, COG spawning, ATN, P2 counter, PseudoRNG, and LOCK management into the AVR IO space. So in theory you could fully control a P2 from a large AVR GCC program and spawn COGs from data held in the AVR program space etc. A video driver COG's screen buffer memory could be mapped into a portion of AVR external RAM so the AVR could present a console too, which would be quite useful for a large text UI application that doesn't consume too much P2 memory.

Still testing it but it seems to be working now.
The C code I ran above is this:

#include <avr/io.h>
#define _SFR_IO32(io_addr) _MMIO_DWORD(((io_addr) + __SFR_OFFSET))

#undef DIRA
#undef DIRB
#undef PORTA
#undef PORTB
#undef PINA
#undef PINB

#define DIRA        _SFR_IO32(0x0)
#define DIRB        _SFR_IO32(0x4)
#define PORTA       _SFR_IO32(0x8)
#define PORTB       _SFR_IO32(0xC)
#define PINA        _SFR_IO32(0x10)
#define PINB        _SFR_IO32(0x14)
#define XVAL        _SFR_IO32(0x18)
#define YVAL        _SFR_IO32(0x1C)
#define COUNTER     _SFR_IO8(0x20)
#define COUNTER32   _SFR_IO32(0x20)
#define RANDOM      _SFR_IO8(0x24)
#define RANDOM32    _SFR_IO32(0x24)
#define HUB         _SFR_IO32(0x28)
#define CORDICOP    _SFR_IO8(0x2C)
#define COGINIT     _SFR_IO8(0x2D)
#define SMARTPIN    _SFR_IO8(0x2E)
#define SMARTOP     _SFR_IO8(0x2F)
#define LOCKID      _SFR_IO8(0x30)
#define LOCK        _SFR_IO8(0x31)
#define COGATN      _SFR_IO8(0x32)
#define COGID       _SFR_IO8(0x33)
#define PERF        _SFR_IO8(0x34)
#define UDR         _SFR_IO8(0x35)
#define USR         _SFR_IO8(0x36)
#define AVAL        _SFR_MEM32(0x60)
#define BVAL        _SFR_MEM32(0x64)
#define CVAL        _SFR_MEM32(0x68)

#define MULTIPLY 0  // Y:X = A*B
#define DIVIDE   1  // X=(0:A)/B, Y=(0:A)%B  or  X=(C:A)/B, Y=(C:A)%B
#define FRAC     2  // X=(A:0)/B, Y=(A:0)%B  or  X=(A:C)/B, Y=(A:C)%B
#define ROTATE   3  // X,Y=rotate(A,0) by B or (A,C) by B
#define VECTOR   4  // (A, B) into length X, angle Y
#define SQRT     5  // X=sqrt(B:A)
#define LOG      6  // X=log(A), result in 5:27 unsigned bit format
#define EXP      7  // X=exp(A), A in 5:27 bit unsigned format
#define CPARAM   8  // ADD/OR this value in to include the C PARAMETER as setq data in CORDIC, instead of treating as 0


#define CORDIC1(op, a) (AVAL=(a), CORDICOP=((op) & ~CPARAM), XVAL)
#define CORDIC2(op, a, b) (AVAL=(a), BVAL=(b), CORDICOP=((op) & ~CPARAM), XVAL)
#define CORDIC3(op, c, a, b) (AVAL=(a), BVAL=(b), CVAL=(c), CORDICOP=((op) | CPARAM), XVAL)
#define CORDIC1_Y(op, a) (AVAL=(a), CORDICOP=((op) & ~CPARAM), YVAL)
#define CORDIC2_Y(op, a, b) (AVAL=(a), BVAL=(b), CORDICOP=((op) & ~CPARAM), YVAL)
#define CORDIC3_Y(op, c, a, b) (AVAL=(a), BVAL=(b), CVAL=(c), CORDICOP=((op) | CPARAM), YVAL)
#define CORDIC1_64(op, a) (AVAL=(a), CORDICOP=((op) & ~CPARAM), *(unsigned long long *)&XVAL)
#define CORDIC2_64(op, a, b) (AVAL=(a), BVAL=(b), CORDICOP=((op) & ~CPARAM), *(unsigned long long *)&XVAL)
#define CORDIC3_64(op, c, a, b) (AVAL=(a), BVAL=(b), CVAL=(c), CORDICOP=((op) | CPARAM), *(unsigned long long *)&XVAL)

#define NEWCOG        16
#define NEWCOG_PAIR   17
#define HUBEXEC       32
#define HUBEXEC_PAIR  33

#define COGSTART(id, a, b) (PINA=(a), PINB=(b), COGINIT=(id), 0, COGINIT)
#define COGSTOP(x) (COGID=x)
#define HUBOP(x) (HUB=x)

#define ATN(mask) COGATN=mask
#define WAITATN() while (COGATN == 0) do { }
#define POLLATN() (COGATN!=0)

#define SMART_AK 0
#define SMART_WX 1
#define SMART_WY 2
#define SMART_WR 3

#define AKPIN(x, pin) do { SMARTPIN = (pin); PINA = (x); SMARTOP = SMART_AK; } while(0);
#define WXPIN(x, pin) do { SMARTPIN = (pin); PINA = (x); SMARTOP = SMART_WX; } while(0);
#define WYPIN(x, pin) do { SMARTPIN = (pin); PINA = (x); SMARTOP = SMART_WY; } while(0);
#define WRPIN(x, pin) do { SMARTPIN = (pin); PINA = (x); SMARTOP = SMART_WR; } while(0);

#define RDPIN(pin) (SMARTPIN=(pin), SMARTPIN)
#define RQPIN(pin) (SMARTPIN=(pin), SMARTOP)

#undef RXC
#undef UDRE
#define RXC  7
#define UDRE 5

#include <stdio.h>

// uart rx for stdin
static int p2getch(FILE *stream)
{
    while (!(USR & (1<<RXC)));
    return (unsigned char)UDR;
}

// uart tx for stdout
static int p2putch(char ch, FILE *stream)
{
    if (ch == '\n') // LF->CRLF
        p2putch('\r', stream);
    while (!(USR & (1<<UDRE)));
    UDR = (unsigned char)ch;
    return 0;
}

// IO mapping
static FILE p2io = FDEV_SETUP_STREAM(p2putch, p2getch, _FDEV_SETUP_RW);


// There is no support in AVR-LIBC for printing 64 bit values with printf, 
// so here's a slow way to do it, just for testing...

void print_uint64_dec(uint64_t u) {
  if (u >= 10) {
    uint64_t udiv10 = u/10;
    u %= 10;
    print_uint64_dec(udiv10);
  }
  putchar((int)u + '0');
}

void wait(count) //temporary hack for a delay until I add a WAITX
{
    uint8_t x;
    uint8_t last=COUNTER;

    while(1)
    {
        x = COUNTER;
        if (x < last)
        {
            if (count-- == 0)
                return;
        }
        last = x;
    }
}

uint32_t clk1;
uint32_t clk2;
uint32_t insn1;
uint32_t insn2;
uint32_t branches1;
uint32_t hits1;
uint32_t misses1;

void main(void)
{ 
    uint8_t x;
    uint32_t result;

    clk1 = COUNTER32;
    PERF=0; // read instruction count
    insn1 = XVAL;
    PERF=1; // read branch count
    branches1 = XVAL;
    PERF=2; // read cache hits
    hits1 = XVAL;
    PERF=3; // read cache misses
    misses1 = XVAL;

    // map STDIO to UART
    stdout = &p2io;
    stdin = &p2io;

    printf("AVR emulator running on COG %d\n", COGID);

    puts("\nLED test:");

    // LIGHT up both LEDs on P2Edge32MB
    DIRA |= 0xC0000;
    PORTA |= 0xC0000;

    puts("ATN test:");
    POLLATN(); // clear any old ATN mask
    if (POLLATN())
        puts("ATN flag is not clear");
    else
        puts("ATN flag is clear");
    puts("Sending ATN");
    ATN(1<<COGID); // send ourselves an ATN
    if (POLLATN())
        puts("ATN was received");
    else
        puts("ATN was not received");

    puts("CORDIC tests:");
    printf("1000*1000 = %lu\n", CORDIC2(MULTIPLY, 1000, 1000));
    printf("1000000*12311 = ");
    print_uint64_dec(CORDIC2_64(MULTIPLY, 1000000, 12311));
    puts("");
    printf("1000/20 = %lu\n", CORDIC2(DIVIDE, 1000, 20));
    printf("10000000000/33 = %lu\n", CORDIC3(DIVIDE, (unsigned long)(10000000000LL>>32), (unsigned long)(10000000000LL),33));
    printf("1000//30 = %lu\n", CORDIC2_Y(DIVIDE, 1000, 30));
    printf("SQRT(200) = %lu\n", CORDIC2(SQRT, 200, 0));

    result=CORDIC1(LOG,1234);
    printf("LOG2(1234) = %lu.", result>>27);
    result = XVAL & 0x7ffffff;
    result = CORDIC2(MULTIPLY, result, 1000000000);
    result = CORDIC3(DIVIDE, YVAL, XVAL, 1L<<27);

    printf("%09lu\n", result);
    printf("EXP2(9) = %lu\n", CORDIC1(EXP, 9L<<27));

    puts("Counter test:");
    for (x = 0; x<10; x++)
        printf("COUNTER32 = %lu\n", COUNTER32);

    puts("Pseudo Random Number Generator test:");
    for (x = 0; x<10; x++)
        printf("RANDOM32 = %08lx\n", RANDOM32);

    PERF=0; // read instruction count
    insn2 = XVAL;
    clk2 = COUNTER32;  // read P2 ticks
    printf("INSTRUCTIONS = %lu  ticks=%lu  ticks/instr=%lu\n", insn2-insn1, clk2-clk1, (clk2-clk1)/(insn2-insn1));
    PERF=1; // read branch count
    printf("BRANCHES = %lu\n", XVAL-branches1);
    PERF=2; // read cache hits
    printf("HITS = %lu\n", XVAL-hits1);
    PERF=3; // read cache misses
    printf("MISSES = %lu\n", XVAL-misses1);

    puts("Press <space> to stop COG, other keys echo:");
    while(1)
    {
        if ((x=getchar()) == 32)
        {
            puts("\nStopping\n");
            wait(10);
            COGSTOP(COGID);
        }
        else 
            putchar(x);
    }
}

rogloh · 2022-02-10 11:03

Here's the secret unoptimized sauce that reads and executes cached code on demand from external PSRAM (with my driver). It is targeting a 16 bit opcode format for the AVR but could be changed to handle other CPUs opcode types.

branch_handler is entered on all AVR conditional branches, calls, jumps
readword is entered when a word argument needs to be read following an instruction (like LDS, STS, JMP absolute CALL absolute).
loadnextblock is entered initially to bring in the first cache row before starting, and on demand when a block wraps.
mainloop is the main emulator loop that the various AVR execf sequences each return to, but this can be extended to the outer debug loop when the single step debugging is enabled (by simply changing return_addr) or when hitting a BREAK instruction.

' The main emulator and optional debugging loops

debugloop       getptr  addr                                ' save HUB FIFO position
                call    debugger_base                       ' run debugger (hub EXEC)
                rdfast  #0, addr                            ' restore HUB FIFO and fall through to normal processing
mainloop        rfword  instr                               ' read next instruction from HUB FIFO
                incmod  blockoffset, #BLOCKWRAP wz
    if_z        call    #loadnextblock                      ' wrap to next block when we reach the end
                push    return_addr                         ' push emulator return address (mainloop or debugloop)
                getbyte d, instr, #1                        ' extract upper opcode portion
                shr     d, #1 wc                            ' only use 7 bits for table index, retain lsb in C flag
                altd    d, #opcode_table                    ' lookup opcode in table
                execf   0-0                                 ' execute AVR instruction for opcode
branch_handler 
                add     branches, #1                        ' FOR PERFORMANCE TESTING ONLY! remove in final
                subr    lastbranchstart, blockoffset wc     ' FOR PERF ONLY!
                add     instructions, lastbranchstart       ' FOR PERF ONLY!
    if_c        add     instructions, #BLOCKSIZE/2          ' FOR PERF ONLY!

                mov     blockoffset, addr                   ' get branch address
                and     blockoffset, #BLOCKMASK>>1          ' ...to compute offset (in words, later adjusted to bytes)

                mov     lastbranchstart, blockoffset        ' FOR PERF ONLY!

                mov     activeblock, addr                   ' get branch address
                shr     activeblock, #BLOCKSHIFT-1          ' ...to compute next active block number

                jmp     #readblock                          ' continue to read the next block
readword
                rfword  addr                                ' reads next word from FIFO
                incmod  blockoffset, #BLOCKWRAP wz          ' detect block wrap around
    if_nz       ret                                         ' if none then return
loadnextblock
                incmod  activeblock, lastcodeblock          ' we just rolled over into the next block

                subr    lastbranchstart, #BLOCKSIZE/2       ' FOR PERF ONLY!
                add     instructions, lastbranchstart       ' FOR PERF ONLY!
                mov     lastbranchstart, #0                 ' FOR PERF ONLY!
readblock
                mov     pb, blockmaptable                   ' get start of block mapping table that maps blocks to entries in cache
                add     pb, activeblock                     ' add this address to the active block to index this table
                rdbyte  hubaddr, pb wz                      ' find the i-cache memory index associated with this block, 0=unmapped
    if_z        jmp     #blocknotpresent                    ' go handle the block not present case, if required

                add     cachehits, #1                       ' FOR PERF ONLY!

                mul     hubaddr, #BLOCKSIZE                 ' multiply cache entry # by block size in bytes (transfer size)
                add     hubaddr, cachebase                  ' add to start address of i-cache in HUB RAM
                add     hubaddr, blockoffset                ' add offset needed for branch cases, harmless in non-branch wrap case
                add     hubaddr, blockoffset                ' add offset needed for branch cases, harmless in non-branch wrap case
    _ret_       rdfast  #0, hubaddr                         ' setup new FIFO read address and return
blocknotpresent
                incmod  nextfreeidx, maxfreeidx             ' roll around to find next free cache index to use (circular queue)

                add     cachemisses, #1                     ' FOR PERF ONLY!

                mov     pa, nextfreeidx                     ' pa contains the index in i-cache we will use
                add     pa, #1                              ' use one based cache entries
                mov     hubaddr, pa                         ' save for later
                wrbyte  pa, pb                              ' associate new cache entry with this block
                mul     pa, #4                              ' this free block table uses 32 bit entries, so x4 to get table offset
                add     pa, freeblocktable                  ' add this to the start address of the free block table in HUB
                rdlong  lastaddr, pa wc                     ' lookup block using this cache entry, c=1 if free, c=0 in use by address provided
    if_nc       wrbyte  #0, lastaddr                        ' unmap previous block to cache entry association in block table
                wrlong  pb, pa                              ' associate new block map address with this cache entry
                mov     extaddr, activeblock                ' get external memory source address to read, by:
                mul     extaddr, #BLOCKSIZE                 ' ...multiplying the new active block by block size in bytes
                add     extaddr, extrdburstreq              ' ...and including the external memory request and offset
                mul     hubaddr, #BLOCKSIZE                 ' multiply next free block idx by block size in bytes
                add     hubaddr, cachebase                  ' ...and offset to start of i-cache in HUB
                mov     transfersize, #BLOCKSIZE            ' setup number of bytes to read
                setq    #3-1                                ' transfer 3 longs
                wrlong  extaddr, mailbox                    ' send request to external memory driver
poll
                rdlong  pa, mailbox wc                      ' check for result
    if_c        jmp     #poll                               ' retry until ready
                add     hubaddr, blockoffset                ' add offset needed for branch case
                add     hubaddr, blockoffset                ' add offset needed for branch case
    _ret_       rdfast  #0, hubaddr                         ' setup new FIFO read address

evanh · 2022-02-10 12:58

I'm sunk at EXECF. O_o

rogloh · 2022-02-10 13:34

The 0-0 just means the instruction is D-patched by the previous altd. I was not including the rest of the AVR code here, just the code that tracks blocks of instructions, but for more clues, here's a bit more:

opcode_table

long (%0 << 10) + op_nop_movw                   ' 0000_000x  00-01
...

op_nop_movw
    if_nc       ret                         ' nop case
                getnib  s, instr, #0        ' get s index
                altgw   s, #regbase         ' lookup register
                getword d1, 0-0, #0         ' get register pair
                getnib  d, instr, #1        ' get d index
                altsw   d, #regbase         ' lookup register
    _ret_       setword 0-0, d1, #0         ' set register pair

TonyB_ · 2022-02-10 13:43

Great work, Roger. Is a cache miss massively costly? I think Evan's point is the code is quite simple up to and including EXECF but not so much afterwards!

rogloh · 2022-02-10 13:50

Not really too costly at high P2 clock speeds. The miss would just bring in another batch of 128 bytes to HUB (could be smaller or larger). The new block will remain in HUB RAM until it is pushed out if/when the working set has fully cycled through the total number of cache rows. The latency/execution time for the external read is in the order of 1us for 128 byte transfers at 250MHz, but this would hold another 64 instructions so it wouldn't happen that often. Once the set has been loaded into HUB, it remains much faster to select, and branches within the existing active block are faster again. I think it should be usable for some (non time-critical) emulator applications.

I'll post the full emulator when it's tidied up and tested a little more...

Update: this first cache implementation is very basic and its replacement scheme is least recently cached, not proper LFU, LRU etc. How well that works in practice is TBD and relies on the working set not really exceeding the number of cache entries available very often or for very long. There might be better schemes to use but at the same time we don't really want to spend a lot of cycles on it and the incmod wrap scheme I use is crazy simple/fast.

evanh · 2022-02-10 22:43

Lol, oops, I didn't even notice the 0-0 there. My O_o is a facial expression. EXECF I know nothing about and the returning to mainloop comment bamboozled me.

rogloh · 2022-02-10 23:00

@evanh said:
Lol, oops, I didn't even notice the 0-0 there. My O_o is a facial expression. EXECF I know nothing about and the returning to mainloop comment bamboozled me.

The return to mainloop (or debugloop) only happens because of that "push return_addr" instruction prior to the actual EXECF which itself is basically a jump and skipf combined into one operation. EXECF is great. Last night during some optimizations I found I can save another 4 cycles with an EXECF instead of JMPs to my branch_handler too.

I've also moved the readword sequence back into the callers and only call loadnextblock when required which saves another 8 clocks most of the time too instead of the call + return made on each readword operation.

TonyB_ · 2022-02-11 00:03

@evanh said:
EXECF I know nothing about ...

I'm very surprised. Over half of my P2ASM code is preceded by SKIPF or EXECF. Most of the remainder would be too if nested skipping were available, i.e. skipping inside subroutines called from skip sequences.

@rogloh said:
The return to mainloop (or debugloop) only happens because of that "push return_addr" instruction prior to the actual EXECF which itself is basically a jump and skipf combined into one operation. EXECF is great. Last night during some optimizations I found I can save another 4 cycles with an EXECF instead of JMPs to my branch_handler too.

EXECF the instruction is great but EXECF the name is not, frankly. JSKIPF or JMPSKP are more descriptive. The latter leaves door open for future CALLSKP for nested skipping.

evanh · 2022-02-11 00:15

I guess my focus is generally on combined speed and determinism. Every time I read up on SKIPF I find I'm adding unneeded execution overhead.
And the odd time I think I might have a use for it I then find the inevitable REP loop blows that out of the water.

rogloh · 2022-02-11 01:21

@TonyB_ said:
i.e. skipping inside subroutines called from skip sequences.

Yeah that would have been nice, to have saved the current EXECF skip sequence on the stack instead of just keeping one and restoring it for the final return. However I did just find you could nest the calls/returns more than one level deep within EXECF which is still really useful. Previously I had thought it was only limited to one call deep for some reason until I discovered that.

Fred777 · 2022-02-11 10:09

Soooo does this mean I will soon be able to run my AVR Code made using Basom AVR on the Prop? :-)

rogloh · 2022-02-11 10:29

@Fred777 said:
Soooo does this mean I will soon be able to run my AVR Code made using Basom AVR on the Prop? :-)

Potentially down the track it might allow something to work, yeah. Keep in mind this does not try to emulate any actual AVR HW peripherals but instead maps some of the P2 features (including IO pins) to the AVR IO space, so actual device drivers and functions expecting real AVR peripherals would not work without changing things to use some P2 replacements where applicable. Right now this emulator just targets the largest ATMega2560/1 devices, but it could be scaled back to the smaller devices with two byte program counters. To date I've focused on compiling GCC programs not BASIC though and it is an experimental exercise to see how emulated code could be made to run with external memory. Ideally we would have a way to run real P2 code this way.

Today I was also looking into the Arduino API and build process and wondering if some of that could ultimately be made to run on the P2 too, as some sort of frankenhybrid P2/Arduino board thing that uses the Arduino API and C/C++ code. It's probably going to be around 4-5x slower than native AVRs though and rather non-deterministic too.

Fred777 · 2022-02-11 10:54

@rogloh said:
... to see how emulated code could be made to run with external memory. Ideally we would have a way to run real P2 code this way.

That sounds very interesting, there is so much cool stuff going on since we have external memory !

rogloh · 2022-02-11 10:57

Yeah it really opens up new possibilities with the P2 including good things for audio, video, emulators, RAM filesystem storage, possible future on board dev tools for P2 coding etc, etc, etc.

Tubular · 2022-02-11 12:07

Wow, just catching up on all this, you have been busy. This is potentially super useful, even if it doesn't run at full avr speed.

I'm curious what the cog use is like?

rogloh · 2022-02-11 12:23

@Tubular said:
Wow, just catching up on all this, you have been busy. This is potentially super useful, even if it doesn't run at full avr speed.

I'm curious what the cog use is like?

It's a single COG emulator and it's getting rather full now, maybe ~40 longs left over both LUT and COG RAM, though there is some extra space used for my own debug during development which can be removed in the end. The CPU emulation code itself is fully complete and the spare space left could either be allocated to a few more IO features, or could contain some internal PSRAM reading code to avoid needing a separate memory driver COG. I do have a second SPIN COG which is used for disassembling code in my debugger portion and decodes opcodes to print instruction names but this part is not required for normal operation. For smaller AVRs you could use less than 64k RAM and the whole thing could have quite a small HUB footprint. I actually started out on a version that reads AVR code purely from HUB RAM and this has recently morphed into one that can read from external RAM (I do still have the original version though).

rogloh · 2022-02-12 03:18

Here's an idea I had recently for executing native P2 instructions read from external memory while remaining in COGEXEC mode.

It uses 64 bits for each instruction. This doubles code memory space used which is somewhat wasteful but allows reasonably fast execution in COGEXEC mode, 8x slower than native when not branching, and a nice way to deal with instructions that require special handling. The 8x rate is a handy multiple for the HUB window access. The 64 bit P2 code to be executed could be generated by converting normal 32 bits instructions to 64 bits with (hopefully) fairly simple tools. Most normal instructions are just zero extended to 64 bits while the special ones have other upper bits added to "escape" them into 64 bit instructions.

This approach allows the current program counter to be easily computed vs any variable length instruction schemes that try to conserve memory.

The basic P2 emulator loop is shown below. REP was not used so interrupts could still be supported eventually perhaps.
Importantly the flags are not affected by this loop and are preserved for the application.

loop rflong instr ' read P2 instruction to be executed
rflong esc ' read as 64 bits
incmod x, #BLOCKSIZE-1 ' update partial PC at end of block
tjz x, #nextblock ' handle wrap case
testesc tjnz esc, #escape ' handle escaped/64 bit instructions
instr nop {patched} ' instruction to be executed
jmp #loop ' restart sequence
esc long 0 ' stores high 32 bits of 64 bit instruction
x long 0 ' offset in block
block long 0 ' block number being executed

nextblock
' go read next block from i-cache/extmem and update the PC
' also compute the hub address needed for FIFO reads from the i-cache entry for this PC
incmod block, #LASTBLOCK ' increment block number (PC = (block*BLOCKSIZE + x) * 4)
...
rdfast #0, hubaddr
jmp #testesc ' continue executing instruction

escape
' - Special handling for "escaped" instructions would be done here.
' - This uses the data read in the "esc" long to branch off and handle any instructions that
' affect the pipeline or change/require the PC like the ones listed below.
' - Some instructions may not be supported and could be ignored as NOPs or trigger exceptions?
' - The JMPs/CALLs will be pretty easy to emulate, would just need to lookup next block in i-cache
' and update FIFO read position. We do need to handle both S/#S cases and #S would jump into
' the emulator COG so that might need to be prevented.
' - ALTxxx and AUGx may need a temporary pipeline created by copying this instruction and the
' next one into some sequential COG addresses before executing both of them together.
' How many we could do in a row might need to be restricted.
' - Any _ret_ instructions need to be escaped as well to update the PC after executing them.
{
jmp
call
jxx
jnxx
? rep ' ? = may not be able to support these, or need to find some way to do them
? skipf
? skip
? execf
* rfxx ' * we would want to prevent these as they affect the FIFO we are using
* rfvar[s]
* wfxx
* rdfast
* wrfast
* fblock
djnz etc
ijnz etc
tjnz etc
altxx etc
p scas ' p = affects pipeline
p sca
p xoro
setq ' may need to be back to back in some cases for burst transfers
calla
callb
callpa
callpb
calld
jmprel
ret
reta
retb
augd
augs
loc
_ret_
}

We'd like a larger JMP/CALL space with immediates otherwise we are limited to 1MB only in HUB RAM.
We could potential include some special additional instructions with 32 bit immediates like these:

farjump ###xxx
farcall ###xxx
ldimm32 reg, ###xxxx

Maybe ### could be used by tools to indicate 32bit constants.

rogloh · 2022-02-12 03:47

An alternative approach is to keep the executed instructions as 32 bits only and restrict the set of allowed instructions to be executed from external memory so as not to crash the emulator. This could work ok for tools generating code for high-level languages such as C which would only need a subset of the P2 instruction set anyway. We would still need some way to branch in/out of external memory addresses though with (conditional) calls/jumps. They could probably be transformed into local branches that then emulate those instructions.

Yanomani · 2022-02-12 05:17

In fact, you'll need at least two consecutive longs anyway, in order to be able to execute/process sequences like "GETCT WC + GETCT"; other instruction sequences that also block interrupts till some condition is satisfied, but don't forcefully rely on the "strictly consecutive processing" concept would also work, though the total execution time would not match the "bare metal", which isn't the objective.

rogloh · 2022-02-12 06:03

Yes the consecutive long thing get rather tricky. I'm also looking at what it takes to execute i-cache rows directly with hub exec, maybe with a _ret_ NOP instruction placed at the end of each cache row to allow emulator return for handling of wrapping into the next block. Again if the instructions that need to be consecutive span two disjoint i-cache rows we have a problem, unless the offending last instruction (alt, setq, augs etc) is somehow copied/duplicated at the beginning of the next block in some reserved space for this purpose - maybe that could work too.

We'd still need to find a way to escape to perform far calls and jumps at a minimum. I've not figured out a way to do that yet without losing the JMP address information in the instruction itself, unless it is taken from some prior loaded register, which changes the coding a bit more. Eg, this might work, and it could allow conditional branches too with if prefixes on the jmp #farjmp_handler instruction where needed. Far calls would be similar. Other stuff like standalone djnz and jatn instructions would still not be possible though.

  mov farjmp, ##branch_target_addr ' uses augs where required 
  jmp #farjmp_handler


farjmp_handler
  ' check farjmp address value and branch accordingly (to HUB RAM, or i-cache ROW)

TonyB_ · 2022-02-12 11:15

@rogloh said:
Here's an idea I had recently for executing native P2 instructions read from external memory while remaining in COGEXEC mode.

I'm not certain what the target is here. Is this intended for emulators, to "escape" to P2 code inside emulated instructions to speed them up?

Is is possible to do fast block reads using SETQ/SETQ2 at the same time as streamer is reading external RAM, to copy code from hub RAM to cog/LUT RAM only just after hub RAM is written by streamer?

evanh · 2022-02-12 11:46

@TonyB_ said:
Is is possible to do fast block reads using SETQ/SETQ2 at the same time as streamer is reading external RAM, to copy code from hub RAM to cog/LUT RAM only just after hub RAM is written by streamer?

Sure can. Streamer has priority, as in it'll stall the cog to grab the hubRAM databus, but as long as the cog isn't reading where the streamer is writing then no issue. Or more accurately, the FIFO does the cog stalling.

rogloh · 2022-02-12 12:25

@TonyB_ said:
I'm not certain what the target is here. Is this intended for emulators, to "escape" to P2 code inside emulated instructions to speed them up?

The "escape" part is to allow us to break out of the emulator's instruction processing loop running the code natively and go handle special case instructions that could otherwise crash the emulator. Although emulator is probably a bad description of what this is, if it is running actual P2 code. It's probably more like a type of VM.

It would be great to come up with a scheme that can work without special tool support for generating runnable code but it doesn't look like that is doable. We'd really need other tools that generate this type of code. A special mode in P2GCC could be one approach where far calls and jumps are coded in a way that work with the i-cached external memory.

Also I thought more about the COGRAM handler for a far call and I think the callpa #/D, #/S instruction could be rather useful for jumping out of the emulator/VM in hub exec mode. It would also push the current PC which we extract and push to a real stack and later pop for the return address. The "pa" register could be used to hold the target and this callpa instruction could be still made conditional with if_xxx prefixes etc, and we could also augment the #/D value for creating immediate 32 bit branch target addresses.

TonyB_ · 2022-02-12 12:41

@evanh said:

@TonyB_ said:
Is is possible to do fast block reads using SETQ/SETQ2 at the same time as streamer is reading external RAM, to copy code from hub RAM to cog/LUT RAM only just after hub RAM is written by streamer?

Sure can. Streamer has priority, as in it'll stall the cog to grab the hubRAM databus, but as long as the cog isn't reading where the streamer is writing then no issue. Or more accurately, the FIFO does the cog stalling.

Thinking about how this would work for streamer @ sysclk/2. Is it: FIFO writes block of 8 longs one to each slice, do nothing for 8 cycles, write next block etc. Block read stalled during FIFO writes, read 8 longs, stalled again etc.

EDIT:
"FIFO write" = read from FIFO and write to hub RAM

evanh · 2022-02-12 12:50

Dunno. There likely is high and low water lines but no idea if they're at a round eight apart.

EDIT: Actually, it's more likely to be sporadic singles , doubles and triples I guess. The access time is single tick but the slot rotation limits when that happens.

EDIT2: It suppose that very slot restriction will create a rhythm like you've described in that syclock/2 case.

Yanomani · 2022-02-12 13:05

@rogloh said:
Again if the instructions that need to be consecutive span two disjoint i-cache rows we have a problem, unless the offending last instruction (alt, setq, augs etc) is somehow copied/duplicated at the beginning of the next block in some reserved space for this purpose - maybe that could work too.

Cases like that seems to be crying for a pre-processing unit, perhaps by properlly expanding the code into the LUT, and running it there...

P2 native & AVR CPU emulation with external memory, XBYTE etc

Comments