P2 native & AVR CPU emulation with external memory, XBYTE etc

TonyB_ · 2022-02-12 13:08

@TonyB_ said:

@evanh said:

@TonyB_ said:
Is is possible to do fast block reads using SETQ/SETQ2 at the same time as streamer is reading external RAM, to copy code from hub RAM to cog/LUT RAM only just after hub RAM is written by streamer?

Sure can. Streamer has priority, as in it'll stall the cog to grab the hubRAM databus, but as long as the cog isn't reading where the streamer is writing then no issue. Or more accurately, the FIFO does the cog stalling.

Thinking about how this would work for streamer @ sysclk/2. Is it: FIFO writes block of 8 longs one to each slice, do nothing for 8 cycles, write next block etc. Block read stalled during FIFO writes, read 8 longs, stalled again etc.

EDIT:
"FIFO write" = read from FIFO and write to hub RAM

Hub slice N+1 must be written immediately after slice N as cannot wait for next rotation at sysclk/2. N+2 immediately after N+1,etc., so must have block writes.

[Text editor freezing every few seconds is incredibly annoying.]

rogloh · 2022-02-12 13:36

@Yanomani said:

@rogloh said:
Again if the instructions that need to be consecutive span two disjoint i-cache rows we have a problem, unless the offending last instruction (alt, setq, augs etc) is somehow copied/duplicated at the beginning of the next block in some reserved space for this purpose - maybe that could work too.

Cases like that seems to be crying for a pre-processing unit, perhaps by properlly expanding the code into the LUT, and running it there...

Yeah I remember I had another LUTexec proposal a long time back. Will have to dig it up and look at that idea again as it was another way to do things. However I do know it would require the extra step of reading in the block of code from HUB to LUT as well (adds 1 clock per instruction in a burst read with SETQ2).

In the meantime this callpa #, # approach I am looking into for far branching plus hub exec from an i-cache row terminated with a ret instruction might have some legs. I just need a way to quickly identify the commands that might need to be duplicated if they fall on the end of a block in the i-cache and we wrap around. If we restrict them just to AUGD/AUGS/SETQ/SETQ2 it might be okay. The ALTs ones are more problematic to match quickly. But high-level code probably won't need to use those ALT instructions.

Some instructions taking a SETQ operand will work with an older SETQ value but others need to have it immediately set preceding the instruction. So I think just duplicating those 4 instructions (AUGD/AUGS/SETQ/SETQ2) at the start of each new cache row if the last block executed ended on one would be okay and keep things working. I'm going to work on this idea more and maybe try some sample hand crafted PASM code to see if it works.

msrobots · 2022-02-12 18:46

maybe lut sharing could help there one COG could do the memory access the other the execution

evanh · 2022-02-12 23:24

@TonyB_ said:
[Text editor freezing every few seconds is incredibly annoying.]

Ya, I'm now turning scripting back off again. I used to leave it on while editing. Sadly, scripting is needed for clicking the edit button.

Wuerfel_21 · 2022-02-12 23:42

I realize this is wildly offtopic, but this here UserScript I wrote to hack Spin syntax highlighting into existance, as a side effect, actually fixes editor freezing issues:

// ==UserScript==
// @name         Owie ouch spin plugin
// @version      0.1
// @match        *://forums.parallax.com/*
// @run-at document-start
// @require https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.5.0/highlight.min.js
// @require https://raw.githack.com/Wuerfel21/hljs-spin/master/spin.js
// ==/UserScript==
console.log("Runnin!!!");

var head = document.getElementsByTagName('head')[0]
var oldAppendChild = head.appendChild;
head.appendChild = function() {
    if (arguments[0].tagName == "SCRIPT" && arguments[0].src.toLowerCase().includes("highlightjs")) {
        console.log("AROOOGA!! "+arguments[0].src);
        return;
    }
    oldAppendChild.apply(this, arguments);
};

document.addEventListener('DOMContentLoaded', (event) => {
  document.querySelectorAll('.codeBlock').forEach((block) => {
      console.log("EUIEUIEUIEUIEUI");
    hljs.highlightBlock(block);
  });
});

var styleLink = document.createElement("link");
styleLink.type = "text/css";
styleLink.rel = "stylesheet";
styleLink.href = "https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.5.0/styles/default.min.css";
head.appendChild(styleLink);

If you trace the freezes, they happen inside that obsfuscated hljs blob. I think it recomputes syntax highlighting whenever it feels like it (which involves parsing every code block through every language grammar). To run the unobsfuscated version that I can patch Spin support into, the blob is blocked from loading and that solves the issue. You however now need to refresh to see highlighting on a fresh post.

evanh · 2022-02-13 00:46

I didn't know there was any highlighting. I only get the mono-font. EDIT: Huh, there is some colours right there in your script now that I've turned on scripts.

How does one add a user script?

Wuerfel_21 · 2022-02-13 01:07

@evanh said:
How does one add a user script?

Install a userscript managing extension (I use Tampermonkey), then open its control panel, add new script, paste in the script, save.

evanh · 2022-02-13 01:47

In that case, Jim should just remove the highlighter feature entirely.

rogloh · 2022-02-13 01:57

Yeah posting code has been sucking a lot lately with the slowdowns.

rogloh · 2022-02-13 02:11

@msrobots said:
maybe lut sharing could help there one COG could do the memory access the other the execution

That's one idea, sort of use the second COG as some type of MMU making the first COG's memory requests. I'm still hoping to just have one COG running this though, even if it has to do the memory access itself eventually. Right now I'm using my own driver COG's mailbox which makes things a bit easier to setup and initialize memory with etc, and even allows sharing the external RAM with other video COGs etc (obviously with some degraded performance, so hopefully the i-cache helps a bit there).

The i-cache scheme I've come up with for the AVR is okay for small devices ~256kB to 2MB but once the size of the memory space exceeds that, the byte wide table I use to map 128 byte address blocks to cache rows will start to dominate HUB RAM. Eg. ATMega2560 only needs a 2kB table, but a 32MB address space would need 256kB! Obviously that takes too much HUB RAM. Can be reduced by increasing block size or reducing executable area, eg, 8MB with 256 byte row size (64 P2 instructions), only needs 32kB of HUB for the table which is more manageable. Another alternative is to have multiple table structures or other caching schemes, but this would run slower. Mine takes just one external HUB RAM access per block hit which is not bad and this test needs to run on every branch or row wrap. A block miss is worse and needs multiple HUB RAM accesses to decide where the new row will go in the cache, plus the row transfer time (some of which could be overlapped).

rogloh · 2022-02-13 04:29

I was just looking at the best case performance with i-cache hits and my callpa ##, # scheme for far branches because this will effectively be the limit on the MIPS possible with external RAM execution of native P2 code when branches are present in the code.

To branch to a new 32 bit immediate far address "destination_addr", the HUB exec code executed from i-cache executes this instruction:

        callpa ##destination_addr, #farbranch_handler

The shortest code path to handle this using my current block caching scheme is probably this (assuming the handler is stored in COGRAM)

farbranch_handler
    pop ina ' drop internal stack return addr data passed to us
    mov block, pa  ' copy target address
    shr block, #8 ' extract block (assuming block size 256 bytes)
    add block, blockmaptable ' index the block map table
    rdbyte row, block ' read block mapping entry byte
    tjz row, #loadrow ' if zero, we are unmapped
    mul row, #264 ' 256 plus two longs (64 instructions and padded at end with a ret, and optionally preceded with room for the last executed setq/setq2/augs/augd)
    add row, cachebaseplus4 ' add to base of cache storage area (plus 4 bytes to account for room for preceding instruction)
    and pa, #$ff ' extract offset in block
    add pa, row ' add to row base address
    jmp pa ' jump to this i-cache address in HUB exec

I count the following cycles:
2 for the implied AUGD before callpa instruction for the 32 bit immediate destination address
4 to do a callpa jump back into COGexec mode
9 x regular 2 cycle instructions = 18 clock cycles
9-16 cycles for the rdbyte
13-20 cycles for the final branch back into HUB exec mode in the i-cache row
total is somewhere from 46 - 60 clocks.

So if the cache hit rate is perfect and we are branching on average after every 6 P2 instructions (for example) the best we will see would be 7 instructions executed every 6x2+46 clocks or 1 instruction every 58/7 = 8.3 clocks. If we take the other bound we get 7 instructions in 6x2+60 clocks or 1 in 72/7 = 10.3 clocks. At 252MHz this is something between 24 and 30 MIPs best case numbers with a 6:1 proportion of non-branches to branches and a perfect icache hit rate with everything already cached for the application's working set. In reality it will be worse than this but it is nice to see how fast it could be under "more ideal" circumstances. ie. we won't really see better than this, unless the code is really linear with very few branches.

There is also another branch needed end at the end of each cache row reached (eg, every 64 instructions), and more cycles are needed to check and patch in the optional setq(2)/augd/augs instructions at the start of each row. One rdlong, some opcode instruction tests and one optional wrlong plus several other address generation instructions would be needed in addition to the above branching code.

If we manage to get something that runs in the vicinity of 20MIPs peak it might run a bit like the original P1 but can address a huge amount of instruction memory.

rogloh · 2022-02-13 04:35

Actually just noticed there is this optimization...but it only saves 2 clocks.

farbranch_handler
    pop ina ' drop internal stack return addr data passed to us
    ror pa, #8  ' align block in target address
    add pa, blockmaptable ' index the block map table
    rdbyte row, pa ' read block mapping entry byte
    tjz row, #loadrow ' if zero, we are unmapped
    mul row, #264 ' 256 plus two longs (64 instructions and padded at end with a ret, and optionally preceded with room for the last executed setq/setq2/augs/augd)
    add row, cachebaseplus4 ' add to base of cache storage area (plus 4 bytes to account for room for preceding instruction)
    shr pa, #24 ' extract offset in block
    add pa, row ' add to row base address
    jmp pa ' jump to this i-cache address in HUB exec

msrobots · 2022-02-13 06:48

Isn't it alike LMM on the P1? But just with cache lines to read larger blocks?

Catalina C already supports external Memory on the P1 using @Cluso99 build Ramblade with 512 Kb on a P1. Maybe @RossH can adapt Catalina to use the Ram on the Edge 32mb alike?

That would be cool. But I am not a C programmer so getting Spin Code to run from external memory would be really cool for me, but C might be more important for general use.

But when you figured out a P2 VM PASM2 reading from the P2 Edge external RAM we might be able to say 'please, pretty please' to @ersmith and @Wuerfel_21 to build a code generator to support that?

Being able to run code from external RAM would be really nice,

It is already nice to use it for video RAM or as a simple file system like @"Mike Green" showed, but expanding the code execution space, even slower would allow a lot bigger programs.

just my thinking

Mike

rogloh · 2022-02-13 06:53

Here's some sample i-cache row wrap handling code. We would return to this COGRAM code from the HW stack using a ret retained at the end of each cache row (and that is never clobbered by reading the row). I just don't know if the spare instruction (coded in the long $fbdc0000) using its immediate #D, #S form will absorb any extended immediates pending from any prior AUGD/AUGS instructions that were just executed at the end of the row and still be a NOP. Hopefully it should and this will save an extra instruction otherwise. This code will only handle SETQ/SETQ2/AUGD/AUGS which are the main ones needed for large constants and and block reads. It won't handle multiple of these instructions back to back either, just the last one at the end of the block so when it wraps we can continue on. I think it will be useful.

For cache hits the i-cache row wrap overhead looks like: a return to COGRAM, 21 regular instructions plus 2 reads and an optional HUB write before returning back to HUB exec from the i-cache row. So typically something around 4+42+(18..32) + (13..20) or 75-96 clocks in the typical case, and about ten more for the special SETQ/SETQ2/AUGS/AUGD cases when duplicated. So this operation drops the best case straight line execution (a pathological case) to 64 instructions executed from cached external memory every 128 + 75 clocks or 3.17 clocks per instruction which is 79MIPS @252MHz. It will never be obtained with real-world code but interesting to see nonetheless.

wrap_handler
    long    $fbdc0000 ' soak up any just executed augd/s instructions with reserved opcode using #,# form, hope this works!
    rczr    flags  ' preserve flags because we are changing them
    add     activerow, #252 ' go to end of the row
    rdlong  instr, activerow ' get instruction just executed at end of cache row
    getnib  pa, instr, #6 ' get 6th nibble
    sub     pa, #$f wz  ' compare against augd/augs opcode
 if_z jmp   #skiptest
    mov     pa, instr  ' get instruction
    and     pa, ##0ff801fe ' apply mask for setq/setq2
    cmp     pa, ##0d600028 wz ' check for setq/setq2
skiptest
    setbyte block, #0, #3 'clear offset 
    add     block, #1 ' advance to next block, maximum external address space wrap case? cmpsub/incmod instead perhaps or ignore?
    mov     pa, blockmaptable
    add     pa, block
    rdbyte  activerow, pa
    tjz     activerow, #loadrow2 ' slightly different row loader, needs to patch instruction at start if z flag is set
    mul     activerow, #264 ' 256 plus two longs (64 instructions padded at end with ret, optionally preceded with last executed setq/setq2/augs/augd)
    add     activerow, cachebaseplus4
    mov     pa, activerow
if_z sub    pa, #4      ' rewind i-cache entry address by one instruction
if_z wrlong instr, pa     ' load last instruction executed into i-cache row
    push    #wrap_handler ' we will return to this row wrap handler code using HW stack
    rczl    flags wcz ' restore the flags we changed
    jmp     pa

rogloh · 2022-02-13 06:57

@msrobots said:
Isn't it alike LMM on the P1? But just with cache lines to read larger blocks?

Catalina C already supports external Memory on the P1 using @Cluso99 build Ramblade with 512 Kb on a P1. Maybe @RossH can adapt Catalina to use the Ram on the Edge 32mb alike?

That would be cool. But I am not a C programmer so getting Spin Code to run from external memory would be really cool for me, but C might be more important for general use.

But when you figured out a P2 VM PASM2 reading from the P2 Edge external RAM we might be able to say 'please, pretty please' to @ersmith and @Wuerfel_21 to build a code generator to support that?

Being able to run code from external RAM would be really nice,

It is already nice to use it for video RAM or as a simple file system like @"Mike Green" showed, but expanding the code execution space, even slower would allow a lot bigger programs.

just my thinking

Mike

Agree with everything you have said above. If we find a useful way to extend high level languages to run from external memory it would be great to have some tools extended to support this. I'm considering looking at P2GCC to see what might be done there (at least for a test), but if Eric/Wuerfel_21/Chip and others can also make use of this stuff too down the track, all the better. A larger SPIN2 code segment might be useful too for really huge programs (still keeping data in HUBRAM).

RossH · 2022-02-13 08:03

@msrobots said:
Catalina C already supports external Memory on the P1 using @Cluso99 build Ramblade with 512 Kb on a P1. Maybe @RossH can adapt Catalina to use the Ram on the Edge 32mb alike?

Catalina supports all the P1 platforms with XMM RAM that I am aware of, but more may have been added since I last checked. However, in hindsight using XMM for executing program code was probably a mistake on the P1, and one that I am not all that keen to repeat on the P2. On the P1 it was too slow to be much use except for some very niche applications, and I now think that if your ambition is just to run larger programs and you don't care about the much reduced execution speeds, then you would probably have been better off using a different processor in the first place (I do realize this is heresy for many Propeller users! )

I am actually now dropping Catalina support for some of the P1 XMM platforms, since I doubt some of them even exist any longer. How many people still have a working Morpheus, for example? And that platform was a beast to support!

Let's not repeat this debacle on the P2. If you want to use XMM as a buffer then fine. Or turn it into a file system, and then all the languages and tools can already use it - for instance, Catalina can already load and execute overlays from file systems, which will then execute at full speed.

Ross.

msrobots · 2022-02-13 08:16

yes overlays are the second best option and someone here on the forum even got PropGcc to create and link them, kind of a relict but still supported, somehow.

But having that P2 Edge with RAM I think it would make sense to support it.

Just my 2 cents

Mike

rogloh · 2022-02-13 10:35

@RossH said:
Catalina supports all the P1 platforms with XMM RAM that I am aware of, but more may have been added since I last checked. However, in hindsight using XMM for executing program code was probably a mistake on the P1, and one that I am not all that keen to repeat on the P2. On the P1 it was too slow to be much use except for some very niche applications, and I now think that if your ambition is just to run larger programs and you don't care about the much reduced execution speeds, then you would probably have been better off using a different processor in the first place (I do realize this is heresy for many Propeller users! )

Yes all those P1 external memory solutions were poor for quite a few reasons...
1) slow P1 IO pin speed limited bandwidth/performance of all external memory reads
2) remaining IO pin count became too low with the faster parallel bus RAM solutions
3) could not take advantage of newer RAMs like HyperRAM/PSRAM back then, were not really around or suited to P1 clock speeds
4) very limited HUB RAM size prevented having a decent sized i-cache that would have improved the performance
5) fairly low HUB RAM bandwidth to get data into COGs.

The P2 + external memory solutions available now should help overcome many of these issues. The memory speed is two orders of magnitude faster and we can easily provide 32k-64kB of I-cache HUB RAM.

I am actually now dropping Catalina support for some of the P1 XMM platforms, since I doubt some of them even exist any longer. How many people still have a working Morpheus, for example? And that platform was a beast to support!

Let's not repeat this debacle on the P2. If you want to use XMM as a buffer then fine. Or turn it into a file system, and then all the languages and tools can already use it - for instance, Catalina can already load and execute overlays from file systems, which will then execute at full speed.

It's not for Catalina then given your prior painful history (sounds a bit like asking for GCC for P2, lol), but hopefully there might be other tools that can use this eventually.

I still see a range of applications where the external memory execution may become useful on a P2...despite being slower than COGEXEC or HUBEXEC code.
1) obviously for emulators, though that is more ideal for non-P2 code...look at what Wuerfel_21 has achieved, amazing!
2) applications needing lots of free HUB RAM for data storage or other real-time PASM COG use, but execution speed of the controlling application logic is less important and where it could get large - eg, for CNC machine's GUI application control code, some robot logic etc.
3) applications requiring larger codebases or library code than would fit in 512kB and where operating speed is not critical, or for applications with larger networking stacks etc.
4) in situ debug/development tools, decent file editors etc
5) Data acquisition where you want to keep as much HUB RAM memory free for high speed data and just run the control code out of external memory, or spawn other COGs from there.
etc.

evanh · 2022-02-13 10:42

It's probably better to leave native alone and instead extend VM'd/tokenised/scripted languages to operate from external memories.

Emulators can do anything of course. That's a different ball game.

TonyB_ · 2022-02-13 11:10

@rogloh said:
In the meantime this callpa #, # approach I am looking into for far branching plus hub exec from an i-cache row terminated with a ret instruction might have some legs. I just need a way to quickly identify the commands that might need to be duplicated if they fall on the end of a block in the i-cache and we wrap around. If we restrict them just to AUGD/AUGS/SETQ/SETQ2 it might be okay. The ALTs ones are more problematic to match quickly. But high-level code probably won't need to use those ALT instructions.

What if the compiler could pad the end of each 256-byte code block with one or two NOPs if needed for AUGx/ALTx/SETQx to avoid any wrap around problems?

rogloh · 2022-02-13 12:08

@TonyB_ said:
What if the compiler could pad the end of each 256-byte code block with one or two NOPs if needed for AUGx/ALTx/SETQx to avoid any wrap around problems?

From reading the spec for AUGS it looks like it uses the value on the next #S occurrence not necessarily the next instruction. NOPs may not cut it. That's why I cancelled it out with the reserved opcode using #D, #S form. It could either be part of the row or part of the handler. If part of the row maybe it could be combined with ret as well. I'm sort of hoping it wouldn't need too many special compiler rules to generate the code, that's one of the aims here. If something like P2GCC doesn't use them much except for loading 32 bit constants, then it may not be that big of a deal. That being said maybe having the ability to align to the next block boundary would be useful if you do want to squeeze special inline PASM instructions sequences that could only ever safely execute in the same block (like djnz, skipf etc).

"Queue #n to be used as upper 23 bits for next #S occurrence, so that the next 9-bit #S will be augmented to 32 bits. "

rogloh · 2022-02-13 12:11

@evanh said:
It's probably better to leave native alone and instead extend VM'd/tokenised/scripted languages to operate from external memories.

I see no particular reason to leave native alone if someone ultimately finds there is a reasonable way to make it work to some extent. Of course without any tools supporting it then it would not fly.

evanh · 2022-02-13 12:51

I'll bug out. Not my area.

rogloh · 2022-02-14 01:46

In an experiment I just found that the reserved opcode long method that I was using above in my wrap handler would only soak up the ##S replacement from a prior AUGS, but not the one from a prior AUGD. However the COGID #D instruction seems to work to soak it up and appears to be harmless. So I think we can do a _RET_ COGID #0 at the end of the i-cache block followed by the reserved opcode long later in the COG handler which nicely combines the return operation as well. This will cancel out any AUGS's or AUGDs as the last instruction executed in a block and we can then duplicate them at the start of the next block. AUGS's would be useful for 32 bit constants and AUGD's for 32 bit address constants for far branches and calls. I don't imagine standard (high level) code would need to ever put them back to back.

For SETQ/SETQ2, repeating it should be fine without any need to cancel the prior one. And if do we need to use SETQ in our handler we will just clobber the old value anyway.

rogloh · 2022-02-14 03:48

I figured out how to load a row with this (untested) code below...looks like a cache miss will only essentially add a 3 long write burst and the row transfer time, plus one polling rdlong vs the (hopefully) far more common cache hit case. Once the cache row is filled it will jump into the destination cache line. Thankfully most of the extra cache housekeeping work is run in parallel with the memory transfer. So the code overhead is pretty small vs the transfer time + latency itself. It will be as fast as it can be.

In this example I have the LUTRAM holding a 128 cache row association table, but this table could probably be stored in HUBRAM if more LUT needs to remain free, and there should still be time for the extra HUB RAM accesses during the cache row transfer which is going to take at least a microsecond or so.

loadrow
            ' setup external memory source address for transfer

            mov     extaddr, block          ' get block number
            mul     extaddr, #256           ' multiply by block size to get block address in ext mem 0-16MB
            add     extaddr, extrdaddr      ' include any external memory offset and the read request nibble

            'the hub destination address is already setup in advance to the next cache row to be used

            setq    #3-1                     ' write 3 longs
            wrlong  extaddr, mailbox         ' issue external read transfer request to ext memory driver to bring in new row

            ' now go do cache housekeeping work while waiting for transfer to complete

            rdlut   ownerblock, nextrow wc  ' compute the old owner of this row if any, c=1 means empty
            wrlut   pa, nextrow             ' set the new owner of the cache row
    if_nc   wrbyte  #0, ownerblock          ' clear out old block map table mapping
            wrbyte  nextrow, pa             ' setup new block mapping entry 

            mov     activerow, hubaddr      ' get active row start address from the hub read address
            sub     hubaddr, #4             ' rewind by a long
            wrlong  instr, hubaddr          ' go write previous instruction in case it is required

            incmod  nextrow, #127 wc        ' select next cache row to use next time (127 rows x 264 bytes, 0=reserved)
    if_c    mov     nextrow, #1             ' 1 based row numbers
            mov     hubaddr, nextrow        ' get nextrow
            mul     hubaddr, #264           ' scale by row size
            add     hubaddr, cachbaseplus4  ' add offset in HUB RAM to get address

            shr     pa, #24                 ' get branch offset in cache row
            add     pa, activerow           ' add to row base
    if_z    sub     pa, #4                  ' rewind by a long if needed, to repeat last AUGS/AUGS/SETQ/SETQ2 instruction
            rczl    flags  wcz              ' restore flags
poll
            rdlong  flags, mailbox          ' check for completion of external request
            tjns    flags, pa               ' branch to just read cache row with jump into HUB EXEC
            jmp     #poll                   ' or retry


flags           long 0
activerow       long 0   
cachebaseplus4  long 0
blockmaptable   long 0    
nextrow         long 0  ' setup in advance
block           long 0
instr           long 0
extaddr         long 0
hubaddr         long 0   ' setup in advance
transfersize    long 256
ownerblock      long 0
extrdaddr       long $B0000000 ' read burst request + offset in external memory
mailbox         long 0

TonyB_ · 2022-02-14 10:55

@rogloh said:
In an experiment I just found that the reserved opcode long method that I was using above in my wrap handler would only soak up the ##S replacement from a prior AUGS, but not the one from a prior AUGD. However the COGID #D instruction seems to work to soak it up and appears to be harmless. So I think we can do a _RET_ COGID #0 at the end of the i-cache block followed by the reserved opcode long later in the COG handler which nicely combines the return operation as well. This will cancel out any AUGS's or AUGDs as the last instruction executed in a block and we can then duplicate them at the start of the next block. AUGS's would be useful for 32 bit constants and AUGD's for 32 bit address constants for far branches and calls. I don't imagine standard (high level) code would need to ever put them back to back.

For SETQ/SETQ2, repeating it should be fine without any need to cancel the prior one. And if do we need to use SETQ in our handler we will just clobber the old value anyway.

There are other instructions to consider: CRCNIB / SCA / SCAS / GETCT+WC / GETXACC / XORO32 and REP. Seems to me that it would be better if compiler has option to prevent these crossing a page boundary in hub RAM by prefixing with NOP(s), or at least give a warning in case of REP. Let the compiler do the tedious work.

How much time would be saved if 256 byte blocks, not 264?

rogloh · 2022-02-14 11:50

@TonyB_ said:
There are other instructions to consider: CRCNIB / SCA / SCAS / GETCT+WC / GETXACC / XORO32 and REP. Seems to me that it would be better if compiler has option to prevent these crossing a page boundary in hub RAM by prefixing with NOP(s), or at least give a warning in case of REP. Let the compiler do the tedious work.

There is nothing stopping a compiler from doing what it wants. It could even use DJNZ and JMPs, ALTxxx stuff as well if they are known to already fit within the block. In my view the instructions you listed above would not typically be used by a compiler for a high level language (certainly C) unless inline PASM is used. In that case you would just need to ensure they fit inside a 256 byte block. For this it would be useful to have the ability to align code to the next 256 byte boundary so you could control things better if you wanted to use inline PASM for speed or some extra features above.

How much time would be saved if 256 byte blocks, not 264?

The blocks are actually 256 bytes, but the cache row space in HUB RAM is 264 bytes per row allowing an optional duplicated AUGS/AUGD/SETQ/SETQ2 instruction to be prepended as well as the return at the end of the cache row to refill and continue in the next row. These extra 8 bytes are not part of the executable being read in, so the compiler doesn't have to do anything special, it will be handled automatically. Because these instructions (AUGS/AUGD) are used a lot for long target branch addresses and 32 bit constants in high level languages it would be inconvenient to have to add lots of NOPs around them at block boundaries.

RossH · 2022-02-14 11:58

Oops! Wrong thread - moving to Catalina thread!

rogloh · 2022-02-14 12:02

Here's the latest work to demonstrate the basic idea. It's compiling now but I now need to work on something to actually start to test it out. Probably needs a mod to p2gcc to try to build code that will be runnable.

rogloh · 2022-02-15 06:26

I've been looking at the P2GCC changes that might be needed to try this external memory execution concept out.
I think what I need to do is basically this:

use the -lmm output mode setting in propeller-elf-gcc parameters during the C to assembly step
convert occurrences of "lcall" and "brs" opcodes to use my new callpa #,# method of doing branches and calls, in s2pasm
handle occurences of call #__LMM_PUSHM and call #__LMM_POPM so we use setq burst transfers (I already did this in my own version for MicroPython)
handle use cases of the link register "lr", specifically "mov pc,lr" and do a "callpa lr, #farcall" or "callpa lr, #farbranch" instead, and also for other indirect calls
combine prefix.spin2 with my posted code, and alter my current use of PTRA to use the local sp register instead, and assume a decreasing stack
change p2link to split up its code and data segment addresses, so data remains in HUB RAM and code is in a different higher address range (and can be extracted separately in a binary file to be loaded into external RAM)

Hopefully doing that lot might allow me to test real C code to get this idea off the ground. I know it is still far from ideal just using the P1 assembly code conversion with P2GCC tools, but it is a start and will let us experiment with the idea. I still strongly believe if we ever get a proper P2 port of GCC it could improve performance quite a bit over the P1 - eg, allowing more than 15 registers, and using PTRA or PTRB indexed addressing for stack frame arguments and automatic variables would reduce the code size and increase speed as well.

Possibly the hardest thing here might be modifying p2link to separate its code and data - not sure yet, haven't dug into that one.

P2 native & AVR CPU emulation with external memory, XBYTE etc

Comments