Shop OBEX P1 Docs P2 Docs Learn Events
ANNOUNCING: Large memory model for Propeller assembly language programs! — Parallax Forums

ANNOUNCING: Large memory model for Propeller assembly language programs!

Bill HenningBill Henning Posts: 6,445
edited 2012-05-16 06:12 in Propeller 1
Dear Friends,

I am afraid I've been sitting on this idea for some weeks; I came up with it shortly after getting my Propeller kit - however since I have not had time to do much with it, and it does not like I'll have time this year, I've decided to publish now.

Feel free to use·these methods·in your compilers, but please credit me if you do [noparse]:)[/noparse] and don't try to patent it!

If you are going to use it in a commercial compiler, I take PayPal [noparse]:)[/noparse]

While mulling over the memory limitations of cogs (in between bouts of bugging Chip asking for new features for the next propeller... Chip: hint hint) I thought of something very interesting:

[noparse][[/noparse]code]
nxt···· rdlong·· instr,pc
········ add······pc,#4
instr···nop····· ' placeholder!
········ jmp····· nxt
[noparse][[/noparse]/code]

{ Now Chip, that's another reason I keep bugging you for auto incrementing pointers! }

The above code fragment can execute code stored in HUB memory at 32 cycles per instruction.

Ofcourse I considered that overhead too high, however by the simple expedient of unrolling the loop four times, we get it down to a far more appealing 20 cycles per instruction. Appealing? At 5x slower than local code? YES! It lets us execute LARGE programs!

Now that is not all.

Consider... the code executed from main memory can call routines in local memory. We have the main "Program Counter" in a local register.

We can have "FJMP", "FCALL" and even "FBRcc" instructions!

Granted, they will be slow compared to native COG code, but they will be MUCH faster than Spin or any byte code language. I have not written all the primitives, but all we need is an "SP" pointer held in each cog running in "large" model for saved hub return addresses, and a small number of primitive functions that can be called; later they can be "masked" by a macro assembler to look like native instructions.

The "instructions" I propose are:

FJMP addr· ' calls routine that replaces PC with long at PC, then jumps to nxt
FCALL addr ' increments SP by 2, replaces PC with long at PC after it saves PC+4 at SP
FRET········ ' loads PC from·word at SP, decrements SP by two
FBRcc addr ' works the same as FJMP but is conditional

There you go guys. This "Large Model" I came up with allows for creation of compilers for "conventional" languages meant for more conventional architectures.

And I have a way of addressing the performance penalty ... reducing it in most cases to less than a factor of two compared to native cog code!

Let me know what you think [noparse]:)[/noparse]

I'd prefer that we standardize on registers used for PC and SP, as well as the entry points for the 'kernel' routines. I'd prefer to keep the kernel as small as possible in order to have as much cog memory free for use as buffers and for transient code as possible.

Oh what the heck, I'll let the other cat out of the bag!

There will be another primitive.

Call it "FCACHE"

When the "kernel" executes an "FCACHE" instruction, which in reality just calls a small primitive routine in the cog, it will copy all longs after the FCACHE to the cog's execution buffer (I will be using $080-$0FF as the "FCACHE" code area, I would VERY MUCH appreciate it if others adopted my conventions; that way languages compiling to my large model will be code compatible!) stopping only when it runs across a "NULL" long (0).

The code between FCACHE and NULL will be copied to the cache area, and the cog FCACHE primitive will jump to it after setting PC to the address of the hub word just past the NULL. When it exits, the code is responsible to jump to nxt.

Cached code must NOT call (or FCALL) any hub code, as a matter of fact, it must obey the rules of normal cog assembly programs.

Yes, by loading more than the 128 words I suggest, this can be used as a "paging" mechanism for very large programs.

Yes, this also makes it possible to run multiple threads per cog - I have·a "YIELD" primitive in mind that saves PC, SP and switches to another thread of execution (tasks for now must statically allocate non-overlapping registers.)

Ok, thats it.

No one better try to patent this as "their" IP - that's why I'm very publically disclosing this [noparse]:)[/noparse]
·
«134

Comments

  • Bill HenningBill Henning Posts: 6,445
    edited 2006-11-10 09:25
    A bit more detail:

    I propose the following branch instructions:

    FBRC addr ' branch to far address if Carry flag is set
    FBRNC addr ' branch to far address if Carry flag is clear
    FBRZ addr ' branch to far address if Zero flag is set
    FBRNZ addr ' branch to far address if Zero flag is NOT set

    By the way, the same mechanism would also work to say external memory, except the inner interpreter loop would have to be changed.

    I'd like all of us to get togeather and work out a standard everyone will conform to - sort of a "Propeller ABI"

    A couple of limitations:

    Code directly executed out of hub memory MAY NOT use any of the conditional branch instructions directly, it MUST use the FBRcc primitives (otherwise it would branch out of the hub interpreter loop!)

    Code in FCACHE blocks must not use any of the Fxxxx primitives as mentioned in the earlier message

    A couple of HUGE advantages:

    Think of system calls in HUB memory, things like SPI_IN / SPI_OUT / SD_READ / SD_WRITE

    All those system calls can include FCACHE blocks and run at full cog speed!

    A neat trick:

    In code executed out of HUB memory (BUT NOT INSIDE FCACHE/NULL blocks!)... consider the effect of

    ·······if_c· add pc,#40··········· ' yep, a short conditional branch in HUB code without using a primitive!

    So the FBRcc primitives are only needed if the branch target is more than +-128 words distant from the hub location where the add/sub op is executed. If we accept that limitation, there is no need for FBRcc primitives. Which keeps the kernel smaller.

    Oh, I also want to reserve FSVC for a system service call routine.

    I'd also like to reserve 128 longs in the first 512 longs of hub memory, I have some excellent ideas for them, but I am too tired to spill any more beans tonite.

    I will be setting up a blog for this project soon.

    Good Night,

    Bill


    Post Edited (Bill Henning) : 11/10/2006 10:06:53 AM GMT
  • cgraceycgracey Posts: 14,256
    edited 2006-11-10 09:48
    Bill,

    This is a great idea! I love the fetch/execute and FCACHE, but YIELD could be a real sleeper. If you could get multiple ASM threads running on one cog, that would be great -- especially if people could crunch multiple mid-bandwidth processes like serial comms. I hear you (and everyone else) on the auto-incrementing address register(s). This large memory model is going to be exciting.

    BTW, I hope there are no such people on the forum that would be so rotten as to patent things they've seen here. Well, the system is broken, so the joke might be on them, afterall. "He·that diggeth a pit shall fall into it" - And hopefully sooner than later.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.
  • Bill HenningBill Henning Posts: 6,445
    edited 2006-11-10 09:55
    Chip,

    Thanks!

    I am pretty sure I can get YIELD working

    That's one of the reasons I want 128 of the first 512 longs. I'd really like 256 of them, but some might see that as greedy. I'm aiming for four threads per cog, and eight is not impossible if people respect my 128 word limit for FCACHE blocks.

    Ofcourse this also means that we can do a native code large model SPIN compiler that would be 10x faster than the current spin, and normally close to pure assembly language program speeds. Ditto for a C compiler [noparse]:)[/noparse]

    Best,

    Bill

    p.s.

    Mutiple mid-speed threads is why I came up with YIELD [noparse]:)[/noparse]
  • Graham StablerGraham Stabler Posts: 2,510
    edited 2006-11-10 10:12
    As any patent would be filed after this info was posted on a public forum they would be pretty screwed anyway.

    This looks rather cool.

    Graham
  • nutsonnutson Posts: 242
    edited 2006-11-10 15:17
    Great idea, Bill. This not only opens up the possibility of having a C compiler for the prop one day, but also for interesting options as multiple props executing the same program with different data sets, feasible as each prop has a dedicated register set. Shared RAM bandwith is going to be the limiting factor in the prop's performance, as in many multi processor systems. By making shared RAM access more intelligent, assign time slots only to props demanding access, the props performance could be increased even more. The rotating combustion engine piston type of explanation for shared resource access would not apply then, what a pity, but 8 digit MIPS figures would certainly soften the pain.

    Nico Hattink
  • Mike GreenMike Green Posts: 23,101
    edited 2006-11-10 15:28
    Bill,
    One of the issues here is that the "entry points" for the basic "instructions" are going to have to be hard fixed or go through a jump table that's fixed or use a "linking process" otherwise maintenance and upgrades are going to be a nightmare. In particular, if I were to incorporate this into the Propeller OS (which sounds like a great idea), the loader/I2C cog would use this as its basic loop and some of the basic "instructions" would read/write EEPROM or load and execute a SPIN program. If I were to make corrections or changes to the loader, some of the "entry points" could change unless there was a well defined convention. I was planning to add primitives to read/write between EEPROM and the loader's cog memory. This now gives a much better general framework than the simple overlay loader I had envisioned. Doing overlays from EEPROM is way slower than from HUB memory, but might be very useful for some applications and would be completely independent from HUB memory, needing only its own 2 I/O pin I2C bus.
    Mike
  • Mike GreenMike Green Posts: 23,101
    edited 2006-11-10 15:58
    Bill,
    Another piece: Unless the SPIN compiler is modified to make some low memory available, it will be difficult at best to integrate some of your ideas with the existing Propeller Tool. It would be a shame to not be able to use the existing SPIN interpreter.

    It would be easy to modify the OS's loader to load and execute a modified SPIN image that skips over a block of low HUB memory. The space in the EEPROM could be used for other things or used to initialize the 128 long area. The Propeller Tool would have to have a directive added (like _xxxx = ???) that would specify the size of the area to be skipped (from $10 to $10+???-1). This would be compatible with the existing boot loader and all existing code.

    Another Propeller Tool directive that would be useful would be an "ORG"-like statement that would specify the location in the binary image to use for the following assembly/data information. This could be used to initialize fixed areas like the 128 long area.
    Mike
  • hellosethhelloseth Posts: 43
    edited 2006-11-10 16:37
    This reminds me a bit of a feature in another MCU I am following.· http://www.intellasys.net/products/index.php

    Their Seaforth24 product seems very similar to the Prop, but bigger and more complex.


    They have/will have multi-core (24) MCU's, which process forth directly. But one of their cool features is that one core, can execute code 'read' directly from another core.· Bascally the code is passed one 'word' at a time from the other core.·(They had a few white papers on their Resources page explaining this, but that page is empty as I write this.)

    Seth
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2006-11-10 16:49
    Bill,

    This is almost exactly what I've already tried for the Forth kernel (the paging approach I mentioned) -- I used it to implement an experimental DTC interpreter and some user-native-code support.

    On the Propeller, with stack in shared RAM, it is no faster than an ITC interpreter, and in many cases is slower. The speed of the ITC is also bound by memory bus bandwidth, but transfers significantly less.

    I'm still considering this approach for user-defined native words, but I'm mostly responding to correct your statement about it being faster than "any bytecode language." This is most likely not correct. (Bytecode VMs and ITC are a trivial transform apart, so I take your statement as applying to both, as well as token-threaded code.)

    It will, of course, be significantly faster than SPIN. smile.gif
  • Paul BakerPaul Baker Posts: 6,351
    edited 2006-11-10 19:28
    At the risk of veering off into YABPC (yet another bashing patents conversation), an examiner wouldn't know to look in Company X's online forums for prior art (I know first-hand forums are not part of thier search strategy).·Assuming it was filed today it would be ~3 years before the examiner saw it by then this thread would be 3 years old and then there is·~2 more years before it became a patent, at which point a lawyer is now looking for material buried in a forum that is 5 years old even assuming a party of interest has hired a lawyer. SIR's are the only avenue in which a person can reasonable expect an examiner will see the information, and that costs money. IEEE and ACM (I miss my access to their databases of articles) also have a better than average expectation of being noticed by an examiner, but you have to convince the reviewers it is worthwhile information for them to publish.
    Graham Stabler said...
    As any patent would be filed after this info was posted on a public forum they would be pretty screwed anyway.

    This looks rather cool.

    Graham

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Paul Baker
    Propeller Applications Engineer

    Parallax, Inc.

    Post Edited (Paul Baker (Parallax)) : 11/10/2006 8:49:46 PM GMT
  • Mike GreenMike Green Posts: 23,101
    edited 2006-11-10 20:12
    How about some suggestions for rearranging and maybe eliminating an instruction or two?
    entry                   rdlong  pc,PAR
                            mov     stkPtr,pc
                            shr     stkPtr,#16
                            jmp     #nxt
    
    fjmp                    rdlong  pc,pc
                            jmp     #nxt
    
    fcache                  movd    :copyIt,#$80
                            nop
    :copyIt                 rdlong  0-0,pc             wz
                            add     :copyIt,dspIncr
                            add     pc,#4
                    if_nz   jmp     #:copyIt
                            jmp     #$80
    
    fret                    sub     stkPtr,#2
                            rdword  pc,stkPtr
                            jmp     #nxt
    
    fcall                   rdlong  nxtPc,pc
                            add     pc,#4
                            wrword  pc,stkPtr
                            add     stkPtr,#2
                            mov     pc,nxtPc
    
    nxt                     rdlong  :inst1,pc
                            add     pc,#4
    :inst1                  nop
                            rdlong  :inst2,pc
                            add     pc,#4
    :inst2                  nop
                            rdlong  :inst3,pc
                            add     pc,#4
    :inst3                  nop
                            rdlong  :inst4,pc
                            add     pc,#4
    :inst4                  nop
                            jmp     #nxt
    
    dspIncr                 long    1 << 9
    pc                      long    0
    stkPtr                  long    0
    nxtPc                 long    0
    
    


    One question for others ... How about having the stack in the cog? There are pros and cons. If it's strictly a call stack, it could be reasonably limited in depth. A good place would be to run the stack downwards from the end of the cache area. It wouldn't be too hard to pack return addresses 2 per long word. Advantage is that there'd be one less thing to allocate in HUB RAM. Disadvantage is that it'd be harder to switch execution threads.

    Post Edited (Mike Green) : 11/10/2006 8:24:06 PM GMT
  • cgraceycgracey Posts: 14,256
    edited 2006-11-10 20:39
    I found something... The nop could be gotten rid of in fcache by post-fixing the destination address:
    fcache··················rdlong··$80,pc·············wz
    ························add·····fcache,dspIncr
    ························add·····pc,#4
    ················if_nz···jmp·····#fcache

    ······················· movd····fcache,#$80
    ························jmp·····#$80


    Here is a way to make it ~33% faster by adding 4 instructions:

    fcache··················rdlong··$80,pc·············wz
    ························add·····fcache,dspIncr2 (2 << 9)
    ························add·····pc,#4
    fcache2·········if_nz···rdlong··$81,pc·············wz
    ························add·····fcache2,dspIncr2 (2 << 9)
    ························add·····pc,#4
    ················if_nz···jmp·····#fcache

    ······················· movd····fcache,#$80
    ······················· movd····fcache2,#$81
    ························jmp·····#$80


    I love doing stuff like this!


    ·
    Mike Green said...
    How about some suggestions for rearranging and maybe eliminating an instruction or two?
    entry                   rdlong  pc,PAR
                            mov     stkPtr,pc
                            shr     stkPtr,#16
                            jmp     #nxt
    
    fjmp                    rdlong  pc,pc
                            jmp     #nxt
    
    [color=blue]fcache                  movd    :copyIt,#$80
                            nop
    :copyIt                 rdlong  0-0,pc             wz
                            add     :copyIt,dspIncr
                            add     pc,#4
                    if_nz   jmp     #:copyIt
                            jmp     #$80
    [/color]
    fret                    sub     stkPtr,#2
                            rdword  pc,stkPtr
                            jmp     #nxt
    
    fcall                   rdlong  nxtPc,pc
                            add     pc,#4
                            wrword  pc,stkPtr
                            add     stkPtr,#2
                            mov     pc,nxtPc
    
    [color=orange][color=red]nxt                     rdlong  :inst1,pc     {THIS IS NEAT!!!}
                            add     pc,#4
    :inst1                  nop
                            rdlong  :inst2,pc
                            add     pc,#4
    :inst2                  nop
                            rdlong  :inst3,pc
                            add     pc,#4
    :inst3                  nop
                            rdlong  :inst4,pc
                            add     pc,#4
    :inst4                  nop
                            jmp     #nxt
    [/color]
    [/color]dspIncr                 long    1 << 9
    pc                      long    0
    stkPtr                  long    0
    
    

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.

    Post Edited (Chip Gracey (Parallax)) : 11/10/2006 8:46:06 PM GMT
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2006-11-10 20:42
    This is all pretty exciting stuff! It turns out that autoincrementing wouldn't help in the nxt loop, if it were available, since the extra instruction is necessary for pipelining considerations, anyway.

    I might suggest a jump table for the fcall, fret, etc. This keeps their cog addresses constant; and there's no speed penalty, since you can use an indirect jump to execute them.

    Also, you don't really need special treatment for conditional branches, since the address of the jump will be the least-significant word in the long making up the next "instruction". This means that the 16 most-significant bits are zero — a nop! If the jump isn't taken, it'll just fall through the nop onto the next instruction. So, just using the Propeller's conditionals on the jmp to fjmp, say, will suffice.

    -Phil
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2006-11-10 20:48
    Mike,

    If you're trying to save shared RAM, putting the stack in the Cog works well. I've got a prototype (for my other compiler for a different language).

    However, if you're doing it for speed, you may be disappointed; I was unable to get it faster (in the general case) than putting the stack in shared RAM and keeping TOS in a register. I don't have that lab notebook here or I'd post the math.

    Perhaps someone cleverer than I can pull it off; I will gladly steal^Wuse their code. smile.gif

    Edit: I'm speaking here specifically of a data stack or mixed data/return stack (as in C), not a dedicated return stack (as in Forth). Putting a return stack in the Cog would be easier, but also less of an optimization (the data stack tends to be hotter by an order of magnitude in languages that separate them).
  • cgraceycgracey Posts: 14,256
    edited 2006-11-10 21:05
    Man, this little bit of code really gets my gears turning. It is ingenious! It can execute·assembly language·from hub ram at 1/4 rate, and then cache any loops and execute them internally. For many applications, this could mean 95% the speed of entirely-native cog execution, but with the giant benefits of a larger memory model. This is just awesome. Way to go, Bill!

    I keep experiencing this phenomenon where what had I accepted as a hard and fast limitation gets blown straight through by some unexpected finesse, and it's·been making it difficult to get·back onto the next chip. So many times I've thought, "Well, the next chip will be able to do that." And then we find a way to get the current chip to·do it. Fun fun fun.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.

    Post Edited (Chip Gracey (Parallax)) : 11/10/2006 9:48:42 PM GMT
  • Bill HenningBill Henning Posts: 6,445
    edited 2006-11-10 21:29
    I am glad you like it [noparse]:)[/noparse]

    I've been bursting at the seams to let it out, but I was first trying to think of same way of directly monitizing it.

    Last night I decided that indirect monitization (getting better known, eventually making a web site about it, people supporting the idea and helping me) is better in this case.

    More tonight, when I am not at work - I will document the threading model I came up with; Chip I sent you a PM outlining the basics of it [noparse]:)[/noparse]
  • Ym2413aYm2413a Posts: 630
    edited 2006-11-10 21:31
    Oh wow! This idea looks amazing. idea.gif
  • Mike GreenMike Green Posts: 23,101
    edited 2006-11-10 22:29
    Chip,
    One little correction in the case that you find a zero value on the first fetch.
    This way, pc is properly set to point after the zero value.
    fcache               rdlong  $80,pc             wz
                            add     fcache,dspIncr2 (2 << 9)
                  if_nz  add     pc,#4
    fcache2   if_nz  rdlong  $81,pc             wz
                            add     fcache2,dspIncr2 (2 << 9)
                            add     pc,#4
                  if_nz  jmp     #fcache
                            movd    fcache,#$80
                            movd    fcache2,#$81
                            jmp     #$80
    
    
  • cgraceycgracey Posts: 14,256
    edited 2006-11-10 22:38
    Oh, I didn't think about that. That would have caused a problem. Good thinking.
    Mike Green said...
    Chip,
    One little correction in the case that you find a zero value on the first fetch.
    This way, pc is properly set to point after the zero value.
    fcache           rdlong  $80,pc             wz
                     add     fcache,dspIncr2 (2 << 9)
              [color=red]if_nz[/color]  add     pc,#4
    fcache2   if_nz  rdlong  $81,pc             wz
                     add     fcache2,dspIncr2 (2 << 9)
                     add     pc,#4
              if_nz  jmp     #fcache
                     movd    fcache,#$80
                     movd    fcache2,#$81
                     jmp     #$80
    
    

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.
  • Dennis FerronDennis Ferron Posts: 480
    edited 2006-11-10 22:48
    Begin YABPC:

    I realize copyrights and patents are apples and oragnes, but can the GNU General Public License be used to protect an idea from patent sharks? For instance, is there a way Bill could release this as GPL or public domain and thereby make it unpatentable? What about circuit schematics - can making the schematic public domain protect it from being patented by an unscrupulous company later?

    end YABPC:
  • Paul BakerPaul Baker Posts: 6,351
    edited 2006-11-10 22:59
    Well, the whole software thing has been thoroughly loused up by the interpretations of the statutes by the legal system. It wasn't until recently that anything of a software nature could be patented. The loophole that has since been expanded·to a crater the size of Texas is that software coupled to the act of executing it on hardware is now considered patentable. But yes anything that is publicly disclosed is considered prior art, the crook of this is it's only as good as how well publicized it is. IOW if it's not commonly known and availible, there is a more than likely chance the examiner won't know about it and won't apply it. But anything of public knowledge is fair game, I had on more than one occasion used actual sections of code of the Linux operating system to reject an application. But this was only because I or a senior examiner I consulted with knew that Linux did the same thing.

    Now out of the way publications are just as valid, but it they weren't applied on the front end, it requires overturning the patent·on the back end (ie sueing in a court of law). But that can easily run into the millions of dollars, so it's best to avoid the situation whenever possible.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Paul Baker
    Propeller Applications Engineer

    Parallax, Inc.

    Post Edited (Paul Baker (Parallax)) : 11/10/2006 11:06:27 PM GMT
  • Mike GreenMike Green Posts: 23,101
    edited 2006-11-10 23:00
    Well, to get very specific, very early, I propose that any COG that is using Bill's system start with at least the following longs:
            org          0
            jmp        #cache             ' initialization code in cache
            jmp        #f_next            ' continue with next instruction
            jmp        #f_jmp             ' jump routine
            jmp        #f_call              ' call routine
            jmp        #f_ret              ' return routine
            jmp        #f_cache          ' cache & jump routine
            jmp        #f_expand        ' expansion / system call
            jmp        #f_next            ' reserved
    
    


    Note that by using jmp instructions, you can either transfer control or use the non-immediate form of a jump, but it's clearer what you intend. The call stack will begin at $7F and move downwards. The cache will begin at $80 as planned and go to $FF. The support routines will fit between this jump table and the stack. Any thoughts?

    I'm pushing for some basic conventions because I want to start using this with the Propeller OS I wrote and I'm getting ready to make a new release.
    Mike

    Post Edited (Mike Green) : 11/10/2006 11:04:26 PM GMT
  • Bill HenningBill Henning Posts: 6,445
    edited 2006-11-10 23:09
    I like it!

    Suggestions:

    - I think the call stack should be in hub memory. The threading model I am playing with pretty much requires that (up to eight threads per cog)

    - let's reserve the first sixteen longs as jump vectors; leaves room for my threading and messaging primitives

    - let's try to keep the kernel below $80, including all vectors. I know, this is limiting, but I have more ideas...

    Rough cog memory map:

    $000-$00F vectors
    $010-$07F primitives
    $080-$0FF FCACHE area
    $100-$17F FDCACHE area (more on this later)
    $180-$1EF virtual peripheral area / thread space (more on this later)
    $1F0-$1FF the reserved cog special registers

    Please note each cache and virtual peripheral area works out to be 512 bytes long. This is not an accident. Think swapping in from SPI eeprom or swapping in/out to/from SPI ferroram.
    Basically what I am doing is treating a cog's memory as a program controlled L1 cache, with the HUB memory as the "main memory".

    Later I plan to treat HUB memory as an L2 cache [noparse]:)[/noparse]

    (I'm having a late lunch so I could comment)
  • Mike GreenMike Green Posts: 23,101
    edited 2006-11-10 23:31
    Bill,
    I'm trying to simplify some aspects of this, yet allow for complexity later when needed. Unless you're doing multi-threading, you may not need a HUB based stack and the vectors, basic primitives, and cache are all that would be needed (and would be strictly upward compatible with the multi-threaded version). I would still like to push ahead with a cog-based stack version, but make sure that only the call/ret/initialization routines know about that.
    Mike
  • Mike GreenMike Green Posts: 23,101
    edited 2006-11-10 23:45
    One more thing ... The PC variable needs to be somewhere directly accessable so that the relative "jumps" can be done. Since location zero is only used for the jump to the initialization code, I suggest that we use COG location zero for the PC value. It would be set up by the once-only initialization code, probably using PAR directly or indirectly. Consider keeping the HUB stack pointer in the upper half of location zero. It might make multi-threading easier. You'd swap one long for another and you'd be off in the next thread. It would add an instruction or two to a call or return at most.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2006-11-11 00:47
    Further simplifying, I don't think you even need an fjmp routine. It's only one instruction (rdlong pc,pc), which could just as easily be put inline as either rdlong pc,pc, followed by the address, for long absolute jumps, mov pc,#nnn for jumps to the first 512 program locations, add pc,#nnn for short forward relative jumps, and sub pc,#nnn for short backward relative jumps.

    -Phil
  • Mike GreenMike Green Posts: 23,101
    edited 2006-11-11 00:55
    OK, so now we're down to
    DAT
                            org     0
    bill_pc                 jmp     #f_initial
    bill_next               jmp     #f_next
    bill_call               jmp     #f_call
    bill_retn               jmp     #f_retn
    bill_cache              jmp     #f_cache
    bill_expand             jmp     #f_next
    bill_reserved0          jmp     #f_next
    bill_reserved1          jmp     #f_next
    bill_reserved2          jmp     #f_next
    bill_reserved3          jmp     #f_next
    bill_reserved4          jmp     #f_next
    bill_reserved5          jmp     #f_next
    bill_reserved6          jmp     #f_next
    bill_reserved7          jmp     #f_next
    bill_reserved8          jmp     #f_next
    bill_reserved9          jmp     #f_next
    
    
  • Bill HenningBill Henning Posts: 6,445
    edited 2006-11-11 00:57
    Excellent!
  • Mike GreenMike Green Posts: 23,101
    edited 2006-11-11 01:33
    OK, here we go. The entries are defined as CONs so they can be referenced using the object constant "obj#con" notation. There's no jump entry since we can use a "RDLONG obj#bhPC,obj#bhPC" for that purpose and it doesn't cost any more time. I'll add some more comments after I've gotten some dinner.

    Comments? Questions?

    Post Edited (Mike Green) : 11/11/2006 1:40:51 AM GMT
  • Mike GreenMike Green Posts: 23,101
    edited 2006-11-11 01:49
    Another special case:

    A halt/pause could be "sub pc,#4".
Sign In or Register to comment.