ANNOUNCING: Large memory model for Propeller assembly language programs!
Bill Henning
Posts: 6,445
Dear Friends,
I am afraid I've been sitting on this idea for some weeks; I came up with it shortly after getting my Propeller kit - however since I have not had time to do much with it, and it does not like I'll have time this year, I've decided to publish now.
Feel free to use·these methods·in your compilers, but please credit me if you do [noparse]:)[/noparse] and don't try to patent it!
If you are going to use it in a commercial compiler, I take PayPal [noparse]:)[/noparse]
While mulling over the memory limitations of cogs (in between bouts of bugging Chip asking for new features for the next propeller... Chip: hint hint) I thought of something very interesting:
[noparse][[/noparse]code]
nxt···· rdlong·· instr,pc
········ add······pc,#4
instr···nop····· ' placeholder!
········ jmp····· nxt
[noparse][[/noparse]/code]
{ Now Chip, that's another reason I keep bugging you for auto incrementing pointers! }
The above code fragment can execute code stored in HUB memory at 32 cycles per instruction.
Ofcourse I considered that overhead too high, however by the simple expedient of unrolling the loop four times, we get it down to a far more appealing 20 cycles per instruction. Appealing? At 5x slower than local code? YES! It lets us execute LARGE programs!
Now that is not all.
Consider... the code executed from main memory can call routines in local memory. We have the main "Program Counter" in a local register.
We can have "FJMP", "FCALL" and even "FBRcc" instructions!
Granted, they will be slow compared to native COG code, but they will be MUCH faster than Spin or any byte code language. I have not written all the primitives, but all we need is an "SP" pointer held in each cog running in "large" model for saved hub return addresses, and a small number of primitive functions that can be called; later they can be "masked" by a macro assembler to look like native instructions.
The "instructions" I propose are:
FJMP addr· ' calls routine that replaces PC with long at PC, then jumps to nxt
FCALL addr ' increments SP by 2, replaces PC with long at PC after it saves PC+4 at SP
FRET········ ' loads PC from·word at SP, decrements SP by two
FBRcc addr ' works the same as FJMP but is conditional
There you go guys. This "Large Model" I came up with allows for creation of compilers for "conventional" languages meant for more conventional architectures.
And I have a way of addressing the performance penalty ... reducing it in most cases to less than a factor of two compared to native cog code!
Let me know what you think [noparse]:)[/noparse]
I'd prefer that we standardize on registers used for PC and SP, as well as the entry points for the 'kernel' routines. I'd prefer to keep the kernel as small as possible in order to have as much cog memory free for use as buffers and for transient code as possible.
Oh what the heck, I'll let the other cat out of the bag!
There will be another primitive.
Call it "FCACHE"
When the "kernel" executes an "FCACHE" instruction, which in reality just calls a small primitive routine in the cog, it will copy all longs after the FCACHE to the cog's execution buffer (I will be using $080-$0FF as the "FCACHE" code area, I would VERY MUCH appreciate it if others adopted my conventions; that way languages compiling to my large model will be code compatible!) stopping only when it runs across a "NULL" long (0).
The code between FCACHE and NULL will be copied to the cache area, and the cog FCACHE primitive will jump to it after setting PC to the address of the hub word just past the NULL. When it exits, the code is responsible to jump to nxt.
Cached code must NOT call (or FCALL) any hub code, as a matter of fact, it must obey the rules of normal cog assembly programs.
Yes, by loading more than the 128 words I suggest, this can be used as a "paging" mechanism for very large programs.
Yes, this also makes it possible to run multiple threads per cog - I have·a "YIELD" primitive in mind that saves PC, SP and switches to another thread of execution (tasks for now must statically allocate non-overlapping registers.)
Ok, thats it.
No one better try to patent this as "their" IP - that's why I'm very publically disclosing this [noparse]:)[/noparse]
·
I am afraid I've been sitting on this idea for some weeks; I came up with it shortly after getting my Propeller kit - however since I have not had time to do much with it, and it does not like I'll have time this year, I've decided to publish now.
Feel free to use·these methods·in your compilers, but please credit me if you do [noparse]:)[/noparse] and don't try to patent it!
If you are going to use it in a commercial compiler, I take PayPal [noparse]:)[/noparse]
While mulling over the memory limitations of cogs (in between bouts of bugging Chip asking for new features for the next propeller... Chip: hint hint) I thought of something very interesting:
[noparse][[/noparse]code]
nxt···· rdlong·· instr,pc
········ add······pc,#4
instr···nop····· ' placeholder!
········ jmp····· nxt
[noparse][[/noparse]/code]
{ Now Chip, that's another reason I keep bugging you for auto incrementing pointers! }
The above code fragment can execute code stored in HUB memory at 32 cycles per instruction.
Ofcourse I considered that overhead too high, however by the simple expedient of unrolling the loop four times, we get it down to a far more appealing 20 cycles per instruction. Appealing? At 5x slower than local code? YES! It lets us execute LARGE programs!
Now that is not all.
Consider... the code executed from main memory can call routines in local memory. We have the main "Program Counter" in a local register.
We can have "FJMP", "FCALL" and even "FBRcc" instructions!
Granted, they will be slow compared to native COG code, but they will be MUCH faster than Spin or any byte code language. I have not written all the primitives, but all we need is an "SP" pointer held in each cog running in "large" model for saved hub return addresses, and a small number of primitive functions that can be called; later they can be "masked" by a macro assembler to look like native instructions.
The "instructions" I propose are:
FJMP addr· ' calls routine that replaces PC with long at PC, then jumps to nxt
FCALL addr ' increments SP by 2, replaces PC with long at PC after it saves PC+4 at SP
FRET········ ' loads PC from·word at SP, decrements SP by two
FBRcc addr ' works the same as FJMP but is conditional
There you go guys. This "Large Model" I came up with allows for creation of compilers for "conventional" languages meant for more conventional architectures.
And I have a way of addressing the performance penalty ... reducing it in most cases to less than a factor of two compared to native cog code!
Let me know what you think [noparse]:)[/noparse]
I'd prefer that we standardize on registers used for PC and SP, as well as the entry points for the 'kernel' routines. I'd prefer to keep the kernel as small as possible in order to have as much cog memory free for use as buffers and for transient code as possible.
Oh what the heck, I'll let the other cat out of the bag!
There will be another primitive.
Call it "FCACHE"
When the "kernel" executes an "FCACHE" instruction, which in reality just calls a small primitive routine in the cog, it will copy all longs after the FCACHE to the cog's execution buffer (I will be using $080-$0FF as the "FCACHE" code area, I would VERY MUCH appreciate it if others adopted my conventions; that way languages compiling to my large model will be code compatible!) stopping only when it runs across a "NULL" long (0).
The code between FCACHE and NULL will be copied to the cache area, and the cog FCACHE primitive will jump to it after setting PC to the address of the hub word just past the NULL. When it exits, the code is responsible to jump to nxt.
Cached code must NOT call (or FCALL) any hub code, as a matter of fact, it must obey the rules of normal cog assembly programs.
Yes, by loading more than the 128 words I suggest, this can be used as a "paging" mechanism for very large programs.
Yes, this also makes it possible to run multiple threads per cog - I have·a "YIELD" primitive in mind that saves PC, SP and switches to another thread of execution (tasks for now must statically allocate non-overlapping registers.)
Ok, thats it.
No one better try to patent this as "their" IP - that's why I'm very publically disclosing this [noparse]:)[/noparse]
·
Comments
I propose the following branch instructions:
FBRC addr ' branch to far address if Carry flag is set
FBRNC addr ' branch to far address if Carry flag is clear
FBRZ addr ' branch to far address if Zero flag is set
FBRNZ addr ' branch to far address if Zero flag is NOT set
By the way, the same mechanism would also work to say external memory, except the inner interpreter loop would have to be changed.
I'd like all of us to get togeather and work out a standard everyone will conform to - sort of a "Propeller ABI"
A couple of limitations:
Code directly executed out of hub memory MAY NOT use any of the conditional branch instructions directly, it MUST use the FBRcc primitives (otherwise it would branch out of the hub interpreter loop!)
Code in FCACHE blocks must not use any of the Fxxxx primitives as mentioned in the earlier message
A couple of HUGE advantages:
Think of system calls in HUB memory, things like SPI_IN / SPI_OUT / SD_READ / SD_WRITE
All those system calls can include FCACHE blocks and run at full cog speed!
A neat trick:
In code executed out of HUB memory (BUT NOT INSIDE FCACHE/NULL blocks!)... consider the effect of
·······if_c· add pc,#40··········· ' yep, a short conditional branch in HUB code without using a primitive!
So the FBRcc primitives are only needed if the branch target is more than +-128 words distant from the hub location where the add/sub op is executed. If we accept that limitation, there is no need for FBRcc primitives. Which keeps the kernel smaller.
Oh, I also want to reserve FSVC for a system service call routine.
I'd also like to reserve 128 longs in the first 512 longs of hub memory, I have some excellent ideas for them, but I am too tired to spill any more beans tonite.
I will be setting up a blog for this project soon.
Good Night,
Bill
Post Edited (Bill Henning) : 11/10/2006 10:06:53 AM GMT
This is a great idea! I love the fetch/execute and FCACHE, but YIELD could be a real sleeper. If you could get multiple ASM threads running on one cog, that would be great -- especially if people could crunch multiple mid-bandwidth processes like serial comms. I hear you (and everyone else) on the auto-incrementing address register(s). This large memory model is going to be exciting.
BTW, I hope there are no such people on the forum that would be so rotten as to patent things they've seen here. Well, the system is broken, so the joke might be on them, afterall. "He·that diggeth a pit shall fall into it" - And hopefully sooner than later.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Thanks!
I am pretty sure I can get YIELD working
That's one of the reasons I want 128 of the first 512 longs. I'd really like 256 of them, but some might see that as greedy. I'm aiming for four threads per cog, and eight is not impossible if people respect my 128 word limit for FCACHE blocks.
Ofcourse this also means that we can do a native code large model SPIN compiler that would be 10x faster than the current spin, and normally close to pure assembly language program speeds. Ditto for a C compiler [noparse]:)[/noparse]
Best,
Bill
p.s.
Mutiple mid-speed threads is why I came up with YIELD [noparse]:)[/noparse]
This looks rather cool.
Graham
Nico Hattink
One of the issues here is that the "entry points" for the basic "instructions" are going to have to be hard fixed or go through a jump table that's fixed or use a "linking process" otherwise maintenance and upgrades are going to be a nightmare. In particular, if I were to incorporate this into the Propeller OS (which sounds like a great idea), the loader/I2C cog would use this as its basic loop and some of the basic "instructions" would read/write EEPROM or load and execute a SPIN program. If I were to make corrections or changes to the loader, some of the "entry points" could change unless there was a well defined convention. I was planning to add primitives to read/write between EEPROM and the loader's cog memory. This now gives a much better general framework than the simple overlay loader I had envisioned. Doing overlays from EEPROM is way slower than from HUB memory, but might be very useful for some applications and would be completely independent from HUB memory, needing only its own 2 I/O pin I2C bus.
Mike
Another piece: Unless the SPIN compiler is modified to make some low memory available, it will be difficult at best to integrate some of your ideas with the existing Propeller Tool. It would be a shame to not be able to use the existing SPIN interpreter.
It would be easy to modify the OS's loader to load and execute a modified SPIN image that skips over a block of low HUB memory. The space in the EEPROM could be used for other things or used to initialize the 128 long area. The Propeller Tool would have to have a directive added (like _xxxx = ???) that would specify the size of the area to be skipped (from $10 to $10+???-1). This would be compatible with the existing boot loader and all existing code.
Another Propeller Tool directive that would be useful would be an "ORG"-like statement that would specify the location in the binary image to use for the following assembly/data information. This could be used to initialize fixed areas like the 128 long area.
Mike
Their Seaforth24 product seems very similar to the Prop, but bigger and more complex.
They have/will have multi-core (24) MCU's, which process forth directly. But one of their cool features is that one core, can execute code 'read' directly from another core.· Bascally the code is passed one 'word' at a time from the other core.·(They had a few white papers on their Resources page explaining this, but that page is empty as I write this.)
Seth
This is almost exactly what I've already tried for the Forth kernel (the paging approach I mentioned) -- I used it to implement an experimental DTC interpreter and some user-native-code support.
On the Propeller, with stack in shared RAM, it is no faster than an ITC interpreter, and in many cases is slower. The speed of the ITC is also bound by memory bus bandwidth, but transfers significantly less.
I'm still considering this approach for user-defined native words, but I'm mostly responding to correct your statement about it being faster than "any bytecode language." This is most likely not correct. (Bytecode VMs and ITC are a trivial transform apart, so I take your statement as applying to both, as well as token-threaded code.)
It will, of course, be significantly faster than SPIN.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
Post Edited (Paul Baker (Parallax)) : 11/10/2006 8:49:46 PM GMT
One question for others ... How about having the stack in the cog? There are pros and cons. If it's strictly a call stack, it could be reasonably limited in depth. A good place would be to run the stack downwards from the end of the cache area. It wouldn't be too hard to pack return addresses 2 per long word. Advantage is that there'd be one less thing to allocate in HUB RAM. Disadvantage is that it'd be harder to switch execution threads.
Post Edited (Mike Green) : 11/10/2006 8:24:06 PM GMT
fcache··················rdlong··$80,pc·············wz
························add·····fcache,dspIncr
························add·····pc,#4
················if_nz···jmp·····#fcache
······················· movd····fcache,#$80
························jmp·····#$80
Here is a way to make it ~33% faster by adding 4 instructions:
fcache··················rdlong··$80,pc·············wz
························add·····fcache,dspIncr2 (2 << 9)
························add·····pc,#4
fcache2·········if_nz···rdlong··$81,pc·············wz
························add·····fcache2,dspIncr2 (2 << 9)
························add·····pc,#4
················if_nz···jmp·····#fcache
······················· movd····fcache,#$80
······················· movd····fcache2,#$81
························jmp·····#$80
I love doing stuff like this!
· ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey (Parallax)) : 11/10/2006 8:46:06 PM GMT
I might suggest a jump table for the fcall, fret, etc. This keeps their cog addresses constant; and there's no speed penalty, since you can use an indirect jump to execute them.
Also, you don't really need special treatment for conditional branches, since the address of the jump will be the least-significant word in the long making up the next "instruction". This means that the 16 most-significant bits are zero — a nop! If the jump isn't taken, it'll just fall through the nop onto the next instruction. So, just using the Propeller's conditionals on the jmp to fjmp, say, will suffice.
-Phil
If you're trying to save shared RAM, putting the stack in the Cog works well. I've got a prototype (for my other compiler for a different language).
However, if you're doing it for speed, you may be disappointed; I was unable to get it faster (in the general case) than putting the stack in shared RAM and keeping TOS in a register. I don't have that lab notebook here or I'd post the math.
Perhaps someone cleverer than I can pull it off; I will gladly steal^Wuse their code.
Edit: I'm speaking here specifically of a data stack or mixed data/return stack (as in C), not a dedicated return stack (as in Forth). Putting a return stack in the Cog would be easier, but also less of an optimization (the data stack tends to be hotter by an order of magnitude in languages that separate them).
I keep experiencing this phenomenon where what had I accepted as a hard and fast limitation gets blown straight through by some unexpected finesse, and it's·been making it difficult to get·back onto the next chip. So many times I've thought, "Well, the next chip will be able to do that." And then we find a way to get the current chip to·do it. Fun fun fun.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey (Parallax)) : 11/10/2006 9:48:42 PM GMT
I've been bursting at the seams to let it out, but I was first trying to think of same way of directly monitizing it.
Last night I decided that indirect monitization (getting better known, eventually making a web site about it, people supporting the idea and helping me) is better in this case.
More tonight, when I am not at work - I will document the threading model I came up with; Chip I sent you a PM outlining the basics of it [noparse]:)[/noparse]
One little correction in the case that you find a zero value on the first fetch.
This way, pc is properly set to point after the zero value.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
I realize copyrights and patents are apples and oragnes, but can the GNU General Public License be used to protect an idea from patent sharks? For instance, is there a way Bill could release this as GPL or public domain and thereby make it unpatentable? What about circuit schematics - can making the schematic public domain protect it from being patented by an unscrupulous company later?
end YABPC:
Now out of the way publications are just as valid, but it they weren't applied on the front end, it requires overturning the patent·on the back end (ie sueing in a court of law). But that can easily run into the millions of dollars, so it's best to avoid the situation whenever possible.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
Post Edited (Paul Baker (Parallax)) : 11/10/2006 11:06:27 PM GMT
Note that by using jmp instructions, you can either transfer control or use the non-immediate form of a jump, but it's clearer what you intend. The call stack will begin at $7F and move downwards. The cache will begin at $80 as planned and go to $FF. The support routines will fit between this jump table and the stack. Any thoughts?
I'm pushing for some basic conventions because I want to start using this with the Propeller OS I wrote and I'm getting ready to make a new release.
Mike
Post Edited (Mike Green) : 11/10/2006 11:04:26 PM GMT
Suggestions:
- I think the call stack should be in hub memory. The threading model I am playing with pretty much requires that (up to eight threads per cog)
- let's reserve the first sixteen longs as jump vectors; leaves room for my threading and messaging primitives
- let's try to keep the kernel below $80, including all vectors. I know, this is limiting, but I have more ideas...
Rough cog memory map:
$000-$00F vectors
$010-$07F primitives
$080-$0FF FCACHE area
$100-$17F FDCACHE area (more on this later)
$180-$1EF virtual peripheral area / thread space (more on this later)
$1F0-$1FF the reserved cog special registers
Please note each cache and virtual peripheral area works out to be 512 bytes long. This is not an accident. Think swapping in from SPI eeprom or swapping in/out to/from SPI ferroram.
Basically what I am doing is treating a cog's memory as a program controlled L1 cache, with the HUB memory as the "main memory".
Later I plan to treat HUB memory as an L2 cache [noparse]:)[/noparse]
(I'm having a late lunch so I could comment)
I'm trying to simplify some aspects of this, yet allow for complexity later when needed. Unless you're doing multi-threading, you may not need a HUB based stack and the vectors, basic primitives, and cache are all that would be needed (and would be strictly upward compatible with the multi-threaded version). I would still like to push ahead with a cog-based stack version, but make sure that only the call/ret/initialization routines know about that.
Mike
-Phil
Comments? Questions?
Post Edited (Mike Green) : 11/11/2006 1:40:51 AM GMT
A halt/pause could be "sub pc,#4".