 |
|
 |
| Parallax Forums > Public Forums > Propeller Chip > ANNOUNCING: Large memory model for Propeller assembly language programs! | Forum Quick Jump
|
|  Bill Henning Registered Member

       Date Joined Sep 2006 Total Posts : 1283 | Posted 11/10/2006 12:55 AM (GMT -8) |   | | Dear Friends,
I am afraid I've been sitting on this idea for some weeks; I came up with it shortly after getting my Propeller kit - however since I have not had time to do much with it, and it does not like I'll have time this year, I've decided to publish now.
Feel free to use these methods in your compilers, but please credit me if you do :) and don't try to patent it!
If you are going to use it in a commercial compiler, I take PayPal :)
While mulling over the memory limitations of cogs (in between bouts of bugging Chip asking for new features for the next propeller... Chip: hint hint) I thought of something very interesting:
[code]
nxt rdlong instr,pc
add pc,#4
instr nop ' placeholder!
jmp nxt
[/code]
{ Now Chip, that's another reason I keep bugging you for auto incrementing pointers! }
The above code fragment can execute code stored in HUB memory at 32 cycles per instruction.
Ofcourse I considered that overhead too high, however by the simple expedient of unrolling the loop four times, we get it down to a far more appealing 20 cycles per instruction. Appealing? At 5x slower than local code? YES! It lets us execute LARGE programs!
Now that is not all.
Consider... the code executed from main memory can call routines in local memory. We have the main "Program Counter" in a local register.
We can have "FJMP", "FCALL" and even "FBRcc" instructions!
Granted, they will be slow compared to native COG code, but they will be MUCH faster than Spin or any byte code language. I have not written all the primitives, but all we need is an "SP" pointer held in each cog running in "large" model for saved hub return addresses, and a small number of primitive functions that can be called; later they can be "masked" by a macro assembler to look like native instructions.
The "instructions" I propose are:
FJMP addr ' calls routine that replaces PC with long at PC, then jumps to nxt
FCALL addr ' increments SP by 2, replaces PC with long at PC after it saves PC+4 at SP
FRET ' loads PC from word at SP, decrements SP by two
FBRcc addr ' works the same as FJMP but is conditional
There you go guys. This "Large Model" I came up with allows for creation of compilers for "conventional" languages meant for more conventional architectures.
And I have a way of addressing the performance penalty ... reducing it in most cases to less than a factor of two compared to native cog code!
Let me know what you think :)
I'd prefer that we standardize on registers used for PC and SP, as well as the entry points for the 'kernel' routines. I'd prefer to keep the kernel as small as possible in order to have as much cog memory free for use as buffers and for transient code as possible.
Oh what the heck, I'll let the other cat out of the bag!
There will be another primitive.
Call it "FCACHE"
When the "kernel" executes an "FCACHE" instruction, which in reality just calls a small primitive routine in the cog, it will copy all longs after the FCACHE to the cog's execution buffer (I will be using $080-$0FF as the "FCACHE" code area, I would VERY MUCH appreciate it if others adopted my conventions; that way languages compiling to my large model will be code compatible!) stopping only when it runs across a "NULL" long (0).
The code between FCACHE and NULL will be copied to the cache area, and the cog FCACHE primitive will jump to it after setting PC to the address of the hub word just past the NULL. When it exits, the code is responsible to jump to nxt.
Cached code must NOT call (or FCALL) any hub code, as a matter of fact, it must obey the rules of normal cog assembly programs.
Yes, by loading more than the 128 words I suggest, this can be used as a "paging" mechanism for very large programs.
Yes, this also makes it possible to run multiple threads per cog - I have a "YIELD" primitive in mind that saves PC, SP and switches to another thread of execution (tasks for now must statically allocate non-overlapping registers.)
Ok, thats it.
No one better try to patent this as "their" IP - that's why I'm very publically disclosing this :)
| | Back to Top | | |
 |  Bill Henning Registered Member

       Date Joined Sep 2006 Total Posts : 1283 | Posted 11/10/2006 1:25 AM (GMT -8) |   | | A bit more detail:
I propose the following branch instructions:
FBRC addr ' branch to far address if Carry flag is set
FBRNC addr ' branch to far address if Carry flag is clear
FBRZ addr ' branch to far address if Zero flag is set
FBRNZ addr ' branch to far address if Zero flag is NOT set
By the way, the same mechanism would also work to say external memory, except the inner interpreter loop would have to be changed.
I'd like all of us to get togeather and work out a standard everyone will conform to - sort of a "Propeller ABI"
A couple of limitations:
Code directly executed out of hub memory MAY NOT use any of the conditional branch instructions directly, it MUST use the FBRcc primitives (otherwise it would branch out of the hub interpreter loop!)
Code in FCACHE blocks must not use any of the Fxxxx primitives as mentioned in the earlier message
A couple of HUGE advantages:
Think of system calls in HUB memory, things like SPI_IN / SPI_OUT / SD_READ / SD_WRITE
All those system calls can include FCACHE blocks and run at full cog speed!
A neat trick:
In code executed out of HUB memory (BUT NOT INSIDE FCACHE/NULL blocks!)... consider the effect of
if_c add pc,#40 ' yep, a short conditional branch in HUB code without using a primitive!
So the FBRcc primitives are only needed if the branch target is more than +-128 words distant from the hub location where the add/sub op is executed. If we accept that limitation, there is no need for FBRcc primitives. Which keeps the kernel smaller.
Oh, I also want to reserve FSVC for a system service call routine.
I'd also like to reserve 128 longs in the first 512 longs of hub memory, I have some excellent ideas for them, but I am too tired to spill any more beans tonite.
I will be setting up a blog for this project soon.
Good Night,
Bill
Post Edited (Bill Henning) : 11/10/2006 10:06:53 AM GMT | | Back to Top | | |
   |  Graham Stabler Registered Member

       Date Joined Jul 2006 Total Posts : 1908 | Posted 11/10/2006 2:12 AM (GMT -8) |   | As any patent would be filed after this info was posted on a public forum they would be pretty screwed anyway.
This looks rather cool.
Graham | | Back to Top | | |
 |  nutson Registered Member
        Date Joined May 2006 Total Posts : 108 | Posted 11/10/2006 7:17 AM (GMT -8) |   | Great idea, Bill. This not only opens up the possibility of having a C compiler for the prop one day, but also for interesting options as multiple props executing the same program with different data sets, feasible as each prop has a dedicated register set. Shared RAM bandwith is going to be the limiting factor in the prop's performance, as in many multi processor systems. By making shared RAM access more intelligent, assign time slots only to props demanding access, the props performance could be increased even more. The rotating combustion engine piston type of explanation for shared resource access would not apply then, what a pity, but 8 digit MIPS figures would certainly soften the pain.
Nico Hattink | | Back to Top | | |
 |  Mike Green Registered Member

       Date Joined Oct 2004 Total Posts : 14332 | Posted 11/10/2006 7:28 AM (GMT -8) |   | Bill, One of the issues here is that the "entry points" for the basic "instructions" are going to have to be hard fixed or go through a jump table that's fixed or use a "linking process" otherwise maintenance and upgrades are going to be a nightmare. In particular, if I were to incorporate this into the Propeller OS (which sounds like a great idea), the loader/I2C cog would use this as its basic loop and some of the basic "instructions" would read/write EEPROM or load and execute a SPIN program. If I were to make corrections or changes to the loader, some of the "entry points" could change unless there was a well defined convention. I was planning to add primitives to read/write between EEPROM and the loader's cog memory. This now gives a much better general framework than the simple overlay loader I had envisioned. Doing overlays from EEPROM is way slower than from HUB memory, but might be very useful for some applications and would be completely independent from HUB memory, needing only its own 2 I/O pin I2C bus. Mike | | Back to Top | | |
 |  Mike Green Registered Member

       Date Joined Oct 2004 Total Posts : 14332 | Posted 11/10/2006 7:58 AM (GMT -8) |   | Bill, Another piece: Unless the SPIN compiler is modified to make some low memory available, it will be difficult at best to integrate some of your ideas with the existing Propeller Tool. It would be a shame to not be able to use the existing SPIN interpreter.
It would be easy to modify the OS's loader to load and execute a modified SPIN image that skips over a block of low HUB memory. The space in the EEPROM could be used for other things or used to initialize the 128 long area. The Propeller Tool would have to have a directive added (like _xxxx = ???) that would specify the size of the area to be skipped (from $10 to $10+???-1). This would be compatible with the existing boot loader and all existing code.
Another Propeller Tool directive that would be useful would be an "ORG"-like statement that would specify the location in the binary image to use for the following assembly/data information. This could be used to initialize fixed areas like the 128 long area. Mike | | Back to Top | | |
  |  Cliff L. Biffle [ Does Not Read Forums ]
        Date Joined Jun 2006 Total Posts : 206 | Posted 11/10/2006 8:49 AM (GMT -8) |   | Bill,
This is almost exactly what I've already tried for the Forth kernel (the paging approach I mentioned) -- I used it to implement an experimental DTC interpreter and some user-native-code support.
On the Propeller, with stack in shared RAM, it is no faster than an ITC interpreter, and in many cases is slower. The speed of the ITC is also bound by memory bus bandwidth, but transfers significantly less.
I'm still considering this approach for user-defined native words, but I'm mostly responding to correct your statement about it being faster than "any bytecode language." This is most likely not correct. (Bytecode VMs and ITC are a trivial transform apart, so I take your statement as applying to both, as well as token-threaded code.)
It will, of course, be significantly faster than SPIN.  | | Back to Top | | |
 |  Paul Baker Registered Member
        Date Joined Jul 2004 Total Posts : 6316 | Posted 11/10/2006 11:28 AM (GMT -8) |   | At the risk of veering off into YABPC (yet another bashing patents conversation), an examiner wouldn't know to look in Company X's online forums for prior art (I know first-hand forums are not part of thier search strategy). Assuming it was filed today it would be ~3 years before the examiner saw it by then this thread would be 3 years old and then there is ~2 more years before it became a patent, at which point a lawyer is now looking for material buried in a forum that is 5 years old even assuming a party of interest has hired a lawyer. SIR's are the only avenue in which a person can reasonable expect an examiner will see the information, and that costs money. IEEE and ACM (I miss my access to their databases of articles) also have a better than average expectation of being noticed by an examiner, but you have to convince the reviewers it is worthwhile information for them to publish.
Graham Stabler said... As any patent would be filed after this info was posted on a public forum they would be pretty screwed anyway.
This looks rather cool.
Graham
Paul Baker
Propeller Applications Engineer
Parallax, Inc. Post Edited (Paul Baker (Parallax)) : 11/10/2006 8:49:46 PM GMT | | Back to Top | | |
 |  Mike Green Registered Member

       Date Joined Oct 2004 Total Posts : 14332 | Posted 11/10/2006 12:12 PM (GMT -8) |   | How about some suggestions for rearranging and maybe eliminating an instruction or two?
entry rdlong pc,PAR mov stkPtr,pc shr stkPtr,#16 jmp #nxt
fjmp rdlong pc,pc jmp #nxt
fcache movd :copyIt,#$80 nop :copyIt rdlong 0-0,pc wz add :copyIt,dspIncr add pc,#4 if_nz jmp #:copyIt jmp #$80
fret sub stkPtr,#2 rdword pc,stkPtr jmp #nxt
fcall rdlong nxtPc,pc add pc,#4 wrword pc,stkPtr add stkPtr,#2 mov pc,nxtPc
nxt rdlong :inst1,pc add pc,#4 :inst1 nop rdlong :inst2,pc add pc,#4 :inst2 nop rdlong :inst3,pc add pc,#4 :inst3 nop rdlong :inst4,pc add pc,#4 :inst4 nop jmp #nxt
dspIncr long 1 << 9 pc long 0 stkPtr long 0 nxtPc long 0
One question for others ... How about having the stack in the cog? There are pros and cons. If it's strictly a call stack, it could be reasonably limited in depth. A good place would be to run the stack downwards from the end of the cache area. It wouldn't be too hard to pack return addresses 2 per long word. Advantage is that there'd be one less thing to allocate in HUB RAM. Disadvantage is that it'd be harder to switch execution threads.Post Edited (Mike Green) : 11/10/2006 8:24:06 PM GMT | | Back to Top | | |
 |  Chip Gracey (Parallax) Forum Moderator

       Date Joined Aug 2004 Total Posts : 1107 | Posted 11/10/2006 12:39 PM (GMT -8) |   | |
I found something... The nop could be gotten rid of in fcache by post-fixing the destination address:
fcache rdlong $80,pc wz add fcache,dspIncr add pc,#4 if_nz jmp #fcache
movd fcache,#$80 jmp #$80
Here is a way to make it ~33% faster by adding 4 instructions:
fcache rdlong $80,pc wz add fcache,dspIncr2 (2 << 9) add pc,#4 fcache2 if_nz rdlong $81,pc wz add fcache2,dspIncr2 (2 << 9) add pc,#4 if_nz jmp #fcache
movd fcache,#$80 movd fcache2,#$81 jmp #$80
I love doing stuff like this!
Mike Green said...How about some suggestions for rearranging and maybe eliminating an instruction or two?
entry rdlong pc,PAR mov stkPtr,pc shr stkPtr,#16 jmp #nxt
fjmp rdlong pc,pc jmp #nxt
fcache movd :copyIt,#$80 nop :copyIt rdlong 0-0,pc wz add :copyIt,dspIncr add pc,#4 if_nz jmp #:copyIt jmp #$80
fret sub stkPtr,#2 rdword pc,stkPtr jmp #nxt
fcall rdlong nxtPc,pc add pc,#4 wrword pc,stkPtr add stkPtr,#2 mov pc,nxtPc
nxt rdlong :inst1,pc {THIS IS NEAT!!!} add pc,#4 :inst1 nop rdlong :inst2,pc add pc,#4 :inst2 nop rdlong :inst3,pc add pc,#4 :inst3 nop rdlong :inst4,pc add pc,#4 :inst4 nop jmp #nxt
dspIncr long 1 << 9 pc long 0 stkPtr long 0
Chip Gracey
Parallax, Inc. Post Edited (Chip Gracey (Parallax)) : 11/10/2006 8:46:06 PM GMT | | Back to Top | | |
 |  Phil Pilgrim (PhiPi) Registered Member

       Date Joined Feb 2006 Total Posts : 6597 | Posted 11/10/2006 12:42 PM (GMT -8) |   | This is all pretty exciting stuff! It turns out that autoincrementing wouldn't help in the nxt loop, if it were available, since the extra instruction is necessary for pipelining considerations, anyway.
I might suggest a jump table for the fcall, fret, etc. This keeps their cog addresses constant; and there's no speed penalty, since you can use an indirect jump to execute them.
Also, you don't really need special treatment for conditional branches, since the address of the jump will be the least-significant word in the long making up the next "instruction". This means that the 16 most-significant bits are zero — a nop! If the jump isn't taken, it'll just fall through the nop onto the next instruction. So, just using the Propeller's conditionals on the jmp to fjmp, say, will suffice.
-Phil | | Back to Top | | |
 |  Cliff L. Biffle [ Does Not Read Forums ]
        Date Joined Jun 2006 Total Posts : 206 | Posted 11/10/2006 12:48 PM (GMT -8) |   | Mike,
If you're trying to save shared RAM, putting the stack in the Cog works well. I've got a prototype (for my other compiler for a different language).
However, if you're doing it for speed, you may be disappointed; I was unable to get it faster (in the general case) than putting the stack in shared RAM and keeping TOS in a register. I don't have that lab notebook here or I'd post the math.
Perhaps someone cleverer than I can pull it off; I will gladly steal^Wuse their code. 
Edit: I'm speaking here specifically of a data stack or mixed data/return stack (as in C), not a dedicated return stack (as in Forth). Putting a return stack in the Cog would be easier, but also less of an optimization (the data stack tends to be hotter by an order of magnitude in languages that separate them). | | Back to Top | | |
  |  Bill Henning Registered Member

       Date Joined Sep 2006 Total Posts : 1283 | Posted 11/10/2006 1:29 PM (GMT -8) |   | I am glad you like it :)
I've been bursting at the seams to let it out, but I was first trying to think of same way of directly monitizing it.
Last night I decided that indirect monitization (getting better known, eventually making a web site about it, people supporting the idea and helping me) is better in this case.
More tonight, when I am not at work - I will document the threading model I came up with; Chip I sent you a PM outlining the basics of it :) | | Back to Top | | |
 |  Ym2413a Registered Member

       Date Joined May 2006 Total Posts : 477 | Posted 11/10/2006 1:31 PM (GMT -8) |   | | | |
 |  Mike Green Registered Member

       Date Joined Oct 2004 Total Posts : 14332 | Posted 11/10/2006 2:29 PM (GMT -8) |   | Chip, One little correction in the case that you find a zero value on the first fetch. This way, pc is properly set to point after the zero value.
fcache rdlong $80,pc wz add fcache,dspIncr2 (2 << 9) if_nz add pc,#4 fcache2 if_nz rdlong $81,pc wz add fcache2,dspIncr2 (2 << 9) add pc,#4 if_nz jmp #fcache movd fcache,#$80 movd fcache2,#$81 jmp #$80
| | Back to Top | | |
 |  Chip Gracey (Parallax) Forum Moderator

       Date Joined Aug 2004 Total Posts : 1107 | Posted 11/10/2006 2:38 PM (GMT -8) |   | |
Oh, I didn't think about that. That would have caused a problem. Good thinking.
Mike Green said...Chip, One little correction in the case that you find a zero value on the first fetch. This way, pc is properly set to point after the zero value.
fcache rdlong $80,pc wz add fcache,dspIncr2 (2 << 9) if_nz add pc,#4 fcache2 if_nz rdlong $81,pc wz add fcache2,dspIncr2 (2 << 9) add pc,#4 if_nz jmp #fcache movd fcache,#$80 movd fcache2,#$81 jmp #$80
Chip Gracey
Parallax, Inc. | | Back to Top | | |
 |  Dennis Ferron Registered Member
        Date Joined Jul 2006 Total Posts : 480 | Posted 11/10/2006 2:48 PM (GMT -8) |   | Begin YABPC:
I realize copyrights and patents are apples and oragnes, but can the GNU General Public License be used to protect an idea from patent sharks? For instance, is there a way Bill could release this as GPL or public domain and thereby make it unpatentable? What about circuit schematics - can making the schematic public domain protect it from being patented by an unscrupulous company later?
end YABPC: | | Back to Top | | |
 |  Paul Baker Registered Member
        Date Joined Jul 2004 Total Posts : 6316 | Posted 11/10/2006 2:59 PM (GMT -8) |   | | Well, the whole software thing has been thoroughly loused up by the interpretations of the statutes by the legal system. It wasn't until recently that anything of a software nature could be patented. The loophole that has since been expanded to a crater the size of Texas is that software coupled to the act of executing it on hardware is now considered patentable. But yes anything that is publicly disclosed is considered prior art, the crook of this is it's only as good as how well publicized it is. IOW if it's not commonly known and availible, there is a more than likely chance the examiner won't know about it and won't apply it. But anything of public knowledge is fair game, I had on more than one occasion used actual sections of code of the Linux operating system to reject an application. But this was only because I or a senior examiner I consulted with knew that Linux did the same thing.
Now out of the way publications are just as valid, but it they weren't applied on the front end, it requires overturning the patent on the back end (ie sueing in a court of law). But that can easily run into the millions of dollars, so it's best to avoid the situation whenever possible.
Paul Baker
Propeller Applications Engineer
Parallax, Inc. Post Edited (Paul Baker (Parallax)) : 11/10/2006 11:06:27 PM GMT | | Back to Top | | |
   |  Mike Green Registered Member

       Date Joined Oct 2004 Total Posts : 14332 | Posted 11/10/2006 3:31 PM (GMT -8) |   | Bill, I'm trying to simplify some aspects of this, yet allow for complexity later when needed. Unless you're doing multi-threading, you may not need a HUB based stack and the vectors, basic primitives, and cache are all that would be needed (and would be strictly upward compatible with the multi-threaded version). I would still like to push ahead with a cog-based stack version, but make sure that only the call/ret/initialization routines know about that. Mike | | Back to Top | | |
 | 97 posts in this thread. Viewing Page : 1 2 3 4 | | Forum Information | Currently it is Tuesday, February 09, 2010 5:08 AM (GMT -8) There are a total of 416,001 posts in 57,637 threads. In the last 3 days there were 78 new threads and 889 reply posts. View Active Threads
| | Who's Online | This forum has 18518 registered members. Please welcome our newest member, wendy ooi. 63 Guest(s), 9 Registered Member(s) are currently online. Details AJM, Peter Jakacki, wendy ooi, kf4ixm, Sapieha, grouchy, parts-man73, okemoabe, WBA Consulting |
Forum powered by dotNetBB v2.42EC SP2.02 dotNetBB © 2000-2010 |
|
|