PDA

View Full Version : ANNOUNCING: Large memory model for Propeller assembly language programs!



Bill Henning
11-10-2006, 03:55 PM
Dear Friends,

I am afraid I've been sitting on this idea for some weeks; I came up with it shortly after getting my Propeller kit - however since I have not had time to do much with it, and it does not like I'll have time this year, I've decided to publish now.

Feel free to use·these methods·in your compilers, but please credit me if you do :) and don't try to patent it!

If you are going to use it in a commercial compiler, I take PayPal :)

While mulling over the memory limitations of cogs (in between bouts of bugging Chip asking for new features for the next propeller... Chip: hint hint) I thought of something very interesting:



nxt···· rdlong·· instr,pc
········ add······pc,#4
instr···nop····· ' placeholder!
········ jmp····· nxt


{ Now Chip, that's another reason I keep bugging you for auto incrementing pointers! }

The above code fragment can execute code stored in HUB memory at 32 cycles per instruction.

Ofcourse I considered that overhead too high, however by the simple expedient of unrolling the loop four times, we get it down to a far more appealing 20 cycles per instruction. Appealing? At 5x slower than local code? YES! It lets us execute LARGE programs!

Now that is not all.

Consider... the code executed from main memory can call routines in local memory. We have the main "Program Counter" in a local register.

We can have "FJMP", "FCALL" and even "FBRcc" instructions!

Granted, they will be slow compared to native COG code, but they will be MUCH faster than Spin or any byte code language. I have not written all the primitives, but all we need is an "SP" pointer held in each cog running in "large" model for saved hub return addresses, and a small number of primitive functions that can be called; later they can be "masked" by a macro assembler to look like native instructions.

The "instructions" I propose are:

FJMP addr· ' calls routine that replaces PC with long at PC, then jumps to nxt
FCALL addr ' increments SP by 2, replaces PC with long at PC after it saves PC+4 at SP
FRET········ ' loads PC from·word at SP, decrements SP by two
FBRcc addr ' works the same as FJMP but is conditional

There you go guys. This "Large Model" I came up with allows for creation of compilers for "conventional" languages meant for more conventional architectures.

And I have a way of addressing the performance penalty ... reducing it in most cases to less than a factor of two compared to native cog code!

Let me know what you think :)

I'd prefer that we standardize on registers used for PC and SP, as well as the entry points for the 'kernel' routines. I'd prefer to keep the kernel as small as possible in order to have as much cog memory free for use as buffers and for transient code as possible.

Oh what the heck, I'll let the other cat out of the bag!

There will be another primitive.

Call it "FCACHE"

When the "kernel" executes an "FCACHE" instruction, which in reality just calls a small primitive routine in the cog, it will copy all longs after the FCACHE to the cog's execution buffer (I will be using $080-$0FF as the "FCACHE" code area, I would VERY MUCH appreciate it if others adopted my conventions; that way languages compiling to my large model will be code compatible!) stopping only when it runs across a "NULL" long (0).

The code between FCACHE and NULL will be copied to the cache area, and the cog FCACHE primitive will jump to it after setting PC to the address of the hub word just past the NULL. When it exits, the code is responsible to jump to nxt.

Cached code must NOT call (or FCALL) any hub code, as a matter of fact, it must obey the rules of normal cog assembly programs.

Yes, by loading more than the 128 words I suggest, this can be used as a "paging" mechanism for very large programs.

Yes, this also makes it possible to run multiple threads per cog - I have·a "YIELD" primitive in mind that saves PC, SP and switches to another thread of execution (tasks for now must statically allocate non-overlapping registers.)

Ok, thats it.

No one better try to patent this as "their" IP - that's why I'm very publically disclosing this :)
·

Bill Henning
11-10-2006, 04:25 PM
A bit more detail:

I propose the following branch instructions:

FBRC addr ' branch to far address if Carry flag is set
FBRNC addr ' branch to far address if Carry flag is clear
FBRZ addr ' branch to far address if Zero flag is set
FBRNZ addr ' branch to far address if Zero flag is NOT set

By the way, the same mechanism would also work to say external memory, except the inner interpreter loop would have to be changed.

I'd like all of us to get togeather and work out a standard everyone will conform to - sort of a "Propeller ABI"

A couple of limitations:

Code directly executed out of hub memory MAY NOT use any of the conditional branch instructions directly, it MUST use the FBRcc primitives (otherwise it would branch out of the hub interpreter loop!)

Code in FCACHE blocks must not use any of the Fxxxx primitives as mentioned in the earlier message

A couple of HUGE advantages:

Think of system calls in HUB memory, things like SPI_IN / SPI_OUT / SD_READ / SD_WRITE

All those system calls can include FCACHE blocks and run at full cog speed!

A neat trick:

In code executed out of HUB memory (BUT NOT INSIDE FCACHE/NULL blocks!)... consider the effect of

·······if_c· add pc,#40··········· ' yep, a short conditional branch in HUB code without using a primitive!

So the FBRcc primitives are only needed if the branch target is more than +-128 words distant from the hub location where the add/sub op is executed. If we accept that limitation, there is no need for FBRcc primitives. Which keeps the kernel smaller.

Oh, I also want to reserve FSVC for a system service call routine.

I'd also like to reserve 128 longs in the first 512 longs of hub memory, I have some excellent ideas for them, but I am too tired to spill any more beans tonite.

I will be setting up a blog for this project soon.

Good Night,

Bill


Post Edited (Bill Henning) : 11/10/2006 10:06:53 AM GMT

cgracey
11-10-2006, 04:48 PM
Bill,

This is a great idea! I love the fetch/execute and FCACHE, but YIELD could be a real sleeper. If you could get multiple ASM threads running on one cog, that would be great -- especially if people could crunch multiple mid-bandwidth processes like serial comms. I hear you (and everyone else) on the auto-incrementing address register(s). This large memory model is going to be exciting.

BTW, I hope there are no such people on the forum that would be so rotten as to patent things they've seen here. Well, the system is broken, so the joke might be on them, afterall. "He·that diggeth a pit shall fall into it" - And hopefully sooner than later.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Bill Henning
11-10-2006, 04:55 PM
Chip,

Thanks!

I am pretty sure I can get YIELD working

That's one of the reasons I want 128 of the first 512 longs. I'd really like 256 of them, but some might see that as greedy. I'm aiming for four threads per cog, and eight is not impossible if people respect my 128 word limit for FCACHE blocks.

Ofcourse this also means that we can do a native code large model SPIN compiler that would be 10x faster than the current spin, and normally close to pure assembly language program speeds. Ditto for a C compiler :)

Best,

Bill

p.s.

Mutiple mid-speed threads is why I came up with YIELD :)

Graham Stabler
11-10-2006, 05:12 PM
As any patent would be filed after this info was posted on a public forum they would be pretty screwed anyway.

This looks rather cool.

Graham

nutson
11-10-2006, 10:17 PM
Great idea, Bill. This not only opens up the possibility of having a C compiler for the prop one day, but also for interesting options as multiple props executing the same program with different data sets, feasible as each prop has a dedicated register set. Shared RAM bandwith is going to be the limiting factor in the prop's performance, as in many multi processor systems. By making shared RAM access more intelligent, assign time slots only to props demanding access, the props performance could be increased even more. The rotating combustion engine piston type of explanation for shared resource access would not apply then, what a pity, but 8 digit MIPS figures would certainly soften the pain.

Nico Hattink

Mike Green
11-10-2006, 10:28 PM
Bill,
One of the issues here is that the "entry points" for the basic "instructions" are going to have to be hard fixed or go through a jump table that's fixed or use a "linking process" otherwise maintenance and upgrades are going to be a nightmare. In particular, if I were to incorporate this into the Propeller OS (which sounds like a great idea), the loader/I2C cog would use this as its basic loop and some of the basic "instructions" would read/write EEPROM or load and execute a SPIN program. If I were to make corrections or changes to the loader, some of the "entry points" could change unless there was a well defined convention. I was planning to add primitives to read/write between EEPROM and the loader's cog memory. This now gives a much better general framework than the simple overlay loader I had envisioned. Doing overlays from EEPROM is way slower than from HUB memory, but might be very useful for some applications and would be completely independent from HUB memory, needing only its own 2 I/O pin I2C bus.
Mike

Mike Green
11-10-2006, 10:58 PM
Bill,
Another piece: Unless the SPIN compiler is modified to make some low memory available, it will be difficult at best to integrate some of your ideas with the existing Propeller Tool. It would be a shame to not be able to use the existing SPIN interpreter.

It would be easy to modify the OS's loader to load and execute a modified SPIN image that skips over a block of low HUB memory. The space in the EEPROM could be used for other things or used to initialize the 128 long area. The Propeller Tool would have to have a directive added (like _xxxx = ???) that would specify the size of the area to be skipped (from $10 to $10+???-1). This would be compatible with the existing boot loader and all existing code.

Another Propeller Tool directive that would be useful would be an "ORG"-like statement that would specify the location in the binary image to use for the following assembly/data information. This could be used to initialize fixed areas like the 128 long area.
Mike

helloseth
11-10-2006, 11:37 PM
This reminds me a bit of a feature in another MCU I am following.· http://www.intellasys.net/products/index.php

Their Seaforth24 product seems very similar to the Prop, but bigger and more complex.


They have/will have multi-core (24) MCU's, which process forth directly. But one of their cool features is that one core, can execute code 'read' directly from another core.· Bascally the code is passed one 'word' at a time from the other core.·(They had a few white papers on their Resources page explaining this, but that page is empty as I write this.)

Seth

Cliff L. Biffle
11-10-2006, 11:49 PM
Bill,

This is almost exactly what I've already tried for the Forth kernel (the paging approach I mentioned) -- I used it to implement an experimental DTC interpreter and some user-native-code support.

On the Propeller, with stack in shared RAM, it is no faster than an ITC interpreter, and in many cases is slower. The speed of the ITC is also bound by memory bus bandwidth, but transfers significantly less.

I'm still considering this approach for user-defined native words, but I'm mostly responding to correct your statement about it being faster than "any bytecode language." This is most likely not correct. (Bytecode VMs and ITC are a trivial transform apart, so I take your statement as applying to both, as well as token-threaded code.)

It will, of course, be significantly faster than SPIN. http://forums.parallax.com/images/smilies/smile.gif

Paul Baker
11-11-2006, 02:28 AM
At the risk of veering off into YABPC (yet another bashing patents conversation), an examiner wouldn't know to look in Company X's online forums for prior art (I know first-hand forums are not part of thier search strategy).·Assuming it was filed today it would be ~3 years before the examiner saw it by then this thread would be 3 years old and then there is·~2 more years before it became a patent, at which point a lawyer is now looking for material buried in a forum that is 5 years old even assuming a party of interest has hired a lawyer. SIR's are the only avenue in which a person can reasonable expect an examiner will see the information, and that costs money. IEEE and ACM (I miss my access to their databases of articles) also have a better than average expectation of being noticed by an examiner, but you have to convince the reviewers it is worthwhile information for them to publish.

Graham Stabler said...
As any patent would be filed after this info was posted on a public forum they would be pretty screwed anyway.

This looks rather cool.

Graham


▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker (mailto:pbaker@parallax.com)
Propeller Applications Engineer
[/url][url=http://www.parallax.com] (http://www.parallax.com)
Parallax, Inc. (http://www.parallax.com)

Post Edited (Paul Baker (Parallax)) : 11/10/2006 8:49:46 PM GMT

Mike Green
11-11-2006, 03:12 AM
How about some suggestions for rearranging and maybe eliminating an instruction or two?



entry rdlong pc,PAR
mov stkPtr,pc
shr stkPtr,#16
jmp #nxt

fjmp rdlong pc,pc
jmp #nxt

fcache movd :copyIt,#$80
nop
:copyIt rdlong 0-0,pc wz
add :copyIt,dspIncr
add pc,#4
if_nz jmp #:copyIt
jmp #$80

fret sub stkPtr,#2
rdword pc,stkPtr
jmp #nxt

fcall rdlong nxtPc,pc
add pc,#4
wrword pc,stkPtr
add stkPtr,#2
mov pc,nxtPc

nxt rdlong :inst1,pc
add pc,#4
:inst1 nop
rdlong :inst2,pc
add pc,#4
:inst2 nop
rdlong :inst3,pc
add pc,#4
:inst3 nop
rdlong :inst4,pc
add pc,#4
:inst4 nop
jmp #nxt

dspIncr long 1 << 9
pc long 0
stkPtr long 0
nxtPc long 0



One question for others ... How about having the stack in the cog? There are pros and cons. If it's strictly a call stack, it could be reasonably limited in depth. A good place would be to run the stack downwards from the end of the cache area. It wouldn't be too hard to pack return addresses 2 per long word. Advantage is that there'd be one less thing to allocate in HUB RAM. Disadvantage is that it'd be harder to switch execution threads.

Post Edited (Mike Green) : 11/10/2006 8:24:06 PM GMT

cgracey
11-11-2006, 03:39 AM
I found something... The nop could be gotten rid of in fcache by post-fixing the destination address:
fcache··················rdlong··$80,pc············ ·wz
························add·····fcache,dspIncr
························add·····pc,#4
················if_nz···jmp·····#fcache
······················· movd····fcache,#$80
························jmp·····#$80

Here is a way to make it ~33% faster by adding 4 instructions:

fcache··················rdlong··$80,pc············ ·wz
························add·····fcache,dspIncr2 (2 << 9)
························add·····pc,#4
fcache2·········if_nz···rdlong··$81,pc············ ·wz
························add·····fcache2,dspIncr2 (2 << 9)
························add·····pc,#4
················if_nz···jmp·····#fcache
······················· movd····fcache,#$80
······················· movd····fcache2,#$81
························jmp·····#$80

I love doing stuff like this!


·

Mike Green said...
How about some suggestions for rearranging and maybe eliminating an instruction or two?



entry rdlong pc,PAR
mov stkPtr,pc
shr stkPtr,#16
jmp #nxt

fjmp rdlong pc,pc
jmp #nxt

fcache movd :copyIt,#$80
nop
:copyIt rdlong 0-0,pc wz
add :copyIt,dspIncr
add pc,#4
if_nz jmp #:copyIt
jmp #$80

fret sub stkPtr,#2
rdword pc,stkPtr
jmp #nxt

fcall rdlong nxtPc,pc
add pc,#4
wrword pc,stkPtr
add stkPtr,#2
mov pc,nxtPc

nxt rdlong :inst1,pc {THIS IS NEAT!!!}
add pc,#4
:inst1 nop
rdlong :inst2,pc
add pc,#4
:inst2 nop
rdlong :inst3,pc
add pc,#4
:inst3 nop
rdlong :inst4,pc
add pc,#4
:inst4 nop
jmp #nxt

dspIncr long 1 << 9
pc long 0
stkPtr long 0




▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 11/10/2006 8:46:06 PM GMT

Phil Pilgrim (PhiPi)
11-11-2006, 03:42 AM
This is all pretty exciting stuff! It turns out that autoincrementing wouldn't help in the nxt loop, if it were available, since the extra instruction is necessary for pipelining considerations, anyway.

I might suggest a jump table for the fcall, fret, etc. This keeps their cog addresses constant; and there's no speed penalty, since you can use an indirect jump to execute them.

Also, you don't really need special treatment for conditional branches, since the address of the jump will be the least-significant word in the long making up the next "instruction". This means that the 16 most-significant bits are zero — a nop! If the jump isn't taken, it'll just fall through the nop onto the next instruction. So, just using the Propeller's conditionals on the jmp to fjmp, say, will suffice.

-Phil

Cliff L. Biffle
11-11-2006, 03:48 AM
Mike,

If you're trying to save shared RAM, putting the stack in the Cog works well. I've got a prototype (for my other compiler for a different language).

However, if you're doing it for speed, you may be disappointed; I was unable to get it faster (in the general case) than putting the stack in shared RAM and keeping TOS in a register. I don't have that lab notebook here or I'd post the math.

Perhaps someone cleverer than I can pull it off; I will gladly steal^Wuse their code. http://forums.parallax.com/images/smilies/smile.gif

Edit: I'm speaking here specifically of a data stack or mixed data/return stack (as in C), not a dedicated return stack (as in Forth). Putting a return stack in the Cog would be easier, but also less of an optimization (the data stack tends to be hotter by an order of magnitude in languages that separate them).

cgracey
11-11-2006, 04:05 AM
Man, this little bit of code really gets my gears turning. It is ingenious! It can execute·assembly language·from hub ram at 1/4 rate, and then cache any loops and execute them internally. For many applications, this could mean 95% the speed of entirely-native cog execution, but with the giant benefits of a larger memory model. This is just awesome. Way to go, Bill!

I keep experiencing this phenomenon where what had I accepted as a hard and fast limitation gets blown straight through by some unexpected finesse, and it's·been making it difficult to get·back onto the next chip. So many times I've thought, "Well, the next chip will be able to do that." And then we find a way to get the current chip to·do it. Fun fun fun.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 11/10/2006 9:48:42 PM GMT

Bill Henning
11-11-2006, 04:29 AM
I am glad you like it :)

I've been bursting at the seams to let it out, but I was first trying to think of same way of directly monitizing it.

Last night I decided that indirect monitization (getting better known, eventually making a web site about it, people supporting the idea and helping me) is better in this case.

More tonight, when I am not at work - I will document the threading model I came up with; Chip I sent you a PM outlining the basics of it :)

Ym2413a
11-11-2006, 04:31 AM
Oh wow! This idea looks amazing. http://forums.parallax.com/images/smilies/idea.gif

Mike Green
11-11-2006, 05:29 AM
Chip,
One little correction in the case that you find a zero value on the first fetch.
This way, pc is properly set to point after the zero value.



fcache rdlong $80,pc wz
add fcache,dspIncr2 (2 << 9)
if_nz add pc,#4
fcache2 if_nz rdlong $81,pc wz
add fcache2,dspIncr2 (2 << 9)
add pc,#4
if_nz jmp #fcache
movd fcache,#$80
movd fcache2,#$81
jmp #$80

cgracey
11-11-2006, 05:38 AM
Oh, I didn't think about that. That would have caused a problem. Good thinking.


Mike Green said...
Chip,
One little correction in the case that you find a zero value on the first fetch.
This way, pc is properly set to point after the zero value.



fcache rdlong $80,pc wz
add fcache,dspIncr2 (2 << 9)
if_nz add pc,#4
fcache2 if_nz rdlong $81,pc wz
add fcache2,dspIncr2 (2 << 9)
add pc,#4
if_nz jmp #fcache
movd fcache,#$80
movd fcache2,#$81
jmp #$80



▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Dennis Ferron
11-11-2006, 05:48 AM
Begin YABPC:

I realize copyrights and patents are apples and oragnes, but can the GNU General Public License be used to protect an idea from patent sharks? For instance, is there a way Bill could release this as GPL or public domain and thereby make it unpatentable? What about circuit schematics - can making the schematic public domain protect it from being patented by an unscrupulous company later?

end YABPC:

Paul Baker
11-11-2006, 05:59 AM
Well, the whole software thing has been thoroughly loused up by the interpretations of the statutes by the legal system. It wasn't until recently that anything of a software nature could be patented. The loophole that has since been expanded·to a crater the size of Texas is that software coupled to the act of executing it on hardware is now considered patentable. But yes anything that is publicly disclosed is considered prior art, the crook of this is it's only as good as how well publicized it is. IOW if it's not commonly known and availible, there is a more than likely chance the examiner won't know about it and won't apply it. But anything of public knowledge is fair game, I had on more than one occasion used actual sections of code of the Linux operating system to reject an application. But this was only because I or a senior examiner I consulted with knew that Linux did the same thing.

Now out of the way publications are just as valid, but it they weren't applied on the front end, it requires overturning the patent·on the back end (ie sueing in a court of law). But that can easily run into the millions of dollars, so it's best to avoid the situation whenever possible.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker (mailto:pbaker@parallax.com)
Propeller Applications Engineer
[/url][url=http://www.parallax.com] (http://www.parallax.com)
Parallax, Inc. (http://www.parallax.com)

Post Edited (Paul Baker (Parallax)) : 11/10/2006 11:06:27 PM GMT

Mike Green
11-11-2006, 06:00 AM
Well, to get very specific, very early, I propose that any COG that is using Bill's system start with at least the following longs:



org 0
jmp #cache ' initialization code in cache
jmp #f_next ' continue with next instruction
jmp #f_jmp ' jump routine
jmp #f_call ' call routine
jmp #f_ret ' return routine
jmp #f_cache ' cache & jump routine
jmp #f_expand ' expansion / system call
jmp #f_next ' reserved



Note that by using jmp instructions, you can either transfer control or use the non-immediate form of a jump, but it's clearer what you intend. The call stack will begin at $7F and move downwards. The cache will begin at $80 as planned and go to $FF. The support routines will fit between this jump table and the stack. Any thoughts?

I'm pushing for some basic conventions because I want to start using this with the Propeller OS I wrote and I'm getting ready to make a new release.
Mike

Post Edited (Mike Green) : 11/10/2006 11:04:26 PM GMT

Bill Henning
11-11-2006, 06:09 AM
I like it!

Suggestions:

- I think the call stack should be in hub memory. The threading model I am playing with pretty much requires that (up to eight threads per cog)

- let's reserve the first sixteen longs as jump vectors; leaves room for my threading and messaging primitives

- let's try to keep the kernel below $80, including all vectors. I know, this is limiting, but I have more ideas...

Rough cog memory map:

$000-$00F vectors
$010-$07F primitives
$080-$0FF FCACHE area
$100-$17F FDCACHE area (more on this later)
$180-$1EF virtual peripheral area / thread space (more on this later)
$1F0-$1FF the reserved cog special registers

Please note each cache and virtual peripheral area works out to be 512 bytes long. This is not an accident. Think swapping in from SPI eeprom or swapping in/out to/from SPI ferroram.
Basically what I am doing is treating a cog's memory as a program controlled L1 cache, with the HUB memory as the "main memory".

Later I plan to treat HUB memory as an L2 cache :)

(I'm having a late lunch so I could comment)

Mike Green
11-11-2006, 06:31 AM
Bill,
I'm trying to simplify some aspects of this, yet allow for complexity later when needed. Unless you're doing multi-threading, you may not need a HUB based stack and the vectors, basic primitives, and cache are all that would be needed (and would be strictly upward compatible with the multi-threaded version). I would still like to push ahead with a cog-based stack version, but make sure that only the call/ret/initialization routines know about that.
Mike

Mike Green
11-11-2006, 06:45 AM
One more thing ... The PC variable needs to be somewhere directly accessable so that the relative "jumps" can be done. Since location zero is only used for the jump to the initialization code, I suggest that we use COG location zero for the PC value. It would be set up by the once-only initialization code, probably using PAR directly or indirectly. Consider keeping the HUB stack pointer in the upper half of location zero. It might make multi-threading easier. You'd swap one long for another and you'd be off in the next thread. It would add an instruction or two to a call or return at most.

Phil Pilgrim (PhiPi)
11-11-2006, 07:47 AM
Further simplifying, I don't think you even need an fjmp routine. It's only one instruction (rdlong pc,pc), which could just as easily be put inline as either rdlong pc,pc, followed by the address, for long absolute jumps, mov pc,#nnn for jumps to the first 512 program locations, add pc,#nnn for short forward relative jumps, and sub pc,#nnn for short backward relative jumps.

-Phil

Mike Green
11-11-2006, 07:55 AM
OK, so now we're down to



DAT
org 0
bill_pc jmp #f_initial
bill_next jmp #f_next
bill_call jmp #f_call
bill_retn jmp #f_retn
bill_cache jmp #f_cache
bill_expand jmp #f_next
bill_reserved0 jmp #f_next
bill_reserved1 jmp #f_next
bill_reserved2 jmp #f_next
bill_reserved3 jmp #f_next
bill_reserved4 jmp #f_next
bill_reserved5 jmp #f_next
bill_reserved6 jmp #f_next
bill_reserved7 jmp #f_next
bill_reserved8 jmp #f_next
bill_reserved9 jmp #f_next

Bill Henning
11-11-2006, 07:57 AM
Excellent!

Mike Green
11-11-2006, 08:33 AM
OK, here we go. The entries are defined as CONs so they can be referenced using the object constant "obj#con" notation. There's no jump entry since we can use a "RDLONG obj#bhPC,obj#bhPC" for that purpose and it doesn't cost any more time. I'll add some more comments after I've gotten some dinner.

Comments? Questions?

Post Edited (Mike Green) : 11/11/2006 1:40:51 AM GMT

Mike Green
11-11-2006, 08:49 AM
Another special case:

A halt/pause could be "sub pc,#4".

Mike Green
11-11-2006, 08:56 AM
Another thought for discussion:

The instructions being interpreted/executed sometimes need operands and all of them are in COG memory. Since the use of the cache is indivisible as far as threads are concerned, how about, by convention, putting any working storage at the end of the cache area? Alternatively, we could allocate some small number (say 16) after the end of the jump table?

There also needs to be a primitive to copy the long following the instruction to a specific location (then skip the 2nd long too). This allows a long (often a HUB address) to be loaded without a complex multiple instruction sequence. Let's call it a LOAD. If we use a JMPRET instruction, the destination would be overstored with the data so the fact that a useless return address is stored there first doesn't matter. So, we'd use a "JMPRET <dest>,bhLoad" followed by the long value.



f_load rdlong f_temp,PC ' get the data
sub PC,#4
rdlong f_temp2,PC ' get the instruction again
shr f_temp2,#9
movd :loadIt,f_temp2
add PC,#8
:loadIt mov 0-0,f_temp
jmp #f_next



Comments? Thoughts?

Post Edited (Mike Green) : 11/11/2006 2:35:46 AM GMT

Bill Henning
11-11-2006, 11:41 AM
OK Gentlemen!

Thank you for the enthusiasm, I am now at home, and about to spend the next few hours on this forum http://forums.parallax.com/images/smilies/smile.gif

Mike - you've put in an AWFUL lot of thought and work on fleshing out the far model! Thank you!

Chip - thanks for the enthusiasm! And the improved FCACHE!

Phil - Great suggestions! They help a LOT!

Ok, there will be more later, and I want to thank everyone... and I hope to post a LOT more tonight and tomorrow night.

Mike - I like your FLOAD, I think it deserves the next slot in the jump table right after FCACHE. I'm debating the potential merits of a mirroring FSTORE, but I am not sure its needed.

Much more in a bit, but I need to do some typing off-line to edit it properly before posting...

Phil Pilgrim (PhiPi)
11-11-2006, 12:06 PM
Mike,

Brilliant idea using JMPRET for the FLOAD function!

Bill,

I agree: FSTORE isn't needed. FLOAD is only needed to grab a constant larger than $1ff so it can be used in an indirect operand. There's no complementary requirement for storing, since the "large number" is already in a local register. Basically, the FLOADs just replace all those defined LONGs that appear at the end of a regular assembly program.

Chip,

'Just got to thinking. If you hadn't made the writing of condition codes optional, none of this would even be possible! (Or at least it would be a lot more awkward!)

-Phil

Mike Green
11-11-2006, 12:12 PM
I've been playing with some sample code for the OS. The FLOAD is useful for the occasional value greater than 9 bits. Anything more complicated than that ... seems to call for using FCACHE. Unfortunately, SPIN makes it difficult to reference locations in HUB RAM from DAT code. They all have to be patched up.

Bill Henning
11-11-2006, 12:38 PM
Ok guys, here I go...

Mike, I LOVE FLOAD.

I was originally intending to just keep data right at the end of FCACHE'd segments, as obviously the FCACHE'd code will know where it will wind up in the cog; but this is great for regular far code.

To be honest, I don't like 'sub pc,#4' for HALT as there is no way out of it short of restarting a cog; so it may as well be halted so that it won't generate heat.

Phil, thanks for agreeing, no FSTORE.

Now...

I can see we will have several different kernels - that is great, special purpose kernels fit the "philosophy" of the propeller. I will be publishing later tonite the specs for a mutli-threaded kernel :-) I already sent Chip a preview earlier today, but I want to clean it up a bit before posting it.

In order to keep all kernels binary compatible, and the idea simple... may I suggest a steering panel?

I'd love to see a panel composed of myself, Mike,·Chip and·Phil with others ofcourse welcome to contribute ideas at all times; and I hope·people would not·think it would be·too presumptious of me to assume the role of steering it.

Meanwhile, please find attached a somewhat cleaned up, somewhat commented version of 'kernel.spin'

It still needs a start method, but that will be trivial.


Post Edited (Bill Henning) : 11/11/2006 7:20:43 AM GMT

Phil Pilgrim (PhiPi)
11-11-2006, 12:39 PM
Guys,

'Just had a thought while in the shower (second best place for ideas!): Since what's coming from this effort really amounts to an emulator, why not include hooks in a debugging version for breakpoints and single-stepping, too? It would have to be able accept commands from a debugger (probably written in Spin), to signal when events occur, and to dump the machine state back to hub memory on demand. But I think these things could be accomplished rather easily in the general model we've been discussing.

-Phil

Bill Henning
11-11-2006, 12:42 PM
Good point Phil - that would work!

My other ideas (other than threading) include 'virtual peripherals' like on the SX, and user code transparent demand paging of code :)

Except that instead of an emulator, this is really more like a VM on a processor without virtualization, as the vast majority of code will run at full speed, and is not interpreted by code, but is actually executed by the normal fetch-decode-execute-store cycles of the cog

Bill Henning
11-11-2006, 12:47 PM
Spin extension request:

INLINE

block of large model assembly code follows; if needed throws out part of spin interpreter to use FCACHE functionality to execute the block, then reloads tossed away spin interpreter code to continue

ideally the spin compiler would flag as errors any native cog instructions that break the large model and any FCACHE rules (jumping out of the FCAHCE block to anything but legal Fxxx primitives basically)

Phil Pilgrim (PhiPi)
11-11-2006, 12:54 PM
Oh gosh, Bill. Do you realize how tightly the Spin interpreter is packed? I doubt there's a single bit left to handle any kind of caching. I'm sure any "inlining" will have to be initiated in a separate cog by a COGNEW.

-Phil

Bill Henning
11-11-2006, 01:00 PM
Proposed standard cog memory map:

Common to ALL kernel versions, required for binary compatibility!

000: PC
001: SP
002: @next
003: @fcall
004: @fret
005: @fcache
006: @fload
007: @fsystem
008-00f: reserved
010-07f: kernel primitives (also Mike's local stack grows down from $07f in local stack kernels)
080-0ff: FCACHE code buffer

"Standard" kernel - not multi-threaded, stack can be local or hub based

100-17f: DCACHE - yes, what it looks like, proposed Data cache area! One of the system calls will load it
180-1ef: virtual peripheral area

"Threaded kernel"

100-17f: DCACHE

The next 112 locations, range $180-$1ef are divided up as follows:

008: @yield····· - prematurely ends a threads execution time, passes control to next ready thread
Unfortunately, the "real" PC/SP must be moved under the threaded model.

For compatibility, I am thinking of revising the standard spec to place PC/SP/BP/STATUS starting at $180 even
for non-threaded kernels.

Thread local guaranteed local registers for currently executing thread; PC and SP at $0 and $1
180: PC·········· virtual program counter
181: SP·········· return stack pointer
182: BP·········· base pointer for future high level language support
183: STATUS·· status register to save Z and C during context switches
184: R0·········· registers
185: R1
186: R2
187: R3
188: R4
189: R5
18A: R6
18B: R7

followed by·eight groups of·eleven registers each; each group stores STATUS,·BP, PC, SP, R0-R7 for a potential thread (BP is a reserved register; for future two-instruction "local hub" variable access)

the last·four locations are scratch locations for FCACHE blocks

Each thread must have its own stack in hub memory

More on how threading works in another message later; I had to revise the design due to issues I found trying to code the context switch routine

Post Edited (Bill Henning) : 11/11/2006 6:51:50 AM GMT

Bill Henning
11-11-2006, 01:02 PM
Phil:

INLINE is to be a compiler directive to allow us to embed non-standard code; ie have SPIN assemble it at the locations where it would end up in sequence in hub memory.

It is NOT intended to inline code into the SPIN cogs!

Think of it like #asm for hub execution friendly code, hopefully it will also emit correct code for FCALL, FCACHE and friends :)

Bill Henning
11-11-2006, 01:59 PM
Ok, design change.

After editing the message above AGAIN, I decided time for a new message.

For compatibility, the threading model is forcing some changes in the standard kernel layout. Even non-threaded kernels are expected to reserve the following range:

1E0-1EF: Currently running·context registers; copied to/from here on context switches

1E0: PC
1E1: SP
1E2: BP
1E3: FLAGS
1E4: R0
1E5: R1
1E6: R2
1E7: R3
1E8: R4
1E9: R5
1EA: R6
1EB: R7
1EC: READYTASKS - bitmap of threads that are ready to run
1ED: TIMESLICE - number of "interpreter loop cycles" between forced context switches
1EE: TIMELEFT - "interpreter loop cycles" left for currently running thread
1EF: CURRTASK - pointer to currently running tasks thread context

Yes, this does make this kernel a pre-emptive multitasking kernel.

Threaded kernels also make the following reservations:

1D4-1DF: THREAD0 context
1C8-1D3: THREAD1 context
1BC-1C7: THREAD2 context
1B0-1BB: THREAD3 context
1A4-1AF: THREAD4 context
198-1A3: THREAD5 context
18C-197: THREAD6 context
180-18B: THREAD7 context

The reason for the reverse ordering is that thread contexts grow DOWN from the space reserved for the running context. There is theoretically no reason why more threads cannot be supported; as long as we stay out of the FCACHE area and don't use DCACHE, an additional ten threads are possible, for a maximum of 18 running threads per cog, allowing for a theoretical limit of 8 cogs * 18 threads... 144 threads on one propeller!!!!

Post Edited (Bill Henning) : 11/11/2006 7:49:35 AM GMT

Mike Green
11-11-2006, 02:01 PM
Bill,
I had a look at your "kernel.spin" posting. It's nice to have the documentation of everything that has transpired. It does cause problems with verbosity in integrating the kernel with my existing I2C routines. Can you distill down the statement of who's to get credit to maybe a paragraph that can be included in my source code, yet I can trim out all the other comments so that the Large Memory Model stuff isn't longer than my whole I2C package. Thanks.
Mike

By the way, the comment on the "long f_load" is incorrect now (line 74). Also, relocating the initialization routine to the cache area won't work the way you've done it since the COGINIT instruction loads a block of 512 contiguous longs and the Propeller Tool doesn't pad out the DAT section when it sees the ORGs. Better to leave the initialization code at the end of the "kernel" where it may either be pushed into the cache area as the "kernel" gets bigger or be overwritten by the stack
in the non-threaded version.

Mike Green
11-11-2006, 02:04 PM
Bill,
I like moving the PC and SP to high memory in the cog. What do you want to do with the locations currently occupied by PC and SP?
Mike

Bill Henning
11-11-2006, 02:09 PM
Mike - let's simply shift the primitives down.

If you don't mind, why don't you make the change regarding initialization code, put it back where it would get overwritten by your stack, and fix the long_f comment?

Hmm.. good suggestion, based on which i think perhaps the revision log etc should go to a 'largemodel.credits' file.
I think perhaps the YIELD call should come before SYSTEM though. If running a non-threaded kernel, it can just call f_next

I'm hot-and-heavy into writing (but not testing yet) the threading code...

cgracey
11-11-2006, 02:09 PM
This would, indeed, be a problem. Not only is the interpreter wound way too tight, but it's in ROM, as well. Launching the large-model·emulator would have to be done through a COGNEW or COGINIT.

There are some issues with the current compiler, as well, but those are mainly fixable. I think I need to make a mode, maybe triggered through 'ORG -1' where the compiler will quit worrying about cog ram being exceeded by asm instructions. I think this could be realized by 'org -1' causing the cog address to not be incremented anymore, until a normal org appeared, as would be required for a cached block.

It's true that DAT labels are relative to the object they're in -- they're not absolute, so this would need some patching, as someone pointed out. The compiler loses this data during object mixing, and I know this would be a major (but perhaps eventual) undertaking. In the interim, when a large-model emulator is launched, it could be told what the base address of the object is, and then patch at run-time. It would come down to a single·addition, but require a 2-4 instruction sequence to realize within the large-model·emulator.

To the launch the emulator you'd need to convey at least two pointers. One to the start of the code (@virtualasmcode), and one to the start of the object (@@0) so branch addresses could be resolved. A jump address in virtual asm code would be expressed as @datlabel.

For a full-blown, more native approach, Spin would not have to be a consideration. Any language would do.

Am I following this whole thing correctly?




Phil Pilgrim (PhiPi) said...
Oh gosh, Bill. Do you realize how tightly the Spin interpreter is packed? I doubt there's a single bit left to handle any kind of caching. I'm sure any "inlining" will have to be initiated in a separate cog by a COGNEW.

-Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Bill Henning
11-11-2006, 02:17 PM
Hmm... i think you are.

And I think we can use the 'BP' base pointer I introduced to be the object base address in hub memory for large model asm code included in spin blocks... what do you think?

For example, in large model asm code, to refer to a VAR long foo, we would write:

INLINE
mov R0,@foo
add R0,BP
rdlong R0,R0

I propose that PC, SP and BP be set up from the PAR passed to COGNEW when initiating a new large model process

Chip, any comments on the threading?

Bill Henning
11-11-2006, 02:24 PM
By the way Chip...

The kernel code can execute out of ROM just as easily as out of RAM :)

Think of what you can do to the next version of spin with FAR more code space!

And if at all possible, I'd like to reserve at least 128 of the first 512 bytes of hub memory. 256 would be better.

Why?

FAST Messaging between tasks, fast system calls.

Why?

wrlong req,#cog0request

...

rdlong rep,#cog0result


Post Edited (Bill Henning) : 11/11/2006 7:30:35 AM GMT

Mike Green
11-11-2006, 02:42 PM
Chip,
It's a little more complicated than that. For example, if you want to have code addresses (for calls, jumps, etc.), the granularity is wrong (longs vs. bytes). All code offsets have to be multiplied by 4. Also, although it's potentially dangerous, the use of a "here" operand is very helpful with relative jumps of a short distance (maybe up to 8 longs in either direction). Sure you can have local labels, but, for very short distances, they're more of a nuisance than a help.

Bill,
How should the initial information be passed? As 3 longs or 3 words? What order?

Can you define the format of the System "instruction"? I'd like to nail down the code for the non-threaded version. I assume there'll be some parameters following the "instruction". How many longs?

I'll try to edit in your changes (like the context registers) tomorrow and post my version (with the changes) then.
Mike

Bill Henning
11-11-2006, 02:58 PM
Mike, let me think on those excellent questions a few minutes... thanks for the edits on those, or if you have not made other changes, I can put them in. We should pass it back and forth :)

For now... tada... here is the multi-tasking context switching code!

Right now it assumes all eight threads always run, I will change that tomorrow, I have used up all the registers between $1E0 and $1EF now!

Ok, the end of the runrolled fetch-exec loop used to read:

:inst4 nop
jmp #f_next

In order to have eight threads, this needs to change to the following:


:inst4·················nop
······················· DJNZ··· TIMELEFT,#f_next······
······················· mov···· TIMELEFT,TIMESLICE····· ' reset slice clock for next thread

······················· ' save current C and Z into FLAGS register

···············MUXZ··· FLAGS,#zFlag··· ' Thanks Chip! Forgot about MUX instructions
···············MUXC··· FLAGS,#cFlag··· ' you saved everyone 8 cycles on every task switch!

······················· ' save current context
······················· movs··· ctxsave,#PC
······················· movd··· ctxsave,CURRTASK
······················· mov···· f_temp, #12
ctxsave············· mov···· 0-0,0-0
······················· add···· ctxsave,src_inc_const
······················· add···· ctxsave,dst_inc_const
······················· DJNZ··· f_temp,ctxsave

······················· ' go to next context - for now I'm assuming all threads always run
······················· ' this is proof of concept code; next version will use the last spare
······················· ' word

······················· sub···· CURRTASK,#12
······················· cmp···· CURRTASK,#$180
······· if_b···········mov···· CURRTASK,#$1D4·········· ' go back to top thread
······················· movs··· ctxload,CURTASK
······················· movd··· ctxload,#PC
······················· mov···· f_temp, #12
·······················
ctxload··············mov···· 0-0,0-0
······················· add···· ctxload,src_inc_const
······················· add···· ctxload,dst_inc_const
······················· DJNZ··· f_temp,ctxload

······················· ' restore flags
·················································· ···········
······················· andn···· FLAGS,zFlag wz
······················· rcr···· FLAGS,#1 wc

······················· jmp···· #f_next····


NOTE: I have NOT run this code yet, there maybe bugs, I could have left some in there...

Post Edited (Bill Henning) : 11/11/2006 8:23:28 AM GMT

cgracey
11-11-2006, 02:59 PM
Bill Henning said...


I propose that PC, SP and BP be set up from the PAR passed to COGNEW when initiating a new large model process

Chip, any comments on the threading?
Yes, it seems to me that those three values must be conveyed in a 3-word structure via PAR.

About threading, I like how you reduced the task-specific working registers to R0..R7 so that only they and three other values must be swapped to make a thread switch. You know, though, if a set of threads is pre-written to run on a single cog, they could be hard-coded to use separate register areas. This swapping is only necessary for un-related-at-design-time threads, right? But, that might be the real value of this thing - threads can start, spawn, and stop without any intimate knowledge of eachother. With swapping, someone could write a serial driver and someone else could run a few of them on a single cog. You'd get a lot more bang out of·a cog that way.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Phil Pilgrim (PhiPi)
11-11-2006, 02:59 PM
Here's a small bit of code I propose adding to the kernel. It enables Forth-style threaded-code. (This isn't the same as Bill's multi-threaded code. 'Too bad the two terms are so close. It can be confusing.) Anyway, this kind of threaded code is just a list of word-length addresses, each pointing to a location in the large-model codespace, along with a way to enter and exit such lists, as well as to fetch the next address and jump to that location. The inspiration for this comes from similar instructions which were native to the Zilog Super8 microcontroller. These instructions were all it took for the Super8 to support Forth almost natively. This was 20-some years ago and, unfortunately, Zilog didn't know what they had. So far as I know, it was only available in NMOS, and it got hot. The product had an enthusiastic customer base but died from lack of interest on Zilog's part.

In addition to an instruction pointer register (ip) in cog RAM, here's all the code that's necessary (sp is the same as Mike's stkptr):




'Part of initial jump table.

ienter long xienter
iexit long xiexit
inext long xinext

'Working code.

xienter wrword ip,sp 'Push IP on the stack.
add sp,#2
mov ip,pc 'New IP points to next instruction (a threaded-code word)
jmp #xinext 'Go get it.

xiexit sub sp,#2 'Pop IP from the stack.
rdword ip,sp

xinext rdword pc,ip 'Point PC to the next address in the threaded list.
add ip,#2 'Increment the instruction pointer.
jmp #nxt 'Go execute the next threaded-code word.




Here's a sample of some Forth-like procedures written in a mixture of large-model assembly and threaded-code. tos, ttos, and ep are app-specific registers in the scratch area.




'Large-model assembly procedures:

EXIT jmp iexit 'Exit from code thread.

PUSH wrword tos,ep 'Push stack top onto expression stack.
add ep,#2 'Increment expression stack pointer.
rdword tos,ip 'Load the new stack top from thread-code stream.
add ip,#2 'Increment instruction pointer.
jmp inext 'Go do next instruction.

POP call #pop_one
jmp inext

SWAP rdword ttos,ep 'Get nos into ttos.
wrword tos,ep 'Put tos into nos.
mov tos,ttos 'Copy ttos to tos.
jmp inext

ADDW call #pop_one 'ttos<-tos tos<-nos
add tos,ttos 'Replace tos with sum.
jmp inext

pop_one mov ttos,tos 'Save stack top for possible use in operation.
sub ep,#2 'Pop tos from expression stack.
rdword tos,ep
pop_one_ret ret

'Threaded-code procedures:

LFTROT jmp ienter
word ROT
word ROT
word EXIT

OVER jmp ienter
word SWAP
word DUP
word ROT
word SWAP
word EXIT

INCHES jmp ienter
word PUSH
word 10
word MULT
word PUSH
word 254
word DIV
word EXIT

MILLIMETERS jmp ienter
word PUSH
word 254
word MULT
word PUSH
word 10
word DIV
word EXIT




-Phil

Post Edited (Phil Pilgrim (PhiPi)) : 11/11/2006 8:15:37 AM GMT

cgracey
11-11-2006, 03:06 PM
Bill Henning said...

······················· ' save current C and Z into FLAGS register
······· if_z··········· or····· FLAGS,#zFlag
······· if_nz·········· andn··· FLAGS,#zFlag
······· if_c··········· or····· FLAGS,#cFlag···· ' cFlag must be the lowest bit in the word!
······· if_nc·········· andn··· FLAGS,#cFlag

Bill, there are instructions called MUXZ, MUXNZ, MUXC, and MUXNC that will write a flag or its complement to any number of bits in a destination. Here's how to reduce the above code:

··················· MUXZ··· FLAGS,#zFlag
··················· MUXC··· FLAGS,#cFlag
·
I think you'll·need·to use MUXNZ for the Z flag so that when later TESTed, it restores properly.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 11/11/2006 8:10:17 AM GMT

Bill Henning
11-11-2006, 03:10 PM
Thanks Chip.

I reduced it to 12 registers including the new FLAGS (to save C and Z) specifically to be able to compile code without having to statically allocate registers; but you are right, thread-aware code hand built code could make use of the DCACHE area for more local variables.

You are also correct, the point of a thread specific limited context was support for 'classic' threading functions for spawning, killing, suspending etc threads. Theoretically, threads could even be swapped out to FRAM over SPI!

The multi-threaded kernel would even allow for Unix style fork() and friends.

Eventually I want to be able to allow an· FCALL to a system library to spawn a new thread on the calling cog!

For example, I am considering adding an event-watching system, so that when say a 'CTS' signal is detected, a call can be made to 'RS232Receive', which then spawns a thread that FCACHE's the actual receive routine :)

OR,

· FCALL SPI_READ
· #disk_block

:-)

Which spawns a thread, that uses an FCACHE'd block to read AT FULL SPI SPEED a block!

The thead can then die, and the cog goes back to regular threading.

This allows a FULL FUNCTION multi-tasking OS!

Compilers, shells, interpreters... BRING IT ON!

A full feature BASIC!

A C compiler!


Chip Gracey (Parallax) said...

Bill Henning said...


I propose that PC, SP and BP be set up from the PAR passed to COGNEW when initiating a new large model process

Yes, it seems to me that those three values must be conveyed in a 3-word structure via PAR.

About threading, I like how you reduced the task-specific working registers to R0..R7 so that only they and three other values must be swapped to make a thread switch. You know, though, if a set of threads is pre-written to run on a single cog, they could be hard-coded to use separate register areas. This swapping is only necessary for un-related-at-design-time threads, right? But, that might be the real value of this thing - threads can start, spawn, and stop without any intimate knowledge of eachother. With swapping, someone could write a serial driver and someone else could run a few of them on a single cog. You'd get a lot more bang out of·a cog that way.

Bill Henning
11-11-2006, 03:11 PM
THANKS!!! I forgot about those!


Chip Gracey (Parallax) said...

Bill Henning said...

······················· ' save current C and Z into FLAGS register
······· if_z··········· or····· FLAGS,#zFlag
······· if_nz·········· andn··· FLAGS,#zFlag
······· if_c··········· or····· FLAGS,#cFlag···· ' cFlag must be the lowest bit in the word!
······· if_nc·········· andn··· FLAGS,#cFlag

Bill, there are instructions called MUXZ, MUXNZ, MUXC, and MUXNC that will write a flag or its complement to any number of bits in a destination. Here's how to reduce the above code:

··················· MUXZ··· FLAGS,#zFlag
··················· MUXC··· FLAGS,#cFlag
·
I think you'll·need·to use MUXNZ for the Z flag so that when later TESTed, it restores properly.

Bill Henning
11-11-2006, 03:17 PM
That is VERY interesting!

Might have interesting implications for code density where speed is not as important!

I see no reason why there cannot be a forth-kernel variant.

I can see my large model idea spawning all kinds of special purpose kernels.

Personally, I'd love to see a floating point package implemented as FCALLable routines, with FCACHE'd segments for the actual work; and I could even see a special floating point kernel that implemented the following primitives:

FTOI
ITOF
FADD
FSUB
FMUL
FDIV
FMOD

More complex math functions could then be built on them and FCALL'ed at high speed.

IEEE 32 bit math please :)




Phil Pilgrim (PhiPi) said...
Here's a small bit of code I propose adding to the kernel. It enables Forth-style threaded-code. (This isn't the same as Bill's multi-threaded code. 'Too bad the two terms are so close. It can be confusing.) Anyway, this kind of threaded code is just a list of word-length addresses, each pointing to a location in the large-model codespace, along with a way to enter and exit such lists, as well as to fetch the next address and jump to that location. The inspiration for this comes from similar instructions which were native to the Zilog Super8 microcontroller. These instructions were all it took for the Super8 to support Forth almost natively. This was 20-some years ago and, unfortunately, Zilog didn't know what they had. So far as I know, it was only available in NMOS, and it got hot. The product had an enthusiastic customer base but died from lack of interest on Zilog's part.

In addition to an instruction pointer register (ip) in cog RAM, here's all the code that's necessary (sp is the same as Mike's stkptr):




'Part of initial jump table.

ienter long xienter
iexit long xiexit
inext long xinext

'Working code.

xienter wrword ip,sp 'Push IP on the stack.
add sp,#2
mov ip,pc 'New IP points to next instruction (a threaded-code word)
jmp #inext 'Go get it.

xiexit sub sp,#2 'Pop IP from the stack.
rdword ip,sp

xinext rdword pc,ip 'Point PC to the next address in the threaded list.
add ip,#2 'Increment the instruction pointer.
jmp #nxt 'Go execute the next threaded-code word.




Here's a sample of some Forth-like procedures written in a mixture of large-model assembly and threaded-code. tos, ttos, and ep are app-specific registers in the scratch area.




'Large-model assembly procedures:

EXIT jmp iexit 'Exit from code thread.

PUSH wrword tos,ep 'Push stack top onto expression stack.
add ep,#2 'Increment expression stack pointer.
rdword tos,ip 'Load the new stack top from thread-code stream.
add ip,#2 'Increment instruction pointer.
jmp inext 'Go do next instruction.

POP call #pop_one
jmp inext

SWAP rdword ttos,ep 'Get nos into ttos.
wrword tos,ep 'Put tos into nos.
mov tos,ttos 'Copy ttos to tos.
jmp inext

ADDW call #pop_one 'ttos<-tos tos<-nos
add tos,ttos 'Replace tos with sum.
jmp inext

pop_one mov ttos,tos 'Save stack top for possible use in operation.
sub ep,#2 'Pop tos from expression stack.
rdword tos,ep
pop_one_ret ret

'Threaded-code procedures:

LFTROT jmp ienter
word ROT
word ROT
word EXIT

OVER jmp ienter
word SWAP
word DUP
word ROT
word SWAP
word EXIT

INCHES jmp ienter
word PUSH
word 10
word MULT
word PUSH
word 254
word DIV
word EXIT

MILLIMETERS jmp ienter
word PUSH
word 254
word MULT
word PUSH
word 10
word DIV
word EXIT




-Phil

Bill Henning
11-11-2006, 03:48 PM
By the way everyone... in case anyone is worried.

Let me be VERY clear... I am NOT looking to "submarine patent" anyone.

I do expect to be credited for my ideas; and would expect some kind of arrangement for anyone wanting to do closed source commercial work that they then distribute (internal use does not count)

As long as I am properly credited, I hereby very publically commit not to go after any open source use of these ideas.

Parallax is further VERY welcome to incorporate these ideas in SPIN and its environment; but I do ask for appropriate credit, which I am certain they would have provided even if I did not ask :)

This also covers the pre-emptive multi-tasking kernel, and anything else I post to this thread related to large model on the propeller.

Post Edited (Bill Henning) : 11/11/2006 8:53:04 AM GMT

Bill Henning
11-11-2006, 11:32 PM
After sleeping on it... a few thoughts...

1) There will be a SECOND multi-tasking kernel. This (slower) one will store the process table in HUB memory, in order to make cog memory locations $180-$1DF available for use for a loadable library or virtual peripheral code.

This will slow down context switching, but will allow for a "pool" of cogs to load and run the next ready thread, and an effectively unlimited number of threads system wide.

2) The kernel will choose between cog and hub stack during initialization, and patch FCALL and FRET respectively. The "spare" code can live in the FCACHE area and be overwritten

3) The "single tasking" kernel can become multi-tasking by FCALLing the FSTART system call

4) The floating point math library I proposed - only FADD / FSUB / FMUL / FDIV / FREM should be primitives (and they should fit in the library area); ITOF, FTOI, ATOF, FTOA, SIN, COS, TAN and friends should be FCALL'able library routines.

Now I'm off to work for a startup downton, I won't be able to check the forum until I get home around 7pm pacific....

I owe I owe... off to work I go!

Lawson
11-12-2006, 12:07 AM
actually you can make a self-incrementing address pointer in a cog. Just use the timers plus one pin. setup counter A to divide the system clock by the right ammount (1/16 usually) and have it output on "the pin". Counter B can then be setup to increment by some number once for every low-high transition of "the pin". (say +4 every time "the pin" transitions low-high) Now reading counter B's PHS register (using a source field read) will provide a pointer to hub-ram that auto-increments independant of the current cog code.

Now, i can see some issues with this. First it needs a pin to link the two counters. (a crying shame the counters don't have an input clock divider!) Second, the pointer auto-increments independant of the Cog code so it would be prone to loosing synchronization and causing some "interesting" bugs.

my two cents,
Marty

Cliff L. Biffle
11-12-2006, 04:47 AM
Bill, I've spoken to you about this in PM, but you keep banging this drum.

These ideas are so close to STC/DTC that a patent would not stand anyway. You've been posting about a message a day along the lines of "I won't patent this, but please if you do this in commercial software give me credit!"

Bill, I've been doing this in commercial software for nearly 15 years. Your code is clever and I'm really pleased with what you've done, but please, let's focus on the code and quit the ego-stroking.

potatohead
11-12-2006, 05:41 AM
Holy Cow!

I'm offline for a coupla days and see this! Great work guys!

Coupla things with regard to IP:

-I would not GPL this. Doing so would essentially require the same thing of derivative works, unless they were seperated into distinct elements. That's a big hassle with few returns, IMHO. If anything, do a BSD style where credits are part of the story, but other licenses can be attached to finished programs. That way, derivative works are not mandated to be open, Parallax is free to incorporate the work done here into it's supporting software, credit is given where it makes sense, and ongoing development of this framework can continue in an open fashion by whoever is interested or motivated by application need to do so.

-In this nasty IP environment, I don't think some steady and frank conversation about these matters is out of line. There is always somebody...

The next version of the Propeller will benefit greatly from the work done on this thread, I'm sure.

(Goes back to work through the code posted here...)

Post Edited (potatohead) : 11/11/2006 10:45:47 PM GMT

cgracey
11-12-2006, 07:15 AM
Cliff,

Bill's just excited and enthusiastic about engineering. No need to knock him. We've all been zealous at times, and hopefully we will be often. I'm sure that no matter how cool of ideas any of us get, in the long run we'll just be happy to have had them, and·we'll be·enriched if we have shared them, which is Bill's overriding interest here. It's true that inspiration comes to many people, even for the same things. This is what our patent system is in conflict with. Under the mind-warping paranoia it induces, Bill probably felt compelled to bring it up. He has some valuable ideas that he wants to share and refine with the forum, and such concerns would definitely cross my mind, too.


Cliff L. Biffle said...
Bill, I've spoken to you about this in PM, but you keep banging this drum.

These ideas are so close to STC/DTC that a patent would not stand anyway. You've been posting about a message a day along the lines of "I won't patent this, but please if you do this in commercial software give me credit!"

Bill, I've been doing this in commercial software for nearly 15 years. Your code is clever and I'm really pleased with what you've done, but please, let's focus on the code and quit the ego-stroking.


▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 11/12/2006 12:36:39 AM GMT

Cliff L. Biffle
11-12-2006, 09:03 AM
I actually thought I deleted that message; I re-sent it to Bill in PM. It probably wasn't appropriate for a public forum, and I'd delete it now if Chip hadn't quoted it. (Chip, if you get the inclination, un-quote it and delete my copy; Bill has one of his own.)

Ah, well. Let's keep hacking and quit our (my) sniping.

paulmac
11-12-2006, 10:02 AM
Chip Gracey (Parallax) said...


The nop could be gotten rid of in fcache by post-fixing the destination address:

Here is a way to make it ~33% faster by adding 4 instructions:

I love doing stuff like this!




This is getting a bit like Perl (http://www.perl.org/) golf (http://en.wikipedia.org/wiki/Perl_Golf)! http://forums.parallax.com/images/smilies/smile.gif

Bill Henning
11-12-2006, 12:16 PM
Cliff,

I responded to the PM, and true, I did not appreciate the public criticism, but I will respond in public as well, without acrimony.

I have been·writing and·designing·software since 1982, and in that time I've had a number of unpleasant experiences.

You are almost certainly correct that these ideas are close to STC/DTC - I have not had time to research that, as I just got home from work.

However, they are definitely unique on the propeller, and pure software FCACHE is pretty damned unique (as far as I know)

Given that I have not applied for a patent, and that I deliberately published SO NO ONE ELSE COULD, your distrust of my intention is misplaced.

The fact is, I reserve the right to patent this should I need to in a defensive manner, but I have ALREADY publically posted committing not to go after FOSS developers / users / pets / etc. (sorry, a bit of sarcasm slipped out) In the unlikely event I feel I have to patent it (why on earth would I want to go through the time, expense, or headaches unless it was defensive?) I'd still keep my earlier promises.

While I trust Chip, Mike, Phil, you, and every other decent person, to credit·as appropriate, NOT EVERYONE is like that! (Personal experiences tell me this, even in university situations where citations are supposedly mandatory!)

By posting that I expect credit and some agreed compensation (that I would share with contributors on an agreed upon basis) for CLOSED SOURCE COMMERCIAL USE, I have established the "ground rules" if you will - so if ACME markets a "whizbang widget" that uses these techniques and don't credit and try to arrange something, I have the option of trying to do something about it, as I clearly publically indicated the "rules".

Having to be like this sucks. Having to waste forum bandwidth, and misunderstandings, sucks even more.

I hope I have not scared even ONE FOSS developer off - that is not my intent.

Since I decided to free it, I want credit, which all decent people would provide without asking. I do not believe that to be wrong, and I am just trying to make sure that the less than decent people take a pause.

And while there ARE certain similarities with Forth engines (btw I have coded several token threaded interpreters, and will look into the alternate methods you've mentioned, they seem intriguing), I'm quite certain there is no other large memory model for a limited memory multiprocessor system with shared memory that provides for large model features at a very small run time penalty by program controlled caching and with the latest version provides a fully pre-emptive multitasking pico kernel for a processor with only 512 addressable words!

I'd bet that before I posted no one seriously considered the propeller capable of supporting a large code space memory model for a non-interpreted / threaded language running at almost native speed with multitasking!

Anyway, enough apparent tooting of my horn. I would not have posted this except for your posting.

Believe it or not, I actually respect your concerns.

I would not have made the posting you were objecting to except for someone expressing concerns about my intentions and your past PM's, I was actually trying to defuse the situation, and get more people to get involved!


Cliff L. Biffle said...
Bill, I've spoken to you about this in PM, but you keep banging this drum.

These ideas are so close to STC/DTC that a patent would not stand anyway. You've been posting about a message a day along the lines of "I won't patent this, but please if you do this in commercial software give me credit!"

Bill, I've been doing this in commercial software for nearly 15 years. Your code is clever and I'm really pleased with what you've done, but please, let's focus on the code and quit the ego-stroking.

Post Edited (Bill Henning) : 11/12/2006 5:30:46 AM GMT

Bill Henning
11-12-2006, 12:18 PM
Thanks Chip.

You understand me EXACTLY!


Chip Gracey (Parallax) said...

Cliff,

Bill's just excited and enthusiastic about engineering. No need to knock him. We've all been zealous at times, and hopefully we will be often. I'm sure that no matter how cool of ideas any of us get, in the long run we'll just be happy to have had them, and·we'll be·enriched if we have shared them, which is Bill's overriding interest here. It's true that inspiration comes to many people, even for the same things. This is what our patent system is in conflict with. Under the mind-warping paranoia it induces, Bill probably felt compelled to bring it up. He has some valuable ideas that he wants to share and refine with the forum, and such concerns would definitely cross my mind, too.


Cliff L. Biffle said...
Bill, I've spoken to you about this in PM, but you keep banging this drum.

These ideas are so close to STC/DTC that a patent would not stand anyway. You've been posting about a message a day along the lines of "I won't patent this, but please if you do this in commercial software give me credit!"

Bill, I've been doing this in commercial software for nearly 15 years. Your code is clever and I'm really pleased with what you've done, but please, let's focus on the code and quit the ego-stroking.

Bill Henning
11-12-2006, 01:10 PM
Ok, after considering it, I cannot think of an easier way to passing arguments to FCALL'ed functions than a good old bog standard method - they follow the call instruction!

Library calls are to expect the following:

FCALL
@some_lib_routine
arg1/@arg1
arg2/@arg2
...
argN/@argn

Library functions may use non-long arguments, BUT, they have to pad their argument list to a long boundry, and modify the PC by the appropriate number of bytes so that the return works... easiest way is to pop the return address and set the corrected PC manually.

PLEASE NOTE:

Even though it "wastes" memory, I want all pointers to be 32 bits long, with unused upper bits. Its only a matter of time until Chip et al present us with a propeller with more memory and I'd like exsiting code to work. We have a chance to do it right people.

EXCEPTION:

Cure large model address threaded interpreters can use sixteen bit pointers if they play nice and leave PC word aligned; they willingly accept memory limitations for tighter coding.

I'm also thinking that we need a shared library standard; probably just a set of jump vectors like at the start of the kernel.

Given how FCACHE will speed up complex calls, it may make sense for shared library calls to use something like the follows

FLIBCALL
[16 bit lib vector pointer][16 bit "function ID" specifying which fn to call]
[arg list like optional one for FCALL

Comments?

Mike Green
11-12-2006, 01:41 PM
Bill,
1) I assume that the called routine will access the arguments indirectly through PC. Easy thing to do for non-long parameters is to exit to a routine that adds 3, then masks with !3 to round up to the nearest long word boundary, then falls through to f_next. A lot of routines will still compute their parameters and some will use a combination of techniques.
2) Think carefully about how you would do libraries. We have no memory manager for HUB RAM and currently all this code is absolute (or most of it ... we do have relative jumps). You did incorporate a BP value. Do we want to adjust all long addresses with this (FJUMP and FCALL) so we have completely relocatable code?
Mike

Bill Henning
11-12-2006, 01:58 PM
1) excellent suggestion! That is now the standard http://forums.parallax.com/images/smilies/smile.gif Another entry for credits.txt

2) Actually I was considering FJMP/FCALL relative to PC, and a separate LIBCALL library:16, function:16, with library being a hub address indirectly pointing to the library (I'd prefer 32 bit pointer here for supporting more hub ram in the future) with a function entry number. Yes, a lot of indirection, but gives us relocatable libraries that could be demand loaded...

BP was for potential support for local variables on the stack or a heap... I did not think it totally thru yet which would be the best approach, frankly I don't like the run time penalty, but it would make true nested procedure calls with local variables available. I might leave that to byte code languages tho.

Bill Henning
11-13-2006, 01:45 AM
Status update - I have a single/multi threaded cog kernel image compiling, but it needs a few more tweaks, specifically I have to finish the system calls I am writing to start/stop threads on the calling cog.

By default a new cog started with the kernel comes up in single-tasking mode, without any pentaly for potentially being multi-threaded (ok, about 30 longs of cog memory penalty) but large model programs running on the kernel will be able to create, delete, pause and resume threads on the fly by calling system library routines. I am also moving the stack back into the hub memory, but will leave Mike's excellent cog based stack routines in there, but commented out, for people who prefer stacks in the cog. I am also considering moving the process table to hub memory, to allow for more tasks per cog and more free cog memory.

Yes, I know I am using "threads" and "tasks" interchangably, because at this point, the kernel would allow them to be used as either tightly coupled threads working on the same image, or totally separate tasks. If I move the process table to hub memory, I can not only allow far more tasks, but the tasks can freely migrate from cog to cog... I could do dynamic load management.

For example, if you start a bunch of tasks distributed on five cogs because·and you are·running a keyboad and a two-cog vga display, if you wanted to start a *second* two cog display (that you did not have free cogs for) you would not have to kill any tasks, because the tasks running on the two cogs pre-empted for the additional VGA display would be redistributed to the remaining cogs; which would simply become somewhat slower due to running more threads.

I won't have much more time to work on this today, but will resume late tonite and tomorrow.

There are many more features coming...

Post Edited (Bill Henning) : 11/12/2006 6:50:02 PM GMT

Bill Henning
11-13-2006, 01:46 AM
By the way, if there is interest in it, I can actually support a Unix style 'fork()' system call, with the same semantics.

Tracy Allen
11-13-2006, 12:35 PM
Bill said, <<I'd bet that before I posted no one seriously considered the propeller capable of supporting a large code space memory model for a non-interpreted / threaded language running at almost native speed with multitasking!>>.

Wow, away from this for a week, and look what transpires!

It's taken a couple of hours to study through the the primitives posted in this thread, from the central idea and on through things like Mike's clever JMPRET mechanism for reading in data longs. I'm still trying to absorb the potential of multitasking, with tasks queued to cogs within this framework. An education. You guys are real professionals.

There was discussion of how tightly written is the Spin interpreter. There is in many cases a one to one correspondence between Spin instructions and Propasm instructions. At first I thought that the IDE might compile something like we are talking about here. For example, compile a waitcnt() directly to one propasm instruction, and then at run time the interpreter would simply read that in and execute it in place. But that's not the way it works, rather, it uses byte code, or tokens, to build up the parameter list and the command, right? I'm not real clear on that. And that is why it takes the speed hit.

However this new model does work with direct read and execute. There is a new set of rules that will have to be made very clear in order to avoid spectacular bugs. Code loaded into the cache at native speed also has the native flexibility, while single instructions read and executed one at a time from the HUB have to follow tighter rules for preparing the source and destination. But it is all opening up a whole new vista of possibilities, that is for sure.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Tracy Allen
www.emesystems.com (http://www.emesystems.com)

Bill Henning
11-14-2006, 10:20 AM
Hi Tracy,

Thanks :) however it amazes me what code density Chip gets out of Spin; the speed penalty is not an issue for a lot of tasks, so I am not surprised he went for a densely encoded byte code, it was a good decision.

You are correct, the rules for code 'run' out of hub memory have to be fairly strict.

Generally, the limitations of what instructions may be directly executed·boil down to:

- only jumps to cog based low jump vectors are allowed
- no JUMPRET except to Mike's FLOAD (also means no CALL)
- no TJxx instructions
- no DJNZ instructions

Therefore for other jumps, FJUMP must be used; or the 'add pc,#offset' / 'sub pc,#offset' trick, which ofcourse may be conditional.

I will almost certainly change FJUMP to be relative to the tasks base address, ditto for FCALL, however FSYSTEM will call hub based library code at absolute addresses, probably through jump vectors so that libraries too can be relocated. These simple change paves the way for swapping tasks in/out, relocating tasks, etc.

I did not have much time in the last couple of days to post here or code, but I got quite a bit of design done, and figured out some thorny issues such as how will one cog ask another to start a new task.

By moving the task control blocks to hub memory the arbitrary limit of 8 tasks per cog is removed; and the latest code / specification allows for task migration from cog to cog.

I will keep this thread updated with new information as it develops :) tonight I am taking time to clean up my workbench so I can wire up a propeller board or two this week.

Bill

Post Edited (Bill Henning) : 11/14/2006 3:25:37 AM GMT

Mike Green
11-14-2006, 11:58 AM
Just as a side comment ... I used JUMPRET because it is a jump, allows a destination register with the assembler, and doesn't do anything with the destination that can't be ignored (for the FLOAD). I would have used a JUMP instruction, but there's no way to get the assembler to stick a destination register in. If you're directly generating binary instructions, I'd use the JUMP.

Bill Henning
11-16-2006, 01:14 PM
Thanks Mike. I got the new propeller tool today that supports Chip's new ORGX directive that makes FCACHE'd blocks easier :)

AndreL
11-19-2006, 03:39 AM
Hopefully Chip can integrate this stuff (or a variant) into the tool. Over a year ago, we came up with very similar techniques, in fact when I did my seminar at parallax I described the technique since about 30 mins after we got the first prop chips 15 months ago, we figured out how to stream large programs with caches in ASM, we had to, to make games. But, of course like you we had to fight the tool and do things manually. In the end we settled on doing 256 instruction pre-caching of code blocks and found that running that way, rather than an instruction at a time was the best. So all this talk of patenting etc. worries me, we did this before anyone even saw a prop publically, and I even talked about it 25 people around the world and showed them the demos that used it. So let's all agree that this is common sense stuff to anyone that does compilers and VMs, and no one should try to patent anything :)

Anyway, adding macros to the assembler would make all this and other techniques a lot easier to deal with.

Andre'

Bill Henning
11-19-2006, 11:03 AM
LOL Andre,

Don't worry! I was thinking defensive... ie if someone patents it, to prevent them. I had no intention to patent it unless I had to stop some Big Bad Company from claiming it to stop us from using it or to make us pay to use it. I did want some credit :) and it looks like people are willing to credit me :)

I considered your approach, of pre-caching a fixed number of instructions (I was thinking of 128, with another 128 for caching library code) - that approach is really more like paging or overlays, but when disclosing what I was working on, I was disclosing what I thought would work best with the compiler I am thinking of writing (oops. I guess that's another cat out of the bag); which is why I came up with the variable sized FCACHE'd blocks.

YES, I really want macros! And conditional assembly! And dare I say it... a linker!

By the way, the Hydra looks like an amazing piece of work, can't wait to play with it :)

Best,

Bill

AndreL
11-19-2006, 11:51 AM
Right, just sitting in a loop and streaming instructions is good for fast execution, but not REALLY fast ASM code and or compiled code generated from a compiler, while loading in chunks and then executing is what you need for best speed, especially if there is a lot of cache coherence you don't need flushes very often. Anyone that does any ASM games for the prop would use either technique out of need depending on what they are doing and that's what we did, try different variants ,see what is worth the time to get working etc. The most important thing is for people to just devise compilers for the prop and self hosted systems like our toy BASIC and the much more advanced FORTH, so people can get more work done without knowing the details. Of course then there is the issue of global variable access in ASM etc. If I had more time, the first thing I would do is create a VERY powerful macro assembler that did all the memory accessing and streaming via macros and code blocks types. THEN use that to develop languages. Then we don't have to rely on any other tools, the macro assembler just generates a prop image and that's that.


Andre'

Paul Baker
11-19-2006, 02:05 PM
Bill, I wouldn't worry too much. I personally know the examiners that would examine such a patent application, it's my old unit. The concept is close enough to user directed pre-caching that anyone would have to narrow thier claims down so much to get around the pre-existing art. And the examiners·would know to apply that body of art to severely narrow the scope of the claimed invention. The only chance of it getting through is if the applicant appealed the decision to the board.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker (mailto:pbaker@parallax.com)
Propeller Applications Engineer
[/url][url=http://www.parallax.com] (http://www.parallax.com)
Parallax, Inc. (http://www.parallax.com)

Post Edited (Paul Baker (Parallax)) : 11/19/2006 7:10:35 AM GMT

Bill Henning
11-19-2006, 02:37 PM
Hi Paul,

I'm not worried; I was just responding to AndreL's concern :)

Bill

Loopy Byteloose
11-19-2006, 07:03 PM
As far as recognition and protection of intellectual property, can't we just refer to these as the Bill Henning Propeller Primatives?· Seems that would perpetuate recognition of who came up with them.· [kinda of like Ohm's Law, etc.]
It seems to me that if something has someone's name on it, false patent claims are far more unlikely to be sucessful.

I am looking forward to eventually us these.·
For now, I am taking a leisure route to Propeller studies as I just cannot keep up with all of it.

Seems the world is going toward lots of parallel processing.


▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
"If you want more fiber, eat the package.· Not enough?· Eat the manual."········


···················· Tropical regards,····· G. Herzog [·黃鶴 ]·in Taiwan

Bill Henning
11-20-2006, 12:17 AM
Thanks Kramer,

But I am not asking for THAT much recognition :) Mostly I was seeking credits in source code, any compiler documentation, or any explanation of the methods.

Now I'm going to wire up another propeller board ...

Best,

Bill

M. K. Borri
11-21-2006, 07:14 AM
So... there would be an option to load the spin interpreter, or the large memory model assy "interpreter"? Sounds neat... especially if you want the Prop used in education as this emulates a more conventional micro -- I can't get the chairman of EE here to look at the Prop in that sense because of it (nevermind that I think most micros in the future will be multicore but hey).

Bill Henning
12-05-2006, 01:17 PM
Hi M.K.,

I am currently looking into building a simple compiler to generate large model code; I've just been too busy last couple of weeks to do much with it - other than finalizing the process control tables layout and the memory manager data structures.

I was also thrown into a tizzy contemplating what I could do on the future 8 cog / 256KB / 8 cycle hub access -or- 16 cog / 128KB / 16 cycle hub access 160MIP cog future propellers...

M. K. Borri
12-06-2006, 07:56 AM
Why don't you patent it and sell the patent to Parallax for, idunno, lawyer costs plus a movie ticket? That way everyone that needs to be happy is happy... OK, except the lawyer who gets more money than he should, but we have to live with that I guess.

Bill Henning
12-06-2006, 11:55 AM
Why would I spend time on that now? Basically, I deliberately published the idea here to establish "prior art" so no one else can patent it (reasonably) :)

Now I just want to build cool stuff :)

M. K. Borri
12-07-2006, 07:35 AM
I did that re: a car mp3 player in 1999 and got shafted anyway.

Bill Henning
01-11-2007, 05:00 PM
Status update:

- I've started testing my new large model assembler for the propeller; I hope to release a beta test version in one to two weeks.

Why did I write yet another assembler?

- I wanted an assembler designed specifically for large model programs

- I wanted the following additional features:

- conditional assembly (IF / ELSE / ENDIF)
- nested include files
- macro's (label MACRO arg_1,..,arg_n / ENDM)
- HORG (hub memory ORG)
- CORG (alias for ORG, cog memory org)
- listing files
- symbol tables for an external linker / loader

I've also defined some heap management primitives (malloc,free), as well as the task control blocks for the threads in the cogs that will load my multi-tasking pico kernel. I am currently considering adding C-style stdin/stdout/stderr streams to the task control blocks.

I'll publish a more detailed hub memory map later, when I've finalized it; in general I am trying to keep it Spin compatible (I hope to have the ability to have some cogs run the spin interpreter) however I need to figure out how to start a spin interpreter in a cog, and how to limit what range of memory it will try to use.

Basic hub memory layout:

$0000-$000F: boot loader / spin initialization area
$0010-$01FF: reserved for shared kernel data
$0200-$05FF: large model 1KB default kernel image, modifies itself for single/multi-threaded depending on task control block pointed to by par
$0600-$09FF: 1KB buffer space (any task can request to use it via a semaphore) (also loaded into cogs when default kernel loaded)
$0C00-topmem: code / heap space
$topmem-$7FFF: task control blocks, they grow down from end of RAM.

I expect the memory footprint to stay under 3k with a small number of threads, leaving 29k for code :)

Anyway, I'll keep plugging away :) sorry its going so slowly, but I am currently overworked.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com (http://www.mikronauts.com) - a new blog about microcontrollers

Ym2413a
01-11-2007, 11:33 PM
Wow, This is cool!
29k code space could do a lot!

mahjongg
03-04-2007, 11:01 PM
This is important information, and it;s now a bit "invisible".

When it's all ironed out and working well, I think this topic should be made "sticky", so other users know there is an alternative to using either SPIN or the 512 word COG machine language.

Mahjongg

Bill Henning
03-04-2007, 11:36 PM
Hi Mahjongg,

I've been too busy with consulting work to update it recently - but I will start updating again next week.

Here is a sneak peek at some of the changes:

HUB MEMORY:

0000-000F Initialization area, I'd leave it as defined (I hope to eventually run Spin programs under this scheme)
0010-00FF Reserved for OS mail boxes, pointers to library bases, etc
0100-01FF 32 mailboxes of eight bytes each
0200-03FF I/O buffer 0 - 512 bytes
0400-0BFF I/O buffers 1..4, also used loading area for cog kernel images
0C00-77FF available memory, managed by primitives imaginatively called 'malloc' & 'free' or 'kmalloc' and 'kfree'
7800-7BFF process control blocks
7C00-7FFF system library

(on the next propeller with 128k the PCB's and system library move to the end of ram, and I'll be greedy and double their size)

The system library implements:

- block SD driver, simple read/write/erase 512 byte blocks, NO FS layer support here
- I2C eeprom support on the default pins, simple read/write/erase 512 byte blocks
- serial I/O over the default pins, simple setbaud (just saves it in a hub location)/read/write
- SPAWN, KILL, MALLOC and FREE
- minimal string library (strcpy, strcmp, atoi, itoa)
- minimal memory library (memcpy, memcmp, memset)
- minimal 32 bit IEEE floating point library (FADD, FSUB, FMUL, FDIV, FREM, FTOI, ITOF, and if it fits, FSIN, FCOS, FTAN, FLOGN, FEXP, FTOA, ATOF)

(I realize that I am very optimistic as to how much can be sqeezed into the 1KB system library area; its size may have to be increased, or the floating point library could be dropped from being part of the base system library)

The mailboxes are for message passing; processes will communicate via simple messages.

The SPAWN routine loads 2KB (512 longs) into IO buffers 1..4 and then COGNEW's it. In order to simplify code (and thus keep it small), SPAWN can only load from block (512 byte) aligned starting locations from either I2C or SD card, and without any file system - just a starting block offset.

KILL can just kill a thread or stop a whole cog.

MALLOC and FREE are obvious; each block of memory has two word pointers at the front of it. The first one points to the next block, and the second poits to the process table entry that owns it. When FREEing a block, adjacent free blocks will automatically be merged (thus hopefully cutting down on fragmentation and avoiding the need for periodic garbage collection)

Every process will also get an STDIN and STDOUT handle however I am thinking of all processes sharing a single STDERR in order to save memory. I/O handles are just pointers to a mailbox; messages equivalent to open/read/write/ioctl/close will be defined.

The whole system will have an extremely minimalist Unix-like feel, however the intention is to make it a 'nano' sized system - I think micro kernels are too large.

Best,

Bill

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com (http://www.mikronauts.com) - a new blog about microcontrollers

Post Edited (Bill Henning) : 3/4/2007 5:07:38 PM GMT

ImageCraft
10-10-2007, 01:57 PM
DCACHE?

Bill, did you mention how DCACHE will be used? If it's for caching HUB memory locations, how do you propose to modify the rd/wr HUB instructions to use the faster DCACHE?

Bill Henning
10-10-2007, 02:10 PM
DCACHE is meant to be used as a data cache; for things like string operations, reading/writing blocks for devices etc under program control; however you are correct, a smart loader could use the dcache area to hold global longs, and patch the RDLONG/WRLONG's in the hub to instead refer to registers in the DCACHE area.

I was also considering a paged array model, where the dcache could be used to hold two 64 element blocks.

FYI, I have a running kernel, finally had time to debug it http://forums.parallax.com/images/smilies/smile.gif and I did change FCACHE a bit.

OLD WAY, OBSOLETE:

FCACHE
<many longs of code, 0 not a valid long>
0 ' indicates end of block

NEW, CORRECT WAY:

FCACHE
number of longs to load
<longs to load>

Please note that if the fcached block is smaller than 128 longs, after the code you can have pre-initialized variables!

How's the assembler & compiler coming?

p.s.

It was EXTREMELY painful to write large model test code in the spin environment... the test program is very...icky.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com (http://www.mikronauts.com) - a new blog about microcontrollers

ImageCraft
12-08-2007, 04:09 PM
BTW, to resurrect this thread (ha ha), let me put on public record that the ImageCraft C will use at the minimum some ideas from Bill Henning's LMM with contributions from Mike Green, Chip etc. We *will* fully document the runtime architecture: register maps, COG RAM usage, entry points etc. At this moment, our LMM will look a bit different from Bill's proposal, mainly in terms of lack of support for the threading, the number of registers and the COG RAM usage. There will also be different kernel routines, to support C. Also, experience has shown that most C loops are not that bad, so the FCACHE will probably be smaller, to give more space to the internal stack.

Anyway, I'd like to thank and acknowledge Bill Henning publicly for the LMM idea. It's reasonably obvious once someone discovers it, but certainly without such mechanism, Propeller C would not of much use.

Now all we have to do is to finish the compiler http://forums.parallax.com/images/smilies/smile.gif

CardboardGuru
12-08-2007, 05:03 PM
ImageCraft said...
It's reasonably obvious once someone discovers it


The very best ideas usually are.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Help to build the Propeller wiki - propeller.wikispaces.com (http://propeller.wikispaces.com)
Play Defender - Propeller version of the classic game (http://forums.parallax.com/showthread.php?p=685888)
Prop Room Robotics - my web store for Roomba spare parts (http://www.proproomrobotics.co.uk) in the UK

J. A. Streich
12-09-2007, 08:30 AM
I want in, this is cool! I think I'd like to write a shell for the new system once it's finalized.. I need to learn the instruction set for the Prop's ASM, but I'll need to do that anyway for my next project.

I would like to suggest a trap and interrupt system. I know that with the other Cogs such a system isn't strictly needed, but it would be very interesting to see the kernel pass along the interrupts to the client programs -- perhaps that is part of what the mail box is for? With all the different applications of the chip what causes interrupts and what doesn't would be hard to figure out.

MJB
05-16-2012, 10:02 AM
such a pitty - the thread ends here and no follow up, no pointers to any outcome

especially for people like me - new to propeller
workin through the forum and then - dead end

looks like there is a whole lot of activity in 2006/2007 and then ... end ...

any results from this great work here?
products that build on this?

further reading?
links?

thanks - a very enthusiastic propeller newbie
MJB

pik33
05-16-2012, 11:10 AM
Search and you can find; there is a big subforum about gcc on Propeller, it uses LMM

http://forums.parallax.com/forumdisplay.php?91-Propeller-GCC-Alpha-Test-Forum

Rayman
05-16-2012, 11:50 AM
This thread does deserve a better ending!

Bill Henning may have single handedly shaped the future of Propeller programming!

Prop1 C codes like Catalina, GCC, and the other one all make use of it to allow programs of virtually unlimited size and also run faster than SPIN.

Prop2 is going to rely very heavily on this mode as probably a greater number of people will be using C with it...

Bean
05-16-2012, 01:12 PM
I would like to acknowledge that PropBasic uses Bill Henning's LMM idea (with my own twist) to generate LMM code.

Thanks Bill for your efforts...

Bean