ANNOUNCING: Large memory model for Propeller assembly language programs!

potatohead · 2006-11-11 22:41

Holy Cow!

I'm offline for a coupla days and see this! Great work guys!

Coupla things with regard to IP:

-I would not GPL this. Doing so would essentially require the same thing of derivative works, unless they were seperated into distinct elements. That's a big hassle with few returns, IMHO. If anything, do a BSD style where credits are part of the story, but other licenses can be attached to finished programs. That way, derivative works are not mandated to be open, Parallax is free to incorporate the work done here into it's supporting software, credit is given where it makes sense, and ongoing development of this framework can continue in an open fashion by whoever is interested or motivated by application need to do so.

-In this nasty IP environment, I don't think some steady and frank conversation about these matters is out of line. There is always somebody...

The next version of the Propeller will benefit greatly from the work done on this thread, I'm sure.

(Goes back to work through the code posted here...)

Post Edited (potatohead) : 11/11/2006 10:45:47 PM GMT

cgracey · 2006-11-12 00:15

Cliff,

Bill's just excited and enthusiastic about engineering. No need to knock him. We've all been zealous at times, and hopefully we will be often. I'm sure that no matter how cool of ideas any of us get, in the long run we'll just be happy to have had them, and·we'll be·enriched if we have shared them, which is Bill's overriding interest here. It's true that inspiration comes to many people, even for the same things. This is what our patent system is in conflict with. Under the mind-warping paranoia it induces, Bill probably felt compelled to bring it up. He has some valuable ideas that he wants to share and refine with the forum, and such concerns would definitely cross my mind, too.

Cliff L. Biffle said...
Bill, I've spoken to you about this in PM, but you keep banging this drum.

These ideas are so close to STC/DTC that a patent would not stand anyway. You've been posting about a message a day along the lines of "I won't patent this, but please if you do this in commercial software give me credit!"

Bill, I've been doing this in commercial software for nearly 15 years. Your code is clever and I'm really pleased with what you've done, but please, let's focus on the code and quit the ego-stroking.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 11/12/2006 12:36:39 AM GMT

Cliff L. Biffle · 2006-11-12 02:03

I actually thought I deleted that message; I re-sent it to Bill in PM. It probably wasn't appropriate for a public forum, and I'd delete it now if Chip hadn't quoted it. (Chip, if you get the inclination, un-quote it and delete my copy; Bill has one of his own.)

Ah, well. Let's keep hacking and quit our (my) sniping.

paulmac · 2006-11-12 03:02

Chip Gracey (Parallax) said...

The nop could be gotten rid of in fcache by post-fixing the destination address:

Here is a way to make it ~33% faster by adding 4 instructions:

I love doing stuff like this!

This is getting a bit like Perl golf!

Bill Henning · 2006-11-12 05:16

Cliff,

I responded to the PM, and true, I did not appreciate the public criticism, but I will respond in public as well, without acrimony.

I have been·writing and·designing·software since 1982, and in that time I've had a number of unpleasant experiences.

You are almost certainly correct that these ideas are close to STC/DTC - I have not had time to research that, as I just got home from work.

However, they are definitely unique on the propeller, and pure software FCACHE is pretty damned unique (as far as I know)

Given that I have not applied for a patent, and that I deliberately published SO NO ONE ELSE COULD, your distrust of my intention is misplaced.

The fact is, I reserve the right to patent this should I need to in a defensive manner, but I have ALREADY publically posted committing not to go after FOSS developers / users / pets / etc. (sorry, a bit of sarcasm slipped out) In the unlikely event I feel I have to patent it (why on earth would I want to go through the time, expense, or headaches unless it was defensive?) I'd still keep my earlier promises.

While I trust Chip, Mike, Phil, you, and every other decent person, to credit·as appropriate, NOT EVERYONE is like that! (Personal experiences tell me this, even in university situations where citations are supposedly mandatory!)

By posting that I expect credit and some agreed compensation (that I would share with contributors on an agreed upon basis) for CLOSED SOURCE COMMERCIAL USE, I have established the "ground rules" if you will - so if ACME markets a "whizbang widget" that uses these techniques and don't credit and try to arrange something, I have the option of trying to do something about it, as I clearly publically indicated the "rules".

Having to be like this sucks. Having to waste forum bandwidth, and misunderstandings, sucks even more.

I hope I have not scared even ONE FOSS developer off - that is not my intent.

Since I decided to free it, I want credit, which all decent people would provide without asking. I do not believe that to be wrong, and I am just trying to make sure that the less than decent people take a pause.

And while there ARE certain similarities with Forth engines (btw I have coded several token threaded interpreters, and will look into the alternate methods you've mentioned, they seem intriguing), I'm quite certain there is no other large memory model for a limited memory multiprocessor system with shared memory that provides for large model features at a very small run time penalty by program controlled caching and with the latest version provides a fully pre-emptive multitasking pico kernel for a processor with only 512 addressable words!

I'd bet that before I posted no one seriously considered the propeller capable of supporting a large code space memory model for a non-interpreted / threaded language running at almost native speed with multitasking!

Anyway, enough apparent tooting of my horn. I would not have posted this except for your posting.

Believe it or not, I actually respect your concerns.

I would not have made the posting you were objecting to except for someone expressing concerns about my intentions and your past PM's, I was actually trying to defuse the situation, and get more people to get involved!

Cliff L. Biffle said...
Bill, I've spoken to you about this in PM, but you keep banging this drum.

These ideas are so close to STC/DTC that a patent would not stand anyway. You've been posting about a message a day along the lines of "I won't patent this, but please if you do this in commercial software give me credit!"

Bill, I've been doing this in commercial software for nearly 15 years. Your code is clever and I'm really pleased with what you've done, but please, let's focus on the code and quit the ego-stroking.

Post Edited (Bill Henning) : 11/12/2006 5:30:46 AM GMT

Bill Henning · 2006-11-12 05:18

Thanks Chip.

You understand me EXACTLY!

Chip Gracey (Parallax) said...

Cliff,

Bill's just excited and enthusiastic about engineering. No need to knock him. We've all been zealous at times, and hopefully we will be often. I'm sure that no matter how cool of ideas any of us get, in the long run we'll just be happy to have had them, and·we'll be·enriched if we have shared them, which is Bill's overriding interest here. It's true that inspiration comes to many people, even for the same things. This is what our patent system is in conflict with. Under the mind-warping paranoia it induces, Bill probably felt compelled to bring it up. He has some valuable ideas that he wants to share and refine with the forum, and such concerns would definitely cross my mind, too.

Cliff L. Biffle said...
Bill, I've spoken to you about this in PM, but you keep banging this drum.

These ideas are so close to STC/DTC that a patent would not stand anyway. You've been posting about a message a day along the lines of "I won't patent this, but please if you do this in commercial software give me credit!"

Bill, I've been doing this in commercial software for nearly 15 years. Your code is clever and I'm really pleased with what you've done, but please, let's focus on the code and quit the ego-stroking.

Bill Henning · 2006-11-12 06:10

Ok, after considering it, I cannot think of an easier way to passing arguments to FCALL'ed functions than a good old bog standard method - they follow the call instruction!

Library calls are to expect the following:

FCALL
@some_lib_routine
arg1/@arg1
arg2/@arg2
...
argN/@argn

Library functions may use non-long arguments, BUT, they have to pad their argument list to a long boundry, and modify the PC by the appropriate number of bytes so that the return works... easiest way is to pop the return address and set the corrected PC manually.

PLEASE NOTE:

Even though it "wastes" memory, I want all pointers to be 32 bits long, with unused upper bits. Its only a matter of time until Chip et al present us with a propeller with more memory and I'd like exsiting code to work. We have a chance to do it right people.

EXCEPTION:

Cure large model address threaded interpreters can use sixteen bit pointers if they play nice and leave PC word aligned; they willingly accept memory limitations for tighter coding.

I'm also thinking that we need a shared library standard; probably just a set of jump vectors like at the start of the kernel.

Given how FCACHE will speed up complex calls, it may make sense for shared library calls to use something like the follows

FLIBCALL
[noparse][[/noparse]16 bit lib vector pointer][noparse][[/noparse]16 bit "function ID" specifying which fn to call]
[noparse][[/noparse]arg list like optional one for FCALL

Comments?

Mike Green · 2006-11-12 06:41

Bill,
1) I assume that the called routine will access the arguments indirectly through PC. Easy thing to do for non-long parameters is to exit to a routine that adds 3, then masks with !3 to round up to the nearest long word boundary, then falls through to f_next. A lot of routines will still compute their parameters and some will use a combination of techniques.
2) Think carefully about how you would do libraries. We have no memory manager for HUB RAM and currently all this code is absolute (or most of it ... we do have relative jumps). You did incorporate a BP value. Do we want to adjust all long addresses with this (FJUMP and FCALL) so we have completely relocatable code?
Mike

Bill Henning · 2006-11-12 06:58

1) excellent suggestion! That is now the standard

Another entry for credits.txt

2) Actually I was considering FJMP/FCALL relative to PC, and a separate LIBCALL library:16, function:16, with library being a hub address indirectly pointing to the library (I'd prefer 32 bit pointer here for supporting more hub ram in the future) with a function entry number. Yes, a lot of indirection, but gives us relocatable libraries that could be demand loaded...

BP was for potential support for local variables on the stack or a heap... I did not think it totally thru yet which would be the best approach, frankly I don't like the run time penalty, but it would make true nested procedure calls with local variables available. I might leave that to byte code languages tho.

Bill Henning · 2006-11-12 18:45

Status update - I have a single/multi threaded cog kernel image compiling, but it needs a few more tweaks, specifically I have to finish the system calls I am writing to start/stop threads on the calling cog.

By default a new cog started with the kernel comes up in single-tasking mode, without any pentaly for potentially being multi-threaded (ok, about 30 longs of cog memory penalty) but large model programs running on the kernel will be able to create, delete, pause and resume threads on the fly by calling system library routines. I am also moving the stack back into the hub memory, but will leave Mike's excellent cog based stack routines in there, but commented out, for people who prefer stacks in the cog. I am also considering moving the process table to hub memory, to allow for more tasks per cog and more free cog memory.

Yes, I know I am using "threads" and "tasks" interchangably, because at this point, the kernel would allow them to be used as either tightly coupled threads working on the same image, or totally separate tasks. If I move the process table to hub memory, I can not only allow far more tasks, but the tasks can freely migrate from cog to cog... I could do dynamic load management.

For example, if you start a bunch of tasks distributed on five cogs because·and you are·running a keyboad and a two-cog vga display, if you wanted to start a *second* two cog display (that you did not have free cogs for) you would not have to kill any tasks, because the tasks running on the two cogs pre-empted for the additional VGA display would be redistributed to the remaining cogs; which would simply become somewhat slower due to running more threads.

I won't have much more time to work on this today, but will resume late tonite and tomorrow.

There are many more features coming...

Post Edited (Bill Henning) : 11/12/2006 6:50:02 PM GMT

Bill Henning · 2006-11-12 18:46

By the way, if there is interest in it, I can actually support a Unix style 'fork()' system call, with the same semantics.

Tracy Allen · 2006-11-13 05:35

Bill said, <<I'd bet that before I posted no one seriously considered the propeller capable of supporting a large code space memory model for a non-interpreted / threaded language running at almost native speed with multitasking!>>.

Wow, away from this for a week, and look what transpires!

It's taken a couple of hours to study through the the primitives posted in this thread, from the central idea and on through things like Mike's clever JMPRET mechanism for reading in data longs. I'm still trying to absorb the potential of multitasking, with tasks queued to cogs within this framework. An education. You guys are real professionals.

There was discussion of how tightly written is the Spin interpreter. There is in many cases a one to one correspondence between Spin instructions and Propasm instructions. At first I thought that the IDE might compile something like we are talking about here. For example, compile a waitcnt() directly to one propasm instruction, and then at run time the interpreter would simply read that in and execute it in place. But that's not the way it works, rather, it uses byte code, or tokens, to build up the parameter list and the command, right? I'm not real clear on that. And that is why it takes the speed hit.

However this new model does work with direct read and execute. There is a new set of rules that will have to be made very clear in order to avoid spectacular bugs. Code loaded into the cache at native speed also has the native flexibility, while single instructions read and executed one at a time from the HUB have to follow tighter rules for preparing the source and destination. But it is all opening up a whole new vista of possibilities, that is for sure.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Tracy Allen
www.emesystems.com

Bill Henning · 2006-11-14 03:20

Hi Tracy,

Thanks [noparse]:)[/noparse] however it amazes me what code density Chip gets out of Spin; the speed penalty is not an issue for a lot of tasks, so I am not surprised he went for a densely encoded byte code, it was a good decision.

You are correct, the rules for code 'run' out of hub memory have to be fairly strict.

Generally, the limitations of what instructions may be directly executed·boil down to:

- only jumps to cog based low jump vectors are allowed
- no JUMPRET except to Mike's FLOAD (also means no CALL)
- no TJxx instructions
- no DJNZ instructions

Therefore for other jumps, FJUMP must be used; or the 'add pc,#offset' / 'sub pc,#offset' trick, which ofcourse may be conditional.

I will almost certainly change FJUMP to be relative to the tasks base address, ditto for FCALL, however FSYSTEM will call hub based library code at absolute addresses, probably through jump vectors so that libraries too can be relocated. These simple change paves the way for swapping tasks in/out, relocating tasks, etc.

I did not have much time in the last couple of days to post here or code, but I got quite a bit of design done, and figured out some thorny issues such as how will one cog ask another to start a new task.

By moving the task control blocks to hub memory the arbitrary limit of 8 tasks per cog is removed; and the latest code / specification allows for task migration from cog to cog.

I will keep this thread updated with new information as it develops [noparse]:)[/noparse] tonight I am taking time to clean up my workbench so I can wire up a propeller board or two this week.

Bill

Post Edited (Bill Henning) : 11/14/2006 3:25:37 AM GMT

Mike Green · 2006-11-14 04:58

Just as a side comment ... I used JUMPRET because it is a jump, allows a destination register with the assembler, and doesn't do anything with the destination that can't be ignored (for the FLOAD). I would have used a JUMP instruction, but there's no way to get the assembler to stick a destination register in. If you're directly generating binary instructions, I'd use the JUMP.

Bill Henning · 2006-11-16 06:14

Thanks Mike. I got the new propeller tool today that supports Chip's new ORGX directive that makes FCACHE'd blocks easier [noparse]:)[/noparse]

AndreL · 2006-11-18 20:39

Hopefully Chip can integrate this stuff (or a variant) into the tool. Over a year ago, we came up with very similar techniques, in fact when I did my seminar at parallax I described the technique since about 30 mins after we got the first prop chips 15 months ago, we figured out how to stream large programs with caches in ASM, we had to, to make games. But, of course like you we had to fight the tool and do things manually. In the end we settled on doing 256 instruction pre-caching of code blocks and found that running that way, rather than an instruction at a time was the best. So all this talk of patenting etc. worries me, we did this before anyone even saw a prop publically, and I even talked about it 25 people around the world and showed them the demos that used it. So let's all agree that this is common sense stuff to anyone that does compilers and VMs, and no one should try to patent anything [noparse]:)[/noparse]

Anyway, adding macros to the assembler would make all this and other techniques a lot easier to deal with.

Andre'

Bill Henning · 2006-11-19 04:03

LOL Andre,

Don't worry! I was thinking defensive... ie if someone patents it, to prevent them. I had no intention to patent it unless I had to stop some Big Bad Company from claiming it to stop us from using it or to make us pay to use it. I did want some credit [noparse]:)[/noparse] and it looks like people are willing to credit me [noparse]:)[/noparse]

I considered your approach, of pre-caching a fixed number of instructions (I was thinking of 128, with another 128 for caching library code) - that approach is really more like paging or overlays, but when disclosing what I was working on, I was disclosing what I thought would work best with the compiler I am thinking of writing (oops. I guess that's another cat out of the bag); which is why I came up with the variable sized FCACHE'd blocks.

YES, I really want macros! And conditional assembly! And dare I say it... a linker!

By the way, the Hydra looks like an amazing piece of work, can't wait to play with it [noparse]:)[/noparse]

Best,

Bill

AndreL · 2006-11-19 04:51

Right, just sitting in a loop and streaming instructions is good for fast execution, but not REALLY fast ASM code and or compiled code generated from a compiler, while loading in chunks and then executing is what you need for best speed, especially if there is a lot of cache coherence you don't need flushes very often. Anyone that does any ASM games for the prop would use either technique out of need depending on what they are doing and that's what we did, try different variants ,see what is worth the time to get working etc. The most important thing is for people to just devise compilers for the prop and self hosted systems like our toy BASIC and the much more advanced FORTH, so people can get more work done without knowing the details. Of course then there is the issue of global variable access in ASM etc. If I had more time, the first thing I would do is create a VERY powerful macro assembler that did all the memory accessing and streaming via macros and code blocks types. THEN use that to develop languages. Then we don't have to rely on any other tools, the macro assembler just generates a prop image and that's that.

Andre'

Paul Baker · 2006-11-19 07:05

Bill, I wouldn't worry too much. I personally know the examiners that would examine such a patent application, it's my old unit. The concept is close enough to user directed pre-caching that anyone would have to narrow thier claims down so much to get around the pre-existing art. And the examiners·would know to apply that body of art to severely narrow the scope of the claimed invention. The only chance of it getting through is if the applicant appealed the decision to the board.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

Post Edited (Paul Baker (Parallax)) : 11/19/2006 7:10:35 AM GMT

Bill Henning · 2006-11-19 07:37

Hi Paul,

I'm not worried; I was just responding to AndreL's concern [noparse]:)[/noparse]

Bill

LoopyByteloose · 2006-11-19 12:03

As far as recognition and protection of intellectual property, can't we just refer to these as the Bill Henning Propeller Primatives?· Seems that would perpetuate recognition of who came up with them.· [noparse][[/noparse]kinda of like Ohm's Law, etc.]
It seems to me that if something has someone's name on it, false patent claims are far more unlikely to be sucessful.

I am looking forward to eventually us these.·
For now, I am taking a leisure route to Propeller studies as I just cannot keep up with all of it.

Seems the world is going toward lots of parallel processing.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
"If you want more fiber, eat the package.· Not enough?· Eat the manual."········

···················· Tropical regards,····· G. Herzog [noparse][[/noparse]·黃鶴 ]·in Taiwan

Bill Henning · 2006-11-19 17:17

Thanks Kramer,

But I am not asking for THAT much recognition [noparse]:)[/noparse] Mostly I was seeking credits in source code, any compiler documentation, or any explanation of the methods.

Now I'm going to wire up another propeller board ...

Best,

Bill

M. K. Borri · 2006-11-21 00:14

So... there would be an option to load the spin interpreter, or the large memory model assy "interpreter"? Sounds neat... especially if you want the Prop used in education as this emulates a more conventional micro -- I can't get the chairman of EE here to look at the Prop in that sense because of it (nevermind that I think most micros in the future will be multicore but hey).

Bill Henning · 2006-12-05 06:17

Hi M.K.,

I am currently looking into building a simple compiler to generate large model code; I've just been too busy last couple of weeks to do much with it - other than finalizing the process control tables layout and the memory manager data structures.

I was also thrown into a tizzy contemplating what I could do on the future 8 cog / 256KB / 8 cycle hub access -or- 16 cog / 128KB / 16 cycle hub access 160MIP cog future propellers...

M. K. Borri · 2006-12-06 00:56

Why don't you patent it and sell the patent to Parallax for, idunno, lawyer costs plus a movie ticket? That way everyone that needs to be happy is happy... OK, except the lawyer who gets more money than he should, but we have to live with that I guess.

Bill Henning · 2006-12-06 04:55

Why would I spend time on that now? Basically, I deliberately published the idea here to establish "prior art" so no one else can patent it (reasonably) [noparse]:)[/noparse]

Now I just want to build cool stuff [noparse]:)[/noparse]

M. K. Borri · 2006-12-07 00:35

I did that re: a car mp3 player in 1999 and got shafted anyway.

Bill Henning · 2007-01-11 10:00

Status update:

- I've started testing my new large model assembler for the propeller; I hope to release a beta test version in one to two weeks.

Why did I write yet another assembler?

- I wanted an assembler designed specifically for large model programs

- I wanted the following additional features:

- conditional assembly (IF / ELSE / ENDIF)
- nested include files
- macro's (label MACRO arg_1,..,arg_n / ENDM)
- HORG (hub memory ORG)
- CORG (alias for ORG, cog memory org)
- listing files
- symbol tables for an external linker / loader

I've also defined some heap management primitives (malloc,free), as well as the task control blocks for the threads in the cogs that will load my multi-tasking pico kernel. I am currently considering adding C-style stdin/stdout/stderr streams to the task control blocks.

I'll publish a more detailed hub memory map later, when I've finalized it; in general I am trying to keep it Spin compatible (I hope to have the ability to have some cogs run the spin interpreter) however I need to figure out how to start a spin interpreter in a cog, and how to limit what range of memory it will try to use.

Basic hub memory layout:

$0000-$000F: boot loader / spin initialization area
$0010-$01FF: reserved for shared kernel data
$0200-$05FF: large model 1KB default kernel image, modifies itself for single/multi-threaded depending on task control block pointed to by par
$0600-$09FF: 1KB buffer space (any task can request to use it via a semaphore) (also loaded into cogs when default kernel loaded)
$0C00-topmem: code / heap space
$topmem-$7FFF: task control blocks, they grow down from end of RAM.

I expect the memory footprint to stay under 3k with a small number of threads, leaving 29k for code [noparse]:)[/noparse]

Anyway, I'll keep plugging away [noparse]:)[/noparse] sorry its going so slowly, but I am currently overworked.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers

Ym2413a · 2007-01-11 16:33

Wow, This is cool!
29k code space could do a lot!

mahjongg · 2007-03-04 16:01

This is important information, and it;s now a bit "invisible".

When it's all ironed out and working well, I think this topic should be made "sticky", so other users know there is an alternative to using either SPIN or the 512 word COG machine language.

Mahjongg