The instructions being interpreted/executed sometimes need operands and all of them are in COG memory. Since the use of the cache is indivisible as far as threads are concerned, how about, by convention, putting any working storage at the end of the cache area? Alternatively, we could allocate some small number (say 16) after the end of the jump table?
There also needs to be a primitive to copy the long following the instruction to a specific location (then skip the 2nd long too). This allows a long (often a HUB address) to be loaded without a complex multiple instruction sequence. Let's call it a LOAD. If we use a JMPRET instruction, the destination would be overstored with the data so the fact that a useless return address is stored there first doesn't matter. So, we'd use a "JMPRET <dest>,bhLoad" followed by the long value.
f_load rdlong f_temp,PC ' get the data
sub PC,#4
rdlong f_temp2,PC ' get the instruction again
shr f_temp2,#9
movd :loadIt,f_temp2
add PC,#8
:loadIt mov 0-0,f_temp
jmp #f_next
Comments? Thoughts?
Post Edited (Mike Green) : 11/11/2006 2:35:46 AM GMT
Thank you for the enthusiasm, I am now at home, and about to spend the next few hours on this forum
Mike - you've put in an AWFUL lot of thought and work on fleshing out the far model! Thank you!
Chip - thanks for the enthusiasm! And the improved FCACHE!
Phil - Great suggestions! They help a LOT!
Ok, there will be more later, and I want to thank everyone... and I hope to post a LOT more tonight and tomorrow night.
Mike - I like your FLOAD, I think it deserves the next slot in the jump table right after FCACHE. I'm debating the potential merits of a mirroring FSTORE, but I am not sure its needed.
Much more in a bit, but I need to do some typing off-line to edit it properly before posting...
Brilliant idea using JMPRET for the FLOAD function!
Bill,
I agree: FSTORE isn't needed. FLOAD is only needed to grab a constant larger than $1ff so it can be used in an indirect operand. There's no complementary requirement for storing, since the "large number" is already in a local register. Basically, the FLOADs just replace all those defined LONGs that appear at the end of a regular assembly program.
Chip,
'Just got to thinking. If you hadn't made the writing of condition codes optional, none of this would even be possible! (Or at least it would be a lot more awkward!)
I've been playing with some sample code for the OS. The FLOAD is useful for the occasional value greater than 9 bits. Anything more complicated than that ... seems to call for using FCACHE. Unfortunately, SPIN makes it difficult to reference locations in HUB RAM from DAT code. They all have to be patched up.
I was originally intending to just keep data right at the end of FCACHE'd segments, as obviously the FCACHE'd code will know where it will wind up in the cog; but this is great for regular far code.
To be honest, I don't like 'sub pc,#4' for HALT as there is no way out of it short of restarting a cog; so it may as well be halted so that it won't generate heat.
Phil, thanks for agreeing, no FSTORE.
Now...
I can see we will have several different kernels - that is great, special purpose kernels fit the "philosophy" of the propeller. I will be publishing later tonite the specs for a mutli-threaded kernel :-) I already sent Chip a preview earlier today, but I want to clean it up a bit before posting it.
In order to keep all kernels binary compatible, and the idea simple... may I suggest a steering panel?
I'd love to see a panel composed of myself, Mike,·Chip and·Phil with others ofcourse welcome to contribute ideas at all times; and I hope·people would not·think it would be·too presumptious of me to assume the role of steering it.
Meanwhile, please find attached a somewhat cleaned up, somewhat commented version of 'kernel.spin'
It still needs a start method, but that will be trivial.
Post Edited (Bill Henning) : 11/11/2006 7:20:43 AM GMT
'Just had a thought while in the shower (second best place for ideas!): Since what's coming from this effort really amounts to an emulator, why not include hooks in a debugging version for breakpoints and single-stepping, too? It would have to be able accept commands from a debugger (probably written in Spin), to signal when events occur, and to dump the machine state back to hub memory on demand. But I think these things could be accomplished rather easily in the general model we've been discussing.
My other ideas (other than threading) include 'virtual peripherals' like on the SX, and user code transparent demand paging of code [noparse]:)[/noparse]
Except that instead of an emulator, this is really more like a VM on a processor without virtualization, as the vast majority of code will run at full speed, and is not interpreted by code, but is actually executed by the normal fetch-decode-execute-store cycles of the cog
block of large model assembly code follows; if needed throws out part of spin interpreter to use FCACHE functionality to execute the block, then reloads tossed away spin interpreter code to continue
ideally the spin compiler would flag as errors any native cog instructions that break the large model and any FCACHE rules (jumping out of the FCAHCE block to anything but legal Fxxx primitives basically)
Oh gosh, Bill. Do you realize how tightly the Spin interpreter is packed? I doubt there's a single bit left to handle any kind of caching. I'm sure any "inlining" will have to be initiated in a separate cog by a COGNEW.
Common to ALL kernel versions, required for binary compatibility!
000: PC
001: SP
002: @next
003: @fcall
004: @fret
005: @fcache
006: @fload
007: @fsystem
008-00f: reserved
010-07f: kernel primitives (also Mike's local stack grows down from $07f in local stack kernels)
080-0ff: FCACHE code buffer
"Standard" kernel - not multi-threaded, stack can be local or hub based
100-17f: DCACHE - yes, what it looks like, proposed Data cache area! One of the system calls will load it
180-1ef: virtual peripheral area
"Threaded kernel"
100-17f: DCACHE
The next 112 locations, range $180-$1ef are divided up as follows:
008: @yield····· - prematurely ends a threads execution time, passes control to next ready thread
Unfortunately, the "real" PC/SP must be moved under the threaded model.
For compatibility, I am thinking of revising the standard spec to place PC/SP/BP/STATUS starting at $180 even
for non-threaded kernels.
Thread local guaranteed local registers for currently executing thread; PC and SP at $0 and $1
180: PC·········· virtual program counter
181: SP·········· return stack pointer
182: BP·········· base pointer for future high level language support
183: STATUS·· status register to save Z and C during context switches
184: R0·········· registers
185: R1
186: R2
187: R3
188: R4
189: R5
18A: R6
18B: R7
followed by·eight groups of·eleven registers each; each group stores STATUS,·BP, PC, SP, R0-R7 for a potential thread (BP is a reserved register; for future two-instruction "local hub" variable access)
the last·four locations are scratch locations for FCACHE blocks
Each thread must have its own stack in hub memory
More on how threading works in another message later; I had to revise the design due to issues I found trying to code the context switch routine
Post Edited (Bill Henning) : 11/11/2006 6:51:50 AM GMT
INLINE is to be a compiler directive to allow us to embed non-standard code; ie have SPIN assemble it at the locations where it would end up in sequence in hub memory.
It is NOT intended to inline code into the SPIN cogs!
Think of it like #asm for hub execution friendly code, hopefully it will also emit correct code for FCALL, FCACHE and friends [noparse]:)[/noparse]
After editing the message above AGAIN, I decided time for a new message.
For compatibility, the threading model is forcing some changes in the standard kernel layout. Even non-threaded kernels are expected to reserve the following range:
1E0-1EF: Currently running·context registers; copied to/from here on context switches
1E0: PC
1E1: SP
1E2: BP
1E3: FLAGS
1E4: R0
1E5: R1
1E6: R2
1E7: R3
1E8: R4
1E9: R5
1EA: R6
1EB: R7
1EC: READYTASKS - bitmap of threads that are ready to run
1ED: TIMESLICE - number of "interpreter loop cycles" between forced context switches
1EE: TIMELEFT - "interpreter loop cycles" left for currently running thread
1EF: CURRTASK - pointer to currently running tasks thread context
Yes, this does make this kernel a pre-emptive multitasking kernel.
Threaded kernels also make the following reservations:
The reason for the reverse ordering is that thread contexts grow DOWN from the space reserved for the running context. There is theoretically no reason why more threads cannot be supported; as long as we stay out of the FCACHE area and don't use DCACHE, an additional ten threads are possible, for a maximum of 18 running threads per cog, allowing for a theoretical limit of 8 cogs * 18 threads... 144 threads on one propeller!!!!
Post Edited (Bill Henning) : 11/11/2006 7:49:35 AM GMT
Bill,
I had a look at your "kernel.spin" posting. It's nice to have the documentation of everything that has transpired. It does cause problems with verbosity in integrating the kernel with my existing I2C routines. Can you distill down the statement of who's to get credit to maybe a paragraph that can be included in my source code, yet I can trim out all the other comments so that the Large Memory Model stuff isn't longer than my whole I2C package. Thanks.
Mike
By the way, the comment on the "long f_load" is incorrect now (line 74). Also, relocating the initialization routine to the cache area won't work the way you've done it since the COGINIT instruction loads a block of 512 contiguous longs and the Propeller Tool doesn't pad out the DAT section when it sees the ORGs. Better to leave the initialization code at the end of the "kernel" where it may either be pushed into the cache area as the "kernel" gets bigger or be overwritten by the stack
in the non-threaded version.
If you don't mind, why don't you make the change regarding initialization code, put it back where it would get overwritten by your stack, and fix the long_f comment?
Hmm.. good suggestion, based on which i think perhaps the revision log etc should go to a 'largemodel.credits' file.
I think perhaps the YIELD call should come before SYSTEM though. If running a non-threaded kernel, it can just call f_next
I'm hot-and-heavy into writing (but not testing yet) the threading code...
This would, indeed, be a problem. Not only is the interpreter wound way too tight, but it's in ROM, as well. Launching the large-model·emulator would have to be done through a COGNEW or COGINIT.
There are some issues with the current compiler, as well, but those are mainly fixable. I think I need to make a mode, maybe triggered through 'ORG -1' where the compiler will quit worrying about cog ram being exceeded by asm instructions. I think this could be realized by 'org -1' causing the cog address to not be incremented anymore, until a normal org appeared, as would be required for a cached block.
It's true that DAT labels are relative to the object they're in -- they're not absolute, so this would need some patching, as someone pointed out. The compiler loses this data during object mixing, and I know this would be a major (but perhaps eventual) undertaking. In the interim, when a large-model emulator is launched, it could be told what the base address of the object is, and then patch at run-time. It would come down to a single·addition, but require a 2-4 instruction sequence to realize within the large-model·emulator.
To the launch the emulator you'd need to convey at least two pointers. One to the start of the code (@virtualasmcode), and one to the start of the object (@@0) so branch addresses could be resolved. A jump address in virtual asm code would be expressed as @datlabel.
For a full-blown, more native approach, Spin would not have to be a consideration. Any language would do.
Am I following this whole thing correctly?
Phil Pilgrim (PhiPi) said...
Oh gosh, Bill. Do you realize how tightly the Spin interpreter is packed? I doubt there's a single bit left to handle any kind of caching. I'm sure any "inlining" will have to be initiated in a separate cog by a COGNEW.
And I think we can use the 'BP' base pointer I introduced to be the object base address in hub memory for large model asm code included in spin blocks... what do you think?
For example, in large model asm code, to refer to a VAR long foo, we would write:
INLINE
mov R0,@foo
add R0,BP
rdlong R0,R0
I propose that PC, SP and BP be set up from the PAR passed to COGNEW when initiating a new large model process
Chip,
It's a little more complicated than that. For example, if you want to have code addresses (for calls, jumps, etc.), the granularity is wrong (longs vs. bytes). All code offsets have to be multiplied by 4. Also, although it's potentially dangerous, the use of a "here" operand is very helpful with relative jumps of a short distance (maybe up to 8 longs in either direction). Sure you can have local labels, but, for very short distances, they're more of a nuisance than a help.
Bill,
How should the initial information be passed? As 3 longs or 3 words? What order?
Can you define the format of the System "instruction"? I'd like to nail down the code for the non-threaded version. I assume there'll be some parameters following the "instruction". How many longs?
I'll try to edit in your changes (like the context registers) tomorrow and post my version (with the changes) then.
Mike
Mike, let me think on those excellent questions a few minutes... thanks for the edits on those, or if you have not made other changes, I can put them in. We should pass it back and forth [noparse]:)[/noparse]
For now... tada... here is the multi-tasking context switching code!
Right now it assumes all eight threads always run, I will change that tomorrow, I have used up all the registers between $1E0 and $1EF now!
Ok, the end of the runrolled fetch-exec loop used to read:
:inst4 nop
jmp #f_next
In order to have eight threads, this needs to change to the following:
[noparse][[/noparse]code]
:inst4·················nop ······················· DJNZ··· TIMELEFT,#f_next······ ······················· mov···· TIMELEFT,TIMESLICE····· ' reset slice clock for next thread
······················· ' save current C and Z into FLAGS register
···············MUXZ··· FLAGS,#zFlag··· ' Thanks Chip! Forgot about MUX instructions ···············MUXC··· FLAGS,#cFlag··· ' you saved everyone 8 cycles on every task switch!
······················· ' go to next context - for now I'm assuming all threads always run ······················· ' this is proof of concept code; next version will use the last spare ······················· ' word
······················· sub···· CURRTASK,#12 ······················· cmp···· CURRTASK,#$180 ······· if_b···········mov···· CURRTASK,#$1D4·········· ' go back to top thread ······················· movs··· ctxload,CURTASK ······················· movd··· ctxload,#PC ······················· mov···· f_temp, #12 ·······················
ctxload··············mov···· 0-0,0-0 ······················· add···· ctxload,src_inc_const ······················· add···· ctxload,dst_inc_const ······················· DJNZ··· f_temp,ctxload
I propose that PC, SP and BP be set up from the PAR passed to COGNEW when initiating a new large model process
Chip, any comments on the threading?
Yes, it seems to me that those three values must be conveyed in a 3-word structure via PAR.
About threading, I like how you reduced the task-specific working registers to R0..R7 so that only they and three other values must be swapped to make a thread switch. You know, though, if a set of threads is pre-written to run on a single cog, they could be hard-coded to use separate register areas. This swapping is only necessary for un-related-at-design-time threads, right? But, that might be the real value of this thing - threads can start, spawn, and stop without any intimate knowledge of eachother. With swapping, someone could write a serial driver and someone else could run a few of them on a single cog. You'd get a lot more bang out of·a cog that way.
Here's a small bit of code I propose adding to the kernel. It enables Forth-style threaded-code. (This isn't the same as Bill's multi-threaded code. 'Too bad the two terms are so close. It can be confusing.) Anyway, this kind of threaded code is just a list of word-length addresses, each pointing to a location in the large-model codespace, along with a way to enter and exit such lists, as well as to fetch the next address and jump to that location. The inspiration for this comes from similar instructions which were native to the Zilog Super8 microcontroller. These instructions were all it took for the Super8 to support Forth almost natively. This was 20-some years ago and, unfortunately, Zilog didn't know what they had. So far as I know, it was only available in NMOS, and it got hot. The product had an enthusiastic customer base but died from lack of interest on Zilog's part.
In addition to an instruction pointer register (ip) in cog RAM, here's all the code that's necessary (sp is the same as Mike's stkptr):
'Part of initial jump table.
ienter [b]long[/b] xienter
iexit [b]long[/b] xiexit
inext [b]long[/b] xinext
'Working code.
xienter [b]wrword[/b] ip,sp 'Push IP on the stack.
[b]add[/b] sp,#2
[b]mov[/b] ip,pc 'New IP points to next instruction (a threaded-code word)
[b]jmp[/b] #xinext 'Go get it.
xiexit [b]sub[/b] sp,#2 'Pop IP from the stack.
[b]rdword[/b] ip,sp
xinext [b]rdword[/b] pc,ip 'Point PC to the next address in the threaded list.
[b]add[/b] ip,#2 'Increment the instruction pointer.
[b]jmp[/b] #nxt 'Go execute the next threaded-code word.
Here's a sample of some Forth-like procedures written in a mixture of large-model assembly and threaded-code. tos, ttos, and ep are app-specific registers in the scratch area.
'Large-model assembly procedures:
EXIT [b]jmp[/b] iexit 'Exit from code thread.
PUSH [b]wrword[/b] tos,ep 'Push stack top onto expression stack.
[b]add[/b] ep,#2 'Increment expression stack pointer.
[b]rdword[/b] tos,ip 'Load the new stack top from thread-code stream.
[b]add[/b] ip,#2 'Increment instruction pointer.
[b]jmp[/b] inext 'Go do next instruction.
POP [b]call[/b] #pop_one
[b]jmp[/b] inext
SWAP [b]rdword[/b] ttos,ep 'Get nos into ttos.
[b]wrword[/b] tos,ep 'Put tos into nos.
[b]mov[/b] tos,ttos 'Copy ttos to tos.
[b]jmp[/b] inext
ADDW [b]call[/b] #pop_one 'ttos<-tos tos<-nos
[b]add[/b] tos,ttos 'Replace tos with sum.
[b]jmp[/b] inext
pop_one [b]mov[/b] ttos,tos 'Save stack top for possible use in operation.
[b]sub[/b] ep,#2 'Pop tos from expression stack.
[b]rdword[/b] tos,ep
pop_one_ret [b]ret[/b]
'Threaded-code procedures:
LFTROT [b]jmp[/b] ienter
[b]word[/b] ROT
[b]word[/b] ROT
[b]word[/b] EXIT
OVER [b]jmp[/b] ienter
[b]word[/b] SWAP
[b]word[/b] DUP
[b]word[/b] ROT
[b]word[/b] SWAP
[b]word[/b] EXIT
INCHES [b]jmp[/b] ienter
[b]word[/b] PUSH
[b]word[/b] 10
[b]word[/b] MULT
[b]word[/b] PUSH
[b]word[/b] 254
[b]word[/b] DIV
[b]word[/b] EXIT
MILLIMETERS [b]jmp[/b] ienter
[b]word[/b] PUSH
[b]word[/b] 254
[b]word[/b] MULT
[b]word[/b] PUSH
[b]word[/b] 10
[b]word[/b] DIV
[b]word[/b] EXIT
-Phil
Post Edited (Phil Pilgrim (PhiPi)) : 11/11/2006 8:15:37 AM GMT
······················· ' save current C and Z into FLAGS register ······· if_z··········· or····· FLAGS,#zFlag ······· if_nz·········· andn··· FLAGS,#zFlag ······· if_c··········· or····· FLAGS,#cFlag···· ' cFlag must be the lowest bit in the word! ······· if_nc·········· andn··· FLAGS,#cFlag
Bill, there are instructions called MUXZ, MUXNZ, MUXC, and MUXNC that will write a flag or its complement to any number of bits in a destination. Here's how to reduce the above code:
··················· MUXZ··· FLAGS,#zFlag ··················· MUXC··· FLAGS,#cFlag ·
I think you'll·need·to use MUXNZ for the Z flag so that when later TESTed, it restores properly.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey (Parallax)) : 11/11/2006 8:10:17 AM GMT
I reduced it to 12 registers including the new FLAGS (to save C and Z) specifically to be able to compile code without having to statically allocate registers; but you are right, thread-aware code hand built code could make use of the DCACHE area for more local variables.
You are also correct, the point of a thread specific limited context was support for 'classic' threading functions for spawning, killing, suspending etc threads. Theoretically, threads could even be swapped out to FRAM over SPI!
The multi-threaded kernel would even allow for Unix style fork() and friends.
Eventually I want to be able to allow an· FCALL to a system library to spawn a new thread on the calling cog!
For example, I am considering adding an event-watching system, so that when say a 'CTS' signal is detected, a call can be made to 'RS232Receive', which then spawns a thread that FCACHE's the actual receive routine [noparse]:)[/noparse]
OR,
· FCALL SPI_READ · #disk_block
:-)
Which spawns a thread, that uses an FCACHE'd block to read AT FULL SPI SPEED a block!
The thead can then die, and the cog goes back to regular threading.
This allows a FULL FUNCTION multi-tasking OS!
Compilers, shells, interpreters... BRING IT ON!
A full feature BASIC!
A C compiler!
Chip Gracey (Parallax) said...
Bill Henning said...
I propose that PC, SP and BP be set up from the PAR passed to COGNEW when initiating a new large model process
Yes, it seems to me that those three values must be conveyed in a 3-word structure via PAR.
About threading, I like how you reduced the task-specific working registers to R0..R7 so that only they and three other values must be swapped to make a thread switch. You know, though, if a set of threads is pre-written to run on a single cog, they could be hard-coded to use separate register areas. This swapping is only necessary for un-related-at-design-time threads, right? But, that might be the real value of this thing - threads can start, spawn, and stop without any intimate knowledge of eachother. With swapping, someone could write a serial driver and someone else could run a few of them on a single cog. You'd get a lot more bang out of·a cog that way.
······················· ' save current C and Z into FLAGS register ······· if_z··········· or····· FLAGS,#zFlag ······· if_nz·········· andn··· FLAGS,#zFlag ······· if_c··········· or····· FLAGS,#cFlag···· ' cFlag must be the lowest bit in the word! ······· if_nc·········· andn··· FLAGS,#cFlag
Bill, there are instructions called MUXZ, MUXNZ, MUXC, and MUXNC that will write a flag or its complement to any number of bits in a destination. Here's how to reduce the above code:
··················· MUXZ··· FLAGS,#zFlag ··················· MUXC··· FLAGS,#cFlag ·
I think you'll·need·to use MUXNZ for the Z flag so that when later TESTed, it restores properly.
Might have interesting implications for code density where speed is not as important!
I see no reason why there cannot be a forth-kernel variant.
I can see my large model idea spawning all kinds of special purpose kernels.
Personally, I'd love to see a floating point package implemented as FCALLable routines, with FCACHE'd segments for the actual work; and I could even see a special floating point kernel that implemented the following primitives:
FTOI
ITOF
FADD
FSUB
FMUL
FDIV
FMOD
More complex math functions could then be built on them and FCALL'ed at high speed.
IEEE 32 bit math please [noparse]:)[/noparse]
Phil Pilgrim (PhiPi) said...
Here's a small bit of code I propose adding to the kernel. It enables Forth-style threaded-code. (This isn't the same as Bill's multi-threaded code. 'Too bad the two terms are so close. It can be confusing.) Anyway, this kind of threaded code is just a list of word-length addresses, each pointing to a location in the large-model codespace, along with a way to enter and exit such lists, as well as to fetch the next address and jump to that location. The inspiration for this comes from similar instructions which were native to the Zilog Super8 microcontroller. These instructions were all it took for the Super8 to support Forth almost natively. This was 20-some years ago and, unfortunately, Zilog didn't know what they had. So far as I know, it was only available in NMOS, and it got hot. The product had an enthusiastic customer base but died from lack of interest on Zilog's part.
In addition to an instruction pointer register (ip) in cog RAM, here's all the code that's necessary (sp is the same as Mike's stkptr):
'Part of initial jump table.
ienter [b]long[/b] xienter
iexit [b]long[/b] xiexit
inext [b]long[/b] xinext
'Working code.
xienter [b]wrword[/b] ip,sp 'Push IP on the stack.
[b]add[/b] sp,#2
[b]mov[/b] ip,pc 'New IP points to next instruction (a threaded-code word)
[b]jmp[/b] #inext 'Go get it.
xiexit [b]sub[/b] sp,#2 'Pop IP from the stack.
[b]rdword[/b] ip,sp
xinext [b]rdword[/b] pc,ip 'Point PC to the next address in the threaded list.
[b]add[/b] ip,#2 'Increment the instruction pointer.
[b]jmp[/b] #nxt 'Go execute the next threaded-code word.
Here's a sample of some Forth-like procedures written in a mixture of large-model assembly and threaded-code. tos, ttos, and ep are app-specific registers in the scratch area.
'Large-model assembly procedures:
EXIT [b]jmp[/b] iexit 'Exit from code thread.
PUSH [b]wrword[/b] tos,ep 'Push stack top onto expression stack.
[b]add[/b] ep,#2 'Increment expression stack pointer.
[b]rdword[/b] tos,ip 'Load the new stack top from thread-code stream.
[b]add[/b] ip,#2 'Increment instruction pointer.
[b]jmp[/b] inext 'Go do next instruction.
POP [b]call[/b] #pop_one
[b]jmp[/b] inext
SWAP [b]rdword[/b] ttos,ep 'Get nos into ttos.
[b]wrword[/b] tos,ep 'Put tos into nos.
[b]mov[/b] tos,ttos 'Copy ttos to tos.
[b]jmp[/b] inext
ADDW [b]call[/b] #pop_one 'ttos<-tos tos<-nos
[b]add[/b] tos,ttos 'Replace tos with sum.
[b]jmp[/b] inext
pop_one [b]mov[/b] ttos,tos 'Save stack top for possible use in operation.
[b]sub[/b] ep,#2 'Pop tos from expression stack.
[b]rdword[/b] tos,ep
pop_one_ret [b]ret[/b]
'Threaded-code procedures:
LFTROT [b]jmp[/b] ienter
[b]word[/b] ROT
[b]word[/b] ROT
[b]word[/b] EXIT
OVER [b]jmp[/b] ienter
[b]word[/b] SWAP
[b]word[/b] DUP
[b]word[/b] ROT
[b]word[/b] SWAP
[b]word[/b] EXIT
INCHES [b]jmp[/b] ienter
[b]word[/b] PUSH
[b]word[/b] 10
[b]word[/b] MULT
[b]word[/b] PUSH
[b]word[/b] 254
[b]word[/b] DIV
[b]word[/b] EXIT
MILLIMETERS [b]jmp[/b] ienter
[b]word[/b] PUSH
[b]word[/b] 254
[b]word[/b] MULT
[b]word[/b] PUSH
[b]word[/b] 10
[b]word[/b] DIV
[b]word[/b] EXIT
Let me be VERY clear... I am NOT looking to "submarine patent" anyone.
I do expect to be credited for my ideas; and would expect some kind of arrangement for anyone wanting to do closed source commercial work that they then distribute (internal use does not count)
As long as I am properly credited, I hereby very publically commit not to go after any open source use of these ideas.
Parallax is further VERY welcome to incorporate these ideas in SPIN and its environment; but I do ask for appropriate credit, which I am certain they would have provided even if I did not ask [noparse]:)[/noparse]
This also covers the pre-emptive multi-tasking kernel, and anything else I post to this thread related to large model on the propeller.
Post Edited (Bill Henning) : 11/11/2006 8:53:04 AM GMT
1) There will be a SECOND multi-tasking kernel. This (slower) one will store the process table in HUB memory, in order to make cog memory locations $180-$1DF available for use for a loadable library or virtual peripheral code.
This will slow down context switching, but will allow for a "pool" of cogs to load and run the next ready thread, and an effectively unlimited number of threads system wide.
2) The kernel will choose between cog and hub stack during initialization, and patch FCALL and FRET respectively. The "spare" code can live in the FCACHE area and be overwritten
3) The "single tasking" kernel can become multi-tasking by FCALLing the FSTART system call
4) The floating point math library I proposed - only FADD / FSUB / FMUL / FDIV / FREM should be primitives (and they should fit in the library area); ITOF, FTOI, ATOF, FTOA, SIN, COS, TAN and friends should be FCALL'able library routines.
Now I'm off to work for a startup downton, I won't be able to check the forum until I get home around 7pm pacific....
actually you can make a self-incrementing address pointer in a cog. Just use the timers plus one pin. setup counter A to divide the system clock by the right ammount (1/16 usually) and have it output on "the pin". Counter B can then be setup to increment by some number once for every low-high transition of "the pin". (say +4 every time "the pin" transitions low-high) Now reading counter B's PHS register (using a source field read) will provide a pointer to hub-ram that auto-increments independant of the current cog code.
Now, i can see some issues with this. First it needs a pin to link the two counters. (a crying shame the counters don't have an input clock divider!) Second, the pointer auto-increments independant of the Cog code so it would be prone to loosing synchronization and causing some "interesting" bugs.
Bill, I've spoken to you about this in PM, but you keep banging this drum.
These ideas are so close to STC/DTC that a patent would not stand anyway. You've been posting about a message a day along the lines of "I won't patent this, but please if you do this in commercial software give me credit!"
Bill, I've been doing this in commercial software for nearly 15 years. Your code is clever and I'm really pleased with what you've done, but please, let's focus on the code and quit the ego-stroking.
Comments
The instructions being interpreted/executed sometimes need operands and all of them are in COG memory. Since the use of the cache is indivisible as far as threads are concerned, how about, by convention, putting any working storage at the end of the cache area? Alternatively, we could allocate some small number (say 16) after the end of the jump table?
There also needs to be a primitive to copy the long following the instruction to a specific location (then skip the 2nd long too). This allows a long (often a HUB address) to be loaded without a complex multiple instruction sequence. Let's call it a LOAD. If we use a JMPRET instruction, the destination would be overstored with the data so the fact that a useless return address is stored there first doesn't matter. So, we'd use a "JMPRET <dest>,bhLoad" followed by the long value.
Comments? Thoughts?
Post Edited (Mike Green) : 11/11/2006 2:35:46 AM GMT
Thank you for the enthusiasm, I am now at home, and about to spend the next few hours on this forum
Mike - you've put in an AWFUL lot of thought and work on fleshing out the far model! Thank you!
Chip - thanks for the enthusiasm! And the improved FCACHE!
Phil - Great suggestions! They help a LOT!
Ok, there will be more later, and I want to thank everyone... and I hope to post a LOT more tonight and tomorrow night.
Mike - I like your FLOAD, I think it deserves the next slot in the jump table right after FCACHE. I'm debating the potential merits of a mirroring FSTORE, but I am not sure its needed.
Much more in a bit, but I need to do some typing off-line to edit it properly before posting...
Brilliant idea using JMPRET for the FLOAD function!
Bill,
I agree: FSTORE isn't needed. FLOAD is only needed to grab a constant larger than $1ff so it can be used in an indirect operand. There's no complementary requirement for storing, since the "large number" is already in a local register. Basically, the FLOADs just replace all those defined LONGs that appear at the end of a regular assembly program.
Chip,
'Just got to thinking. If you hadn't made the writing of condition codes optional, none of this would even be possible! (Or at least it would be a lot more awkward!)
-Phil
Mike, I LOVE FLOAD.
I was originally intending to just keep data right at the end of FCACHE'd segments, as obviously the FCACHE'd code will know where it will wind up in the cog; but this is great for regular far code.
To be honest, I don't like 'sub pc,#4' for HALT as there is no way out of it short of restarting a cog; so it may as well be halted so that it won't generate heat.
Phil, thanks for agreeing, no FSTORE.
Now...
I can see we will have several different kernels - that is great, special purpose kernels fit the "philosophy" of the propeller. I will be publishing later tonite the specs for a mutli-threaded kernel :-) I already sent Chip a preview earlier today, but I want to clean it up a bit before posting it.
In order to keep all kernels binary compatible, and the idea simple... may I suggest a steering panel?
I'd love to see a panel composed of myself, Mike,·Chip and·Phil with others ofcourse welcome to contribute ideas at all times; and I hope·people would not·think it would be·too presumptious of me to assume the role of steering it.
Meanwhile, please find attached a somewhat cleaned up, somewhat commented version of 'kernel.spin'
It still needs a start method, but that will be trivial.
Post Edited (Bill Henning) : 11/11/2006 7:20:43 AM GMT
'Just had a thought while in the shower (second best place for ideas!): Since what's coming from this effort really amounts to an emulator, why not include hooks in a debugging version for breakpoints and single-stepping, too? It would have to be able accept commands from a debugger (probably written in Spin), to signal when events occur, and to dump the machine state back to hub memory on demand. But I think these things could be accomplished rather easily in the general model we've been discussing.
-Phil
My other ideas (other than threading) include 'virtual peripherals' like on the SX, and user code transparent demand paging of code [noparse]:)[/noparse]
Except that instead of an emulator, this is really more like a VM on a processor without virtualization, as the vast majority of code will run at full speed, and is not interpreted by code, but is actually executed by the normal fetch-decode-execute-store cycles of the cog
INLINE
block of large model assembly code follows; if needed throws out part of spin interpreter to use FCACHE functionality to execute the block, then reloads tossed away spin interpreter code to continue
ideally the spin compiler would flag as errors any native cog instructions that break the large model and any FCACHE rules (jumping out of the FCAHCE block to anything but legal Fxxx primitives basically)
-Phil
Common to ALL kernel versions, required for binary compatibility!
000: PC
001: SP
002: @next
003: @fcall
004: @fret
005: @fcache
006: @fload
007: @fsystem
008-00f: reserved
010-07f: kernel primitives (also Mike's local stack grows down from $07f in local stack kernels)
080-0ff: FCACHE code buffer
"Standard" kernel - not multi-threaded, stack can be local or hub based
100-17f: DCACHE - yes, what it looks like, proposed Data cache area! One of the system calls will load it
180-1ef: virtual peripheral area
"Threaded kernel"
100-17f: DCACHE
The next 112 locations, range $180-$1ef are divided up as follows:
008: @yield····· - prematurely ends a threads execution time, passes control to next ready thread
Unfortunately, the "real" PC/SP must be moved under the threaded model.
For compatibility, I am thinking of revising the standard spec to place PC/SP/BP/STATUS starting at $180 even
for non-threaded kernels.
Thread local guaranteed local registers for currently executing thread; PC and SP at $0 and $1
180: PC·········· virtual program counter
181: SP·········· return stack pointer
182: BP·········· base pointer for future high level language support
183: STATUS·· status register to save Z and C during context switches
184: R0·········· registers
185: R1
186: R2
187: R3
188: R4
189: R5
18A: R6
18B: R7
followed by·eight groups of·eleven registers each; each group stores STATUS,·BP, PC, SP, R0-R7 for a potential thread (BP is a reserved register; for future two-instruction "local hub" variable access)
the last·four locations are scratch locations for FCACHE blocks
Each thread must have its own stack in hub memory
More on how threading works in another message later; I had to revise the design due to issues I found trying to code the context switch routine
Post Edited (Bill Henning) : 11/11/2006 6:51:50 AM GMT
INLINE is to be a compiler directive to allow us to embed non-standard code; ie have SPIN assemble it at the locations where it would end up in sequence in hub memory.
It is NOT intended to inline code into the SPIN cogs!
Think of it like #asm for hub execution friendly code, hopefully it will also emit correct code for FCALL, FCACHE and friends [noparse]:)[/noparse]
After editing the message above AGAIN, I decided time for a new message.
For compatibility, the threading model is forcing some changes in the standard kernel layout. Even non-threaded kernels are expected to reserve the following range:
1E0-1EF: Currently running·context registers; copied to/from here on context switches
1E0: PC
1E1: SP
1E2: BP
1E3: FLAGS
1E4: R0
1E5: R1
1E6: R2
1E7: R3
1E8: R4
1E9: R5
1EA: R6
1EB: R7
1EC: READYTASKS - bitmap of threads that are ready to run
1ED: TIMESLICE - number of "interpreter loop cycles" between forced context switches
1EE: TIMELEFT - "interpreter loop cycles" left for currently running thread
1EF: CURRTASK - pointer to currently running tasks thread context
Yes, this does make this kernel a pre-emptive multitasking kernel.
Threaded kernels also make the following reservations:
1D4-1DF: THREAD0 context
1C8-1D3: THREAD1 context
1BC-1C7: THREAD2 context
1B0-1BB: THREAD3 context
1A4-1AF: THREAD4 context
198-1A3: THREAD5 context
18C-197: THREAD6 context
180-18B: THREAD7 context
The reason for the reverse ordering is that thread contexts grow DOWN from the space reserved for the running context. There is theoretically no reason why more threads cannot be supported; as long as we stay out of the FCACHE area and don't use DCACHE, an additional ten threads are possible, for a maximum of 18 running threads per cog, allowing for a theoretical limit of 8 cogs * 18 threads... 144 threads on one propeller!!!!
Post Edited (Bill Henning) : 11/11/2006 7:49:35 AM GMT
I had a look at your "kernel.spin" posting. It's nice to have the documentation of everything that has transpired. It does cause problems with verbosity in integrating the kernel with my existing I2C routines. Can you distill down the statement of who's to get credit to maybe a paragraph that can be included in my source code, yet I can trim out all the other comments so that the Large Memory Model stuff isn't longer than my whole I2C package. Thanks.
Mike
By the way, the comment on the "long f_load" is incorrect now (line 74). Also, relocating the initialization routine to the cache area won't work the way you've done it since the COGINIT instruction loads a block of 512 contiguous longs and the Propeller Tool doesn't pad out the DAT section when it sees the ORGs. Better to leave the initialization code at the end of the "kernel" where it may either be pushed into the cache area as the "kernel" gets bigger or be overwritten by the stack
in the non-threaded version.
I like moving the PC and SP to high memory in the cog. What do you want to do with the locations currently occupied by PC and SP?
Mike
If you don't mind, why don't you make the change regarding initialization code, put it back where it would get overwritten by your stack, and fix the long_f comment?
Hmm.. good suggestion, based on which i think perhaps the revision log etc should go to a 'largemodel.credits' file.
I think perhaps the YIELD call should come before SYSTEM though. If running a non-threaded kernel, it can just call f_next
I'm hot-and-heavy into writing (but not testing yet) the threading code...
There are some issues with the current compiler, as well, but those are mainly fixable. I think I need to make a mode, maybe triggered through 'ORG -1' where the compiler will quit worrying about cog ram being exceeded by asm instructions. I think this could be realized by 'org -1' causing the cog address to not be incremented anymore, until a normal org appeared, as would be required for a cached block.
It's true that DAT labels are relative to the object they're in -- they're not absolute, so this would need some patching, as someone pointed out. The compiler loses this data during object mixing, and I know this would be a major (but perhaps eventual) undertaking. In the interim, when a large-model emulator is launched, it could be told what the base address of the object is, and then patch at run-time. It would come down to a single·addition, but require a 2-4 instruction sequence to realize within the large-model·emulator.
To the launch the emulator you'd need to convey at least two pointers. One to the start of the code (@virtualasmcode), and one to the start of the object (@@0) so branch addresses could be resolved. A jump address in virtual asm code would be expressed as @datlabel.
For a full-blown, more native approach, Spin would not have to be a consideration. Any language would do.
Am I following this whole thing correctly?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
And I think we can use the 'BP' base pointer I introduced to be the object base address in hub memory for large model asm code included in spin blocks... what do you think?
For example, in large model asm code, to refer to a VAR long foo, we would write:
INLINE
mov R0,@foo
add R0,BP
rdlong R0,R0
I propose that PC, SP and BP be set up from the PAR passed to COGNEW when initiating a new large model process
Chip, any comments on the threading?
The kernel code can execute out of ROM just as easily as out of RAM [noparse]:)[/noparse]
Think of what you can do to the next version of spin with FAR more code space!
And if at all possible, I'd like to reserve at least 128 of the first 512 bytes of hub memory. 256 would be better.
Why?
FAST Messaging between tasks, fast system calls.
Why?
wrlong req,#cog0request
...
rdlong rep,#cog0result
Post Edited (Bill Henning) : 11/11/2006 7:30:35 AM GMT
It's a little more complicated than that. For example, if you want to have code addresses (for calls, jumps, etc.), the granularity is wrong (longs vs. bytes). All code offsets have to be multiplied by 4. Also, although it's potentially dangerous, the use of a "here" operand is very helpful with relative jumps of a short distance (maybe up to 8 longs in either direction). Sure you can have local labels, but, for very short distances, they're more of a nuisance than a help.
Bill,
How should the initial information be passed? As 3 longs or 3 words? What order?
Can you define the format of the System "instruction"? I'd like to nail down the code for the non-threaded version. I assume there'll be some parameters following the "instruction". How many longs?
I'll try to edit in your changes (like the context registers) tomorrow and post my version (with the changes) then.
Mike
For now... tada... here is the multi-tasking context switching code!
Right now it assumes all eight threads always run, I will change that tomorrow, I have used up all the registers between $1E0 and $1EF now!
Ok, the end of the runrolled fetch-exec loop used to read:
:inst4 nop
jmp #f_next
In order to have eight threads, this needs to change to the following:
[noparse][[/noparse]code]
:inst4·················nop
······················· DJNZ··· TIMELEFT,#f_next······
······················· mov···· TIMELEFT,TIMESLICE····· ' reset slice clock for next thread
······················· ' save current C and Z into FLAGS register
···············MUXZ··· FLAGS,#zFlag··· ' Thanks Chip! Forgot about MUX instructions
···············MUXC··· FLAGS,#cFlag··· ' you saved everyone 8 cycles on every task switch!
······················· ' save current context
······················· movs··· ctxsave,#PC
······················· movd··· ctxsave,CURRTASK
······················· mov···· f_temp, #12
ctxsave············· mov···· 0-0,0-0
······················· add···· ctxsave,src_inc_const
······················· add···· ctxsave,dst_inc_const
······················· DJNZ··· f_temp,ctxsave
······················· ' go to next context - for now I'm assuming all threads always run
······················· ' this is proof of concept code; next version will use the last spare
······················· ' word
······················· sub···· CURRTASK,#12
······················· cmp···· CURRTASK,#$180
······· if_b···········mov···· CURRTASK,#$1D4·········· ' go back to top thread
······················· movs··· ctxload,CURTASK
······················· movd··· ctxload,#PC
······················· mov···· f_temp, #12
·······················
ctxload··············mov···· 0-0,0-0
······················· add···· ctxload,src_inc_const
······················· add···· ctxload,dst_inc_const
······················· DJNZ··· f_temp,ctxload
······················· ' restore flags
·····························································
······················· andn···· FLAGS,zFlag wz
······················· rcr···· FLAGS,#1 wc
······················· jmp···· #f_next····
[noparse][[/noparse]/code]
NOTE: I have NOT run this code yet, there maybe bugs, I could have left some in there...
Post Edited (Bill Henning) : 11/11/2006 8:23:28 AM GMT
About threading, I like how you reduced the task-specific working registers to R0..R7 so that only they and three other values must be swapped to make a thread switch. You know, though, if a set of threads is pre-written to run on a single cog, they could be hard-coded to use separate register areas. This swapping is only necessary for un-related-at-design-time threads, right? But, that might be the real value of this thing - threads can start, spawn, and stop without any intimate knowledge of eachother. With swapping, someone could write a serial driver and someone else could run a few of them on a single cog. You'd get a lot more bang out of·a cog that way.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
In addition to an instruction pointer register (ip) in cog RAM, here's all the code that's necessary (sp is the same as Mike's stkptr):
Here's a sample of some Forth-like procedures written in a mixture of large-model assembly and threaded-code. tos, ttos, and ep are app-specific registers in the scratch area.
-Phil
Post Edited (Phil Pilgrim (PhiPi)) : 11/11/2006 8:15:37 AM GMT
··················· MUXZ··· FLAGS,#zFlag
··················· MUXC··· FLAGS,#cFlag
·
I think you'll·need·to use MUXNZ for the Z flag so that when later TESTed, it restores properly.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey (Parallax)) : 11/11/2006 8:10:17 AM GMT
I reduced it to 12 registers including the new FLAGS (to save C and Z) specifically to be able to compile code without having to statically allocate registers; but you are right, thread-aware code hand built code could make use of the DCACHE area for more local variables.
You are also correct, the point of a thread specific limited context was support for 'classic' threading functions for spawning, killing, suspending etc threads. Theoretically, threads could even be swapped out to FRAM over SPI!
The multi-threaded kernel would even allow for Unix style fork() and friends.
Eventually I want to be able to allow an· FCALL to a system library to spawn a new thread on the calling cog!
For example, I am considering adding an event-watching system, so that when say a 'CTS' signal is detected, a call can be made to 'RS232Receive', which then spawns a thread that FCACHE's the actual receive routine [noparse]:)[/noparse]
OR,
· FCALL SPI_READ
· #disk_block
:-)
Which spawns a thread, that uses an FCACHE'd block to read AT FULL SPI SPEED a block!
The thead can then die, and the cog goes back to regular threading.
This allows a FULL FUNCTION multi-tasking OS!
Compilers, shells, interpreters... BRING IT ON!
A full feature BASIC!
A C compiler!
Might have interesting implications for code density where speed is not as important!
I see no reason why there cannot be a forth-kernel variant.
I can see my large model idea spawning all kinds of special purpose kernels.
Personally, I'd love to see a floating point package implemented as FCALLable routines, with FCACHE'd segments for the actual work; and I could even see a special floating point kernel that implemented the following primitives:
FTOI
ITOF
FADD
FSUB
FMUL
FDIV
FMOD
More complex math functions could then be built on them and FCALL'ed at high speed.
IEEE 32 bit math please [noparse]:)[/noparse]
Let me be VERY clear... I am NOT looking to "submarine patent" anyone.
I do expect to be credited for my ideas; and would expect some kind of arrangement for anyone wanting to do closed source commercial work that they then distribute (internal use does not count)
As long as I am properly credited, I hereby very publically commit not to go after any open source use of these ideas.
Parallax is further VERY welcome to incorporate these ideas in SPIN and its environment; but I do ask for appropriate credit, which I am certain they would have provided even if I did not ask [noparse]:)[/noparse]
This also covers the pre-emptive multi-tasking kernel, and anything else I post to this thread related to large model on the propeller.
Post Edited (Bill Henning) : 11/11/2006 8:53:04 AM GMT
1) There will be a SECOND multi-tasking kernel. This (slower) one will store the process table in HUB memory, in order to make cog memory locations $180-$1DF available for use for a loadable library or virtual peripheral code.
This will slow down context switching, but will allow for a "pool" of cogs to load and run the next ready thread, and an effectively unlimited number of threads system wide.
2) The kernel will choose between cog and hub stack during initialization, and patch FCALL and FRET respectively. The "spare" code can live in the FCACHE area and be overwritten
3) The "single tasking" kernel can become multi-tasking by FCALLing the FSTART system call
4) The floating point math library I proposed - only FADD / FSUB / FMUL / FDIV / FREM should be primitives (and they should fit in the library area); ITOF, FTOI, ATOF, FTOA, SIN, COS, TAN and friends should be FCALL'able library routines.
Now I'm off to work for a startup downton, I won't be able to check the forum until I get home around 7pm pacific....
I owe I owe... off to work I go!
Now, i can see some issues with this. First it needs a pin to link the two counters. (a crying shame the counters don't have an input clock divider!) Second, the pointer auto-increments independant of the Cog code so it would be prone to loosing synchronization and causing some "interesting" bugs.
my two cents,
Marty
These ideas are so close to STC/DTC that a patent would not stand anyway. You've been posting about a message a day along the lines of "I won't patent this, but please if you do this in commercial software give me credit!"
Bill, I've been doing this in commercial software for nearly 15 years. Your code is clever and I'm really pleased with what you've done, but please, let's focus on the code and quit the ego-stroking.