Thank you for the interest then.
(note: I edited in some detail so you can see as to why I am driven for the um, redundancy)
I did make some assumptions and adaptions yes. .. some of which I did not include in my earlier post
to keep the spec from getting too long.
For one, I figured that probably any cog should be able to write down any free broadcast bus.
I also pretty much stay with concept that some dedicated cog in each prop listens for broadcasts most of the time.
OK I will start with XWRITE: (actually called set)
void SET( ptr32 handle, ulong value ) (8 bytes + cmdsize)
for the protocol, functions that return void don't wait. They just broadcast on an available non-escalated buss.
Props know which ptr32 handles they manage, the listening props that own the appropriate range in
SRAM or hub mem just update the value. Sets like this are atomic and fairly fast.
ptr32 handle must be a value returned from "ptr32 CALLOClongs()" or an offset within it
since CALLOC(n,b) reserves N items of size B each. longptrhandle could be listened for in
multiple Props...(memory aliasing) any one of which might be a debugger listening for changes to your variable.
void *CALLOC( long Nitems, long itemSize) (sends 8 bytes + cmdsize)
this asks the distributed memory manager to get me room for that many things
all the same size. all zeroed. now implied in this command is a small exchange
under the hood. since this one returns a value, it can cause the current cog
to swapout/suspend. something to also consider is that this version of the command
does not specify where/which prop should be used to satisfy the request.
Therefore multiple Props can answer. If there are multiple answers for the same range
within Nitems, the least busy Prop (first to answer) is accepted. All others are told they
are a dupe() {this is an event hook point, dupe()}
Advanced detail:
The memory /looks/ contiguous to subsequent SET/GET calls (including walking a pointer
through memory with *p++) . Internally a prop is allowed to partially fill a request
and answer back with a range, however preference will be given to props
which can do the most in a single swatch.
The caller can still get an out of memory condition if no combination of
props HUB and SRAM and backing-store can satisfy the request.
(you don't get less than you ask for even if a given prop only does a
partial range). This is all part of the exchange worked out on your behalf
and is intended to satisfy the semantics of the C calloc() function.
-- what else can be done with a CALLOC() return value? are there any hidden goodies? --
Yes - of course.
You can get the get_size_of(ptr32 handle) which returns itemSize. (note: not nitems x itemsize)
You can get the get_type_of(ptr32 handle) which for an itemSize of 1 returns a StringDispatcher(),
for other things you allocate it can return ObjectDispatcher() [noparse];)[/noparse]
You can also change the type of an item so that you can have your own dispatchers.
There is a plan to permit one to even change the "size" of an item in place (makes realloc() simple)
Since the true location of an item in a remote Prop may be swapped out to SRAM or backing-store
there is the equivalent of a page-table within each distributed object-memory manager.
The page table knows for a given object:
- the base virtual address this object is aliased to,
- the range of valid offsets in the item handled by this Prop/page table,
- where it really is, [noparse][[/noparse]HUB,SRAM,SWAP]+offset
- where it goes back to on a save-self (or write-through)
- dirty, swapped, active and other management bits,
- who to call for a dispatcher (aka type),
- a backup list of who to call for a dispatcher (used for inheritance)
- the size in bytes (itemSize)
This is similar to what is maintained in a memory management unit
on an MMU enabled CPU.
What I find interesting is that the local page table can
duplicate this information so that it is possible to cache an active object
locally at will... means reads can go much faster.
Thats a GET call. One would think v = GET(ptr32 remote_handle) however, GET on the wire
needs to actually cause the remote prop to do a WRITE/SET back to
v in the local one. Thus what happens is:
GET(&v, ptr32 remote_handle) goes out. (8 bytes + cmdsize)
which the remote Prop turns into SET(&v, value) and sends us a broadcast back.
Now if v is in SRAM, static mem or on the stack then a REF to it
is needed. &v is actually another call which sets up a local page table entry
for wherever v is (if it is actually local). its actually called REF(&v)
effectively we then have:
// for: auto_y[noparse][[/noparse]j] = remote_x[noparse][[/noparse]k]
vea = REF(&(auto_y+(j*sizeItem(y[noparse][[/noparse]0]))) // compute far address for destination
send: "GET"(vea, &(remote_x + (k*sizeItem(x[noparse][[/noparse]0]))))
(optional suspend current cog)
...
recv: "SET"(vea, remote_value)
FRE(vea)
(resume current cog or in a new cog)
Now all we need to be able to do is undo the page table entry we created with REF.
Thats done with FRE()
If doing loop optimization, one can compute the REF outside the loop once
and call FRE once only too once the loop is done and vea goes out of scope.
The issuance of an unadorned GET() can, like
CALLOC() be a swapout/suspend point if it takes too long for an answer to come back.
good if you have more objects than you do cogs.
You can override the timeout event for either GET or CALLOC to call some other function.
Default is "IM_AVAILABLE()". It is called by the cog before it is saved.
By the way, REF() and FRE() can help with reference counting.
-- whats in a handle? --
One of the ways in which a ptr32 handle can be quickly distinguished from any other pointer
is by setting a bit in it which marks it as a remote pointer. Indexing and incrementing can still be
done on it. It can still be a pointer to a structure with members. Perhaps a mask over some of the high order address
indicate directly which page table entry. Again, ++on the handle should just select a higher offset from the actual base.
You will be tempted to have a 16 bit wide page table index. Dont do it. In smaller Props
the entire SRAM will get eaten by page table entries. Not worth it. We are counting on adding memory
and seconday swap space. lets use it right. There is a second reason to not slide that mask down too far,
a later generation Prop will likely have more internal mem. (256K last I heard)
These are implementation details which should not concern someone coding in, say C.
A runtime routine should take care of these details.
But how does that runtime work???
"Taking Addresses."
This is when your local C compiler is your friend.
You cant take the address of a register so having them in cog ram is fine.
Likely stack and autos in the large model will be in HUB ram, this way you don't
have to save it on a suspend of a cog. Statics can be in HUB ram too.
If they spill over, its predictable and some can
be relocated to SRAM (same for auto spillover).
If the compiler knows something is not static and not in a register or on the stack,
then its likely remote/external. Thus anything in a pointer or the address
of anything can be computed.
Thank you for your interesting considerations, Bob. IMHO it needs to be a little bit more complicated, and I have still no idea how LOCKING will come in, but it most likely can be included in some of the dispatcher's tables.
The underlying routing will be handled by the corresponding layer, needing a different set of COGs, I think...
I expect that we will need about nine COGs per Prop busy for the communication, but maybe I am too pessimistic...
a heckler said...
I expect that we will need about nine COGs per Prop busy for the communication, but maybe I am too pessimistic...
how about 8 ?
Well, thats actually not a fair count. its meant to get people to giggle
till they actually take the time to read whats really being proposed.
So - one cog listens for GETs, all cogs may issue SETs.
Thats how I got to 8.
How did you get to 9?
now if the code space for I2C is too big and cant be shared, then its possible
to have a pair of cogs in that role.
It may be possible to reduce that to a single cog for both get()s and set()s --
I have not tried.
This can be trialed with standard I2C routines for backbone access so you
can use existing stuff. Its just slow and doesn't permit an 'overtake' operation to be modeled.
It's ok to not understand what I'm saying - you did ask for implementation
and my target implementation audience was folks writing compilers
who often think in terms of ptr+offset for struct members.
At one point the protocol had a separate long for the offset part but I realized
a way we could do without that.
Bob, I am sure this will be important what you are thinking and posting.
My main problem is, that I CANNOT USE IT in any way!
As I said, I experimented a little bit some time ago and found:
(a) there was too much missing to allow a SIMPLE approach
(b) the cost/benefit ratio was bad, compared to a low cost single processor PC solution as well as to a highend chip (Tilera's TILE64) , most likely even compared to proprietary FPGAs, though I have not evaluated their costs.
The main advantage of a "Propellor Stack" will and must be its scalability.
I personally should find your ideas more useful, if you could arrange them according to well established protocol layers:
Physical (incl. Bus Architecture) / Data Link / Network / Transport / Session
But - as said - I am only interested in the Application Layer: "Distributed Computing".
There are many possibilities...
(a) You can provide a service for setting I/O pin 21 in Propeller # 53
(b) You can load a complete COG in Propeller #53 allowed what a COG is allowed to, e.g. setting pin 21 or deleting the complete HUB of Propeller #53
Considering the timing dimension (a) will be a rare application, and I see no way to work around (b)
....
Two cents: I like single points of failure, they are way easier to identify and fix. So here's a thought: instead of being Bound-By-The-Bus, let each adjacent pair of props have their own comm protocols. Maximize flexibility by designing a decent vertical physical implementation that puts every i/o pin to the pcb edge, and also lets any combination be linked up or down or both.
Write a program that figures out the i/o configuration and launches the appropriate handlers. Or not. Just write what you need for where you need it.
Post Edited (Fred Hawkins) : 10/8/2007 10:22:59 AM GMT
Comments
(note: I edited in some detail so you can see as to why I am driven for the um, redundancy)
I did make some assumptions and adaptions yes. .. some of which I did not include in my earlier post
to keep the spec from getting too long.
For one, I figured that probably any cog should be able to write down any free broadcast bus.
I also pretty much stay with concept that some dedicated cog in each prop listens for broadcasts most of the time.
OK I will start with XWRITE: (actually called set)
void SET( ptr32 handle, ulong value ) (8 bytes + cmdsize)
for the protocol, functions that return void don't wait. They just broadcast on an available non-escalated buss.
Props know which ptr32 handles they manage, the listening props that own the appropriate range in
SRAM or hub mem just update the value. Sets like this are atomic and fairly fast.
ptr32 handle must be a value returned from "ptr32 CALLOClongs()" or an offset within it
since CALLOC(n,b) reserves N items of size B each. longptrhandle could be listened for in
multiple Props...(memory aliasing) any one of which might be a debugger listening for changes to your variable.
void *CALLOC( long Nitems, long itemSize) (sends 8 bytes + cmdsize)
this asks the distributed memory manager to get me room for that many things
all the same size. all zeroed. now implied in this command is a small exchange
under the hood. since this one returns a value, it can cause the current cog
to swapout/suspend. something to also consider is that this version of the command
does not specify where/which prop should be used to satisfy the request.
Therefore multiple Props can answer. If there are multiple answers for the same range
within Nitems, the least busy Prop (first to answer) is accepted. All others are told they
are a dupe() {this is an event hook point, dupe()}
Advanced detail:
The memory /looks/ contiguous to subsequent SET/GET calls (including walking a pointer
through memory with *p++) . Internally a prop is allowed to partially fill a request
and answer back with a range, however preference will be given to props
which can do the most in a single swatch.
[noparse][[/noparse]more next post]
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
-bobR
props HUB and SRAM and backing-store can satisfy the request.
(you don't get less than you ask for even if a given prop only does a
partial range). This is all part of the exchange worked out on your behalf
and is intended to satisfy the semantics of the C calloc() function.
-- what else can be done with a CALLOC() return value? are there any hidden goodies? --
Yes - of course.
You can get the get_size_of(ptr32 handle) which returns itemSize. (note: not nitems x itemsize)
You can get the get_type_of(ptr32 handle) which for an itemSize of 1 returns a StringDispatcher(),
for other things you allocate it can return ObjectDispatcher() [noparse];)[/noparse]
You can also change the type of an item so that you can have your own dispatchers.
There is a plan to permit one to even change the "size" of an item in place (makes realloc() simple)
Since the true location of an item in a remote Prop may be swapped out to SRAM or backing-store
there is the equivalent of a page-table within each distributed object-memory manager.
The page table knows for a given object:
- the base virtual address this object is aliased to,
- the range of valid offsets in the item handled by this Prop/page table,
- where it really is, [noparse][[/noparse]HUB,SRAM,SWAP]+offset
- where it goes back to on a save-self (or write-through)
- dirty, swapped, active and other management bits,
- who to call for a dispatcher (aka type),
- a backup list of who to call for a dispatcher (used for inheritance)
- the size in bytes (itemSize)
This is similar to what is maintained in a memory management unit
on an MMU enabled CPU.
What I find interesting is that the local page table can
duplicate this information so that it is possible to cache an active object
locally at will... means reads can go much faster.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
-bobR
Post Edited (bobr_) : 10/6/2007 7:47:18 AM GMT
Thats a GET call. One would think v = GET(ptr32 remote_handle) however, GET on the wire
needs to actually cause the remote prop to do a WRITE/SET back to
v in the local one. Thus what happens is:
GET(&v, ptr32 remote_handle) goes out. (8 bytes + cmdsize)
which the remote Prop turns into SET(&v, value) and sends us a broadcast back.
Now if v is in SRAM, static mem or on the stack then a REF to it
is needed. &v is actually another call which sets up a local page table entry
for wherever v is (if it is actually local). its actually called REF(&v)
effectively we then have:
// for: auto_y[noparse][[/noparse]j] = remote_x[noparse][[/noparse]k]
vea = REF(&(auto_y+(j*sizeItem(y[noparse][[/noparse]0]))) // compute far address for destination
send: "GET"(vea, &(remote_x + (k*sizeItem(x[noparse][[/noparse]0]))))
(optional suspend current cog)
...
recv: "SET"(vea, remote_value)
FRE(vea)
(resume current cog or in a new cog)
Now all we need to be able to do is undo the page table entry we created with REF.
Thats done with FRE()
If doing loop optimization, one can compute the REF outside the loop once
and call FRE once only too once the loop is done and vea goes out of scope.
The issuance of an unadorned GET() can, like
CALLOC() be a swapout/suspend point if it takes too long for an answer to come back.
good if you have more objects than you do cogs.
You can override the timeout event for either GET or CALLOC to call some other function.
Default is "IM_AVAILABLE()". It is called by the cog before it is saved.
By the way, REF() and FRE() can help with reference counting.
-- whats in a handle? --
One of the ways in which a ptr32 handle can be quickly distinguished from any other pointer
is by setting a bit in it which marks it as a remote pointer. Indexing and incrementing can still be
done on it. It can still be a pointer to a structure with members. Perhaps a mask over some of the high order address
indicate directly which page table entry. Again, ++on the handle should just select a higher offset from the actual base.
You will be tempted to have a 16 bit wide page table index. Dont do it. In smaller Props
the entire SRAM will get eaten by page table entries. Not worth it. We are counting on adding memory
and seconday swap space. lets use it right. There is a second reason to not slide that mask down too far,
a later generation Prop will likely have more internal mem. (256K last I heard)
These are implementation details which should not concern someone coding in, say C.
A runtime routine should take care of these details.
But how does that runtime work???
"Taking Addresses."
This is when your local C compiler is your friend.
You cant take the address of a register so having them in cog ram is fine.
Likely stack and autos in the large model will be in HUB ram, this way you don't
have to save it on a suspend of a cog. Statics can be in HUB ram too.
If they spill over, its predictable and some can
be relocated to SRAM (same for auto spillover).
If the compiler knows something is not static and not in a register or on the stack,
then its likely remote/external. Thus anything in a pointer or the address
of anything can be computed.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
-bobR
Post Edited (bobr_) : 10/6/2007 10:08:44 AM GMT
The underlying routing will be handled by the corresponding layer, needing a different set of COGs, I think...
I expect that we will need about nine COGs per Prop busy for the communication, but maybe I am too pessimistic...
LMAORFL
how about 8 ?
Well, thats actually not a fair count. its meant to get people to giggle
till they actually take the time to read whats really being proposed.
So - one cog listens for GETs, all cogs may issue SETs.
Thats how I got to 8.
How did you get to 9?
now if the code space for I2C is too big and cant be shared, then its possible
to have a pair of cogs in that role.
It may be possible to reduce that to a single cog for both get()s and set()s --
I have not tried.
This can be trialed with standard I2C routines for backbone access so you
can use existing stuff. Its just slow and doesn't permit an 'overtake' operation to be modeled.
It's ok to not understand what I'm saying - you did ask for implementation
and my target implementation audience was folks writing compilers
who often think in terms of ptr+offset for struct members.
At one point the protocol had a separate long for the offset part but I realized
a way we could do without that.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
-bobR
My main problem is, that I CANNOT USE IT in any way!
As I said, I experimented a little bit some time ago and found:
(a) there was too much missing to allow a SIMPLE approach
(b) the cost/benefit ratio was bad, compared to a low cost single processor PC solution as well as to a highend chip (Tilera's TILE64) , most likely even compared to proprietary FPGAs, though I have not evaluated their costs.
The main advantage of a "Propellor Stack" will and must be its scalability.
I personally should find your ideas more useful, if you could arrange them according to well established protocol layers:
Physical (incl. Bus Architecture) / Data Link / Network / Transport / Session
But - as said - I am only interested in the Application Layer: "Distributed Computing".
There are many possibilities...
(a) You can provide a service for setting I/O pin 21 in Propeller # 53
(b) You can load a complete COG in Propeller #53 allowed what a COG is allowed to, e.g. setting pin 21 or deleting the complete HUB of Propeller #53
Considering the timing dimension (a) will be a rare application, and I see no way to work around (b)
....
Post Edited (deSilva) : 10/7/2007 8:21:02 PM GMT
Write a program that figures out the i/o configuration and launches the appropriate handlers. Or not. Just write what you need for where you need it.
Post Edited (Fred Hawkins) : 10/8/2007 10:22:59 AM GMT