How large can EEPROM devices be? Can I get a mega byte in an EEPROM?
To me, it's very attractive, given the device can hold enough code, and it runs at SPIN speed or greater. The primary attraction would be to use the HUB more fully as a buffer or operating area, where little to no code exists. 2K for uploading to COGs, some amount for parallel code threads, and a master program that's quite large residing in EEPROM is a potent combination, IMHO.
Thanks! (I did not mean to digress, here's the rest of what I was going to say.)
I've thought some about the complexity argument, and the code size one too.
Say this stuff takes about 20 longs to sort out. A quick look at most drivers shows that number of longs is not uncommon as working variables, meaning Ross is basically on solid ground with the idea that there isn't really any overhead on code SIZE. Some good labels and maybe stacking them (more than one label per long) could sort this out nicely.
On the complexity argument, I do see that as an issue, but one that we could do some work to marginalize. Often, my own experience is time and goals are often in conflict, as is experience. Often, on new ground, I just want to get it working. Then comes optimization. I would submit, after that comes integration and publishing for wider scale re-use.
Templates and documentation help on this. Often can be a big help. But, not everybody is going to use those, or even see them, so...
Most everything that really matters is MIT, perhaps we can do some of the integration for them, rather than have them be expected to do it themselves. After all, they can just go and get the integrated code and update for new operations. Or, maybe they don't, and their update to their code could be integrated and factored for re-use later too.
For the case of SPIN + PASM, simply having a cog operate through PAR is enough. Ideally that COG does it's own computations / init too, though this isn't always possible and or practical. Translation to the other language will be needed.
For the case of C, I like the idea of a rough, simple, minimum case standard, and personally would make use of template code / documentation. It's often easier to start in a framework like that and build out, than it is to get something done, then perform that work, and I think that's worth noting. Often, "it's working" is where it gets to. Then it's off to doing something else. Providing good starter material would help in this considerably, IMHO.
Also, in the context of this discussion, I find this notable:
But this is important - supporting models other than "one main program and a bunch of drivers" allows us to take full advantage of the two "killer" advantages of the Propeller - absurdly simple multi-processing, and amazingly flexible soft peripherals. No other chip on the market seems to (currently) be able to offer this combination.
Agreed.
With SPIN+PASM, I can do those things dead simple easy, robust, fast. (Yes, I don't think SPIN is all that slow, when used as intended.)
With C (and I'm writing generically, using both C's right now), I find I can't do that as easily, but I can make larger programs! Part of this boils down to external storage not aligning well with multi-processing. The other part is just complexity, and likely learning new tricks to boot-strap myself up to that level is the other part, which happened with SPIN too. (there just wasn't that much boot strapping, in that case)
So my first thought is, "shouldn't this standard include external memory drivers as well?" And if that's a mess, off topic, ugly, feel free to scroll right on by.
So my first thought is, "shouldn't this standard include external memory drivers as well?"
Hi potatohead,
I agree with everything in your post - especially the bits you quoted from my posts
On the point above, we currently do have a kind of "defacto" interface - Bill's VMCOG (cache) interface. I think all the current C compilers can use this for XMM programs (Catalina also offers direct XMM access, but that's different - direct XMM access uses a traditional PASM API - i.e. a set of function definitions - and is not a cog-to-cog interface).
Originally, I thought this particular interface might not be appropriate to include (since in many ways it is more like an extension of the LMM kernel than something users would be interested in) - but on second thoughts I figure if we start excluding things for such arbitrary reasons, we will probably not come up with a good general-purpose solution.
On the point above, we currently do have a kind of "defacto" interface - Bill's VMCOG (cache) interface. I think all the current C compilers can use this for XMM programs
No. GCC uses a faster one. Of course it "can" as you say use VMCOG but we wouldn't want to offer a slower interface.
Thanks - I didn't know that. Is your new interface worthy of posting here? i.e. could it be a candidate for general purpose cog-to-cog interaction?
Ross.
The "new" interface predates VMCOG by 3 months. It is in the source trees for GCC, zog, and xbasic. You have seen it. The interface is designed for performance between the main LMM/XMM interpreter and the cache COG when necessary.
The "new" interface predates VMCOG by 3 months. It is in the source trees for GCC, zog, and xbasic. You have seen it. The interface is designed for performance between the main LMM/XMM interpreter and the cache COG when necessary.
I must have missed it. Is there any documentation?
That's probably just one reason you didn't use it in Catalina
Ah, but I did! That mailbox interface is the one Catalina uses. This is what I refer to as "Bill's VMCOG" interface. If I have been attributing it wrongly, please let me know who actually originated it.
I'm not too proud to take good code from any source - provided the license permits it. Why, I have even used some of your code
Ah, but I did! That mailbox interface is the one Catalina uses. This is what I refer to as "Bill's VMCOG" interface. If I have been attributing it wrongly, please let me know who actually originated it.
I'm not too proud to take good code from any source - provided the license permits it. Why, I have even used some of your code
Ross.
Where does your XMM PASM kernel use the cache line pointer for data fetch/save?
The "line code" in your file is based on VMCOG.
The cache line pointer interface is what I roughly defined in December 2009 and matured into what is used by GCC.
Where does your XMM PASM kernel use the cache line pointer for data fetch/save?
The "line code" in your file is based on VMCOG.
The cache line pointer interface is what I roughly defined in December 2009 and matured into what is used by GCC.
Yes, I remember that thread - but I can't see the actual VMCOG mail box interface defined there. It presumably got added sometime later - it certainly already existed by the time I had added caching to Catalina, and I thought it was Bill who developed it.
The Catalina code that uses that interface is in Cached_XMM.inc (in the Catlaina target directory). I'm working on an updated version which improves execution speed but could only ever be used by a single kernel cog. I hope to include this version in the next release of Catalina, and the underlying VMCOG mailbox interface should remain the same. However, I am also working on a version that supports access from multiple cogs.
Yes, I remember that thread - but I can't see the actual VMCOG mail box interface defined there. It presumably got added sometime later - it certainly already existed by the time I had added caching to Catalina, and I thought it was Bill who developed it.
The basic cache line operation was defined in the last code block of the first post. The GCC XMM kernel uses a the fully developed model which includes the read/write interface defined here.
Bill defined an interface in his vmcog where there were 6 separate commands for read/write access types - horribly inefficient. I defined 2 commands: read/write cache line pointer. David added the flash/sdcard "extended" commands with zog/xbasic.
The cache line pointer has a drawback for multiple cog access. It uses the write-back method. A write-through method would probably be necessary for multiple cog accesses.
A different type of cacheing method has been stirring around in a few heads where the cache manager actually lives in the XMM kernel. This would allow just telling the cache cog to load or save a line. We have room for that in our XMM code, but haven't had time to develop it.
The cache line pointer has a drawback for multiple cog access. It uses the write-back method. A write-through method would probably be necessary for multiple cog accesses.
Write back is a problem only if you have multiple caches - but with one shared cache it should be ok.
A different type of cacheing method has been stirring around in a few heads where the cache manager actually lives in the XMM kernel. This would allow just telling the cache cog to load or save a line. We have room for that in our XMM code, but haven't had time to develop it.
Yes, I've thought about something similar myself - kind of a "universal" FCACHE. But I was going to save this for the Prop 2, since it has much faster access to Hub RAM.
The Catalina code that uses that interface is in Cached_XMM.inc (in the Catlaina target directory).
I see the access code, but I can't find the cache line management code - without which, you may as well be accessing an element at a time. A primary key to performance is not having to always go "off COG" to get the data.
I see the access code, but I can't find the cache line management code - without which, you may as well be accessing an element at a time. A primary key to performance is not having to always go "off COG" to get the data.
Correct - we have discussed this before. Catalina currently accesses the cache an element at a time, because I want to run C programs on multiple cogs. This technique works, but it is slow (but remember the cache is not Catalina's "normal" XMM access mechanism - the cache is only used with slow XMM RAM).
But I have since decided that it is probably better to have two solutions - one fast one for single kernel access (and here the cache access can indeed be speeded up), and another solution (perhaps slower) for when multiple kernel access is required. I don't think the same solution can work efficiently in both cases (it might be possible, but I have more work to do before I will know for sure - if you already have a solution for this I'd be happy to hear about it!).
Anyway, now that I think we have correctly identified the common cache interface we both use (for potential inclusion in a new "cog to cog" communications standard), this is now getting a bit off-topic. Perhaps we could take the details of caching off to another thread?
I can't, for the life of me, understand why Spin/PASM objects cannot be used in situ by C programs without translation. After all, Spin byte-code programs are nothing more than data for just another PASM program (i.e. interpreter).
I'm sure this has been addressed here already, I've had a hard following this thread with only a mobile phone, but here is my take on it.
1) Spin compiled to byte code which are interpreted, as you know. As a result there is no possible possible "linkage" between objects/methods in Spin and any other language compiled to LMM or native COG code or whatever. It was just not designed that way. Such a use case was never in mind in the design of Spin. It would require some major mods to the Spin compiler/interpreter to make it so.
2) That leaves communicating between a C program Spin program running on a different COG via some shared variables in HUB. Again we have a "linkage" problem. How would the C compiler know where the Spin compiler has put those variables/buffers etc?. The Spin compiler has not been built to provide that information for external users.
3) I'm not sure but I have an idea that Spin byte programs are not relocatable. That is to say the thing is built to loaded and run from address zero in HUB (Anyone know better?) So just including a bunch of bytecodes into a C program and expecting them to be runnable by the interpreter at any old address may not be possible.
4) That leaves the possibility of loading and running a Spin program, as normal, which happens to start up an LMM kernel to run the C code. Which we can located pretty much anywhere in HUB. The Spin program passing pointers to the LMM kernel so as to provide some kind of "linkage" back to the running Spin.
This is actually what Zog does. Spin starts Zog in a COG which runs the C code. Spin provides Zog some pointers as hooks so that C code can send commands to the Spin program for I/O etc.
This is also what happens in GCC. There are a handful of Spin byte codes that get the LMM kernel running although they are hard coded into the C run time startup sequence rather than being compiled from Spin source. So potentially GCC could adopte the Zog approach as well.
And if GCC can't accommodate such a PASM program/data structure, I have to wonder what the he!! good it is.
The interesting parts of most Spin drivers/devices is the PASM part, where you need the speed and would rather be able to use the code than wate time rewriting. Often the Spin parts are thin wrappers to provide an interface API that is easily translated to C/C++.
Problem is, as has been discussed here, that damn "linkage" issue again. Often the Spin part of a driver will set up longs in the DAT section prior to loading it to COG for example. Again the C system has no way to know what or where those LONGS are within the binary PASM it wants to load. Up shot is that the PASM part needs reworking to get rid of those data items shared with Spin and make sure everything is passed in and out of the COG code via PAR.
All of which brings us to the point of this thread. How, and should we, agree on a standard for doing that?
Bottom line, in my view, is that for Parallax to expend a great deal of time and effort into getting GCC, or any other compiler, to inter-operate with Spin is pretty much a waste of time. It's just not going to be used. And don't forget, the Prop II will have no Spin interpreter in ROM so why would we want to add that baggage to a C system?
On the other hand, as I have said before, all "gold standard" Spin/PASM objects should be required to be written in such a way that the PASM can be used without the Spin wrapper.
Problem is, as has been discussed here, that damn "linkage" issue again. Often the Spin part of a driver will set up longs in the DAT section prior to loading it to COG for example. Again the C system has no way to know what or where those LONGS are within the binary PASM it wants to load. Up shot is that the PASM part needs reworking to get rid of those data items shared with Spin and make sure everything is passed in and out of the COG code via PAR.
It is possible to do this, but it is not easy. For instance, Catalina has to process the PASM listing file to do some of the things it does with PASM symbols, and it could do the same for Spin symbol information. Mind you, I'm not proposing this - it isn't really germane to the subject of this thread - just pointing out that it can be done.
All of which brings us to the point of this thread. How, and should we, agree on a standard for doing that?
Bottom line, in my view, is that for Parallax to expend a great deal of time and effort into getting GCC, or any other compiler, to inter-operate with Spin is pretty much a waste of time. It's just not going to be used. And don't forget, the Prop II will have no Spin interpreter in ROM so why would we want to add that baggage to a C system?
On the other hand, as I have said before, all "gold standard" Spin/PASM objects should be required to be written in such a way that the PASM can be used without the Spin wrapper.
Excellent point! I agree there is no need to modify Spin - we just need to come up with a common way to invoke the PASM components of various Spin objects - which would then become independent of the Spin code!
I hadn't made the connection to the "gold standard", but you are quite right there too - that is exactly what we should be aiming at, to make the "gold standard" objects re-usable from any language!
I believe this would really make the Propeller a sure-fire winner!
Catalina has to process the PASM listing file to do some of the things it does with PASM symbols
Cunning and devious.
However it relies on having BST or HomeSpun to compile things. Both closed source programs. Neither from Parallax.
This really does not sit well with GCC.
Should Parallax fix up the Prop Tool to provide listings or symbol tables and then opensource it just for this?
I hadn't made the connection to the "gold standard", but you are quite right there too
- that is exactly what we should be aiming at, to make the "gold standard" objects re-usable from any language!
What? You agree with me? (Heater checks that there is not something wrong with his argument, no, OK).
I have been harping on about this Gold Standard issue for ages, ever since I found Zog was going to have this problem.
I believe this would really make the Propeller a sure-fire winner!
Yep. The Prop has no peripheral devices in the normal sense. Many potential users will skip over it just because of that.
BUT with a catalog of such PASM drivers/devices it has an infinitely flexible array of peripherals.
Isn't that what this thread is about?
P.S. I know you are also angling for standardizing a general purpose COG to COG communications in terms of "parallel processing" rather than the "main program plus peripherals model" So far I don't see the need for it. That does not a normal use case for the Prop.
Continuing from where I left off in the last post...
...unique capabilities of the Propeller - i.e. the ability to use multiple cores as "soft" peripherals, or to do
parallel processing.
But this is important - supporting models other than "one main program and a bunch of drivers" allows us to take
full advantage of the two "killer" advantages of the Propeller - absurdly simple multi-processing, and amazingly
flexible soft peripherals. No other chip on the market seems to (currently) be able to offer this combination.
Firstly I think that "one main program and a bunch of drivers" is the mode that the overwhelming majority
of Propeller applications will adopt. Why?
Despite having 8 COGS the Prop is not a general purpose parallel processing machine because:
1) To do work fast you are confined to COG and message passing through HUB. That means each of your parallel
threads has to be very small (496 instructions). Not very useful.
2) To allow a more useful size of code in each thread requires moving to Spin or LMM or such executed from HUB.
Then you have killed your speed and might as well go out and by a faster single processor chip to do the same
work with less hassle.
My feeling about the may change with the Prop II but it looks unlikely. Still only 8 small COGs for parallel processing.
Good for peripherals and driver like code with no interrupts and predictable timing, not so good for big parallel processing.
As for "No other chip on the market seems to (currently) be able to offer this combination." I guess you have
missed the chip that shall not be mentioned here which does all you require with a turbo charger and bells on:)
One thing I'm trying to get away from is the model of "one main program and a bunch of drivers". This is of course an important model that must be supported (it is inherent in all C programs for example) - but if that's all the Propeller offers then it will always be doomed to suffer by comparison with other microcontrollers (which shall remain nameless) which already do that kind of thing - and do it better, faster and cheaper than either the Propeller 1 or the Propeller 2 can.
Perhaps the thread name is misleading -- "communicating with cogs from any language" makes it sound like the cogs are somehow subsidiary. So what you really want is a standard for "inter-program communication"?
It seems to me that there are 3 things that need communicating:
(1) The initial parameters to the new program.
(2) Parameters and results for requests to the program (if the program offers a service).
(3) What services are available in the system.
I think that item (1) is best handled by passing a pointer to a list of parameters in PAR. This is simple, already implemented in many drivers, and can handle pretty much all cases. I know that you've proposed passing a pointer to the registry instead, but I disagree -- I think that if the cog driver needs access to the registry, pass a pointer to the registry as one of the parameters. Otherwise, or at least in the current setup, the cog driver has to figure out its own cogid (simple enough) and then look it up in the registry (not so simple, unless the registry is one service per cog, but that's not a general enough model for all services).
I like the Catalina model of request blocks for item (2). Note that the initial parameter pointer in (1) need not necessarily be the same as the request block/result returned space, although in common cases it will be.
Finally, a registry of some sort makes sense for what services are available in the system. However, I do not think we want to rely on there being a 1-1 mapping between "cogs" and "services". All that clients care about is how to make their requests, so it's perfectly OK for a cog to provide multiple services, or for a service to require multiple cogs.
I'd suggest that the registry be either a NULL terminated list of services, or a linked list. I can see benefits to both; the NULL terminated list consumes slightly less memory, so it's probably the best choice. Each registry entry should consist of:
(1) The length of this entry (1 byte): this allows us to add additional data should it be required. Programs parsing the registry will always use the length field to advance to the next registry step.
(2) The service type (1 byte?)
(3) A pointer to the request block location for the service (2 bytes in the Propeller 1, since it has to be a hub memory address).
(4) Any additional service type specific information that may be required (length is variable, 0-250 bytes).
We can certainly re-arrange the order or change the sizes of these fields, although using sub-byte sized fields will introduce complexity in the parsing code.
The registry is terminated by an entry with a length of 0.
The remaining open question is how services are told about where the registry is. My suggestion would be that by convention the first parameter (the thing pointed to by PAR on entry) be a long word with 0xFF in the high byte and a pointer to the registry in the low bytes. When the service has finished initializing itself it writes a 0 to that first parameter to indicate this. This matches up closely with the way we handle service requests -- in some sense the initialization is a "request" with type 0xFF and parameter "registry pointer".
Services which care about the registry would be able to read it and parse it. Cog code which doesn't care about the registry would simply skip over that word.
The registry itself would be set up by the service that is loading and running the cogs. This will normally be written in some high level language, so the overhead of registry initialization would not be too important (and we may be able to figure out some clever way of re-using the initialization code space).
P.S. I know you are also angling for standardizing a general purpose COG to COG communications in terms of "parallel processing" rather than the "main program plus peripherals model" So far I don't see the need for it. That does not a normal use case for the Prop.
I agree it is not currently a normal use case for the Prop - but it should be. It is one of the advantages that the Prop offers that is not easily matched (or beaten) by its competitors.
Perhaps the thread name is misleading -- "communicating with cogs from any language" makes it sound like the cogs are somehow subsidiary. So what you really want is a standard for "inter-program communication"?
Perhaps - but the basic case (which is where there is one main cog and the other cogs are subsidiary) is so important - i.e. used by >90% of programs - that it is worth considering that to be the main case. The other cases (i.e. having multiple main cogs and multiple subsidiary cogs, or having all cogs equal) are definitely of less importance to most users.
It seems to me that there are 3 things that need communicating:
(1) The initial parameters to the new program.
(2) Parameters and results for requests to the program (if the program offers a service).
(3) What services are available in the system.
I think that item (1) is best handled by passing a pointer to a list of parameters in PAR. This is simple, already implemented in many drivers, and can handle pretty much all cases. I know that you've proposed passing a pointer to the registry instead, but I disagree -- I think that if the cog driver needs access to the registry, pass a pointer to the registry as one of the parameters. Otherwise, or at least in the current setup, the cog driver has to figure out its own cogid (simple enough) and then look it up in the registry (not so simple, unless the registry is one service per cog, but that's not a general enough model for all services).
I disagree. My main reason is that with your proposal (which is essentially the existing mechanism), you have to decide in advance whether you are going to need the registry. If you decide to change your program later and require registry access, you have to change your parameter block and all your initialization code - nasty!. It better to always pass the registry and let programs get their parameter block from that. There is no overhead to this, and it means every program uses the same initialization code.
I like the Catalina model of request blocks for item (2). Note that the initial parameter pointer in (1) need not necessarily be the same as the request block/result returned space, although in common cases it will be.
Finally, a registry of some sort makes sense for what services are available in the system. However, I do not think we want to rely on there being a 1-1 mapping between "cogs" and "services". All that clients care about is how to make their requests, so it's perfectly OK for a cog to provide multiple services, or for a service to require multiple cogs.
I'd suggest that the registry be either a NULL terminated list of services, or a linked list. I can see benefits to both; the NULL terminated list consumes slightly less memory, so it's probably the best choice. Each registry entry should consist of:
(1) The length of this entry (1 byte): this allows us to add additional data should it be required. Programs parsing the registry will always use the length field to advance to the next registry step.
(2) The service type (1 byte?)
(3) A pointer to the request block location for the service (2 bytes in the Propeller 1, since it has to be a hub memory address).
(4) Any additional service type specific information that may be required (length is variable, 0-250 bytes).
We can certainly re-arrange the order or change the sizes of these fields, although using sub-byte sized fields will introduce complexity in the parsing code.
The registry is terminated by an entry with a length of 0.
I do like the idea of a variable length registry, where services are registered rather than cogs. My main concern is that it would be much more complex to manage than a simple fixed length registry. Even the Catalina model is too complex for many - do you think the additional complexity is worthwhile, and will not just drive people away?
The remaining open question is how services are told about where the registry is. My suggestion would be that by convention the first parameter (the thing pointed to by PAR on entry) be a long word with 0xFF in the high byte and a pointer to the registry in the low bytes. When the service has finished initializing itself it writes a 0 to that first parameter to indicate this. This matches up closely with the way we handle service requests -- in some sense the initialization is a "request" with type 0xFF and parameter "registry pointer".
Services which care about the registry would be able to read it and parse it. Cog code which doesn't care about the registry would simply skip over that word.
The registry itself would be set up by the service that is loading and running the cogs. This will normally be written in some high level language, so the overhead of registry initialization would not be too important (and we may be able to figure out some clever way of re-using the initialization code space).
As I mentioned above, by far the simplest method is just to pass the registry, and let the cogs figure out everything else for themselves. For cogs that don't really need the registry, the overhead can be zero if you really need it to be, and the advantage is that it makes all cogs potentially equal.
This model model is also the simplest to implement, to explain and also - importantly! - to get people to comply with. If you want something more complex, I think you need to be able to justify the potential benefits.
This model model is also the simplest to implement, to explain and also - importantly! - to get people to comply with. If you want something more complex, I think you need to be able to justify the potential benefits.
The scope of compliance is important. Where does one need to comply? Driver only? O/S? xyzzy design?
I really don't need a registry for most of my projects. If I had an O/S related project I might care about it.
The scope of compliance is important. Where does one need to comply? Driver only? O/S? xyzzy design?
I really don't need a registry for most of my projects. If I had an O/S related project I might care about it.
I realize it may sound circular, but think this is self-defining - i.e. if you want to use a "standard" plugin (driver or otherwise) in your project, and all plugins use a common registry, then you need a registry. Otherwise you don't.
The assumption is (of course) that the overhead of having a registry - even if your project only has one plugin that needs it - is low enough that you don't ever have to even worry about it. How many Propeller programs have ever been written that couldn't spare 8 longs of Hub RAM?
Also, the mechanics of it has to be simple and foolproof - for instance, in Spin (which has no default initialization code and no concept of drivers) you would have to include one standard object and make one call to it (i.e. something like Registry.initialize). Then of course, for each object that needs it you must make a call (i.e. something like Plugin.Register). In other languages - such as C - you wouldn't even need to do that - it is all handled for you by the C runtime.
Just now digging into your early a.m. posts. (Do you ever sleep?) I think the answer to the points you raised about Spin objects accessible to C would be a Spin compiler that produces relocatable object modules instead of monolithic code. That way, exported symbols (CONstants and PUBlic methods) could be expressed in the object module in a way that they could be knitted into a combined load file with C (or whatever) object modules. I hate to suggest two Spin compilers: one for Spin/PASM-only apps and one for use with the GCC environment; but that may be what's necessary and is probably preferable to a Spin-to-C translator.
Just now digging into your early a.m. posts. (Do you ever sleep?)
Yes I sleep from about 11PM to 6AM most nights. I can't figure out what posts of mine you're talking about though.
SPIN/PASM and C/CogC or any other language can easily coexist as long as there are API's. If there was a way to export symbol information for an API that might be helpful. I don't think we really need a fully developed relocatable object though. Something like BSTC's list file can easily be parsed by some PERL guru though for exporting symbol data in to a shareable header. Whatever open-source SPIN compiler Roy comes up with ... whenever that happens ... would be easier to change to provide symbol info in any preferred format by whomever is willing to take those bits and do them.
Comments
To me, it's very attractive, given the device can hold enough code, and it runs at SPIN speed or greater. The primary attraction would be to use the HUB more fully as a buffer or operating area, where little to no code exists. 2K for uploading to COGs, some amount for parallel code threads, and a master program that's quite large residing in EEPROM is a potent combination, IMHO.
I've thought some about the complexity argument, and the code size one too.
Say this stuff takes about 20 longs to sort out. A quick look at most drivers shows that number of longs is not uncommon as working variables, meaning Ross is basically on solid ground with the idea that there isn't really any overhead on code SIZE. Some good labels and maybe stacking them (more than one label per long) could sort this out nicely.
On the complexity argument, I do see that as an issue, but one that we could do some work to marginalize. Often, my own experience is time and goals are often in conflict, as is experience. Often, on new ground, I just want to get it working. Then comes optimization. I would submit, after that comes integration and publishing for wider scale re-use.
Templates and documentation help on this. Often can be a big help. But, not everybody is going to use those, or even see them, so...
Most everything that really matters is MIT, perhaps we can do some of the integration for them, rather than have them be expected to do it themselves. After all, they can just go and get the integrated code and update for new operations. Or, maybe they don't, and their update to their code could be integrated and factored for re-use later too.
For the case of SPIN + PASM, simply having a cog operate through PAR is enough. Ideally that COG does it's own computations / init too, though this isn't always possible and or practical. Translation to the other language will be needed.
For the case of C, I like the idea of a rough, simple, minimum case standard, and personally would make use of template code / documentation. It's often easier to start in a framework like that and build out, than it is to get something done, then perform that work, and I think that's worth noting. Often, "it's working" is where it gets to. Then it's off to doing something else. Providing good starter material would help in this considerably, IMHO.
Also, in the context of this discussion, I find this notable:
But this is important - supporting models other than "one main program and a bunch of drivers" allows us to take full advantage of the two "killer" advantages of the Propeller - absurdly simple multi-processing, and amazingly flexible soft peripherals. No other chip on the market seems to (currently) be able to offer this combination.
Agreed.
With SPIN+PASM, I can do those things dead simple easy, robust, fast. (Yes, I don't think SPIN is all that slow, when used as intended.)
With C (and I'm writing generically, using both C's right now), I find I can't do that as easily, but I can make larger programs! Part of this boils down to external storage not aligning well with multi-processing. The other part is just complexity, and likely learning new tricks to boot-strap myself up to that level is the other part, which happened with SPIN too. (there just wasn't that much boot strapping, in that case)
So my first thought is, "shouldn't this standard include external memory drivers as well?" And if that's a mess, off topic, ugly, feel free to scroll right on by.
Hi potatohead,
I agree with everything in your post - especially the bits you quoted from my posts
On the point above, we currently do have a kind of "defacto" interface - Bill's VMCOG (cache) interface. I think all the current C compilers can use this for XMM programs (Catalina also offers direct XMM access, but that's different - direct XMM access uses a traditional PASM API - i.e. a set of function definitions - and is not a cog-to-cog interface).
Originally, I thought this particular interface might not be appropriate to include (since in many ways it is more like an extension of the LMM kernel than something users would be interested in) - but on second thoughts I figure if we start excluding things for such arbitrary reasons, we will probably not come up with a good general-purpose solution.
Ross.
No. GCC uses a faster one. Of course it "can" as you say use VMCOG but we wouldn't want to offer a slower interface.
Thanks - I didn't know that. Is your new interface worthy of posting here? i.e. could it be a candidate for general purpose cog-to-cog interaction?
Ross.
I must have missed it. Is there any documentation?
Ross.
The conceptual design is described here: http://code.google.com/p/propgcc/wiki/PropGccExternalMemory
Actual implementation description needs to be added.
Ah, but I did! That mailbox interface is the one Catalina uses. This is what I refer to as "Bill's VMCOG" interface. If I have been attributing it wrongly, please let me know who actually originated it.
I'm not too proud to take good code from any source - provided the license permits it. Why, I have even used some of your code
Ross.
Where does your XMM PASM kernel use the cache line pointer for data fetch/save?
The "line code" in your file is based on VMCOG.
The cache line pointer interface is what I roughly defined in December 2009 and matured into what is used by GCC.
Yes, I remember that thread - but I can't see the actual VMCOG mail box interface defined there. It presumably got added sometime later - it certainly already existed by the time I had added caching to Catalina, and I thought it was Bill who developed it.
The Catalina code that uses that interface is in Cached_XMM.inc (in the Catlaina target directory). I'm working on an updated version which improves execution speed but could only ever be used by a single kernel cog. I hope to include this version in the next release of Catalina, and the underlying VMCOG mailbox interface should remain the same. However, I am also working on a version that supports access from multiple cogs.
Ross.
The basic cache line operation was defined in the last code block of the first post. The GCC XMM kernel uses a the fully developed model which includes the read/write interface defined here.
Bill defined an interface in his vmcog where there were 6 separate commands for read/write access types - horribly inefficient. I defined 2 commands: read/write cache line pointer. David added the flash/sdcard "extended" commands with zog/xbasic.
The cache line pointer has a drawback for multiple cog access. It uses the write-back method. A write-through method would probably be necessary for multiple cog accesses.
A different type of cacheing method has been stirring around in a few heads where the cache manager actually lives in the XMM kernel. This would allow just telling the cache cog to load or save a line. We have room for that in our XMM code, but haven't had time to develop it.
Just thinking about the multi-processing aspect of things.
I think it would only make sense if they used separate pins - which starts to be expensive in pins if you had to have one per cog.
Ross.
Ross.
I see the access code, but I can't find the cache line management code - without which, you may as well be accessing an element at a time. A primary key to performance is not having to always go "off COG" to get the data.
Correct - we have discussed this before. Catalina currently accesses the cache an element at a time, because I want to run C programs on multiple cogs. This technique works, but it is slow (but remember the cache is not Catalina's "normal" XMM access mechanism - the cache is only used with slow XMM RAM).
But I have since decided that it is probably better to have two solutions - one fast one for single kernel access (and here the cache access can indeed be speeded up), and another solution (perhaps slower) for when multiple kernel access is required. I don't think the same solution can work efficiently in both cases (it might be possible, but I have more work to do before I will know for sure - if you already have a solution for this I'd be happy to hear about it!).
Anyway, now that I think we have correctly identified the common cache interface we both use (for potential inclusion in a new "cog to cog" communications standard), this is now getting a bit off-topic. Perhaps we could take the details of caching off to another thread?
Ross.
I'm sure this has been addressed here already, I've had a hard following this thread with only a mobile phone, but here is my take on it.
1) Spin compiled to byte code which are interpreted, as you know. As a result there is no possible possible "linkage" between objects/methods in Spin and any other language compiled to LMM or native COG code or whatever. It was just not designed that way. Such a use case was never in mind in the design of Spin. It would require some major mods to the Spin compiler/interpreter to make it so.
2) That leaves communicating between a C program Spin program running on a different COG via some shared variables in HUB. Again we have a "linkage" problem. How would the C compiler know where the Spin compiler has put those variables/buffers etc?. The Spin compiler has not been built to provide that information for external users.
3) I'm not sure but I have an idea that Spin byte programs are not relocatable. That is to say the thing is built to loaded and run from address zero in HUB (Anyone know better?) So just including a bunch of bytecodes into a C program and expecting them to be runnable by the interpreter at any old address may not be possible.
4) That leaves the possibility of loading and running a Spin program, as normal, which happens to start up an LMM kernel to run the C code. Which we can located pretty much anywhere in HUB. The Spin program passing pointers to the LMM kernel so as to provide some kind of "linkage" back to the running Spin.
This is actually what Zog does. Spin starts Zog in a COG which runs the C code. Spin provides Zog some pointers as hooks so that C code can send commands to the Spin program for I/O etc.
This is also what happens in GCC. There are a handful of Spin byte codes that get the LMM kernel running although they are hard coded into the C run time startup sequence rather than being compiled from Spin source. So potentially GCC could adopte the Zog approach as well.
The interesting parts of most Spin drivers/devices is the PASM part, where you need the speed and would rather be able to use the code than wate time rewriting. Often the Spin parts are thin wrappers to provide an interface API that is easily translated to C/C++.
Problem is, as has been discussed here, that damn "linkage" issue again. Often the Spin part of a driver will set up longs in the DAT section prior to loading it to COG for example. Again the C system has no way to know what or where those LONGS are within the binary PASM it wants to load. Up shot is that the PASM part needs reworking to get rid of those data items shared with Spin and make sure everything is passed in and out of the COG code via PAR.
All of which brings us to the point of this thread. How, and should we, agree on a standard for doing that?
Bottom line, in my view, is that for Parallax to expend a great deal of time and effort into getting GCC, or any other compiler, to inter-operate with Spin is pretty much a waste of time. It's just not going to be used. And don't forget, the Prop II will have no Spin interpreter in ROM so why would we want to add that baggage to a C system?
On the other hand, as I have said before, all "gold standard" Spin/PASM objects should be required to be written in such a way that the PASM can be used without the Spin wrapper.
Excellent point! I agree there is no need to modify Spin - we just need to come up with a common way to invoke the PASM components of various Spin objects - which would then become independent of the Spin code!
I hadn't made the connection to the "gold standard", but you are quite right there too - that is exactly what we should be aiming at, to make the "gold standard" objects re-usable from any language!
I believe this would really make the Propeller a sure-fire winner!
Ross.
Cunning and devious.
However it relies on having BST or HomeSpun to compile things. Both closed source programs. Neither from Parallax.
This really does not sit well with GCC.
Should Parallax fix up the Prop Tool to provide listings or symbol tables and then opensource it just for this?
Continuing from where I left off in the last post...
Firstly I think that "one main program and a bunch of drivers" is the mode that the overwhelming majority
of Propeller applications will adopt. Why?
Despite having 8 COGS the Prop is not a general purpose parallel processing machine because:
1) To do work fast you are confined to COG and message passing through HUB. That means each of your parallel
threads has to be very small (496 instructions). Not very useful.
2) To allow a more useful size of code in each thread requires moving to Spin or LMM or such executed from HUB.
Then you have killed your speed and might as well go out and by a faster single processor chip to do the same
work with less hassle.
My feeling about the may change with the Prop II but it looks unlikely. Still only 8 small COGs for parallel processing.
Good for peripherals and driver like code with no interrupts and predictable timing, not so good for big parallel processing.
As for "No other chip on the market seems to (currently) be able to offer this combination." I guess you have
missed the chip that shall not be mentioned here which does all you require with a turbo charger and bells on:)
It seems to me that there are 3 things that need communicating:
(1) The initial parameters to the new program.
(2) Parameters and results for requests to the program (if the program offers a service).
(3) What services are available in the system.
I think that item (1) is best handled by passing a pointer to a list of parameters in PAR. This is simple, already implemented in many drivers, and can handle pretty much all cases. I know that you've proposed passing a pointer to the registry instead, but I disagree -- I think that if the cog driver needs access to the registry, pass a pointer to the registry as one of the parameters. Otherwise, or at least in the current setup, the cog driver has to figure out its own cogid (simple enough) and then look it up in the registry (not so simple, unless the registry is one service per cog, but that's not a general enough model for all services).
I like the Catalina model of request blocks for item (2). Note that the initial parameter pointer in (1) need not necessarily be the same as the request block/result returned space, although in common cases it will be.
Finally, a registry of some sort makes sense for what services are available in the system. However, I do not think we want to rely on there being a 1-1 mapping between "cogs" and "services". All that clients care about is how to make their requests, so it's perfectly OK for a cog to provide multiple services, or for a service to require multiple cogs.
I'd suggest that the registry be either a NULL terminated list of services, or a linked list. I can see benefits to both; the NULL terminated list consumes slightly less memory, so it's probably the best choice. Each registry entry should consist of:
(1) The length of this entry (1 byte): this allows us to add additional data should it be required. Programs parsing the registry will always use the length field to advance to the next registry step.
(2) The service type (1 byte?)
(3) A pointer to the request block location for the service (2 bytes in the Propeller 1, since it has to be a hub memory address).
(4) Any additional service type specific information that may be required (length is variable, 0-250 bytes).
We can certainly re-arrange the order or change the sizes of these fields, although using sub-byte sized fields will introduce complexity in the parsing code.
The registry is terminated by an entry with a length of 0.
The remaining open question is how services are told about where the registry is. My suggestion would be that by convention the first parameter (the thing pointed to by PAR on entry) be a long word with 0xFF in the high byte and a pointer to the registry in the low bytes. When the service has finished initializing itself it writes a 0 to that first parameter to indicate this. This matches up closely with the way we handle service requests -- in some sense the initialization is a "request" with type 0xFF and parameter "registry pointer".
Services which care about the registry would be able to read it and parse it. Cog code which doesn't care about the registry would simply skip over that word.
The registry itself would be set up by the service that is loading and running the cogs. This will normally be written in some high level language, so the overhead of registry initialization would not be too important (and we may be able to figure out some clever way of re-using the initialization code space).
Eric
Ross.
As I mentioned above, by far the simplest method is just to pass the registry, and let the cogs figure out everything else for themselves. For cogs that don't really need the registry, the overhead can be zero if you really need it to be, and the advantage is that it makes all cogs potentially equal.
This model model is also the simplest to implement, to explain and also - importantly! - to get people to comply with. If you want something more complex, I think you need to be able to justify the potential benefits.
Ross.
I really don't need a registry for most of my projects. If I had an O/S related project I might care about it.
The assumption is (of course) that the overhead of having a registry - even if your project only has one plugin that needs it - is low enough that you don't ever have to even worry about it. How many Propeller programs have ever been written that couldn't spare 8 longs of Hub RAM?
Also, the mechanics of it has to be simple and foolproof - for instance, in Spin (which has no default initialization code and no concept of drivers) you would have to include one standard object and make one call to it (i.e. something like Registry.initialize). Then of course, for each object that needs it you must make a call (i.e. something like Plugin.Register). In other languages - such as C - you wouldn't even need to do that - it is all handled for you by the C runtime.
Ross.
Just now digging into your early a.m. posts. (Do you ever sleep?) I think the answer to the points you raised about Spin objects accessible to C would be a Spin compiler that produces relocatable object modules instead of monolithic code. That way, exported symbols (CONstants and PUBlic methods) could be expressed in the object module in a way that they could be knitted into a combined load file with C (or whatever) object modules. I hate to suggest two Spin compilers: one for Spin/PASM-only apps and one for use with the GCC environment; but that may be what's necessary and is probably preferable to a Spin-to-C translator.
-Phil
Two Basic Pictures
In the first picture a registry has zero value. In the second it has value.
What causes us to need a registry in your view? O/S or other?
SPIN/PASM and C/CogC or any other language can easily coexist as long as there are API's. If there was a way to export symbol information for an API that might be helpful. I don't think we really need a fully developed relocatable object though. Something like BSTC's list file can easily be parsed by some PERL guru though for exporting symbol data in to a shareable header. Whatever open-source SPIN compiler Roy comes up with ... whenever that happens ... would be easier to change to provide symbol info in any preferred format by whomever is willing to take those bits and do them.