While I think the new CLUT instructions sound great and it certainly will be nice to have 64 more longs available to each COG I wonder if it will be as useful as we think for implementing high level languages. First, a 64 entry stack is kind of small for C and the memory isn't directly addressable so it won't be possible to take the address of a variable on the stack. It could certainly be used in the prolog of functions for saving and restoring registers but using it for function parameters and local variables may not work out well. Has anyone (Ross maybe) thought through how the CLUT would be used in a C/C++ calling sequence?
I was just told the CLUT is 128x32bits. Still that's not much for an HLL stack.
Zog is going to love this as a place to keep the bytecode dispatch table. Must be faster than HUB access and tidies everything up into the COG anyway.
Offtopic? Well it's Prop II and GCC related anyway.
Someone remind me what's on the plans for stacks and/or indexed addressing into HUB memory?
Seems that would be more useful to LMM compiler builders. In fact it would shave time off of emulations, VMs and many other codes.
I was just told the CLUT is 128x32bits. Still that's not much for an HLL stack.
Zog is going to love this as a place to keep the bytecode dispatch table. Must be faster than HUB access and tidies everything up into the COG anyway.
Offtopic? Well it's Prop II and GCC related anyway.
Someone remind me what's on the plans for stacks and/or indexed addressing into HUB memory?
Seems that would be more useful to LMM compiler builders. In fact it would shave time off of emulations, VMs and many other codes.
Oh yeah, I guess that makes more sense. 256 16 bit CLUT entries which would give use 128 32 bit longs. Is that correct? Seems like 256 32 bit entries would be needed to do 24 bit color or is the Propeller 2 limited to 16 bit color? Anyway, 128 longs is a bit better than only 64.
It only took a couple of minutes to see how to use it as a FIFO... see post #427 in that thread.
Now if only you and Kye get the hub/cog pointer related instructions (ie auto inc/dec, offset etc for hub, and pointer access instructions) documented and posted....
The 128 long CLUT is great as a return stack, however as David notes, it is far too small to use for stack frames (arguments, local vars)
I think ZOG and other VM's will get a huge speed boost from just using it as a return stack; and stack oriented languages could use it as both return stack and expression evaluation stack - which should speed up ZOG and FORTH immensely!
Stack frames for paramenters and local variables will still have to be placed in hub or xmm.
David - I wonder how difficult it would be for you to modify your XBASIC VM to separate stack frames from just return addresses and expression evaluation - if it is not too difficult, you would get a tremendous speedup by using the dual stack approach I suggest, and it should (famous last words) be good for say 100 call deep stack (for return addresses) and up to 28 deep evaluation stack.
(A modified Spin VM would get a big speed boost from doing this as well)
The 128 long CLUT is great as a return stack, however as David notes, it is far too small to use for stack frames (arguments, local vars)
I think ZOG and other VM's will get a huge speed boost from just using it as a return stack; and stack oriented languages could use it as both return stack and expression evaluation stack - which should speed up ZOG and FORTH immensely!
Stack frames for paramenters and local variables will still have to be placed in hub or xmm.
David - I wonder how difficult it would be for you to modify your XBASIC VM to separate stack frames from just return addresses and expression evaluation - if it is not too difficult, you would get a tremendous speedup by using the dual stack approach I suggest, and it should (famous last words) be good for say 100 call deep stack (for return addresses) and up to 28 deep evaluation stack.
xbasic can probably just use the CLUT for all of its stack needs. I don't think most basic programs are heavily recursive nor do they use lots of local variables.
xbasic can probably just use the CLUT for all of its stack needs. I don't think most basic programs are heavily recursive nor do they use lots of local variables.
Bill, others:
I am working on getting descriptions of all the new instructions and registers. Chip is feeding me the info and I am writing up descriptions and then having him review it (that is how I did the earlier post with the CLUT stack instructions). Hopefully later tonight I will have a post with most if not all of the new stuff listed with some form of description. It won't be fully detailed (datasheet like) stuff, but enough for you to have an idea what each one does.
Bill, others:
I am working on getting descriptions of all the new instructions and registers. Chip is feeding me the info and I am writing up descriptions and then having him review it (that is how I did the earlier post with the CLUT stack instructions). Hopefully later tonight I will have a post with most if not all of the new stuff listed with some form of description. It won't be fully detailed (datasheet like) stuff, but enough for you to have an idea what each one does.
Roy
That sounds like just what is needed! Thanks, Roy!
Is anyone working on updating one of the Propeller 1 simulators to handle the new Propeller 2 instructions? I guess spinsim would be an obvious choice? I think it would be great to have a simulator for P2 so that we can actually try running code with the new instructions. I worked at another company where we designed a processor core and we used early development with a simulator to refine the instruction set in advance of having silicon. It gave us the opportunity to adjust the instructions as needed to allow the code to run efficiently.
Bill, others:
I am working on getting descriptions of all the new instructions and registers. Chip is feeding me the info and I am writing up descriptions and then having him review it (that is how I did the earlier post with the CLUT stack instructions). Hopefully later tonight I will have a post with most if not all of the new stuff listed with some form of description. It won't be fully detailed (datasheet like) stuff, but enough for you to have an idea what each one does.
Is anyone working on updating one of the Propeller 1 simulators to handle the new Propeller 2 instructions? I guess spinsim would be an obvious choice? I think it would be great to have a simulator for P2 so that we can actually try running code with the new instructions. I worked at another company where we designed a processor core and we used early development with a simulator to refine the instruction set in advance of having silicon. It gave us the opportunity to adjust the instructions as needed to allow the code to run efficiently.
I would offer to work on this but I don't want to step on anyone else's toes. For instance Dave Hein may want to update his own simulator. Does anyone know if he is working on this or planning to?
I am planning on adding a Prop 2 mode to SpinSim. I'm waiting for more detail on the instruction set. SpinSim does allow setting a larger RAM size with the -m option. The ROM is implemented as a 32K chunk of RAM that is initialized to the Prop's ROM contents. It remains at address $8000 even if the RAM size is larger than 64K. However, I'll change this for the Prop 2 mode once the details are released about the ROM.
SpinSim does have an undocumented -t option that is used to enable the Prop 2 mode. However, it currently only adjusts the number of instructions executed per cycle and the hub access window for each cog.
At UPEW during his talk Chip said that there will only be 512 longs of rom for a boot loader. I have no idea what hub address (if any) it will have.
I think this is a great approach, I far prefer more HUB ram as I don't use the ROM fonts, and the rom sin/cos etc tables are made obsolete by the cog tables, CORDIC etc.
It will be great to have to have SpinSim simulate the Prop2 once the instruction documentation is available.
I am planning on adding a Prop 2 mode to SpinSim. I'm waiting for more detail on the instruction set. SpinSim does allow setting a larger RAM size with the -m option. The ROM is implemented as a 32K chunk of RAM that is initialized to the Prop's ROM contents. It remains at address $8000 even if the RAM size is larger than 64K. However, I'll change this for the Prop 2 mode once the details are released about the ROM.
SpinSim does have an undocumented -t option that is used to enable the Prop 2 mode. However, it currently only adjusts the number of instructions executed per cycle and the hub access window for each cog.
Has anyone (Ross maybe) thought through how the CLUT would be used in a C/C++ calling sequence?
Yes, I've given it some thought. It could be used, but there doesn't currently seem to be any great benefit that would not be outweighed by the additional complexity. I'm waiting to see the other instruction changes Chip mentioned, which I think will make using the CLUT as a stack unnecessary in any LMM language that uses stack frames (Pascal/C/C++ etc). It is likely be more useful in other languages (such as Basic, as others have pointed out). In an LMM language the CLUT may be more useful just as additional storage.
Still, I don't see any need to rush in to any decisions on this. My current plan is to first do a minimalist port of Catalina to the Prop II, which should be available pretty much as soon as there is a SPIN/PASM compiler available, then do "incremental" improvements to Catalina as we learn more about the pitfalls and benefits of the new instructions.
I've been doing some more work on LMM2, and so far the simple kernel skeleton I posted above looks like the best bet.
I did some simulations on a few other approaches:
Paging LMM2:
Essentially paging in large blocks of code, executing within the cog, and fetching the next block when it branches out of scope or calls a function. Cluso99 and some others on the forum have also tried this approach I believe as it is an obvious alternative to fetching one instruction like LMM1 does. I really liked PhiPi's reverse loader variant.
This works great on small trivial code, returning great benchmark results, but fails horribly on large programs due to thrashing when the locality of reference is not ridiculously high.
Thunk LMM2:
Uses RDQLONG to always read four consecutive longs, and runs linearly through them, potentially in a REPEAT loop.
Works reasonably well for linear code, but wastes a lot of memory due to the "bubbles" of NOP's that would form. This is similar to problems with early VLIW machines not being able to use all of their functional units. (VLIW = very long instruction word)
LMM2.5:
This is the variation I initially suggested to Chip, however it is too big a change, and only somewhat faster than my simplified LMM2 posted above. It also has the bubble problem, though not as bad as my Thunk LMM2. Fortunately my alternate suggestion to Chip may make it in, Chip thought it would fit and it would greatly improve LMM2. Chip's additions to my suggestion would improve all other VM's.
As I posted earlier, I am not comfortable posting that idea yet, as not only has Chip not confirmed it can go in, I have found a potentially serious pipeline issue with it (that I've now found a solution for)
The simple LMM2 I posted earlier:
Decent performance, excellent if FCACHE is used to its maximum potential - requires a very thorough understanding of the whole Prop2 architecture to implement properly. I actually figured out how to eliminate almost all of the kernel primitives LMM originally needed! (I am not ready to disclose how yet, as I need a simulator / test chip to test the idea).
I actually figured out another huge speed gain for LMM, but I don't want Ken to send hit men after me for telling Chip and delaying Prop2... it would be huge though, but given the new instruction it would require, I am not sure it would fit - I'd have to hammer it out with Chip first.
The original LMM
I designed for Prop1 (which from now on I will call LMM1) is highly inappropriate for Prop2 - I realized this during the initial Prop2 suggestion thread, which I have been tracking for latest information, updating LMM2 designs (and trying the alternatives above)
Honestly, other than the approaches I outlined above (or a byte/word code interpreter that would be inherently slower) I cannot see any other way to run C on the prop without major architectural changes to the cog, essentially turning them into convetional RISC processors, which would delay Prop2 at least a year.
I am comitted to making Prop2 run LMM2 as fast as is possible, so I posted the above so that Ross etc can benefit from my research.
Decent performance, excellent if FCACHE is used to its maximum potential - requires a very thorough understanding of the whole Prop2 architecture to implement properly.
Do you believe any compiler can implement FCACHE calls easily? I fear the reason an instruction FCACHE has not been used so far is because of the difficulty of getting the compiler to actually emit code that would do the job. ImageCraft implemented a data cache version, but that's different.
I had a blast coming up with LMM in the first place... and seeing you and others adopt it. That was what I wanted... C and other conventional languages on the Prop.
Don't get me wrong, I *REALLY* like the Spin/PASM combination for quick prototyping. If some of Spin's shortcomings are also fixed, it will be even better (second class nature of strings, lack of data structures, lack of full support for floats, lack of #define/#undef/#ifdef/#ifndef/#endif et al)
I've been having a blast with LMM2... which explains why I went a bit overboard with the "RAH RAH" on Prop2/LMM2 - but I've been working on LMM2 since the first differences were posted, and seeing Prop2 finally get close to release...
Ever since you ported Catalina to Morpheus (thank you again) I have been meaning to find time to do a "deep dive" into your version of the LMM kernel to see if I could tweak it for better performance (perhaps with greater FCACHE user <grin>) ... hopefully now that the "Mikronauts Modular System" is here, I can do that soon.
Do you believe any compiler can implement FCACHE calls easily? I fear the reason an instruction FCACHE has not been used so far is because of the difficulty of getting the compiler to actually emit code that would do the job. ImageCraft implemented a data cache version, but that's different.
Ross, what was your experience with FCACHE?
I am in the happy position of being able to agree completely with both you and Bill:
Yes, FCACHE could give you substantial performance improvements if you could make effective use it.
Yes, writing a code generator that could make effective use of FCACHE would be extremely difficult - I am tempted to say impossible, but you know where saying things like that that gets you when discussing the Propeller .
I am in the happy position of being able to agree completely with both you and Bill:
Yes, FCACHE could give you substantial performance improvements if you could make effective use it.
Yes, writing a code generator that could make effective use of FCACHE would be extremely difficult - I am tempted to say impossible, but you know where saying things like that that gets you when discussing the Propeller .
I've been having a blast with LMM2... which explains why I went a bit overboard with the "RAH RAH" on Prop2/LMM2
If you were not totally overboard with this particular "Bill Henning brand" of enthusiasm (not intended as an insult), i would be concerned that you didn't care. I'm glad that you care.
Comments
http://forums.parallax.com/showthread.php?125543-Propeller-II-update-BLOG&p=1002844&viewfull=1#post1002844
Hi Roy,
Yes, I saw that. I wasn't planning on using any CLUT instructions until we get verification that they will actually make it into the silicon.
However, I'm very interested in the details of the read/write hub instructions.
Ross.
Is this the same Chip who said we would have 256k on the Prop II?
Ross.
While I think the new CLUT instructions sound great and it certainly will be nice to have 64 more longs available to each COG I wonder if it will be as useful as we think for implementing high level languages. First, a 64 entry stack is kind of small for C and the memory isn't directly addressable so it won't be possible to take the address of a variable on the stack. It could certainly be used in the prolog of functions for saving and restoring registers but using it for function parameters and local variables may not work out well. Has anyone (Ross maybe) thought through how the CLUT would be used in a C/C++ calling sequence?
Zog is going to love this as a place to keep the bytecode dispatch table. Must be faster than HUB access and tidies everything up into the COG anyway.
Offtopic? Well it's Prop II and GCC related anyway.
Someone remind me what's on the plans for stacks and/or indexed addressing into HUB memory?
Seems that would be more useful to LMM compiler builders. In fact it would shave time off of emulations, VMs and many other codes.
Oh yeah, I guess that makes more sense. 256 16 bit CLUT entries which would give use 128 32 bit longs. Is that correct? Seems like 256 32 bit entries would be needed to do 24 bit color or is the Propeller 2 limited to 16 bit color? Anyway, 128 longs is a bit better than only 64.
Note to self... stay away from all busses... unless driving an Abrams M1 or equivalent
LOL
It only took a couple of minutes to see how to use it as a FIFO... see post #427 in that thread.
Now if only you and Kye get the hub/cog pointer related instructions (ie auto inc/dec, offset etc for hub, and pointer access instructions) documented and posted....
No rest for nice guys!
Thanks Roy
The 128 long CLUT is great as a return stack, however as David notes, it is far too small to use for stack frames (arguments, local vars)
I think ZOG and other VM's will get a huge speed boost from just using it as a return stack; and stack oriented languages could use it as both return stack and expression evaluation stack - which should speed up ZOG and FORTH immensely!
Stack frames for paramenters and local variables will still have to be placed in hub or xmm.
David - I wonder how difficult it would be for you to modify your XBASIC VM to separate stack frames from just return addresses and expression evaluation - if it is not too difficult, you would get a tremendous speedup by using the dual stack approach I suggest, and it should (famous last words) be good for say 100 call deep stack (for return addresses) and up to 28 deep evaluation stack.
(A modified Spin VM would get a big speed boost from doing this as well)
But no worries, wifey wants Hawaii next April...
xbasic can probably just use the CLUT for all of its stack needs. I don't think most basic programs are heavily recursive nor do they use lots of local variables.
I am working on getting descriptions of all the new instructions and registers. Chip is feeding me the info and I am writing up descriptions and then having him review it (that is how I did the earlier post with the CLUT stack instructions). Hopefully later tonight I will have a post with most if not all of the new stuff listed with some form of description. It won't be fully detailed (datasheet like) stuff, but enough for you to have an idea what each one does.
Roy
That sounds like just what is needed! Thanks, Roy!
That is exactly what I need.
I would offer to work on this but I don't want to step on anyone else's toes. For instance Dave Hein may want to update his own simulator. Does anyone know if he is working on this or planning to?
SpinSim does have an undocumented -t option that is used to enable the Prop 2 mode. However, it currently only adjusts the number of instructions executed per cycle and the hub access window for each cog.
At UPEW during his talk Chip said that there will only be 512 longs of rom for a boot loader. I have no idea what hub address (if any) it will have.
I think this is a great approach, I far prefer more HUB ram as I don't use the ROM fonts, and the rom sin/cos etc tables are made obsolete by the cog tables, CORDIC etc.
It will be great to have to have SpinSim simulate the Prop2 once the instruction documentation is available.
Yes, I've given it some thought. It could be used, but there doesn't currently seem to be any great benefit that would not be outweighed by the additional complexity. I'm waiting to see the other instruction changes Chip mentioned, which I think will make using the CLUT as a stack unnecessary in any LMM language that uses stack frames (Pascal/C/C++ etc). It is likely be more useful in other languages (such as Basic, as others have pointed out). In an LMM language the CLUT may be more useful just as additional storage.
Still, I don't see any need to rush in to any decisions on this. My current plan is to first do a minimalist port of Catalina to the Prop II, which should be available pretty much as soon as there is a SPIN/PASM compiler available, then do "incremental" improvements to Catalina as we learn more about the pitfalls and benefits of the new instructions.
Ross.
I've been doing some more work on LMM2, and so far the simple kernel skeleton I posted above looks like the best bet.
I did some simulations on a few other approaches:
Paging LMM2:
Essentially paging in large blocks of code, executing within the cog, and fetching the next block when it branches out of scope or calls a function. Cluso99 and some others on the forum have also tried this approach I believe as it is an obvious alternative to fetching one instruction like LMM1 does. I really liked PhiPi's reverse loader variant.
This works great on small trivial code, returning great benchmark results, but fails horribly on large programs due to thrashing when the locality of reference is not ridiculously high.
Thunk LMM2:
Uses RDQLONG to always read four consecutive longs, and runs linearly through them, potentially in a REPEAT loop.
Works reasonably well for linear code, but wastes a lot of memory due to the "bubbles" of NOP's that would form. This is similar to problems with early VLIW machines not being able to use all of their functional units. (VLIW = very long instruction word)
LMM2.5:
This is the variation I initially suggested to Chip, however it is too big a change, and only somewhat faster than my simplified LMM2 posted above. It also has the bubble problem, though not as bad as my Thunk LMM2. Fortunately my alternate suggestion to Chip may make it in, Chip thought it would fit and it would greatly improve LMM2. Chip's additions to my suggestion would improve all other VM's.
As I posted earlier, I am not comfortable posting that idea yet, as not only has Chip not confirmed it can go in, I have found a potentially serious pipeline issue with it (that I've now found a solution for)
The simple LMM2 I posted earlier:
Decent performance, excellent if FCACHE is used to its maximum potential - requires a very thorough understanding of the whole Prop2 architecture to implement properly. I actually figured out how to eliminate almost all of the kernel primitives LMM originally needed! (I am not ready to disclose how yet, as I need a simulator / test chip to test the idea).
I actually figured out another huge speed gain for LMM, but I don't want Ken to send hit men after me for telling Chip and delaying Prop2... it would be huge though, but given the new instruction it would require, I am not sure it would fit - I'd have to hammer it out with Chip first.
The original LMM
I designed for Prop1 (which from now on I will call LMM1) is highly inappropriate for Prop2 - I realized this during the initial Prop2 suggestion thread, which I have been tracking for latest information, updating LMM2 designs (and trying the alternatives above)
Honestly, other than the approaches I outlined above (or a byte/word code interpreter that would be inherently slower) I cannot see any other way to run C on the prop without major architectural changes to the cog, essentially turning them into convetional RISC processors, which would delay Prop2 at least a year.
I am comitted to making Prop2 run LMM2 as fast as is possible, so I posted the above so that Ross etc can benefit from my research.
Hi Bill,
Thanks for the update - I look forward to seeing more detail when you get a chance to try some of your ideas out.
it's already clear that your contribution to the development of LMM2 will be as significant as it was to the development of LMM1.
Ross.
Ross, what was your experience with FCACHE?
I had a blast coming up with LMM in the first place... and seeing you and others adopt it. That was what I wanted... C and other conventional languages on the Prop.
Don't get me wrong, I *REALLY* like the Spin/PASM combination for quick prototyping. If some of Spin's shortcomings are also fixed, it will be even better (second class nature of strings, lack of data structures, lack of full support for floats, lack of #define/#undef/#ifdef/#ifndef/#endif et al)
I've been having a blast with LMM2... which explains why I went a bit overboard with the "RAH RAH" on Prop2/LMM2 - but I've been working on LMM2 since the first differences were posted, and seeing Prop2 finally get close to release...
Ever since you ported Catalina to Morpheus (thank you again) I have been meaning to find time to do a "deep dive" into your version of the LMM kernel to see if I could tweak it for better performance (perhaps with greater FCACHE user <grin>) ... hopefully now that the "Mikronauts Modular System" is here, I can do that soon.
I am in the happy position of being able to agree completely with both you and Bill:
- Yes, FCACHE could give you substantial performance improvements if you could make effective use it.
- Yes, writing a code generator that could make effective use of FCACHE would be extremely difficult - I am tempted to say impossible, but you know where saying things like that that gets you when discussing the Propeller .
Ross.You are correct.
I have learned me that never say NOT IMPOSSIBLE as fast You talk Propeller
Ps - As You know I dont like C - But I like You contribution to community on Very GOOD C compiler.