Yes, I'm sure the 432 had a lot less transistors and connections than modern CPU. But from a user perspective it was not simple.
The PC as it is today is horribly complex, both architecturally, in the number of layers and protection schemes, and for compatibility reasons, config is a nightmare. So much has to be hidden away or no-one would be able to use the beast.
The PC's success is not through technology/performance nor innovation nor simplicity. The PC's success is through being the only platform offering hardware competition with your favourite software a disc copy away. Once established and no other competitor in sight, the rules can be changed.
It's the same formula being played out over and over again online these days. Make a new start-up, give away the product to gain a user base, get some serious free advertising by getting on the news about your success, and viola, you can change the rules and start charging for the product ...
With the ability to run up to 32 'threads' do we need more than the current 8 locks?
I also see hubexec opening the door for some new ways of doing things that might make having more locks available be useful.
One example would be for objects that have FIFO's like a UART.
The object that owns the FIFO would expose hubexec code that a caller executes to add something to the FIFO.
Get a lock for access to the FIFO 'Add Character' method.
Put the character in the location defined by the object as the parameter location.
execute the 'Add Character' method.
Release the lock.
Since these 'exposed methods', 'remote methods', or whatever we call them run in the calling COG we will need to define some rules so they play nicely and don't hose the callers register space.
I think one of the things might be that one of the parameters passed with the call is a COG address that defines the beginning of a block of COG space that may be used by the method.
I can't say I'm in favour of the fixed time-slicing being used for general multitasking. Trying to treat the hardware threads as 32 Cogs is going to lead to trouble.
I can't say I'm in favour of the fixed time-slicing being used for general multitasking. Trying to treat the hardware threads as 32 Cogs is going to lead to trouble.
In a general computing sense I would agree.
I do think they will be very useful for letting a single COG act as multiple low level peripherals like UARTS, keypad scanners, etc.
II do think they will be very useful for letting a single COG act as multiple low level peripherals like UARTS, keypad scanners, etc.
In that situation you have a managerial task, possibly in HubExec mode, and up to three subordinate threads hidden away. The managerial task does the wheeling and dealing.
In that situation you have a managerial task, possibly in HubExec mode, and up to three subordinate threads hidden away. The managerial task does the wheeling and dealing.
What does that have to do with the discussion of possibly having more than 8 locks available and the 'external methods'?
The point is that there are potentially more things going on at a time on the P2 than there was on the P1 that might need to use locks and it might be useful to have more available.
The example of the 'external methods' is one of the cases where a lock would be useful since in most cases a given call would need to complete before another caller calls the same method.
Finally, there really isn't some grand vision or path in place like Intel and others have had.
I could turn that around and say that Intel never had a "grand vision".
They made an 8080 processor to fit the requirements of some calculator manufacturer.
They bolted some hacks on to that to make a 16 bit machine supporting more memory (Segments bletch)
Hack upon hack it grew until we had the 32 bit Pentium class machines.
Meanwhile Intel was trying out the 432, the 860 and the Itanium.
It was AMD who showed how to continue the hacks into the 64 bit world.
evanh,
That's a side effect of x86 having gone RISC as of the 80486.
Nah, there is nothing "RISC" abouth the 486 instruction set as used by compilers.
Besides, if you write a compiler for a normal block structured language like Fortran, C or Pascal you don't need much instruction set support. Loads and stores, stack ops, arithmetic and logical ops etc. Making use of any other special instructions is a lot more work.
Now we have things like SIMD instructions. They require the compiler to analyse your source carefully to see if they can be used to good effect.
evanha,
The PC as it is today is horribly complex, both architecturally, in the number of layers and protection schemes,
No it is not. Well no more than it was twenty years ago,
OK back in the 486 days I had to write code that would take a 486 from reset to running in 32 bit protected mode. And then run a bunch of threads in their own memory spaces.
It was not so trivial to do, but not so hard either. I'm sure that same code will run on a 64 bit AMD today.
The PC's success is not through technology/performance nor innovation nor simplicity.
Let's not go there. It's all about Intel and MS and Windows and network effects. The upshot is Intel and AMD have done a good job of pushing the possible performance of silicon based computers to the max. So much so that we are hitting the physical limits now.
ctwardell,
Do we need more locks?
Stop now. I despair. This chip is never going to make it out of the gate.
Stop now. I despair. This chip is never going to make it out of the gate.
It may be trivial to increase the number of locks.
One of the things I fear more than not getting it done is not thinking through the impact of all the changes that have been made and finding that their usefulness is hindered for the lack of a few small items.
With the ability to run up to 32 'threads' do we need more than the current 8 locks?
I also see hubexec opening the door for some new ways of doing things that might make having more locks available be useful.
.....
C.W.
+1,
Yes, I am always mindful of the number of locks use in my P1 SPIN projects, which use typically use 5 to 7 locks, maybe with better programming skills on my part I would not needs as many locks.
But then, locks are so easy to use and they allow me to spend more time on my algorithm and less time on programming style or efficiency. More Locks would mean I would worry less about accounting of their use.
For P2, I would like maybe, . . 32 locks ? (or more !)
I am always mindful of the number of locks use in my P1 SPIN projects, which use typically use 5 to 7 locks,
Are you serious?
A year ago or so Chip asked here if it would be OK to remove the locks altogether. It would save silicon space and he had guessed nobody was using them.
In the ensuing discussion it turned out that a very few people had used locks in a very few cases. Mostly we get by without them.
In most cases we have a producer of data and a consumer. No locks are required for that exchange. For example see FullDuplexSerial and its buffer handling.
A year ago or so Chip asked here if it would be OK to remove the locks altogether. It would save silicon space and he had guessed nobody was using them.
In the ensuing discussion it turned out that a very few people had used locks in a very few cases. Mostly we get by without them.
In most cases we have a producer of data and a consumer. No locks are required for that exchange. For example see FullDuplexSerial and its buffer handling.
In reading the thread I find the following from Chip:
"Good to hear that you guys are getting the intended usage out of the LOCKs. They'll be staying, for sure, then."
and
"mctrivia said...
can we have 32 of them just to round out a long? i can definitely deal with 8. there are ways around getting more through hardware/software combination but they are so convenient and quick this way. not a must but if not difficult would be nice in the rare situation more then 8 are needed.
It's easy to add them. We could."
Most useful (to me) are my locks that allow different COGs to "own and control" fields within a single 20x2 LCD display. Or, allow different COGs have access to the RTC without conflict. Note: there are about 8 (or so) I2C peripherals on the same bus (see link above).
The way I look at it is: use it if "it" is available or loose it, otherwise use a different programming style.
Perhaps, a experienced programmer could show "this hobbyist" a better method.
FWIW +1
I've been using P1 since 2006 and only once have I used a (1) lock in one of my applications.
In hindsight with a few more years of P1 mileage I could probably get away without using it now.
The P2 has Port D and XCH ,I can see more flexibility in using this system.
Seeing that the locks are already there,leave them in.
I'm looking forward to the time when "lock" is only used in terms of the P2's design is now LOCKED
I've never used then either, though I have an excellent explanation of how they're use... readily available... just in case. I figure that if I have to use locks then I've programmed myself into a corner. I then spend time trying to get out of that corner.
I'm sure they're useful, just haven't had to use them yet. Could be my inexperience talking. :-)
By all means increase the number of locks, but please don't remove them altogether!
Sure, most people don't need them and have probably never used them, but some of us do and have! If Parallax removed hardware support for locks, then we would all immediately have to write software to emulate them, and I think any emulation of locks in software would require at least 2 hub instructions to implement (whereas lock operations currently take only one).
Also, to emulate them in software would require permanent allocation of some Hub RAM to store the lock state, and this would make it implementation dependent.
My main use case for locks is when a program has multiple produces of multi-long content. I've only had to use a single lock in my programs. It's for a concurrent buffer where multiple cogs may be want to write a ~20 long block into a circular buffer.
The LMM interpreter in PropGCC also uses a single hardware lock.
As far as I know it's impossible to make a lock in software without some kind of atomic "test and set" or "swap exchange" instruction that itself performs atomically.
Did the P2 sprout such an instruction yet? If not we need those locks.
As far as I know it's impossible to make a lock in software without some kind of atomic "test and set" or "swap exchange" instruction that itself performs atomically.
Did the P2 sprout such an instruction yet? If not we need those locks.
No, it's possible on the Prop because of the nature of the hub - but you need at least 2 hub instructions to do it (it can certainly be done in 3 hub instructions, but I suspect there might be a clever way to do it in 2).
The number of locks depends on the number of single resources you need to share between threads and not on the number of possible threads.
A lock is needed if several threads share one resource like an I2C interface. It does not matter if 8 or 32 threads share the single I2C interface, they all wait on the same lock.
The use of locks should be minimized, because it really is a bad thing for parallel processing., if two or more threads have to wait for the same resource often. It is more efficient if a hardware resource is accessed by a single thread and this thread does some preprocessing and provides then the result to other threads if needed..
Perhaps some use it as ordinary flags to signal states between cogs, but for that you can use any hubram variable.
No, it's possible on the Prop because of the nature of the hub - but you need at least 2 hub instructions to do it (it can certainly be done in 3 hub instructions, but I suspect there might be a clever way to do it in 2).
Ross.
I'll take a stab at the 3 instruction version. In the hub there is a variable whose value is either -1 (free) or cogID (busy). Each cog follows this sequence:
1. (HUB) Read variable.
2. If busy, go to 1.
3. (HUB) Set variable to ID
4. (HUB) Read variable
5. If variable is same as ID then access resource. Otherwise go to 1.
6. (HUB) Write -1 to release resource.
I'll take a stab at the 3 instruction version. In the hub there is a variable whose value is either -1 (free) or cogID (busy). Each cog follows this sequence:
1. (HUB) Read variable.
2. If busy, go to 1.
3. (HUB) Set variable to ID
4. (HUB) Read variable
5. If variable is same as ID then access resource. Otherwise go to 1.
6. (HUB) Write -1 to release resource.
Yes, that's essentially the algorithm I had in mind. It can be done in 2 if you only need access to the locked resource for a very limited time (essentially one hub cycle) - and I think there might be a clever way to generalize this.
What if processes A and B both get passed steps 1 and 2 when they are claiming a resource which is free.
They both now think that the resource is free.
Process A then does steps 3, 4 and 5 after which it will think it owns the resource.
Process B then also does steps 3, 4 and 5 after which it will think it owns the resource.
Processes A and B then proceed to corrupt the data in the resource.
Now, with knowledge of the round robin nature of the HUB access and assuming we use exactly the same steps to acquire the lock in all processes it might be possible to avoid erroneous interleaving of lock acquisition steps as I have shown above.
But given the prospect of cached hub accesses and a potential "greedy mode" this starts to look impossible again.
What's easy for gurus isn't easy or apparent for the beginner or new comer to the strange world of the Prop, especially in light that the P2 targeted to commercial sector(in order to recoup the costs involved, if it doesn't recoup, forget about the dreamy processor). If the gurus here want to present it in it's best light, I'd suggest they sit down and churn out some intro PDF's showing how easy it is to code for the P2 in C and PASM, etc.
I agree, Rod. I'm working with Daniel to make a complete FPGA board that has everything early adopters might need, and nothing more. Once we've got some hardware we can afford to use and distribute we'll all be in a better position to do exactly what you mentioned: create some impressive commercial examples in C, Spin and PASM. Expect us (you, Parallax, community) to put out some challenges to code impressive projects into examples that typical embedded programmers can use, through our Open Propeller Project.
I agree, Rod. I'm working with Daniel to make a complete FPGA board that has everything early adopters might need, and nothing more. Once we've got some hardware we can afford to use and distribute we'll all be in a better position to do exactly what you mentioned: create some impressive commercial examples in C, Spin and PASM. Expect us (you, Parallax, community) to put out some challenges to code impressive projects into examples that typical embedded programmers can use, through our Open Propeller Project.
Ken Gracey
There will certainly be some impressive project examples done by some of the P2 gurus. Ozpropdev has done a magnificent job with just one cog! While it's not a commercial program, it certainly exploits the multi-thread of a single cog, and this is prior to hubexec. It is certainly a showpiece of what the P2 can do.
Of course there will be plenty of others too. I know baggers is just itching to show off some graphics.
Nah, there is nothing "RISC" abouth the 486 instruction set as used by compilers.
RISC is more than an instruction set. It's a philosophy of streamlining that revolves around pipelining. The fixed length instruction sets that were spawned are a side-effect. The way compilers evolved also reflects this RISC focus irrespective of fixed or variable length instructions.
No it is not. Well no more than it was twenty years ago,
OK back in the 486 days I had to write code that would take a 486 from reset to running in 32 bit protected mode. And then run a bunch of threads in their own memory spaces.
Have you seen the size of a BIOS lately? There is absolutely tons to config! That's part of the deal also.
What if processes A and B both get passed steps 1 and 2 when they are claiming a resource which is free.
They both now think that the resource is free.
Process A then does steps 3, 4 and 5 after which it will think it owns the resource.
Process B then also does steps 3, 4 and 5 after which it will think it owns the resource.
Processes A and B then proceed to corrupt the data in the resource.
Now, with knowledge of the round robin nature of the HUB access and assuming we use exactly the same steps to acquire the lock in all processes it might be possible to avoid erroneous interleaving of lock acquisition steps as I have shown above.
But given the prospect of cached hub accesses and a potential "greedy mode" this starts to look impossible again.
Heater,
You are forgetting the deterministic nature of the Prop. What you are proposing simply cannot occur if all cogs are executing the same code.
What I said was "there is nothing "RISC" about the 486 instruction set as used by compilers". Given that the instruction set is much the same as it was on 386, 286, 8086 and those processors were categorized as CISC then I stand by that statement.
Until we come to the x64 which adds on 8 addition general purpose registers that is.
Anyway the RISC/CISC divide has always been a bit woolly and ultimately the debate pointless.
Don't care about the size of the BIOS. We were talking chip architectures.
Comments
The PC as it is today is horribly complex, both architecturally, in the number of layers and protection schemes, and for compatibility reasons, config is a nightmare. So much has to be hidden away or no-one would be able to use the beast.
The PC's success is not through technology/performance nor innovation nor simplicity. The PC's success is through being the only platform offering hardware competition with your favourite software a disc copy away. Once established and no other competitor in sight, the rules can be changed.
It's the same formula being played out over and over again online these days. Make a new start-up, give away the product to gain a user base, get some serious free advertising by getting on the news about your success, and viola, you can change the rules and start charging for the product ...
With the ability to run up to 32 'threads' do we need more than the current 8 locks?
I also see hubexec opening the door for some new ways of doing things that might make having more locks available be useful.
One example would be for objects that have FIFO's like a UART.
The object that owns the FIFO would expose hubexec code that a caller executes to add something to the FIFO.
Get a lock for access to the FIFO 'Add Character' method.
Put the character in the location defined by the object as the parameter location.
execute the 'Add Character' method.
Release the lock.
Since these 'exposed methods', 'remote methods', or whatever we call them run in the calling COG we will need to define some rules so they play nicely and don't hose the callers register space.
I think one of the things might be that one of the parameters passed with the call is a COG address that defines the beginning of a block of COG space that may be used by the method.
C.W.
In a general computing sense I would agree.
I do think they will be very useful for letting a single COG act as multiple low level peripherals like UARTS, keypad scanners, etc.
C.W.
In that situation you have a managerial task, possibly in HubExec mode, and up to three subordinate threads hidden away. The managerial task does the wheeling and dealing.
What does that have to do with the discussion of possibly having more than 8 locks available and the 'external methods'?
The point is that there are potentially more things going on at a time on the P2 than there was on the P1 that might need to use locks and it might be useful to have more available.
The example of the 'external methods' is one of the cases where a lock would be useful since in most cases a given call would need to complete before another caller calls the same method.
C.W.
They made an 8080 processor to fit the requirements of some calculator manufacturer.
They bolted some hacks on to that to make a 16 bit machine supporting more memory (Segments bletch)
Hack upon hack it grew until we had the 32 bit Pentium class machines.
Meanwhile Intel was trying out the 432, the 860 and the Itanium.
It was AMD who showed how to continue the hacks into the 64 bit world.
evanh, Nah, there is nothing "RISC" abouth the 486 instruction set as used by compilers.
Besides, if you write a compiler for a normal block structured language like Fortran, C or Pascal you don't need much instruction set support. Loads and stores, stack ops, arithmetic and logical ops etc. Making use of any other special instructions is a lot more work.
Now we have things like SIMD instructions. They require the compiler to analyse your source carefully to see if they can be used to good effect.
evanha, No it is not. Well no more than it was twenty years ago,
OK back in the 486 days I had to write code that would take a 486 from reset to running in 32 bit protected mode. And then run a bunch of threads in their own memory spaces.
It was not so trivial to do, but not so hard either. I'm sure that same code will run on a 64 bit AMD today. Let's not go there. It's all about Intel and MS and Windows and network effects. The upshot is Intel and AMD have done a good job of pushing the possible performance of silicon based computers to the max. So much so that we are hitting the physical limits now.
ctwardell, Stop now. I despair. This chip is never going to make it out of the gate.
It may be trivial to increase the number of locks.
One of the things I fear more than not getting it done is not thinking through the impact of all the changes that have been made and finding that their usefulness is hindered for the lack of a few small items.
C.W.
+1,
Yes, I am always mindful of the number of locks use in my P1 SPIN projects, which use typically use 5 to 7 locks, maybe with better programming skills on my part I would not needs as many locks.
But then, locks are so easy to use and they allow me to spend more time on my algorithm and less time on programming style or efficiency. More Locks would mean I would worry less about accounting of their use.
For P2, I would like maybe, . . 32 locks ? (or more !)
Are you serious?
A year ago or so Chip asked here if it would be OK to remove the locks altogether. It would save silicon space and he had guessed nobody was using them.
In the ensuing discussion it turned out that a very few people had used locks in a very few cases. Mostly we get by without them.
In most cases we have a producer of data and a consumer. No locks are required for that exchange. For example see FullDuplexSerial and its buffer handling.
What are you doing that need so many locks?
I assume you are referring to this thread:
http://forums.parallax.com/archive/index.php/t-109727.html
In reading the thread I find the following from Chip:
"Good to hear that you guys are getting the intended usage out of the LOCKs. They'll be staying, for sure, then."
and
"mctrivia said...
can we have 32 of them just to round out a long? i can definitely deal with 8. there are ways around getting more through hardware/software combination but they are so convenient and quick this way. not a must but if not difficult would be nice in the rare situation more then 8 are needed.
It's easy to add them. We could."
C.W.
Day #1222
Week #175
This week, the talk turns from reducing COGRAM space to increasing the number of available locks.......
:frown:
Yes, . . . maybe it is my style (or lack there of). I use locks to coordinate updates to I2C divices from multiple COGs, see: http://wa0uwh.blogspot.com/2012/06/prop-ui-progress.html .
Most useful (to me) are my locks that allow different COGs to "own and control" fields within a single 20x2 LCD display. Or, allow different COGs have access to the RTC without conflict. Note: there are about 8 (or so) I2C peripherals on the same bus (see link above).
The way I look at it is: use it if "it" is available or loose it, otherwise use a different programming style.
Perhaps, a experienced programmer could show "this hobbyist" a better method.
FWIW +1
I've been using P1 since 2006 and only once have I used a (1) lock in one of my applications.
In hindsight with a few more years of P1 mileage I could probably get away without using it now.
The P2 has Port D and XCH ,I can see more flexibility in using this system.
Seeing that the locks are already there,leave them in.
I'm looking forward to the time when "lock" is only used in terms of the P2's design is now LOCKED
I've never used then either, though I have an excellent explanation of how they're use... readily available... just in case. I figure that if I have to use locks then I've programmed myself into a corner. I then spend time trying to get out of that corner.
I'm sure they're useful, just haven't had to use them yet. Could be my inexperience talking. :-)
Sandy
By all means increase the number of locks, but please don't remove them altogether!
Sure, most people don't need them and have probably never used them, but some of us do and have! If Parallax removed hardware support for locks, then we would all immediately have to write software to emulate them, and I think any emulation of locks in software would require at least 2 hub instructions to implement (whereas lock operations currently take only one).
Also, to emulate them in software would require permanent allocation of some Hub RAM to store the lock state, and this would make it implementation dependent.
Ross.
The LMM interpreter in PropGCC also uses a single hardware lock.
Did the P2 sprout such an instruction yet? If not we need those locks.
No, it's possible on the Prop because of the nature of the hub - but you need at least 2 hub instructions to do it (it can certainly be done in 3 hub instructions, but I suspect there might be a clever way to do it in 2).
Ross.
A lock is needed if several threads share one resource like an I2C interface. It does not matter if 8 or 32 threads share the single I2C interface, they all wait on the same lock.
The use of locks should be minimized, because it really is a bad thing for parallel processing., if two or more threads have to wait for the same resource often. It is more efficient if a hardware resource is accessed by a single thread and this thread does some preprocessing and provides then the result to other threads if needed..
Perhaps some use it as ordinary flags to signal states between cogs, but for that you can use any hubram variable.
Andy
I'll take a stab at the 3 instruction version. In the hub there is a variable whose value is either -1 (free) or cogID (busy). Each cog follows this sequence:
1. (HUB) Read variable.
2. If busy, go to 1.
3. (HUB) Set variable to ID
4. (HUB) Read variable
5. If variable is same as ID then access resource. Otherwise go to 1.
6. (HUB) Write -1 to release resource.
Yes, that's essentially the algorithm I had in mind. It can be done in 2 if you only need access to the locked resource for a very limited time (essentially one hub cycle) - and I think there might be a clever way to generalize this.
Ross.
In general I don't think that works.
What if processes A and B both get passed steps 1 and 2 when they are claiming a resource which is free.
They both now think that the resource is free.
Process A then does steps 3, 4 and 5 after which it will think it owns the resource.
Process B then also does steps 3, 4 and 5 after which it will think it owns the resource.
Processes A and B then proceed to corrupt the data in the resource.
Now, with knowledge of the round robin nature of the HUB access and assuming we use exactly the same steps to acquire the lock in all processes it might be possible to avoid erroneous interleaving of lock acquisition steps as I have shown above.
But given the prospect of cached hub accesses and a potential "greedy mode" this starts to look impossible again.
I agree, Rod. I'm working with Daniel to make a complete FPGA board that has everything early adopters might need, and nothing more. Once we've got some hardware we can afford to use and distribute we'll all be in a better position to do exactly what you mentioned: create some impressive commercial examples in C, Spin and PASM. Expect us (you, Parallax, community) to put out some challenges to code impressive projects into examples that typical embedded programmers can use, through our Open Propeller Project.
Ken Gracey
Of course there will be plenty of others too. I know baggers is just itching to show off some graphics.
RISC is more than an instruction set. It's a philosophy of streamlining that revolves around pipelining. The fixed length instruction sets that were spawned are a side-effect. The way compilers evolved also reflects this RISC focus irrespective of fixed or variable length instructions.
Have you seen the size of a BIOS lately? There is absolutely tons to config! That's part of the deal also.
Heater,
You are forgetting the deterministic nature of the Prop. What you are proposing simply cannot occur if all cogs are executing the same code.
Ross.
Until we come to the x64 which adds on 8 addition general purpose registers that is.
Anyway the RISC/CISC divide has always been a bit woolly and ultimately the debate pointless.
Don't care about the size of the BIOS. We were talking chip architectures.