I was talking whole PC hardware when it came to complexity of booting and coding and how hard it might be for someone without the infrastructural support now available. The BIOS is part of making that easier by hiding a lot of the scary stuff, as is the OSes and development platforms. Even the CPU is no simply thing now though, there is so much of what was exclusively chipset functionality packed in there, not to mention GPU!
Compiler writers were quite quick to adapt to the 486 RISC features and even more so for the Pentiums with much assistance/encouragement from Intel of course.
You are forgetting the deterministic nature of the Prop.
No. Did you miss the bit about "...with knowledge of the round robin nature of the HUB access and assuming we use exactly the same steps to acquire the lock in all processes..."
Firstly we have no interrupts that could get in the way and upset the timing of accesses to those HUB variables. We are not going to be rescheduled by an OS at any random point.
Secondly we know that after our HUB slot the seven other COGS get a go and then it's our turn again.
With that determinism I believe that the algorithm stands a chance of working.
I suspect this might not fly so well for multi-threaded COG codes. Or what about XMM codes? What do you think?
Of course that is exactly the determinism many here have been arguing we don't need in the P2 any more, what with "greedy COGs" and such, which may well break the whole idea.
In the traditional sense of RISC/CISC what 486 RISC features?
To quote Intel's Andy Grove around the time of the launch of the 486:
We now had two very powerful chips that we were introducing at just about the same time: the 486, largely based on CISC technology and compatible with all the PC software, and the i860, based on RISC technology, which was very fast but compatible with nothing.
Anyway the RISC/CISC debate is somewhat off topic here. No body ever agreed which the P1 was never mind the P2.
No. Did you miss the bit about "...with knowledge of the round robin nature of the HUB access and assuming we use exactly the same steps to acquire the lock in all processes..."
No, but you seemed to be implying (i.e. "it might be possible") that you had to do something special to get this "software lock" to work. You don't - it's guaranteed by the round-robin nature of the hub. That's the beauty of the P1.
However, I do agree that the P2 is beginning to look like a bit of a monster - I hope Parallax come to their senses and abandons all this stuff that would invalidate the deterministic nature of the P1 - if not, the P2 will become "just another chip" that has to be programmed in a high-level language because the assembly language is too complex for us mere mortals to learn.
...but you seemed to be implying (i.e. "it might be possible") that you had to do something special...
That is true, I did.
That is just my natural tendency to not believe anything I think might work will actually work until I see it working. If you see what I mean.
Or I have managed to come up with some mathematical/logical proof that it will work that someone else agrees with (A rare situation).
In this case I think you can write down all the possible interleavings of a few COGs stepping though that code and convince yourself it always works.
I start to wonder why the P1 has locks at all now?
I agree, Rod. I'm working with Daniel to make a complete FPGA board that has everything early adopters might need, and nothing more. Once we've got some hardware we can afford to use and distribute we'll all be in a better position to do exactly what you mentioned: create some impressive commercial examples in C, Spin and PASM. Expect us (you, Parallax, community) to put out some challenges to code impressive projects into examples that typical embedded programmers can use, through our Open Propeller Project.
An impressive eval Board is the STM32F429I-DISCO - not so much for the me-too uC, but for the inclusion of a QVGA graphics display (touch?) and 64Mb SDRAM.
The P2 module does not have to ship with a QVGA or SSD1963, but I believe it should include a footprint to allow easy connection to such, and at least 2 SO8 QuadSPI flash devices, allowing byte-streaming.
An impressive eval Board is the STM32F429I-DISCO - not so much for the me-too uC, but for the inclusion of a QVGA graphics display (touch?) and 64Mb SDRAM.
The P2 module does not have to ship with a QVGA or SSD1963, but I believe it should include a footprint to allow easy connection to such, and at least 2 SO8 QuadSPI flash devices, allowing byte-streaming.
The term "complex" in reference to the P2 is a little harsh.
The most complex thing about P2 I see is trying to sift your way through 1000's of posts
to find the brief description of some of P2's internals. Once you find the scattered fragments
of documentation and actually try using them it's not too "complex" at all.
With Parallax's own FPGA board and a influx of new P2 heads, and with Chip's work
nearing completion the consolidated documentation will materialize.
As for cached hub access, I would assume that hub writes would invalidate other cogs' cached hub data if they intersected. Is this not the case?
Both the data and instruction caches have no awareness of writes to hub memory. They simply read 8 longs when data is needed. There are separate DCACHEX and ICACHEX instructions for invaliding each cache.
I think people may be equating the complexity of the discussion about features and how they are implemented with the complexity of using the features once they are implemented in the chip. These features will be trivial to use.
For sure!! Also there is a huge difference between "available to be used" and "forced to be used".
Without naming names, there are entire families of microcontrollers that require a large amount of tedious and exacting initialization code. They power-up entirely non compos mentis. This is radically different than the P1 and P2. The P2 even powers up with a nice little monitor program running! What could be friendlier than that?
Both the data and instruction caches have no awareness of writes to hub memory. They simply read 8 longs when data is needed. There are separate DCACHEX and ICACHEX instructions for invaliding each cache.
Thanks for the clarification. So in a producer/consumer configuration between two cogs that are communicating through hub RAM, will the producer have to issue DCACHEX after each hub write? I can see a whole class of hair-pulling bugs that could arise for newbies and oldies alike forgetting to clear the cache. Perhaps I should go back and read the discussion again in greater detail.
Thanks for the clarification. So in a producer/consumer configuration between two cogs that are communicating through hub RAM, will the producer have to issue DCACHEX after each hub write? I can see a whole class of hair-pulling bugs that could arise for newbies and oldies alike forgetting to clear the cache. Perhaps I should go back and read the discussion again in greater detail.
I would think it would be the other way around. The consumer would need to issue DCACHEX before reading the shared data to make sure it doesn't get a old cached value instead of the values updated by the other COG.
If high performance COG interaction is required, the D port is set aside for that purpose.
If you are writing to hub memory, there is no way to avoid the non-cached variants (David). Invalidating the D cache will cost more than a RDXXXX instruction, because there is one clock for invalidate and the same number of clocks as RDXXX to reload the cache, in which that address will be invalid anyway. If you are careful about your RDXXXX and RDXXXXC instructions, you can have your cake and eat it too.
ICACHE for HUBEX
DCACHE for RDXXXXC
NOCACHE for RDXXXX
So you can avoid invalidating the DCACHE by selectively using RDXXXXC variants, using RDXXXX for addresses you know are volatile (in the C keyword sense).
If high performance COG interaction is required, the D port is set aside for that purpose.
If you are writing to hub memory, there is no way to avoid the non-cached variants (David). Invalidating the D cache will cost more than a RDXXXX instruction, because there is one clock for invalidate and the same number of clocks as RDXXX to reload the cache, in which that address will be invalid anyway. If you are careful about your RDXXXX and RDXXXXC instructions, you can have your cake and eat it too.
ICACHE for HUBEX
DCACHE for RDXXXXC
NOCACHE for RDXXXX
So you can avoid invalidating the DCACHE by selectively using RDXXXXC variants, using RDXXXX for addresses you know are volatile (in the C keyword sense).
While it is true that fetching a single value from hub memory will be faster with a RDLONG instruction than with a DCACHEX followed by RDLONGC, that may not be true for fetching numerous contiguous values from hub if they are all in one cache line. I guess which makes the most sense depends on the context.
To quote Intel's Andy Grove around the time of the launch of the 486:We now had two very powerful chips that we were introducing at just about the same time: the 486, largely based on CISC technology and compatible with all the PC software, and the i860, based on RISC technology, which was very fast but compatible with nothing.
Marketing, buzzwords. We all know how that works. Multitasking was a terribly geeky term that has technical meaning of kernel management of sharing computing resources between executing programs and was only used in computing circles until around 2000 or so when some head-honcho (Presumable Mr Jobs or Mr Gates or both) made the term part of their sales slogan. After which time the general public now think of the term as a human interactivity with computers or anything else for that matter.
Multitasking as a technical term didn't change but the term has now been borrowed for a non-technical purpose and can be confused as a result.
So you can avoid invalidating the DCACHE by selectively using RDXXXXC variants, using RDXXXX for addresses you know are volatile (in the C keyword sense).
Yup, I missed the RDXXXX/RDXXXXC distinction when I was catching up on this monster thread. Thanks for clearing that up.
If high performance COG interaction is required, the D port is set aside for that purpose.
Do we have any documentation yet for SETXCH?
It seems like there is a lot of hope that the D Port is going to be some kind of magical panacea, but we only have one so it's usage will need to be negotiated among various objects.
Not really. The question of compilers only using a small selection of a CPU's possible instructions is orthogonal to that of CPU's RISC/CISC architectures.
I do agree, such things do tend to get lost in marketing noise. Andy was right in that quote though.
I do hope there is a stable "P2 Instruction Set For Dummies" soon.
Not really. The question of compilers only using a small selection of a CPU's possible instructions is orthogonal to that of CPU's RISC/CISC architectures.
Really?! That was Brian's very point. Compilers went from a broad usage to a narrow usage as the instructions were streamlined. Hell, Intel later, when going superscalar, even consigned the slow legacy instructions to a single pipe, freeing the "fast pipes" for the RISCised only instructions.
Yes really...now we are talking at cross-purposes. My point has always been that compilers have always been "narrow usage". They never did transition from " broad usage to a narrow usage".
I offer for examples the C compilers used for Z80's that never used the vast majority of it's instructions above the 8080 subset. Or the C and PL/M compilers for Intel's 8086.
It is this phenomena that was the motivation for the RISC idea. They analysed the output of compilers and the run time usage of different instructions and addressing modes etc and realized it might be a good idea to throw away the unused or little used instructions and dedicate the silicon resources to speeding up execution of what is used. Use those transistors for registers and pipelines rather than instruction decoders.
Of course as transistors became plentiful it turned out you could keep the backwards compatible CISC and have those registers and pipelines. At cost of increased power consumption of course. Enter the ARM for mobile.
Yes really...now we are talking at cross-purposes. My point has always been that compilers have always been "narrow usage". They never did transition from " broad usage to a narrow usage".
I offer for examples the C compilers used for Z80's that never used the vast majority of it's instructions above the 8080 subset. Or the C and PL/M compilers for Intel's 8086.
It is this phenomena that was the motivation for the RISC idea. They analysed the output of compilers and the run time usage of different instructions and addressing modes etc and realized it might be a good idea to throw away the unused or little used instructions and dedicate the silicon resources to speeding up execution of what is used. Use those transistors for registers and pipelines rather than instruction decoders.
Of course as transistors became plentiful it turned out you could keep the backwards compatible CISC and have those registers and pipelines. At cost of increased power consumption of course. Enter the ARM for mobile.
It's been my suspicion that this phenomenon of compilers driving CPU design has created a wasteland for would-be assembly languages programmers. Can anyone corroborate this?
Comments
Compiler writers were quite quick to adapt to the 486 RISC features and even more so for the Pentiums with much assistance/encouragement from Intel of course.
Firstly we have no interrupts that could get in the way and upset the timing of accesses to those HUB variables. We are not going to be rescheduled by an OS at any random point.
Secondly we know that after our HUB slot the seven other COGS get a go and then it's our turn again.
With that determinism I believe that the algorithm stands a chance of working.
I suspect this might not fly so well for multi-threaded COG codes. Or what about XMM codes? What do you think?
Of course that is exactly the determinism many here have been arguing we don't need in the P2 any more, what with "greedy COGs" and such, which may well break the whole idea.
Ah well.
In the traditional sense of RISC/CISC what 486 RISC features?
To quote Intel's Andy Grove around the time of the launch of the 486:
We now had two very powerful chips that we were introducing at just about the same time: the 486, largely based on CISC technology and compatible with all the PC software, and the i860, based on RISC technology, which was very fast but compatible with nothing.
Anyway the RISC/CISC debate is somewhat off topic here. No body ever agreed which the P1 was never mind the P2.
No, but you seemed to be implying (i.e. "it might be possible") that you had to do something special to get this "software lock" to work. You don't - it's guaranteed by the round-robin nature of the hub. That's the beauty of the P1.
However, I do agree that the P2 is beginning to look like a bit of a monster - I hope Parallax come to their senses and abandons all this stuff that would invalidate the deterministic nature of the P1 - if not, the P2 will become "just another chip" that has to be programmed in a high-level language because the assembly language is too complex for us mere mortals to learn.
Ross.
That is just my natural tendency to not believe anything I think might work will actually work until I see it working. If you see what I mean.
Or I have managed to come up with some mathematical/logical proof that it will work that someone else agrees with (A rare situation).
In this case I think you can write down all the possible interleavings of a few COGs stepping though that code and convince yourself it always works.
I start to wonder why the P1 has locks at all now?
An impressive eval Board is the STM32F429I-DISCO - not so much for the me-too uC, but for the inclusion of a QVGA graphics display (touch?) and 64Mb SDRAM.
The P2 module does not have to ship with a QVGA or SSD1963, but I believe it should include a footprint to allow easy connection to such, and at least 2 SO8 QuadSPI flash devices, allowing byte-streaming.
Because they allow you to do in 1 hub instruction what would otherwise take 3 (plus some reserved Hub RAM).
Ross.
The display does have touch.
PM me if you have anything specific you want/would like.
The most complex thing about P2 I see is trying to sift your way through 1000's of posts
to find the brief description of some of P2's internals. Once you find the scattered fragments
of documentation and actually try using them it's not too "complex" at all.
With Parallax's own FPGA board and a influx of new P2 heads, and with Chip's work
nearing completion the consolidated documentation will materialize.
Brian
I agree. It's not that complicated and the documentation won't be very big.
Yup, this software lock is only possible because of the strict time slicing of hub access. Greedy mode would prevent this from working properly.
As for cached hub access, I would assume that hub writes would invalidate other cogs' cached hub data if they intersected. Is this not the case?
Both the data and instruction caches have no awareness of writes to hub memory. They simply read 8 longs when data is needed. There are separate DCACHEX and ICACHEX instructions for invaliding each cache.
Without naming names, there are entire families of microcontrollers that require a large amount of tedious and exacting initialization code. They power-up entirely non compos mentis. This is radically different than the P1 and P2. The P2 even powers up with a nice little monitor program running! What could be friendlier than that?
Thanks for the clarification. So in a producer/consumer configuration between two cogs that are communicating through hub RAM, will the producer have to issue DCACHEX after each hub write? I can see a whole class of hair-pulling bugs that could arise for newbies and oldies alike forgetting to clear the cache. Perhaps I should go back and read the discussion again in greater detail.
RDLONG / RDWORD / RDBYTE non-caching versions, which bypass the cache entirely according to my understanding.
If you are writing to hub memory, there is no way to avoid the non-cached variants (David). Invalidating the D cache will cost more than a RDXXXX instruction, because there is one clock for invalidate and the same number of clocks as RDXXX to reload the cache, in which that address will be invalid anyway. If you are careful about your RDXXXX and RDXXXXC instructions, you can have your cake and eat it too.
ICACHE for HUBEX
DCACHE for RDXXXXC
NOCACHE for RDXXXX
So you can avoid invalidating the DCACHE by selectively using RDXXXXC variants, using RDXXXX for addresses you know are volatile (in the C keyword sense).
This conversation loops right about now. Brian Fairchild pointed how compilers tend to use a focused few instructions - http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1236419&viewfull=1#post1236419
Marketing, buzzwords. We all know how that works. Multitasking was a terribly geeky term that has technical meaning of kernel management of sharing computing resources between executing programs and was only used in computing circles until around 2000 or so when some head-honcho (Presumable Mr Jobs or Mr Gates or both) made the term part of their sales slogan. After which time the general public now think of the term as a human interactivity with computers or anything else for that matter.
Multitasking as a technical term didn't change but the term has now been borrowed for a non-technical purpose and can be confused as a result.
Yup, I missed the RDXXXX/RDXXXXC distinction when I was catching up on this monster thread. Thanks for clearing that up.
Do we have any documentation yet for SETXCH?
It seems like there is a lot of hope that the D Port is going to be some kind of magical panacea, but we only have one so it's usage will need to be negotiated among various objects.
C.W.
I do agree, such things do tend to get lost in marketing noise. Andy was right in that quote though.
I do hope there is a stable "P2 Instruction Set For Dummies" soon.
Really?! That was Brian's very point. Compilers went from a broad usage to a narrow usage as the instructions were streamlined. Hell, Intel later, when going superscalar, even consigned the slow legacy instructions to a single pipe, freeing the "fast pipes" for the RISCised only instructions.
Yes really...now we are talking at cross-purposes. My point has always been that compilers have always been "narrow usage". They never did transition from " broad usage to a narrow usage".
I offer for examples the C compilers used for Z80's that never used the vast majority of it's instructions above the 8080 subset. Or the C and PL/M compilers for Intel's 8086.
It is this phenomena that was the motivation for the RISC idea. They analysed the output of compilers and the run time usage of different instructions and addressing modes etc and realized it might be a good idea to throw away the unused or little used instructions and dedicate the silicon resources to speeding up execution of what is used. Use those transistors for registers and pipelines rather than instruction decoders.
Of course as transistors became plentiful it turned out you could keep the backwards compatible CISC and have those registers and pipelines. At cost of increased power consumption of course. Enter the ARM for mobile.
It's been my suspicion that this phenomenon of compilers driving CPU design has created a wasteland for would-be assembly languages programmers. Can anyone corroborate this?