...you have a fast processor, artificially slowed in it's ability to interact with the real world!
No, it is not slowed. Consider:
Without that I/O synchronization that I describe any threads instruction could set an I/O at whatever point in the round robin, or whatever, scheduling cycle it executes at. This is apparently faster at the price of making it undeterministic as it depends which thread it is. BUT the thread still can't change that I/O again until it gets it's next "turn". So at least we see that simple toggling speed is not affected by the synchronization.
Similarly if the thread is waiting on an input change prior to continuing it may get the input sooner without the synchronization BUT it cannot react (perform the next instruction) at least until it's next "turn".
Therefore I maintain that the I/O synchronization among such interleaved threads is not slowing anything down.
Re: Cortex
Since not all instructions take the same amount of time, they offer the ability to artificially stretch the interrupt handling time to match the longest possible instruction time.
So what you are saying is that they have arranged for the maximum time for the highest priority interrupt to get from input signal to first instruction fetch of the handler is always the same. i.e the worst case time.
Well, might be useful somewhere, sometime, possibly, maybe, I guess.
With the ARM, instructions are either one or two cycles. Most are one cycle. I don't see what that has to do with interrupts, however. It's the interrupt response time (the number of cycles) that is being stretched slightly, not the clock cycle. That remains the same. As I've said, that is negligible at 100 MHz, if one needs interrupt determinacy. Other operations are fully deterministic.
No, no, no. bank switching. We all hate bank switching, we've all done it many times before ... It's horrible.
I can't disagree with it being horrible, but when you have an architecture which can't have register fields changed and you need more registers then what are you going to do ? You can either say, no, no more registers and accept the consequences of that, or accept, as undesirable as it is, that banked registers do provide a solution. If it would make the difference between commercial success or failure then is the pragmatic really that offensive ?
For the Prop architecture, banked cog memory would be a much cleaner implementation than it has been on other processors, notably earlier PICmicro. Above all there's one significant advantage; if you don't want to you don't have to use it.
The question is not whether it makes us shudder, how ideologically offensive, but is it needed and would it make the Prop more successful ? Are limitations of Cog size real or only perceived and can be addressed in some other way ?
Reality is that Cog memory is limited, and people do hit those limitations. I have, and I expect even the Spin interpreter itself would do much more if the memory was larger. Most VM's seem to have involved a battle against those limitations and how to work round them. LMM is a solution but also a trade-off of speed against memory use, and can simply displace memory use to hub. VM's also show up other issues of the Prop PASM design; lack of indirect register access and limited bit-field decoding. Not insurmountable but they do have an impact, especially on code size. Other micros may be no more capable but don't have the memory size limitation as well.
This isn't in response to any particular comment, just the overall "vibe" shall we say, that interrupts are evil.
I get the whole deterministic timing thing, which in the case of needing to have very precise timing makes sense.
However....a lot of times there *ARE* basically random events that need to be captured. Without interrupts we can (1) sit in a tight loop polling for the event, or (2) in the case of the prop, sit idle at a waitxxx instruction.
(1) We are wasting time polling, and also have lost deterministic timing, the more useful work we try to accomplish in the loop, the worse the timing.
(2) Very deterministic timing, no other work getting done.
It seems to me it would be ideal to have a multicore processor, like the prop, but with interrupts also available.
There are various ways to address more than 512 longs of cog memory. Bank switching is certainly one possibility. It's used on the SX, and it's not that bad. Special instructions could be added to merge the destination field with the source field to provide 18 bits of addressing for jmpx, callx and retx instructions. The return address location for a callx could be the location immediately before the callx target address. It would also be nice to have a relative jump instruction that could jump to the current program counter + 255 to -256 locations.
Locations that are used as registers for add, sub, mov, etc. could be limited to the first 512 longs of memory. Or there could be an index register that adds a bias to the source and destination fields. Wouldn't it be nice to have source and destination index registers that would allow random access of cog memory?
I'm not suggesting that this should be done in Prop 2. The design requirements need to remain frozen for the Prop 2 or it will never get done. However, these suggestions could apply to a Prop 3.
Dave
Edit: How about adding rdlongx, wrlongx, rdwordx, etc. instructions to Prop 3 that will use the next hub slot if it's owner is not using it? That way you could get the highest performance by using the "x" instructions or you can get deterministic timing by using the current hub access instructions.
IMHO, improving how LMM works is a better answer. If the speed and efficiency of that is improved, then we go from just a little bit larger, fast programs, to just larger, faster programs, which is scalable in the longer term, and not a kludge. If the COG is 48 bits, or some other bits, what impact does that have on LMM systems?
If a hub fetch is more productive, there isn't as much of a reason to break the simple access system. 4 longs has been tossed about as being possible. Breaking the simple access system breaks the portability of the various objects. What happens when one loads two of those "if no other HUB is using it" objects?
The next COG will have instructions that improve it's code space use as well. The "REP" instruction being one of those we saw in testing.
Re: Interrupts are evil. Well, I don't think so. I like them on traditional CPUs, mostly because they are necessary. They are not necessary on the Prop, and frankly, that's interesting and potent. Improving on that seems again the right answer, instead of diluting what differentiates the Propeller. It's really powerful to be able to write little chunks of code, and have them interact with other chunks without worrying about their overall impact on the interrupt kernel / OS that is necessary to manage things otherwise.
As Devil's Advocate - the fact that one needs to resort to LMM to overcome limitations of the Cog shows the very limitations. Only if LMM can be an equal to PASM will that limitation cease to exist. It's unlikely to ever be a perfect equal and it's still asking a potential user to jump through hoops they would not have to otherwise - albeit that memory banking or whatever also involves some hoops to jump through.
On Prop I, we need to resort to LMM, because it really wasn't part of the initial design. It just happened that it was possible. Lots of work to make it practical though. But, now that we know about it, and we've seen how things can go using it, the idea of needing to make the cog bigger seems far less important. On the Prop I, LMM is slower than people would like, and the scale of Prop I in general is small enough that every bit counts, but will that really be true for Prop II? Or, does it have to be that way with Prop II. A improved hub fetch, and some instructions designed for LMM changes the game considerably.
So a cog could be a dedicated peripheral, or it could be a "CPU" with specialized instructions tunable for various tasks running as LMM code, or a DMA type thing, in charge of larger memory spaces, or servicing requests to move data from outside memory to the hub, etc... Or the cog could just run SPIN. Lots of choices, each with trade-offs. LMM could come fairly close to PASM speeds, given a larger hub memory fetch data size per access window for example. What is that worth, compared to adding complexity in the COG via bank selects, word size changes and such?
Seems to me, for those that want interrupts, a cog running LMM code, could easily be polling for some condition, able to interrupt those programs easily enough. Software defined silicon? Maybe that's a pretty reasonable answer to those regular queries. Just asking at this point, mostly because I think it's worth asking.
About threads, it's more useful to be doing that, as well as pitching lot of stuff into small CPU core - like Pentium 4 HT (same idea, really!) It allows us to stuff what's necessary to be done on few cores. 8 threads is very helpful in DSP butterfly calculation (I know, no FPU, but some of us hack together a software IEEE-754 emulation anyways.) - for example.
And, bank switching, I must admit, is an engineer's nightmare. What was open for probing eyes, is now completely closed: most CPUs do it themselves nowaday - the boot firmwares always contain some bank switching features, so it has done cutting the job out for all us. So, no more touching bank switching (which is a good news for those who hate it). BUT - on some microcontrollers, such as ARM Cortex, and few others, it can be viewable to the user - you gotta look before you buy, if you REALLY want it so badly.
SRLM, honestly, I really doubt either microcontrollers are going to be obsolete anytime soon - not shortly, but more like over a decade. Some teachers are really ignorant about the technological proofreads - microcontrollers don't really die quickly - take PIC for an example. (PIC's the oldest microcontroller ever sold throughout few decades.) If you already know that, sorry about that - just pointing out some oftentime-ignored obvious. Android, on the other hand, is totally closed-source. Won't be long before Google get sued by Linux community: Linux's explicitly licensed by GPL v2 / v3 which stipulates it should be openly distributed without any meaning of collecting the profit, or worse, monopoly practices.
About interruption, it is still a nifty feature which is not available on Propeller. But, we can do software-based interruption vector nests in Propeller, similar to ARM Cortex M3 CPU core's interruption vector layer. It takes careful kernel coding, however.
Oh, and to point out, I am not just centering around just one microcontroller - I just use any microcontroller I can get my hands on - TI (formerly Luminary Micro) ARM microcontroller's still dirt cheap but potent for its price. It's surprisingly compatible with uCLinux, which is in C++ and can be used for anything even your DIY flashlight's LED DC-DC converter.
Well, then why can't I find the source codes I looked everywhere at the official Android webpage? Hmm?
I initally tried Android, but gave up because I have to download the whole thing via Android Developer Kit software. Troublesome...
I really want to see it in tarball file. It's hard to say if it's really open-source if you're required to download it via the manager.
I will just drop Android topic for a while - don't want to start a war. It's pointless anyways - there are bunches of OSes that
can be ported on nth things. What I like about Android is that its GUI is quite nice, and even has Google Earth, another useful
features for traveler (even on phone!).
And, Propeller's COG is similar to ARM Cortex M2 / M0, only without the interruption, as it can do two instructions in four cycles (it's currently
a simple pipelined CPU but plentiful powerful when started in ASM mode). What's bad about the two-lane pipeline in CPU's integer unit logic
is that it can be mucked up pretty quickly - a perfect example of SPIN interpreter bogging the whole thing from 160MIPS to 20MIPS.
Four-lane pipeline integer unit do certainly help - but how much can we squeeze from in ASM mode? LOTTA MORE - 5.2 GIPS outta one chip.
Out-of-Order execution units in a COG can really do SPIN nearly instantly, as it just hang on the "insufficient data" threads, and do the first one
that be done first, then come back to that. In ASM, it will go further, however. OOOE COG really will be a nice feature in 3D graphic draw-up
or some really complex mathematics. For now, 4-stage is fine by me.
I see. I don't have git program on Windows, though. I do have Python on Windows XP - as it do certainly help with most of stuff.
I still have SDK on my portable 500GB hard drive - will reinstall later and then get it. (I bet it will take forever on crappy modem.
Linux is ridluciously huge nowaday too - 4.7GB for a DVD... I have Linux Mandriva 2009 DVD for my 64-bit workstation, tried it on Phenom
as I swapped the CPU from Athlon 64 X2 with this CPU as a test, not too bad. HPET tend to freeze the whole thing: it's a well-known bug on
Phenom Agena. Been fixed in Phenom II Deneb.)
And, what's important - in my experience, anyways - is that the threads should be carefully allocated, each with its handler object,
as well as some RAM stack for that job. Even if I do hack XS1 to do threads out-of-order, I still would give its handler as-is, just to be safe.
(Why so? I have seen Pentium 4 HT going bonkers when inserting wrong codes [just displaying weird blue screen with General Fault Protection flags
saying "INVALID OPCODE: EAX:00000000 ECX: 000000FF EBX: 0000FF02 EDX: 0A200000" - you know the whole story] - I did it twice - one on purpose, another on accident.)
XS1 is pretty nice, but as I emphasize earlier, I use what I can use. I don't have XDK yet, though - really need one to give me a head-start. (I am a slow point-earner, although (don't have my own website - they're expensive nowaday). My accounts at xcore.com seems to be forgetting my passwords - surely it may be happening with the others.)
Without that I/O synchronization that I describe any threads instruction could set an I/O at whatever point in the round robin, or whatever, scheduling cycle it executes at. This is apparently faster at the price of making it undeterministic as it depends which thread it is. BUT the thread still can't change that I/O again until it gets it's next "turn". So at least we see that simple toggling speed is not affected by the synchronization.
Similarly if the thread is waiting on an input change prior to continuing it may get the input sooner without the synchronization BUT it cannot react (perform the next instruction) at least until it's next "turn".
Therefore I maintain that the I/O synchronization among such interleaved threads is not slowing anything down.
So ... let me get this straight - you're introducing an artifical synchronization mechanism to slow down the potential of the processor to interact with the real world by a factor of 8, so that you can simulate 8 cogs?
So what you are saying is that they have arranged for the maximum time for the highest priority interrupt to get from input signal to first instruction fetch of the handler is always the same. i.e the worst case time.
Well, might be useful somewhere, sometime, possibly, maybe, I guess.
Exactly. I'm still trying to think of a case where this would be a useful thing to do.
With the ARM, instructions are either one or two cycles. Most are one cycle. I don't see what that has to do with interrupts, however. It's the interrupt response time (the number of cycles) that is being stretched slightly, not the clock cycle. That remains the same. As I've said, that is negligible at 100 MHz, if one needs interrupt determinacy. Other operations are fully deterministic.
Leon,
I honestly don't see why you don't seem to get this - "Deterministic interrupts" is meaningless in the presence of multiple or concurrent interrupts (see this thread, where we discussed this issue previously - specifically, the article by Atmel on "determinism and latency").
To say that "other operations are fully deterministic" is true but superfluous - aren't all operations on all processors determinstic in the absence of interrupts? Or are you aware of a processor that takes a random amount of time to execute a particular instruction?
Attempts to make interrupt driven processors "deterministic" always requires additional artifical synchronization mechanism to be introduced - precisely to overcome the indeterminacy introduced by interrupts.
BTW, this is somewhat akin to Heater's though experiment of attempting to reproduce what cogs can do using threads. To do so, he has to artificially slow down the ability of the system to interact with the real world. The answer to his thought experiment is essentially "yes, you could do that, but why use a fast processor to simulate what a much slower and simpler processor can already do?"
I am not going to buy in to the other comments here re interrupts and determinism, except to say waitcnt or waitpxx reduce power consumption significantly - a real winner. Interrupts can be simulated by dedicating a pin and use it as 1=interrupt, and a hub address to indicate from where it came from.
Anyway...
What I wanted to say was wrt to cog ram. The reason is, as said by hippy, we have had to overcome the cog ram minimal size. Now, the Prop II is addressing this somewhat by the extra 512B or 512L (longs) - I am not sure which. This can be used in various ways, but not code, and we will have to wait to find out how. However, Prop II also gives us other great features like 8x performance and 4x hub transfers and 2x hub access. So, IMHO, overlays will perhaps far faster than LMM, and they are easier to code. While overlays are not the best solution, they may be better than bank switching, and the resultant cog memory space consumed on the die for those cogs not requiring the extra memory. Hub ram is again critical, but at least it can be shared as required. And we have the extra pins for SRAM & SDRAM and some support in hardware, as yet not fully revealed.
One last thing, we are all trying to do things that the prop was not designed for. Why? The Prop must have some advantage for us to be perservering with this solution instead of using another chip that could possibly be better at doing what we want. So, we have found a problem and we are finding ways to work around those limitations. This says heaps for the Props design doesn't it !!!!! (or we would have moved on). Think about it.
True. Deterministic interruption is oftentime better off being done in software. CPU can only control what is going on, leave alone software telling what to do.
Pin interruption can be done as is explained in notes somewhere on Parallax's Propeller PDF page. It helps to have static interruption - but for totally
dynamic interruption, it's always better to do in software.
And, about ARM Cortex - 2 cycle execution, it really DEPENDS on how you write software - no way it can be done in reality, because it has to have several reference stacks, which takes few cycles to look for and load it in L1 cache RAM or set a lok-up flag in the local system memory. Plus, to make it to Zero-Cycle execution, the codes HAS to be clean, free of bug - which is impossible (unless you have PC-Lint software, which is way too expensive for the mortals).
And, about COG RAM, YAY!!! I really wanted more RAM size as it will really help with branch and instruction storage, along with kernel threads. I knew 2 KB COG RAM isn't enough, leave alone planned 4KB memories.
Ross, and Cluso99 - I can agree with you.
Viewing how the hardware works always help me with software decision, as is.
So ... let me get this straight - you're introducing an artifical synchronization mechanism to slow down the potential of the processor to interact with the real world by a factor of 8, so that you can simulate 8 cogs?
No.
I did not say anything about simulating 8 COGs. I'm talking about building, in silicon, a single processor that can have exactly the same functionality/timing as 8 parallel processors.
I'm only suggesting that 8 threads on a single CPU could be equivalent to 8 threads each running on it's own CPU.
That is to say, given a suitably fast technology a single processor could be built to perform exactly the same work as 8 processors with exactly the same timing.
There is nothing logically "magic" about parallelism. There is nothing a parallel processors can do that a fast enough single processor can not.
Now about that timing:
1) I'm going to assume a simple model where one CPU does the work of eight by performing one instruction from each "thread" at a time in a round robin fashion.
2) So we have a cycle time in which all threads get one instruction done.
3) All inputs are sampled at the beginning of that cycle and all outputs are clocked out at the end of the cycle. In this way, for example, if all 8 threads set an I/O bit high within a cycle all the I/O bits will go high at the same time. As we would expect for one instruction cycle on the Prop for example.
Now, I maintain that such an I/O clocking scheme.
1) Makes all I/O timing indistinguishable form a system with 8 processors.
2) Does NOT "slow down the potential l of the processor to interact with the real world by a factor of 8"
Why?, well see my previous posts...
1) If an instruction in a thread sets an I/O bit it cannot clear that bit until the next cycle. So toggling speed is limited to one eighth the single instruction time anyway. So the sycnchronization does not slow that down
2) If a thread is responding to an input during an I/O read instruction it has to wait a whole round robin cycle before it can do it's next instruction. So synchronization does not slow that down either.
3) Can you think of anything the synchronization does slow down?
2) So we have a cycle time in which all threads get one instruction done.
3) All inputs are sampled at the beginning of that cycle and all outputs are clocked out at the end of the cycle. In this way, for example, if all 8 threads set an I/O bit high within a cycle all the I/O bits will go high at the same time. As we would expect for one instruction cycle on the Prop for example.
Ok - so if we have (say) 8 threads, then you're sampling once every 8 instructions, and updating once every 8 instructions.
3) Can you think of anything the synchronization does slow down?
Yes - interactiion with anything external to the processor occurs 1/8 slower than it really needs to.
I'm sorry if I'm being thick - but I just don't get why you would want to do this. I have already agreed that this does simulate what would happen on a processor with 8 cogs that runs at 1/8 the speed. But if I had the ability to run 8 times as fast, I certainly wouldn't waste this capability on such a futile simulation exercise - I'd run all 8 cogs at that speed instead!
Ok - so if we have (say) 8 threads, then you're sampling once every 8 instructions, and updating once every 8 instructions.
Exactly that.
I'm sorry if I'm being thick -
That's OK I'm very good at that as well. Still, it's possible my argument has a hole in it somewhere.
I have already agreed that this does simulate what would happen on a processor with 8 cogs that runs at 1/8 the speed
OK: We are on the same page.
But if I had the ability to run 8 times as fast, I certainly wouldn't waste this capability on such a futile simulation exercise
Not simulation, real silicon. The round-robin threading is performed by the processor hardware don't forget.
I'd run all 8 cogs at that speed instead!
Ah, now I see the issue. Yes of course. But it may not be possible, you may not have the available transistor budget to implement 8 processors. All those barrel shifters, multipliers etc can get very large when built for speed. Or perhaps there is a power budget.
Or you might go the route of the chip company that will remain nameless. Each core can run 8 threads pretty much as described giving you all that COG like parallel execution goodness and perhaps freedom from interrupts. Or when running only 4 threads you get twice the speed each or two threads for 4 times the speed etc.
Something of the best of both worlds, speed or multiple independent threads. I believe they do that for transistor budget reasons.
Isn't some assistance with threads like that planned for the Prop II?
P.S. Except the chip company that will remain nameless buggered up the independance of threads, the timing determinism, because a few instructions take longer to execute than normal, mul and div I believe. This means that to get timing determinism on their devices you need to use hadware features like clocked I/O. Sound familiar?
I thought that Heater was referring to the hardware threading used by the chip that must remain nameless, but I left it to him to mention it.
It's only division that causes that problem, because it takes more than cycle. All the other instructions are OK, AFAIK. The problem might have been fixed in the timing analyser tool, I'll check on that.
I am a Propeller loyalist, no question, but I've tried to find a local group to join. The leader of a local group emailed a link about a competing micro. I include it here but we could easily do the same. We have writers. Click here. I think the title should have said 'Propeller'.
Heater, I just discovered it is open-source, but there was some dispute in the way I think about HOW they transport the source codes. I guess I am being an idiot, but oh well. Android is Linux-based, so it's
somewhat revelant - it may not matter here, though.
uCLinux is said to be C++, with ASM thrown in for good measure, just like the real thing - not 100% C++, though. I may be mistaken, though.
I am a Propeller loyalist, no question, but I've tried to find a local group to join. The leader of a local group emailed a link about a competing micro. I include it here but we could easily do the same. We have writers. Click here. I think the title should have said 'Propeller'.
The guy writing the article works for a company that sells the Arduino and is writing for make which certainly seems in love with it too. I think it is great and in a sense it has won but what it has won is the hearts and minds of a lot of people. Still it won't do the kinds of things I want to do and the propeller can. There is no best micro, no best platform it is horses for courses.
I have still been thinking about the prop II and re-read the description of it. The main point I was trying to make was that from everything I have read, there are two things that would go a long way to really improving the speed of the operations of the system when programming in Spin. The first is to allow the Spin interpreter to be loaded once for each COG, in a 'shadow' ram on the cog. Call it paging or something else, but this would open up the main COG memory, to allow the main cog memory to hold the Spin tokens directly rather than have them fetched all the time from the main memory. This would reduce the need to go and fetch them from main memory. The other thing that would be an improvement would be the ability to have a spin to assembly compiler that would allow one to write in spin that compiles to assembly. Fast Inline PASM would also be possible if the there were directives that allowed for this. Analysis of processor loading, bus traffic and the partitioning of programs and data movement has always migrated to add more processors, and make the bus traffic loading minimal. So after reading about the propeller and messing around with it (I am no expert by any means), the thing that stands out as an area to really make some improvement is the way in which the HUB allocates bus cycles to each COG, and also the time wasted in not being able to bounce between SPIN objects and assembly in a near zero overhead basis and the inability to have spin tokens reside up in cog memory. My current understanding is that when spin is running, the tokens have to reside in main memory because the spin interpeter is taking up the cog ram. Fetching tokens, for each spin instruction from main memory, is not as fast as things really could be, if there were a second bank/page/shadow ram where the spin interperer could reside in each HUB. This would allow the tokens for the spin to be local in the real hub memory. This would, or seems to me, should be faster. The other thing is that inline assembly could then be executed without penalty of the SPIN interperter being reloaded eating up cycles. This is what leads me to think that there is room for improvements here. Maybe one of you Prop experts can calculate the potential speed up for this that might be had. It all may not be practical either, as it is really a change in the fundamental way the system operates for the most part, as it goes to the core of how instructions are stored, fetched and executed. This would increase the complexity of the logic in the hub memory access controller. It would also require a few directives and a few new instructions.
Ah, now I see the issue. Yes of course. But it may not be possible, you may not have the available transistor budget to implement 8 processors. All those barrel shifters, multipliers etc can get very large when built for speed. Or perhaps there is a power budget.
Or you might go the route of the chip company that will remain nameless. Each core can run 8 threads pretty much as described giving you all that COG like parallel execution goodness and perhaps freedom from interrupts. Or when running only 4 threads you get twice the speed each or two threads for 4 times the speed etc.
Something of the best of both worlds, speed or multiple independent threads. I believe they do that for transistor budget reasons.
Isn't some assistance with threads like that planned for the Prop II?
P.S. Except the chip company that will remain nameless buggered up the independance of threads, the timing determinism, because a few instructions take longer to execute than normal, mul and div I believe. This means that to get timing determinism on their devices you need to use hadware features like clocked I/O. Sound familiar?
Aha! Now I see the direction of your argument - if you avoid the mul and div instruction, and also avoid interrupts, then you could use an X___ to achieve the level of determinism of a Propeller? Very clever!
After a (very) quick perusal of the X___ thread model, it look like you could do this using threads and barriers. It might even be possible for the X___ to approach Propeller I execution speeds (although I'm not sure about this).
I suppose you could take this idea further and actually emulate a Propeller - but of course trying to emulate the Propeller's pin handling model in threads would probably slow the whole thing down dramatically. And you'd still be missing the counters, which can also be used for fast I/O on the Propeller.
Perhaps someone with X___ expertise would like to take this on as a project? - it would certainly demonstrate once and for all that the X___ can actually be used for something useful!
Division is the only problematic instruction, and there is a way round it. Interrupts are only provided for legacy code and aren't necessary, they have been replaced by events.
It's much faster than the Propeller, and can do High-Speed USB and 100 Mbit/s Ethernet in software. The Propeller can just about manage Full-Speed USB.
Yeah. Yeah. Don't forget that: Propeller can be programmed to be very deterministic around the events, both external and internal - it have been done by many microcontrollers, even the x86 / PowerPC CPUs. I know XS1 is faster, but the more important facts is, the pricetag: Quad-core XS1 costs $30, and Octo-core Propeller costs only $8. I don't want to insult, but that's the most #1 thing most peoples are interested in at that moment. It would be nice if the XS1 would be dropping in price, but I really doubt it - at least for a while, because of 90nm / 65nm SOI and it costs money.
Okay, the most expensive microcontroller (sort of) is Texas Instruments's 45nm HkMG octo-core C66 DSP - it costs $165.
Just pointing out.
Cluso99 said: One last thing, we are all trying to do things that the prop was not designed for. Why?
All that I/O accomplished with nothing but resistors from a $9 chip is a big attraction. That and the fact that it's different; it rethinks the whole idea of how resources should be allocated to make a uC. Instead of more memory and faster faster faster, we've got an elegant mix of useful stuff with the PLL clock, the cog counter/timers with their PLL's, and the cogs whose memory seems so limiting at first (512 instructions! WTF!) but then you realize how powerful those instructions can be since each one can contain source, destination, instruction, and a bunch of modifiers all in one Long. Brilliant. It's a whole different kind of programming; it's like stepping back to the Atari 2600, where as one of the big retro programmers sigged "every bit is sacred," only unlike the 2600 the prop can do stuff that is useful today like drive VGA displays. More than one of them at a time, even. It was a bellwether moment for me when I was debugging something on the PPDB and the main display output was very busy, and I needed a debug output channel and I just popped the three resistors to make the voltage divider onto the breadboard, started a second instance of tv_text, and boom there I was with two completely separate video outputs. Show me another platform where you can drive four TV sets from one chip while doing real processing at the same time, that sells for eight bucks, and I'll be impressed.
Comments
No, it is not slowed. Consider:
Without that I/O synchronization that I describe any threads instruction could set an I/O at whatever point in the round robin, or whatever, scheduling cycle it executes at. This is apparently faster at the price of making it undeterministic as it depends which thread it is. BUT the thread still can't change that I/O again until it gets it's next "turn". So at least we see that simple toggling speed is not affected by the synchronization.
Similarly if the thread is waiting on an input change prior to continuing it may get the input sooner without the synchronization BUT it cannot react (perform the next instruction) at least until it's next "turn".
Therefore I maintain that the I/O synchronization among such interleaved threads is not slowing anything down.
Re: Cortex
So what you are saying is that they have arranged for the maximum time for the highest priority interrupt to get from input signal to first instruction fetch of the handler is always the same. i.e the worst case time.
Well, might be useful somewhere, sometime, possibly, maybe, I guess.
I can't disagree with it being horrible, but when you have an architecture which can't have register fields changed and you need more registers then what are you going to do ? You can either say, no, no more registers and accept the consequences of that, or accept, as undesirable as it is, that banked registers do provide a solution. If it would make the difference between commercial success or failure then is the pragmatic really that offensive ?
For the Prop architecture, banked cog memory would be a much cleaner implementation than it has been on other processors, notably earlier PICmicro. Above all there's one significant advantage; if you don't want to you don't have to use it.
The question is not whether it makes us shudder, how ideologically offensive, but is it needed and would it make the Prop more successful ? Are limitations of Cog size real or only perceived and can be addressed in some other way ?
Reality is that Cog memory is limited, and people do hit those limitations. I have, and I expect even the Spin interpreter itself would do much more if the memory was larger. Most VM's seem to have involved a battle against those limitations and how to work round them. LMM is a solution but also a trade-off of speed against memory use, and can simply displace memory use to hub. VM's also show up other issues of the Prop PASM design; lack of indirect register access and limited bit-field decoding. Not insurmountable but they do have an impact, especially on code size. Other micros may be no more capable but don't have the memory size limitation as well.
I get the whole deterministic timing thing, which in the case of needing to have very precise timing makes sense.
However....a lot of times there *ARE* basically random events that need to be captured. Without interrupts we can (1) sit in a tight loop polling for the event, or (2) in the case of the prop, sit idle at a waitxxx instruction.
(1) We are wasting time polling, and also have lost deterministic timing, the more useful work we try to accomplish in the loop, the worse the timing.
(2) Very deterministic timing, no other work getting done.
It seems to me it would be ideal to have a multicore processor, like the prop, but with interrupts also available.
C.W.
Locations that are used as registers for add, sub, mov, etc. could be limited to the first 512 longs of memory. Or there could be an index register that adds a bias to the source and destination fields. Wouldn't it be nice to have source and destination index registers that would allow random access of cog memory?
I'm not suggesting that this should be done in Prop 2. The design requirements need to remain frozen for the Prop 2 or it will never get done. However, these suggestions could apply to a Prop 3.
Dave
Edit: How about adding rdlongx, wrlongx, rdwordx, etc. instructions to Prop 3 that will use the next hub slot if it's owner is not using it? That way you could get the highest performance by using the "x" instructions or you can get deterministic timing by using the current hub access instructions.
If a hub fetch is more productive, there isn't as much of a reason to break the simple access system. 4 longs has been tossed about as being possible. Breaking the simple access system breaks the portability of the various objects. What happens when one loads two of those "if no other HUB is using it" objects?
The next COG will have instructions that improve it's code space use as well. The "REP" instruction being one of those we saw in testing.
Re: Interrupts are evil. Well, I don't think so. I like them on traditional CPUs, mostly because they are necessary. They are not necessary on the Prop, and frankly, that's interesting and potent. Improving on that seems again the right answer, instead of diluting what differentiates the Propeller. It's really powerful to be able to write little chunks of code, and have them interact with other chunks without worrying about their overall impact on the interrupt kernel / OS that is necessary to manage things otherwise.
As Devil's Advocate - the fact that one needs to resort to LMM to overcome limitations of the Cog shows the very limitations. Only if LMM can be an equal to PASM will that limitation cease to exist. It's unlikely to ever be a perfect equal and it's still asking a potential user to jump through hoops they would not have to otherwise - albeit that memory banking or whatever also involves some hoops to jump through.
On Prop I, we need to resort to LMM, because it really wasn't part of the initial design. It just happened that it was possible. Lots of work to make it practical though. But, now that we know about it, and we've seen how things can go using it, the idea of needing to make the cog bigger seems far less important. On the Prop I, LMM is slower than people would like, and the scale of Prop I in general is small enough that every bit counts, but will that really be true for Prop II? Or, does it have to be that way with Prop II. A improved hub fetch, and some instructions designed for LMM changes the game considerably.
So a cog could be a dedicated peripheral, or it could be a "CPU" with specialized instructions tunable for various tasks running as LMM code, or a DMA type thing, in charge of larger memory spaces, or servicing requests to move data from outside memory to the hub, etc... Or the cog could just run SPIN. Lots of choices, each with trade-offs. LMM could come fairly close to PASM speeds, given a larger hub memory fetch data size per access window for example. What is that worth, compared to adding complexity in the COG via bank selects, word size changes and such?
Seems to me, for those that want interrupts, a cog running LMM code, could easily be polling for some condition, able to interrupt those programs easily enough. Software defined silicon? Maybe that's a pretty reasonable answer to those regular queries. Just asking at this point, mostly because I think it's worth asking.
And, bank switching, I must admit, is an engineer's nightmare. What was open for probing eyes, is now completely closed: most CPUs do it themselves nowaday - the boot firmwares always contain some bank switching features, so it has done cutting the job out for all us. So, no more touching bank switching (which is a good news for those who hate it). BUT - on some microcontrollers, such as ARM Cortex, and few others, it can be viewable to the user - you gotta look before you buy, if you REALLY want it so badly.
SRLM, honestly, I really doubt either microcontrollers are going to be obsolete anytime soon - not shortly, but more like over a decade. Some teachers are really ignorant about the technological proofreads - microcontrollers don't really die quickly - take PIC for an example. (PIC's the oldest microcontroller ever sold throughout few decades.) If you already know that, sorry about that - just pointing out some oftentime-ignored obvious. Android, on the other hand, is totally closed-source. Won't be long before Google get sued by Linux community: Linux's explicitly licensed by GPL v2 / v3 which stipulates it should be openly distributed without any meaning of collecting the profit, or worse, monopoly practices.
About interruption, it is still a nifty feature which is not available on Propeller. But, we can do software-based interruption vector nests in Propeller, similar to ARM Cortex M3 CPU core's interruption vector layer. It takes careful kernel coding, however.
Oh, and to point out, I am not just centering around just one microcontroller - I just use any microcontroller I can get my hands on - TI (formerly Luminary Micro) ARM microcontroller's still dirt cheap but potent for its price. It's surprisingly compatible with uCLinux, which is in C++ and can be used for anything even your DIY flashlight's LED DC-DC converter.
I initally tried Android, but gave up because I have to download the whole thing via Android Developer Kit software. Troublesome...
I really want to see it in tarball file. It's hard to say if it's really open-source if you're required to download it via the manager.
I will just drop Android topic for a while - don't want to start a war. It's pointless anyways - there are bunches of OSes that
can be ported on nth things. What I like about Android is that its GUI is quite nice, and even has Google Earth, another useful
features for traveler (even on phone!).
And, Propeller's COG is similar to ARM Cortex M2 / M0, only without the interruption, as it can do two instructions in four cycles (it's currently
a simple pipelined CPU but plentiful powerful when started in ASM mode). What's bad about the two-lane pipeline in CPU's integer unit logic
is that it can be mucked up pretty quickly - a perfect example of SPIN interpreter bogging the whole thing from 160MIPS to 20MIPS.
Four-lane pipeline integer unit do certainly help - but how much can we squeeze from in ASM mode? LOTTA MORE - 5.2 GIPS outta one chip.
Out-of-Order execution units in a COG can really do SPIN nearly instantly, as it just hang on the "insufficient data" threads, and do the first one
that be done first, then come back to that. In ASM, it will go further, however. OOOE COG really will be a nice feature in 3D graphic draw-up
or some really complex mathematics. For now, 4-stage is fine by me.
I have Android 2.2 on my Dell Streak tablet. It's very nice.
I still have SDK on my portable 500GB hard drive - will reinstall later and then get it. (I bet it will take forever on crappy modem.
Linux is ridluciously huge nowaday too - 4.7GB for a DVD... I have Linux Mandriva 2009 DVD for my 64-bit workstation, tried it on Phenom
as I swapped the CPU from Athlon 64 X2 with this CPU as a test, not too bad. HPET tend to freeze the whole thing: it's a well-known bug on
Phenom Agena. Been fixed in Phenom II Deneb.)
And, what's important - in my experience, anyways - is that the threads should be carefully allocated, each with its handler object,
as well as some RAM stack for that job. Even if I do hack XS1 to do threads out-of-order, I still would give its handler as-is, just to be safe.
(Why so? I have seen Pentium 4 HT going bonkers when inserting wrong codes [just displaying weird blue screen with General Fault Protection flags
saying "INVALID OPCODE: EAX:00000000 ECX: 000000FF EBX: 0000FF02 EDX: 0A200000" - you know the whole story] - I did it twice - one on purpose, another on accident.)
XS1 is pretty nice, but as I emphasize earlier, I use what I can use. I don't have XDK yet, though - really need one to give me a head-start. (I am a slow point-earner, although (don't have my own website - they're expensive nowaday). My accounts at xcore.com seems to be forgetting my passwords - surely it may be happening with the others.)
So ... let me get this straight - you're introducing an artifical synchronization mechanism to slow down the potential of the processor to interact with the real world by a factor of 8, so that you can simulate 8 cogs?
Isn't that what I said above?
Exactly. I'm still trying to think of a case where this would be a useful thing to do.
Ross.
Leon,
I honestly don't see why you don't seem to get this - "Deterministic interrupts" is meaningless in the presence of multiple or concurrent interrupts (see this thread, where we discussed this issue previously - specifically, the article by Atmel on "determinism and latency").
To say that "other operations are fully deterministic" is true but superfluous - aren't all operations on all processors determinstic in the absence of interrupts? Or are you aware of a processor that takes a random amount of time to execute a particular instruction?
Attempts to make interrupt driven processors "deterministic" always requires additional artifical synchronization mechanism to be introduced - precisely to overcome the indeterminacy introduced by interrupts.
BTW, this is somewhat akin to Heater's though experiment of attempting to reproduce what cogs can do using threads. To do so, he has to artificially slow down the ability of the system to interact with the real world. The answer to his thought experiment is essentially "yes, you could do that, but why use a fast processor to simulate what a much slower and simpler processor can already do?"
Ross.
Anyway...
What I wanted to say was wrt to cog ram. The reason is, as said by hippy, we have had to overcome the cog ram minimal size. Now, the Prop II is addressing this somewhat by the extra 512B or 512L (longs) - I am not sure which. This can be used in various ways, but not code, and we will have to wait to find out how. However, Prop II also gives us other great features like 8x performance and 4x hub transfers and 2x hub access. So, IMHO, overlays will perhaps far faster than LMM, and they are easier to code. While overlays are not the best solution, they may be better than bank switching, and the resultant cog memory space consumed on the die for those cogs not requiring the extra memory. Hub ram is again critical, but at least it can be shared as required. And we have the extra pins for SRAM & SDRAM and some support in hardware, as yet not fully revealed.
One last thing, we are all trying to do things that the prop was not designed for. Why? The Prop must have some advantage for us to be perservering with this solution instead of using another chip that could possibly be better at doing what we want. So, we have found a problem and we are finding ways to work around those limitations. This says heaps for the Props design doesn't it !!!!! (or we would have moved on). Think about it.
Pin interruption can be done as is explained in notes somewhere on Parallax's Propeller PDF page. It helps to have static interruption - but for totally
dynamic interruption, it's always better to do in software.
And, about ARM Cortex - 2 cycle execution, it really DEPENDS on how you write software - no way it can be done in reality, because it has to have several reference stacks, which takes few cycles to look for and load it in L1 cache RAM or set a lok-up flag in the local system memory. Plus, to make it to Zero-Cycle execution, the codes HAS to be clean, free of bug - which is impossible (unless you have PC-Lint software, which is way too expensive for the mortals).
And, about COG RAM, YAY!!! I really wanted more RAM size as it will really help with branch and instruction storage, along with kernel threads. I knew 2 KB COG RAM isn't enough, leave alone planned 4KB memories.
Ross, and Cluso99 - I can agree with you.
Viewing how the hardware works always help me with software decision, as is.
No. Android is totally open source: http://source.android.com/source/index.html
Anyway, what has Android got to do with anything discussed here?
I don't think you will find any C++ in uCLinux.
No.
I did not say anything about simulating 8 COGs. I'm talking about building, in silicon, a single processor that can have exactly the same functionality/timing as 8 parallel processors.
I'm only suggesting that 8 threads on a single CPU could be equivalent to 8 threads each running on it's own CPU.
That is to say, given a suitably fast technology a single processor could be built to perform exactly the same work as 8 processors with exactly the same timing.
There is nothing logically "magic" about parallelism. There is nothing a parallel processors can do that a fast enough single processor can not.
Now about that timing:
1) I'm going to assume a simple model where one CPU does the work of eight by performing one instruction from each "thread" at a time in a round robin fashion.
2) So we have a cycle time in which all threads get one instruction done.
3) All inputs are sampled at the beginning of that cycle and all outputs are clocked out at the end of the cycle. In this way, for example, if all 8 threads set an I/O bit high within a cycle all the I/O bits will go high at the same time. As we would expect for one instruction cycle on the Prop for example.
Now, I maintain that such an I/O clocking scheme.
1) Makes all I/O timing indistinguishable form a system with 8 processors.
2) Does NOT "slow down the potential l of the processor to interact with the real world by a factor of 8"
Why?, well see my previous posts...
1) If an instruction in a thread sets an I/O bit it cannot clear that bit until the next cycle. So toggling speed is limited to one eighth the single instruction time anyway. So the sycnchronization does not slow that down
2) If a thread is responding to an input during an I/O read instruction it has to wait a whole round robin cycle before it can do it's next instruction. So synchronization does not slow that down either.
3) Can you think of anything the synchronization does slow down?
Ok - so if we have (say) 8 threads, then you're sampling once every 8 instructions, and updating once every 8 instructions.
Yes - interactiion with anything external to the processor occurs 1/8 slower than it really needs to.
I'm sorry if I'm being thick - but I just don't get why you would want to do this. I have already agreed that this does simulate what would happen on a processor with 8 cogs that runs at 1/8 the speed. But if I had the ability to run 8 times as fast, I certainly wouldn't waste this capability on such a futile simulation exercise - I'd run all 8 cogs at that speed instead!
Ross.
Exactly that.
That's OK I'm very good at that as well. Still, it's possible my argument has a hole in it somewhere.
OK: We are on the same page.
Not simulation, real silicon. The round-robin threading is performed by the processor hardware don't forget.
Ah, now I see the issue. Yes of course. But it may not be possible, you may not have the available transistor budget to implement 8 processors. All those barrel shifters, multipliers etc can get very large when built for speed. Or perhaps there is a power budget.
Or you might go the route of the chip company that will remain nameless. Each core can run 8 threads pretty much as described giving you all that COG like parallel execution goodness and perhaps freedom from interrupts. Or when running only 4 threads you get twice the speed each or two threads for 4 times the speed etc.
Something of the best of both worlds, speed or multiple independent threads. I believe they do that for transistor budget reasons.
Isn't some assistance with threads like that planned for the Prop II?
P.S. Except the chip company that will remain nameless buggered up the independance of threads, the timing determinism, because a few instructions take longer to execute than normal, mul and div I believe. This means that to get timing determinism on their devices you need to use hadware features like clocked I/O. Sound familiar?
It's only division that causes that problem, because it takes more than cycle. All the other instructions are OK, AFAIK. The problem might have been fixed in the timing analyser tool, I'll check on that.
somewhat revelant - it may not matter here, though.
uCLinux is said to be C++, with ASM thrown in for good measure, just like the real thing - not 100% C++, though. I may be mistaken, though.
The guy writing the article works for a company that sells the Arduino and is writing for make which certainly seems in love with it too. I think it is great and in a sense it has won but what it has won is the hearts and minds of a lot of people. Still it won't do the kinds of things I want to do and the propeller can. There is no best micro, no best platform it is horses for courses.
Graham
re... No No No.. paging....
I have still been thinking about the prop II and re-read the description of it. The main point I was trying to make was that from everything I have read, there are two things that would go a long way to really improving the speed of the operations of the system when programming in Spin. The first is to allow the Spin interpreter to be loaded once for each COG, in a 'shadow' ram on the cog. Call it paging or something else, but this would open up the main COG memory, to allow the main cog memory to hold the Spin tokens directly rather than have them fetched all the time from the main memory. This would reduce the need to go and fetch them from main memory. The other thing that would be an improvement would be the ability to have a spin to assembly compiler that would allow one to write in spin that compiles to assembly. Fast Inline PASM would also be possible if the there were directives that allowed for this. Analysis of processor loading, bus traffic and the partitioning of programs and data movement has always migrated to add more processors, and make the bus traffic loading minimal. So after reading about the propeller and messing around with it (I am no expert by any means), the thing that stands out as an area to really make some improvement is the way in which the HUB allocates bus cycles to each COG, and also the time wasted in not being able to bounce between SPIN objects and assembly in a near zero overhead basis and the inability to have spin tokens reside up in cog memory. My current understanding is that when spin is running, the tokens have to reside in main memory because the spin interpeter is taking up the cog ram. Fetching tokens, for each spin instruction from main memory, is not as fast as things really could be, if there were a second bank/page/shadow ram where the spin interperer could reside in each HUB. This would allow the tokens for the spin to be local in the real hub memory. This would, or seems to me, should be faster. The other thing is that inline assembly could then be executed without penalty of the SPIN interperter being reloaded eating up cycles. This is what leads me to think that there is room for improvements here. Maybe one of you Prop experts can calculate the potential speed up for this that might be had. It all may not be practical either, as it is really a change in the fundamental way the system operates for the most part, as it goes to the core of how instructions are stored, fetched and executed. This would increase the complexity of the logic in the hub memory access controller. It would also require a few directives and a few new instructions.
Regards,
Walt,
Aha! Now I see the direction of your argument - if you avoid the mul and div instruction, and also avoid interrupts, then you could use an X___ to achieve the level of determinism of a Propeller? Very clever!
After a (very) quick perusal of the X___ thread model, it look like you could do this using threads and barriers. It might even be possible for the X___ to approach Propeller I execution speeds (although I'm not sure about this).
I suppose you could take this idea further and actually emulate a Propeller - but of course trying to emulate the Propeller's pin handling model in threads would probably slow the whole thing down dramatically. And you'd still be missing the counters, which can also be used for fast I/O on the Propeller.
Perhaps someone with X___ expertise would like to take this on as a project? - it would certainly demonstrate once and for all that the X___ can actually be used for something useful!
Ross.
It's much faster than the Propeller, and can do High-Speed USB and 100 Mbit/s Ethernet in software. The Propeller can just about manage Full-Speed USB.
Okay, the most expensive microcontroller (sort of) is Texas Instruments's 45nm HkMG octo-core C66 DSP - it costs $165.
Just pointing out.
All that I/O accomplished with nothing but resistors from a $9 chip is a big attraction. That and the fact that it's different; it rethinks the whole idea of how resources should be allocated to make a uC. Instead of more memory and faster faster faster, we've got an elegant mix of useful stuff with the PLL clock, the cog counter/timers with their PLL's, and the cogs whose memory seems so limiting at first (512 instructions! WTF!) but then you realize how powerful those instructions can be since each one can contain source, destination, instruction, and a bunch of modifiers all in one Long. Brilliant. It's a whole different kind of programming; it's like stepping back to the Atari 2600, where as one of the big retro programmers sigged "every bit is sacred," only unlike the 2600 the prop can do stuff that is useful today like drive VGA displays. More than one of them at a time, even. It was a bellwether moment for me when I was debugging something on the PPDB and the main display output was very busy, and I needed a debug output channel and I just popped the three resistors to make the voltage divider onto the breadboard, started a second instance of tv_text, and boom there I was with two completely separate video outputs. Show me another platform where you can drive four TV sets from one chip while doing real processing at the same time, that sells for eight bucks, and I'll be impressed.