In the nutshell: Propeller does have WAITCNT and WAITVID, on XS1 you have one WAIT instruction that you can use to wait for any combination of events from various peripherals. And that's only the beginning.
XS1 has a fairly powerful software-controlled interrupt vectoring system. The interrupt vectors are not permanently assigned to peripherals, like in many MCUs. Instead, you can assign any vector to any event-generating peripheral (I/O port, etc). When the event happens, the vector points to the next instruction to be scheduled for given thread. A vector is specific to a thread, so you have full thread affinity for responding to external events.
The classical problem of what to do if different events all reuse same interrupt handler (vector) is handled very nicely, too. Normally you have to interrogate status bits to know what happened, if the handler could be triggered by different things. On XS1, to each event source you assign a so-called environment vector. It's simply a data word that's available in your interrupt/event handler, and lets you adjust your logic according to interrupt source. You can use it as a bitmask, as a jump offset or table offset, or whatever suits your application. I haven't seen anything like that on any of the mainstream MCUs -- feel free to correct me if I'm wrong. You normally have to emulate this by setting up code to write a value somewhere, then jump to a common handler code. This costs precious cycles. On XS1, an event/interrupt handler can be done in a couple thread cycles -- say in 80ns. That's less than one clock period on some MCUs.
The major difference between XS1 and Propeller (P1 and P2 both) is that Propeller has no interrupt support at all. On XS1, event/interrupt support enables essentially free event-driven switch statements. You can wait on many things to happen, and there's no time penalty for that. Waiting on one event is no different from waiting on 10 events, in terms of latency. Of course if two things happen at the same time, you can't process them concurrently in the same thread, but at least your code doesn't get any slower from trying to wait on many things in the first place. This is IMHO a very sane design decision.
The difference between events and interrupts is fairly simple on XS1. An Event handler does not preserve the PC. You have to be within a WAIT instruction for an event to fire. It is like a hardware-driven switch statement. You have sole control of the execution path after you're done handling an event. An Interrupt does the usual automagical PC/status storage in registers dedicated for that purpose, so there's no memory access overhead for that.
Yeah, but that was a painful process of discovery. Pulling loose teeth from my daughter's mouth is more pleasant than that, I kid you not.
I hear you. I puzzled long and hard over the scattered documentation until clarity finally set in. Much of what I learned actually came from looking at the assembly code generated by the compiler rather than from the documentation.
When you take that little $49 sparkfun dev board with that 400MHz XS1-L1 chip on it, you have a device that does thread switches every 2.5ns, with no overhead: every consecutive instruction that's executed comes from a different thread. For about $20 you get a XS1-G4 chip that got four of those cores in it.
XS1 is very similar to Propeller in that core-to-core (cog-to-cog) communications have to go through a channel (hub). XS1 does not have hub memory, except for some FIFO channel buffers, though it does have 4 "free" threads per every core. So a $20 qty 1 XS1-G4 gives you 16 threads of execution, all of them running at 100 MIPS each, or 1600 MIPS overall. A P2 at probably half that price will give you 8 threads of execution at 160 MIPS each, or 1280 MIPS overall. So those parts are in the same ballpark, really. I'd say that 100MIPS per thread on XS1 is equivalent to 160MIPS per thread in P2, since the former has a more powerful instruction set, that lets you do more in a single cycle. P2's powerful all-instruction conditional execution could be a counterargument, but I don't think it's that big of a deal. So who wins where, then?
P2 wins IMHO in terms of the usefulness and capability of its I/O pins and counters (plentiful!). It's completely unique in that sense. There are plenty of interfaces where you don't need anything besides what P2 would give to you. P2s major shortcoming could be the impedance mismatch between on-chip and off-chip communications: HUB access only works within a chip.
XS1 wins in terms of flexibility of its architecture: ease of communicating multiple cores, both on- and off-chip, the more powerful instruction set, more flexible memory architecture, and hardware resource management (resources: timers, clocks, channels/channel ends, ports). Contrary to P2, communicating between threads on a single core is no different than communicating between threads on two chips with a couple of XS-link switches in between them. If you want to cheat within a single core, you can always share RAM areas, with a big CAVEAT EMPTOR attached.
At the moment, the programming tools are IMHO somewhat tied between both platforms (of course I only judge what's out for P1). Viewport debugger is IMHO a good enough counterbalance to lack of high-level languages beyond SPIN, when you compare to higher level languages available for XS1 (C, C++ and XC).
XS1 and P2 will be, overall, very similarly performing devices, apart from cases where device-specific features will be of a make-or-break kind. I'd personally not attempt to code a complex industrial communications protocols in PASM or SPIN, but I'd gladly use P2 in a controller/sensor application as an I/O front-end and fixed-function DSP processor. The logic and networking would be probably better handled in XC on XS1.
There are some applications where P2 seems like a perfect fit, say filtering of modulator data from multi-channel sigma-delta converters like TIs ADS1278. One COG per channel, at 160MIPS, would be more than plenty to massage the data if the on-chip filters don't suit you. Of course, a "standard" way of doing it is within a large DSP or an FPGA, but P2 would be probably more cost effective, at least for R&D and small runs.
For some other applications, XS1 is a perfect fit. Implementing an EtherCat slave (fast, 100mbit ethernet-based industrial communications protocol) turned out to be fairly easy on the XMOS device -- and here we're replacing a custom ASIC or FPGA IP with a purely software solution. Adding goodies like ethernet-over-ethercat was simple once you got a hang of it. A stand-alone fixed-latency realtime ethernet switch was easy to do as well. I'd be probably in deep dodoland trying to squeeze it all into COGs, even on a yet-to-emerge P2. On XS1-G4, I could partition threads across cores to evenly spread out RAM use across the cores, getting most out of the 64kb of shared data/code RAM.
There was a rant here that XS1 counters/timers are "16 bit only", that there's no capture, etc. First of all, you don't need anything more. The software execution has plenty of speed to do housekeeping to simulate longer timer registers. The reason for this was architectural: since you often wait on timers, you want to have short instructions that set it up, and for that you have 16 bits for the instruction, and 16 bits for timer value. The important part, though, is that every port can have a timer capture assigned for timestamping.]
Maybe this can be another benchmark
Note with good timer silicon, I can capture either edge, and resolve narrow pulses down to 1 CLK pulse widths.
I can also Divide a 100MHz clock, by N, and capture from that, again to 10ns resolutions.
So, rather than vague 'plenty of speed' claims, I like to see someone show me real numbers,
and software that manages the nasty cases, like what happens when an event arrives exactly when the 16 bit timer wraps.
Being right 99.99847% of the time, might be OK for Microsoft, but it's not in engineering.
There was also some confusion about multithreading.... For 5-8 threads, the thread cycle prolongs proportionally (8ns @ 4 threads, 16ns @ 8 threads). Every thread has a full set of its own registers, including hidden state registers only visible to the ALU. Those are switched into the instruction executor with no delay or overhead.
To me that is a flaw.
Sometimes hard real time, means exactly that.
They should have given users the option to 'spread all the load', or lock-in resource to vital threads, and then 'spread the rest'.
If you lock-in, sure the slow thread is slowed down more, but the premise everyone shares the time may be democratic, but it is not engineering.
The fifos are 16 bytes deep.
Timers are 32 bit, but timers applied to ports (switch this port at this time) are 16 bits, two completely different beasts ! And for analogy what happens when you 32 bit timer wraps around ?... less often than a 16 bit one but not 100 % either.
Timers also has a prescaler, you do not need to use the 100 MHz clock for that.
I'd really like to see where you crash with this "it is a flaw". There are other more powerful solutions... it all depends on the needs. You can get a 256 bit timer using a FPGA if you want.... or build it with interrupts and a few memory locations, or use a real time clock !... many different ways of solving a problem... if it is a "proper" solution... is normally irrelevant to the end-user as long as it works reliably (and some end-users can even work with less than reliable solutions, ahem *desktop computers & associated software* ahem).
I'd really like to see where you crash with this "it is a flaw". There are other more powerful solutions... it all depends on the needs.
Did you read what I was replying to ?
The flaw is in the design of lack of user control of the thread switching - it is actually very simple, and not a question of "more powerful solutions"
Rather than stretch all threads the same, all they needed to do, was let users choose who gets the time-slices.
The hard work is already all done, they can flip threads now very quickly, so this for one example, should be a viable user choice for 5 threads :
100MHz,100MHz,100MHz,50MHz,50MHz
and 6 threads
100MHz,100MHz,100MHz,33MHz,33MHz,33MHz
or,
100MHz,100MHz,50MHz,50MHz,50MHz,50MHz
Notice now that 2 or 3 threads have locked time slots and do not drift down in speed. Or, users could choose equal allocate and get the same time-drift as now, Their choice.
jmg: I read what you wrote, but It is a bit more complicated because waiting threads free resources for running threads (it is written in the manual, albeit I haven't tested it). I see what you mean, and that is why I ask you when do you need it. I'd really like to see an example (it is not that I don't believe it, is it that I want to see where it is used!)
The classical problem of what to do if different events all reuse same interrupt handler (vector) is handled very nicely, too. Normally you have to interrogate status bits to know what happened, if the handler could be triggered by different things. On XS1, to each event source you assign a so-called environment vector. It's simply a data word that's available in your interrupt/event handler, and lets you adjust your logic according to interrupt source. You can use it as a bitmask, as a jump offset or table offset, or whatever suits your application. I haven't seen anything like that on any of the mainstream MCUs -- feel free to correct me if I'm wrong.
I haven't seen that on MCUs either. But there was something similar on the NORD-10 minicomputer from 1974 (it may have been used also on the earlier NORD-1, but I'm not familiar with the NORD-1 architecture). The NORD-10 had 16 interrupt levels, all with its own full set of registers btw, and a diverse arsenal of interrupt handling mechanisms to choose among depending on the speed necessary. Four of those interrupt levels had something that wasn't unlike what you describe for the XS1. When multiple devices were using the same interrupt level a device-specific word would end up in the A register. This would cost one instruction: The first instruction called after the interrupt was an IDENT and that's when the word was transferred to the A register. The device-specific word would be either hardcoded on the device (which was hardware-wise a board in the bus), or set by thumbwheel, or programmable (it had to be unique for the bus and interrupt level though).
It was certainly fast, for the time. There were faster variants too for some of the other levels, but this was what you would use for your ordinary device / peripheral handling.
jmg: I read what you wrote, but It is a bit more complicated because waiting threads free resources for running threads (it is written in the manual, albeit I haven't tested it). I see what you mean, and that is why I ask you when do you need it. I'd really like to see an example (it is not that I don't believe it, is it that I want to see where it is used!)
I'm not sure I follow. Are you asking why determinism matters ?
Plenty of examples of that : Frequency synthesis, time measurement that has a software-sampling component, phase modulation....
Something like Chroma modulation is relatively easy in a thread, but only if that thread always runs at a known phase velocity.
Modulate that, and you (or your customers) will see the consequences.
Or, One threads might NEED full speed to complete, and if that happens, you can only use half the chip-as-supplied, whilst I could use all threads in my improved alternative.
I am asking for an example where all the resources have to be used all the time. I can also come up with single examples... the thing is for 8 of them on the same chip. BTW which alternative do you propose to use ?
jmg: I read what you wrote, but It is a bit more complicated because waiting threads free resources for running threads (it is written in the manual, albeit I haven't tested it). I see what you mean, and that is why I ask you when do you need it. I'd really like to see an example (it is not that I don't believe it, is it that I want to see where it is used!)
I'm not sure I follow. Are you asking why determinism matters ?
Plenty of examples of that : Frequency synthesis, time measurement that has a software-sampling component, phase modulation....
Something like Chroma modulation is relatively easy in a thread, but only if that thread always runs at a known phase velocity.
Modulate that, and you (or your customers) will see the consequences.
Or, One threads might NEED full speed to complete, and if that happens, you can only use half the chip-as-supplied, whilst I could use all threads in my improved alternative.
This is similar to the debate about HUB access timing on the Prop. Many have suggested that if some cogs are not in use the the HUB switch should skip them and hence give more frequent access to HUB RAM for those that are running. Or that there should be a mechanism to give more HUB access slots to selected cogs regardless of what else is running. And other variations on this theme.
My take is that is a complication that may be usefull in some cases but in general messes up the determinism of the Prop. No longer could you just throw objects from OBEX or elsewhere into your project and be sure that they don't have bad timing interactions with each other.
I'm inclined to make the same case against such priority schemes in xcore thread scheduling. Possibly usefull for some cases where you want to push the thing over the current limits by a little but in general messy and complicating.
I just wanted to see an example where all resources were used. I'm not saying that determinism does not matter, just that there are several ways to achieve precise timing. One way is counting the instructions...
Heater, there are many cases where determinism is not important. In fact, I would say most of the time it is not needed. One big challenge with P2 is to program it to efficiently hit the hub window every 8 cycles. When I run the Spin interpreter on the P2 simulator it seems like its waiting for its hub access window half the time. It would be nice if P2 had a more sophisticated hub arbitration scheme. There could be a mode bit for each cog that could enforce determinism for that cog. Otherwise, it could use a round-robin scheme, or some other scheme that isn't difficult to implement. It's too late for this to happen in P2, but maybe in P3?
As I mentioned on another thread, a P2 simulator is under development as part of the GCC project. However, the focus of the GCC project has been shifted toward the P1 since the P2 chip design is in progress. I believe there are still a number of open issues on the P2 design, so it impossible to complete a simulator at this time. It would also be a burden on the Parallax design team to have to provide a detailed spec at this time.
Development of the simulator is on hold right now, pending more detail from Parallax. It does implement some of the basic P2 instructions with a P1-like I/O to provide serial I/O. Most of the P2 I/O, counters, video and cordic instructions are not implemented. Also, the correct handling of the Z and C flags is unknown for the new P2 instructions.
I read that post... I just did not infer at the time that a usable simulator existed nor that a new spin interpreter had been available, as I find that P1's is totally unsuited for the P2 (due to the memory restrictions)...
Has anyone considered how magical the coginit PASM instruction is?
In one instruction we can:
1) Find a spare processor to run some code on.
2) Return an error if there is none left.
3) Load the required code into the core.
4) Give it a parameter, which in turn can point to a whole array of parameters
5) Start the core executing.
Never really though about it before but is there any other processor that has such an instruction?
The xmos devices don't not have such a thing. Sure they can start threads but it's for sure not a single instruction affair. Also threads on xmos cannot arbitrarily start child threads on other cores. Cheifly because cores do not share RAM I guess.
What this means is that if I write some complicated object that wants to run in a number of threads then either it has to rely on being able to run all the threads in one core, in which case it can start them itself with the XC "par" construct. Or I have to leave it up to the user of my object to start the threads on which ever cores he chooses, which can only be done from the applications "main" function. Hmmm...well perhaps that's the only way to deal with that situation as multiple threads probably means using lots of I/O which the users gadget may have distributed around different cores. Again that limitation that xcores can only "see" their own pins, not the ones connected to there cores.
The old Spin interpreter works OK on P2 except for four places in the code. I had to add a NOP after MOVI in two places because of the extra level of pipelining. I also had to mask off the upper 16 bits in a hub address where the old interpreter left non-zero bits. This works OK on the P1 because the hardware ignores the upper 16 bits of a hub address. However, that trick won't work on the P2 because it has a larger hub address space.
The final change was to handle the PAR register. The P2 simulator still memory maps the PAR and CNT registers, so I handled it that way. That will need to change in the interpreter when I remove this from the simulator. I used the more efficient version of the SQRT routine to free up room for the extra instructions used.
The old Spin interpreter can execute code within the first 64K of RAM. It can access data beyond the first 64K by using pointers. A big Spin version that can execute code beyond 64K would require more changes. It would require some code to be executed as LMM PASM or use an FCACHE.
Has anyone considered how magical the coginit PASM instruction is?
Never really though about it before but is there any other processor that has such an instruction?
I can't think of any specific examples, but the typical way such complex instructions have been implemented on other processors is to have a look-up table with a sequence of lower-level microcode instructions that the processor executes sequentially to perform the complex operation.
Well I guess somewhere inside modern multi-core x86 and the like there must be such instructions to get cores running. I have not looked at instruction sets for such chips since the 486 era.
if someone was going on holiday in a week or so and wanted to take some documents with him to make a comparison of the P1/P2 and XMOS what would you recommend? As they'll be read whilst sipping beer and overlooking boats bobbing around, they will need to be printed so not too many pages.
if someone was going on holiday in a week or so and wanted to take some documents with him to make a comparison of the P1/P2 and XMOS what would you recommend?
Note with good timer silicon, I can capture either edge, and resolve narrow pulses down to 1 CLK pulse widths.
I can also Divide a 100MHz clock, by N, and capture from that, again to 10ns resolutions.
So, rather than vague 'plenty of speed' claims, I like to see someone show me real numbers,
and software that manages the nasty cases, like what happens when an event arrives exactly when the 16 bit timer wraps.
Being right 99.99847% of the time, might be OK for Microsoft, but it's not in engineering.
You seem to be under an impression that those claims are in fact vague. I've been developing on XS1 for a year now, and somehow I never got this impression.
XS1 does not lose concurrent events. It'd be useless if it did, and I dislike that you imply otherwise, or even imply that somehow this is a poorly explored corner of XS1's performance. Events are like S-R flip flops. They are Set by hardware, and Reset when an event handler is dispatched. Of course if your handling of events is too slow, you can lose them if they reoccur before you handle them, but that's a problem with any architecture. That's all there is to it.
There is an idiomatic way of processing lots of events on XS1, with a very low fixed overhead that ensures that no events will be lost if you do your homework. You simply dedicate one thread to waiting for events, and the event handlers shoot out a message over a buffered link to another thread. This is fully deterministic in the sense that mathematically you can prove that events will never be missed when you place bounds on event periods, minimum length of channel buffer, and processing time in the thread that receives event notifications. The event handler thread can easily do event coalescence, too, to reduce load on the other thread. Usual ways the events are coalesced is to add up their number, combine their contents, or simply keep the most recent value. Then you use a flush timeout to report on the most recent coalesced value to the processing thread. This is how many GUI toolkits deal with mouse events, for example.
The major win for XS1 is, IMHO, that you can respond to multiple events from a single thread with a fixed latency that has jitter within 10ns p-p. This is not possible on a single COG in Propeller. Say you wait for a dozen events. In one COG, you can only have a loop that checks various event sources in order, and dispatches based on that. The latency in event response depends on where in the loop your code happened to be when the event arrived. Of course if you only have a few events to wait on, you spread them out to individual COGs. On XS1, you can wait for as many events as can be provided by the hardware, and the latency for a thread to react to the event is fixed in terms of thread cycles (but not in terms of core clock). Port timers let you issue a response that's timed based on the time the event came in, in terms of 100MHz reference clock, so you can have very short, fixed latencies, measured in multiples of 10s of nanoseconds. The jitter is well under +/-10ns. For simple event handlers, you can have a reaction in a few tens of nanoseconds, completely deterministic.
A lot of confusion related to determinism in XS1 is precisely because people cannot get rid of the timing loop approach used in architectures like SX48 and Prop I.
if someone was going on holiday in a week or so and wanted to take some documents with him to make a comparison of the P1/P2 and XMOS what would you recommend? As they'll be read whilst sipping beer and overlooking boats bobbing around, they will need to be printed so not too many pages.
You'd be well advised to do a full mirror of XMOS forums, and read those along with all documents posted in the documentation section of their webpage. You should also download and install the XDE, as their simulator is cycle accurate: you can check timing of I/O waveforms down to the state of CPU pipelines. That's very powerful.
For Propeller I, all you really need is their manual and datasheet. That's where Prop's documentation shines, I admit, but Prop is a conceptually MUCH simpler architecture. That's not necessarily a weakness if you need to be up to speed quickly, but can be a weakness when you run into the limitations of Prop's architecture.
For PII -- there isn't much to go by really, besides a few forum posts from Parallax employees. For a lot of the things I used to do, Prop I would have been completely adequate if only it had a MAC on each cog. Even a 16x16 MAC would have saved our bacon. It would have delayed us, though, when time came to get realtime ethernet going, as we'd be waiting for PII for that.
I still believe that XS1 architecture could be adequately documented in a single, if long, book, without any need to consult dozens of documents and having to browse through hundreds of forum posts. They seriously suck in that regard, and I'm freely admitting this as their user.
I'm not sure I follow. Are you asking why determinism matters ?
Plenty of examples of that : Frequency synthesis, time measurement that has a software-sampling component, phase modulation....
Something like Chroma modulation is relatively easy in a thread, but only if that thread always runs at a known phase velocity.
Modulate that, and you (or your customers) will see the consequences.
Or, One threads might NEED full speed to complete, and if that happens, you can only use half the chip-as-supplied, whilst I could use all threads in my improved alternative.
There is some confusion, it seems, as to what determinism means. Determinism means that you can place hard limits on when things will happen (alternatively: how long will they take). On some architectures (say x86 with Windows), the hard limit is "it may take infinite time" -- so while "hard", it's not very reasonable nor useful for some applications.
There is no such thing as a "software sampling" component. In the end, you are doing input and output. It is important that those happen at well defined times. It does not matter to the external world if the software routine takes precisely X number of cycles. It makes your job as a software programmer much easier if you can decouple the timing of your code from the timing of I/O, and that's precisely what XS1 does for you. If your software does no I/O at all, noone will care if it takes infinite amount of time: how will you know? You won't.
What XS1 guarantees, through their tools, is that your code will execute in a certain maximum amount of time. When you create a functional block in code (like a virtual peripheral on SX48 or object on Propeller), you can (and should!) add timing constraints into a timing verification script. That way the end user can verify that their particular system setup (number of threads, core clock, etc) will run your software fast enough. That is how you guarantee determinism. If your code happens to require a certain number of MIPS, and the code ends up on a chip that has too slow of a thread clock (e.g. too many threads or too slow core clock), the timing verifier will bomb out and inform you about it. That's how you should approach the issue of determinism and time-criticality in a way that keeps it portable.
For any sort of software-calculated modulation, all that you care about is that your calculation is done once per I/O cycle. Since the timing of the inputs and outputs is done in hardware, you know exactly when they'll happen. Then the timing verifier makes sure that your code will, in worst case, fit in between the input and output. And that's the end of this discussion. It doesn't matter exactly whether your modulator takes a 100 thread cycles or 15 thread cycles, as long as those thread cycles fit between the I/O.
Comments
XS1 has a fairly powerful software-controlled interrupt vectoring system. The interrupt vectors are not permanently assigned to peripherals, like in many MCUs. Instead, you can assign any vector to any event-generating peripheral (I/O port, etc). When the event happens, the vector points to the next instruction to be scheduled for given thread. A vector is specific to a thread, so you have full thread affinity for responding to external events.
The classical problem of what to do if different events all reuse same interrupt handler (vector) is handled very nicely, too. Normally you have to interrogate status bits to know what happened, if the handler could be triggered by different things. On XS1, to each event source you assign a so-called environment vector. It's simply a data word that's available in your interrupt/event handler, and lets you adjust your logic according to interrupt source. You can use it as a bitmask, as a jump offset or table offset, or whatever suits your application. I haven't seen anything like that on any of the mainstream MCUs -- feel free to correct me if I'm wrong. You normally have to emulate this by setting up code to write a value somewhere, then jump to a common handler code. This costs precious cycles. On XS1, an event/interrupt handler can be done in a couple thread cycles -- say in 80ns. That's less than one clock period on some MCUs.
The major difference between XS1 and Propeller (P1 and P2 both) is that Propeller has no interrupt support at all. On XS1, event/interrupt support enables essentially free event-driven switch statements. You can wait on many things to happen, and there's no time penalty for that. Waiting on one event is no different from waiting on 10 events, in terms of latency. Of course if two things happen at the same time, you can't process them concurrently in the same thread, but at least your code doesn't get any slower from trying to wait on many things in the first place. This is IMHO a very sane design decision.
The difference between events and interrupts is fairly simple on XS1. An Event handler does not preserve the PC. You have to be within a WAIT instruction for an event to fire. It is like a hardware-driven switch statement. You have sole control of the execution path after you're done handling an event. An Interrupt does the usual automagical PC/status storage in registers dedicated for that purpose, so there's no memory access overhead for that.
I hear you. I puzzled long and hard over the scattered documentation until clarity finally set in. Much of what I learned actually came from looking at the assembly code generated by the compiler rather than from the documentation.
XS1 is very similar to Propeller in that core-to-core (cog-to-cog) communications have to go through a channel (hub). XS1 does not have hub memory, except for some FIFO channel buffers, though it does have 4 "free" threads per every core. So a $20 qty 1 XS1-G4 gives you 16 threads of execution, all of them running at 100 MIPS each, or 1600 MIPS overall. A P2 at probably half that price will give you 8 threads of execution at 160 MIPS each, or 1280 MIPS overall. So those parts are in the same ballpark, really. I'd say that 100MIPS per thread on XS1 is equivalent to 160MIPS per thread in P2, since the former has a more powerful instruction set, that lets you do more in a single cycle. P2's powerful all-instruction conditional execution could be a counterargument, but I don't think it's that big of a deal. So who wins where, then?
P2 wins IMHO in terms of the usefulness and capability of its I/O pins and counters (plentiful!). It's completely unique in that sense. There are plenty of interfaces where you don't need anything besides what P2 would give to you. P2s major shortcoming could be the impedance mismatch between on-chip and off-chip communications: HUB access only works within a chip.
XS1 wins in terms of flexibility of its architecture: ease of communicating multiple cores, both on- and off-chip, the more powerful instruction set, more flexible memory architecture, and hardware resource management (resources: timers, clocks, channels/channel ends, ports). Contrary to P2, communicating between threads on a single core is no different than communicating between threads on two chips with a couple of XS-link switches in between them. If you want to cheat within a single core, you can always share RAM areas, with a big CAVEAT EMPTOR attached.
At the moment, the programming tools are IMHO somewhat tied between both platforms (of course I only judge what's out for P1). Viewport debugger is IMHO a good enough counterbalance to lack of high-level languages beyond SPIN, when you compare to higher level languages available for XS1 (C, C++ and XC).
XS1 and P2 will be, overall, very similarly performing devices, apart from cases where device-specific features will be of a make-or-break kind. I'd personally not attempt to code a complex industrial communications protocols in PASM or SPIN, but I'd gladly use P2 in a controller/sensor application as an I/O front-end and fixed-function DSP processor. The logic and networking would be probably better handled in XC on XS1.
There are some applications where P2 seems like a perfect fit, say filtering of modulator data from multi-channel sigma-delta converters like TIs ADS1278. One COG per channel, at 160MIPS, would be more than plenty to massage the data if the on-chip filters don't suit you. Of course, a "standard" way of doing it is within a large DSP or an FPGA, but P2 would be probably more cost effective, at least for R&D and small runs.
For some other applications, XS1 is a perfect fit. Implementing an EtherCat slave (fast, 100mbit ethernet-based industrial communications protocol) turned out to be fairly easy on the XMOS device -- and here we're replacing a custom ASIC or FPGA IP with a purely software solution. Adding goodies like ethernet-over-ethercat was simple once you got a hang of it. A stand-alone fixed-latency realtime ethernet switch was easy to do as well. I'd be probably in deep dodoland trying to squeeze it all into COGs, even on a yet-to-emerge P2. On XS1-G4, I could partition threads across cores to evenly spread out RAM use across the cores, getting most out of the 64kb of shared data/code RAM.
http://forums.parallax.com/showthread.php?125543-Propeller-II-update-BLOG&p=1009229&viewfull=1#post1009229
Maybe this can be another benchmark
Note with good timer silicon, I can capture either edge, and resolve narrow pulses down to 1 CLK pulse widths.
I can also Divide a 100MHz clock, by N, and capture from that, again to 10ns resolutions.
So, rather than vague 'plenty of speed' claims, I like to see someone show me real numbers,
and software that manages the nasty cases, like what happens when an event arrives exactly when the 16 bit timer wraps.
Being right 99.99847% of the time, might be OK for Microsoft, but it's not in engineering.
To me that is a flaw.
Sometimes hard real time, means exactly that.
They should have given users the option to 'spread all the load', or lock-in resource to vital threads, and then 'spread the rest'.
If you lock-in, sure the slow thread is slowed down more, but the premise everyone shares the time may be democratic, but it is not engineering.
Timers are 32 bit, but timers applied to ports (switch this port at this time) are 16 bits, two completely different beasts ! And for analogy what happens when you 32 bit timer wraps around ?... less often than a 16 bit one but not 100 % either.
Timers also has a prescaler, you do not need to use the 100 MHz clock for that.
I'd really like to see where you crash with this "it is a flaw". There are other more powerful solutions... it all depends on the needs. You can get a 256 bit timer using a FPGA if you want.... or build it with interrupts and a few memory locations, or use a real time clock !... many different ways of solving a problem... if it is a "proper" solution... is normally irrelevant to the end-user as long as it works reliably (and some end-users can even work with less than reliable solutions, ahem *desktop computers & associated software* ahem).
Did you read what I was replying to ?
The flaw is in the design of lack of user control of the thread switching - it is actually very simple, and not a question of "more powerful solutions"
Rather than stretch all threads the same, all they needed to do, was let users choose who gets the time-slices.
The hard work is already all done, they can flip threads now very quickly, so this for one example, should be a viable user choice for 5 threads :
100MHz,100MHz,100MHz,50MHz,50MHz
and 6 threads
100MHz,100MHz,100MHz,33MHz,33MHz,33MHz
or,
100MHz,100MHz,50MHz,50MHz,50MHz,50MHz
Notice now that 2 or 3 threads have locked time slots and do not drift down in speed. Or, users could choose equal allocate and get the same time-drift as now, Their choice.
It was certainly fast, for the time. There were faster variants too for some of the other levels, but this was what you would use for your ordinary device / peripheral handling.
I'm not sure I follow. Are you asking why determinism matters ?
Plenty of examples of that : Frequency synthesis, time measurement that has a software-sampling component, phase modulation....
Something like Chroma modulation is relatively easy in a thread, but only if that thread always runs at a known phase velocity.
Modulate that, and you (or your customers) will see the consequences.
Or, One threads might NEED full speed to complete, and if that happens, you can only use half the chip-as-supplied, whilst I could use all threads in my improved alternative.
I'm not sure I follow. Are you asking why determinism matters ?
Plenty of examples of that : Frequency synthesis, time measurement that has a software-sampling component, phase modulation....
Something like Chroma modulation is relatively easy in a thread, but only if that thread always runs at a known phase velocity.
Modulate that, and you (or your customers) will see the consequences.
Or, One threads might NEED full speed to complete, and if that happens, you can only use half the chip-as-supplied, whilst I could use all threads in my improved alternative.
My take is that is a complication that may be usefull in some cases but in general messes up the determinism of the Prop. No longer could you just throw objects from OBEX or elsewhere into your project and be sure that they don't have bad timing interactions with each other.
I'm inclined to make the same case against such priority schemes in xcore thread scheduling. Possibly usefull for some cases where you want to push the thing over the current limits by a little but in general messy and complicating.
In one instruction we can:
1) Find a spare processor to run some code on.
2) Return an error if there is none left.
3) Load the required code into the core.
4) Give it a parameter, which in turn can point to a whole array of parameters
5) Start the core executing.
Never really though about it before but is there any other processor that has such an instruction?
The xmos devices don't not have such a thing. Sure they can start threads but it's for sure not a single instruction affair. Also threads on xmos cannot arbitrarily start child threads on other cores. Cheifly because cores do not share RAM I guess.
What this means is that if I write some complicated object that wants to run in a number of threads then either it has to rely on being able to run all the threads in one core, in which case it can start them itself with the XC "par" construct. Or I have to leave it up to the user of my object to start the threads on which ever cores he chooses, which can only be done from the applications "main" function. Hmmm...well perhaps that's the only way to deal with that situation as multiple threads probably means using lots of I/O which the users gadget may have distributed around different cores. Again that limitation that xcores can only "see" their own pins, not the ones connected to there cores.
The final change was to handle the PAR register. The P2 simulator still memory maps the PAR and CNT registers, so I handled it that way. That will need to change in the interpreter when I remove this from the simulator. I used the more efficient version of the SQRT routine to free up room for the extra instructions used.
The old Spin interpreter can execute code within the first 64K of RAM. It can access data beyond the first 64K by using pointers. A big Spin version that can execute code beyond 64K would require more changes. It would require some code to be executed as LMM PASM or use an FCACHE.
I can't think of any specific examples, but the typical way such complex instructions have been implemented on other processors is to have a look-up table with a sequence of lower-level microcode instructions that the processor executes sequentially to perform the complex operation.
if someone was going on holiday in a week or so and wanted to take some documents with him to make a comparison of the P1/P2 and XMOS what would you recommend? As they'll be read whilst sipping beer and overlooking boats bobbing around, they will need to be printed so not too many pages.
A NEW travel agent????
(Sorry, I couldn't resist!)
You seem to be under an impression that those claims are in fact vague. I've been developing on XS1 for a year now, and somehow I never got this impression.
XS1 does not lose concurrent events. It'd be useless if it did, and I dislike that you imply otherwise, or even imply that somehow this is a poorly explored corner of XS1's performance. Events are like S-R flip flops. They are Set by hardware, and Reset when an event handler is dispatched. Of course if your handling of events is too slow, you can lose them if they reoccur before you handle them, but that's a problem with any architecture. That's all there is to it.
There is an idiomatic way of processing lots of events on XS1, with a very low fixed overhead that ensures that no events will be lost if you do your homework. You simply dedicate one thread to waiting for events, and the event handlers shoot out a message over a buffered link to another thread. This is fully deterministic in the sense that mathematically you can prove that events will never be missed when you place bounds on event periods, minimum length of channel buffer, and processing time in the thread that receives event notifications. The event handler thread can easily do event coalescence, too, to reduce load on the other thread. Usual ways the events are coalesced is to add up their number, combine their contents, or simply keep the most recent value. Then you use a flush timeout to report on the most recent coalesced value to the processing thread. This is how many GUI toolkits deal with mouse events, for example.
The major win for XS1 is, IMHO, that you can respond to multiple events from a single thread with a fixed latency that has jitter within 10ns p-p. This is not possible on a single COG in Propeller. Say you wait for a dozen events. In one COG, you can only have a loop that checks various event sources in order, and dispatches based on that. The latency in event response depends on where in the loop your code happened to be when the event arrived. Of course if you only have a few events to wait on, you spread them out to individual COGs. On XS1, you can wait for as many events as can be provided by the hardware, and the latency for a thread to react to the event is fixed in terms of thread cycles (but not in terms of core clock). Port timers let you issue a response that's timed based on the time the event came in, in terms of 100MHz reference clock, so you can have very short, fixed latencies, measured in multiples of 10s of nanoseconds. The jitter is well under +/-10ns. For simple event handlers, you can have a reaction in a few tens of nanoseconds, completely deterministic.
A lot of confusion related to determinism in XS1 is precisely because people cannot get rid of the timing loop approach used in architectures like SX48 and Prop I.
For Propeller I, all you really need is their manual and datasheet. That's where Prop's documentation shines, I admit, but Prop is a conceptually MUCH simpler architecture. That's not necessarily a weakness if you need to be up to speed quickly, but can be a weakness when you run into the limitations of Prop's architecture.
For PII -- there isn't much to go by really, besides a few forum posts from Parallax employees. For a lot of the things I used to do, Prop I would have been completely adequate if only it had a MAC on each cog. Even a 16x16 MAC would have saved our bacon. It would have delayed us, though, when time came to get realtime ethernet going, as we'd be waiting for PII for that.
I still believe that XS1 architecture could be adequately documented in a single, if long, book, without any need to consult dozens of documents and having to browse through hundreds of forum posts. They seriously suck in that regard, and I'm freely admitting this as their user.
There is no such thing as a "software sampling" component. In the end, you are doing input and output. It is important that those happen at well defined times. It does not matter to the external world if the software routine takes precisely X number of cycles. It makes your job as a software programmer much easier if you can decouple the timing of your code from the timing of I/O, and that's precisely what XS1 does for you. If your software does no I/O at all, noone will care if it takes infinite amount of time: how will you know? You won't.
What XS1 guarantees, through their tools, is that your code will execute in a certain maximum amount of time. When you create a functional block in code (like a virtual peripheral on SX48 or object on Propeller), you can (and should!) add timing constraints into a timing verification script. That way the end user can verify that their particular system setup (number of threads, core clock, etc) will run your software fast enough. That is how you guarantee determinism. If your code happens to require a certain number of MIPS, and the code ends up on a chip that has too slow of a thread clock (e.g. too many threads or too slow core clock), the timing verifier will bomb out and inform you about it. That's how you should approach the issue of determinism and time-criticality in a way that keeps it portable.
For any sort of software-calculated modulation, all that you care about is that your calculation is done once per I/O cycle. Since the timing of the inputs and outputs is done in hardware, you know exactly when they'll happen. Then the timing verifier makes sure that your code will, in worst case, fit in between the input and output. And that's the end of this discussion. It doesn't matter exactly whether your modulator takes a 100 thread cycles or 15 thread cycles, as long as those thread cycles fit between the I/O.