XMOS chips vs. P2

RossH · 2011-08-06 23:52

Batang wrote: »

Such an asinine response.

Wait - It gets worse! Now this thread appears to be about benchmarking an XMOS with a simulated Prop 2 - a chip whose design has not even been finalized yet!

How is this going to tell anyone anything?

Ross.

potatohead · 2011-08-07 00:40

+1 to that.

Leon · 2011-08-07 01:23

RossH wrote: »

Wait - It gets worse! Now this thread appears to be about benchmarking an XMOS with a simulated Prop 2 - a chip whose design has not even been finalized yet!

How is this going to tell anyone anything?

Ross.

Isn't that the sort of thing that Andre was after when he started this thread? I assumed that Parallax had a fully functional simulator for the P2.

A comparison with the P1 using the proposed benchmark will be interesting, anyway, and it should be possible to estimate how the P2 will perform.

Heater. · 2011-08-07 01:51

Yes, the very title of the thread tells you how hopeless it is going to be. Hence my "oops" comment at the start.

No problem that the P2 does not exist yet. We can run the P2 emulation on the Xcore and make the comparisons with that:)

On a serious note, I have no issues with a P1/XMOS comparison. Previously I have said that they are comparable due to the common elements of their fundamental design philosophy. However the resulting devices are vastly different of course.

If you prevent your self from being blinded by clock speed numbers or emotional allegences one way or the other there are a lot of interesting comparisons of design details and choices to be made. I believe both P and X could learn from each other for future devices.

RossH · 2011-08-07 01:58

Leon wrote: »

Isn't that the sort of thing that Andre was after when he started this thread? I assumed that Parallax had a fully functional simulator for the P2.

A fully functional simulator for the P2? C'mon Leon - you've read the same threads as all the rest of us - now you're just being disingenuous.

Anyway, if you re-read the first post in this thread, you will see Andre' was after nothing of the kind. Andre' wanted to hear from anyone who had any actual experience with one of the XMOS. There do seem to be a few, and they seem to be generally negative about their experiences. Andre' hasn't posted again, so I guess he got his answer and has moved on.

I also think he knew how any thread mentioning the XMOS would end up - and I guess we all took the bait!

Ross.

Toby Seckshund · 2011-08-07 03:56

I was with Heater.'s initial response.

It looked like a handy pile of milk bottles, rags and petrolium spirit.

Alan.

jmg · 2011-08-07 03:59

Heater. wrote: »

Jmg,
Taking you up on the quadrature challenge my first stab at an xcore version written in XC not assembler looks like it can sample, decode and count quadrature steps at 8.3 Mhz.
That loop contains only 3 lines of XC.

There is a down side in that the xcore only allows pins to be used in groups of 1, 4, 8, 16 pins. That means when using a 4 pin group the quadrature inputs we waste two pins.

One could run two decoders from those 4 pins but then there is a speed hit.

I'll post the code when I have more than this phone to chat on if anyone is interested.

I am interested (of course)

Where there is a natural step, it is also useful to note the speed/resource for 2 copies of a Quad counter.

Sometimes that can be less than 2x a single copy, and sometimes more, it depends on how the resources slice-up.
Sounds like you example has 'no additonal pin cost, for adding 2 ch', and some Quad Counters have an index pulse, which resets the
counter. That could use a 'wasted' pin.

I've also seen some encoders with redundant outputs - so there you would poll 4 pins, and (I guess) choose the working pair ?

Ale · 2011-08-08 00:26

I was yesterday writing a driver for Rayman's 4.3" display for text mode. It uses clocked and buffered ports (where I learned that the data is output on the falling edge of the clock!, I rfm), and thus the inner loop has 800 ns (8 pixels @ 10 MHz pixel clock). From all these 320 ticks, 164 are free. And I am using XC. To achieve this I had to crank the optimization level and use unchecked arrays. I thought it wouldn't be possible without asm...
I hope the P2 brings this kind of clocked pins, both for input and output. It really simplifies software, OTOH >100 MIPS per cog would make many of the 2 or 3 cog drivers a one cog solution!

Even a not-yet-finished specification can be used for a software simulator... that way the interpreter for instance can be written.

Heater. · 2011-08-08 00:58

Jmg,
Strangely enough I already started thinking about multiple quadrature counters. I might propose that for this little P/X comparison we agree to implement objects with 2 quad counters including 2 zero position pulse inputs. A bonus would be able to configure the thing for 4 counters at compile time.

Heater. · 2011-08-08 01:04

Ale,
Now, clocked and buffered ports is an example of something the Prop could learn from xcore. BUT I have a feeling that it is the clocking and buffering of pins on xcore that requires that the pins be configured into groups of 1, 4, 8, 16. This defeats the Props advantage of completley independent single pins useable from any cog.
Not saying it can't be done but it is probably a more silicon to allow for both arrangements.

Ale · 2011-08-08 01:27

Propeller pins are grouped into 32 bits... masking out what we do not want may prove complicated for the buffering, though...

Heater. · 2011-08-08 01:38

Ah yes. But surely the Prop having pins grouped 32 at a time is a natural consequence of it being a 32 bit machine. On the prop I can wiggle any combination of Pins from 1 pin to 32 pins from any cog. Not so on xcore. Now allowing for say 8 bits to be clocked in/out from a signal on a 9th pin seems to imply to me that grouping pins in hardware becomes the easiest way to do it. Hence the restrictions imposed on pin usage on xcore. Would we really want those restrictions on the Props?

Heater. · 2011-08-08 01:40

Jmg,
Can we agree to run all those multiple quad counters in one cog on prop and using one thread on xcore?

Baggers · 2011-08-08 01:58

don't forget the thread on xcore needs to have it's connected thread running also, as it makes it half speed!

Ale · 2011-08-08 02:08

Yes and no. you get 50 MIPS only with 8 threads, more with less. Don't forget that instructions like memory read can take as little as 4 cycles, multiplications too

.

I want to copy a byte into the other 3 bytes of the long.. how do you do it in one instruction ? (maybe two)... with a fast multiply by 0x01010101 for instance (what I use)... is there another way ?

Another problem... I want to expand the bits in a byte to a group of 8 bytes... (not xmos specific but useful for display of text) I use a table...

 firstword = translate[font << 1];
 secondword = translate[font << 1 + 1];

The table takes of course 256*8 = 2kbytes... a bit "fat"... maybe a few instructions can solve the problem (I have some time left... like 16 instructions)... Of course a smaller 16 entry table can be used...

  firstword = translate[(font & 0x0f)];

translate[] = { 0x00000000, 0x000000FF, 0x0000FF00, 0x0000FFFF, 0x00FF0000, .. };

The addressing of longs does not need any pre-shifting of addresses (a difference with the byte addressing of HUB memory on the prop (it could be very useful...).

I am sure there is another way...

Baggers · 2011-08-08 02:11

so yes, you'd have to have all other threads running, to make it a real comparison!

Ale · 2011-08-08 02:24

Yes, of course. But even clock-for-clock the xmos has faster instructions and a larger program memory (somewhat not always needed from what we can squeeze in 496 longs!, I am quite amazed by this fact but the propeller assembler is really compact even at 32 bits per instruction)

Heater. · 2011-08-08 02:43

Baggers,
No, thread speed is 100 MIPS per thread for 1 to 4 threads after that goes down for each extra thread until you have 50 MIPS per thread for 8 threads.
However you now remind me that the speed I quoted for the quad counter should really be divided by two. After all if you were to use the code in a project I have no way to know how many threads you are using in there.

Baggers · 2011-08-08 03:19

Ale, you're right, it is amazing what has been squeezed into 496 longs, and more often a lot less!
Heater, yes, it does need to be 50MIPS as you don't know what the rest of the project is doing, unlike on the Prop, where one cog speed is not affected by anothers cogs app.

Heater. · 2011-08-08 03:41

Yep, I was amazed to find I could fit a whole Intel 8080 emulation into 496 longs (excluding the op code dispatch table in HUB).
I have occationally argued with the xmos guys that they should not describe the xcore as "fully deterministic" precisely because of that modulation of thread speed with the number of threads. They don't seem to think about the unpredictabilty of mixing and matching objects from here and ther as we do in Propeller land.

Ale · 2011-08-08 03:59

Maybe because the sort of recommend you do not launch things dynamically... and thus when static declared... you know how many mips you will get.

jmg · 2011-08-08 05:10

Heater. wrote: »

Jmg,
Can we agree to run all those multiple quad counters in one cog on prop and using one thread on xcore?

Yes, as that is most likely how these would be coded, but it should include a means to communicate the results to other threads/gogs, and the results should be tabulated in Code Size and Speed Ceiling, for 1,2,3 etc Counters and a nice design would have a means to flag that ceiling was reached.

Pirority should be speed first and size second, (as these are usually small anyway) & If hardware can be used to increase speed, then that should be
included as an option. (Like those benchmarks that are Optimise off, and Optimize on).
The idea is to illustrate what the SW can do and what SW + HW can do.

jmg · 2011-08-08 05:17

Ale wrote: »

Maybe because the sort of recommend you do not launch things dynamically... and thus when static declared... you know how many mips you will get.

I'd be happier if they had thoiught to allow users to control that elasticity. Thus some threads would always get time, and others can share what is left.
(rather than the OK to 4, then 4+, everyone slows, as now )
That way, critical threads are NOT impacted by any downstream changes.

I just did a coding exercise on the HoChips part, and there they lock in a 3 way slice, but a user choice to 66:33 or 50:50 over two, would allow more MIPS where it mattered.

Ale · 2011-08-08 05:48

Not much gets in the way of you launching things dynamically... even pins can be allocated, not an issue. The software seems a bit rigid in that field (assembler of course has no problems

).

HoChips ? what is that (el goog shows not relevant stuff).

kuba · 2011-08-08 13:06

The major conceptual difference between XMOS and Propeller is that in XMOS, your code needs to be fast enough, and that's it. In Propeller, your code has to have exact timings. In XS1 XMOS architecture, the hardware does timings for you, and kicks your thread into action when the right time comes. This usually lets you forget about cycle counting, and I, for one, think it's high time someone came up with this. I've done enough cycle counting in my life -- on SX48 and on Z8 Encore!

The supposed "non-determinism" of the XMOS architecture is somewhat imaginary sort of a problem. Imagine you've coded things for Propeller. Then you decide to have your clock go faster. You usually have to recode things, or at least tweak the code heavily. Nobody will insert NOPs for you if the clock goes faster. On XMOS, if your code runs on a slower device, it will run on any faster device, too, as long as you don't hardcode various clock divisors but slave them from a global clock speed.

A thread (or a set of them) coded for XS1 will usually be specified for a certain number of MIPS available to that thread. Say a software-defined functional block requires two threads, one with 80 MIPS minimum, another with 50 MIPS minimum. If the threads only communicate via channels (like they will, in 99% of cases), it usually would be up to you how you split them between cores (those can be on different chips, too). For example you can run one 400MHz core with 5 threads, and run the 80 MIPS thread there. Another core can be a 400MHz core running 8 threads, and the 50MIPS thread will be just fine on it. Perhaps the only other important factor is latency of the networking fabric: XS1-L devices have more latency than XS1-G devices. I have an application where the lower latency of XS1-G switches is called for.

A properly coded thread will not care at all about having more MIPS thrown at it, nor about having a faster network thrown at it, and the critical I/O timings that you intend to control will stay where you designed them to be.

Adding to the XMOS's determinism is their timing analysis tool (XTA) -- this lets you do static proofs that certain code will execute within a given time. This guarantees that your code will work as planned on the thread(s) you chose, and on any faster thread(s), too.

I do agree that XMOS documentation is fragmented, and some things are not clearly spelled out and that sets you up for initial frustrations. Whatever bad could be said about Zilog, their Encore! product manuals are comprehensive, professionally done and essentially fully document the hardware. Two documents are needed to get everything you'll need to use their parts: the product manual for the particular chip you're using, and the eZ8 architecture manual for the nitty-gritty on the CPU core. Their debugging interface (half-duplex serial line) is also very simple to use from your own code. Even Parallax's documentation for Propeller used not to be that good -- do note that Zilog released their product and documentation together, and the documentation hasn't substantially changed since. Here both Parallax and XMOS were initially lacking, and XMOS is still doing catch-up, while Parallax has got it all straightened out -- I'm happy with Prop's documentation as it stands now. Two documents (manual and datasheet) and you're set. Like with eZ8 -- good.

I'm waiting for someone to write a complete product-manual-book for XS1, so that I'd have everything in one volume, instead of half dozen of them (plus assorted forum printouts). That's my main gripe with XMOS. I hate poor documentation. I hate it so much, that I'd be more than willing to spend a couple hundred USD on a well-written technical book on XS1. It'd still be way cheaper than spending time having to search through 5+ different documents -- there's a good reason I'm not a Chinese scholar, but XMOS almost made me reconsider

Propeller's IO silicon, especially in P2, has obviously way more tricks up its hat than "bog standard" digital I/O ports of XS1. Once P2 arrives, I won't be ashamed at all to do projects where P2 interfaces with the analog world, and XS1 does number crunching and communications. There is something to be said for having a single chip (less PHYs), software-defined 4-port Ethernet switch with a fixed port-to-port latency with on-the-fly forwarding (no store-and-forward). For anyone who thinks of buying a hundred or so IEEE-1588 supporting (grandmaster) OEM-style switches, two XMOS dev boards instead may be a better bargain if you intend to do some coding. It doesn't take all that many of those for R&D to be paid for in hardware savings (you're looking at ~$2k+ for one IEEE-1588 grandmaster switch).

Alas, while P2 I/O is supposed to provide more of serialization/deserialization, XS1 already provides that on every port in buffered mode. Of course the VIDEO mode in P1 is in similar vein, but it's even more limited than XS1. On XS1 you can access 1, 4, 8, 16 and 32 bit ports using 32 bit buffers. The contents of the long buffer are shifted out in bits, nibbles, bytes or words, as appropriate. You can assign other single-bit ports as strobes for that. This covers the needs when you talk to Ethernet and USB PHYs, and also for various parallel bus options. It's very easy and cheap (cycle-wise), for example, to make an XS1 pretend to a peripheral to be any of the popular bus-oriented CPUs, like Z80, i80xx, 68HC11, etc. It also helps plenty when trying to squeeze multiple UARTs into one thread. Since both timing and deserialization are done in hardware, it's fairly easy to have a couple 1Mbaud/s UART receivers all running from one thread -- and that's all in XC, without need for assembly. And I'm not talking about the simplest proof-of-concept implementation either: functionality of a real UART chip -- triple sampling, false start bit detection, etc.

There was a rant here that XS1 counters/timers are "16 bit only", that there's no capture, etc. First of all, you don't need anything more. The software execution has plenty of speed to do housekeeping to simulate longer timer registers. The reason for this was architectural: since you often wait on timers, you want to have short instructions that set it up, and for that you have 16 bits for the instruction, and 16 bits for timer value. The important part, though, is that every port can have a timer capture assigned for timestamping. If you read something from a port that's not an immediate read, there is a timestamp telling you when a condition was satisfied that caused the port's value to be captured and made available to the software (either an external clock pulse, or a bit pattern match). There are 10 timers per core, and 5 clock blocks on top of the 100MHz reference clock. The clock block is essentially a clock routing resource -- think of it as equivalent to a clock signal line. There can be multiple timers running off one clock block, of course. This is a fairly flexible architecture. Let's not forget: in many MCUs a timer can either use a common system clock or take it from a fixed external pin, and that's it. It's good enough when timers can be internally daisy-chained without having to put external tracks on the board...

[It's not uncommon to see a professor of Chinese have a 10+ dictionaries open when working on a text -- I kid you not. I wouldn't believe it had I not seen it first hand, and had an explanation given why it is so. But I'll leave that for another day, as it'd be a long rant.]

jazzed · 2011-08-08 13:22

kuba wrote: »

...

I really appreciate the amount of detail you put in your post.

The Propeller waitcnt instruction (and others) offer very good alternatives to stuffing NOPs. In most cases, it is the skill of the programmer that determines the outcome with a device. Having hardware that helps do the job is appreciated too.

kuba · 2011-08-08 13:23

There was also some confusion about multithreading. Thread switching happens on every core clock, assuming that there is more than one thread ready to run. So a 500MHz part with a full pipeline will switch threads at instruction execution rate -- every 2ns. The division instructions do not mess things up, apparently, even if they take long to execute (up to 32 thread cycles). The divider's state registers are per-thread. The only thing that's really shared between threads is a 4-deep pipeline, and that's why adding threads up to count of 4 does not prolong a thread cycle of 8ns. There's no penalty of any sort for having either one or four threads. For 5-8 threads, the thread cycle prolongs proportionally (8ns @ 4 threads, 16ns @ 8 threads). Every thread has a full set of its own registers, including hidden state registers only visible to the ALU. Those are switched into the instruction executor with no delay or overhead.

kuba · 2011-08-08 13:34

The memory is a hot issue in the Propeller architecture, and I personally find it easier to code when you're free to split memory between threads at your leisure. You may have a thread that only uses a few dozen words, the rest of them don't get wasted and are available to other threads. The XS1 has perhaps a couple extra % of RAM size per core of hidden memory in form of channel and network switch buffers. If all you need is a short FIFO (dozens of words), you can use a streaming channel for it without using any RAM at all. I have to do some tests to characterize exactly how deep those FIFOs are: the XMOS documentation is characteristically mute about such details (or they hid it where you won't find it), and yes, I admit it's an unnecessary pain.

A case in point about the dizzy state of their documentation: you'll find functional specification for behavior of I/O ports in the language specification, of all places. That's why I stand by my offer of $250 or so for a copy of a good book that fully documents the damn thing -- in the sense that if you take it to an unplugged vacation, you can come back home ready to roll, without having more questions than you began with. I'm sure there'd be enough customers for it, that someone could get perhaps $100k (USD) to write and have it self-published. Shouldn't take longer than a year to pull off. I'd do it myself if I hadn't loved my job as it is.

kuba · 2011-08-08 13:51

There is an aspect of hardware-enforced "don't shoot yourself in the foot" that's perhaps hard to come by on most other architectures. Propeller's unique design forces communication between cogs via the hub and that's it. You have shared semaphores, but apart from it you're free to devise any length of rope to hang yourself with. Say, a pin read-modify-write race between cogs.

On XS1, the threads similarly have sequential access to RAM. Access to some resources is exclusive to a thread, and that's enforced by the hardware that will raise an exception -- so you can't cheat even in assembly. Some resources can be shared.

Highlights of hardware-generated exceptions are below. All of those help at catching software bugs and some are fairly unique to XS1 and make it a more robust platform.

- trying to send interconnect (link) control tokens in software (analogous to preventing spoofing ethernet packets)
- illegal PC counter value (on many MCUs, you can execute in lala land as long as the default value in unmapped memory happens to be a valid opcode)
- illegal resource (use of invalid resource identifier/handle, akin to trying to read from a closed file handle), this is methinks unique to XS1 and no other architecture has anything like that
- access to undefined registers (say processor status) - instead of returning some "default" value, it's correctly caught
- resource dependency, when one thread tries to access the same resource within 4 clock cycles of another thread, this can catch race conditions.

Sal Ammoniac · 2011-08-08 14:02

Kuba,

Perhaps you should write the consolidated book on the XC-1 that you mention above. You seem to be well-versed in the architecture.

XMOS chips vs. P2

Comments