Is it time to re-examine the P2 requirements ???

Electrodude · 2015-02-04 12:10

The indeterminism of the rotating hub selector can often be made up for by calculating the address and starting the hub streamer in advance, doing other stuff for a few instructions, and coming back to find your data there waiting for you.

EDIT:
The things that need hub data the fastest (and can't wait for my abovementioned trick), like SD, video, and probably USB, make most of their timing-critical hub accesses sequentially.

Seairth · 2015-02-04 12:11

Heater. wrote: »

I kind of disagree with Seairth when he says the "Propeller 2 ... is also less different." Seems to me that with 16 COGS to do what I describe above it's twice as different as the P1 was:)

I meant "less different than ARM, as compared to the difference between P1 and ARM." Or did you already know what I meant, and was just twisting it a bit? :P

potatohead · 2015-02-04 12:17

Unless he means other devices do more things like a P2 is likely to do them.

Seems to me deterministic means being able to understand what is going to happen.

That is true of the scenarios Heater put out there.

True for both devices too, but they differ in the approach and difficulty.

Another word is needed to make deterministic more meaningful.

potatohead · 2015-02-04 12:17

Unless he means other devices do more things like a P2 is likely to do them.

Seems to me deterministic means being able to understand what is going to happen.

That is true of the scenarios Heater put out there.

True for both devices too, but they differ in the approach and difficulty.

Another word is needed to make deterministic more meaningful.

evanh · 2015-02-04 12:21

jmg wrote: »

I think they mean "Relative to Windows"

As humorous as that seems, JMG is actually correct. Or at least determinism there means there is no hidden emulations that hog the CPU, as well as having a decent kernel managing the job of task switching.

ARM is not saying their chips are good at bit-bashing, which is what we are trying to deal with on the Prop. There is plenty of dedicated peripherals to do all that work on the ARMs.

Dave Hein might have a point, the "eat beater" could be a bit difficult to predict. Time for that FPGA image ...

Heater. · 2015-02-04 13:22

@Searith,

I meant "less different than ARM, as compared to the difference between P1 and ARM." Or did you already know what I meant, and was just twisting it a bit?

By "less different" I guess you mean "more the same". Well yeah, with more memory and direct HUB execution that is more like an ARM MCU perhaps.

But with 16 cores tightly coupled to the I/O that is less like an ARM.

"more than", "less than" I'm confused. Still an ARM and a Prop 2 are very different beasts. (Well, ignoring the not insignificant difference that one exists and the other does not

)

@potatohead.

Seems to me deterministic means being able to understand what is going to happen. ... Another word is needed to make deterministic more meaningful.

Yes and yes. It's bugged me for ages that there are two or more different meanings of "deterministic" in use during these discussions and it makes things confusing.

One meaning is the traditional engineering one of "I want this task started at this time and completed at this other time. Plus and minus some tolerance. I don't want any other random interrupts or task switches to ever upset that requirement"

A subtle but important variation of that is "I want to run my code and perhaps a bunch of other modules/libraries/objects and be totally sure that they don't have any weird and surprising timing interactions. Oh and I don't want to have to study or adapt any of those other modules to be sure this is true"

It's that second meaning of determinism that makes reuse of objects, mixing and matching stuff from here and there, in Spin so easy.

I don't know of any other processors apart from the XMOS that make this determinism possible.

potatohead · 2015-02-04 14:16

Precisely.

BTW, that also speaks to the no OS potential Props bring to the table.

Anyway, another word is needed, maybe a couple to properly position and differentiate the latter use of deterministic.

Mini rant: I just migrated from a Droid 4 with keyboard to a galaxy Note 4.

The Note is actually a great little computer. It has a multi core ARM, 3GB RAM, lots of sensors, an insane 2560x14xx pixel display, and a strong GPU, totally capable of slinging pixels around.

A little computer this capable would serve quite nicely for a lot of developing work.

When I connect it to an external display and Bluetooth keyboard, mouse, etc... it rocks.

But the touch input, good and smart as it is does not hold a candle to a tactile one... if you have noted typos, that is why. Arrgh!

I'm rapidly improving, and I think peak skill will make this whole affair OK rather than marginal. Work in progress.

The voice input is seriously good, and I use it more frequently than I thought I would.

...now back to your regular programming.

@Evanh -value is a slippery thing. I'll boil down what I understand and put it here shortly.

User Name · 2015-02-04 18:40

One simple test for determinism could be the WS2812 serial protocol. How easily, precisely, and naturally can you get your processor/language/isr combination to meet or beat spec? Extra points awarded if you don't have to insert magic nop's or temporarily disable everything else.

kwinn · 2015-02-04 19:42

My understanding of deterministic has always been that if it takes x time to perform a function that function should always take x time regardless of what else is happening. By that definition even code running in a cog is not always deterministic, but it is as close to that ideal as anything I have seen to date. If you leave out the few instructions that have variable execution times it is 100% deterministic.

In the real world I would consider functions that require instructions with variable execution times to be deterministic if they can complete in the maximum possible time required for the deterministic and variable timing instructions of that function.

evanh · 2015-02-04 20:54

kwinn wrote: »

My understanding of deterministic has always been that if it takes x time to perform a function that function should always take x time regardless of what else is happening.

A lot of promo talk is about kernel level determinism. What your description equates to at the kernel level is just the tasks get serviced regularly.

It's a much coarser granularity than instruction level determinism. After all, a typical multilevel cache system on modern general purpose CPUs these days makes any attempt at instruction determinism pointless. Again, the philosophy is use dedicated hardware for the timing critical parts. Software just performs management of the DMA channels and the like.

Brian Fairchild · 2015-02-04 23:10

evanh wrote: »

The term "value", I've got qualms about it's use/meaning, have you got a formal definition?

Price = the amount of money a vendor thinks a product is worth.
Value = the amount of money a buyer thinks a product is worth.

msrobots · 2015-02-04 23:43

Brian Fairchild wrote: »

Price = the amount of money a vendor thinks a product is worth.
Value = the amount of money a buyer thinks a product is worth.

nice.

Mike

David Betz · 2015-02-05 04:21

evanh wrote: »

A lot of promo talk is about kernel level determinism. What your description equates to at the kernel level is just the tasks get serviced regularly.

It's a much coarser granularity than instruction level determinism. After all, a typical multilevel cache system on modern general purpose CPUs these days makes any attempt at instruction determinism pointless. Again, the philosophy is use dedicated hardware for the timing critical parts. Software just performs management of the DMA channels and the like.

Yes, but isn't the "tightly coupled memory" in the ARM processors supposed to allow instruction level determinism?

Edit: Well, even if TCM does allow for instruction level determinism, if you only have a single core processor and it is always executing from TCM then I guess you aren't making much use of most of the capability of the chip. That's where the Propeller has an advantage. It has 8 processors so you don't mind so much if some are dedicated to processes that require that level of determinism. We need an 8 core ARM! :-)

kwinn · 2015-02-05 07:21

evanh wrote: »

A lot of promo talk is about kernel level determinism. What your description equates to at the kernel level is just the tasks get serviced regularly.

It's a much coarser granularity than instruction level determinism. After all, a typical multilevel cache system on modern general purpose CPUs these days makes any attempt at instruction determinism pointless. Again, the philosophy is use dedicated hardware for the timing critical parts. Software just performs management of the DMA channels and the like.

I meant deterministic at the instruction/hardware level, not the kernel. A simple instruction that is always executed in x clock cycles is deterministic, as is a hardware function such as DMA that always transfers data at a specific rate for a specific peripheral. That kind of absolute determinism is extremely difficult to achieve when systems become more complex. A purist might object, but for me another level of determinism would be a function that is performed in a period of time that never exceeds the maximum requirement but can be done in less time.

Heater. · 2015-02-05 07:59

@kwinn,

My understanding of deterministic has always been that if it takes x time to perform a function that function should always take x time regardless of what else is happening.

You understanding is correct.

I would add that often we want thing to happen in response to some external event, or perhaps at some precise time. So the latency between the event, or the moment in time till the task starts is also important.

@evanh

kwinn is correct, he said nothing of kernels.

There is an other misleading thing about "deterministic". The little matter of scale.

Operating system developers may be very proud that they can start and complete a job with timing measured to an accuracy of 1 or 10ms.

Embedded systems designers may well be wanting micro-second accuracy.

Dave Hein · 2015-02-05 09:18

There are very few applications on the Prop where cycle-level determinism is required. In cases where cycle-level determinism is really required, it is provided by hardware, such as for display drivers and other things that are driven by counters. In almost all cases, cycle-level determinism is required at the software level. There is usually enough hardware buffering to allow for a certain amount of latency.

Heater. · 2015-02-05 09:35

High level "application" code no.

What about drivers? FDS, video, those new fangled LEDs with serial interfaces?

The Prop has no peripheral hardware so cycle timing and multiple cores are essential.

Dave Hein · 2015-02-05 11:37

Video must be cycle-accurate, but that's done with the hardware on the Prop and not software. The software just uses the WAITVID instruction to sync up to the hardware.

Serial has very loose requirements, and doesn't need high accuracy. With a 10-bit async frame each bit can jitter by +/- 2.5% if both sender and receiver have loose timing. Of course, above 200 kbps the time for 2.5% is close to a single instruction time on the Prop. Because the Prop's instruction rate is relatively slow it does require determinism. A processor with a higher instruction rate wouldn't have to be deterministic.

evanh · 2015-02-05 12:54

Heater. wrote: »

kwinn is correct, he said nothing of kernels.

There is an other misleading thing about "deterministic". The little matter of scale.

Operating system developers may be very proud that they can start and complete a job with timing measured to an accuracy of 1 or 10ms.

Embedded systems designers may well be wanting micro-second accuracy.

Yes, the subject started with the quote from ARM about the level of determinism it's high end products are useful for. I was trying to be clear about the differences and how what was a single meaning can be reinterpreted without changing a word.

Heater. · 2015-02-05 13:20

evanh,

Well, yep, I'm still trying to sort the ARM marketing bumf about highly deterministic and deeply embedded into reality and hype.

potatohead · 2015-02-08 13:42

Re: value

http://forums.parallax.com/showthread.php/160030-Value

Seems better to put that commentary on its own thread.

abecedarian · 2015-02-08 15:28

Heater. wrote: »

evanh,

Well, yep, I'm still trying to sort the ARM marketing bumf about highly deterministic and deeply embedded into reality and hype.

What is your definition of those terms: "deterministic" and "embedded"?

Heater. · 2015-02-08 17:28

abecedarian,

What is your definition of those terms: "deterministic" and "embedded"?

Deterministic has taken on a few meanings for me.

1) The computed result is always the same for the same input.

This is logical determinism. This might sound like a no brainer, dead obvious thing that all computers should be expected to do. It's surprising though how many complex systems have so many side effects, internal state and timing race conditions that this becomes impossible to ensure.

2) The computed result is always produced within some time span. A dead line. Or perhaps delivered at a particular time, plus or minus some tolerance.

This is timing determinism, real-time processing. Of course the time scales can vary wildly, those tolerances might be specified in nanoseconds or days. A program bit-banging on some output might have it's timing requirements specified in micro-seconds. Tomorrows weather forecast had better be ready before tomorrow.

3) Changing or adding other software functionality to a system that already meets 1) and 2) above should not break that original systems functionality.

This is a bit more subtle but very important when it come to software maintenance. It's a particularly striking feature that the Propeller has and many other MCUs do not.

Basically it's the ability to add features to, or change features around an existing program and know that those new features cannot break the already working system. It's the ability to mix and match objects, libraries, whatever unit of code and know that they will work together.

This is exactly what we get when we add Spin objects from OBEX or elsewhere into our projects. We don't have to worry that they won't play together. Why? Because they do their important time critical work in a COG of their own isolated from each other.

Most MCU's with their single processor and using interrupts for critical timing do not offer this.

abecedarian · 2015-02-09 00:26

Heater. wrote: »

abecedarian,

Deterministic has taken on a few meanings for me.

Thanks. I have something to think about now.

Dave Hein · 2015-02-09 07:34

Heater. wrote: »

This is exactly what we get when we add Spin objects from OBEX or elsewhere into our projects. We don't have to worry that they won't play together. Why? Because they do their important time critical work in a COG of their own isolated from each other.

The disadvantage to this is that you can never get more than 100% of a cog's compute even if you're using less than 100% in the other 7 cogs. Let's say that the other 7 cogs are only using 80% of their cycles. So there are 7*20% = 140% of cog cycles that are unused. Unfortunately, you can only run the remaining cog at 100%, and not 100%+140%.

In a single core process that runs at 160 MIPs you could use the 240% * 20MIPs for the work being done on the "8th" cog. This would result in 48 MIPS for that thread. Of course, on a single processor you would need interrupts, so it will reduce its effectiveness slightly. A nicely tuned RTOS should have less than 1% overhead.

The problem with building a single core that runs at 160 MIPs is that the logic must run 8 times faster, and there will most likely be more pipelining. However, that's exactly what we are seeing with the P2. So in theory, a single P2 COG with interrupts could run as fast as the P1 with all of its cogs fully loaded. But they you have this weird idea that time processor must be deterministic so that it always produces the same cycle-accurate timing every time it is run. That's just not needed except in obscure applications that bit-bang video or other real-time signals. With just a small amount of hardware buffering you can eliminate tight software timing constraints.

Of course, now the question is how can you build a single processor equivalent of a 16-COG P2 running at 200 MHz? That is a real challenge, as we can see of the history of the P2 development. At that point a multicore approach makes a lot of sense.

Heater. · 2015-02-09 10:10

You are right Dave.

I laid out my requirements for "determinism". How one actually achieves that is another matter. Could be many cores, could be a super fast single CPU.

Traditionally if you optimize use of your hardware (cpu and memory), arranging for it to be shared among tasks and such, you end up with a non-deterministic system. Think most time sharing operating systems. Overall through put is optimized at the expense of determinism.

It's a trade off.

I don't know how "obscure" the applications are for deterministic systems. Clearly guys use FPGA's to achieve exactly that when a regular CPU cannot deliver. At a lower speed scale they might use XMOS devices. At a lower scale we have the Prop.

You are right to say that deterministic cycle timing is not required. If you have the speed to meet the all the dead lines you are good to go. But let's think about that a moment. What if you had an infinitely fast processor? What if all computations took zero time? Then you would need some other timing hardware and buffering to clock data in and out of the machine at the required times.

This is the XMOS approach. Yes they have cycle accurate instructions. Yes they have multiple cores. Yes they have deterministic interleaved instruction threads in a single core. BUT to be really sure they also have external buffers and clocked I/O to get the external timing exactly right.

Personally I think instruction cycle counting is a dumb way to get the timing determinism as long as your code can be shown to be fast enough the actual timing should be down to those buffers and clocked I/O external to the CPU.

Still, we have the Prop and Spin objects. We can mix and match the software "black boxes" acquired from wherever and run them in their own isolated "containers", the COGS, and know that they won't break anything or be broken by being part of our system.

I'm hard put to think of any other system where that is true. It's even not so easy in the XMOS world.

potatohead · 2015-02-09 13:21

That is the feature I enjoy the most. Yes, it costs compute potential, but then everything costs something.

On the more recent P2 images, having the monitor there made for some interesting experiments. I documented some of those in the monitor document, but the best one was having a program running and it driving a display for TV. Given that program is written nicely, it was entirely possible to just hop into the monitor, kill off the display COG, then fire it up again with a different display, say VGA. Or, just add another display COG and have them both display at the same time.

I would also spend considerable time with the chip emulation running, just uploading things, running them, stopping them, etc... without resets. It reminded me of:

A: Methods I would use on older 8 bit machines, developing in memory, etc...

B: Methods I would use on a multi-user type system with multi-process type programs communicating through standard and shared channels. CORBA (I know horrible, and I barely understood it), SOCKETS, etc...

For those things to work, some programming guide lines were needed. Memory addresses as mailboxes, refreshing protocols and such, but it wasn't all that much.

C: How Forth works.

That is all possible on a P1 today, with Forth showing it all off nicely, but the truth is a similar setup with a monitor running would do all the same things. In fact, I've done it with the TV / VGA displays in the past.

To me the COG isolation is a primary, important, and defining feature of the Propeller. We can and should and arguably need to provide HUBEXE, but that's really to maximize the chip for more use cases, not really a replacement for the COG isolation as it exists in the designs now.

Heater. · 2015-02-09 14:20

potatohead,

Did you say "CORBA"?

CORBA is amazing, a possibly simple idea that was taken as a standard. And then the standard was built up by a whole bunch of vested interests. The end result was an amazingly complicated mess that was impossible to understand, hard to implement, and impossible to make interoperable. And if you ever achieved that it would be slow as hell and very hard to change with changing requirements.

Then we had XML. Again a possibly very simple idea that was taken as a standard. And then the standard was built up by a whole bunch of vested interests. The end result was an amazingly complicated mess that was impossible to understand, hard to implement and impossible to make interoperable. And if you ever achieved that it would be slow as hell and very hard to change with changing requirements.

I find it amazing that the web world, whose business is communicating between computers ended up rejecting these huge and complex standards of CORBA and XML. No, they just get the job done in the simplest way possible with JSON. A one man standard. Thank you Douglas Crockford!

potatohead · 2015-02-09 15:29

Yes, sadly I did. Not impossible to make interoperable, just really, really, painful.

A CAD system I worked with used CORBA as it's standard interprocess communications. CORBA is pretty neat! I could fire off an ORB on any OS, and have it serve up the comms to the program running on any other OS, even just serve it up for a whole host of things running on a pile of OSes. Spiffy!

Just once, I had a project to write some program to process data into a solid model. God what a thick mess! I made it work, but I really don't want to talk about it much, other than that program ended up at a children's hospital helping fix broken kids. Worth it. My skill in geometry manipulation got coupled with some hacking on a side job to do somebody a solid. That's it.

It was slow as hell, and changes were difficult, locked to versions of the CAD, CORBA implementations and their versions, and some OS dependencies. Really painful. The "Introduction to CORBA" book was something like 800 freaking pages! Never again. Ever.

I ended up setting that book aside, got some code "monkey see, monkey do" style, setup tests, authored the part that actually did something, dropped it in there, and when it worked and passed the tests, called it all done. Not a good experience.

I have very little experience with JSON, but I can tell you just from a little poking around it's lean as all get out. I can also tell you I very seriously wish it was the tool back then.

kwinn · 2015-02-09 19:10

Dave Hein wrote: »

The disadvantage to this is that you can never get more than 100% of a cog's compute even if you're using less than 100% in the other 7 cogs. Let's say that the other 7 cogs are only using 80% of their cycles. So there are 7*20% = 140% of cog cycles that are unused. Unfortunately, you can only run the remaining cog at 100%, and not 100%+140%.

In a single core process that runs at 160 MIPs you could use the 240% * 20MIPs for the work being done on the "8th" cog. This would result in 48 MIPS for that thread. Of course, on a single processor you would need interrupts, so it will reduce its effectiveness slightly. A nicely tuned RTOS should have less than 1% overhead.

The problem with building a single core that runs at 160 MIPs is that the logic must run 8 times faster, and there will most likely be more pipelining. However, that's exactly what we are seeing with the P2. So in theory, a single P2 COG with interrupts could run as fast as the P1 with all of its cogs fully loaded. But they you have this weird idea that time processor must be deterministic so that it always produces the same cycle-accurate timing every time it is run. That's just not needed except in obscure applications that bit-bang video or other real-time signals. With just a small amount of hardware buffering you can eliminate tight software timing constraints.

Of course, now the question is how can you build a single processor equivalent of a 16-COG P2 running at 200 MHz? That is a real challenge, as we can see of the history of the P2 development. At that point a multicore approach makes a lot of sense.

I see no reason why a single 160mips processor could not be divvied up among 16 in the manner you are suggesting by a method similar to the hub slot assignment/scheduling that was discussed before the "lazy susan" hub came along. Instead of 16 cogs with registers and a cpu a single fast cpu could switch between 16 register blocks using a register block assignment table. Could work exactly like the current cogs do.

Is it time to re-examine the P2 requirements ???

Comments