Is it time to re-examine the P2 requirements ???

kwinn · 2015-02-13 08:42

potatohead wrote: »

You got the task done right?

Had there been interrupts a whole pile of things may not have gotten done.

Now we know a lot more about using concurrent multiprocessors. And those things are robust and easy to combine too.

Your use case required some effort. Lots of cases don't. And we have accumulated many more that are easy and effective too.

It is the way they are combined that makes a Prop unique.

There are always niche cases. Not all warrant a feature, given there are solutions. This is a basic complexity and dilution of the primary feature argument.

Lots of products end up messy due to this dynamic not well considered.

Yes, I got the task done, and it did require quite a bit of effort. It's also true that a lot of cases don't require all that much effort. Most of the projects I have done fall into that category.

Also true that the two projects I wanted interrupts for were niche cases, but the fact is that the propeller is a niche chip. I am not an expert (or even all that familiar with) the uC marketplace, but I wonder how many more niche's the propeller chip could fill if it had interrupts.

kwinn · 2015-02-13 08:57

jmg wrote: »

I would expand that a little with :
+ 2 bits for Edge / Level control LO,HI,_/= , =\_, maybe one bit for pending/new rules
+1 bit to the Pin field, to allow a choice of Counter events to trigger. This also allows Watchdog operation.

Addit: +1 global bit, that can be set/polled by any/all COGS for inter-COG sync tasks.

Yes, got to thinking along that line after I posted it. May as well make it a regular 32 bit register to be consistent and use bits for added functions.

Once you have that, you have task switching hardware done..
eg fire that state toggle every other cycle, and you get 50% cycle sharing.

Yes, and if you can select a counter as well as a pin for the switching you have variable cycle sharing without using a pin.

mindrobots · 2015-02-13 09:18

Heater. wrote: »

Holy cow, look what I found!

No more discussing of P2 requirements for me. I'm going to blow all the money saved up for P2's on some of these https://www.96boards.org/products/

The The HiKey Board - 1/96 board.

8 cores, 64 bit, 1.2GhZ plus GPU plus a ton of interfaces. 130 dollars !

Software all supported by the wonderful Linaro folks.

That looks pretty sweet....but I have to wait 7 weeks to get one! A lot can happen in this industry in 7 weeks! ....even more in 7 months ....even more in 7 YEARS!

Maybe I should get a HiKey for my birthday...something to play with until I can get a P2 for some birthday!

kwinn · 2015-02-13 09:21

@jmg

Then do not call it an interrupt, call it a hardware event activated task switch instead
With no stack, it is not quite your grandfathers' interrupt anyway.

That's probably a better name for it, although some purists might disagree. That's all some early mini's and microprocessors had for interrupts. It was up to the software to save all the state information.

@Heater

By the way, what is the deal with the 8.5uS latency requirement? I think Kye's Full Duplex Serial driver that works at 250K baud shows how to meet that requirement using WAITCNT and coroutines already.

WAITCNT and coroutines would have worked if the code already in the cog had allowed for a more modular approach. By the time I added all the JMPRET instructions required to meet the timing constraints the existing code would not fit in the cog.

Dave Hein · 2015-02-13 09:22

jmg wrote: »

The P1V is ported to ever-more FPGAs, and with the MAX10, it makes sense to deploy a P1 and a P1V.MAX10 in a design.

But can a P1V implemented on an FPGA compete against a comparable chip in silicon that cost less? I don't think so. So I'm not sure how much sense your proposal makes.

Dave Hein · 2015-02-13 09:28

Brian Fairchild wrote: »

The (harsh) reality is that the world has moved on since the P2 was first mooted. By the time we have real production silicon in 2016 I'm not sure where it will fit.

I think the P2 will have the same customer base as the P1, plus new people who find that it has enough power and memory to meet their needs. I think a lot of new people check out the P1, and find that it's 32K hub RAM and/or it's 2K cog RAM is just too small, and anything using external RAM is just too slow. Hopefully, the P2 will fill that niche.

kwinn · 2015-02-13 09:44

Dave Hein wrote: »

Cycle-level task switching requires zero-overhead hardware support. That is, there can be no performance hit associated with a task switch. However, if you only do 10,000 task switches per second you could allow for some task-switching overhead. A 100 MIPS processor could have a 10 cycle task-switch overhead, which would only take 0.1% of the processor.

That's the beauty of the cogs. Make the program counter and status flags part of a register (the task register?) an event or pin can select and there is no overhead to switch tasks. The task register has to be initialized when the task is started, but nothing else needs to be done to it after that. When the task exits it's program counter remains as it was in that task's register.

kwinn · 2015-02-13 09:50

Heater. wrote: »

So, the crux of the matter is: how to map external events onto code that deals with them. That is what we do in MCU's

Throwing a CPU at each possible event handles that nicely. Until you run out of CPUs.

Having a single CPU deal with multiple events must degrade performance of some code somewhere.

The nice way to do this is for all events and code to be symmetrical. There is no "back gound" and "interrupt handler", only code that is activated by events.

Ergo, no interrupts but thread slicing.

Exactly. Sorry for my poor choice of name and obtuse description. How does "event initiated thread slicing" sound as a description?

Dave Hein · 2015-02-13 09:54

kwinn wrote: »

That's the beauty of the cogs. Make the program counter and status flags part of a register (the task register?) an event or pin can select and there is no overhead to switch tasks. The task register has to be initialized when the task is started, but nothing else needs to be done to it after that. When the task exits it's program counter remains as it was in that task's register.

I think there's a bit more hardware required than just that. When Chip did this over a year ago he also had to dedicate a return stack for each task, and there were a few other things needed. The tasks aren't completely independent of each other either. They have to ensure that they don't use the same resources that other tasks are using. And there's no way to ensure cycle accuracy for a task because other task will stall the pipeline waiting for hub access or some other kind of wait.

potatohead · 2015-02-13 10:19

How many more niches indeed?

Great question IMHO.

Would those be filled by a Prop, or others?

What are those niches and can they not be met with the planned functionality? Why?

Mike Green · 2015-02-13 10:56

There's no way to do preemptive multi-tasking with deterministic cycle accuracy (except for the highest priority task). On the other hand, you've got 16 cogs. Some of them can be used for deterministic cycle accurate tasks (one per cog) and the other cogs can do preemptive multi-tasking without affecting the deterministic cogs.

Heater. · 2015-02-13 11:10

On the other hand you can do cycle interleaved multi-tasking and keep determinism. Possibly at the cost of halving the execution rate of two tasks over one.

With it's pipelined architecture the XMOS can run 4 such interleaved tasks all at the same instruction dispatch rate as a single task on it's own! Basically having 4 such threads keeps the pipeline busy all the time, rather than wasting empty slots in the pipe.

How this applies to the current P2 design I have no idea.

kuba · 2015-02-13 11:11

tritonium wrote: »

This old Texas processor was interesting in its time...it used context switching which I would think would make interrupts fast to service.

Architecture (from the wiki)

The TMS9900 has three internal 16-bit registers Program counter (PC), Status register (ST), and Workspace Pointer register (WP).[1] The WP register points to a base address in external RAM where the processor's 16 general purpose user registers (each 16 bits wide) are kept. This architecture allows for quick context switching; e.g. when a subroutine is entered, only the single workspace register needs to be changed instead of requiring registers to be saved individually.

That's pretty much like the eZ8 (a.k.a. Z8 Encore!), currently offered by Zilog. Of course it's an 8 bit micro, but a neat one and these days you can get it for a dollar and change. It has the USB FS device etc. I've been using them for way too long, but they do the job they are supposed to do.

jmg · 2015-02-13 11:18

Dave Hein wrote: »

But can a P1V implemented on an FPGA compete against a comparable chip in silicon that cost less? I don't think so. So I'm not sure how much sense your proposal makes.

That's why I suggest a P1 and a P1V
Also note this is a Module, not a Chip, and the Module market is a little different.
( The 50c customer is not buying a P1 anyway. )

Look at the Volumes RaspPi ships in, and the volumes Microchip claim for Development systems.
As chips get harder to manage in moderate production runs, and the price-points are lowered, there is a growing trend to use a Module in Production (check out WiFi Modules, GPS modules... etc )

P2 is going to need a compact Module, as even that 0.4mm EDFP is pushing some production lines.

Plenty of scope for some long term Module planning here.
Design the [P1 and P1V] & P2 Module in parallel, so they are interchangeable.
Result is early design wins for P2.

Heater. · 2015-02-13 11:48

The old Texas 16 bit TMS9900 (or whatever it was called) was a wonderfully elegant device.

Problem was that it kept all it's general purpose registers in main RAM rather than on chip.

That meant it could do context switches pretty quickly, just move a memory pointer rather than save a bunch of CPU registers to stack.

BUT it meant it was generally slow. Registers on chip are fast, registers in RAM are slow.

kuba · 2015-02-13 12:06

Seairth wrote: »

Interesting. I was always under the impression that interrupts existed because you need an efficient way for a largely synchronous device to interact with a largely asynchronous outside world. In other words, you need interrupts when you can't know when an event will occur. Once hardware interrupts were added, it was suddenly obvious that this technique could be used to simulate a multi-processing device. But I don't think this later observation would have come along if interrupts hadn't been first added to handle asynchronous events.

This is actually one area of the Propeller "philosophy" that I struggle with. Because there aren't interrupts (in the traditional sense), you have basically two alternative approaches:
Polling

Blocking

Polling provides the opportunity to check more than one input, perform other tasks, etc. But it also requires semi-predictable events. If the event is truly unpredictable, then polling is very expensive, both in terms of power consumption and loss of effective processing time.

In those cases, you can switch to blocking. In this case, the cog just stalls out and waits for an external event. While this doesn't necessarily make good use of precessing resources, it has the benefit that it consumes much less power than polling in a tight loop. Further, you can get more accurate timing, since you aren't performing multi-clock jumps, tests, etc.

But what solution is there that is both power-efficient and processing-efficient? While interrupts may have their own issues, this is one thing that I think they are very good at.

Now, here's the thing: you could actually implement very slow interrupts on the Propeller. How?
In COG A, write code that blocks. When it wakes up, the only thing it does is set a flag in the hub.

In COG B, run a modified SPIN interpreter that checks for the flag at the beginning of the instruction loop, then jump to some other behaviour.

Yes, it's ugly. No, it's not nearly as efficient as hardware interrupts. But that is essentially what real interrupt hardware is doing: it has a very small circuit that is doing what COG A is doing, and the CPU is essentially doing what COG B is doing.

Quite interestingly, that's how XMOS event-driven select operates. The difference is: The "A" cog can be pin hardware, it doesn't have to be any code per se (but could be). The "B" cog's dispatch is handled in hardware, and you can write it all out in a high-level language (XC).

Given that they sell the XMOS core in one chip with an ARM Cortex-M3, you can have the best of both worlds.

Frankly said, even if all Parallax offered was an XMOS clone with lighter weight tools, I'd buy it in a heartbeat. I seem to tolerate Eclipse, and that's about it. For a lot of prototyping and quick turnaround experiments, nothing beats the simplicity and responsiveness of Propeller Tool software.

Dave Hein · 2015-02-13 12:13

jmg wrote: »

That's why I suggest a P1 and a P1V
Also note this is a Module, not a Chip, and the Module market is a little different.

So how much does this P1 and a P1V module cost? P1V requires a relatively expensive FPGA compared to an $8 P1. Assuming the FPGA is $16, can this $24 module really be competitive against somebody else's $15 high performance chip?

kuba · 2015-02-13 12:18

Mike Green wrote: »

There's no way to do preemptive multi-tasking with deterministic cycle accuracy (except for the highest priority task). On the other hand, you've got 16 cogs. Some of them can be used for deterministic cycle accurate tasks (one per cog) and the other cogs can do preemptive multi-tasking without affecting the deterministic cogs.

There a way to do it, and I've been doing it for quite a long time, as do others. The key for a realistic implementation is to have strict prioritization of run-to-completion tasks. An interrupt handler is simply a task that runs to completion, and can be "long", but is preemptable by higher priority interrupts. The interrupt handlers can be software interrupts (traps), as long as there's a way to ensure lock-out of lower priority ones. Different architectures support it differently, oftentimes not explicitly. I've often leveraged GPIO interrupts for this - in most designs, you have a rather limited number of GPIO that's monitored by interrupts, so the unused vectors can be co-opted as long as you can fire them off reasonably. Worst-case, you need to flip a real GPIO pin. Usually, though, there's an interrupt register somewhere, and setting a bit in said register has same effect as if the hardware fired it. As long as there are upper bounds on the request rates for higher-priority tasks, everything is fully deterministic in the sense that you can place an upper bound on the latency of a task of any priority. Some help from the interrupt controller is useful, worst case you need to do interrupt prioritization (masking/unmasking) by hand on interrupt handler entry and exit. If the architecture is otherwise deterministic, the tasks will all start and finish on a deterministic clock cycle given the same clock-synchronous timing of external or internal stimuli.

Heater. · 2015-02-13 12:33

kuba,

Thanks for that long explanation of interrupts, their priorities and the difficulty of juggling all that correctly.

Like you I have been there and done that.

It's bad enough when all the code on the machine is written by yourself, but when you throw in modules/objects/functionality written by others, as we do on the Propeller, it becomes impossible to determine how all the parts interact, or even if it's possible for them to coexist and work correctly on the same machine.

Determinism is then lost, unless you want to analyse all the code you are going to use in detail.

It's exactly why interrupts are a horrible, crude, kludgey idea. Multiple cores replace the need for interrupts and that complication. And why we don't want them on the Propeller.

David Betz · 2015-02-13 12:50

Heater. wrote: »

kuba,

Thanks for that long explanation of interrupts, their priorities and the difficulty of juggling all that correctly.

Like you I have been there and done that.

It's bad enough when all the code on the machine is written by yourself, but when you throw in modules/objects/functionality written by others, as we do on the Propeller, it becomes impossible to determine how all the parts interact, or even if it's possible for them to coexist and work correctly on the same machine.

Determinism is then lost, unless you want to analyse all the code you are going to use in detail.

It's exactly why interrupts are a horrible, crude, kludgey idea. Multiple cores replace the need for interrupts and that complication. And why we don't want them on the Propeller.

I'm not disagreeing with you but I wonder why the rest of the industry (other than maybe XMOS) haven't figured this out?

rod1963 · 2015-02-13 13:15

Heater

But how many hobbyists and commercial apps really need the hard core real-time determinacy that you write about? The Arduino folks have done quite well without the Prop. Same with the people with the BS2.

Video, is the only domain I can see that really needs hard real time for most people. But that's always been the province of dedicated graphic chips/custom logic and memory and has been that way for many decades.

Beyond that real-times gets into specialties such as flight control computers, specialized high data acquisition systems used in various apps, industrial controllers, medical, drive train systems, etc.

Heater. · 2015-02-13 13:16

David,

Good question.

Perhaps it depends on what one means by "the industry".

In my experience companies developing real-time embedded systems had the money to pay for their own dev teams and create all their code in house. In military applications, avionics applications and in industry.

Being of the closed source, proprietary code mind set these issues of mixing and matching objects/modules/functionality from here and there would never cross their minds. The horrendous complexity of managing interrupts and task priorities etc whilst getting all their in house code to work was all just part of the job.

On the other extreme we have the modern world of Arduino users who soon discover that if they want to do more than one thing at a time in their app it's a real pain.They can grab all kinds of Free and Open Source code for their project but will that IMU sensor fusion code work with that Neo Pixel LED driver? How on Earth do I make it so?

The software world has long touted the ideas of encapsulation, data hiding, separation of concerns, object oriented programming. Just as a way to be sure all the parts don't have to many interdependencies and compose well together.

Mapping those ideas to the hardware world requires "timing encapsulation" as well. This function here, no matter what it is, cannot upset any other function in the system, timing wise.

potatohead · 2015-02-13 13:18

Inertia is one reason. Interrupting a single CPU has been dominant for a long time because concurrent multiprocessing was expensive.

Another may well be scale.

The do it in parallel with no OS idea is complling, and I would submit, indicated in embedded applications. Maybe not all of them, but a lot of them can benefit.

At some point, it might not make sense.

Frankly, I hope we get P2 in the can and can maybe think about a full on CPU, with HUB as external RAM and more cores, maybe bigger ones.

I sometimes think about what a PC class device might look like done the Propeller way.

evanh · 2015-02-13 13:23

The Prop isn't targeted to fully utilise all CPU cycles.

potatohead · 2015-02-13 13:23

BTW effective code protect may well open up some of the niches Heater mentioned.

Leon · 2015-02-13 13:24

Heater. wrote: »

kuba,

Thanks for that long explanation of interrupts, their priorities and the difficulty of juggling all that correctly.

Like you I have been there and done that.

It's bad enough when all the code on the machine is written by yourself, but when you throw in modules/objects/functionality written by others, as we do on the Propeller, it becomes impossible to determine how all the parts interact, or even if it's possible for them to coexist and work correctly on the same machine.

Determinism is then lost, unless you want to analyse all the code you are going to use in detail.

It's exactly why interrupts are a horrible, crude, kludgey idea. Multiple cores replace the need for interrupts and that complication. And why we don't want them on the Propeller.

XMOS has interrupts, but they are primarily for use with legacy code.

evanh · 2015-02-13 13:26

potatohead wrote: »

The do it in parallel with no OS idea is complling, and I would submit, indicated in embedded applications. Maybe not all of them, but a lot of them can benefit.

Right, the Prop ecosystem is operating without a kernel/OS. There is no closed binary libs that hide any hardware. There is no isolation other than the natural multi-core config. The whole development system is based around detailed hardware plus a compiler/assembler combo.

The assembler is a critical part of soft devices. How many other multi-core ecosystems are working to that level?

Heater. · 2015-02-13 13:27

rod1963,

But how many hobbyists and commercial apps really need the hard core real-time determinacy that you write about?

You have missed the point. It's not about "hard core" real-time determinacy (whatever that may mean). It's about software composition and being able to determine the resulting behaviour. See post above.

The Arduino folks have done quite well without the Prop.

The Arduino folks have done amazing things without the Prop.

But I have seen many time questions like, "I have this great program that does X now I want to combine it with this other great program that does Y, how do I do that?"

All of a sudden you are in a big mess entwining X and Y and getting the timing right.

The Propeller makes such software composition dead easy. Just drop in the objects, hook them up, and it works.

Perhaps the whole idea is not perfected yet. But both Chip and XMOS have been working on it for a long time.

mark · 2015-02-13 14:13

rod1963 wrote: »

Beyond that real-times gets into specialties such as flight control computers, specialized high data acquisition systems used in various apps, industrial controllers, medical, drive train systems, etc.

I was actually really surprised when I found out SpaceX was using commodity high-performance CPUs running Linux with all software programmed in C++(!) for all their flight/engine control and DAQ systems. I suppose from a performance standpoint, it's not all that surprising when you consider they're computing in the range of billion instructions per second while most external components that need to be controlled probably don't require granularity finer than the millisecond range.

Yet in the audio realm, you have people concerned about sampling rate jitter on the order of picoseconds.

Dave Hein · 2015-02-13 14:20

The normal mode of use for the Prop is to run the main code on one cog, and use the remaining cogs for device drivers. This works fine as long as there are enough cogs to run the device drivers and the main code can run as Spin code at an effective 1 MIPS speed. C has improved the main code speed up to maybe 4 MIPS. If you can FCACHE small tight loops then you can get the full speed of 20 MIPS.

It's difficult to break a program up and do real parallel processing by using multiple cogs for the main code. Certain pieces can be programmed in PASM and put into separate cogs, such a floating point or string searches. However, this is normally not run in parallel with the main cog, but instead is just a way to achieve 20 MIPS, but using multiple cogs to do it. So the bottom line is that it's very difficult to utilize the full 20 MIPS out of each of the cogs to achieve 160 MIPS performance.

It's easier to do this with a single 160 MIPS processor with interrupts. There will be higher latencies, so you'll need to depend on hardware buffering to handle it. In this scenario, it makes sense to have additional hardware support, such as UARTS.

So maybe the idea chip is a merger of the two -- a P1 with 8 cogs for device drivers and a single high speed processor with interrupts, such as an ARM core.

Is it time to re-examine the P2 requirements ???

Comments