Propeller II update - BLOG

potatohead · 2013-12-08 12:43

So when I read "slot sharing", I don't read passive stealing as you mentioned Bill. I think I agree with that case, largely because it does move the problem to code in the needy COG and it doesn't cause hard failures.

And the memory move is one positive scenario. So would a general loop that can vary in speed, latched at key decision points. Can be written for soft failures, much like we do now.

It maybe useful to have a

SETSLOT "DETERMINISTIC | HUNGRY"

In deterministic mode, a cog can only get its own slots, and may not use other cogs slots. The hub goes merrily around every 8 cycles. If it leaves a slot unused, a HUNGRY cog can use it.

In HUNGRY mode, a cog get its own guaranteed slot, and shares spare slots with other hungry cogs. Only guaranteed 1/8 slots, but may get more.

So then, in the sharing scenario, "first come, first served by COG order?"

Probably need better terms. And I really wanted to feel better about the "A" case. Having the product be potentially less effective isn't a trivial concern.

@Bill How about SETCOG NORMAL / MOOCH

ctwardell · 2013-12-08 12:43

potatohead wrote: »

Seems to me, this desire is really rooted in making a Propeller look more like a single core or dual core machine with some definable peripherials than it is any significant and material gain in overall power.

I think in some use cases that end outcome is desirable.

In many cases some of the cogs are used for peripherals while one cog runs the 'main' application.

Wouldn't it be nice to let that 'main' program get some benefit from unused 'horsepower' in the peripheral cogs?

I think hub slot sharing actually helps with object reuse because otherwise the only way to get some use from the spare 'horsepower' in a peripheral cog would be to try to mix in some other functionality besides it primary function of being a UART, keyboard driver, etc.

I realize getting extra hub slots doesn't directly gain any compute power, other than letting the receiving cog spend less time blocked waiting on the hub.

C.W.

potatohead · 2013-12-08 12:52

I think hub slot sharing actually helps with object reuse because otherwise the only way to get some use from the spare 'horsepower' in a peripheral cog would be to try to mix in some other functionality besides it primary function of being a UART, keyboard driver, etc.

Which is why we got tasks, but then it complicates the peripherial COGS. Ok. That's a very good case.

So then, let's say somebody does that, and it underperforms. Will they move to better parallelize, or will they just sort of fall back on something like, "all the COGS should go faster?"

Heater. · 2013-12-08 12:56

Guys,

I have a growing worry in my mind which I don't know how to express succinctly.

So let's start with some simple questions which should have more or less concrete answers:

1) How many different instructions are there now in the current P2 instruction set?

2) How many of those instructions are ever going to be used by a high level language compilers like GCC?

3) How many of those instructions are going to be used by assembly language programmers who have not spent year researching what they all do?

4) How many of those instructions in 1) are left after we remove the subsets of 2) and 3). And why would we want to keep them?

Why do I ask?

Because in some way I see history repeating itself here.

The first microprocessors had things like DAA just because that was good for calculator builders, it was never used much after that and especially not by compilers.

The Z80 added a huge pile of instructions to make life easy for assembler programmers, they were never used.

The RISC guys decided to sweep all that away and have only instructions compilers could use.

Intel tried different ideas with the 432 and the i860. Nobody could use them. The i860 had a bunch of features to boost performance but neither compiler writers or hand coded assembler programmers could figure out how to use it.

What is the point of my rambling?

No idea.

It's just a worry that a lot of clever minds here are developing a lot of optimizations for all kind of things that will never be used.

Bill Henning · 2013-12-08 13:08

quick response, wifey wants me to spend the day with her

Right now, hubexec is a way to give us faster then lmm large programs, with 3 - 7 new instructions, plus 1 for 'BIG'.

Good questions, however I don't think Chip wants to build a conventional microcontroller, designed for C.

Chip is busy adding a little bit of special sauce so C, Spin, and large assembly programs are a lot faster than LMM, widening the potential market, I don't think he wants to get into a CS debate about how Px should be designed from a compiler point of view.

Besides, Chip and a lot of us (including you) love programming in assembler

ctwardell · 2013-12-08 13:17

potatohead wrote: »

Which is why we got tasks, but then it complicates the peripherial COGS. Ok. That's a very good case.

So then, let's say somebody does that, and it underperforms. Will they move to better parallelize, or will they just sort of fall back on something like, "all the COGS should go faster?"

They will just want it to go faster, that's almost always going to be the case.

C.W.

David Betz · 2013-12-08 13:36

Bill Henning wrote: »

Besides, Chip and a lot of us (including you) love programming in assembler

I'm in that category as well although I'm not nearly as good it as you or Chip. It is fun though. However, Parallax might want to consider what percentage of its audience will actually do significant PASM programming vs. Spin or C. Even if that percentage is small, it still doesn't mean that instructions mostly useful to PASM programmers should be removed. If a few gurus like Chip and Bill and many of the rest of you here can make good use of those instructions to create drivers or even library functions that are useful to the high-level language crowd, that's good enough reason for some or all of those instructions to stay. However, I think it's also important to support code generation for high level languages and the changes Chip is currently working on promise to do that.

Baggers · 2013-12-08 13:48

Leon, I hate to burst your bubble and complain about XMOS, it wasn't as deterministic as you think, if you wrote something on one core, and then added a thread it would half the speed of the chip as it didn't have as many cores as the prop :P and thus throwing it out timing wise, so wasn't as uber as you always made out it was, it may have been faster than the P1, but I'd always choose a P1 over the XMOS! Let alone the almighty P2

Anyway, enough bubble bursting, as I appear to have been thrown into this conversation I might as well respond with my thoughts on the issue.

I love the prop, both 1 and 2, one of it's many many awesome features is it's deterministic approach, where you can write one piece of code, and know it'll run the same way on any application it was thrown in. ( although there was that one issue we found with the Sound if it ran on a certain cog, but even that was deterministic. :P ) but if you had code running on a device with interrupts, the more interrupts it has, it would most certainly throw out any timing critical routines that you had in your application, be it anything from tight inner loops of a renderer, or PWMs or bit banging USB code, whatever it is, the joy of it being on the Prop, is knowing that it would always run at the same speed ALWAYS, no matter what the other 7 cogs were doing, other than stopping/resetting/reloading the cog obviously.

Now on the other hand also, I don't mind the ability for a cog to steal unused hub cycles, I could certainly see that handy where speed is required over precision in timing, or say you didn't use all 8 cogs, it would be a waste if they weren't being used.

PS.
Whilst everyone is modding instructions, I'd also like to add if I may, ( So sorry Ken, although this was a request from a long while back ) but one where you can write each byte from a long into each low byte of 4 longs in a quad and another instruction to write the low bytes from a quad into a single long again. No worries if it's too late or can't be added.

jmg · 2013-12-08 13:59

Heater. wrote: »

...
2) How many of those instructions are ever going to be used by a high level language compilers like GCC?
...
The first microprocessors had things like DAA just because that was good for calculator builders, it was never used much after that and especially not by compilers.
...
The Z80 added a huge pile of instructions to make life easy for assembler programmers, they were never used.
...

It's just a worry that a lot of clever minds here are developing a lot of optimizations for all kind of things that will never be used.

used by compilers is a sweeping term, as there is more than one use.

Compilers may not generate code using all Opcodes, but Libraries and Inline ASM certainly can, and will.

That means those opcodes are useful to compiler users, even if the core code generators never spit them out, in the normal program flow streams.

potatohead · 2013-12-08 14:05

I'm not too concerned about the instructions.

Some may go unused. I think that's better than missing out on a few that would have been used a lot. We've tried to get the "used a lot" in known P1 cases covered.

Having good PASM capability is going to make low level things a lot more fun, and that always "trickles up" in my view. Cog objects, library code, etc... isn't a bad thing!

Seems to me, having something unused, like DAA etc... is minor league compared to missing out on compiler friendly instructions and or known sweet spot ones intended for assembly language programmers.

Baggage instructions carry over time to just complicate things. In this case, it's likely P3 will blow out the instruction set again. Non issue then.

I second what David wrote. It's really important that we get compiler efficiency this time around. Maybe we can somehow get Baggers instruction in there too. I think he put that one here last time around. Got forgotten then.

The 6502 had a smallish instruction set, with a few killer cases missed out and a couple of odd ducks added in. Not entirely compiler friendly due to it being 8 bits throughout, and a fixed stack. 6809 was pure joy to program in assembly language. Addressing modes all over the place, big stacks one could place anywhere, etc... and on that one, compiler performance was much improved though it worked best in assembly language through the usual kinds of abuses assembly language programmers find for it. (like using the stack push, pull instructions to facillitate quick memory block moves)

That's kind of how I see the P2. For the assembly language programmer, a ton of tricks will be possible, and C will be perfectly happy. Edit: And as JMG points out, that can be packaged up, encapsulated and used in bigger C programs with few worries.

rod1963 · 2013-12-08 14:14

Determinism has been around longer than Parallax and has been accomplished on supposedly inferior interrupt driven systems. Systems that are good enough to keep the space shuttle, jet aircraft and vehicles working or industrial plants running, etc. Maybe the Prop is better than it's interrupt driven competitors but the case hasn't been made commercially AFIAK.

In short it's not a real bragging point and never has been to people who work in professions where such systems are being used every day and where lives depend on them functioning.

I think if you're gonna marketl the P2, it will have to be more than that mantra of "determinism".

brucee · 2013-12-08 14:24

Just to illustrate rod's post about determinism.

In 1980 we had a product that used video tape recorders to do backup (back when the other option was floppies). It used a Z80 to both produce and read a video signal and insert of extract data from it. Yes hard coded in assembly language and disabled interrupts during the video portion.

I also believe Woz's floppy disk controller hard timed the data interface with a 6502. I'll give Wendall a plug, as I believe he was the engineer at Apple that wrote the production code (around 1977).

Before that and even before me, programmers would deterministically write code to time execution to coincide with the spin of a drum memory.

Even now with multi-core ARMs you can write deterministic code by disabling interrupts and using memory dedicated to that CPU.

potatohead · 2013-12-08 14:33

Sure. Cycle counting, and on single processors, not multi-processors.

The Apple had regular cycles, but for one longer cycle to make the video match up. I think it was every 64th cycle. Shorter bits of code could jitter because of that. The Disk ][ did use a timed interface, incorporating a simple state machine on the drive controller, and the 6502 to handle everything else. The very first thing an Apple does is read in that block of RWTS (read track write sector) code from track 0, using timed code in the ROM. Once that hand off happened, the rest of the disk formatting worked with the new code in RAM.

Cycle counting on those chips wasn't too tough, unless the computer had various DMA activities going on. Interestingly, aside from that one longer cycle, the Apple actually was simple and very deterministic as it did not use interrupts at all for the core design.

Peripherials could introduce DMA, or if they were clever, could "mooch" cycles from the 6502 transparently. The former required programmers grant the Disk ][ exclusive access. The latter would break cycle counted code on the machine, limiting the use of some devices with some programs.

However, the simple design did make building peripherial cards easy, and that computer saw a lot of specialized use outside of home computer needs because it did have that easy determinism. Not a primary selling point for it in the consumer market, but in niches like test and measurement, development systems for embedded, etc... it was a nice option. Other machines offered up more capability, but interfacing to them was very complicated due to that same capability making things difficult.

In a multi-processor scenario, having that ability be easy and make reasonable sense seems a strength. Not a primary one, but a very strong one where making lots of little bits of code run together without needing an interrupt kernel or OS managing things is a primary strength, IMHO. One that depends heavy, though not entirely on the lower level determinism. The core argument here is lean, where there just isn't a whole lot between the programmer and that clock edge.

Go above that, and it's a performance argument, which strongly suggests the passive cycle "mooching" could be a benefit overall. And of course, that means not marketing entirely on the deterministic behavior, but not ignoring it either...

brucee · 2013-12-08 14:51

For cycle counting on multi-processor, google Echelon. It had either 3 or 4 processors(2 different versions), and no interrupts. The network was bit-banged by one processor. User application could be bit-banged on another. Higher level network functions done by the third.

First silicon was out in 1988.

jazzed · 2013-12-08 15:12

I could be wrong, but I think the point is that determinism is the realm of dedicated micro-whatever.

In today's micros we have hardware block peripherals that do deterministic things. Before the idea of soft-peripheral, the only ways to make a deterministic hardware peripheral was to write Verilog (ABEL, PLASM, SystemC, VHDL, whatever) or waste an entire CPU (like 6502, z80, Am2901, or early 80xx).

Propeller makes deterministic peripherals possible without needing to know Verilog, and you don't have to waste the entire chip doing just one thing.

If you have an N-Core micro, you could do similar things, but most use hardware arbitration for memory access, and that is only deterministic to a point. Some micros I've worked with like the Cavium series processors have special queuing mechanisms for maintaining performance, but they still arbitrate memory access.

Propeller has round robin access capability. It is potentially slower than other micros because of that, but timing is always what you make of it. Timing is not up to memory access arbitration.

Baggers · 2013-12-08 15:17

Programming a prop, be it the original, or the new one, has and will always be a huge pleasure to program, for me anyway, as it is for us all, which is why we are all here on the forums

The problem is, whilst it's great fun for us, many people shy away because of it's awesomeness, as they like the comfort zone of C, and Parallax as well as the rest of us, want the P2 to be a huge success, not just for hobbyist, but for business also, as I doubt sales from the hobby side of the prop won't pay the dev costs, they need to make it work for the industry side too, which unfortunately going against what I think Chip, I and a few others here too really want, and that is adding too many C friendly functions to please the C business crowd.

I would love to see the P2 be used as a better friendlier version of what the Raspberry Pi was supposed to be, a tool to teach kids how to program, and unfortunately for Pi, it was far from fun, well I didn't think it was fun, it was definitely no BBC model B replacement, to me, the BBC was easy to program, self contained, just start typing and run, the Pi was heavily linux based, I know I'll get linux defenders replying to this, but what I mean is, when you got a BBC you could just plug it in and go, with the Pi, you have to install an OS to an SD card image, write that to an SD card, and for a noob, that's NOT a walk in the park, to me, it would have been perfect if it had it's own OS, even if it was just a modified BASIC, that could include assembly etc, or even just extra functionality in the basic, like sprites, music etc, to help ease the newcomer to it.

When the P2 comes out, my intensions is to get it to have a stand alone OS, that can also let you program it, and run programs etc, like I think a computer should, and make it fun! Something like how Sphinx was awesome for P1.

koehler · 2013-12-08 15:23

cgracey wrote: »

With what Cluso99 came up, then augmented to allow any other cog a slot, a single cog WOULD get all the slots if no other cogs were needing one.

Uhm, surprised no one has suggested this.

What not allow a mode which offers a 'user defined' HUB order?

You could have the Standard one, the Complementary, or the User-Defined.

Ym2413a · 2013-12-08 16:12

Baggers wrote: »

When the P2 comes out, my intensions is to get it to have a stand alone OS, that can also let you program it, and run programs etc, like I think a computer should, and make it fun! Something like how Sphinx was awesome for P1.

Yeah. I agree. I want to see a simple and easy to use OS for the PropII. : ]
That would make my day!

David Betz · 2013-12-08 16:14

jazzed wrote: »

Propeller makes deterministic peripherals possible without needing to know Verilog, and you don't have to waste the entire chip doing just one thing.

There is a thing called System C which is C that can be synthesize for an FPGA. I'm not sure if it's used widely but it might be possible to write "intelligent peripherals" in System C and compile them for an FPGA. I think it's supposed to be easier to use than Verilog.

David Betz · 2013-12-08 16:15

Baggers wrote: »

Programming a prop, be it the original, or the new one, has and will always be a huge pleasure to program, for me anyway, as it is for us all, which is why we are all here on the forums

I agree! As much as I might complain about various aspects of the Propeller, I've never gone back to programming any other MCU since starting to work with the Propeller. I think part of it is the challenge of molding the relatively blank slate of the Propeller chip into a usable system.

jazzed · 2013-12-08 16:21

David Betz wrote: »

There is a thing called System C which is C that can be synthesize for an FPGA. I'm not sure if it's used widely but it might be possible to write "intelligent peripherals" in System C and compile them for an FPGA. I think it's supposed to be easier to use than Verilog.

Yes, there is (but it's near-C). I get email from the SystemC consortium at least once a year.

Please note that I mentioned it in my post

David Betz · 2013-12-08 16:25

jazzed wrote: »

Yes, there is (but it's near-C). I get email from the SystemC consortium at least once a year.

Please note that I mentioned it in my post

Oops, sorry! I missed it when I skimmed your post I guess. I'm having trouble keeping up with the traffic in this form lately.

SRLM · 2013-12-08 16:40

David Betz wrote: »

I'm having trouble keeping up with the traffic in this form lately.

I think Parallax should create a new podcast: "Intelligence², Propeller 2 edition" where everybody can listen in on the debate. Maybe with Chip as moderator.

Ym2413a · 2013-12-08 16:50

SRLM wrote: »

I think Parallax should create a new podcast: "Intelligence², Propeller 2 edition" where everybody can listen in on the debate. Maybe with Chip as moderator.

There's just a lot of energy and excitement around here. : ]
How often do you get to input your ideas into the design of a new microcontroller? Though, I feel that calling the P2 just a microcontroller is an insult to what is is capable of doing!

jmg · 2013-12-08 17:00

jazzed wrote: »

In today's micros we have hardware block peripherals that do deterministic things. Before the idea of soft-peripheral, the only ways to make a deterministic hardware peripheral was to write Verilog (ABEL, PLASM, SystemC, VHDL, whatever) or waste an entire CPU (like 6502, z80, Am2901, or early 80xx).

Certainly, the new ARMs uC tend to have larger FIFOs on their peripherals, to compensate for the more sluggish expected core response. Some ARM uC have added one or even two M0 cores, as 'helpers', to try to keep a handle on real-time.

It is getting easier to add HW (HDL) peripherals, as the languages get more powerful and the silicon gives more bang for the buck.
Lattice are probably the best placed with 'Peripheral Silicon' offerings and Atmel, Altera and Xilinx have more general CPLD base-lines.

Smaller CPLDs can start from about $1, but if you wanting something reprogrammable with a PLL, you are up to $6+, and a 100pin package.

The PLL barely gets mentioned, but it will be an important differentiator on the P2.

jazzed wrote: »

Propeller makes deterministic peripherals possible without needing to know Verilog, and you don't have to waste the entire chip doing just one thing.

Yes, and it does that in a TQFP128 package, whilst FPGA's bump quickly to BGA packages.

brucee · 2013-12-08 17:09

Maybe the real issue here is the software. I agree there are a lot of Rpi's probably sitting in desk drawers. Linux brings a huge amount of open source software. But as a lot of it is free, that is probably what some of it is worth, not disparaging the good stuff that is out there, it is just for the novice it can be hard to find.

The P2 seems be aimed at that market, though it's performance and cost are outclassed by the Broadcomm ARM. Add to that there is virtually no profit in that board, it makes it hard to compete with it. People here suggest a prop add on board, which is probably a hard sell by itself.

Maybe the key is a $5 app that makes the Rpi easy for the novice, once you've captured that maybe you can offer an addon board that expands features.

KC_Rob · 2013-12-08 17:18

jmg wrote: »

It is getting easier to add HW (HDL) peripherals, as the languages get more powerful and the silicon gives more bang for the buck.

Yes, all the time. Likewise there are more engineers available who do HDL programming. That is the trend anyway.

Yes, and it does that in a TQFP128 package, whilst FPGA's bump quickly to BGA packages.

Not yet. But there is a part that does that in a 44-pin QFP/QFN (also available in 40-pin DIP). I believe it goes for around $7.

By the way, wasn't there talk of P2 in a BGA recently?

Cluso99 · 2013-12-08 17:24

While I think it would be great to be able to change the slot sequence, I don't think that would be good for the Prop general audience. I would definately take advantage of it, and perhaps some commercial users would too. So do not advocate this change.

Hub slot YIELD & GIFT are great for pairs of cogs. Take an example of a video generator (output to DACs side) and a video engine (ie the game engine). These could be written with cog pairing, where the cogs are separated by 4 (ie cog 1 with cog 5) so the hub is 4 slots apart. Now, by pairing the cogs, we can get hub access every 4 clocks for the priority cog.

Now, with some other hopeful change that Chip is implementing (filling aux directly with the wide reads)...
Cog1 (video gen) is given the extra slot by cog5 yielding. 1 fills its aux line, then yields its slot back to 5 (and via a mechanism, 5 knows it now has the priority, and performs the video updating quicker than normal, knowing how many total clocks it has before it needs to yield its hub back to 1. So, here you can see an improvement could be made. Remember, 1 has spare time to perform additional non-hubintensive tasks while the waitvid part is happening.

Note that no other cog is affected by this method. Cogs 0,2,3,4,6,7 all are still the same, all get 1:8 slot access, all are deterministic.
Cogs 1 & 5 are also deterministic while in receive (other cog yields) mode. Because you know when you yield and what that other co-operating cog is doing, it is also possible to calculat the whole deterministic code. Its just more complex, and quite frankly most likely not necessary except for the receive yielded where it would be necessary to fill aux within a time period for video gen.

Now, lets look at using unused slots...

There is quite a high likelyhood that all you would be doing is taking the next available slot (ie minimising your delay in waiting your turn) and then you would be processing and not require your turrn, so another cog could use it.

So this is just speeding up your processing, its not deterministic, and is not impacting any normal deterministic cog. Effectively this just changes the hub allocation to next cog requiring a slot, gets the first one free.

By implementing this the right way, no cog actually could become a hog and take all free slots unless they are available. The implementation would be like this...
There could be a cog free register that stores the who used the last free slot. If a free slot becomes available, then the next cog# after the free register will be offered the slot and the ee register is updated with the cog who just used a free slot.
This method may have to be re-thought, but I dont really think this is that important. Its more necessary to permit the use of free slots, according to the overall designers wishes.

Remember, changing slot priorities, etc, are all to do with giving the commercial designer access to these slots, by understanding what he/she knows what is happening with all user cogs and slots. Do not stiffle the commercial (or systems) user's ability to utilise otherwise unused power in the P2, just because you dont think you would want it or you think its too complex. Let the ultimate user decide, with all the caveats it comes with. Its a simple change with no risk to if not used.

Ramon · 2013-12-08 23:37

jmg wrote: »

Smaller CPLDs can start from about $1, but if you wanting something reprogrammable with a PLL, you are up to $6+, and a 100pin package.

The PLL barely gets mentioned, but it will be an important differentiator on the P2.

Yes, and it does that in a TQFP128 package, whilst FPGA's bump quickly to BGA packages.

Yes, you said it. I wanted to have small and cheap CPLD ($1 to $5) used just to provide several programmable clock sources. I have read all datasheets for CPLD/FPGA with PLL (Altera, Xilinx, Actel/Microsemi, Lattice/SiliconBlue) and PLL is only on more than 100pins packages and on miniature and extremely difficult to solder packages (BGA).

Heater. · 2013-12-09 00:04

Baggers,

...XMOS..., it wasn't as deterministic as you think, if you wrote something on one core, and then added a thread it would half the speed of the chip as it
didn't have as many cores as the prop and thus throwing it out timing wise...

This is not correct.

Firstly the XMOS chips can have up to 4 cores, so far. Whatever code you throw onto a core will not upset the execution rate or timing of the other cores. XMOS cores are the COGs of those devices. Except they don't share RAM so they are even more isolated.

Secondly XMOS cores have hardware thread scheduling. The cores have a four stage pipeline and can complete an instruction in 4 clocks. This means that a 400MHz core, say, can run a thread at 100 MIPS. If you add another thread they both run at 100 MIPs. Up to four threads will always run at 100 MIPS each. The threads do not get slowed down.

The hardware scheduler supports up to 8 threads. So yes, when you add that 5th thread the CPU time is now divvied up five ways and each thread is now running at 80MIPS. And so it goes for 6, 7, 8 threads until you end up with with all threads running at 50MIPS. This sounds terribly non-deterministic and bad but...

The XMOS dev tools include a timing analyser. It knows how many threads you have and it can tell you exactly how long sections of code take to run. You can put assertions about timing requirements into your project an the compiler will give errors if your code does not meet them. In this way everything is deterministic again. No cycle or instruction counting required.

Lastly, the XMOS devices, despite having excellent determinism as described, are designed such that you should not be timing things with your code (instruction counting). Rather the hardware provides clocked I/O. You can set outputs to change at precise moments, at likewise you can set up inputs to be sampled at accurate points in time. With clocked I/O and timers one programs in an event based style. Timing is all hardware controlled. This is a much more robust solution than counting instructions and messing around with NOPs

Now, one of the best things about XMOS devices is that they inspired me to suggest to Chip that he could relatively easily implement hardware thread scheduling in the P2. You don't think I get these ideas myself do you?

Propeller II update - BLOG

Comments