multi core processor design suggestion

Chris Micro · 2009-04-27 12:52

Hello together,

now that I'm playing around a little bit with the propeller, some ideas come to my mind how to improve the propeller design a little bit.

1. The propeller has 8 cores which can be used to implement some peripheral functions like rs232 or keyboard interface. For this type of interface a 8 bit core would be sufficient. So why not to construct a propeller with 4x32 cores and 16x8 bit cores.
2. The propeller global memory access is a kind of slow. So, why not to divide the memory systems in two propeller systems. The memory access speed would double.

I know that this two suggestions go a little bit on the price off homogeneity but with gaining speed.

chris

Kye · 2009-04-27 13:02

Well, the whole point of building the propeller was to avoid stuff like that. If your really stresses for speed I sugest you wirte your own software drivers.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nyamekye,

Peter Jakacki · 2009-04-27 13:46

Chris,

When you end up playing with the Prop "a lot more" then you will appreciate it for the way it is. Sure, sometimes you might just need 8-bit power but then you may as well have specialized hardware such as UARTs etc. Sure you could split the memory system but now you need extra hardware and software for the two halves to communicate. The power of the Prop is in it's elegant simplicity and that each cog is identical and all I/O's are identical etc because then it's just a matter of software. Anyway, designing and fabricating silicon is a little (a lot) different from pcbs so it's not just a matter of saying "let's try this". There is this thing called "money" and also "time", these two are bad news for engineers (the things I'd love to try!).

Chip's a smart cookie and he has thought it through very well, think of him as your "Zen master" (don't get too big a head Chip) and meditate on the way it has been done and you should come to the same conclusion yourself. Sure there's room for improvement, that's why he is working on Prop II.

*Peter*

MagIO2 · 2009-04-27 14:06

Because all what you say makes development much more expensive!

4x32 and 16x8, so you'd have 20 COGs which decreases HUB RAM access speed because they all have to share access-time! It's 2,5 times slower then.
Do you know what a native 8 bit COG means? Completely different handling of the COG internal RAM. Missalignment of the opcodes is bad because it makes decoding of opcodes, source and destination registers more complex.

Global memory is slower, but not too bad. Even in the worst case scenario you can have bulk transfer of ~12MB/sec - did I mention it "PER COG". COG to COG communication can be much faster. From my point of view Parallax did a very good job and keeping design simple but usefull for lots of different needs is one of the benefits.

Dave Hein · 2009-04-27 14:18

I just tried the propeller for the first time yesterday.· It's a nice chip, and I am looking forward to using it in the future.· My initial impression is that the spin language has many similarities with C, and it would have been nice to base it entirely on the C language.· This way newbies wouldn't have to learn a new language with it's own special quirks.

The other thought I had is that it would be nice to have a hardware multiplier.· I'm guessing that multiplies are done in software using a shift and add algorithm, which would take about 100 cylcles for a 32-bit multiply.· Even if only one cog had a multiplier it would greatly improve the speed of the propeller for doing DSP algorithms.

That being said, I am very impressed by the propeller and the development software.

Dave

·

Kye · 2009-04-27 17:09

The mul instruction was not completed on the current silcon. It will be there on the prop 2.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nyamekye,

Chris Micro · 2009-04-27 18:28

Dave Hein said...
Even if only one cog had a multiplier it would greatly improve the speed of the propeller for doing DSP algorithms.

Yes, thats true. A hardware multiplier would enable much faster filter algorithms. And they are more and more used in the information theory. But a hardware multiplier would need a lot of chip space which means more cost.
Probably it could be useful to implement it in only some but not all COGs to reduce cost.

Ale · 2009-04-27 19:05

Chris: You can multiply quite fast using an unrolled loop. Of course it is not going to be 2 cycles but @ 20 MIPS it will take almost 2us for a 16x16 mul. The AVR will need 14 instructions and 5 of them are going to be 2 cycle ones (4 muls and 1 movw). Not a lot faster. when going to 32x32 the propeller will be faster.

You have to see that this processor has very little program memory so the code has to be broken in pieces that can work independently. You can have several serial interfaces in 1 COG, keyboard and mouse drivers in 1 COG. The processes that really need high speed are not that many. The filters you talk about can be implemented especially if your factors are compatible with mul/div by powers of 2.

Mike Green · 2009-04-27 19:52

One of the major design decisions behind the Propeller was that the cogs are identical. The only differences have to do with small propagation delay differences for I/O signals from the cogs and which cog takes precedence if more than one tries to output to the same I/O pin (the ordering of the I/O pin circuitry to each cog). You won't find anything put just into one cog like a multiplier.

There are easy and very fast ways to multiply by a constant using shifts and adds. If the constant is configurable, the code can be generated dynamically.

Chris Micro · 2009-04-28 01:53

Mike Green said...
There are easy and very fast ways to multiply by a constant using shifts and adds. If the constant is configurable, the code can be generated dynamically.

Of course it is possible to speed up the multiply operation with some limitations to the constants. But for flexibility it would be nice to have the full working multiply operation. For instance, let's assume you want to implement a IIR band pass filter and you want to alter the center frequency and the band width dynamically. How do you do it with fixed constants? For advanced DSP operations multipliers are necessary.

Since a while I'm thinking possibilities to create a microprocessor with the smallest number of gates possible. As far as I know the first ARMs need around 30.000 Gates, which is not much for a 32 bit processor. It seems to me as if the instruction code of the propeller has an ARM like structure to reduce the cost of the decoding logic.
For me the question is if it could be possible to reduce the number of gates for an 8 bit processor. If we can reduce the number of gates per processor it is possible to increase the number of processors per chip and increase the MIPS per chip.
In the example above ( 4x32bit and 16x8bit ) this would lead to 4x20+16x20=400MIPs instead of 170 MIPS the propeller has.

Post Edited (Chris Micro) : 4/28/2009 1:59:05 AM GMT

mctrivia · 2009-04-28 02:21

you could write a spin routine that generated the pasm routine for your filter.

but prop 2 will have the hardware multiply you want.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Need to make your prop design easier or secure? Get a PropMod has crystal, eeprom, and programing header in a 40 pin dip 0.7" pitch module with uSD reader, and RTC options.

potatohead · 2009-04-28 02:40

The disadvantage of the 8 bit idea is the more distributed nature of the compute power, and it's uneven distribution. When you introduce these things, it becomes harder to make effective use of the power (too parallel), and where it's best used becomes an increasingly niche thing. (parallel and less symmetry)

As the number of parallel nodes rises, the solution complexity does also, unless the problem is one that is easily factored in a parallel fashion. For problems that don't factor this way, which from what I can see is most problems, intercommunication burdens will diminish the returns. In other words, you may find that 400MIPs delivering less than the 170, after all the communication is done!

No multiply is a bummer. That's the teaser feature for Prop II!

One other thing that was said early on was breaking symmetry results in a kludge. That kludge would expand the scope of solvable problems open to Prop I. However, scaling that design will scale the kludge, thus limiting the scope of potential solvable problems for Prop II. A quick look at the Intel mess tells that story completely. Ugh...

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
Safety Tip: Life is as good as YOU think it is!

kwinn · 2009-04-28 03:38

I would much rather have 8 32 bit cogs than 32 8 bit cogs or 16 16 bit cogs. Having 32 bit counters and the ability to input/output up to 32 bits at a time makes a lot of control tasks easier.

As for the multiply, it would be nice to have it in hardware in every cog, but even adding it as a hub function would help since multiplies are not that frequent. Have 2 hub registers to write the numbers to be multiplied and then read them back when the operation is done to get the 64 bit result. The first computer I programmed (Collins 8400) worked in a similar manner. The data to be processed was written to one or two registers and the result was read out from another.

Brian Fairchild · 2009-04-28 06:28

Surely if you want to do DSP operations you should use a DSP chip?

heater · 2009-04-28 10:38

@Dave Hein: Depends what you mean by "newbies". Someone new to the Prop coming at it from other processors may well love C. People who've never seen a micro controller before are going to be much happier with SPIN and the SPIN Tool as an intro into programming.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Dave Hein · 2009-04-28 13:51

heater said...
@Dave Hein: Depends what you mean by "newbies". Someone new to the Prop coming at it from other processors may well love C. People who've never seen a micro controller before are going to be much happier with SPIN and the SPIN Tool as an intro into programming.

Yes, I have been a C programmer for 20 years, so I am biased toward C.· My comment on the Spin language was in the context of "Why create yet another language?".· In my view, Parallax should have chosen an existing language and then add some extensions specific to the Prop architecture.· I have programmed on a few other embedded processors, and most of them used C as the basis.

I have also programmed the Stamp and the SX.· These both use Parallax's Basic language, so it seems like this would have been another obvious choice for the Prop.· This would allows Stamp and SX programmers to immediately come up to speed on the Prop.· It would also make it easier to port code written for the Stamp or SX to the Prop.

My main issue is that I hate to learn yet another vendor specific language.· However, I do like the Prop, and I intend to use it in one of my next projects, so I'm willing to learn the Spin language so I can use the Prop.

Dave

Dave Hein · 2009-04-28 14:16

Brian Fairchild said...
Surely if you want to do DSP operations you should use a DSP chip?

For dedicated DSP applications, it may be best to use a TI DSP, or something similar.· However, there·are many·applications where a multiply instruction of a few cycles would be handy.· An 8x8 multiplier does not require a lot of transistors.· One could be dedicated to each cog.· This would allow single-cycle 8-bit multiplies and 4-cycle 16-bit multiplies.· A full 32-bit multiply would take 16 cycles.· I haven't timed the current multiplication operation, but it must take around 100 cycles if it·uses a shift-and-add type of algorithm.

With a small amount of silicon real estate, the Prop could be used in many more signal and image processing applications.· There are lots of algorithms that use IIR and FIR filters, DCTs, FFTs and·correlation.· The Prop has more than enough MIPs to do some interesting DSP functions, but it is limited by the number of multiplies per second.

Dave

SRLM · 2009-04-28 15:03

Dave Hein said...
My main issue is that I hate to learn yet another vendor specific language. However, I do like the Prop, and I intend to use it in one of my next projects, so I'm willing to learn the Spin language so I can use the Prop.

Of course, for the most part it's not the language that really matters, it's the principles. If you learn how to program well in one language, it's fairly easy to learn the syntax of another and use that. Personally, I'm glad that they stayed away from C. Why? so that if I have to learn some other microcontroller, I won't mess up the extensions of one with another.

jazzed · 2009-04-28 15:10

The more languages one learns, the better programmer one becomes. We all have our favorites though.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

BradC · 2009-04-28 15:21

jazzed said...
The more languages one learns, the better programmer one becomes. We all have our favorites though.

I can't second this enough. When all one knows how to use is a hammer.. everything looks like a nail.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
"VOOM"?!? Mate, this bird wouldn't "voom" if you put four million volts through it! 'E's bleedin' demised!

Carl Hayes · 2009-04-28 18:11

BradC said...
I can't second this enough. When all one knows how to use is a hammer.. everything looks like a nail.

I learned that differently.

When all you know is a hammer, everything is a thumb.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
· -- Carl, nn5i@arrl.net

Carl Hayes · 2009-04-28 18:37

I'm amazed at the defensiveness of most of these replies.· Any computing system's tasks can be broadly divided into two main classes:· (1) Input/Output, and (2) Processing.

These classes of tasks have, in general, very different requirements.· Designers historically have designed quite different hardware for them.· From the days of the 709 (a vacuum-tube mainframe) right through to the newest mainframes and·latest PCs, I/O has been handled by specialized processors (channels, in mainframe-talk) designed especially·for I/O, and manipulation of data to create new data has been handled by more versatile processors designed for data manipulation.· Not doing it that way is rather like maintaining a fleet of heavy tanks, when many trips need only a Suburban or a motorcycle.

For example, in the PC I'm using at the moment there are four general-purpose processors for doing calculations (it's a dual-Xeon server, with each Xeon having two cores).· There are also two specialized display processors (VGAs) driving three displays.· There is a sound processor.· There are two IDE disk processors, two IDE RAID processors, and a SCSI controller. I've probably omitted some others.

That's twelve processors,·of which only four are general-purpose.· They could have used twelve general-purpose processors instead, but it would have been poor economy and therefore poor design.

So why such defensiveness against the idea that the Propeller's designers, as it evolves into more and more powerful versions, should reexamine the decision to make all processors identical?· I don't suggest that the decision should be changed, but it's reasonable to take another look from time to time, rather than reflexively shooting·down the idea of taking another look, as some have done in this thread.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
· -- Carl, nn5i@arrl.net

Post Edited (Carl Hayes) : 4/28/2009 6:48:47 PM GMT

ericball · 2009-04-28 19:28

@Chris Micro

I'm not sure you'd save a lot of die space going with an 8 bit ALU. Yes, the adder & shifter is 1/4 the size because it handles only 1/4 the bits, but the control logic doesn't scale in the same way.

And how much RAM per 8 bit COG? Ever try to write a 8 bit program in 512 bytes (code+data)?

Not to mention the SPIN interpretter couldn't use the 8 bit COGs, so only 4 SPIN threads.

@Carl Hayes
A dedicated chip will always be more efficient than a general purpose chip, but will only be cost effective if a zillion of them can be made. (Which is why my TiVo costs a tenth of an HTPC and uses a tenth of the power.) But if a zillion aren't required then you either need high-cost custom hardware or low cost general purpose hardware.

And Chip is revisiting his assumptions for PropII. Things like the 64K HUB RAM limitation, and number of cogs. But changes to those assumptions trickle down to implementation details like how COGINIT works and the PASM interpretter.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Composite NTSC sprite driver: http://forums.parallax.com/showthread.php?p=800114

Carl Hayes · 2009-04-28 20:31

ericball said...

A dedicated chip will always be more efficient than a general purpose chip, but will only be cost effective if a zillion of them can be made.

And Chip is revisiting his assumptions for PropII. Things like the 64K HUB RAM limitation, and number of cogs. But changes to those assumptions trickle down to implementation details like how COGINIT works and the PASM interpretter.

If one is comparing the cost of n of one versus n of the other, as here, the dedicated chip will be more cost-effective whether n· is a zillion, or only one.· But who's talking about dedicated chips?· We're talking about different areas of the same chip.

Of course Chip -- is that a nickname and his real name Silicon?·or perhaps Wafer? -- of course Chip, being a man of·versatile and inquiring mind, will reexamine every assumption and every·decision every time as a normal part of his intellectual functioning.· It's not Chip who surprised me -- it's the knee-jerk defensive reactions, which certainly didn't, and couldn't,·come from Chip.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
· -- Carl, nn5i@arrl.net

Post Edited (Carl Hayes) : 4/28/2009 8:38:37 PM GMT

lonesock · 2009-04-28 20:52

Related to the discussion of many generic machines vs specialized machines, of interest to me is the current trend in graphics cards. In the old days we would do all the rasterizing on the CPU, and update a simple bitmap (Mode 13h FTW!). Then the next generation got much more specialized with rasterizing hardware, then transform and lighting hardware, etc., as the "fixed function" started to get faster and more complicated. Then the trend started to shift to programmable hardware (shaders).

Now we have GPUs that are just amazing many-core processors, and we can use them for graphics or for parallel computing (CUDA, OpenCL, etc.). These special purpose cards are getting more general purpose, with the next step in the evolution looking to be similar to the Larrabee project (en.wikipedia.org/wiki/Larrabee_(GPU)), where it's basically a set of 24/32/48 P54C cores (depending on die yield).

So, coming from the graphics programming arena, the propeller looks like it is ahead of the curve [noparse][[/noparse]8^)

Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

Carl Hayes · 2009-04-29 00:01

Yup, Lonesock, history repeats.· The IBM 370/158 had, if I remember correctly, as many as 32 independent I/O processors, or channels.· Ours had 16.· Each channel was said to be, inside, a recycled 360/65, or some such,·with different microcode.· It may even have been true.

It may have been false, too.· There were too many 370/158, a very popular model, for all the channels to be any particular earlier model of 360 or 370.· But the fact that it was believable shows that the difference between special-purpose·and general-purpose processors need not be a wide gulf.· Still there can be savings, for you can leave stuff out of a specialized processor.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
· -- Carl, nn5i@arrl.net

Post Edited (Carl Hayes) : 4/29/2009 12:10:59 AM GMT

pharseid · 2009-04-29 01:00

· Between a software multiply and a full 32x32 flash multiplier, there·are a lot of intermediate steps. A multiply instruction which simply uses 1 bit of the multiplier per cycle (essentially a 1x32 multiply) is quite a bit better than using adds and shifts. Actually, I think such an instruction would shift the previous partial product and the multiplier, and conditionally add in the multiplicand based on the value of the lowest bit. Boothe multipliers which used 2 to 8 bits of the multiplier were intermediate steps to implementing full flash multipliers. So it isn't all or nothing.

-phar

mctrivia · 2009-04-29 03:10

I think I have seen on other chips(think it was the R8C Tiny) a 128x PLL which was used to allow each instruction to execute up to 128 steps in 1 cycle. This method would allow a 1 cycle multiply instruction with almost no extra hardware.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Need to make your prop design easier or secure? Get a PropMod has crystal, eeprom, and programing header in a 40 pin dip 0.7" pitch module with uSD reader, and RTC options.

Post Edited (mctrivia) : 4/29/2009 3:30:42 AM GMT

multi core processor design suggestion

Comments