How unique is the propeller chip?

Phil Pilgrim (PhiPi) · 2009-10-17 21:05

Well, my Saturday-morning diversion (NY Times crossword) is no less silly, I suppose, assuming a CISC vs. RISC debate is equally capable of staving off dementia.

-Phil

Phil Pilgrim (PhiPi) · 2009-10-17 21:11

Leon,

It's a six-stage pipeline, resulting in a one-instruction per system clock (i.e. four crystal or PLL clocks — not quite the same, I know) throughput. The key here is that the instructions (except those requiring wait states or flushing the pipe) are neither variable-length nor microcoded and execute (exclusive of fetches, decodes, and stores) in one clock period.

-Phil

Post Edited (Phil Pilgrim (PhiPi)) : 10/17/2009 9:21:30 PM GMT

potatohead · 2009-10-17 22:18

Phil, perhaps that is all it comes down to in the end. Use it or lose it.

My other guilty pleasure besides the prop is online debate on a variety of matters, with a small crowd, in a small online venue. We meet up a few times per year to catch up "in the flesh", but otherwise gather semi-regularly during the week to advance our state of understanding the body politic, religion, tech issues, music, etc... We've been at it for years, and still after all the time, will easily surprise one another.

Sure beats the tube. [noparse]:)[/noparse]

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
Safety Tip: Life is as good as YOU think it is!

pharseid · 2009-10-17 22:46

A couple things:

Texas Instruments made a chip in the late 90's with one RISC and 4 or 5 DSP's. I really doubt this is still in production, but I think they should get some credit for the pioneering effort. The Prop vs. XMOS is a pretty good illustration of multiprocessing vs. parallel processing. Prop cogs interact via shared memory, whereas XMOS tiles (or whatever they call them) interact via com channels. I would think multiprocessing is more efficient for closely coupled processes, but parallel processing would expand to multichip systems easier. A language like Occam could handle both types of concurrency, assigning new processes to different processors when a process forks and also directly handling com channels. So I would think the ultimate concurrent hardware to support that has yet to be made.

There were some huge SIMD machines made in the late 80's. The basic units of the Connection Machine and NASA's Massively Parallel Processor both had 16K processor elements. Both integer machines. The Connection Machine could be expanded to 64K PE's, but I don't remember if the MPP was expandable. Since you only have one control unit for the whole machine, it's a very efficient use of hardware for problems with huge amounts of data parallelism, but really not much help for pipelining or agenda parallelism. I think it would be fun to implement a SIMD machine on an FPGA, maybe to use as a graphics processor or for neural networking, but am personally too lazy to learn a new skill just for that.

-phar

Leon · 2009-10-18 10:58

XMOS threads (eight per core) share memory. They are implemented in hardware and look like separate processors to the programmer - each has its own register set. Switching between threads takes one clock. The XMOS XC language handles both threads and links, of course.

Another interesting parallel-processing machine from the 1970s was the ICL DAP:

en.wikipedia.org/wiki/ICL_Distributed_Array_Processor

I remember seeing one at Southampton University around 1985. They were more into transputers by that time, though.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM
Suzuki SV1000S motorcycle

Post Edited (Leon) : 10/18/2009 11:28:47 AM GMT

localroger · 2009-10-18 14:18

A truly unique·aspect of the Prop's architecture that is often overlooked is that Chip has gone where others fear to tread in the direction of less is more.· The cogs aren't limited to 512 longs just to fit them 8 of them on the die; that limit makes it possible for a single instruction to specify the source, destination, operation, and a bunch of conditions all in one long.· This makes a cog instruction comparable to two to four instructions on a more typical 32-bit platform, which makes PASM both much faster and much tighter than other assembly languages.· For jobs which are bound by computation rather than data storage, this more than doubles the effective clock speed and cog RAM size.

This is probably the single boldest statement Chip Gracey made with the Prop; other CPU's are offered with limited memory, I've never seen one where that memory limit was used so effectively to create such a powerful set of features.· And another way of looking at it is that the Prop spends most of its time running one program, the Spin interpreter, so cog capabilities are finely tuned to the needs of that particular program.· Other than that and some obvious I/O functions like I²C for the EEPROM and video generation it seems the fact that PASM is capable of doing anything at all other than running the Spin interpreter and displaying "Hello World" is what we in Louisiana call "lagniappe."

And cogs would most definitely be considered RISC according to the people who invented RISC.· The acronym stands for Reduced Instruction Set Computer, and there were several "features" people like John Cocke noticed either didn't get used by compilers are were implemented such that they ran slower than routines based on less complex instructions:

Lots of registers, particularly special purpose registers such as indexing registers, spent a lot of time and code being loaded and unloaded.
Convenience instructions such as string copy were often used inappropriately both by human programmers and compilers in situations where the code to load their special registers made them slower than direct memory manipulation
Complex addressing modes were often not used at all by compilers

Obviously the P8X32 has none of these "features."··(Indeed, it pleases me to to contemplate the·speed at which Djikstra must be spinning·in his grave over the fact that the Prop requires self-modifying code to do even the simplest indirect addressing.)· It also has a non-anti-feature introduced for RISC designs, the ability to specify a short constant within the instruction word.· CISC designs tend not to·provide for·this meaning you burn 16 or 32 bits of storage and a memory read cycle to get·a constant like·3 into the ALU.

The Prop's single instruction pipeline is not an inherently non-RISC feature, particularly since it maintains determinant timing.· A CISC pipeline would be like the very CISC-y 8086/8088, which made a bold statement in the opposite direction by completely separating the fetching of instructions from the process of executing them.· This increases the amount of work wasted when a jump occurs and as the pipeline grows longer gets you into all kinds of branch prediction weirdness, and it means you never really know how much time an instruction will take to execute.· When the x86 brought these techniques to microcomputers for the first time it messed up a lot of real-time programming methods which had been perfected based on cycle timing, particularly for game programming.·

Phil Pilgrim (PhiPi) · 2009-10-18 17:53

One thing to keep in mind with the Prop is that it's a von Neumann architecture. As such, the instruction, source and destination operands, and result all exist in the same memory space. This means that every instruction requires four accesses to this memory. With the Prop I's single-port memory, this dictates a four-step process, and that's where the four ticks per instruction come from. (The instruction decode and execution pipeline steps overlap with other operations.) The Prop II's throughput will be one instruction per tick, which makes the use of single-port RAM impossible. Consequently, it will use a four-port RAM structure so that all four accesses (3 reads and one write) can occur simultaneously.

-Phil

pharseid · 2009-10-18 20:02

I'm not knocking the Prop's instruction set, but it's hardly unique in packing functionality into a single instruction. Short literals and direct page addresses have been used forever and·the ARM has had conditional instruction execution for a couple decades. I think my all-time favorite was Analog Devices 16-bit DSP model. It had instruction formats that could pack 3 assembly language instructions into a single 24-bit opcode and these weren't particularly simple instructions either. An ALU op might entail 2 data memory reads and a memory write in addition to 2 register reads and a register write. Which could occur concurrently with a shifter op and a multiplier-accumulator op or a program control op. If you could fit a program segment into its queue, you could run short indexed loops with zero overhead. It was sort of a long instruction word processor with a short instruction word. I would give an honorable mention to Charles Moore's Novix chip, which was designed to directly execute the core instructions of the Forth language. It could do an ALU op, a stack manipulation, include a short literal and do an optional subroutine return in the same instruction.

-phar

Leon · 2009-10-18 20:12

XMOS has two register with immediate instructions (three operands) such as ADDI that are encoded in a 16 bit word.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM
Suzuki SV1000S motorcycle

Post Edited (Leon) : 10/18/2009 8:19:22 PM GMT

localroger · 2009-10-18 21:10

pharseid: If you want to get like that, well the first computer I ever had access to (granted I was 10 years old) was a HP2100A, which had a number of instruction packing tricks up its 16-bit core memory sleeve, but had no way to specify a memory address as a subpart of a word. Also, IIRC the DEC PDP-10 had a page of "special" RAM at the base of its address space that was both faster than core and could be addressed by special "address as subpart of word" instructions. But...

Leon: I'm aware of processors with instructions such as you mention, but the trick is that they depend on registers, which are treated differently in fundamental ways from program and data main storage. The Prop doesn't have registers, unless you want to say all 512 longs of cog RAM are registers, which really doesn't make much sense.

The very fact, in fact, that people argue about whether the longs of cog RAM are "memory" or "registers" is telling. I can't think of any other CPU's which inspire such arguments. The Prop is more von Neumann than anything even von Neumann probably ever imagined; every instruction storage location is a register, and can even be the special accumulator register for math functions. Nothing else can do that because doing that requires that you pare down the size of RAM to what is normally considered a ridiculously small address space. Yet when you do it -- who would have guessed? -- the result is surprisingly versatile.

Toby Seckshund · 2009-10-18 21:38

I seem to remember the plant at Newport, South Wales, that made the transputers. Right by the nearest motorway junction, within Wales, an alien looking building for that time and place, as soon as the minimum period had pasted before penalties applied the place was shutdown. It made a sad signpost to the future of that area, and now the steel has gone too.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Style and grace : Nil point

waltc · 2009-10-18 22:40

Speaking of instruction packing...

Ultratechnology's F21 - a Forth processor created by Jeff Fox(with assist from Chuck Moore) that never got out of the prototype stage because of corporate politics, it packed 5 instructions in a 20 bit word. Forth was its native instruction set with 2ns execution time . It had its own video co-processor, analog and network co-processors plus a external memory bus.

And this was back in 1998 and prototypes sold for around $20.00. Very elegant and innovative as hell, haven't seen anything like that since in the micro-controller world.

Phil Pilgrim (PhiPi) · 2009-10-19 00:27

Walt,

This would imply that the core of Forth, from which all other words can be composed, consists of only 16 instructions. Do you happen to recall what those 16 basis instructions were? Or was one or more of them an escape to a larger instruction set?

-Phil

Clock Loop · 2009-10-19 00:32

This post was to compare the prop to the micro-controller world.

HollyMinkowski · 2009-10-19 00:43

Multi core with both USB and ethernet would be wonderful.

USB and ethernet have become almost indispensable.

USB, ethernet, 40+ i/o pins, 300mips+, as much ram and flash as possible.
Would really cut down on the parts count and square inches of board space.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
"Where am I? Where am I going? Why am I in a handbasket?"

whicker · 2009-10-19 00:47

Clock Loop said...
This post was to compare the prop to the micro-controller world.

Whats fun for me regarding healthy internet forums, is how the discussion wanders toward the end.
People have tried, but there's really no stopping it. [noparse]:)[/noparse] I've been enjoying the flow. 8)

pharseid · 2009-10-19 01:12

F21

Code Name Description Forth ( with a variable named A)
00 else unconditional jump ELSE
01 T0 jump if T0-19 is false w/ no drop DUP IF
02 call push PC+1 to R, jump :
03 C0 jump if T20 is false CARRY? IF
06 RET pop PC from R (subroutine return) ;
08 @R+ fetch from address in R, increment R R@ @ R> 1+ >R
09 @A+ fetch from address in A, increment A A @ @ 1 A +!
0A # fetch from PC+1, increment PC LIT
0B @A fetch from address in A A @ @
0C !R+ store to address in R, increment R R@ ! R> 1+ >R
0D !A+ store to address in A, increment A A @ ! 1 A +!
0F !A store to address in A A @ !
10 com complement T -1 XOR
11 2* left shift T, 0 to T0 2*
12 2/ right shift T, T20 to T19 2/
13 +* add S to T if T0 is true DUP 1 AND IF OVER + THEN
14 -or exclusive-or S to T XOR
15 and and S to T AND
17 + add S to T +
18 pop pop R, push to T R>
19 A push A to T A @
1A dup push T to T DUP
1B over push S to T OVER
1C push pop T, push to R >R
1D A! pop T to A A !
1E nop delay 2ns NOP

1F drop pop T DROP

Phil Pilgrim (PhiPi) · 2009-10-19 03:49

Ah, I read that wrong. It's four instructions of five bits, not vice-versa. Thanks!

-Phil
_________________

This has been a Special Bulletin. Now back to Clock Loop with our regularly-scheduled programming...

pharseid · 2009-10-19 18:30

I thought Waltc was implying 4-bit instructions too, which got me to checking that. Plus you have a few unused opcodes to play with.

-phar

How unique is the propeller chip?

Comments