Propeller II update - BLOG

evanh · 2012-07-13 06:51

Phil Pilgrim (PhiPi) wrote: »

... unlike the Prop I, which has a von Neumann architecture ...-Phil

The Propeller is not Von Neumann. Code fetching from the register set is pretty out there as a non-normal architecture. And the Hub can be architecturally included due to the way the Cogs start up and the desperate need for significant store, making it even weirder.

evanh · 2012-07-13 06:59

Not to mention implementing a RISC like instruction set while at the same time not using pipelining - which is the main feature of RISC.

Mike Green · 2012-07-13 07:03

The Propeller is very much Von Neumann. It has a single memory for both instructions and data (cog memory) for each of the 8 separate processors. The hub memory is treated as an I/O device with special instructions dedicated to it. The Spin interpreter also emulates a Von Neumann architecture with both code and data in the same memory (hub) and effectively no access to the cog memory used by the interpreter. Look at the Wikipedia description of the Von Neumann architecture.

RISC instruction sets do not require pipelining. The main feature of RISC is that instructions are simple (see the Wikipedia again). The Prop I is pipelined.

Unfortunately, the cogs' memory has been described as "registers". From an architectural standpoint, they're memory and the Prop has a memory to memory two address instruction set (destination and source) with one address acting both as a source and destination for 3-operand operations.

evanh · 2012-07-13 07:09

The Cog's RAM is very much registers in the pure sense. Each one has full, in as much as a non-pipelined processor can, accumulator functionality.

Given the nature of the Hub Load and Store instructions being dedicated to filling the registers before manipulation, just like original RISC. Which leads to the same sort of setup as Harvard with the instructions fetching from a separate memory space than the Load/Store instructions. It could be argued that Prop architecture is just as much Harvard as it is Von Neumann.

PS: The whole point of RISC was pipelining. But sure, there is no denying the way the instruction set is constructed. With pipelining, and the resulting multiported registers, outlandish instructions became a burden to implement not to mention superfluous due to the speed of the generic set. And the improvements in fetching and word alignment and the likes gained from having fixed length instructions just made the whole lot fit like glove. No need for byte wide accesses to main memory for example, do that in the registers instead ... But the nasty little side-effect was bloat, code bloat in particular.

Heater. · 2012-07-13 07:35

evanh,

I think the Prop most certainly is Von Neumann.

I propose you are looking at it backwards.

Instead of seeing it execute code from its register set I see as being a zero register design, all instructions are memory to memory, and executing code from good old fashion memory space which happens to be limited to 512 longs.

Now of course there are registers, there must be a program counter and there may well be an accumulator or such in there but they are hidden from the instruction set.

The HUB does not enter into it. That is an external, to the COG, device the COG accesses via special I/O instructions, [rd|wr][long|word|byte].

I suspect the first Von Neumann machine, the Manchester Mark I, was also COG like with no registers as it only had a 96 word memory.

4x5n · 2012-07-13 07:37

The whole RISC vs CISC thing was a long time ago and I admit I didn't follow it all that closely but I thought the debate was over the size of the instruction set. The idea of RISC (as I remember the debate) was to keep the number of instructions low so that there didn't need to be multi-byte instructions. The idea was that single byte instructions could be loaded in fewer cycles and the increase in instruction execution rate would overcome needing to use more instructions in the code.

Do I remember it wrong? Like I said I wasn't paying that much attention to the debate at the time. I was to busy writing machine control code for the 6809 in assembly at the time! :-)

Heater. · 2012-07-13 07:45

evanh,

The Prop is certainly not Harvard architecture.

The core of the Harvard architecture design is that instructions and data live in different address spaces.
In Harvard instructions and data different things in different places. In Von Neumann instructions are data.

As a result early Harvard designs could not boot themselves as a program could not read/write to it's own program store space, only to the data space. Rather like having an MCU running code from FLASH with data in RAM and it not being able to reprogram it's own FLASH.

Clearly the Prop is not Harvard, it can load it's own code and overwrite any code in COG. In fact self-modifying code is an essential trick in PASM programming.

Heater. · 2012-07-13 07:53

evanh,

The Cog's RAM is very much registers in the pure sense. Each one has full, in as much as a non-pipelined processor can, accumulator functionality.

Choosing to view those 512 LONGs as registers or RAM make no difference to the classification as Von Neumann or Harvard.

At the end of the day, does it matter what we call it? It does what it does and that's it.

evanh · 2012-07-13 07:57

4x5n wrote: »

The whole RISC vs CISC thing was a long time ago and I admit I didn't follow it all that closely but I thought the debate was over the size of the instruction set.

Wasn't really anything to do with the instruction set as a whole. It was about implementation of the instructions themselves. To make use of pipelines one has to use the data and address paths that are plentiful inside the processor core. To do this a lot of the "complex" instructions become redundant as the "simpler" ones will do the same job just as fast. But at a code size cost.

Mike Green · 2012-07-13 08:14

The RISC vs CISC discussion arose from a period in computer design where instruction sets had become more complex and microcoded processors became a standard way to implement them. The IBM360 line was a good example of this. The internal design was a Harvard architecture (separate instruction and data memory) with the (microcoded) instructions coming from a ROM and the emulated instructions and data came from a magnetic core RAM. The microcode was essentially a RISC machine with long words and each field of the instruction directly controlled some piece of logic whether a selector, counter, arithmetic logic or whatever.

As high speed RAM became available and cheaper, a number of groups started research into configurable machines where the microcode could be loaded on reset and the machine could emulate any of several instruction sets. The experimenters found that they could code directly in the RISC and, as larger fast RAMs became available, developed fast RISC machines with Von Neumann architectures letting the compilers available at the time do the optimizations that reduced the code bloat.

Heater. · 2012-07-13 08:21

re: CISC/RISC

It seemed to be that as microprocessors were evolving a lot of programming was still being done in assembler. A lot of it was being done by hardware engineers that were moving their logic designs over to software. As result processor designers were throwing in instructions to make life easier for the assembler programmer. The classic case is the DAA instruction (Decimal Adjust for Addition) in the 8080 that made doing BCD arithmetic easy and fast. Then there was the Z80 that was 8080 compatible but tripled the size of the instruction set by adding a lot of things like bit setting, clearing, testing etc. Useful if you are twiddling bits in peripheral registers and such.

Turned out that a lot of those instructions were never or rarely used. For example most CP/M programs do not require a Z80 to run, they only use the original 8080 opcodes.

Then we move into the world of high level language programming and find that compiler writers don't get their compilers to generate code that uses a wide range of instructions. It's just to hard to do. Most of the convenience instructions are unused in compiled code.

The modern x86 is still carrying around DAA and such for no useful purpose.

So there comes the RISC idea. Why not throw away all those unused or rarely used instructions and use the silicon space saved for speed ups else where, like pipelining or whatever. Helps simplify things if we can make all the operations the same size, so no more byte, two byte three byte etc instructions.

Seems that over time this sort of simple distinction has been muddied though.

evanh · 2012-07-13 08:59

Microcode is fully decoded, RISC instructions are not.

Looking at the MM1, http://en.wikipedia.org/wiki/File:MM1Schematic.svg there is a clear cut accumulator that holds all results from the ALU. Presumably the A register is instruction addressable. Main memory does not appear to be the register set. Or maybe more importanly, main memory can only be one data source.

Edit: Actually, looking a bit harder, it looks like instruction fetches are coming, via the B register, from the drum store rather than the main memory. I'd almost call that Harvard! Nah, doesn't make sense, must be another undocumented instruction path.

Beau Schwabe · 2012-07-13 09:05

Propeller II Status:
The synthesized logic (inner core) has gone through several iterations to meet timing, as well as to improve functionality. This section is mostly driven by Chip, however where I come in is connecting all of my existing Power and Ground to the power grid structure of the synthesized logic. Some of the improvements were to address Power/Ground demands due to IR drop (IR drop is basically voltage drop due to current demands over the wire resistance based on the length and thickness of the wire) ...so the placement of Power and Ground strapping was not completely defined. Now that Power and Ground has been finalized I am currently traversing the perimeter of the synthesized logic and making the Power Ground connections. In the meantime I expect the synthesized logic to be completed. FYI) When the Synthesized logic is changed, it takes about 2 hours to run a script on it to sync all of the stream-out layers. After that it's just a matter of changing the name of the instantiation from the old (previous version) to the newest version.

Mini Contest:
To date I have had 20 entrants between E-mail and answering in the forum. I would like to see a larger pool of entrants so if you still want to throw your guess my direction, please use the E-mail address provided in the link below. Since I didn't set a date, lets give this contest until August 31st. At that point I will call a winner and release the actual numbers.

http://forums.parallax.com/showthread.php?125543-Propeller-II-update-BLOG&p=1096024&viewfull=1#post1096024

potatohead · 2012-07-13 09:08

Unfortunately, the cogs' memory has been described as "registers". From an architectural standpoint, they're memory and the Prop has a memory to memory two address instruction set (destination and source) with one address acting both as a source and destination for 3-operand operations.

Thanks for saying this, it's one of the reasons I think the Propeller is beautiful.

Mike Green · 2012-07-13 09:10

The MM1 as well as many other designs after it had an accumulator. Other machines may have had two or more for special uses. The accumulator was implied in the instruction, not addressable. Early computers did not have addressable (numbered) registers.

Microcode was generally fully decoded, but not always. RISC instructions are often fully decoded. The Prop1 with a few exceptions ... mostly the hub instructions ... is fully decoded. All the various fields are used for the same purpose for essentially all instructions, except for some hub instructions. The hub instructions cause the processor to stall until the hub logic can take care of the operation.

evanh · 2012-07-13 09:19

Mike Green wrote: »

The MM1 as well as many other designs after it had an accumulator. Other machines may have had two or more for special uses. The accumulator was implied in the instruction, not addressable. Early computers did not have addressable (numbered) registers.

That's what I meant by addressable. Ie, The ALU result stops at the A register and another instruction is executed use it. Ie, it's not just a stage in the completion of previous instruction.

The Prop1 with a few exceptions ... mostly the hub instructions ... is fully decoded.

Okay, better show me this one.

EDIT: I suspect there is probably a difference in definition of what constitutes fully decoded here. By my definition I know RISC is not fully decoded because the instructions are binary encoded. There is still a need for a large amount of logic to generate all the combinations of block selectors.

evanh · 2012-07-13 09:25

Beau Schwabe (Parallax) wrote: »

FYI) When the Synthesized logic is changed, it takes about 2 hours to run a script on it to sync all of the stream-out layers. After that it's just a matter of changing the name of the instantiation from the old (previous version) to the newest version.

Meaning? Any new logic changes/features added doesn't need any hand layout to have a finished product?

Mike Green · 2012-07-13 10:20

Fully decoded ... Yes, that depends on your definition. In "the old days" where there were discrete gates, that may have meant a single line from the instruction register for each controlled gate in the instruction logic. Nowadays, there are functional units that do their own internal decoding and the internal decoder is not accessible externally and may not really have a separate physical existence. The Prop's arithmetic unit(s) may illustrate this. The Propeller's barrel shifter is one major hunk of logic in that it has to handle 32-possible cases times maybe 8 or more shift varieties (SHL, SHR, ASL, ASR, ROL, ROR, REV, ?), all 33 wide.

The immediate bit applies to all instructions. The conditional execution bits apply to all instructions. The NR/WC/WZ bits apply to essentially all instructions where there's a resulting value or flag value. There's a barrel shifter, so the source field is used to drive the barrel shifter logic along with part of the instruction field. The exceptions mostly are the hub instructions which are handled partly by the hub logic while the cog is stalled.

evanh · 2012-07-13 10:47

Hmm, the first six bits or opcode is where I'm referring.

Without delving into deeper encodings there is 64 combinations of those six bits which, in turn, decode out to various block selectors that perform data fetches (Using the source and destination fields as register addresses) and then a myriad of possible ALU paths all via individual selectors (decoded bits). And a possible final step of writing back to the destination, this step is also part of the decoded instruction.

Mike Green · 2012-07-13 10:54

As I said, the accumulator is not addressable. It's not selected as part of the instruction. It's implied by the hardware design and the operation selected by the opcode. The fact that there's a register called an accumulator that's part of the arithmetic unit does not make it addressable. Because most smaller and/or older computers don't have multi-ported memory, the ALU result has to be held somewhere so it can either be used in the next operation or written to memory at another time when the memory is not being used to read an operand. Even a multiported memory has to have some way to handle the case where the same location is being written by one port while a second (or third) port is reading the same location.

Phil Pilgrim (PhiPi) · 2012-07-13 11:02

The Propeller instruction set can almost be thought of as microcode in its own right. At least that's the way I like to think of it. It's rare, after all, that you get to choose which flags get altered, whether the result gets written, or whether any instruction can be skipped conditionally -- all based upon bits in the instruction. Those bits basically control gates in similar fashion to the way microcode controls a microcoded processor's logic, one microstep at a time. It's not a perfect metaphor, of course, but the Propeller's instruction set is a lot "closer to the metal" than those of most other micros. And that's very empowering for PASM programmers.

-Phil

evanh · 2012-07-13 11:15

Phil Pilgrim (PhiPi) wrote: »

The Propeller instruction set can almost be thought of as microcode in its own right. At least that's the way I like to think of it. It's rare, after all, that you get to choose which flags get altered, whether the result gets written, or whether any instruction can be skipped conditionally -- all based upon bits in the instruction. Those bits basically control gates in similar fashion to the way microcode controls a microcoded processor's logic, one microstep at a time. It's not a perfect metaphor, of course, but the Propeller's instruction set is a lot "closer to the metal" than those of most other micros. And that's very empowering for PASM programmers.

Good try and from the Hub/Spin/LMM/XMM view I'd 100% agree.

But from a Cog view point, no, Cogs are not that fancy as far as processors go. They lack many features one expects from a CPU. Which in turn points back to better viewing it from Hub again. ... One has to be clear which view is being used.

Now 6:20 AM here and I've got a cold, doh! Nighty-night.

Phil Pilgrim (PhiPi) · 2012-07-13 11:25

View from the hub? The hub doesn't have an instruction set or a processor to run one. I don't get it.

evanh wrote:

Cogs are not that fancy as far as processors go.

Neither is microcode; hence the aptness of my comparison. The Propeller architecture is not fancy -- by design. It is, however, elegant in its simplicity and flexibility and in its closeness to the hardware metal.

-Phil

Circuitsoft · 2012-07-13 12:05

OT:

Bobb Fwed wrote: »

I was surprised that the newest chips have such short pipelines (I always thought Intel maintained the raw-clock-speed-trumps-short-pipelines mentality, but I guess they learned).

Intel learned when the Pentium M (basically a Pentium 3-M with SSE2) bested the Prescott at half the clock speed and a quarter of the power consumption. Intel (Israel) then took over Intel (USA) and integrated the branch predictor from Prescott into the Pentium M and made a dual-core to make the Core Duo. Add 64-bit instructions and the core width to support them, and you get Conroe.

kwinn · 2012-07-13 13:23

evanh wrote: »

Good try and from the Hub/Spin/LMM/XMM view I'd 100% agree.

But from a Cog view point, no, Cogs are not that fancy as far as processors go. They lack many features one expects from a CPU. Which in turn points back to better viewing it from Hub again. ... One has to be clear which view is being used.

Now 6:20 AM here and I've got a cold, doh! Nighty-night.

Have to agree with Phil on this one. Most microcode on the computers I worked with was executed by the hardware ( the code bits controlled gates, registers, and counters to execute the instructions ). The hardware was not that fancy either.

pedward · 2012-07-13 14:28

Chip and I are working on getting the boot ROM code sorted out. We've been talking about the e-fuses and making them idiot proof and tamper resistant. Chip said he added edge capture to the counters. Full details of the capture are up to him to disclose.

We discussed added redundant "authenticate" and "master write disable" e-fuses. The idea being that a completely wacky piece of code couldn't brick the chip by just accidentally flipping a bit. The notion is that you would set 5 interspersed bits to indicate the condition.

For instance, bit 0 would be MWD, bit 1 would be AUTH, skip 32 key bits, then repeat, skip 32 key bits, repeat, etc.

So the setting bits would bookend the key bits and be interleaved with the key bits, ensuring that you really meant to do it.

We also discussed the ROM monitor code for programming these bits. There would be a command enable that actually turns on the PFET which blows the e-fuses, so you have to follow a specific order to blow e-fuses.

Most recently I suggested to Chip that we only disable writing to the authentication e-fuses and leave the remainder for end-user application.

This suggestion was based on how Motorola implemented the ICS update to the RAZR. When they push the new release, they blow an e-fuse and the authenticated boot loader won't load a release older than a certain revision.

End-users could use the e-fuses in the same manner to ensure any critical firmware updates can't be undone by older, vulnerable, releases. Imagine you have code with a privilege escalation in it, then you release an update. Flashing back to the older version to get the privilege escalation capability would be prevented by using a boot loader that checks the e-fuses and compares them with the contents of your encrypted firmware. In this way you can push firmware releases that are resistant to downgrade attacks.

Since the e-fuses are OTP, you would have a succession of bits that you blow for each update, starting at 10000.... then 11000... then 111000... then 111100... etc.

jmg · 2012-07-13 15:45

pedward wrote: »

Chip said he added edge capture to the counters. Full details of the capture are up to him to disclose.

Yahoo!!!! Great news.

Fingers crossed that he will disclose these (low impact?) details :

* That the Edge Capture is not exclusive of external clocking, ie you can INC on one pin, and capture on another
With the existing separate A and B pin fields, this should be relatively easy, and A==B allows capture on inc.

* That atomic control of capture on two counters is possible.
Doing this would just need a single aliased flag, in CTRA and CTRB Registers, or a config option that flips control to the other counter's Enable.

For symmetry, Atomic Enable control of Count on both counters would also be nice.

pedward · 2012-07-13 15:50

Something he mentioned, which isn't clear to everyone, is that each counter is dual channel. He said each cog has 4 counter channels, effectively doubling the number. What limitations there are is unknown.

Chip said the edge capture was low, high, or full period, and since the result is buffered, you can change from one edge to the other while it's accumulating and get values for both edges.

Phil Pilgrim (PhiPi) · 2012-07-13 15:53

pedward wrote:

... each counter is dual channel. He said each cog has 4 counter channels, ...

What exactly do you mean by "dual-channel"? Are counters shared between cogs, so that a cog can "borrow" a counter from a neighboring cog?

-Phil

pedward · 2012-07-13 15:55

Chip said each counter has 2 channels, to what extent that is usable is unknown to me.

Propeller II update - BLOG

Comments