The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

evanh · 2014-04-17 21:13

potatohead wrote: »

I find it interesting the current design may fall in the 1-2 watt range, which is about what that other design would consume on average at a modest clock, and it would do so with a similar throughput too.

While it's not compulsory to stay under 2 Watts, it would be nice because then it fits in USB's 500 mA supply rating. I don't really want to have an added heat-sink either.

I certainly hadn't contemplated a hot IC even with the earlier Prop2 design.

Brian Fairchild · 2014-04-18 00:22

RossH wrote: »

You are right that very few (if any) of these new instructions will make their way into a high-level language compiler,...

That's very likely to be true.

I've been doing some optimisation work on some code written for a customer on an 8-bit uC. It comprises about 3k lines of code generating lines in C which compiles (not gcc) down to 16494 opcodes from a pool of 110 different codes.

25% of the 110 are never used.
2 opcodes account for 25% of the total usage.
23 opcodes account for 90% of the total usage.
40 opcodes account for 97.6% of the total usage.

Ramon · 2014-04-18 00:37

Brian, the million dollar questions are:

1a) what is the opcode that follow those 2 opcodes that account for 25%?
1b) Is there any pattern? can a new instruction do both instructions at the same time?

2a) what is the opcode that follow those 23 opcodes that account for 90%?
2b) Is there any pattern? can a new instruction do both instructions at the same time?

....

3a) what are the two next opcodes that follow those 2 opcodes that account for 25%?
3b) Is there any pattern? can a new instruction subtitute the three instructions at the same time?

4a) what are the three next opcodes that follow those 23 opcodes that account for 90%?
...

Heater. · 2014-04-18 01:05

It's an odd thing this instruction counting business. I remember Motorola explaining how they analysed a lot of existing software for instruction frequency before coming up with the 6809 (I think) instruction set. I always wondered how they actually did that.

One can't simply assume that because an instruction occurs infrequently in peoples source code that we could perhaps not bother implementing it in the CPU.

One could imagine a "rare" instruction that only occurs once in any program but is actually essential, like say a COGINIT or some such.

Or a performance enhancing instruction, MUL16 say, that only ever occurs once in a program but is key to getting adequate performance is actually called a lot at run time.

What might be interesting is to know is how many times each instruction is called in real running programs. Is it in the critical tight loop that makes the whole program possible?

This is a bit hard to determine from the source code without actually reading it all and finding the time critical parts.

Brian Fairchild · 2014-04-18 01:16

Hmmm, it's a nice day outside and it's a Public Holiday so I don't think I'll be doing too much work today.

Ramon wrote: »

1a) what is the opcode that follow those 2 opcodes that account for 25%?
1b) Is there any pattern? can a new instruction do both instructions at the same time?

The most common (16%) opcode is to load an immediate value into a register. So, can be followed by almost anything, so no patterns.
The next most common (9%) opcode stores the contents of a register to the data space using an index register. So it often appears at the end of a block of code so again, no patterns.

If anyone was really interested in writing some scripts to analyse the code, and promised to post the results back here to the forum, I'd be happy to make a file available containing the assembly source output from the compiler, minus comments, strings and C code. This is for an AVR.

potatohead · 2014-04-18 02:02

Heater. wrote: »

It's an odd thing this instruction counting business. I remember Motorola explaining how they analysed a lot of existing software for instruction frequency before coming up with the 6809 (I think) instruction set. I always wondered how they actually did that.

One can't simply assume that because an instruction occurs infrequently in peoples source code that we could perhaps not bother implementing it in the CPU.

One could imagine a "rare" instruction that only occurs once in any program but is actually essential, like say a COGINIT or some such.

Or a performance enhancing instruction, MUL16 say, that only ever occurs once in a program but is key to getting adequate performance is actually called a lot at run time.

What might be interesting is to know is how many times each instruction is called in real running programs. Is it in the critical tight loop that makes the whole program possible?

This is a bit hard to determine from the source code without actually reading it all and finding the time critical parts.

Notably, the 6809 sings in assembly language. Also notable is the fact that a few of those instructions, when abused, end up enabling the chip to move memory at pretty amazing speeds for an 8 bit CPU.

The exploit ends up being to point the (S) system stack at a memory address, optionally (U) user stack at another, then use the push / pull register set instructions to just blast values from one place to another. A Tandy CoCo 3 can move memory at about a megabyte per second doing this!

I've always wondered how Moto did that analysis too.

Hitachi extended the chip, as the 6309, which added MORE instructions, and some addressing modes and registers to squeeze even more power out of what was already a beast of an 8 bit CPU!

I don't consider the extra instructions a big deal, unless we end up lumping too many together, get too modal, etc... Those things are typically undesirable, but we don't have a lot of that going on currently.

Props will need good PASM programs, and those good PASM programs will work well with SPIN. Since that's all one unified thing, it is in our best interests to maximize it. Part of the secret sauce.

Not everybody will see it as a strength, and we've got the key instructions in there to maximize C, for example. Having this be as true as killer PASM programs are happens to be precisely why many of us support Chip in doing the work to make sure hub exec makes the most sense it can!

In PASM, we will use the HUB and COG together to get things done. Compilers aren't likely to do that kind of thing, but they will need to make efficient PASM code, and doing that with hub exec makes a lot of sense for everybody!

If Chip includes the ability to start a COG without reloading it, and or start one executing right from the HUB, we get a very serious gain, in addition to the above. Now, we can point a COG at something, have that COG blast through it, and when done, shut itself down and return to the pool of COGS.

In COG mode, this has some nice uses, when we can start one without reloading. But in hub exec mode, this really adds to what we can do with the COGS! Think of a simple scheduler that simply assigns COGS to tasks, and the COGS go and do their thing, returning to the pool when they need to...

Or maybe we want to try that no loop system you mentioned a while back Heater. Great! Write the code, sans loops. Then blast through it with the COGS doing things in parallel where it makes sense, sequentially where it makes sense. Those kinds of things will be easy on a Propeller.

Anyway, let's say we don't get those two features. We still get hub exec, and that means we get big, efficient, easy to write PASM programs. Easy to write for a compiler, so SPIN builds to native PASM instead of byte code, and so C can run fast and potentially work with SPIN easier, or at least run on par with SPIN.

We need those things.

Or, somebody could use the PASM assembler and write big programs, or maybe use a more advanced macro assembler and do the same, making great use of the instructions...

Or, when using any HLL that allows reasonable inline code, blend the best of both worlds and get easy and fast where needed! This isn't as favorable to reuse, but it has many other advantages, one primary one being the ability to drill down and write just the bit of PASM that adds a lot of value, thus learning PASM and applying it all in one go. It doesn't get any better for building assembly skill than that. Another advantage is being able to more directly balance speed, size and complexity, leaving the HLL as a framework to keep things organized.

Re: The stack

I too wouldn't mind seeing a few more levels, but 4 is good! For a very large number of PASM programs, this will be sufficient. Larger ones can manage return addresses in other ways, as can compilers. No worries there.

From the power discussions we had, the one thing made clear is the number of gates toggling and the frequency they end up toggling at, drives the process physics limits. All of that speed limited the earlier design, leaving it fast like this one will be, even at lower clock speeds and voltages needed to keep the power sane. In this one, optimizing for the higher clock speed means we get better time precision, and it's simpler, but faster overall.

Chip knows this and is going to optimize accordingly. Having a few odd features here and there to maximize something isn't the same as building out beast like sub-systems like the other design has. That one had video, on chip math, etc... all duplicated in the COGS, and it had the tasks, swapping, trace, PORT D and a bunch of other things designed to maximize throughput and emphasize many different ways that COGS, the HUB and PINS could interact.

That is what drove the heat budget, and the larger number of instructions was somewhat an artifact of controlling all those things, but not entirely. The big driver was the sheer number of gates and the toggling they had to do associated with those instructions.

In this design, there is no pipeline, no big video system, and if we get on-chip math, Chip has already said he's going to put it in the HUB, and have it round robin serve the COGS, which is a smart optimization! Moving some smarts to the pins is another optimization that insures power is well distributed and that the number of gates being toggled isn't so large as it was with the other design.

All of this means fixating on instruction number isn't all that productive. It's what they do and how that's associated with the toggles and clocks that matter, and the simple, no pipeline nature of this design insures that only so much can happen per instruction. That alone will change the power dynamics favorably, but you all know Chip is going to take what was learned about the process and keep it in check as things get built.

I do recognize the ease of use argument. Long ago, learning the 6502 was pretty easy! It didn't have that many instructions, and of those instructions, there was a clear, useful set, many favorites, and some odd ducks that got into some clever programs from time to time.

The 6809 was a bit harder to learn. And it too had it's core set of useful, clear instructions, and a whole lot more favorites, and actually less of those odd ducks because the instructions were more useful generally with all the registers and addressing modes.

The P1 is 6502 like, in that it's got a core set of useful, clear instructions, some favorites and a few odd ducks. There are not very many options on the instructions either, requiring self-modifying code to get things done in the COG, less so when addressing HUB memory. Some of the best 6502 code is self-modifying code, and it's due to the distinct lack of addressing options. Sound familiar?

This design is like the 6809. Same overall instruction profile, better addressing where needed, etc... A bit harder to learn, but not too much so. It's more productive overall, due to having some great instructions the P1 did not have.

The earlier power heavy design was a lot like the 6309, in that it's maximized in nearly every way. Process physics get in the way, which made us choose.

So we split the middle. Keep the things about the P1 that make it shine in an older, larger process physics, and add some of the really good stuff from the power hungry design, leaving us with one that clocks high and that uses moderate power.

What the instruction count ends up being won't matter anywhere near as much as how the things work, and that comes down to toggles, clocks, etc... as discussed already.

If there is any worry, that should be the worry, because that is the real limitation we face in this process. The ease of use issues are being blown out of proportion IMHO. For the most part, people are going to be able to treat this chip like a P1 with more COGS and some spiffy pin capabilities. The other instructions will be there for people who advance in PASM, and they won't matter much for people using SPIN, C or most other HLL environments that compile down to PASM.

Forth?

Yeah, those guys will use the instructions to make a killer kernel, and then use them again to make a core dictionary that really rocks hard, then blast in whatever else they want to carry over from P1 and go! No worries at all, I would imagine. Few worries at best.

Cluso99 · 2014-04-18 02:04

I noticed that Chip has included GETS, GETD, GETI & GETCOND. I thought that was interesting.

Also noticed ROLNIBn, RORNIBn, and same for BYTE & WORD. Does anyone know how they work? I understand the GET & SET versions.

AUGD & AUGS add the top 23 bits to 9 bits in the following instructions. Would an AUGDS be useful by adding just 9 bits to both 9 bits on the following instruction, because that would be useful for making 17bit cog/hub addresses.

ozpropdev · 2014-04-18 04:37

Cluso99 wrote: »

I noticed that Chip has included GETS, GETD, GETI & GETCOND. I thought that was interesting.

Also noticed ROLNIBn, RORNIBn, and same for BYTE & WORD. Does anyone know how they work? I understand the GET & SET versions.

AUGD & AUGS add the top 23 bits to 9 bits in the following instructions. Would an AUGDS be useful by adding just 9 bits to both 9 bits on the following instruction, because that would be useful for making 17bit cog/hub addresses.

Ray
ROLNIB shifts the nibble n(0..7) in S into D[3:0]
e.g.
D=$01234567
ROLNIB0 D,#8
D=$12345678

ROLBYTn shifts in the byte n (0..3) in S into D[7:0]
ROLWRDn shifts the word n (0,1) in S into D[15:0]

RORxxx shifts into D[31:28],[31:24] and [31:16].

Basically combines a GETxxx with a shift/or in a single instruction.

Cheers
Brian

Edit: Correction

Ramon · 2014-04-18 06:13

cgracey wrote: »

The only way we could get a spare bit, or two, would be to reduce cog RAM to 256 longs. Then, we could get indirect bits for both D and S. And cog size would be cut in half. It might be viable with hub exec, you know. 32 cogs!

How about going to 64 bits? This will made hub exec pretty straightforward, and no one will ever dare to call it P1+ again !

Heater. · 2014-04-18 06:21

Why is it called ROLNIB?

That implies a rotate operation. Where do the top bits go?

ozpropdev · 2014-04-18 06:33

Heater. wrote: »

Why is it called ROLNIB?

That implies a rotate operation. Where do the top bits go?

It does seem more logical for them to be called SHLNIB,SHLBYT,SHLWRD,SHRNIB ,etc

Roy Eltham · 2014-04-18 10:57

ozpropdev,
Are you sure that's what the ROLXXXn/RORXXXn stuff does? I would have expected them to decide which subpart (nibble,byte,word) of the D register to ROL/ROR with the n part of the opcode, and how much to ROL/ROR with the S part.
So for ROLNIBn, would have 3 bits to select which NIBBLE (the n in the opcode name), and S to select how much to ROL. So you'd have ROLNIB0 through ROLNIB7, etc.

Another way to represent them in source code would be: ROLNIB D, S/#, n

Cluso,
AUGDS D, S <- does AUGD with the D, and AUGS with the S, but all in one instruction.

Baggers · 2014-04-18 13:50

Chip, If it's been said before I must have missed it, but can you tell me ( and all here ) what speed the cogs will run at on the DE2 and DE0?

Cheers,
Jim.

jmg · 2014-04-18 13:57

Baggers wrote: »

.. what speed the cogs will run at on the DE2 and DE0?

If the critical paths are the same, it will be ~ 40MOPS, as the previous 80MHz used 4 port memory,
if the paths improve ~ 25% then maybe ~ 50MOPs and 100MHz Counters

cgracey · 2014-04-18 17:18

Heater. wrote: »

Why is it called ROLNIB?

That implies a rotate operation. Where do the top bits go?

The top bits go into the bit bin, while D shifts left to accommodate a new bottom nibble that was selected from S.

cgracey · 2014-04-18 17:20

Baggers wrote: »

Chip, If it's been said before I must have missed it, but can you tell me ( and all here ) what speed the cogs will run at on the DE2 and DE0?

Cheers,
Jim.

The system clock will be 200MHz. The cogs will take two clocks per instruction, so they'll run at 100 MIPS. It's looking likely that the main clock could go 250MHz. In that case, 125 MIPS x 16 cogs = 2000 MIPS total.

David Betz · 2014-04-18 17:28

RossH wrote: »

You are right that very few (if any) of these new instructions will make their way into a high-level language compiler, unless it is specifically designed from the ground up to use them (like SPIN was).

I think it's a little misleading to say Spin was designed to use these instructions. Spin on the P1 compiled to a bytecode virtual machine so it didn't generate any instructions natively. It just used them in the implementation of the VM. That may be true for Spin2 as well if it continues to use a VM. If it generates native code it will be interesting to see how effectively it can use these new instructions itself. It will, of course, allow their use in DAT sections containing PASM2 code but will its native compiler (if it has one) be able to make use of them any better than GCC or Catalina? I hope so because then we can steal the code generation algorithm for GCC! :-)

cgracey · 2014-04-18 17:28

Baggers wrote: »

Chip, If it's been said before I must have missed it, but can you tell me ( and all here ) what speed the cogs will run at on the DE2 and DE0?

Cheers,
Jim.

Right now, it's looking like the Cyclone IV FPGA boards (DE0/DE2) will be able to hit 200MHz. All the ALU stuff was proven yesterday for at least 300MHz before the last couple mux stages. I've been partitioning the ALU into 4 big chunks, each with its own input flipflops, so this should cut ALU power by 75% of what it would otherwise be. And aside from ALU in this design, there's not much circuitry in a cog. Power will go really low during WAITCNT/WAITPxx. So, I think the DE0/DE2 boards will be good for actual silicon speed. And this is from Altera Quartus' own timing analysis. I hope to have the cog design all tied together in another week.

Bill Henning · 2014-04-18 17:34

WOW is all I can say on that.

Actually... can we have all 64 (or 80, whatever you end up building) brought out on the nano to the connectors? However many smart pins, the rest as standard digital I/O

(another WOW on the real silicon hitting 125MIPS... 120MIPS would be nice for USB)

cgracey wrote: »

Right now, it's looking like the Cyclone IV FPGA boards (DE0/DE2) will be able to hit 200MHz. All the ALU stuff was proven yesterday for at least 300MHz before the last couple mux stages. I've been partitioning the ALU into 4 big chunks, each with its own input flipflops, so this should cut ALU power by 75% of what it would otherwise be. And aside from ALU in this design, there's not much circuitry in a cog. Power will go really low during WAITCNT/WAITPxx. So, I think the DE0/DE2 boards will be good for actual silicon speed. And this is from Altera Quartus' own timing analysis. I hope to have the cog design all tied together in another week.

cgracey · 2014-04-18 17:36

I made a better timing diagram for the new memory scheme:

5ns clock period
------------____________------------____________------------____________------------____________------------____________------------____________-

|			|			|			|			|			|			|
|-------+		|	       rdRAM Ic |-------+		|	       rdRAM Id |-------+		|	       rdRAM Ie |
|	|		|			|	|		|			|	|		|			|
|---+	+----> rdRAM Db |------------> latch Db |---+	+----> rdRAM Dc |------------> latch Dc |---+	+----> rdRAM Dd |------------> latch Dd |
|---+	+----> rdRAM Sb |------------> latch Sb |---+	+----> rdRAM Sc |------------> latch Sc |---+	+----> rdRAM Sd |------------> latch Sd |
|---+	+----> latch Ib	|------------> latch Ib	|---+	+----> latch Ic	|------------> latch Ic	|---+	+----> latch Id	|------------> latch Id	|
|   |			|			|   |			|			|   |			|			|
|   +------------------ALU-----------> wrRAM Ra	|   +------------------ALU-----------> wrRAM Rb	|   +------------------ALU-----------> wrRAM Rc	|
|			|			|			|			|			|			|
|			|	<wait a>	|			|	<wait b>	|			|	<wait c>	|
|			|			|			|			|			|			|

Cog RAM register indirection will be achieved by substituting D/S fields within the instruction being read from the RAM, before it feeds the rdRAM D/S inputs and before it's latched. There is time to do this, but there is not enough time to look at the instruction data coming out of the RAM (which arrives near the end of the cycle), make some decision based on it, and then buffer up to drive some mux's to do D/S field substitutions in time to feed the rdRAM D/S inputs. The latter would be required to achieve INDx like we had on Prop2. In the new scheme, substitute D/S values are ready to go and only need to be muxed into the instruction data coming out of the cog RAM.

Note that on that diagram, at the beginning of every clock, two memory accesses are initiated with the dual-port cog RAM. On one cycle you initiate both the instruction read and the result write. On the other cycle, you issue reads for both D and S registers. Also note that there are two full clock periods for the ALU result to settle in, while the ALU inputs are held steady from flipflops. The slowest ALU path is through the rotator (ROR/ROL/RCR/RCL/SHR/SHL/SAR/REV). Quartus is showing that circuit's Fmax at 150MHz, which would mean a 300MHz system clock. Once the last mux's are tied in and data-forwarding is implemented, I think 200MHz shouldn't be a problem.

jmg · 2014-04-18 17:46

cgracey wrote:

The system clock will be 200MHz. The cogs will take two clocks per instruction, so they'll run at 100 MIPS. It's looking likely that the main clock could go 250MHz. In that case, 125 MIPS x 16 cogs = 2000 MIPS total.

Are those numbers for Silicon (power envelope permitting), or FPGA ?

cgracey wrote: »

Right now, it's looking like the Cyclone IV FPGA boards (DE0/DE2) will be able to hit 200MHz. All the ALU stuff was proven yesterday for at least 300MHz before the last couple mux stages. I've been partitioning the ALU into 4 big chunks, each with its own input flipflops, so this should cut ALU power by 75% of what it would otherwise be. And aside from ALU in this design, there's not much circuitry in a cog. Power will go really low during WAITCNT/WAITPxx. So, I think the DE0/DE2 boards will be good for actual silicon speed. And this is from Altera Quartus' own timing analysis. I hope to have the cog design all tied together in another week.

Wow, hitting target silicon speeds will be a large jump from P2 - might this decrease as the chip 'fills up' vs a test case ?

jmg · 2014-04-18 17:50

cgracey wrote: »

Power will go really low during WAITCNT/WAITPxx.

I've been thinking about WAITCNT - I think currently, it is a "Wait until Equals", but it may be more flexible to do a "Wait until Equals or Greater than" (identical operation in a non gated design)

I think the small logic extension to allow this is something like
(C >= WP) & (C[MSB] == WP[MSB])

Doing this would allow various power saving COG gating approaches, without having any WAITs miss the strict 'equals' test.

jmg · 2014-04-18 17:54

Bill Henning wrote: »

(another WOW on the real silicon hitting 125MIPS... 120MIPS would be nice for USB)

There is still the Power Envelope lurking, so there may be caveats on this speed. Things like this tend to come down over time.

cgracey · 2014-04-18 17:59

jmg wrote: »

I've been thinking about WAITCNT - I think currently, it is a "Wait until Equals", but it may be more flexible to do a "Wait until Equals or Greater than" (identical operation in a non gated design)

I think the small logic extension to allow this is something like
(C >= WP) & (C[MSB] == WP[MSB])

Doing this would allow various power saving COG gating approaches, without having any WAITs miss the strict 'equals' test.

A bit-level comparison is very quick, while a compare via subtraction would probably break the 5ns budget, since there's carry ripple to accommodate.

cgracey · 2014-04-18 18:04

jmg wrote: »

There is still the Power Envelope lurking, so there may be caveats on this speed. Things like this tend to come down over time.

If you could get the cogs to do WAITCNT/WAITPxx's whenever they weren't needed for some quick processing, you could bring the power way down, while still having a very high clock frequency. It would take some discipline on the programmer's part.

jmg · 2014-04-18 18:15

cgracey wrote: »

A bit-level comparison is very quick, while a compare via subtraction would probably break the 5ns budget, since there's carry ripple to accommodate.

I guess the alternative is some sort of sticky flag within the Wait checker, which means that (tiny) portion is not Clock-gated with the rest of the COG.

Does the NCO block meet timing ok, as it has a similar adder ripple ?

tonyp12 · 2014-04-18 18:23

> but it may be more flexible to do a "Wait until Equals or Greater than

How would you handle counter overflow?
If at FFFFFF80 you ask the waitcnt to wait until Cnt+$200
2 clks later the current cnt is higher than the the $180 value waitcnt have been set for, so it will not wait.

I guess you can set a flag if Cnt+$200 overflows (copy C to flag), and when the real cnt overflows it always clears this flag
And the wait is while waitcnt < cnt & flag=0

cgracey · 2014-04-18 18:26

jmg wrote: »

I guess the alternative is some sort of sticky flag within the Wait checker, which means that (tiny) portion is not Clock-gated with the rest of the COG.

Does the NCO block meet timing ok, as it has a similar adder ripple ?

Pin smarts may need to be less than 32 bits.

evanh · 2014-04-18 18:47

For reduced precision (this is the case if the count has already gone); WAITCNT could still use the equality compare but also mask off the lower bits so that multiple consecutive CNTs come true.

EDIT: It would have a perculiar phasing effect on signal generation.

Cluso99 · 2014-04-18 18:48

Chip,
This is FANTASTIC news. You seem to have a really good handle on reducing complexity for speed, and making ALU blocks to minimise power. A WIN-WIN.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments