The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

potatohead · 2015-07-29 20:45

A simpler way to express this might be to say cog code is memory to memory direct and hub code is a load store model.

Maybe the streamer can blast blocks of data into the COG for a lot of processing, then stream them back out too.

Like say some matrices being transformed. Move em to the COG, transform, move em back.

We get a load store with a ton of registers and a math coprocessor.

Heater. · 2015-07-29 21:04

Roy Eltham, "...hub exec would be equivalent speed to cog exec except on jumps or when tryingto use the streamer for data and code (hub exec loses badly here)....I think Ccode execution speed will be near native pasm speed in most cases."
I believe you are right Roy.
However the whole HUB vs COG performance thing is open to interpretation. For example:
1) Code and data in COG. Runs at full "native" speed. Like a "normal" processor. I'm sure the C compiler will do a good job of matching a hand PASM coder.
2) Code and data in HUB. Again the C compiler will do a good job of matching a hand PASM coder.
But oops, in the second case the code requires all those RD/WR HUB instructions to be added whichkills performance. No mater if you are a compiler or a human. Not "native" speed like a "normal" processor.
Like I said, I was just fishing for any vague idea how this pans out. I know prop-gcc did some amazing things with FCACHE for example. I'm always amazed at what optimizers in compilers can do.

Heater. · 2015-07-29 21:26

potatohead,
I'm willing to bet you that two 4 by 4 matrices of 32 bit signed integers can be multiplied quicker by starting with the input matrices in HUB memory and the result matrix written to HUB than doing the whole thing in COG. On a Propeller 2.
Let's say a hundred dollars?
We can discuss exact conditions of the bet if you like.
How about it? Man enough?

msrobots · 2015-07-29 21:40

@Heater,
I think running code from hub will allow to use all cog ram for data or constants. Now we have a 512(?) register to use.
So chances are good to have data in cog memory and code in the hub.
Fcache makes no or less sense with hubexec compared to LMM, the speed gain is just on branches reloading the pipeline/streamer. Might help a bit, but not as much as on the P1.
On the other hand provides the eggbeater hub fast streaming from cog to hub memory in both directions, as long as the data is continuous.
As you noted in another thread PropGCC uses just 16 register. So some stack in the cog seems feasible for all those local vars. .
Since I do not know much about compiler and barely nothing about GCC I can just speculate here what David and Eric are able to accomplish there.
But I am pretty sure they will figure out some way to use the cog memory for something.
Are them extra FIFO like cog memories still in the current version of the P2? I somehow lost track over the time and iterations.
Anyways I think hubexec may not just be beneficial for C/C++ but also for PASM programmer like me.
This seems to develop to a nice chip, @chip.
Mike

Heater. · 2015-07-29 21:49

msrobots,

I'm no compiler writer but as far as I can tell FCACHE is not needed so much if you can execute code from HUB. The main idea of FCACHE was not to have to fetch all the instructions, one by one, with the LMM loop. Just load a tight loop of code to COG and let it rip. This is not required any more.

Also, I have read many times that compiler writers would love to have more registers. Well, hey, now it looks like they can have hundreds of registers!

I agree, it all sounds good to me.

Roy Eltham · 2015-07-29 22:02

That's a dangerous bet at this stage of the game. I don't see how the hub version can beat the cog version with something that fits entirely in a cog. Maybe if you are counting the time it takes to load the code and data into the cog and start it going as part of the cost?
With the indirect addressing available on P2, accessing a 4x4 array in cog should be trivial.
potatohead,
I'm willing to bet you that two 4 by 4 matrices of 32 bit signed integers can be multiplied quicker by starting with the input matrices in HUB memory and the result matrix written to HUB than doing the whole thing in COG. On a Propeller 2.
Let's say a hundred dollars?
We can discuss exact conditions of the bet if you like.
How about it? Man enough?

MJB · 2015-07-29 22:06

Heater,
your description does not sound correct to me - see below:
Roy Eltham, "...hub exec would be equivalent speed to cog exec except on jumps or when tryingto use the streamer for data and code (hub exec loses badly here)....I think Ccode execution speed will be near native pasm speed in most cases."
I believe you are right Roy.
However the whole HUB vs COG performance thing is open to interpretation. For example:
1) Code and data in COG. Runs at full "native" speed. Like a "normal" processor. I'm sure the C compiler will do a good job of matching a hand PASM coder.
If you have a register architecture (like AVR) then COG memory relates to registers. So instead of maybe 32 registers you have a huge number even with some code in COG. So extra instructions to load memory from/to registers are the standard case.

2) Code and data in HUB. Again the C compiler will do a good job of matching a hand PASM coder.
But oops, in the second case the code requires all those RD/WR HUB instructions to be added whichkills performance. No mater if you are a compiler or a human. Not "native" speed like a "normal" processor.

just normal to load registers and after operation store back.
Good compilers 'see' if operands can be kept in register for next operation and make some save/load cycles unnecessary.
And since hubexec is almost as fast as COG and data access is basically thesame all is fine :-)

Like I said, I was just fishing for any vague idea how this pans out. I know prop-gcc did some amazing things with FCACHE for example. I'm always amazed at what optimizers in compilers can do.

Heater. · 2015-07-29 22:16

Roy Eltham,

"That's a dangerous bet at this stage of the game."

Yes it is. Isn't it?

I don't see how the hub version can beat the cog version with something that fits entirely in a cog. Maybe if you are counting the time it takes to load the code and data into the cog and start it going as part of the cost?

I specifically and purposely said "...than doing the whole thing in COG". That is to say input is already in COG, output is in COG. Of course we will need some way of getting the data in and out to verify the results but the timing can be totally internal.

With the indirect addressing available on P2, accessing a 4x4 array in cog should be trivial.

I could only hope so.

Where is Spud? Is he up for the challenge?

potatohead · 2015-07-29 22:23

I agree. But I am also unclear on how the streamer works.

Say I have three transtorms to do. Local to world and world to view. (Yes I know keeping things in world and doing a single transform to view is possible and faster, but I am just trying to consider an example)

Stream data into COG, perform all three, stream out to HUB.

Compare to load compute store, etc....

I'm not saying HUBEXEC will be faster, nor even on par, but I am suggesting how things get structured in memory may significantly impact what SLOWER looks like.

Early on, the STREAMER seemed to be this thing that can move data quickly without the direct attention of the COG. If that is true, my suggestion seems valid.

If not, then of course. I lose, and no worries there. Just trying to think it through and understand like any of us are.

What I do know is the overall complexity went way up when cache, and memory to memory ops were made possible in the HUBEXEC on hot, and this tradeoff is still good, sort of like LMM without having that instruction loop overhead, etc...

My comment earlier should be read as, "might not be as slow as we think" rather than it is faster.

potatohead · 2015-07-29 22:25

And Roy centered on another idea, and that is using a hybrid path. Keep a few things in the COG to be called as needed to augment what may be a lot slower done all hubexec code.

Heater. · 2015-07-29 22:32

Potatohead,
Does that mean you are up to the challenge? 100 dollars. Real US money?
I don't care about "world" and "view". Only numbers in matrices and the correct results, as described above.
If you like we can make it easier for you. I'll use C when prop-gcc is available for the PII.
How about it?

MJB · 2015-07-29 22:42

And Roy centered on another idea, and that is using a hybrid path. Keep a few things in the COG to be called as needed to augment what may be a lot slower done all hubexec code.

like inner loops

ozpropdev · 2015-07-30 00:48

That seems to imply a pretty wide bus to HUB

Looking at the Verilog for the new Hub scheme the bus is 512 bits wide!

input [511:0] d, // 16 sets of 32 data inputs

Cluso99 · 2015-07-30 00:53

Hub to/from Cog memory transfers are going to fly in this P2.

Hubexec can be made to fly, but will have stalls waiting for its hub slot when jumping. But hey, there is no LMM engine wasting slots. And waiting for a slot can be no worse than waiting for a slot on P1.

I can see some interesting mixes here where small code loops will execute much faster by loading them into cog and running them, than running in hubexec mode. This is where we can hand craft code to push the P2 to its extreme.

I am hoping with hubexec, we can now issue coginit/congee without the need to load the cog ram with anything. This would make cog startup extremely fast. And with WAITINT we can have a cog idling in low power mode at the ready to run any code we desire.

Now, just hoping Chip can get the P2 to run at 200MHz

might be an impossible task, but we can hope.

David Betz · 2015-07-30 01:00

I can see some interesting mixes here where small code loops will execute much faster by loading them into cog and running them, than running in hubexec mode. This is where we can hand craft code to push the P2 to its extreme.

With the LMM kernel code out of COG memory there will also be more space for fcache (tm Bill Henning) for speeding up PropGCC code.

jmg · 2015-07-30 02:10

Now, just hoping Chip can get the P2 to run at 200MHz

might be an impossible task, but we can hope.

Isn't the new target ~160MHz ? - and even that may be a little vague, for now ?
IIRC Chip was passing on the present verilog for a die-area sanity check at the FAB team, which may also give a MHz indicator.

potatohead · 2015-07-30 03:09

The target did seem to be 160. As for starting a COG, I swear I read Chip saying the COG start with no load is in there with the HUBEXEC.

@Heater, we may be talking past one another. I hope you are right actually.

>>I'm willing to bet you that two 4 by 4 matrices of 32 bit signed
integers can be multiplied quicker by starting with the input matrices
in HUB memory and the result matrix written to HUB than doing the whole
thing in COG.

What I was thinking is the Streamer moves a long per clock. Went back to look for that and the FIFO discussion, because I thought it could move data into COG RAM, but there is a lot of discussion...

My thought was to setup the multiply, unrolled in the COG, so that it just blasts through the ops saving over the RWLONG instructions needed to operate with the data entirely in the HUB. Or, if a common operand is in the picture, get it into the COG or do that with some setup data, then leave it there while a lot of other data to be operated on stays in the HUB, much like holding a value in a register can make sense on most CPUs.

Of course, results need to go back into the HUB.

And I'm pretty sure the COG can do small multiplies all day long, just not 32 bits. For some things, that's enough, and it may change the balance of things compared to waiting for the pipelined math unit to do it's thing. In that case, there is more than enough time to load store while results are pending.

Thought there may also be frequent savings where some results can just stay in the COG, sort of like when using registers in a CPU to hold intermediate results rather than doing more read write cycles.

As for man enough... lol. Sure, why not when we get an image to bang around on?

When we get PropGCC, I'll be using it off and on with PASM, just like last time.

Programs will be able to jump right into the COG and right back out again, and data can live in the HUB, but the operations will always happen in the COG. Remember all those "but what is a register?" discussions early on? Well, when running from the HUB, the entire COG RAM really is just registers, some of which can hold code, so there is bound to be some nice cases where a hybrid makes sense.

A lot depends on how the FIFO and Streamer work. And that means smart data setup too. When we get some more detail on that, and the pointers, which I think at least one set are still in there, how it plays out might be a lot more clear.

We shall see soon enough.

If it's fast, then doing those matrices may be worth it too. We might see some polys on the P2, and that's just fun. However it happens.

potatohead · 2015-07-30 03:24

Yes, 160Mhz cited here:

http://forums.parallax.com/discussion/comment/1334165/#Comment_1334165

evanh · 2015-07-30 03:35

That seems to imply a pretty wide bus to HUB

Looking at the Verilog for the new Hub scheme the bus is 512 bits wide!

input [511:0] d, // 16 sets of 32 data inputs

And that's just the 32 data lines. There is also the 19 bits of address lines too. Plus read, write, latch type signals as well. 16 ports of this for Hub and 16 ports for the Cogs. So ~55x16x16=14000 way switching matrix with some longish run metal wires.

I don't know why this resulted in lower power consumption than a single 256bit wide Hub bus.

Seairth · 2015-07-30 04:07

Does the hub "rotate" every clock cycle or every instruction cycle? The instructions are now 2 cycles long (pipelined). So, does the hub take a full 32 cycles to wrap around?

evanh · 2015-07-30 04:46

Each clock, 16 clocks, 8 instructions for a full rotation. Same number clocks as the Prop1.

Cluso99 · 2015-07-30 06:32

I thought the multiplies, etc were now pipelined in the hub and took something in the order of 40 clocks which meant 3 hub cycles. However this unit can accept a new calculation every clock. So the cog waits for its slot to issue a calculation.

The lower power consumption is because the hub is now built in small blocks, and only those blocks that are active in a clock cycle are enabled. Up to 16 hub block could be enabled per clock. I think previously the hub design meant the whole hub was active for each clock..

evanh · 2015-07-30 06:46

There is 16 HubRAM blocks. All 16 can be actively transferring in parallel and still be less power hungry than the slower P2-Hot, is the way I understood the situation.

Yep, the cordic is something like 36-40 clocks long but can accept a new command from each Cog every 16 clocks I think. Ie: It's a full pipeline that can take a new command every clock. And therefore does an execute cycle on every clock also.

MJB · 2015-07-30 11:57

There is 16 HubRAM blocks. All 16 can be actively transferring in parallel and still be less power hungry than the slower P2-Hot, is the way I understood the situation.

Yep, the cordic is something like 36-40 clocks long but can accept a new command from each Cog every 16 clocks I think. Ie: It's a full pipeline that can take a new command every clock. And therefore does an execute cycle on every clock also.

hm -- but each COG has it's associated timeslice - it is not possible that one COG uses other COGs free slices - right?

evanh · 2015-07-30 12:20

Since Prop2 HubRAM is every clock for every Cog then I presume you are talking about the Cordic? Yes, my impression is it'll be a phase shifted 1/16th timing slot per Cog.

I could be wrong though, the details of how the egg beater similarly works is also pretty sketchy.

Cluso99 · 2015-07-30 14:43

The cog hub access was neatly described a few posts ago.
Cog 0 uses its slot 0 to access hub long (or word or byte) with addresses ending in 0000_xx. Next is slot 1 where cog 1 accesses addresses ending in 0000_xx and in parallel to this cog 0 can access addresses ending in 0001_xx. This continues.
Now, what that means is on every clock, it is possible for each of the 16 cogs to access one of the addresses ending in 0000xx thru to 1111_xx respectively.
So cog 0 can access 16 successive long addresses in 16 successive hub slots. The same applies for cog 1 except it's address endings are offset by a long with respect to cog 0.

Together with the earlier post by ???, I hope this explains why each cog can read 16 longs in 16 slots.

potatohead · 2015-07-30 15:22

http://forums.parallax.com/discussion/155675/new-hub-scheme-for-next-chip/p1

Bill Henning · 2015-07-30 18:41

Chip,

Can I put n a request for a non-long-aligned access interrupt?

it would allow a SYSCALL with a 19 bit constant argument!

People who did not like it could not enable it.

cgracey said:
@Chip,
From your perspective do you think the pace of the development has improved now you're back on the forums?The
reason I ask is that I can imagine taking a project on like this as the
sole architect must be a struggle at times and take some real tenacity
to get through. I know in my profession it is better to be able to
bounce ideas off others when I can't get around the challenges that I'm
faced with. That often leads to much better solutions in the end....
Just curious.
Regards,
Coley

Yes! I'll work a long time on my own and pretty much get done what I had
planned, but when we discuss things on the forum, there are blasts of
productivity that really surprise me.

I'm just one person with limited thoughts, but all you guys have your
own wealths of experience and ideas that are foreign to me, but enrich
the heck out the Propeller effort.

Prop2-Hot was a Colossus of awesome ideas that would never have occurred
to me, working alone. Your contributions amounted to probably 80% of
the overall design. My job has been implementer and refiner, which has
been really exciting. In fact, much of the refining came from you guys
in the form of suggestions and incidental discussion.

Nobody could hire a group of idea people that could better your casual
efforts here.

I don't think misaligned instructions will ever cost more than a clock after each jump, if that. My problem with misaligned instructions is mainly that it complicates the assembler by making all addresses byte-aligned, even for cogexec code, unnecessarily complicating every compile-time calculated register address.

In practice, you will probably never do anything like '$1F4*4' because your register symbols will be symbolic, and stepping by 4 when you declare them.

There will probably be some development in the assembler's semantics that will further simplify cog vs hub address reckoning.

I don't feel like the hardware address paradigm is flawed, myself. It feels cleaner to me than Prop2-Hot. And we have constant JMP, CALL, and locating instructions which are 20-bit-range and byte-address-granular. I hope we've got everything covered, anyway.

Bill Henning · 2015-07-30 18:49

Quick Drive-By predictions:

1) I expect hubexec code to run at roughly half the speed of code that runs in cog only, so FCACHE still may be useful. This is based on roughly 1:6 instructions being a branch forcing FIFO reload with egg-beater synch.

Math: on average 8 cycles to sync to the beater. Average 1:6 for a branch. hubexec will be roughly 1/2 speed of cog only code (or just a bit faster)

2) Stream-loading 4x4 matrices before multiplying, stream writing the result

For column-major matrix ops, WAY faster to stream into/out of cog

For row-major matrix ops, best guess is 1/2-1/3 the speed if direct hub access to matrices

3) GCC will be able to make great use of the tons of registers & stacks

I'll try to jump in more often, but I am in the middle of a product launch.

potatohead · 2015-07-30 18:57

Make that launch awesome Bill.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments