The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Dave Hein · 2014-04-12 07:46

Bill Henning wrote: »

Last I heard (and deduced from instruction set and register map post by chip)

- four lines of four longs as icache
- four lines of four longs as dcache (this is an improvement over one line in P2 - it will help compiled code a LOT)
...
- fcache-like mechanism would allow full speed inner loops

Bill, I thought there was only 1 line of icache and 1 line of dcache. Where did you see that there are 4 lines?

EDIT: If you are referring to the four ICACHEx and DCACHEx values in the register map, I believe these represent the 4 longs in each of the caches.

EDIT2: I haven't seen any hardware support for an fcache-like mechanism. Wouldn't this have to be done in software like in the P1 LMM interpreter?

Invent-O-Doc · 2014-04-12 08:16

Although it started out well, I don't think this thread is contributing anything to getting the new propeller done anymore. In fact, the opposite may be true. I submit that enough direction is known to produce a workable prototype and propose an end to the discussion until a completed design can be tested.

It may not be fun, but helpful 'advice' is, in my opinion, not helping. Remember how we got into this pickle.

Bill Henning · 2014-04-12 08:28

Thanks Dave! I'll go back and correct my post - you are right. (re 4 vs 1)

Yep, fcache would have to be done in software, but would still help greatly

Dave Hein wrote: »

Bill, I thought there was only 1 line of icache and 1 line of dcache. Where did you see that there are 4 lines?

EDIT: If you are referring to the four ICACHEx and DCACHEx values in the register map, I believe these represent the 4 longs in each of the caches.

EDIT2: I haven't seen any hardware support for an fcache-like mechanism. Wouldn't this have to be done in software like in the P1 LMM interpreter?

jazzed · 2014-04-12 08:48

Invent-O-Doc wrote: »

Although it started out well, I don't think this thread is contributing anything to getting the new propeller done anymore. In fact, the opposite may be true. I submit that enough direction is known to produce a workable prototype and propose an end to the discussion until a completed design can be tested.

It may not be fun, but helpful 'advice' is, in my opinion, not helping. Remember how we got into this pickle.

+1

Chip needs to publish a spec, stop reading the forums, and work to the spec until a bit-file is ready to be posted for testing.

Dave Hein · 2014-04-12 08:56

Invent-O-Doc, I think most of the posts in this thread have been in response to Chip's questions about the necessity of certain instructions. The P1+ hasn't changed much since Chip's post #290 in this thread. Parallax seems more focused on completing the chip in a reasonable time now. I believe the P1+ still looks like this:

8-COG 16-COG P1 Core
4-port 2-port cog memory
20-bit 16-bit multiplier
256K 512K RAM/ROM
32-bit Multiply/Divide Engine in each cog the hub
Cordic Engine in each cog the hub
PTRA/PTRB
INDA/INDB
256-long CLUT/FIFO
PTRX/PTRY
Data Cache
360+ 170+ Instructions
256-bit 128-bit Hub Bus
4 tasks
hubex
1 Instruction Cache
serial I/O
Pre-emptive threads

mindrobots · 2014-04-12 08:57

jazzed wrote: »

+1

Chip needs to publish a spec, stop reading the forums, and work to the spec until a bit-file is ready to be posted for testing.

+10

...and +10 to Invent-O-Doc

I'm trying real hard to participate anymore until there is something to test but this idea makes a lot of sense!!

kwinn · 2014-04-12 13:31

jazzed wrote: »

+1

Chip needs to publish a spec, stop reading the forums, and work to the spec until a bit-file is ready to be posted for testing.

I agree 1000%

Got a little bit busy this past week and could not keep up with the number of post for the new P2 (P1+ ?) so I am trying to picture the current status of the chip by creating a block diagram of what goes where. Please comment on any errors or omissions. One possible error might be the 16 bit multiply in the cog and the 32 bit multiply/divide in the hub. Are these both going to be included or only one of them? My earlier impression was that it would be a multiply/divide and cordic in the hub that would be fast enough to provide results to the cogs without waiting.

David Betz · 2014-04-12 17:26

jazzed wrote: »

+1

Chip needs to publish a spec, stop reading the forums, and work to the spec until a bit-file is ready to be posted for testing.

I would also favor some conclusion to this whole process. It would be nice to have a spec that is frozen except for fixing bugs along with an FPGA image so we can all test to find any bugs that might be there before it is committed to silicon.

Dave Hein · 2014-04-12 17:47

Kwinn, your diagram looks good, except PTRX and PTRY do not exist since there is no Aux RAM. I think the plan is to put the 32-bit multiplier in the hub, but it may not make sense since a 32-bit multiply can be done with 3 16-bit multiplies and a few shifts and adds. The 32-bit divider would still be useful, even if it's in the hub.

potatohead · 2014-04-12 18:18

Chip needs to publish a spec, stop reading the forums, and work to the spec until a bit-file is ready to be posted for testing.

I think Chip needs to do what Chip needs to do.

RossH · 2014-04-12 18:19

Bill Henning wrote: »

Re: hubexec
...

Btw, the reason I have not posted more about hubexec is I want to see what Chip comes up with - no point on speculating more until I see what infrastructure he puts in.

...

Thanks for the destails. Like you, I think it best to wait till Chip confirms the design before worrying about it too much.

Ross.

RossH · 2014-04-12 18:20

Dave Hein wrote: »

Ross, how did you get the 32kb number? There are applications where each cog could be executing the same HUB code, and may only require a small amount of extra HUB RAM for local storage or stack space. I can't wait to try out my threaded chess program on P1+. It uses 1.2K per cog, but that's because each cog keeps a copy of the chess board on the stack for each level it evaluates. A P1+ should be able to go one more level deeper than a P1, and do it in less time.

Just a worst case. Of course, there will be many programs that do not need all 16 cogs to be executing 2k of code.

Ross.

evanh · 2014-04-12 18:26

Like Chip somehow needs tuition on head down bum up.

David Betz · 2014-04-12 18:27

Dave Hein wrote: »

Ross, how did you get the 32kb number? There are applications where each cog could be executing the same HUB code, and may only require a small amount of extra HUB RAM for local storage or stack space. I can't wait to try out my threaded chess program on P1+. It uses 1.2K per cog, but that's because each cog keeps a copy of the chess board on the stack for each level it evaluates. A P1+ should be able to go one more level deeper than a P1, and do it in less time.

If you mean that more than one COG can run in hubexec mode and run the same hub code with different data that may be difficult to arrange unless the data is accessed by pointers. Having a single copy of the code but multiple copies of global data would require at least a simple MMU with a code base register and a data base register for each COG or even task. Of course, I guess that could be done in software by using, say, PTRB as the data segment pointer and having the compiler generate all data references as offsets from that address.

evanh · 2014-04-12 18:32

Invent-O-Doc wrote: »

It may not be fun, but helpful 'advice' is, in my opinion, not helping. Remember how we got into this pickle.

What is this pickle? Impatience maybe?

Dave Hein · 2014-04-12 18:46

David Betz wrote: »

If you mean that more than one COG can run in hubexec mode and run the same hub code with different data that may be difficult to arrange unless the data is accessed by pointers. Having a single copy of the code but multiple copies of global data would require at least a simple MMU with a code base register and a data base register for each COG or even task. Of course, I guess that could be done in software by using, say, PTRB as the data segment pointer and having the compiler generate all data references as offsets from that address.

An MMU and/or pointers are not required. The threaded chess program just uses pthreads and variables defined on the stack. All of the pthreads run the same code.

David Betz · 2014-04-12 18:48

Dave Hein wrote: »

An MMU and/or pointers are not required. The threaded chess program just uses pthreads and variables defined on the stack. All of the pthreads run the same code.

True, if you use only stack variables then I guess you can share code. You have to stay away from globals though unless they are shared among all threads. I thought you were talking about essentially running the same main program on multiple COGs at the same time.

Heater. · 2014-04-12 19:16

David Betz,

If you mean that more than one COG can run in hubexec mode and run the same hub code with different data that may be difficult to arrange unless the data is accessed by pointers. Having a single copy of the code but multiple copies of global data would require at least a simple MMU with a code base register and a data base register for each COG or even task. Of course, I guess that could be done in software by using, say, PTRB as the data segment pointer and having the compiler generate all data references as offsets from that address.

You raise a very interesting point there.

But no MMU has been required to do this already with LMM code on the P1.

For example in C you run the same code in many cores by using OpenMP. See code below that runs the inner loop of an FFT on four COGs on a P1.

Well, as you say, when you fire up code in a core it gets it's own address space where it can keep it's own thread local variables.

That same code run from RAM will fail. Unless it is compiled differently to have some kind of pointer to it's own thread local variables.

Is propgcc going to handle this and how?

But wait, thinking about it, this is the same as uisng pthreads in C. This already works in propgcc as far as I know. Never used it directly but I think OMP sits on top of pthreads. The local, or thread local, variables are just on a different stack for each thread aren't they?

Those writing PASM in their Spin code would need to take care of this manually of course.

    // Parallelize over 4 COGS.
    slices = 4;
    lastLevel = LOG2_FFT_SIZE - 3;
 
    firstLevel = 0;
    for ( ; slices >= 1; slices = slices / 2)
    {
        #pragma omp parallel for default (none) \
                                 shared (bx, by) \
                                 private (slice, s, slen, tid) \
                                 firstprivate(slices, firstLevel, lastLevel) 
        for (slice = 0; slice < slices; slice++)
        {
            s = FFT_SIZE * slice / slices;
            slen = FFT_SIZE / slices;
            butterflies(&bx[s], &by[s], firstLevel, lastLevel, slices, slen);
        }
        lastLevel = lastLevel + 1;
        firstLevel = lastLevel;
    }

Heater. · 2014-04-12 19:18

"Who is there?"... "It's me Dave".

Sorry David, Dave got there first

Heater. · 2014-04-12 19:26

evanh,

helpful 'advice' is, in my opinion, not helping.

No, no, no, you don't want to do it like that, you want to do it like this : http://www.youtube.com/watch?v=nkZdTHmX0TQ

David Betz · 2014-04-12 19:27

Heater. wrote: »
David Betz,

You raise a very interesting point there.

But no MMU has been required to do this already with LMM code on the P1.

For example in C you run the same code in many cores by using OpenMP. See code below that runs the inner loop of an FFT on four COGs on a P1.

Well, as you say, when you fire up code in a core it gets it's own address space where it can keep it's own thread local variables.

That same code run from RAM will fail. Unless it is compiled differently to have some kind of pointer to it's own thread local variables.

Is propgcc going to handle this and how?

But wait, thinking about it, this is the same as uisng pthreads in C. This already works in propgcc as far as I know. Never used it directly but I think OMP sits on top of pthreads. The local, or thread local, variables are just on a different stack for each thread aren't they?

Those writing PASM in their Spin code would need to take care of this manually of course.
    // Parallelize over 4 COGS.
    slices = 4;
    lastLevel = LOG2_FFT_SIZE - 3;
 
    firstLevel = 0;
    for ( ; slices >= 1; slices = slices / 2)
    {
        #pragma omp parallel for default (none) \
                                 shared (bx, by) \
                                 private (slice, s, slen, tid) \
                                 firstprivate(slices, firstLevel, lastLevel) 
        for (slice = 0; slice < slices; slice++)
        {
            s = FFT_SIZE * slice / slices;
            slen = FFT_SIZE / slices;
            butterflies(&bx[s], &by[s], firstLevel, lastLevel, slices, slen);
        }
        lastLevel = lastLevel + 1;
        firstLevel = lastLevel;
    }

You can certainly run the same code in multiple threads. What I meant is that you can't start the same main program in multiple COGs at the same time because they would share the same globals and get into trouble. I guess that isn't what the original poster was looking to do though so my comment is probably irrelevant.

Cluso99 · 2014-04-12 19:33

FWIW the Instruction and Data Caches are both 4 longs wide (Quad). They are built using the shadow registers of each cog. (see the register map over on the instruction thread, post #2)

Each instruction takes 2 clocks (100 MIPs @ 200MHz) and there is no pipeline.

Hub access is currently 1:16 clocks (no smarts). This gives a hub access every 8 instructions (as per P1).

Hub-Cog transfers can be quad longs. (4 bytes * 200MHz / 16 = 50MB/s per cog, 800MB/s hub)

Roy Eltham · 2014-04-12 19:53

Chip will do what he wants to do, which so far has been the process we have been through. Like it or not, it's Chip's choice, not yours.

As Ken alluded to before, he's seen Chip go through the complex back to simple phases on previous projects. I think the end result of this "new P1+" chip is going to be worthy of calling a P2. It's a massive upgrade to the P1, and includes many of the major elements of what was the P2, just in a more refined way. Ultimately, we've ended up where we were destined to be all along. Remember this Verilog and synthesis method was a learning process for Chip. Knowing what he knows now, he can better plan how to do things for the next chip (after the P2 (what you all are calling P1+)).

I doubt what was the P2 will stay in it's current form (even on a smaller process). It'll probably go through a phase or two of restructuring and end up being worthy of being called P3.

potatohead · 2014-04-12 20:04

That is pretty spot on Roy.

Rayman · 2014-04-12 20:09

With analog pins, hubexec, cordic, multiplier, 16 cogs, 200 MHz and 512kB, I certainly think it's worthy of being called P2.

I just hope these things can stay. Seemed like last time the forum maybe helped to bloat the feature set into something they couldn't produce economically.

For me, if it were just regular P1 cores with the above features added, I'd be very happy.

RossH · 2014-04-12 20:12

Rayman wrote: »

With analog pins, hubexec, cordic, multiplier, 16 cogs, 200 MHz and 512kB, I certainly think it's worthy of being called P2.

I just hope these things can stay. Seemed like last time the forum maybe helped to bloat the feature set into something they couldn't produce economically.

For me, if it were just regular P1 cores with the above features added, I'd be very happy.

I think everyone here is in furious agreement!

evanh · 2014-04-12 20:18

Well said Roy.

Rayman wrote: »

Seemed like last time the forum maybe helped to bloat the feature set into something they couldn't produce economically.

The issue was not price but thermal generation (and by extension, power consumption was more than USB could supply). It was an issue that was present for a long time but hadn't been accounted for.

kwinn · 2014-04-12 20:31

Dave Hein wrote: »

Kwinn, your diagram looks good, except PTRX and PTRY do not exist since there is no Aux RAM. I think the plan is to put the 32-bit multiplier in the hub, but it may not make sense since a 32-bit multiply can be done with 3 16-bit multiplies and a few shifts and adds. The 32-bit divider would still be useful, even if it's in the hub.

Thanks Dave, I've removed PTRX and Y. Now to see if I can add the I/O pin block.

potatohead · 2014-04-12 20:32

Yep.

The current design is far more optimized for power. Chip will see to that. I'm pretty excited really. Seeing the development process like we did was kind of a let down at first, but then it's obvious how much of an improvement we can get. Good times coming everybody.

With all that we learned on the P2+ design, and I'm calling the current one P2 because it deserves it, the P3 we get someday is going to be really something!

Heater. · 2014-04-12 21:02

Of course it's P2.

I'm furiously agreeing with Rayman as well.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments