Shop OBEX P1 Docs P2 Docs Learn Events
The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Page 23 — Parallax Forums

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

12021232526144

Comments

  • Dave HeinDave Hein Posts: 6,347
    edited 2014-04-12 07:46
    Last I heard (and deduced from instruction set and register map post by chip)

    - four lines of four longs as icache
    - four lines of four longs as dcache (this is an improvement over one line in P2 - it will help compiled code a LOT)
    ...
    - fcache-like mechanism would allow full speed inner loops
    Bill, I thought there was only 1 line of icache and 1 line of dcache. Where did you see that there are 4 lines?

    EDIT: If you are referring to the four ICACHEx and DCACHEx values in the register map, I believe these represent the 4 longs in each of the caches.

    EDIT2: I haven't seen any hardware support for an fcache-like mechanism. Wouldn't this have to be done in software like in the P1 LMM interpreter?
  • Invent-O-DocInvent-O-Doc Posts: 768
    edited 2014-04-12 08:16
    Although it started out well, I don't think this thread is contributing anything to getting the new propeller done anymore. In fact, the opposite may be true. I submit that enough direction is known to produce a workable prototype and propose an end to the discussion until a completed design can be tested.

    It may not be fun, but helpful 'advice' is, in my opinion, not helping. Remember how we got into this pickle.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-12 08:28
    Thanks Dave! I'll go back and correct my post - you are right. (re 4 vs 1)

    Yep, fcache would have to be done in software, but would still help greatly :)

    Dave Hein wrote: »
    Bill, I thought there was only 1 line of icache and 1 line of dcache. Where did you see that there are 4 lines?

    EDIT: If you are referring to the four ICACHEx and DCACHEx values in the register map, I believe these represent the 4 longs in each of the caches.

    EDIT2: I haven't seen any hardware support for an fcache-like mechanism. Wouldn't this have to be done in software like in the P1 LMM interpreter?
  • jazzedjazzed Posts: 11,803
    edited 2014-04-12 08:48
    Although it started out well, I don't think this thread is contributing anything to getting the new propeller done anymore. In fact, the opposite may be true. I submit that enough direction is known to produce a workable prototype and propose an end to the discussion until a completed design can be tested.

    It may not be fun, but helpful 'advice' is, in my opinion, not helping. Remember how we got into this pickle.

    +1

    Chip needs to publish a spec, stop reading the forums, and work to the spec until a bit-file is ready to be posted for testing.
  • Dave HeinDave Hein Posts: 6,347
    edited 2014-04-12 08:56
    Invent-O-Doc, I think most of the posts in this thread have been in response to Chip's questions about the necessity of certain instructions. The P1+ hasn't changed much since Chip's post #290 in this thread. Parallax seems more focused on completing the chip in a reasonable time now. I believe the P1+ still looks like this:

    8-COG 16-COG P1 Core
    4-port 2-port cog memory
    20-bit 16-bit multiplier
    256K 512K RAM/ROM
    32-bit Multiply/Divide Engine in each cog the hub
    Cordic Engine in each cog the hub
    PTRA/PTRB
    INDA/INDB
    256-long CLUT/FIFO
    PTRX/PTRY
    Data Cache
    360+ 170+ Instructions
    256-bit 128-bit Hub Bus
    4 tasks
    hubex
    1 Instruction Cache
    serial I/O
    Pre-emptive threads
  • mindrobotsmindrobots Posts: 6,506
    edited 2014-04-12 08:57
    jazzed wrote: »
    +1

    Chip needs to publish a spec, stop reading the forums, and work to the spec until a bit-file is ready to be posted for testing.

    +10

    ...and +10 to Invent-O-Doc

    I'm trying real hard to participate anymore until there is something to test but this idea makes a lot of sense!!
  • kwinnkwinn Posts: 8,697
    edited 2014-04-12 13:31
    jazzed wrote: »
    +1

    Chip needs to publish a spec, stop reading the forums, and work to the spec until a bit-file is ready to be posted for testing.

    I agree 1000%

    Got a little bit busy this past week and could not keep up with the number of post for the new P2 (P1+ ?) so I am trying to picture the current status of the chip by creating a block diagram of what goes where. Please comment on any errors or omissions. One possible error might be the 16 bit multiply in the cog and the 32 bit multiply/divide in the hub. Are these both going to be included or only one of them? My earlier impression was that it would be a multiply/divide and cordic in the hub that would be fast enough to provide results to the cogs without waiting.
    397 x 526 - 44K
  • David BetzDavid Betz Posts: 14,511
    edited 2014-04-12 17:26
    jazzed wrote: »
    +1

    Chip needs to publish a spec, stop reading the forums, and work to the spec until a bit-file is ready to be posted for testing.
    I would also favor some conclusion to this whole process. It would be nice to have a spec that is frozen except for fixing bugs along with an FPGA image so we can all test to find any bugs that might be there before it is committed to silicon.
  • Dave HeinDave Hein Posts: 6,347
    edited 2014-04-12 17:47
    Kwinn, your diagram looks good, except PTRX and PTRY do not exist since there is no Aux RAM. I think the plan is to put the 32-bit multiplier in the hub, but it may not make sense since a 32-bit multiply can be done with 3 16-bit multiplies and a few shifts and adds. The 32-bit divider would still be useful, even if it's in the hub.
  • potatoheadpotatohead Posts: 10,255
    edited 2014-04-12 18:18
    Chip needs to publish a spec, stop reading the forums, and work to the spec until a bit-file is ready to be posted for testing.

    I think Chip needs to do what Chip needs to do.
  • RossHRossH Posts: 5,360
    edited 2014-04-12 18:19
    Re: hubexec
    ...

    Btw, the reason I have not posted more about hubexec is I want to see what Chip comes up with - no point on speculating more until I see what infrastructure he puts in.

    ...

    Thanks for the destails. Like you, I think it best to wait till Chip confirms the design before worrying about it too much.

    Ross.
  • RossHRossH Posts: 5,360
    edited 2014-04-12 18:20
    Dave Hein wrote: »
    Ross, how did you get the 32kb number? There are applications where each cog could be executing the same HUB code, and may only require a small amount of extra HUB RAM for local storage or stack space. I can't wait to try out my threaded chess program on P1+. It uses 1.2K per cog, but that's because each cog keeps a copy of the chess board on the stack for each level it evaluates. A P1+ should be able to go one more level deeper than a P1, and do it in less time.

    Just a worst case. Of course, there will be many programs that do not need all 16 cogs to be executing 2k of code.

    Ross.
  • evanhevanh Posts: 15,263
    edited 2014-04-12 18:26
    Like Chip somehow needs tuition on head down bum up.
  • David BetzDavid Betz Posts: 14,511
    edited 2014-04-12 18:27
    Dave Hein wrote: »
    Ross, how did you get the 32kb number? There are applications where each cog could be executing the same HUB code, and may only require a small amount of extra HUB RAM for local storage or stack space. I can't wait to try out my threaded chess program on P1+. It uses 1.2K per cog, but that's because each cog keeps a copy of the chess board on the stack for each level it evaluates. A P1+ should be able to go one more level deeper than a P1, and do it in less time.
    If you mean that more than one COG can run in hubexec mode and run the same hub code with different data that may be difficult to arrange unless the data is accessed by pointers. Having a single copy of the code but multiple copies of global data would require at least a simple MMU with a code base register and a data base register for each COG or even task. Of course, I guess that could be done in software by using, say, PTRB as the data segment pointer and having the compiler generate all data references as offsets from that address.
  • evanhevanh Posts: 15,263
    edited 2014-04-12 18:32
    It may not be fun, but helpful 'advice' is, in my opinion, not helping. Remember how we got into this pickle.

    What is this pickle? Impatience maybe?
  • Dave HeinDave Hein Posts: 6,347
    edited 2014-04-12 18:46
    David Betz wrote: »
    If you mean that more than one COG can run in hubexec mode and run the same hub code with different data that may be difficult to arrange unless the data is accessed by pointers. Having a single copy of the code but multiple copies of global data would require at least a simple MMU with a code base register and a data base register for each COG or even task. Of course, I guess that could be done in software by using, say, PTRB as the data segment pointer and having the compiler generate all data references as offsets from that address.
    An MMU and/or pointers are not required. The threaded chess program just uses pthreads and variables defined on the stack. All of the pthreads run the same code.
  • David BetzDavid Betz Posts: 14,511
    edited 2014-04-12 18:48
    Dave Hein wrote: »
    An MMU and/or pointers are not required. The threaded chess program just uses pthreads and variables defined on the stack. All of the pthreads run the same code.
    True, if you use only stack variables then I guess you can share code. You have to stay away from globals though unless they are shared among all threads. I thought you were talking about essentially running the same main program on multiple COGs at the same time.
  • Heater.Heater. Posts: 21,230
    edited 2014-04-12 19:16
    David Betz,
    If you mean that more than one COG can run in hubexec mode and run the same hub code with different data that may be difficult to arrange unless the data is accessed by pointers. Having a single copy of the code but multiple copies of global data would require at least a simple MMU with a code base register and a data base register for each COG or even task. Of course, I guess that could be done in software by using, say, PTRB as the data segment pointer and having the compiler generate all data references as offsets from that address.
    You raise a very interesting point there.

    But no MMU has been required to do this already with LMM code on the P1.

    For example in C you run the same code in many cores by using OpenMP. See code below that runs the inner loop of an FFT on four COGs on a P1.

    Well, as you say, when you fire up code in a core it gets it's own address space where it can keep it's own thread local variables.

    That same code run from RAM will fail. Unless it is compiled differently to have some kind of pointer to it's own thread local variables.

    Is propgcc going to handle this and how?

    But wait, thinking about it, this is the same as uisng pthreads in C. This already works in propgcc as far as I know. Never used it directly but I think OMP sits on top of pthreads. The local, or thread local, variables are just on a different stack for each thread aren't they?

    Those writing PASM in their Spin code would need to take care of this manually of course.
        // Parallelize over 4 COGS.
        slices = 4;
        lastLevel = LOG2_FFT_SIZE - 3;
     
        firstLevel = 0;
        for ( ; slices >= 1; slices = slices / 2)
        {
            #pragma omp parallel for default (none) \
                                     shared (bx, by) \
                                     private (slice, s, slen, tid) \
                                     firstprivate(slices, firstLevel, lastLevel) 
            for (slice = 0; slice < slices; slice++)
            {
                s = FFT_SIZE * slice / slices;
                slen = FFT_SIZE / slices;
                butterflies(&bx[s], &by[s], firstLevel, lastLevel, slices, slen);
            }
            lastLevel = lastLevel + 1;
            firstLevel = lastLevel;
        }
    
  • Heater.Heater. Posts: 21,230
    edited 2014-04-12 19:18
    "Who is there?"... "It's me Dave".

    Sorry David, Dave got there first :)
  • Heater.Heater. Posts: 21,230
    edited 2014-04-12 19:26
    evanh,
    helpful 'advice' is, in my opinion, not helping.
    No, no, no, you don't want to do it like that, you want to do it like this : http://www.youtube.com/watch?v=nkZdTHmX0TQ
  • David BetzDavid Betz Posts: 14,511
    edited 2014-04-12 19:27
    Heater. wrote: »
    David Betz,

    You raise a very interesting point there.

    But no MMU has been required to do this already with LMM code on the P1.

    For example in C you run the same code in many cores by using OpenMP. See code below that runs the inner loop of an FFT on four COGs on a P1.

    Well, as you say, when you fire up code in a core it gets it's own address space where it can keep it's own thread local variables.

    That same code run from RAM will fail. Unless it is compiled differently to have some kind of pointer to it's own thread local variables.

    Is propgcc going to handle this and how?

    But wait, thinking about it, this is the same as uisng pthreads in C. This already works in propgcc as far as I know. Never used it directly but I think OMP sits on top of pthreads. The local, or thread local, variables are just on a different stack for each thread aren't they?

    Those writing PASM in their Spin code would need to take care of this manually of course.
        // Parallelize over 4 COGS.
        slices = 4;
        lastLevel = LOG2_FFT_SIZE - 3;
     
        firstLevel = 0;
        for ( ; slices >= 1; slices = slices / 2)
        {
            #pragma omp parallel for default (none) \
                                     shared (bx, by) \
                                     private (slice, s, slen, tid) \
                                     firstprivate(slices, firstLevel, lastLevel) 
            for (slice = 0; slice < slices; slice++)
            {
                s = FFT_SIZE * slice / slices;
                slen = FFT_SIZE / slices;
                butterflies(&bx[s], &by[s], firstLevel, lastLevel, slices, slen);
            }
            lastLevel = lastLevel + 1;
            firstLevel = lastLevel;
        }
    
    You can certainly run the same code in multiple threads. What I meant is that you can't start the same main program in multiple COGs at the same time because they would share the same globals and get into trouble. I guess that isn't what the original poster was looking to do though so my comment is probably irrelevant.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-04-12 19:33
    FWIW the Instruction and Data Caches are both 4 longs wide (Quad). They are built using the shadow registers of each cog. (see the register map over on the instruction thread, post #2)

    Each instruction takes 2 clocks (100 MIPs @ 200MHz) and there is no pipeline.

    Hub access is currently 1:16 clocks (no smarts). This gives a hub access every 8 instructions (as per P1).

    Hub-Cog transfers can be quad longs. (4 bytes * 200MHz / 16 = 50MB/s per cog, 800MB/s hub)
  • Roy ElthamRoy Eltham Posts: 2,996
    edited 2014-04-12 19:53
    Chip will do what he wants to do, which so far has been the process we have been through. Like it or not, it's Chip's choice, not yours.

    As Ken alluded to before, he's seen Chip go through the complex back to simple phases on previous projects. I think the end result of this "new P1+" chip is going to be worthy of calling a P2. It's a massive upgrade to the P1, and includes many of the major elements of what was the P2, just in a more refined way. Ultimately, we've ended up where we were destined to be all along. Remember this Verilog and synthesis method was a learning process for Chip. Knowing what he knows now, he can better plan how to do things for the next chip (after the P2 (what you all are calling P1+)).

    I doubt what was the P2 will stay in it's current form (even on a smaller process). It'll probably go through a phase or two of restructuring and end up being worthy of being called P3.
  • potatoheadpotatohead Posts: 10,255
    edited 2014-04-12 20:04
    That is pretty spot on Roy.
  • RaymanRayman Posts: 14,021
    edited 2014-04-12 20:09
    With analog pins, hubexec, cordic, multiplier, 16 cogs, 200 MHz and 512kB, I certainly think it's worthy of being called P2.

    I just hope these things can stay. Seemed like last time the forum maybe helped to bloat the feature set into something they couldn't produce economically.

    For me, if it were just regular P1 cores with the above features added, I'd be very happy.
  • RossHRossH Posts: 5,360
    edited 2014-04-12 20:12
    Rayman wrote: »
    With analog pins, hubexec, cordic, multiplier, 16 cogs, 200 MHz and 512kB, I certainly think it's worthy of being called P2.

    I just hope these things can stay. Seemed like last time the forum maybe helped to bloat the feature set into something they couldn't produce economically.

    For me, if it were just regular P1 cores with the above features added, I'd be very happy.

    I think everyone here is in furious agreement! :smile:
  • evanhevanh Posts: 15,263
    edited 2014-04-12 20:18
    Well said Roy.
    Rayman wrote: »
    Seemed like last time the forum maybe helped to bloat the feature set into something they couldn't produce economically.

    The issue was not price but thermal generation (and by extension, power consumption was more than USB could supply). It was an issue that was present for a long time but hadn't been accounted for.
  • kwinnkwinn Posts: 8,697
    edited 2014-04-12 20:31
    Dave Hein wrote: »
    Kwinn, your diagram looks good, except PTRX and PTRY do not exist since there is no Aux RAM. I think the plan is to put the 32-bit multiplier in the hub, but it may not make sense since a 32-bit multiply can be done with 3 16-bit multiplies and a few shifts and adds. The 32-bit divider would still be useful, even if it's in the hub.

    Thanks Dave, I've removed PTRX and Y. Now to see if I can add the I/O pin block.
  • potatoheadpotatohead Posts: 10,255
    edited 2014-04-12 20:32
    Yep.

    The current design is far more optimized for power. Chip will see to that. I'm pretty excited really. Seeing the development process like we did was kind of a let down at first, but then it's obvious how much of an improvement we can get. Good times coming everybody.

    With all that we learned on the P2+ design, and I'm calling the current one P2 because it deserves it, the P3 we get someday is going to be really something!
  • Heater.Heater. Posts: 21,230
    edited 2014-04-12 21:02
    Of course it's P2.

    I'm furiously agreeing with Rayman as well.
Sign In or Register to comment.