Shop OBEX P1 Docs P2 Docs Learn Events
New Hub Scheme For Next Chip - Page 10 — Parallax Forums

New Hub Scheme For Next Chip

17810121337

Comments

  • markmark Posts: 252
    edited 2014-05-14 20:52
    jmg wrote: »
    some cycles need to be stolen to move into the dest register in COG memory.
    I haven't thought of this, and you're probably right. If that is indeed the case, then it would need to be split into two instructions. Still worthwhile, I think.
  • LawsonLawson Posts: 870
    edited 2014-05-14 21:00
    jmg wrote: »
    Yes, Not pretty.

    It's looking like a split opcode & more choices (I think someone else mentioned split opcodes ? ) wold be best
    RDREQ - issues Read Address - (resets RDGET flags ?)
    ...
    RDGET - Fetches RdReq result, stalls if not ready yet (must be paired with RDREQ)

    WRREQ Send Register to Write Buffer. Stalls if Buffer not empty.

    WRDIR Direct Write, stalls until Nibble==
    RDDIR Direct Read, stalls until Nibble==


    I think those would allow higher average bandwidths, (worst case is the same) but not solve cycle-determinism, which needs SNAPCNT or similar.
    SNAPCNT could optionally 'attach' to those opcodes.
    Tools could advise on spacing, but the auto-stall protects users.

    Nice simple description of the op-codes that would be needed to support a buffered read and write. I actually think the RDREQ/RDGET pair will improve determinism of the code assuming you put 16-18 clocks between them. Basically this pair allows you to do a hub read without a hub sync as long as you know the read address soon enough. (just like the RDBLOCK instruction) WRREQ would be the same. Keep your write frequency down, and your code would never have to sync with the hub. Biggest cost I see for code is that using RDREQ, RDGET, and WRREQ would have a higher minimum read modify write latency.

    Marty
  • jmgjmg Posts: 15,140
    edited 2014-05-14 21:11
    [QUOTE=mark
  • Cluso99Cluso99 Posts: 18,066
    edited 2014-05-14 21:11
    Please forget all these RDREQ etc. The perceived benefits are not really there because you still require the clocks to do the transfer anyway. It is not like its free because it has to access the cog to perform the transfer.

    It is much better for Chip to spend the time getting basic hubexec working which will bring far more benefits from this new hub scheme.
  • LawsonLawson Posts: 870
    edited 2014-05-14 21:16
    cgracey wrote: »
    It takes ~9,000 LE's, which is equivalent to ~2 cogs. It's hefty, for sure, but buys a lot of performance.

    Yep, that's a muck'n big crossbar switch. :cool: Wonder why it's not compiled as run-time reconfiguring of the FPGA routing fabric? (too slow?)

    The Wikipedia entry for Nonblocking minimal spanning switches shows that there are ways to shrink cross-bar switches at the cost of more complexity. (lots of useful looking hyper-links and search terms in that Wiki article too)

    Marty
  • jmgjmg Posts: 15,140
    edited 2014-05-14 21:17
    Cluso99 wrote: »
    Please forget all these RDREQ etc. The perceived benefits are not really there because you still require the clocks to do the transfer anyway. It is not like its free because it has to access the cog to perform the transfer.

    Sure, the cycles per HUB access are not changed, but what has changed is the other code you can run while the HUB is rotating to the needed slot.
    Think of it as boosting opcode bandwidth. In operation, rather like a buffered UART where you can do other stuff while waiting for new data, which you know will be along eventually.
  • kwinnkwinn Posts: 8,697
    edited 2014-05-14 21:56
    That would be your pay rate wouldn't it? Me, I'm just a poor hardware guy who dabbles in programming. A veritable neophyte.
    RossH wrote: »
    Ha! This explains why you guys insist on coming up with such complicated proposals - it's all a massive conspiracy to raise programmer's pay rates! :lol:
  • RossHRossH Posts: 5,333
    edited 2014-05-14 22:34
    kwinn wrote: »
    That would be your pay rate wouldn't it? Me, I'm just a poor hardware guy who dabbles in programming. A veritable neophyte.

    Don't let it get you down.
  • kwinnkwinn Posts: 8,697
    edited 2014-05-14 22:36
    I'm not opposed to to a simple wait or the snapcnt instruction. As you say it is useful in and out of hub loops. That does not make a R/W that lets execution continue less useful or or more risky. The only risk I can see is that the programmer tries to use the data being read before it arrives or overwriting the data in a register before it is written (if the implementation does not buffer the address and data). Bad programmer....don't do that.

    As to the no gain/stolen cycles argument, the data has to be written to the register so the time is lost either way, and what you gain is the execution of several instructions instead of a stalled cog for several cycles. If the loop is shorter than the time between access to the hub block it would automatically synchronize to that hub block access with no fuss or muss.

    jmg wrote: »
    See my expanded description of SNAPCNT here
    http://forums.parallax.com/showthread.php/155692-Nibble-Carry-Higher-speed-Buffers-FIFOs-using-new-HUB-Rotate

    It does pretty much what you say - ie has a bit to attach the wait to WR or RD, so those opcodes apply the snap, but you need to define the delay value, so that is why a separate opcode (or config register) is suggested.

    The SNAPCNT is quite useful outside of HUB loops, which is another reason to have an opcode for it.
  • kwinnkwinn Posts: 8,697
    edited 2014-05-14 22:44
    RossH wrote: »
    Don't let it get you down.

    Doesn't get me down at all. I enjoy my work and dabbling in programming. I did write software for a number of years but after getting a taste of field work I just could not go back to sitting in an office all day. Thanks for the thought though.
  • jmgjmg Posts: 15,140
    edited 2014-05-14 22:52
    kwinn wrote: »
    I'm not opposed to to a simple wait or the snapcnt instruction. As you say it is useful in and out of hub loops. That does not make a R/W that lets execution continue less useful or or more risky. The only risk I can see is that the programmer tries to use the data being read before it arrives or overwriting the data in a register before it is written (if the implementation does not buffer the address and data). Bad programmer....don't do that.

    The split read manages that, with low programmer risk, and no extra cost (as the transfer has to use cycles).
    Relying on hidden transfers is a very bad idea, and not HLL friendly.
    Any non-stalling solution would have to include single level, so write is 'self protecting'.
  • Brian FairchildBrian Fairchild Posts: 549
    edited 2014-05-14 23:09
    My head hurts. I also find it amusing that in another topic two people, who's work I admire, have difficulty communicating how this new scheme works and the timing implications. Can you imagine what someone browsing datasheets for a new processor will think?

    This new processor is in danger of becoming a case study in how not to do things. Spot a problem, focus in on it, solve it by adopting a clever complex solution, rinse, repeat. Nowhere is anyone standing back and looking at the bigger picture. There's a good reason most mainstream chip makers have moved to linear memory maps and why they do things like analyse how C compilers work.

    It's been stated that C is a major 'must-have' for this chip. I've been using it, and other embedded languages for over 30 years whilst keeping an eye what goes on under the hood. I used to be able to decompile PL/M80 and PL/M86 by eye. And I'll make this prediction...C compilers for this chip will not generate anything like optimal code without a lot of work on the part of the compiler writers and the programmer. It's not going to happen. Are programmers really expected to have to use a tool to check how their data access timings will work out? How will the compiler optimise code to make best use of this new access scheme? It's madness.

    The good news is I've discovered an even better access scheme and have got a picture which illustrates it...

    load-reactions.jpg
    300 x 313 - 27K
  • RossHRossH Posts: 5,333
    edited 2014-05-14 23:23
    My head hurts. I also find it amusing that in another topic two people, who's work I admire, have difficulty communicating how this new scheme works and the timing implications. Can you imagine what someone browsing datasheets for a new processor will think?

    This new processor is in danger of becoming a case study in how not to do things. Spot a problem, focus in on it, solve it by adopting a clever complex solution, rinse, repeat. Nowhere is anyone standing back and looking at the bigger picture. There's a good reason most mainstream chip makers have moved to linear memory maps and why they do things like analyse how C compilers work.

    It's been stated that C is a major 'must-have' for this chip. I've been using it, and other embedded languages for over 30 years whilst keeping an eye what goes on under the hood. I used to be able to decompile PL/M80 and PL/M86 by eye. And I'll make this prediction...C compilers for this chip will not generate anything like optimal code without a lot of work on the part of the compiler writers and the programmer. It's not going to happen. Are programmers really expected to have to use a tool to check how their data access timings will work out? How will the compiler optimise code to make best use of this new access scheme? It's madness.

    Agreed 110%. If you can't explain it in a couple of paragraphs and one diagram, then it ain't gonna fly.

    I do think Chip's basic scheme has merit - but if he adopts it then we don't need all these additional instructions and complications. For those cogs that need absolute determinism, an additional instruction or two fixes the problem (but loses any speed advantage).

    Ross.
  • Roy ElthamRoy Eltham Posts: 2,996
    edited 2014-05-14 23:34
    I don't think we need any additions or changes from what Chip has implemented and originally described.

    I think people just need to stop thinking it's more complex than it is, and stop trying to make it more complex than it is with new stuff that really doesn't buy you anything worthwhile.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2014-05-14 23:38
    Roy,

    +1

    -Phil
  • Cluso99Cluso99 Posts: 18,066
    edited 2014-05-14 23:55
    Roy,
    Yes it is simple, and easy to explain, especially with the lazy susan concept. And no, it doesn't need to be any more complex.
    And it certainly delivers on throughput - each cog could achieve 800MB/s in parallel (with tricks of course).

    But, by the same token, it doesn't hurt for some of us to explore other possibilities. After all, if you and Chip hadn't explored other ideas, you would not have come up with this. In fact, many of us have bee trying to work out ways to increase the hub bandwidth (while Chip was off in other parts of the design anyway) while many, IIRC you included, have tried to shut down the ideas discussion.
  • Brian FairchildBrian Fairchild Posts: 549
    edited 2014-05-15 00:04
    Cluso99 wrote: »
    ...each cog could achieve 800MB/s in parallel (with tricks of course).

    But that's one of the things that concerns me. How on earth do you put that as a bullet point on a datasheet?
    # core to memory access rate 800MB/s *1
    
    *1 with tricks of course
    

    ;)
  • Roy ElthamRoy Eltham Posts: 2,996
    edited 2014-05-15 00:09
    Cluso,
    I am fine with exploring ideas. I'm not fine with adding complications to this when it's not warranted. I was against all the hub sharing ideas because they all involved making the cogs unequal (and most of them were complicated messes).

    This solution came about from a discussion with chip while he was explaining to me how things were currently working in the ALU/main cog pipeline, and how the memory stuff was arranged and split apart. A big part of the reason we "went with it" was because it simplified things, allowed for going down to 32bit lower power memory setups, and because it kept with the spirit of the Propeller where all the cogs are equal and independent of each other. It also gave us better overall bandwidth between cog/hub than we had even for the old P2 design. It was one of those things where Chip was super excited the whole time, which in my experience means very cool things. :)
  • Brian FairchildBrian Fairchild Posts: 549
    edited 2014-05-15 00:09
    Cluso99 wrote: »
    Yes it is simple, and easy to explain, especially with the lazy susan concept.

    So, let's assume I'm a vegetarian (I am) and all the veggie dishes are placed next to each other. How long will I have to wait for my meal which consists of 4 different dishes given...

    a) that it's constantly rotating
    b) it takes me a finite time to transfer the food from the dish to my plate
    c) different dishes takes different amount of time to transfer
    d) I like some things more than other

    Oh wait, I know, there's and app for that. :)
  • rod1963rod1963 Posts: 752
    edited 2014-05-15 00:17
    I have to agree with Brian, for a outsider this scheme is a bit of brain bruiser, I think I get it, but it''s complicated. Complicated enough until I see a working FPGA image and a C compiler that generates code for it and that doesn't force the coder to be aware of the underlying architecture, I'll remain skeptical.
  • Brian FairchildBrian Fairchild Posts: 549
    edited 2014-05-15 00:19
    RossH wrote: »
    If you can't explain it in a couple of paragraphs and one diagram, then it ain't gonna fly.
    The first thing that has to go is any attempt to explain it by way of any analogy. Gears, hubs, lazy susan's, Ferris wheels, they all have to go. It's a big rectangular box in the middle of the block diagram that people have to explain.

    p16x64a.jpg
    1024 x 665 - 86K
  • Roy ElthamRoy Eltham Posts: 2,996
    edited 2014-05-15 00:20
    There's no tricks with the cog/hub memory bandwidth. It's 16 longs every 18 clocks. Which is about 711MB/s. In real world practical use cases (where you are actually doing work to read/write the data from/to pins) then your will probably not realize the full bandwidth except in short bursts.

    The 800MB/s number comes from the fact that during the 16 long transfer period it is going at the 800MB rate, but then you have 2 clock gaps between each one of those for the RDBLOC/WRBLOCK instruction.

    The REAL kicker is that if multiple cogs are working together you could easily realize much higher overall throughput (up to around 12GB/s actually being possible).
  • jmgjmg Posts: 15,140
    edited 2014-05-15 00:26
    Roy Eltham wrote: »
    I don't think we need any additions or changes from what Chip has implemented and originally described.

    I think people just need to stop thinking it's more complex than it is, and stop trying to make it more complex than it is with new stuff that really doesn't buy you anything worthwhile.

    Sounds great.

    Now show me how what Chip has described, (no improvements), can stream Data continually into HUB, at 3 SysCLKs per sample, no jitter.

    Or is that sort of speed gain, over what we have now, what you meant by 'worthwhile' ?
  • Roy ElthamRoy Eltham Posts: 2,996
    edited 2014-05-15 00:31
    jmg wrote: »
    Sounds great.

    Now show me how what Chip has described, (no improvements), can stream Data continually into HUB, at 3 SysCLKs per sample, no jitter.

    Or is that sort of speed gain, over what we have now, what you meant by 'worthwhile' ?

    3 SysCLKs per sample? That's slower than WRBLOC is now. It's effectively 1.125 SysCLKs per long now. Jitter only matters when hitting the pins, not when hitting HUB.
  • jmgjmg Posts: 15,140
    edited 2014-05-15 00:33
    The first thing that has to go is any attempt to explain it by way of any analogy. Gears, hubs, lazy susan's, Ferris wheels, they all have to go. It's a big rectangular box in the middle of the block diagram that people have to explain.

    Why "a big rectangular box", when the rotation model is exactly how this actually works ?.
    But that's one of the things that concerns me. How on earth do you put that as a bullet point on a datasheet?

    That number is easy, but of more interest to customers is simple examples of what the device can actually DO.
    For example,
    "be configured so 3 COGs can stream Pin information into Main Memory at 200Ms/s (3 SysCLKs each)"
    and
    "be configured so 3 more COGs can stream Pin from Main Memory at 200Ms/s (3 SysCLKs each)"
    and
    Do this at the same time. With no Bandwidth impact on the 10 COGS left.
    What would you like to do with the remaining 10 COGs ?
  • jmgjmg Posts: 15,140
    edited 2014-05-15 00:35
    Roy Eltham wrote: »
    3 SysCLKs per sample? That's slower than WRBLOC is now. It's effectively 1.125 SysCLKs per long now. Jitter only matters when hitting the pins, not when hitting HUB.

    Yes, I want to sample the pins at 3SysCLKs, no jitter. (or Drive the Pins, at 3 SysCLKs, no jitter.)
  • Roy ElthamRoy Eltham Posts: 2,996
    edited 2014-05-15 00:44
    Instructions are 2 sysclk's each. reading INA is possible every 2 sysclks, so you could in theory read pins at 2 sysclk's per sample for burst on a single cog, and then burst that out to HUB. With 2 cogs doing it, you could achieve continuous reading of pins at that rate. Not sure what you are advocating changing or doing for your 3 sysclk thing, but 2 sysclk seems doable already...
  • jmgjmg Posts: 15,140
    edited 2014-05-15 00:47
    Roy Eltham wrote: »
    Instructions are 2 sysclk's each. reading INA is possible every 2 sysclks, so you could in theory read pins at 2 sysclk's per sample for burst on a single cog, and then burst that out to HUB. With 2 cogs doing it, you could achieve continuous reading of pins at that rate. Not sure what you are advocating changing or doing for your 3 sysclk thing, but 2 sysclk seems doable already...

    No Cigar, This is continual, 500k samples, no pauses or jitter.
  • Brian FairchildBrian Fairchild Posts: 549
    edited 2014-05-15 00:47
    jmg wrote: »
    Why "a big rectangular box", when the rotation model is exactly how this actually works ?.

    So, what exactly is rotating and how does that answer my question in post #290.

    I'm not against analogies, I even use them myself to explain technical concepts to non-technical people, but when you have to have one in a datasheet then it's time to worry.
  • Roy ElthamRoy Eltham Posts: 2,996
    edited 2014-05-15 00:51
    jmg wrote: »
    No Cigar, This is continual, 500k samples, no pauses or jitter.
    Like I said with 2 cogs, you could get continuous reading at 2 sysclks per sample. one would read 16 and write it to HUB, the other would read 16 while the other was writing to hub. They would alternate. They could be synced up easily, so I no jitter and continuous reading...

    People have done similar setups on the P1 with multiple cogs synced up to read data quickly to HUB. The same will be possible on P2, with MUCH higher rates possible.
Sign In or Register to comment.