Shop OBEX P1 Docs P2 Docs Learn Events
The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Page 116 — Parallax Forums

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

1113114116118119144

Comments

  • jmgjmg Posts: 15,140
    edited 2015-07-27 08:40
    For cog exec code initially residing in hub RAM, all that matters is that those instructions share a common hub byte offset, so that when they are loaded into the cog RAM, everything is long-aligned. It also does not matter for hub-exec purposes whether instructions fall on absolute long boundaries, or not. It is true that it takes one more clock to begin a hub exec instruction stream if it is not absolutely long-aligned. However, this can be avoided by long-aligning your hub exec code in the assembler. This is a small price to pay for allowing data structures of mixed word lengths in hub memory. There is no reason that I see to enforce long-alignment rules in hub memory. All that would do is introduce unnecessary strictures.

    Given there is a speed penalty for not being aligned, will the Assembler have an automatic/default method whereby code is long aligned ?Seems this is not the sort of thing you would want a novice to trigger accidentally - a sudden speed change in code they did not intentionally modify.
  • evanhevanh Posts: 15,126
    Ease up a little there JMG. The assembler isn't written yet ... It'll just be an extra directive or similar. It's something that can dealt with at any time.
  • Cluso99Cluso99 Posts: 18,066
    edited 2015-07-27 10:44
    Thanks for clearing up the hub instruction boundary issue.
    I will definitely be enforcing any of my code to be on a long boundary in hub. IMHO that gives consistency and I would not want it any other way.

    Executing hubexec from non long boundaries just seems plain silly IMHO.

    However, if others have use for it otherwise then fine.
  • evanhevanh Posts: 15,126
    PS: This is really only a HubExec issue, and HubExec ain't gonna be that consistent anyway.
  • Hi All.

    If I look from hardware side, byte boundary are only one way to go.

    If I look from programing side, All necessary LONG, WORD, BYTE aligning are simple assembler DIRECTIVES
  • cgraceycgracey Posts: 14,133
    PS: This is really only a HubExec issue, and HubExec ain't gonna be that consistent anyway.

    That's right. There are ~5- to ~20 clock delays for reloading the instruction FIFO after branches. Once loaded, contiguous execution goes full speed.
  • MJBMJB Posts: 1,235
    edited 2015-07-27 15:38
    It is true that it takes one more clock to begin a hub exec instruction stream if it is not absolutely long-aligned.


    Given there is a speed penalty for not being aligned, will the Assembler have an automatic/default method whereby code is long aligned ?Seems this is not the sort of thing you would want a novice to trigger accidentally - a sudden speed change in code they did not intentionally modify.



    One clock at the start of the COG load, then the egg-beater streaming architecture just delivers full spead.
    At least that's how I understand it ... ;-)
  • potatoheadpotatohead Posts: 10,253
    edited 2015-07-27 16:16
    That's how I understand it too.  Given the ~20 cycles on branch taken, this little bit of variance is not all that significant.

    And there is time for pnut to see the right operators.



  • It is true that it takes one more clock to begin a hub exec instruction stream if it is not absolutely long-aligned.


    Given there is a speed penalty for not being aligned, will the Assembler have an automatic/default method whereby code is long aligned ?Seems this is not the sort of thing you would want a novice to trigger accidentally - a sudden speed change in code they did not intentionally modify.



    One clock at the start of the COG load, then the egg-beater streaming architecture just delivers full spead.
    At least that's how I understand it ... ;-)


    Sorry to enter this discussion in the middle and without having read the history but I don't see any real advantage to being able to execute instructions at non-aligned addresses especially if it costs an extra cycle. Nor do I see any real advantage to having rdlong or reword be able to fetch from non-aligned addresses. Is this something that falls out of the current implementation and we get for free? If there is any noticeable cost to this it can probably be removed.
  • cgraceycgracey Posts: 14,133
    edited 2015-07-27 17:52
    One thing that's been bugging me about the non-maskable super interrupt for debugging is that it can't step through REP sections. There is too much context to save and restore within REPs to make it practical. It could be done, but it would cost a bunch of extra hardware.
    I had an idea, though, and it tests out just fine: Have the debug interrupt routine look at the instruction at the LINK return address and if it's a REP instruction, just synthesize it in the single-stepper! This is pretty simple and I see that I am able to sense REP's, no problem. The debug interrupt will let you step through all INT0/INT1/INT2 code, regardless of whether interrupts are being stalled or allowed via STALLI/ALLOWI.
    In a similar fashion to how REPs can be synthesized during single-stepping, so can ALTDS/AUGS/AUGD. This means the single-stepper can go anywhere!
    To make sure this feature isn't routinely abused in people's applications, it will be made good for debugging, only, and will not serve any other purposes well. This way, it should always be able to be employed without disrupting people's existing code.
  • If I emulate any old CPU that give me big advantage.
    Same if I have big tables that are Word - Byte mixed ---->

    <
    And so on, and so on



    Sorry to enter this discussion in the middle and without having read the history but I don't see any real advantage to being able to execute instructions at non-aligned addresses especially if it costs an extra cycle. Nor do I see any real advantage to having rdlong or reword be able to fetch from non-aligned addresses. Is this something that falls out of the current implementation and we get for free? If there is any noticeable cost to this it can probably be removed.


  • If I emulate any old CPU that give me big advantage.
    Same if I have big tables that are Word - Byte mixed ---->

    <
    And so on, and so on



    Sorry to enter this discussion in the middle and without having read the history but I don't see any real advantage to being able to execute instructions at non-aligned addresses especially if it costs an extra cycle. Nor do I see any real advantage to having rdlong or reword be able to fetch from non-aligned addresses. Is this something that falls out of the current implementation and we get for free? If there is any noticeable cost to this it can probably be removed.




    I said "any real advantage". I'm not sure emulating old CPUs is a big target market for Propeller chips although I know we hobbyists like to do it sometimes. Anyway, you're probably right that rdlong and rdword are more useful than the ability to execute instructions from non-aligned addresses. My point was that they can be left out if there is a significant cost to having them.
  • jmgjmg Posts: 15,140
    I had an idea, though, and it tests out just fine: Have the debug interrupt routine look at the instruction at the LINK return address and if it's a REP instruction, just synthesize it in the single-stepper! This is pretty simple and I see that I am able to sense REP's, no problem. The debug interrupt will let you step through all INT0/INT1/INT2 code, regardless of whether interrupts are being stalled or allowed via STALLI/ALLOWI.

    In a similar fashion to how REPs can be synthesized during single-stepping, so can ALTDS/AUGS/AUGD. This means the single-stepper can go anywhere!


    Sounds good, (I thought you had meant exactly that in an earlier comment), but nice to read it has been tested with REP and double opcodes.
    This can be a SW optional feature - in many designs the granular step is ok.
    Most debuggers have a Step-into and Step-over,  (some also have Step-ret and Step-loop) so the Step into can drill into a REP and Step over can not bother with the extra work.
  • jmgjmg Posts: 15,140
    edited 2015-07-27 19:53
    Sorry to enter this discussion in the middle and without having read the history but I don't see any real advantage to being able to execute instructions at non-aligned addresses especially if it costs an extra cycle. .

    That is why an assembler directive is there, to avoid that extra cycle. I think the HW does the fetch 'for free', because of the DATA handling mentioned below.(ignoring the time hit for now)

    Nor do I see any real advantage to having rdlong or rdword be able to fetch from non-aligned addresses. .

    Really ? Being able to R/W DATA as Bytes/Words and Longs is built into the opcodes, and that means being able to read any record in memory. Users to not want to have to repack records, just because some reads are not byte aligned, and I think having some DATA IO  opcodes not byte-granular, and some that are, is going to take more hardware.
  • David BetzDavid Betz Posts: 14,511
    edited 2015-07-27 20:30
    Sorry to enter this discussion in the middle and without having read the history but I don't see any real advantage to being able to execute instructions at non-aligned addresses especially if it costs an extra cycle. .

    That is why an assembler directive is there, to avoid that extra cycle. I think the HW does the fetch 'for free', because of the DATA handling mentioned below.(ignoring the time hit for now)

    Nor do I see any real advantage to having rdlong or rdword be able to fetch from non-aligned addresses. .

    Really ? Being able to R/W DATA as Bytes/Words and Longs is built into the opcodes, and that means being able to read any record in memory. Users to not want to have to repack records, just because some reads are not byte aligned, and I think having some DATA IO  opcodes not byte-granular, and some that are, is going to take more hardware.

    Yeah, I was probably wrong about the data operations. I admitted that in a later message. Even if you don't care about CPU emulation, being able to access unaligned words and longs makes it much easier to parse byte streams with unaligned fields. Now all we need are some endian swapping instructions. :-)
    How about instructions for:
    htonlhtonsntohlntohs
  • cgraceycgracey Posts: 14,133
    edited 2015-07-27 20:57
    We do have an endian-swapping instruction: MOVBYTS D,S/#

    S is treated as four 2-bit fields which select one of four D bytes into their S-respective positions. For example:

    MOVBYTS D,#%00_01_10_11 - endian-swap bytes in D
    MOVBYTS D,#%01_00_11_10 - endian-swap words in D
    MOVBYTS D,#%00_00_00_00 - copy lower D byte to all bytes in D
  • Oh thats nifty. 
  • We do have an endian-swapping instruction: MOVBYTS D,S/#

    S is treated as four 2-bit fields which select one of four D bytes into their S-respective positions. For example:

    MOVBYTS D,#%00_01_10_11 - endian-swap bytes in D
    MOVBYTS D,#%01_00_11_10 - endian-swap words in D
    MOVBYTS D,#%00_00_00_00 - copy lower D byte to all bytes in D


    Very nice!
  • Cluso99Cluso99 Posts: 18,066
    I have to say all these Interrupts are nice, BUT, time is dragging on! Now a few weeks have been devoted to this and still further parts are being added.
  • I have to say all these Interrupts are nice, BUT, time is dragging on! Now a few weeks have been devoted to this and still further parts are being added.

    When some of my projects drag on my wife always reminds me that they don't have to be perfect but they do have to 'be'.

    Sandy


  • cgraceycgracey Posts: 14,133
    II have to say all these Interrupts are nice, BUT, time is dragging on! Now a few weeks have been devoted to this and still further parts are being added.

    I suppose if I had an FPGA image out, there would be less concern about things like this.

    I almost finished the debug super interrupt today, but got interrupted by a funeral trip we've been planning. Now, I'm working on the laptop with Quartus while my wife drives and kids cycle through their singing, shrieking, and quarreling behind us. I'm really frustrated that I don't have my FPGA board and scope because I left things in a broken state at the last minute.
  • TubularTubular Posts: 4,620
    edited 2015-07-28 02:20
    Hey Chip, what you're proposing will be really neat, and worth the effort.  Have a safe trip.  
    Out of interest, what's your compile time like?   
  • cgraceycgracey Posts: 14,133
    Hey Chip, what you're proposing will be really neat, and worth the effort.  Have a safe trip.  
    Out of interest, what's your compile time like?   

    I'm still compiling for the Cyclone IV on the DE2-115 board. For one cog + hub, the compile time is about 5 minutes.

    I was able to find out what my problem was by looking at the cog Verilog in Quartus. I had to add a bit to the link instruction generator, but I forgot to take another bit away, so the link instruction turned out to be something else that wasn't having any branch effect. I am relieved I found that! I hate the feeling of things being broken.
  • Cluso99Cluso99 Posts: 18,066
    Chip, have a safe trip. Hope the kids have plenty to keep them occupied :) we've don lots of long trips when our kids were just kids. Now they all have their own kids ;)

    True there would be less pressure if we had an fpga image to test. But there is still pressure as interrupts weren't even on the plan a month ago. And it seems never ending requests continue. It would be easier if we were off testing and the smart pins were done. While I do like the interrupt scheme that has evolved, I am still concerned at the time it has taken.

    There's not going to be a Chipmas this year :(
  • cgraceycgracey Posts: 14,133
    While travelling, I decided to target the Prop2 Quartus compilation to the newer Cyclone V-A9 device which we are using on the Prop-123 board.
    The design was taking forever to compile, and once the fitter started running, it reported 25k registers! It turned out that 16k of them were in the dual-port cog RAM, which didn't make any sense, at first. Turns out it was NOT implying a dual-port RAM from the Verilog code, but was building a huge array of flops!
    I searched the web and found an Altera example of inferable dual-port RAM in Verilog. There was some rather subtle difference in how it worked from mine.
    Here is what I was doing at first, and then what I had to change it to, in order to make it infer as a dual-port RAM...


    This makes a giant array of flops:
    always @(posedge clkx)
    begin
     if (wex) ram[ax] <= dx;
     qx <= ram[ax];
    end
    always @(posedge clky)
    begin
     if (wey) ram[ay] <= dy;
     qy <= ram[ay];  
    end


    This makes a dual-port RAM:
    always @(posedge clkx)
    if (wex)
    begin
     ram[ax] <= dx;
     qx <= dx;
    end
    else
     qx <= ram[ax];
    always @(posedge clky)
    if (wey)
    begin
     ram[ay] <= dy;
     qy <= dy;
    end
    else
     qy <= ram[ay];  


    Get the difference? it makes perfect sense, but I sure wasn't anticipating it.


  • cgraceycgracey Posts: 14,133
    edited 2015-07-28 07:44
    Quartus just finished compiling a one-cog Prop2 for Cyclone V-A9 and the Fmax came in at 98MHz, whereas it was hitting 111MHz on the Cyclone IV. This confirms my previous experience that the Cyclone V is slower than the Cyclone IV, though some thought it should have been about 11% faster. The compile time was in-line with what my faster desktop does. I think those horrendous compile times I had initially reported were due to the un-inferable-ness of my dual-port RAMs on the Cyclone V, which had worked on the Cyclone IV.
    I'll kick off a whole-chip compile now and report the device utilization and compile-time in the morning.
  • jmgjmg Posts: 15,140
    Get the difference? it makes perfect sense, but I sure wasn't anticipating it.


    What does that fix do to A9 compile times and resource usage (and MHz reports ) ?
  • cgraceycgracey Posts: 14,133
    edited 2015-07-28 08:38
    Get the difference? it makes perfect sense, but I sure wasn't anticipating it.


    What does that fix do to A9 compile times and resource usage (and MHz reports ) ?

    Well, the design is not viable, at all, with flops for cog RAM, so there's no comparison. Given time to fully compile, it would be twice as big and half the speed, probably.


    P.S. Giving this further thought, it would be more like 5 times as big and 1/5 as fast. Trying to make RAM from flops is horribly inefficient.
    I just started the whole-chip compile and with all 16 cogs and the hub there are 49k flops. It finished the synthesis in only 9 minutes on my older laptop. The fitter is going to take some time, now.

  • JRetSapDoogJRetSapDoog Posts: 954
    edited 2015-07-28 12:44
    This is like watching a good suspense movie. I've got my popcorn at the ready and am sitting near the screen. Discovering that it wasn't inferring the dual-port RAM was like dodging a bullet. Thank goodness for the wherewithal to search the web and find a relevant Altera example (sometimes when one is clever enough to do most things on one's own, one can neglect to take advantage of the resources out there). Sometimes it's good to hit a wall, I guess. Anyway, I'm amazed that work can continue to any extent on the road in a car filled with travelers family. That's going all out!
    So, for those of us that aren't familiar with FPGA design, if the synthesis process is the step that takes the Verilog code from written form to gates (logical connections), is the "fitter" process the part that lays out all the circuitry? If so, I can imagine that arranging things in a somewhat optimized pattern could be time consuming for the computer (not to mention the designer waiting on it). And if I recall correctly, slightly different but presumably functionally-equivalent results can happen from one or both of those processes. That sounds a little bit scary to the uninitiated like myself, but I guess one learns to trust the tools, at least for the most part.  
  • Get the difference? it makes perfect sense, but I sure wasn't anticipating it.




    Those look different to me. Your original code captures old_value to qx, while the new code captures new_value.
Sign In or Register to comment.