Shop OBEX P1 Docs P2 Docs Learn Events
Optimizing HUB OP speed? — Parallax Forums

Optimizing HUB OP speed?

XlogicXXlogicX Posts: 18
edited 2009-09-28 02:06 in Propeller 1
I apologize in advance if this concept has already been talked about, I’ve searched around to try to find bits of help, but there is an overwhelming amount of info on these forums (usually a great thing), it’s possible my searching was too shallow.

I wanted to know if any of you have any cool hacks for optimizing the timing of going through several HUB OPs. I have some pretty small footprint loops going on in assembly that repeat several times (too continously check for updated variable data in main memory, controlled/changed by high level code); the HUB OPs that I have to do are killing me (with how much they can slow the loop down). I think I just ran out of mental energy, but I think I’m looking for a way to sync the cog up before the first hub op, than do a batch of them while synced. Does this sound pointless? Possible? Or is there any better approach for the main issue of speed?

Comments

  • KyeKye Posts: 2,200
    edited 2009-09-26 14:50
    Um, so, your code should be like this.

    hubop

    regular op
    regular op

    hubop

    ... etc.

    If you read the manual section on timing it explains this fully.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Nyamekye,
  • potatoheadpotatohead Posts: 10,261
    edited 2009-09-26 17:20
    I'm going through this right now too. Got some good help, and here it is:

    Your first hubop syncs your loop. That first case could take up to 23 cycles. From there, consider each HUB op to take two instructions worth of time, then there is a two instruction delay before the next one, making the loop Kye just posted for you the best case.

    I would add that, if you can't make it happen in two instructions, then you might as well do four because the window comes around at multiples of the two instruction window access time.

    So:

    hub

    instruction
    instruction
    instruction

    hub

    is the same as

    hub

    instruction
    instruction
    instruction
    nop

    hub.

    You can get a 20 percent increase in speed, on average, with good instruction placement. In the case of the video stuff I was doing, a longer loop, where operations are done out of order and buffered to take advantage of the window ended up considerably faster than just trying to pack instructions in at the right times. Write your loop and debug it, then re-write it, potentially holding some values in a few spare longs so that best case window opportunities can be taken advantage of. This is tough, but well worth it.

    Look at the code you will find Cluso99 added to mine in the "Improved NTSC Driver" thread here. That's a great example of how it's done.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Propeller Wiki: Share the coolness!
    Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
    Safety Tip: Life is as good as YOU think it is!

    Post Edited (potatohead) : 9/26/2009 5:25:32 PM GMT
  • potatoheadpotatohead Posts: 10,261
    edited 2009-09-26 17:29
    Another approach is to consider how you are fetching data. In your case, just checking on something probably won't work any other way, but...

    If it might, consider that a byte HUB op costs just as much as a long HUB op does. In my case, fetching a long allowed for a longer and much faster 4 HUB op loop, compared to just fetching bytes and using shorter loops.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Propeller Wiki: Share the coolness!
    Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
    Safety Tip: Life is as good as YOU think it is!
  • jazzedjazzed Posts: 11,803
    edited 2009-09-26 17:45
    Expanding on potatohead's comment:

    'pseudo code
    :loop
    hubop
    instruction
    instruction
    jmp #:loop
    
    


    ... is the same as
    'pseudo code
    :loop
    hubop
    instruction
    instruction
    nop
    nop
    nop
    jmp #:loop
    
    

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    --Steve

    Propeller Tools
  • Cluso99Cluso99 Posts: 18,069
    edited 2009-09-27 03:38
    potatoehead slightly missed the operation theory.

    Count the hubop as 2 instructions
    Then you can perform 2, 6, 10, etc instructions in between.
    Next hubop

    A taken jump takes 1 instruction but a non-taken jump takes 2 instructions (because of the pipeling being flushed). However, I believe this is not true for a conditional jump (based on condition codes) as the instruction is ignored it the condition is not met, so it would still be one instruction.

    What does this mean? Well, you can re-order your code to save wasted cycles. I gained 20% by reordering potatohead's code in the scanline section of text rendering.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade, RetroBlade,·TwinBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: Micros eg Altair, and Terminals eg VT100 (Index) ZiCog (Z80) , MoCog (6809)
    · Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
  • potatoheadpotatohead Posts: 10,261
    edited 2009-09-27 04:08
    Ok, so I want to sort this out, because I'm working on this right now. No time like the present, right?

    The simple case of:

    hubop
    instruction
    hubop

    equals

    hubop
    instruction
    nop
    hubop

    Easy cheezy.

    so then:

    hubop
    instruction
    instruction
    instruction
    hubop

    does not equal

    hubop
    instruction
    instruction
    instruction
    nop
    hubop

    Instead it equals

    hubop
    instruction
    instruction
    instruction
    nop
    nop
    nop
    hubop?

    So, why not the second case? If the window hits at two instructions, wouldn't it hit at four?

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Propeller Wiki: Share the coolness!
    Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
    Safety Tip: Life is as good as YOU think it is!
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2009-09-27 04:25
    It's because the hubop requires at least seven clocks, whereas a normal instruction requires four. The "sweet spot comes 'round every 16 clocks, which is one hubop + two instructions. 32 clocks is one hubop + six instructions.

    -Phil
  • kuronekokuroneko Posts: 3,623
    edited 2009-09-27 04:26
    Assuming instruction at 4 cycles.

    hubop          +0     16n   slot 0
    instruction    +8
    instruction    +12
    instruction    +16          slot 1
    nop            +20
    hubop          +24    16m+8
    


    hubop          +0     16n   slot 0
    instruction    +8
    instruction    +12
    instruction    +16          slot 1
    nop            +20
    nop            +24
    nop            +28
    hubop          +32    16m   slot 2
    


    See the difference? HUB ops need to be aligned at 16n (relative to each other) which isn't the case for the 1st example.
  • potatoheadpotatohead Posts: 10,261
    edited 2009-09-27 05:28
    Got it. Thanks!!

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Propeller Wiki: Share the coolness!
    Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
    Safety Tip: Life is as good as YOU think it is!
  • XlogicXXlogicX Posts: 18
    edited 2009-09-28 02:06
    You guys are incredible! this was exactly the type of discussion I was looking for, and you all provided a lot of inspiration and help. I will probably take some of these strategies and watch my code run in GEAR and look very closely at the clock count at the same time.

    By the way, I'm working on a 4-channel audio driver that is sample based, it can do everything you should expect, like playing 4 independent samples at the same time at different frequencies, durations, volumes, and even be able to change bit quality on the fly (for the whole mix) (example; going from 16-bit to something arbitrary like 5-bit). The cool thing is that it completely works, it's the most challenging thing I have ever done. The down side is that I am only really getting up to 1Khz, which is fine for now (especially since the sounds I'm interested in are more bass oriented anyway), but the tighter and quicker I get the loops in the cog that is actually doing the real work, the higher these frequencies will get. Of course, another hack is to have separate samples that play the wave shape two or more times in that sample (I have two other cogs in crunching other stuff in parallel; one mixes audio data, another one decompresses it).
Sign In or Register to comment.