Optimizing HUB OP speed?

XlogicX · 2009-09-26 14:38

I apologize in advance if this concept has already been talked about, I’ve searched around to try to find bits of help, but there is an overwhelming amount of info on these forums (usually a great thing), it’s possible my searching was too shallow.

I wanted to know if any of you have any cool hacks for optimizing the timing of going through several HUB OPs. I have some pretty small footprint loops going on in assembly that repeat several times (too continously check for updated variable data in main memory, controlled/changed by high level code); the HUB OPs that I have to do are killing me (with how much they can slow the loop down). I think I just ran out of mental energy, but I think I’m looking for a way to sync the cog up before the first hub op, than do a batch of them while synced. Does this sound pointless? Possible? Or is there any better approach for the main issue of speed?

Kye · 2009-09-26 14:50

Um, so, your code should be like this.

hubop

regular op
regular op

hubop

... etc.

If you read the manual section on timing it explains this fully.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nyamekye,

potatohead · 2009-09-26 17:20

I'm going through this right now too. Got some good help, and here it is:

Your first hubop syncs your loop. That first case could take up to 23 cycles. From there, consider each HUB op to take two instructions worth of time, then there is a two instruction delay before the next one, making the loop Kye just posted for you the best case.

I would add that, if you can't make it happen in two instructions, then you might as well do four because the window comes around at multiples of the two instruction window access time.

So:

hub

instruction
instruction
instruction

hub

is the same as

hub

instruction
instruction
instruction
nop

hub.

You can get a 20 percent increase in speed, on average, with good instruction placement. In the case of the video stuff I was doing, a longer loop, where operations are done out of order and buffered to take advantage of the window ended up considerably faster than just trying to pack instructions in at the right times. Write your loop and debug it, then re-write it, potentially holding some values in a few spare longs so that best case window opportunities can be taken advantage of. This is tough, but well worth it.

Look at the code you will find Cluso99 added to mine in the "Improved NTSC Driver" thread here. That's a great example of how it's done.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
Safety Tip: Life is as good as YOU think it is!

Post Edited (potatohead) : 9/26/2009 5:25:32 PM GMT

potatohead · 2009-09-26 17:29

Another approach is to consider how you are fetching data. In your case, just checking on something probably won't work any other way, but...

If it might, consider that a byte HUB op costs just as much as a long HUB op does. In my case, fetching a long allowed for a longer and much faster 4 HUB op loop, compared to just fetching bytes and using shorter loops.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
Safety Tip: Life is as good as YOU think it is!

jazzed · 2009-09-26 17:45

Expanding on potatohead's comment:

'pseudo code
:loop
hubop
instruction
instruction
jmp #:loop

... is the same as

'pseudo code
:loop
hubop
instruction
instruction
nop
nop
nop
jmp #:loop

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propeller Tools

Cluso99 · 2009-09-27 03:38

potatoehead slightly missed the operation theory.

Count the hubop as 2 instructions
Then you can perform 2, 6, 10, etc instructions in between.
Next hubop

A taken jump takes 1 instruction but a non-taken jump takes 2 instructions (because of the pipeling being flushed). However, I believe this is not true for a conditional jump (based on condition codes) as the instruction is ignored it the condition is not met, so it would still be one instruction.

What does this mean? Well, you can re-order your code to save wasted cycles. I gained 20% by reordering potatohead's code in the scanline section of text rendering.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBlade,·RamBlade, RetroBlade,·TwinBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: Micros eg Altair, and Terminals eg VT100 (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm

potatohead · 2009-09-27 04:08

Ok, so I want to sort this out, because I'm working on this right now. No time like the present, right?

The simple case of:

hubop
instruction
hubop

equals

hubop
instruction
nop
hubop

Easy cheezy.

so then:

hubop
instruction
instruction
instruction
hubop

does not equal

hubop
instruction
instruction
instruction
nop
hubop

Instead it equals

hubop
instruction
instruction
instruction
nop
nop
nop
hubop?

So, why not the second case? If the window hits at two instructions, wouldn't it hit at four?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
Safety Tip: Life is as good as YOU think it is!

Phil Pilgrim (PhiPi) · 2009-09-27 04:25

It's because the hubop requires at least seven clocks, whereas a normal instruction requires four. The "sweet spot comes 'round every 16 clocks, which is one hubop + two instructions. 32 clocks is one hubop + six instructions.

-Phil

kuroneko · 2009-09-27 04:26

Assuming instruction at 4 cycles.

hubop          +0     16n   slot 0
instruction    +8
instruction    +12
instruction    +16          slot 1
nop            +20
hubop          +24    16m+8

hubop          +0     16n   slot 0
instruction    +8
instruction    +12
instruction    +16          slot 1
nop            +20
nop            +24
nop            +28
hubop          +32    16m   slot 2

See the difference? HUB ops need to be aligned at 16n (relative to each other) which isn't the case for the 1st example.

potatohead · 2009-09-27 05:28

Got it. Thanks!!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
Safety Tip: Life is as good as YOU think it is!

XlogicX · 2009-09-28 02:06

You guys are incredible! this was exactly the type of discussion I was looking for, and you all provided a lot of inspiration and help. I will probably take some of these strategies and watch my code run in GEAR and look very closely at the clock count at the same time.

By the way, I'm working on a 4-channel audio driver that is sample based, it can do everything you should expect, like playing 4 independent samples at the same time at different frequencies, durations, volumes, and even be able to change bit quality on the fly (for the whole mix) (example; going from 16-bit to something arbitrary like 5-bit). The cool thing is that it completely works, it's the most challenging thing I have ever done. The down side is that I am only really getting up to 1Khz, which is fine for now (especially since the sounds I'm interested in are more bass oriented anyway), but the tighter and quicker I get the loops in the cog that is actually doing the real work, the higher these frequencies will get. Of course, another hack is to have separate samples that play the wave shape two or more times in that sample (I have two other cogs in crunching other stuff in parallel; one mixes audio data, another one decompresses it).

Optimizing HUB OP speed?

Comments