Optimizing HUB OP speed?
XlogicX
Posts: 18
I apologize in advance if this concept has already been talked about, I’ve searched around to try to find bits of help, but there is an overwhelming amount of info on these forums (usually a great thing), it’s possible my searching was too shallow.
I wanted to know if any of you have any cool hacks for optimizing the timing of going through several HUB OPs. I have some pretty small footprint loops going on in assembly that repeat several times (too continously check for updated variable data in main memory, controlled/changed by high level code); the HUB OPs that I have to do are killing me (with how much they can slow the loop down). I think I just ran out of mental energy, but I think I’m looking for a way to sync the cog up before the first hub op, than do a batch of them while synced. Does this sound pointless? Possible? Or is there any better approach for the main issue of speed?
I wanted to know if any of you have any cool hacks for optimizing the timing of going through several HUB OPs. I have some pretty small footprint loops going on in assembly that repeat several times (too continously check for updated variable data in main memory, controlled/changed by high level code); the HUB OPs that I have to do are killing me (with how much they can slow the loop down). I think I just ran out of mental energy, but I think I’m looking for a way to sync the cog up before the first hub op, than do a batch of them while synced. Does this sound pointless? Possible? Or is there any better approach for the main issue of speed?
Comments
hubop
regular op
regular op
hubop
... etc.
If you read the manual section on timing it explains this fully.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nyamekye,
Your first hubop syncs your loop. That first case could take up to 23 cycles. From there, consider each HUB op to take two instructions worth of time, then there is a two instruction delay before the next one, making the loop Kye just posted for you the best case.
I would add that, if you can't make it happen in two instructions, then you might as well do four because the window comes around at multiples of the two instruction window access time.
So:
hub
instruction
instruction
instruction
hub
is the same as
hub
instruction
instruction
instruction
nop
hub.
You can get a 20 percent increase in speed, on average, with good instruction placement. In the case of the video stuff I was doing, a longer loop, where operations are done out of order and buffered to take advantage of the window ended up considerably faster than just trying to pack instructions in at the right times. Write your loop and debug it, then re-write it, potentially holding some values in a few spare longs so that best case window opportunities can be taken advantage of. This is tough, but well worth it.
Look at the code you will find Cluso99 added to mine in the "Improved NTSC Driver" thread here. That's a great example of how it's done.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
Safety Tip: Life is as good as YOU think it is!
Post Edited (potatohead) : 9/26/2009 5:25:32 PM GMT
If it might, consider that a byte HUB op costs just as much as a long HUB op does. In my case, fetching a long allowed for a longer and much faster 4 HUB op loop, compared to just fetching bytes and using shorter loops.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
Safety Tip: Life is as good as YOU think it is!
... is the same as
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
Propeller Tools
Count the hubop as 2 instructions
Then you can perform 2, 6, 10, etc instructions in between.
Next hubop
A taken jump takes 1 instruction but a non-taken jump takes 2 instructions (because of the pipeling being flushed). However, I believe this is not true for a conditional jump (based on condition codes) as the instruction is ignored it the condition is not met, so it would still be one instruction.
What does this mean? Well, you can re-order your code to save wasted cycles. I gained 20% by reordering potatohead's code in the scanline section of text rendering.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade, RetroBlade,·TwinBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: Micros eg Altair, and Terminals eg VT100 (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
The simple case of:
hubop
instruction
hubop
equals
hubop
instruction
nop
hubop
Easy cheezy.
so then:
hubop
instruction
instruction
instruction
hubop
does not equal
hubop
instruction
instruction
instruction
nop
hubop
Instead it equals
hubop
instruction
instruction
instruction
nop
nop
nop
hubop?
So, why not the second case? If the window hits at two instructions, wouldn't it hit at four?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
Safety Tip: Life is as good as YOU think it is!
-Phil
See the difference? HUB ops need to be aligned at 16n (relative to each other) which isn't the case for the 1st example.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
Chat in real time with other Propellerheads on IRC #propeller @ freenode.net
Safety Tip: Life is as good as YOU think it is!
By the way, I'm working on a 4-channel audio driver that is sample based, it can do everything you should expect, like playing 4 independent samples at the same time at different frequencies, durations, volumes, and even be able to change bit quality on the fly (for the whole mix) (example; going from 16-bit to something arbitrary like 5-bit). The cool thing is that it completely works, it's the most challenging thing I have ever done. The down side is that I am only really getting up to 1Khz, which is fine for now (especially since the sounds I'm interested in are more bass oriented anyway), but the tighter and quicker I get the loops in the cog that is actually doing the real work, the higher these frequencies will get. Of course, another hack is to have separate samples that play the wave shape two or more times in that sample (I have two other cogs in crunching other stuff in parallel; one mixes audio data, another one decompresses it).