Hub Access Window and Hub Instructions
BlackSoldierB
Posts: 45
I was reading the P8X32A Datasheet.
In 4.4 they say that "hub instructions" take 8 cycles to complete.
First of all, what are "hub instructions"? Are that special instructions to read/write the EEPROM or RAM?
Since the "hub instructions" take 8 cycles i don't understand how cog 0 and cog 1 can execute without interfering or causing any problems. Because they both execute "hub instructions" at the same time.
This is the last alinea of chapter 4.4 from the datasheet.
Could someone explain why they can execute Hub Instructions at the same time.
In 4.4 they say that "hub instructions" take 8 cycles to complete.
First of all, what are "hub instructions"? Are that special instructions to read/write the EEPROM or RAM?
Since the "hub instructions" take 8 cycles i don't understand how cog 0 and cog 1 can execute without interfering or causing any problems. Because they both execute "hub instructions" at the same time.
This is the last alinea of chapter 4.4 from the datasheet.
Keep in mind that a particular cogs hub instructions do
not, in any way, interfere with other cogs instructions
because of the Hub mechanism. Cog 1, for example, may
start a hub instruction during
System Clock cycle 2, in
both of these examples, possibly overlapping its execution
with that of Cog 0 without any ill effects. Meanwhile, all
other cogs can continue executing non-hub instructions,
or awaiting their individual hub access windows
regardless of what the others are doing.
Could someone explain why they can execute Hub Instructions at the same time.
Comments
Clearly not all processors can access shared memory at the same time. In order to organize the accesses the "HUB" mechanism allows each processor to access RAM in turn, in a "round robin" fashion.
So, processor, "COG", has to wait for the HUB to grant access before it can complete such an instruction.
All other instructions work entirely within COG, no shared memory accesses, so they run at full speed.
See the manual for the nice diagrams and explanations of how this works including timing.
Note 1: Other shared resources includes the LOCKS.
Note 2: In case it's not clear "HUB instructions" do not execute from shared memory, they only read and write data to it. The instructions themselves are always executed from within the COG.
Let's say that Cog 0 wants to read a variable and Cog 1 want do write to the same variable.
Cog 0 starts with reading the variable, at this moment t = 0.
When t = 2, Cog 1 starts to write to the variable (1 Hub Cycle has passed).
At t = 8 cog 0 is done with RDLONG, a "other" instruction can now execute.
At t = 10 cog 1 is done with the WRLONG
This will cause problems and can't be solved by the round robin, is that correct?
A read or a write of a byte, word, or long to any location is totally atomic. No other COG is making any access to HUB at the same time. It would not matter if all 8 COGs were hammering on the same variable. Whatever a COG writes get's written, it will then be over written by another COG when that COG gets its turn. A COG that reads will correctly read what ever was written last.
If there were any problem with this we would have known about it a long time ago.
The only issues are:
1) What if you want to write a collection of variables that should all be updated together and you don't want to have another COG taking a turn in between those individual writes thus reading some new data and some old. If that is a problem it can be fixed by using locks. Although in many cases simple in-memory flags will do.
2) If you want to read or write data to HUB as fast as possible you might want to try and make sure there is sufficiently small number of instructions between each RD/WRLONG that they don't miss HUB slots and have to wait a whole other HUB cycle to get access again.
Normally when you have two tasks reading/writing the same value you surround them with a semaphore.
Bottom line is "I just works". If you are using RD/WR, BYTE,WORD,LONG. Then all of those operations are done atomically. That is what the HUB round robin access mechanism is for.
Normally you don't need any such semaphore around single values. Think of the four cores, or whatever, in the Intel chip in your PC. A programmer never has to worry about the atomicity of reading or writing single values, no matter if they are 8 bit bytes all the way up to 64 bit floating point numbers. The chip hardware sorts it out for you.
They might get you a little deeper.
And it is that wait described by the variable number of cycles required for a HUB operation to complete. No matter what state of instruction execution a COG is in, the HUB operation will happen exclusive to that COG.
As mentioned before, you don't need locks for single values. They are atomic, because of the wait to make sure the COG gets it done without worrying about what other COGS may be doing.
If you want to make optimal use of the HUB operations, you need to put instructions between the HUB operations:
WRLONG
WRLONG
There is time for two instructions between these two, making:
WRLONG
nop
nop
WRLONG
...take the same amount of time, meaning the nop instructions are free! This is fastest, and it will always take the same amount of time, once it has started. Say you were to put that into a loop. The very first HUB instruction may take 9-22 cycles to complete. After that, the loop will iterate perfectly, in sync with the HUB round robin access cycles.
The next longer window is:
WRLONG
nop
nop
nop
nop
nop
nop
WRLONG
Edited to show next window properly. I made this same darn mistake earlier: http://forums.parallax.com/showthread.php/116313-Optimizing-HUB-OP-speed
It's 2, 6, 10 instructions. Not 4, and I really go for that 4 instructions...
Again, if this were in a loop, the very first HUB instruction may wait, but after that the instructions all run at speed, synchronized with every other HUB access window.
Since a nop takes four clock cycles, you need four of them to bring you to the next 16 clock cycle hub window. (Of course the cog would have just waited for the access anyway but I think you were illustrating the time passing by using nop instructions.)
I think the premise is wrong. The Propeller Manual says:
"Hub instructions require 8 to 23 clock cycles to execute depending on the relation between the cog’s hub access window and the instruction’s moment of execution."
(Look at any of the HUB assembly instructions, or the HUB description on page 24).
That variation is because the cog might have to wait for HUB access to execute the instruction. Access to HUB is granted in a round robin fashion - like a distributor on a car engine (maybe I'm showing my age...). If the cog wants to execute a HUB instruction it has to wait for the distributor button to point at it. When it gets HUB access it might well only take 8 cycles to complete, but it might have to wait 15 cycles before it gets HUB access.
In fact, look at page 21 of the Propeller Manual - that diagram looks just like a distributor from an 8 cylinder engine...
YES!
And the nops were to show time passing. They represent instructions that could be done between hub access windows. Made the edits needed. Thank you.
The first one will take up to 23, but if you layout your code that you have 2, 6 or 10 regular cog instructions in between the next one will only take 8 clocks
Try not to put 3, 7,11 cogs instruction in between as you can see below the xxxx is unused cog power as the hubop will take 12+8 cycles to complete.
Its simple, the insutrction takes 2 cycles at the hub, but 8 cycles at the cog (several are which are just the cog
waiting for the hub). If the cog isn't already synchronized to the hub it will wait upto another 15 cycles, hence
the worst case being 23 cycles, 15 of which are waiting for the robin to come around (!), several are waiting
for the hub to do its thing. The rest are normal read/execute/write cycles in the cog itself.