Understanding the Eggbeater
Surac
Posts: 176
in Propeller 2
Hello
I have some problems understanding hub slices, the fifo interface and the block transfer modes in and out of the hub ram.
Is there sone documentation? I have already read the google document and the instruction table, but I think I need some more
explanation
Thank you
Surac
I have some problems understanding hub slices, the fifo interface and the block transfer modes in and out of the hub ram.
Is there sone documentation? I have already read the google document and the instruction table, but I think I need some more
explanation
Thank you
Surac
Comments
I think it's not much different from P1 … Except for the FIFO.
Basically, the HUB is divided into 8 blocks of 64KB (16Kx32bits). Each block is interleaved address at the lowest "long" address. We have 1MB of hub space addressing capability although only 512KB exists. Therefore we have 20-bit addressing.
x_aaaaaaaaaaaaaa_kkk_bb
bb = address of bytes within each block
kkk = address selecting the respective block
aaaaaaaaaa = address used within the chosen block
x = basically a don't care.
Now each cog has access to one 32-bit hub bus. There are 8 cogs and 8 x 32bit hub busses with each 32bit hub busses going to a single hub block.
Here comes the egg-beater..........
On the first hub cycle...
cog 0 is connected via the hub bus 0 to block 0
cog 1 via bus 1 to block 1
etc.
On the next hub cycle...
cog 0 is connected via the hub bus 1 to block 1
cog 1 via hub 2 to block 2
etc.
So you can see, at each hub cycle, every cog is connected to one of the hub 64KB (ie 16Kx32bit) blocks.
Because of the block decoding using the "kkk" bits = address bits a4:a2 then on each successive hub clock, the 32bits a cog can access is incremented by one 32bit address or in other words the address is incremented by a "long".
This means that to do a block transfer from/to cog 0 to hub ram, a full 32bit (long) transfer can take place on each and every hub clock provided the address is incrementing. Note that on the same clocks, cog 1 can access the hub long address plus 1.
This means that each of the 8 blocks can be accessed in parallel by the 8 cogs providing their block address is 1 long more than the lower cogs hub address.
Hope you can understand my wording here
And, byte resolution...
The hub accesses rotate by clock cycle and lower address bits.
Clock 0, Cog 0 has access to addresses ending in $0
Clock 0, Cog 1 has access to addresses ending in $1
etc...
Clock 1, Cog 0 has access to addresses ending in $F
Clock 1, Cog 1 has access to addresses ending in $0
etc...
The diagram @ozpropdev posted was for 16 cogs. I think the only difference is there are only 8 cogs to rotate addresses through, but what I can't remember is whether the addresses are still 4 bits of rotation or 3 bits now.
I think the answer is 3.
Not really, it's also relevant for block moves if the FIFO isn't in use at the time (e.g. executing from Cog or LUT Ram).
On P1 if you want to read eight consecutive addresses from Hub you need to wait for eight Hub cycles - up to 128 clocks (1.6us @ 80MHz.)
On P2 if you want to read eight consecutive addresses from Hub with RDLONG you need to wait for the address alignment (up to 7 clocks) and then you can read them all within 8 clocks - using a prior SETQ - for a worst case of 15 clocks (187.5ns * 80MHz, 83.3ns @ 180MHz, 50ns @ 300MHz.)
The fact that all 8 Cogs can do this simultaneously may be largely irrelevant from a single Cog perspective, but from a system wide perspective it is relevant.
Then add the remaining 7 longwords on for the prop2 example SETQ'd burst read, to make worst case of 23 sysclocks for all 8 consecutive longwords.
For the prop1, up to 23 sysclocks for the first longword then 16 sysclocks per longword there after. 7 x 16 + 23 = 135 sysclocks for the same 8 consecutive longwords.
For random addresses, prop1 is no change. Prop2 is worst case of 8 x 16 = 128 sysclocks, similar to the prop1, but could be as low as 8 x 9 = 72 sysclocks, which is notable better.
Thank you all so much. This explanations helped me
In the P1, each cog only had an access window every 16 clocks (for 8 cogs).
In the P2, each cog has an access window every clock but that access can only be to one of the 8 longs that is currently this cog.
So for P1, each read/write of byte/word/long could only occur every 16th clock.
On P2, each read/write of byte/word/long can occur on every clock, but then subsequent accesses must be incremented by 1 long address.
So P2 when accessing successive longs by program (ie not SETQ + Rd/Wrlong) can only do so each 9 clocks (being 8 to get back to the same address +1 for the next address).
-is there any information about this SETQ+RD/WRlong behavior? Can't find any in the google table of instructions
found it
Thanks
Surac
On P1, the address is simply rounded down to the next aligned address (i.e. it ignores the bottom bits)
On P2, it actually reads the unaligned data as you might expect, but at the cost of an additional clock cycle if the data being accessed crosses the boundary of aligned longs (because then the latter part of the data resides in the next memory block)
IIRC Hub memory instructions are a bit slower in hubexec because of the FIFO being in use.
I try to draw an actual diagram of the cog/hub interface based on the info you all gave me
Surac