Understanding the Eggbeater

Surac · 2020-05-14 20:39

Hello

I have some problems understanding hub slices, the fifo interface and the block transfer modes in and out of the hub ram.
Is there sone documentation? I have already read the google document and the instruction table, but I think I need some more
explanation

Thank you
Surac

Rayman · 2020-05-14 21:40

Trying to remember...
I think it's not much different from P1 … Except for the FIFO.

Cluso99 · 2020-05-14 22:13

If you search the very old P2 forums for threads about the hub. Probably it will be around 12 months after the P2-HOT, so 2016 maybe???

Basically, the HUB is divided into 8 blocks of 64KB (16Kx32bits). Each block is interleaved address at the lowest "long" address. We have 1MB of hub space addressing capability although only 512KB exists. Therefore we have 20-bit addressing.
x_aaaaaaaaaaaaaa_kkk_bb
bb = address of bytes within each block
kkk = address selecting the respective block
aaaaaaaaaa = address used within the chosen block
x = basically a don't care.

Now each cog has access to one 32-bit hub bus. There are 8 cogs and 8 x 32bit hub busses with each 32bit hub busses going to a single hub block.
Here comes the egg-beater..........
On the first hub cycle...
cog 0 is connected via the hub bus 0 to block 0
cog 1 via bus 1 to block 1
etc.
On the next hub cycle...
cog 0 is connected via the hub bus 1 to block 1
cog 1 via hub 2 to block 2
etc.

So you can see, at each hub cycle, every cog is connected to one of the hub 64KB (ie 16Kx32bit) blocks.
Because of the block decoding using the "kkk" bits = address bits a4:a2 then on each successive hub clock, the 32bits a cog can access is incremented by one 32bit address or in other words the address is incremented by a "long".

This means that to do a block transfer from/to cog 0 to hub ram, a full 32bit (long) transfer can take place on each and every hub clock provided the address is incrementing. Note that on the same clocks, cog 1 can access the hub long address plus 1.
This means that each of the 8 blocks can be accessed in parallel by the 8 cogs providing their block address is 1 long more than the lower cogs hub address.

Hope you can understand my wording here

ozpropdev · 2020-05-14 23:29

Here's a diagram from 2014 when we had 16 cogs but shows the concept.

Rayman · 2020-05-15 00:07

Isn't it the same as P1? Except for FIFO...

And, byte resolution...

potatohead · 2020-05-15 00:43

No, it's not like P1.

The hub accesses rotate by clock cycle and lower address bits.

Clock 0, Cog 0 has access to addresses ending in $0
Clock 0, Cog 1 has access to addresses ending in $1

etc...

Clock 1, Cog 0 has access to addresses ending in $F
Clock 1, Cog 1 has access to addresses ending in $0

etc...

The diagram @ozpropdev posted was for 16 cogs. I think the only difference is there are only 8 cogs to rotate addresses through, but what I can't remember is whether the addresses are still 4 bits of rotation or 3 bits now.

I think the answer is 3.

rogloh · 2020-05-15 00:46

Not the same as P1 because each COG in the P2 can now do a read at the same time to different slices of the memory and all get a result on the same clock cycle. On P1 you had to wait for your hub window to appear once per hub-cycle (only one COG got it at a time) and you could only get your result on that clock cycle. It was one transfer every 16 P1 clocks for each P1 COG, but is now potentially 8 transfers every 8 P2 clocks on ALL P2 COGs at the same time.

Rayman · 2020-05-15 00:55

But the difference between P1 only matters for the FIFO right?

AJL · 2020-05-15 01:49

Rayman wrote: »

But the difference between P1 only matters for the FIFO right?

Not really, it's also relevant for block moves if the FIFO isn't in use at the time (e.g. executing from Cog or LUT Ram).

On P1 if you want to read eight consecutive addresses from Hub you need to wait for eight Hub cycles - up to 128 clocks (1.6us @ 80MHz.)
On P2 if you want to read eight consecutive addresses from Hub with RDLONG you need to wait for the address alignment (up to 7 clocks) and then you can read them all within 8 clocks - using a prior SETQ - for a worst case of 15 clocks (187.5ns * 80MHz, 83.3ns @ 180MHz, 50ns @ 300MHz.)

The fact that all 8 Cogs can do this simultaneously may be largely irrelevant from a single Cog perspective, but from a system wide perspective it is relevant.

evanh · 2020-05-15 04:56

Numbers are a little worse. You've forgotten the +8 sysclocks to initial fetch times. Applies to both prop1 and prop2. First longword will arrive after 8-23 sysclocks for the prop1, and 9-16 sysclocks for an 8-cog prop2.

Then add the remaining 7 longwords on for the prop2 example SETQ'd burst read, to make worst case of 23 sysclocks for all 8 consecutive longwords.

For the prop1, up to 23 sysclocks for the first longword then 16 sysclocks per longword there after. 7 x 16 + 23 = 135 sysclocks for the same 8 consecutive longwords.

For random addresses, prop1 is no change. Prop2 is worst case of 8 x 16 = 128 sysclocks, similar to the prop1, but could be as low as 8 x 9 = 72 sysclocks, which is notable better.

Surac · 2020-05-15 05:00

Hello

Thank you all so much. This explanations helped me

Cluso99 · 2020-05-15 05:37

ozpropdev wrote: »

Here's a diagram from 2014 when we had 16 cogs but shows the concept.

The concept is right, but the addresses are not quite right - they are shown for long accesses with the byte/word address bits missing.

In the P1, each cog only had an access window every 16 clocks (for 8 cogs).
In the P2, each cog has an access window every clock but that access can only be to one of the 8 longs that is currently this cog.

So for P1, each read/write of byte/word/long could only occur every 16th clock.
On P2, each read/write of byte/word/long can occur on every clock, but then subsequent accesses must be incremented by 1 long address.

So P2 when accessing successive longs by program (ie not SETQ + Rd/Wrlong) can only do so each 9 clocks (being 8 to get back to the same address +1 for the next address).

Surac · 2020-05-15 08:01

-So running code from COG/LUT and HUB should run with the same speed if no jumps are used?
-is there any information about this SETQ+RD/WRlong behavior? Can't find any in the google table of instructions
found it

Thanks
Surac

Wuerfel_21 · 2020-05-15 08:02

Also somewhat relevant: Unaligned WORD/LONG sized memory access behaves differently from P1.
On P1, the address is simply rounded down to the next aligned address (i.e. it ignores the bottom bits)
On P2, it actually reads the unaligned data as you might expect, but at the cost of an additional clock cycle if the data being accessed crosses the boundary of aligned longs (because then the latter part of the data resides in the next memory block)

Rayman · 2020-05-15 10:08

I think hubexec does run same speed as cogexec if there are no jumps.

Wuerfel_21 · 2020-05-15 10:24

Rayman wrote: »

I think hubexec does run same speed as cogexec if there are no jumps.

IIRC Hub memory instructions are a bit slower in hubexec because of the FIFO being in use.

evanh · 2020-05-15 11:00

Rayman is correct. Hubexec matches cogexec when not branching. In a straight line, the FIFO is twice as fast as cog can go. The FIFO is that fast to satisfy maximum streamer throughput.

Surac · 2020-05-15 17:15

Wow

I try to draw an actual diagram of the cog/hub interface based on the info you all gave me

Surac

Understanding the Eggbeater

Comments