Having trouble visualizing the HUB access modes

frank freedman · 2022-05-10 06:03

Trying to understand the hub ram in P2 and not quite getting it. I get that there are 8 ram areas, but if I am using a particular slice, I still have to wait for the time to access the slice again? I don't get the advantage of cycling through the slices. I tend to think in terms of schematics and pictures and just can't seem to find enough in the descriptions so far to build the picture I can understand. Can anyone point to some threads that can help with this. Google search has some info, but did not seem coherent enough to put it all together without wading through random (meaning search returns with no particular order) pages of discussion.

Would there be a possibility of one of the web discussions contain a topic on all the ways the hub can be made to work in general and perhaps some specific examples?

Thanks for any threads that may help.

pik33 · 2022-05-10 07:24

I don't get the advantage of cycling through the slices.

The cog has access to one slice every clock. This means it can transfer data from hub at 1 long per cycle while not disturbing other cogs, which also can access 1 long per cycle at the same time. The setq/setq2+rdlong/wrlong/wmlong combination enables fast move of lot of data from and to cog and lut.

I cannot imagine any simpler architecture to achieve this. If the hub was not sliced there have to be an overcomplex bus control circuit instead with priority setting and the rest of this evil stuff we know from bigger computers or - if using the hub like a P1 - a cog would transfer one long every 8 clock instead of 1

evanh · 2022-05-10 09:17

Address mapping wise, each slice occupies every eighth longword in linear address space.

byte address $00 is slice 0, longword 0
byte address $04 is slice 1, longword 0
byte address $08 is slice 2, longword 0
byte address $0c is slice 3, longword 0
byte address $10 is slice 4, longword 0
byte address $14 is slice 5, longword 0
byte address $18 is slice 6, longword 0
byte address $1c is slice 7, longword 0

byte address $20 is slice 0, longword 1
byte address $24 is slice 1, longword 1
byte address $28 is slice 2, longword 1
byte address $2c is slice 3, longword 1
byte address $30 is slice 4, longword 1
byte address $34 is slice 5, longword 1
byte address $38 is slice 6, longword 1
byte address $3c is slice 7, longword 1

byte address $40 is slice 0, longword 2
byte address $44 is slice 1, longword 2
byte address $48 is slice 2, longword 2
byte address $4c is slice 3, longword 2
byte address $50 is slice 4, longword 2
byte address $54 is slice 5, longword 2
byte address $58 is slice 6, longword 2
byte address $5c is slice 7, longword 2

... [and so on]

The arrangement provides incremental burst transfers - Optimised for sequential throughput with simple symmetrical access for all cores. The sacrifice being random access performance. Average latency is much higher than a single processor architecture would see.

Wuerfel_21 · 2022-05-10 12:11

Do note that the average latency (for writes in particular) is actually lower than on the P1, on a cycle-by-cycle basis. Though you can't as effectively design your code around hiding the latency because it now depends on the low bits of the address.

SaucySoliton · 2022-05-20 04:32

I think the hub should have been called "Memory Go 'Round." There are 8 people standing around the Memory Go 'Round. Everyone wants to touch a horse. On the P1 you have 1 horse that goes around. Only one person at a time can touch the horse.

With the P2 there are 8 horses. Every one of the 8 people can touch a horse at the same time. But it might not be the horse they want to touch. If you want to read a bunch of data, the "horses" are arranged so you can touch one horse after another as they go by.

These animations are shown on the left side of the forum. I have them blocked because I don't like to see unnecessary animations.
https://www.parallax.com/wp-content/uploads/2020/11/p1-hub-ram-interface.gif
https://www.parallax.com/wp-content/uploads/2020/11/p2-hub-ram-interface.gif

Cluso99 · 2022-05-21 04:31

@SaucySoliton said:
I think the hub should have been called "Memory Go 'Round." There are 8 people standing around the Memory Go 'Round. Everyone wants to touch a horse. On the P1 you have 1 horse that goes around. Only one person at a time can touch the horse.

With the P2 there are 8 horses. Every one of the 8 people can touch a horse at the same time. But it might not be the horse they want to touch. If you want to read a bunch of data, the "horses" are arranged so you can touch one horse after another as they go by.

These animations are shown on the left side of the forum. I have them blocked because I don't like to see unnecessary animations.
https://www.parallax.com/wp-content/uploads/2020/11/p1-hub-ram-interface.gif
https://www.parallax.com/wp-content/uploads/2020/11/p2-hub-ram-interface.gif

Excellent analogy

frank freedman · 2022-05-21 08:16

Ok, so good so far. If I want to touch a particular horse, not just any, I need to wait until it shows up again. Questions I have concern what happens at the other end of the horse. So, lets change the horse into a pipe. My cog is at one end of a specific pipe, say pipe number one. What is at the other end of the pipe? In other words, how is the space at the other end of the pipe arranged? Does each pipe have a dedicated range of hub space at the end of the pipe that goes around with that pipe giving each cog a chance at a taste of the hub space at the end of the pipe, or is the pipe able to connect the cog to any hub space?

@Cluso99 I am liking the retroblade2, stuck a rPi heat sink on the prop. May have to do the same to the regulators. Running the ADC-VGA_millivolts demo with 8 inputs will run the heat sink to about 104F, and the regulators out to about 135F or so as measured with an IR temp tester. An oddness with the demo, one of the inputs is in the ground bus, and there is a pretty consistent -40 to -50 mV value read out to the VGA display.

evanh · 2022-05-21 08:28

@"frank freedman" said:
Does each pipe have a dedicated range of hub space at the end of the pipe that goes around with that pipe giving each cog a chance at a taste of the hub space at the end of the pipe, or is the pipe able to connect the cog to any hub space?

The former. If the standing people are Cogs then each horse is a solid block of SRAM containing 512kB / 4 / 8 = 16384 longwords. On each sysclock tick the horses rotate. Eight ticks for all horses to pass by all people.

PS: Horse == slice.

EDIT: Having said that, just to confuse you again, the "pipe" does kind-off get to see all of hubRAM as well. A Cog's pipe is less a single piece but rather a switching system as it sequentially joins all Cogs to all Slices.

They are known as a crosspoint switch or crossbar switch. High-performance "data fabric" use them. This is the first time I've seen one used in a lower end product. I expect IBM has used an advanced version in its latest PowerZ CPU to manage the newly virtualised L3/L4 caching across the vast L2 physical.

frank freedman · 2022-05-21 19:48

Thank you @evanh. A block diagram and/or a drawing showing this would make it a bit more clear as to what is going on. A drawing similar to the Smart pins would be good if there is one I have not found yet.

evanh · 2022-05-21 20:01

Hmm, the switching would just be an 8 x 8 grid with Cogs on one axis and Slices on the other. With control lines going off to an ethereal sequencer. And in practice there is a layer of latency adding buffering as well.

Not very informative unless it was implementation exact. Chip doesn't have a diagram because it was completely done in Verilog.

pik33 · 2022-05-21 20:13

A slice #0 covers addresses (in longs) 0,8,16,24...
A slice #1 is 1,9,17..

... etc

so the cog has first to wait for its selected long to appear. Then every next cycle it has the access to the next long in the hub address space.

At the time when the cog #0 can access the long #0, the cog #1 can access the long #2, etc. This means there are no bus conflicts between cogs and all of them can transfer one long per clock at the same time. This is the awesome solution of the problem and maybe the simplest possible.

That's why rflong and setq/rdlong work, having to wait first until the proper long appears for the cog, but then transferring one long per one clock, as many as needed, and that's the cogexec works using a FIFO. If this FIFI is empty enough, it waits for the proper slot, then fills in at 1 long per cycle while the cog executes what is already stored

evanh · 2022-05-22 01:18

Same info again, labelling use of Cog's hubRAM address bits:

 512 kB hubRAM addressing
==========================
 A18..A5 (16k longwords per slice)
 A4..A2 (slice index)
 A1..A0 (byte sub-address)

And if there was 16 Cogs there would also be 16 slices and the result would be this (assuming the full 1 MB total):

 1 MB hubRAM addressing
========================
 A19..A6 (16k longwords per slice)
 A5..A2 (slice index)
 A1..A0 (byte sub-address)

Reminiscing time ... https://forums.parallax.com/discussion/164364/prop2-family/p1

Cluso99 · 2022-05-22 03:42

@"frank freedman" said:
Ok, so good so far. If I want to touch a particular horse, not just any, I need to wait until it shows up again. Questions I have concern what happens at the other end of the horse. So, lets change the horse into a pipe. My cog is at one end of a specific pipe, say pipe number one. What is at the other end of the pipe? In other words, how is the space at the other end of the pipe arranged? Does each pipe have a dedicated range of hub space at the end of the pipe that goes around with that pipe giving each cog a chance at a taste of the hub space at the end of the pipe, or is the pipe able to connect the cog to any hub space?

@Cluso99 I am liking the retroblade2, stuck a rPi heat sink on the prop. May have to do the same to the regulators. Running the ADC-VGA_millivolts demo with 8 inputs will run the heat sink to about 104F, and the regulators out to about 135F or so as measured with an IR temp tester. An oddness with the demo, one of the inputs is in the ground bus, and there is a pretty consistent -40 to -50 mV value read out to the VGA display.

I've never seen it that hot. I now have the RPi heatsink on the P2 and the next smaller size fits over the 3 regulators. I found a 25x25x5 DF2507S5M 5VDC fan on ebay and plan to fit this on top of the P2 heatsink

evanh · 2022-05-22 03:53

It's an F, not a C. Not going to be cooking any eggs that way.

Cluso99 · 2022-05-24 02:56

@evanh said:
It's an F, not a C. Not going to be cooking any eggs that way.

Thanks. Missed that

Having trouble visualizing the HUB access modes

Comments