What is needed to integrate a dual/quad P2 system?

samuell · 2019-12-08 14:52

Hi,

I'm studying the hypothesis of creating a system integrating two, or even four, P2 chips. My idea is to seamlessly emulate a 16 or 32-core MCU. Power delivery is not an issue (it is easy for me to design DC-DC converter modules that are able to supply 5A or more). Also, I'm expecting to design the clock distribution system so that the chips can be clocked simultaneously. Probably I'll have to disable the PLL on the chips, and clock them directly (300MHz or more, if possible). I'll implement active heat sinking, of course.

As for the power delivery, I'll supply 12V to the system, that will be down-converted to 3.3V and 1.8V to the chips. The DC-DC converter for 1.8V will have to be beefy, for sure. The 3.3V will probably be supplied by a very low noise DC-DC converter, as local LDOs are not really needed (except for any pins that are to be analog in nature).

My question is, what pins have to be connected to integrate both (or four) chips into a single 16-core (or 32-core) emulated MCU?

Kind regards, Samuel Lourenço

kwinn · 2019-12-08 16:43

I don't think there is any way of truly integrating 2/4 P2's to produce a 16/32 core MCU, however you could use a bus or direct connection between the P2's to communicate between them for emulating a 16/32 core MCU. Which interconnection scheme you use would depend on the speed and/or latency you need. As always, there are trade offs to consider and they will depend on the goals desired.

How many free I/O pins do you want?
Do you want all cogs to have access to all hub memories?
Will inter P2 communications be cog to cog or simply P2 to P2?

An interesting idea and I'm sure there are a lot more things to decide.

Mickster · 2019-12-08 16:44

Hey, I'd buy one of these just for the heck of it

samuell · 2019-12-08 16:57

kwinn wrote: »

I don't think there is any way of truly integrating 2/4 P2's to produce a 16/32 core MCU, however you could use a bus or direct connection between the P2's to communicate between them for emulating a 16/32 core MCU. Which interconnection scheme you use would depend on the speed and/or latency you need. As always, there are trade offs to consider and they will depend on the goals desired.
...

I think someone mentioned that the integration could be done via the streamer. I'm aware that the communication between the P2s will be a bottleneck, but that can be minimized.

kwinn wrote: »

...
How many free I/O pins do you want?
...

I don't need to have many free I/O pins. Probably, one of the P2 chips will have to dedicate some pins to USB communication. I don't know if the other should have access to an external EEPROM, or a dedicated EEPROM for each one, for that matter. Anyways, the more pins they share, the better.

kwinn wrote: »

...
Do you want all cogs to have access to all hub memories?
...

Definitely. That would be more seamless.

kwinn wrote: »

...
Will inter P2 communications be cog to cog or simply P2 to P2?
...

Cog to cog, definitely, if possible.

Mickster wrote: »

Hey, I'd buy one of these just for the heck of it

This is just an expensive experiment, and it is still in the thought phase (not even in the design phase). But, I'm glad you are interested. All options are open, for now.

Kind regards, Samuel Lourenço

jmg · 2019-12-08 18:59

samuell wrote: »

... Also, I'm expecting to design the clock distribution system so that the chips can be clocked simultaneously. Probably I'll have to disable the PLL on the chips, and clock them directly (300MHz or more, if possible). ..

The PLL's should lock with jitter well under 1ns, meaning a single oscillator module would likely be ok, but if you wanted a more active clock system a clock synth chip that includes skew/delay adjust on each of 4 clk outputs could be interesting to try.
That would allow you to precisely phase shift P2's.

samuell · 2019-12-08 19:41

jmg wrote: »

samuell wrote: »

... Also, I'm expecting to design the clock distribution system so that the chips can be clocked simultaneously. Probably I'll have to disable the PLL on the chips, and clock them directly (300MHz or more, if possible). ..

The PLL's should lock with jitter well under 1ns, meaning a single oscillator module would likely be ok, but if you wanted a more active clock system a clock synth chip that includes skew/delay adjust on each of 4 clk outputs could be interesting to try.
That would allow you to precisely phase shift P2's.

If the PLL on all chips is guaranteed to lock well and in sync, that would simplify the clock distribution system, definitely. Thus, it would only require a 20MHz oscillator (not just a simple crystal) and a clock distributor (something in the lines of CDCLVC1102 - skew is hardly an issue with that). Distributing high frequencies is a more complex subject.

Kind regards, Samuel Lourenço

jmg · 2019-12-08 19:58

samuell wrote: »

If the PLL on all chips is guaranteed to lock well and in sync, that would simplify the clock distribution system, definitely. Thus, it would only require a 20MHz oscillator (not just a simple crystal) and a clock distributor (something in the lines of CDCLVC1102 - skew is hardly an issue with that). Distributing high frequencies is a more complex subject.

You could look at the Si5351A in QFN20 which has up to 8 clock outputs, or MSOP10 for up to 3 outputs and can synthesise up to 200MHz.
There is a 10 pin version on latest P2D2's, that has 3 clock outs - even that could drive 4 P2's.

kwinn · 2019-12-08 22:43

Using the streamer to communicate between P2's would certainly make data transfers faster, and having more pins available makes it faster still. It's an interesting idea that has a lot of potential. After thinking about it I have come to the conclusion that the specific hardware/software approach would have to be tailored to the problem at hand.

dMajo · 2019-12-09 08:53

I will consider also an ESP32 in the project. I gives you immediately a WiFi interface and with only an RJ45 connector with integrated magnetics also the wired ethernet.
Plus it have a 4bit integrated SD interface.

I also think that an Si5351A is more than enough to deliver the clock to the P2s

samuell · 2019-12-09 12:33

jmg wrote: »

samuell wrote: »

If the PLL on all chips is guaranteed to lock well and in sync, that would simplify the clock distribution system, definitely. Thus, it would only require a 20MHz oscillator (not just a simple crystal) and a clock distributor (something in the lines of CDCLVC1102 - skew is hardly an issue with that). Distributing high frequencies is a more complex subject.

You could look at the Si5351A in QFN20 which has up to 8 clock outputs, or MSOP10 for up to 3 outputs and can synthesise up to 200MHz.
There is a 10 pin version on latest P2D2's, that has 3 clock outs - even that could drive 4 P2's.

dMajo wrote: »

I will consider also an ESP32 in the project. I gives you immediately a WiFi interface and with only an RJ45 connector with integrated magnetics also the wired ethernet.
Plus it have a 4bit integrated SD interface.

I also think that an Si5351A is more than enough to deliver the clock to the P2s

Regarding that, I was to consider other alternative, but then it hit me that the P2 has an integrated clock multiplier. I think I could use the Si5351A to generate 150MHz, and then multiply this by two.

As long as we don't need fractional derived clocks, clock skew between the P2s should not be a problem. And transmitting a 150MHz clock is hardly an issue, I think.

kwinn wrote: »

Using the streamer to communicate between P2's would certainly make data transfers faster, and having more pins available makes it faster still. It's an interesting idea that has a lot of potential. After thinking about it I have come to the conclusion that the specific hardware/software approach would have to be tailored to the problem at hand.

I'm yet to figure what pins to use. Certainly, one of the P2s will act as a master, and will be interfaced via USB, manage the clock generator snd access the flash memory. The other could be programmed via the first. The bus between them would use dedicated pins. I wonder if it is doable.

Kind regards, Samuel Lourenço

kwinn · 2019-12-09 17:03

samuell wrote: »

jmg wrote: »

samuell wrote: »

If the PLL on all chips is guaranteed to lock well and in sync, that would simplify the clock distribution system, definitely. Thus, it would only require a 20MHz oscillator (not just a simple crystal) and a clock distributor (something in the lines of CDCLVC1102 - skew is hardly an issue with that). Distributing high frequencies is a more complex subject.

You could look at the Si5351A in QFN20 which has up to 8 clock outputs, or MSOP10 for up to 3 outputs and can synthesise up to 200MHz.
There is a 10 pin version on latest P2D2's, that has 3 clock outs - even that could drive 4 P2's.

dMajo wrote: »

I will consider also an ESP32 in the project. I gives you immediately a WiFi interface and with only an RJ45 connector with integrated magnetics also the wired ethernet.
Plus it have a 4bit integrated SD interface.

I also think that an Si5351A is more than enough to deliver the clock to the P2s

Regarding that, I was to consider other alternative, but then it hit me that the P2 has an integrated clock multiplier. I think I could use the Si5351A to generate 150MHz, and then multiply this by two.

As long as we don't need fractional derived clocks, clock skew between the P2s should not be a problem. And transmitting a 150MHz clock is hardly an issue, I think.

kwinn wrote: »

Using the streamer to communicate between P2's would certainly make data transfers faster, and having more pins available makes it faster still. It's an interesting idea that has a lot of potential. After thinking about it I have come to the conclusion that the specific hardware/software approach would have to be tailored to the problem at hand.

I'm yet to figure what pins to use. Certainly, one of the P2s will act as a master, and will be interfaced via USB, manage the clock generator snd access the flash memory. The other could be programmed via the first. The bus between them would use dedicated pins. I wonder if it is doable.

Kind regards, Samuel Lourenço

Oh, I am pretty sure it's doable using either a shared bus between all the P2's or direct connections between the individual chips. Took a quick look at routing a shared 16 pin bus for 4 P2's last night and it does not appear to be too hard. Will do the same for direct P2-P2 connections tonight if time permits.

msrobots · 2019-12-10 00:37

I am working on and off on the same problem.

currently I have something running on 2 P2 rev a but had no time to convert it to P2 rev b, yet

There is a long winding thread about it, but not up to date anymore.

I call it Ringbuffer, It is not the fastest solution, but the basic Idea goes like this.

You dedicate a certain amount of HUB ram as shared. It has not to be at the same location in each P2, but has to have the same size. This buffer will be send around and is locked with a P2 software lock.

Each P2 gives up one COG and 2 to 64 pins (64 makes no sense but could work, theoretically)

Half of the pins are input, the other half output. all P2 needs to be daisy chained, the last back to the first, so all P2 are connected as a Ring.

One master-Cog needs to start the communication, and then the Buffer gets send from P2 to P2 and when a new buffer is received, the lock gets released so other COGs can take possession of the buffer (locking it) do their respective read or write and unlock the buffer. The communication COG locks it again and sends it to the next P2.

So there is ONE buffer circulating round robin from P2 to P2.

But basically one has now a shared HUB protected by a lock across multiple P2s.

The use could be like normal mailboxes used on P1 except the COG who wants to read or write needs to aquire a lock before doing so. Else it is completely self contained in the COG and the other COGs can be completely unaware of it, except the need to use a lock when using the shared buffer.

theoretically, since the streamer modes changed.

Mike

samuell · 2019-12-10 00:45

I'm thinking on connecting four P2s, using 16 pins on each to connect to the 16 pins of the adjacent one. The same could be done with just two P2s.

Kind regards, Samuel Lourenço

Peter Jakacki · 2019-12-10 01:32

I designed a system once that had 2 P1s on a sensor pcb, another as an I/O controller, and another as the main processor. However I have never had a need for 4 Prop chips interconnected as if it were some multi-core compute engine. I can do that easily with plenty of other chips that are available or even with a GA144 (144 CPUs), or even in FPGA which I can do with an array of J1 CPUs for instance.

So my questions is this: What do you intend to do with a quad array of P2s, and how is this more effective than currently available solutions?

msrobots · 2019-12-10 02:39

Well my idea is to be able to program each P2 as usual, but be able to access another COG in another P2 as transparent as possible.

The usual way to talk to COGs on the P1 is a Mailbox interface in HUB ram. So I share the mailboxes from multiple P2s in a buffer getting send around.

Say I have one P2 with @rogloh's videodriver using a lot of HUB RAM and sharing a mailbox buffer with the rest of the P2s. And I have another P2 talking to the video driver as if it is on its own HUB. Maybe video driver is wrong here as example it would be more the graphics engine attached to the video driver I would need to talk to via ring buffer.

The basic Idea is to have more COGs and more Ram and more Pins in a expandable way, but relatively transparent to the software. Each P2 has access to the buffer, one at a time, and it gets send around by the streamer, so the code is pretty much the same independent of the data bus width.

So I have one P2 with Keyboard, Mouse, Video (even 2?) and that is connected to some other P2s I can program and they can share the resources Keyboard and Mouse and Screens by accessing their own HUB ram.

Would even work transparent with a TAQOZ P2, Once loaded the Ringbuffer Cog is independent, and TAQOZ could read and write its local copy of the Buffer in HUB provided it is using the lock the PASM Ringbuffer uses to avoid collision and to synchronize all buffer.

need for it? Not sure, but why not?

Mike

samuell · 2019-12-10 23:22

msrobots wrote: »

Well my idea is to be able to program each P2 as usual, but be able to access another COG in another P2 as transparent as possible.

The usual way to talk to COGs on the P1 is a Mailbox interface in HUB ram. So I share the mailboxes from multiple P2s in a buffer getting send around.

Say I have one P2 with @rogloh's videodriver using a lot of HUB RAM and sharing a mailbox buffer with the rest of the P2s. And I have another P2 talking to the video driver as if it is on its own HUB. Maybe video driver is wrong here as example it would be more the graphics engine attached to the video driver I would need to talk to via ring buffer.

The basic Idea is to have more COGs and more Ram and more Pins in a expandable way, but relatively transparent to the software. Each P2 has access to the buffer, one at a time, and it gets send around by the streamer, so the code is pretty much the same independent of the data bus width.

So I have one P2 with Keyboard, Mouse, Video (even 2?) and that is connected to some other P2s I can program and they can share the resources Keyboard and Mouse and Screens by accessing their own HUB ram.

Would even work transparent with a TAQOZ P2, Once loaded the Ringbuffer Cog is independent, and TAQOZ could read and write its local copy of the Buffer in HUB provided it is using the lock the PASM Ringbuffer uses to avoid collision and to synchronize all buffer.

need for it? Not sure, but why not?

Mike

Same idea here. This would potentiate the P2 even more. The GA144 is very limited in memory per core. The P2 doesn't have the same limitation, and it is more available. Sure, a GPU implementing the P2 could use the GA144 to do the menial tasks, in essence acting as a bunch of 144 cuda cores. But the cluster of 2/4 P2s would be far more flexible.

And, why not?

Kind regards, Samuel Lourenço

Peter Jakacki · 2019-12-11 00:37

You wanted a quad P2 array?

samuell · 2019-12-11 01:02

Peter Jakacki wrote: »

You wanted a quad P2 array?

Will buy it.

Kind regards, Samuel Lourenço

kwinn · 2019-12-11 03:17

Peter Jakacki wrote: »

You wanted a quad P2 array?

Can you assemble them without separating them? I might start saving my pennies if you can.

Peter Jakacki · 2019-12-11 03:51

The pcbs are in a frameless v-grooved panel of 4x4 and that is how they are assembled. I suppose if you are really keen you could use this as a single pcb. Personally I would rather design a special pcb for 4 P2s wired up the way I would want and simplify some of the other stuff.

What is needed to integrate a dual/quad P2 system?

Comments