Prospects for Chip Interleaving

rjo__ · 2016-05-31 03:57

One of the ways that some of the gurus use to improve bandwidth for the P1 is to interleave cogs.
A question that I can't get out of my head is:

Given the right board architecture, is it going to be possible to interleave 2 or more P2's?

evanh · 2016-05-31 05:44

HubRAM reads pretty much occurs naturally for internally clocked timing. The only trick was to align the start times of each Cog so that they output the data in order.

The difficult part comes when trying to align with an external clock source. The Prop1 would have had to asynchronously oversample for this. The exception being where the data is in short frames that can be resync'd to for each frame.

The Prop2 has much better opportunity to run more in-sync with a continuous external data source.

rjo__ · 2016-05-31 06:20

One of the fun parts of interdisciplinary work is trying to express what is supposed to be a precise idea in rather imprecise terms.
The process depends upon the expert understanding the question... which invariably is not really the question, but some
reasonable approximation:)

Talking to engineers isn't easy:)

What I imagine is a clock source that does the work distributing alternating clocks between two or more chips...
So... is that enough to roughly double the signal bandwidth and analysis? Accepting that the results will be delayed
by the process of communication between the chips...?

Still doesn't sound right.

Cluso99 · 2016-05-31 06:31

The question is cofused???

You give an example of how we interleave cogs on the P1, then seem to ask about the same concept using multiple P2's.

Do you mean interleaving cogs on P2? If so, yes we can still do that.

If you mean use 2 x P2's then there will be the same problems as synchronising 2 x P1's.
You are not likely to keep 2 x P1's or P2's in full sync because the xtal is multiplied up using a PLL. There will likely be some drift so you will not keep in absolute sync. At least that is what I understand from the PLL circuitry - perhaps someone with better knowledge can correct this if I am wrong.
However, to perform various processing using 2 chips will still be possible, depending of course on what you are doing. I use 3 x P1's in a commercial circuit. If I redesigned it with P2, then I could reduce it to 2 x P2's. There is no point in replacing the other P1 as it only does a mundane job that pretty much any cheap micro could do - I just wanted to keep my code using the prop.

ErNa · 2016-05-31 08:18

rjo__ wrote: »

One of the ways that some of the gurus use to improve bandwidth for the P1 is to interleave cogs.
A question that I can't get out of my head is:

Given the right board architecture, is it going to be possible to interleave 2 or more P2's?

This certainly is a more academic question, because in general simplicity is destroyed. But I can imagine to create an external phase shifted clock chain and so have the props execution cycle interleaved.

jmg · 2016-05-31 08:47

rjo__ wrote: »

What I imagine is a clock source that does the work distributing alternating clocks between two or more chips...
So... is that enough to roughly double the signal bandwidth and analysis? Accepting that the results will be delayed
by the process of communication between the chips...?

Only in very special cases.

The SiLabs Si5351 for example, can phase-set to 333ps, so you could use that to clock 2 P2's in any fine phase adjustment.

Perhaps that fine-phase control could be used by someone wanting better PWM precision, but they might use SI5351 + P2 + Simple Gate for that, rather than 2 x P2.

A more general use of multiple P2's could use the same-phase clocks, and sync between those to have all 32 COGs time-locked.

evanh · 2016-05-31 11:15

rjo__ wrote: »

One of the fun parts of interdisciplinary work is trying to express what is supposed to be a precise idea in rather imprecise terms.
The process depends upon the expert understanding the question... which invariably is not really the question, but some
reasonable approximation:)

Talking to engineers isn't easy:)

Sorry, I was being a bit blasé. I shouldn't have been peeking while at work.

My first sentence - HubRAM reads pretty much occurs naturally for internally clocked timing. - needs some explaining. It applies specifically to sending of data from the Prop. And the important detail being that the Prop is generating the timing of the framing, eg: Video out. The other end, receiving device, automatically syncs up to the Prop's timing.

This means that the Prop only has to get the timing lengths right. When it starts sending is not important.

How this then fits in with the HubRAM timing is important because the Prop is free to choose to line up the start timing with when the first Cog is able to access it's Hub data. Each successive Cog will naturally fall into line because the Hub reads are predictably ordered.

Second sentence of first paragraph - The only trick was to align the start times of each Cog so that they output the data in order. - was pointing out that there is detail in spacing the Hub reads to get the natural timing fit.

First sentence of second paragraph - The difficult part comes when trying to align with an external clock source. - is a very loose phrase. Use of term "clock source" here is not literal specifically. It may or may not have an actual clock with the signal. A better phrase might have been - The difficult part comes when trying to sync up with an incoming synchronous data stream.

Second sentence of second paragraph - The Prop1 would have had to asynchronously oversample for this. - What I think I was thinking about here was getting the collected data into HubRAM. There is no way, in the Prop1, for a continuous synchronous datastream to run at a rate even close to a Cog's best Hub write speed. The demands of the two out-of-step parts, Hub rotation and receive frame, will clash.

Third sentence of second paragraph - The exception being where the data is in short frames that can be resync'd to for each frame. - The WAITxxx instructions are very handy for precisely finding the synchronous edge of a frame, as is very well demonstrated with the UART soft devices.

However, this relies on having breaks in the datastream so doesn't always suit.

Third paragraph - The Prop2 has much better opportunity to run more in-sync with a continuous external data source. - The FIFO is one excellent example of a new feature that provides greater flexibility to sync up to an external synchronous datastream. It specifically allows the poking of received data into HubRAM asynchronously to the Hub rotation.

LUT sharing is the latest in this category. The secondary Cog can handle the buffering, letting the primary I/O Cog focus on holding the best synchronisation with the outside device.

Smartpins/Streamer helps with automating although this may only extend to Prop sourced clocking. SDRAM/HyperRAM comes to mind here.

rjo__ · 2016-06-01 19:38

"A more general use of multiple P2's could use the same-phase clocks, and sync between those to have all 32 COGs time-locked."

I might have a to try this... I haven't scratched my head all over the recent P2v9 enhancement, so this could change, but at the moment I think I am running out of cogs for my PropCam array:)

rjo__ · 2016-06-01 19:40

Evanh,

I'm all into the v9 release. Please allow me get back to you later with a few question.

Rich

evanh · 2016-06-02 21:35

The short answer is a big YES. The Prop2 not only makes it easier to align multiples Cogs to the I/O timing but also readily use HubRAM without having to worry about its rotation timing.

evanh · 2016-06-03 01:24

I'm guessing if you want to make use of the Hub rotation in a synchronously short read or write then it would be similar to the Prop1. Eg: Reading the same location over and over with a RDLONG will be the same repeating 16-clock timing. But reading consecutive longs will have a +1 (17-clock) lagging precession in that timing.

rjo__ · 2016-06-03 03:23

I mixed apples and oranges. The first question is more important... can we impose a fixed phase delay on two different P2's (to roughly double the frequencies of signals that we can measure and decode)? This might be of no interest in the practical world... it might be true, but useless for other reasons, sort of a Rube Goldberg approach... or it might be true but too technically complicated and costly to actually implement.

As to the second question phase sync'd P2's on separate P123 boards... YES!!!

I do have a practical demo in mind and when I get far enough with it, I'll no doubt have some questions.

Thanks guys

Rich

jmg · 2016-06-03 03:44

rjo__ wrote: »

The first question is more important... can we impose a fixed phase delay on two different P2's ...

Yes, you can do that externally - see my note about Clock Generator chip above.
Just how much practical use that is, depends on the application.

You can do that with other parts too.
I was looking at the MicroChip 1M Serial RAM, and it has a 20MHz (50ns) Max spec, but gives 10/10ns tsu/th, which means you could clock a pair with phase shifted clocks, and thus sample every 25ns. (40MHz)
That's the easy bit - once you have captured half the data in each chip, you then have to extract it, and re-merge it....

Or, the HyperRAM has a DDR.100MHz spec (5ns sample rate), but a 1.0ns/1.0ns Tsu/Th, so that could load a Pair at 400MHz on an interleaved basis.

rjo__ · 2016-06-03 04:32

I saw that but then got confused after it.

SO... the first rule holds: If you can't do it with one P2... you can keep adding P2's until you are satisfied.
There for a minute, I thought I had to add an exception.

Thank you very much!!!!

Prospects for Chip Interleaving

Comments