Prospects for Chip Interleaving
rjo__
Posts: 2,114
in Propeller 2
One of the ways that some of the gurus use to improve bandwidth for the P1 is to interleave cogs.
A question that I can't get out of my head is:
Given the right board architecture, is it going to be possible to interleave 2 or more P2's?
A question that I can't get out of my head is:
Given the right board architecture, is it going to be possible to interleave 2 or more P2's?
Comments
The difficult part comes when trying to align with an external clock source. The Prop1 would have had to asynchronously oversample for this. The exception being where the data is in short frames that can be resync'd to for each frame.
The Prop2 has much better opportunity to run more in-sync with a continuous external data source.
The process depends upon the expert understanding the question... which invariably is not really the question, but some
reasonable approximation:)
Talking to engineers isn't easy:)
What I imagine is a clock source that does the work distributing alternating clocks between two or more chips...
So... is that enough to roughly double the signal bandwidth and analysis? Accepting that the results will be delayed
by the process of communication between the chips...?
Still doesn't sound right.
You give an example of how we interleave cogs on the P1, then seem to ask about the same concept using multiple P2's.
Do you mean interleaving cogs on P2? If so, yes we can still do that.
If you mean use 2 x P2's then there will be the same problems as synchronising 2 x P1's.
You are not likely to keep 2 x P1's or P2's in full sync because the xtal is multiplied up using a PLL. There will likely be some drift so you will not keep in absolute sync. At least that is what I understand from the PLL circuitry - perhaps someone with better knowledge can correct this if I am wrong.
However, to perform various processing using 2 chips will still be possible, depending of course on what you are doing. I use 3 x P1's in a commercial circuit. If I redesigned it with P2, then I could reduce it to 2 x P2's. There is no point in replacing the other P1 as it only does a mundane job that pretty much any cheap micro could do - I just wanted to keep my code using the prop.
The SiLabs Si5351 for example, can phase-set to 333ps, so you could use that to clock 2 P2's in any fine phase adjustment.
Perhaps that fine-phase control could be used by someone wanting better PWM precision, but they might use SI5351 + P2 + Simple Gate for that, rather than 2 x P2.
A more general use of multiple P2's could use the same-phase clocks, and sync between those to have all 32 COGs time-locked.
My first sentence - HubRAM reads pretty much occurs naturally for internally clocked timing. - needs some explaining. It applies specifically to sending of data from the Prop. And the important detail being that the Prop is generating the timing of the framing, eg: Video out. The other end, receiving device, automatically syncs up to the Prop's timing.
This means that the Prop only has to get the timing lengths right. When it starts sending is not important.
How this then fits in with the HubRAM timing is important because the Prop is free to choose to line up the start timing with when the first Cog is able to access it's Hub data. Each successive Cog will naturally fall into line because the Hub reads are predictably ordered.
Second sentence of first paragraph - The only trick was to align the start times of each Cog so that they output the data in order. - was pointing out that there is detail in spacing the Hub reads to get the natural timing fit.
First sentence of second paragraph - The difficult part comes when trying to align with an external clock source. - is a very loose phrase. Use of term "clock source" here is not literal specifically. It may or may not have an actual clock with the signal. A better phrase might have been - The difficult part comes when trying to sync up with an incoming synchronous data stream.
Second sentence of second paragraph - The Prop1 would have had to asynchronously oversample for this. - What I think I was thinking about here was getting the collected data into HubRAM. There is no way, in the Prop1, for a continuous synchronous datastream to run at a rate even close to a Cog's best Hub write speed. The demands of the two out-of-step parts, Hub rotation and receive frame, will clash.
Third sentence of second paragraph - The exception being where the data is in short frames that can be resync'd to for each frame. - The WAITxxx instructions are very handy for precisely finding the synchronous edge of a frame, as is very well demonstrated with the UART soft devices.
However, this relies on having breaks in the datastream so doesn't always suit.
Third paragraph - The Prop2 has much better opportunity to run more in-sync with a continuous external data source. - The FIFO is one excellent example of a new feature that provides greater flexibility to sync up to an external synchronous datastream. It specifically allows the poking of received data into HubRAM asynchronously to the Hub rotation.
LUT sharing is the latest in this category. The secondary Cog can handle the buffering, letting the primary I/O Cog focus on holding the best synchronisation with the outside device.
Smartpins/Streamer helps with automating although this may only extend to Prop sourced clocking. SDRAM/HyperRAM comes to mind here.
I might have a to try this... I haven't scratched my head all over the recent P2v9 enhancement, so this could change, but at the moment I think I am running out of cogs for my PropCam array:)
I'm all into the v9 release. Please allow me get back to you later with a few question.
Rich
As to the second question phase sync'd P2's on separate P123 boards... YES!!!
I do have a practical demo in mind and when I get far enough with it, I'll no doubt have some questions.
Thanks guys
Rich
Yes, you can do that externally - see my note about Clock Generator chip above.
Just how much practical use that is, depends on the application.
You can do that with other parts too.
I was looking at the MicroChip 1M Serial RAM, and it has a 20MHz (50ns) Max spec, but gives 10/10ns tsu/th, which means you could clock a pair with phase shifted clocks, and thus sample every 25ns. (40MHz)
That's the easy bit - once you have captured half the data in each chip, you then have to extract it, and re-merge it....
Or, the HyperRAM has a DDR.100MHz spec (5ns sample rate), but a 1.0ns/1.0ns Tsu/Th, so that could load a Pair at 400MHz on an interleaved basis.
SO... the first rule holds: If you can't do it with one P2... you can keep adding P2's until you are satisfied.
There for a minute, I thought I had to add an exception.
Thank you very much!!!!