Parallel Prop2's
fixmax
Posts: 91
in Propeller 2
Something I'm curious about. With it locked into an 8 core Prop 2 running 120mhz, I'm curious about any plans on linking together multiple Prop2 chips to get more cores than 8. I know several folks have done 2, 4, or more chip board version for Prop1's.
Is there a mechanism built into, or designed for, that will provide a seamless (relatively speaking) method of utilizing more than 8 cores in a design? Something along the lines of the XMOS chip to utilize more cores than the chip has by using the link mechanism?
Even if it's just an application note or something, I would be interested in hearing about this. I had been interested in seeing this done using Forth (couldn't remember whose version), and with the Prop2 having Tachyon in some form, maybe this will be easily done? Anyone thought about this somewhat?
What would really be cool is a compiler with a define where you could have multiple Prop targets in one project file, and it loads each Props portion of code, with a mechanism to pass data between them and a synchronizing method (clk or something)
Is there a mechanism built into, or designed for, that will provide a seamless (relatively speaking) method of utilizing more than 8 cores in a design? Something along the lines of the XMOS chip to utilize more cores than the chip has by using the link mechanism?
Even if it's just an application note or something, I would be interested in hearing about this. I had been interested in seeing this done using Forth (couldn't remember whose version), and with the Prop2 having Tachyon in some form, maybe this will be easily done? Anyone thought about this somewhat?
What would really be cool is a compiler with a define where you could have multiple Prop targets in one project file, and it loads each Props portion of code, with a mechanism to pass data between them and a synchronizing method (clk or something)
Comments
I suspect multiple P2's will need to bypass each ones PLL, and drive from a common 160MHz SysCLK, in order to have predictable smart pin phase from unit to unit.
That's not trivial and a lot of EMC issues ...
Lower speed links might be able to send CLK:DATA and use a slave clock setup, and so allow the PLL to be used.
True, but that's primarily a concern if you are trying to achieve maximum transfer rates. In many cases, simple two-wire async serial will be more then adequate. If I recall, you had run some numbers at one point for the maximum reasonable baud rate between two P2 running at 160 MHz (or maybe it was 120 MHz, at the time). Or may @cgracey came up with those numbers. Either way, those numbers (4+ MBps?) seemed quite fast enough for a lot of use cases. And if the thing you're trying to do with the P2 is massive parallel processing, I suspect you've picked the wrong architecture in the first place.
Standard UARTS are quite comfortable at SysCLK/8, which means 20MBd at 160MHz. You might push SysCLK/6 or SysCLK/4, but are also likely to bump into SW timing at those speeds.
IIRC P2 does support 32b Async, so that drops the 'char' rate Sw has to support.
i2c has speed specs at 1MHz, 3.4MHz and 5MHz
A DDR i2c variant might work well on P1, ie use standard i2c for address, and once ACK is seen, you either keep std i2c, or change gears to DDR for data packets ?
BUT...
you have to admit, the idea of just plugging a bunch of P2 boards together with in-built LVDS seems pretty attractive.
I have 2M baud 9-bit asynch half-duplex coms for networking multiple P1s which can use RS485 or just a single I/O if they are relatively close together. The network simulates full-duplex so that I can even chat or download code to an individual P1 or address a group etc.
It works nice on a P1 and should do the same on a P2.
The basic idea is to share some HUB ram buffer by sending it continuously from one Prop to the next one in a ring of propellers.
Every Propeller can read the whole buffer in its own HUBRAM and modify its content. If I remember correctly @Beau used one cog for sending and one COG for receiving the HUB buffer continuously.
This allows, with careful planning of the buffer structure, to utilize the common mailbox system used with PASM on one Prop to extend to multiple Props.
I guess using the smart pins continuous sending and receiving can be done in one COG instead of two as @Beau did.
Certainly I will try that whenever I can get hold of multiple P2 boards.
Mike
A Wait-test-rotate-djnz compact loop rounds down to a 4-4.44MHz data rate.
You could send a frame of [Address.R/W] as standard i2c, and then send [VarAdr16][Data32]..[Data32] for a compact but highly flexible P1-P1 interface.
It required one cog to be dedicated to producing a clock signal, each processor used two pins, one for the clock, one for data with a pull-down resistor on the data line.
I never took the time to optimize the code in assembly, the spin version worked fine for my application, but the basics of the scheme was:
A sync pulse of some length ( I don't recall the count... 4 - 8 time cycles) was sent at the beginning of a packet transfer followed by 384 clock "Ticks". When the other processors needed to "Listen" or "Talk" They would wait for this low sync pulse, then as soon as the clock went high again, this was the start of the packet for processor A, with a bit transferred at the start of each clock tick. After 128 "Ticks" it was processor B's turn to transfer data, after 256 ticks, Processor C's turn.
Each processor had to have a slightly different version of the program so they knew where their block of data started, but with only 3 processors, it wasn't a big deal to change the code for each. As their "Turn" to talk came around, they could pull the data line high or leave it low for each data bit. The other processors would "Listen" for whatever data from the other processors was relevant to their needs.
This wasn't really elegant code... I wouldn't use it for connecting more than two or three processors together, but it worked fine to let me have a controller board with lots of IO and 23 available cogs working together as a team.
Depending on the application, the buffer could be smaller or larger or more data lines could increase transfer speeds if needed.