I truly audacious project by some very young guys. With Facebook money already.
Parallel processing has been tried over and over but always fell short of the gains in MIPs, FLOPs, whatever provided by the single CPU advances as predicted by Moore.
Now that Moore's law has hit the wall going parallel is the only way to go. Perhaps the young generation will finally figure out how to harness such parallel architectures programmatically.
This is far away from the Propeller idea though. Which dictates many processors tightly coupled to the outside world for real-time real-world interaction. More like an FPGA built with CPU's instead of simple logic blocks.
The central issue with any multiprocessor scheme is the communication bottleneck. The CPU cited has a tessellation topology (which is nothing new, BTW), wherein each processor can communicate directly only with its adjacent neighbors. So getting a message from one corner of the array to the other is a bit more complicated. The Propeller, with its hub-and-spoke topology, is more tightly coupled, but it's a design that does not scale as well when adding more processors.
There's no magic bullet for communication, and there will always be design compromises to accommodate inter-processor comms.
Parallel processing may be considered the next great panacea, but as Phil mentioned... the communication bottleneck makes a lot of solutions a non-starter.
When I tried to use the Green Array GA-144 with 144 parallel Forth processors, it became quite obvious that the Propeller's architecture was superior.
Just claiming a multi-core capacity means little. And more cores might actually make things worse rather than better if the architecture isn't completely thought out.
I'll take hub and spoke design any day. It works enormously well with Forth where as the tessellation topology really requires a lot of additional planning about where data is and where you want it to go.
Those guys spending millions on super computers for weather forecasting, physics simulations, whatever they do, already have hundreds or thousands or parallel processors. They already have some idea how to partition their problems for such machines.
Now they want to go from teraflops to petaflops, Seems that doing that with current supercomputer node technology is not feasible. Those Intel processors and all the hardware support they need means the energy consumption requirements are astronomical.
Enter the idea of putting a lot of much simpler 64 bit floating point processors on a chip with much tighter, on chip, communications. Instant saving of power consumption.
Question is, how easy is it to map those big supper computer apps to such architectures where each node has a small memory space and communicating between nodes dominates?
NUMA systems already employ the local node memory concept. The scale of things would be different in terms of RAM / core, but we are already seeing GPU type solutions where there already is essentially a lot of cores on a smaller memory, relative to the problem and number of cores.
Some problems map well. Others have lots of dependencies, or it's all associative, like some forms of simulation. Comms will be needed for those, and the question in my mind really is how does something like NUMA interface with devices like this.
I suppose with an ample amount of local memory, the tesselation approach can work quite well. I am not at all clear about the power savings.
i just found that the Propeller made what seems to be a superior parallel processor Forth machine over what the founder of Forth created with the GA-144.
Yes, we have parallel processing doing a lot of slick things that I am not very aware of. Apparently, the NVida video cards are liked by Russian hackers to actually break encryption. But that is way beyond what I can do.
I still wonder about the 16x8 grid of the GA-144 and how to best manage communications. I loaded the ColorForth demo and tried to learn, but it just made me realize how much easier Tachyon, PropForth, and pfth are on a Propeller. I actually can get all 8 cogs to be useful, and I can retask cogs as required.
Forth in a microcontroller... read again what I posted. I am not a super-computer guy, never will be.
The spoke and hub architecture allow a shared dictionary as well as shared RAM, one can add an SDcard to extend the Dictionary, and different Cogs can use i/o independently or shared in just about any way one desires.
I just wish the Propeller 2 had arrived by now so that I could explore that with Forth.
The central issue with any multiprocessor scheme is the communication bottleneck. The CPU cited has a tessellation topology (which is nothing new, BTW), wherein each processor can communicate directly only with its adjacent neighbors. So getting a message from one corner of the array to the other is a bit more complicated. The Propeller, with its hub-and-spoke topology, is more tightly coupled, but it's a design that does not scale as well when adding more processors.
There's no magic bullet for communication, and there will always be design compromises to accommodate inter-processor comms.
-Phil
The trick to make it functional will be in how the OS allocates CPU cores to the waiting tasks.
A simple 'first free core to the waiting process' won't cut it at all.
A program might instead have a 'map' of desired core usage, and a list of which task goes in which core. Then the task scheduler can compare the map to available areas of the core map.
It'll also needs to have a mechanism for moving a task from one core to another while running. (So that a running program can have its tasks move int a more optimum placement when cores opens up)
This is all out of my league as well. But Andreas Olofsson of Adapteva, the Parallella guy, put it very well when he described his time working on a DSP chip. He was dismayed that only 1% of the final chip area, his FPU design, was actually doing any real calculating work. That means that 99% of the chip and it's energy consumption is being used for caches and interconnects and God knows what else.
So it goes with current super computers with thousands of Intel processors. Each has multiple levels of caches, pipelines, instruction reordering hardware, etc etc. Most of the silicon does not do a + b.
First this
["Sohmers was inspired by Adapteva's Epiphany chip, on which he based his first prototypes. But the chip lacked the memory bandwidth and double precision support he wanted."]
then
["The 3W Neo chip packs into 80 mm2 256 cores (one of which is shown above), each consisting of a 64-bit ALU, IEEE floating point unit, and 128 KBytes of SRAM scratch pad memory. Each core has a 16 Gbyte/s link to its neighbors with about 384 Gbytes/s of aggregate bandwidth between chips."]
So the existing leading edge (because it started design some years ago), is not good enough, but that process that gives the numbers, also has phone-number$ for Mask and NRE charges.
One drawback, is they will have a decent 32MBytes of RAM in this device, but no easy way to allow (say) access to half of that by one core - so the price tag is high, and the available uses is low...
The trick to make it functional will be in how the OS allocates CPU cores to the waiting tasks.
A simple 'first free core to the waiting process' won't cut it at all.
A program might instead have a 'map' of desired core usage, and a list of which task goes in which core. Then the task scheduler can compare the map to available areas of the core map.
It'll also needs to have a mechanism for moving a task from one core to another while running. (So that a running program can have its tasks move int a more optimum placement when cores opens up)
Without shared RAM that is equally accessible to all the CPU cores, the 'mapping' and the 'passing of tasks' add quite a bit of work to the programing.
And, you only have i/o on the edges of the grid with i/o often located at one and only one CPU.
++++++++++
One scheme that I came up with is a 'checker board arrangement' where 50% of the CPUs migrate tasks and data to the other 50% that do the actual computation. It seems to be easier to envision and deploy, but obviously that means that 50% of your CPUs are NEVER working on the actual computational goal ... they just buffer and pass left, right, up, or down.
Even then, the 'checker board' doesn't really free up an entire 144 CPUs in a 16 x 8 grid.
There simply seems to be a great divide between the tessellation scheme and the hub scheme.
Another option in tessellation is to create pathways to distribute data and tasks, but that seems to lead less that 50% of the CPUs actually available for tasks.
Thus, having the Propeller hub scheme seems to keep more CPUs actually able to dedicated to tasks rather than data transfers.
Wow, that is 32MB of SRAM just for the scratch pad. They seriously need to use MRAM.
I think it's important to highlight that the massively parallel simulations are, like GPUs, massaging a lot of data with tiny repetitive algorithms. The code is not so diverse and the raw data is not shuffling back and forth between cores, rather just the control parameters get shared around.
Loading and unloading the bulk data is still an issue but much of the time is computational. A grid of cores suits that. The Prop, with it's global HubRAM, is better suited to control systems.
Comments
Parallel processing has been tried over and over but always fell short of the gains in MIPs, FLOPs, whatever provided by the single CPU advances as predicted by Moore.
Now that Moore's law has hit the wall going parallel is the only way to go. Perhaps the young generation will finally figure out how to harness such parallel architectures programmatically.
This is far away from the Propeller idea though. Which dictates many processors tightly coupled to the outside world for real-time real-world interaction. More like an FPGA built with CPU's instead of simple logic blocks.
There's no magic bullet for communication, and there will always be design compromises to accommodate inter-processor comms.
-Phil
When I tried to use the Green Array GA-144 with 144 parallel Forth processors, it became quite obvious that the Propeller's architecture was superior.
Just claiming a multi-core capacity means little. And more cores might actually make things worse rather than better if the architecture isn't completely thought out.
I'll take hub and spoke design any day. It works enormously well with Forth where as the tessellation topology really requires a lot of additional planning about where data is and where you want it to go.
Those guys spending millions on super computers for weather forecasting, physics simulations, whatever they do, already have hundreds or thousands or parallel processors. They already have some idea how to partition their problems for such machines.
Now they want to go from teraflops to petaflops, Seems that doing that with current supercomputer node technology is not feasible. Those Intel processors and all the hardware support they need means the energy consumption requirements are astronomical.
Enter the idea of putting a lot of much simpler 64 bit floating point processors on a chip with much tighter, on chip, communications. Instant saving of power consumption.
Question is, how easy is it to map those big supper computer apps to such architectures where each node has a small memory space and communicating between nodes dominates?
NUMA systems already employ the local node memory concept. The scale of things would be different in terms of RAM / core, but we are already seeing GPU type solutions where there already is essentially a lot of cores on a smaller memory, relative to the problem and number of cores.
Some problems map well. Others have lots of dependencies, or it's all associative, like some forms of simulation. Comms will be needed for those, and the question in my mind really is how does something like NUMA interface with devices like this.
The CEO is having some casual discussion here:
https://news.ycombinator.com/item?id=8658631
i just found that the Propeller made what seems to be a superior parallel processor Forth machine over what the founder of Forth created with the GA-144.
Yes, we have parallel processing doing a lot of slick things that I am not very aware of. Apparently, the NVida video cards are liked by Russian hackers to actually break encryption. But that is way beyond what I can do.
I still wonder about the 16x8 grid of the GA-144 and how to best manage communications. I loaded the ColorForth demo and tried to learn, but it just made me realize how much easier Tachyon, PropForth, and pfth are on a Propeller. I actually can get all 8 cogs to be useful, and I can retask cogs as required.
Forth in a microcontroller... read again what I posted. I am not a super-computer guy, never will be.
The spoke and hub architecture allow a shared dictionary as well as shared RAM, one can add an SDcard to extend the Dictionary, and different Cogs can use i/o independently or shared in just about any way one desires.
I just wish the Propeller 2 had arrived by now so that I could explore that with Forth.
The trick to make it functional will be in how the OS allocates CPU cores to the waiting tasks.
A simple 'first free core to the waiting process' won't cut it at all.
A program might instead have a 'map' of desired core usage, and a list of which task goes in which core. Then the task scheduler can compare the map to available areas of the core map.
It'll also needs to have a mechanism for moving a task from one core to another while running. (So that a running program can have its tasks move int a more optimum placement when cores opens up)
This is all out of my league as well. But Andreas Olofsson of Adapteva, the Parallella guy, put it very well when he described his time working on a DSP chip. He was dismayed that only 1% of the final chip area, his FPU design, was actually doing any real calculating work. That means that 99% of the chip and it's energy consumption is being used for caches and interconnects and God knows what else.
So it goes with current super computers with thousands of Intel processors. Each has multiple levels of caches, pipelines, instruction reordering hardware, etc etc. Most of the silicon does not do a + b.
Who knows where this all leads?
Interesting - you have to love naive youth !!
First this
["Sohmers was inspired by Adapteva's Epiphany chip, on which he based his first prototypes. But the chip lacked the memory bandwidth and double precision support he wanted."]
then
["The 3W Neo chip packs into 80 mm2 256 cores (one of which is shown above), each consisting of a 64-bit ALU, IEEE floating point unit, and 128 KBytes of SRAM scratch pad memory. Each core has a 16 Gbyte/s link to its neighbors with about 384 Gbytes/s of aggregate bandwidth between chips."]
So the existing leading edge (because it started design some years ago), is not good enough, but that process that gives the numbers, also has phone-number$ for Mask and NRE charges.
One drawback, is they will have a decent 32MBytes of RAM in this device, but no easy way to allow (say) access to half of that by one core - so the price tag is high, and the available uses is low...
Without shared RAM that is equally accessible to all the CPU cores, the 'mapping' and the 'passing of tasks' add quite a bit of work to the programing.
And, you only have i/o on the edges of the grid with i/o often located at one and only one CPU.
++++++++++
One scheme that I came up with is a 'checker board arrangement' where 50% of the CPUs migrate tasks and data to the other 50% that do the actual computation. It seems to be easier to envision and deploy, but obviously that means that 50% of your CPUs are NEVER working on the actual computational goal ... they just buffer and pass left, right, up, or down.
Even then, the 'checker board' doesn't really free up an entire 144 CPUs in a 16 x 8 grid.
There simply seems to be a great divide between the tessellation scheme and the hub scheme.
Another option in tessellation is to create pathways to distribute data and tasks, but that seems to lead less that 50% of the CPUs actually available for tasks.
Thus, having the Propeller hub scheme seems to keep more CPUs actually able to dedicated to tasks rather than data transfers.
I think it's important to highlight that the massively parallel simulations are, like GPUs, massaging a lot of data with tiny repetitive algorithms. The code is not so diverse and the raw data is not shuffling back and forth between cores, rather just the control parameters get shared around.
Loading and unloading the bulk data is still an issue but much of the time is computational. A grid of cores suits that. The Prop, with it's global HubRAM, is better suited to control systems.