I keep reading that on the net but refuse to believe that it's etched in stone, that when "they" added more cores, the efficiency became less. I can only conclude that developments in the modern day world with the remake of of the multi-core design and programming will change this.
Well, as far as the x86 and early PowerPC cores were concerned, they had crappy northbridge chips, making it a tad difficult to add more than two CPU chips as they're limited in bandwidth which can be handled by northbridge alone.
Now, time has changed - all processors has better northbridge, along with smaller lithography nodes. The AMD on-die CPU fabric northbridge can handle several terabytes per seconds easily, compard to Intel 430tx or even i820 chipsets. It's now easier to multiply the cores, like the streaming DSP cores on GPU die.
Lastly, transputers also have proven that it's not impossible to do away with putting few CPU cores in a PC case, even in 1989 (or today, if you still got pulled 486 CPUs lying around) - that also motivated IBM, AMD and eventually Intel to create better and faster northbridge chipsets so multicore processing could be accomplished easily.
And, Humaniodo - you don't have to believe it - there are no such things as hard and fast rules in computing world - the only rule that you follow is your own rules written in the machine codes, or even the hardware you design.
Well, as far as the x86 and early PowerPC cores were concerned, they had crappy northbridge chips, making it a tad difficult to add more than two CPU chips as they're limited in bandwidth which can be handled by northbridge alone.
Now, time has changed - all processors has better northbridge, along with smaller lithography nodes. The AMD on-die CPU fabric northbridge can handle several terabytes per seconds easily, compard to Intel 430tx or even i820 chipsets. It's now easier to multiply the cores, like the streaming DSP cores on GPU die.
That is amazing modern technology with the processing power of AMD in TeraFLOPS and TeraBytes throughput. I was also thinking along the lines of multi parallax propellers in large numbers running multi tasks simultaneously in parallel. Of course at this stage I only count the work performed by many jobs running in parallel and not the time it takes to retrieve the results.
But a Phenom II Deneb in my computer here has 2.6 TB/s peak bandwidth, while it can only spit out 78 GFLOPS from all four cores, which are a big difference. How so? About 50 to 70-% of bandwidth are raw codes, both in yet-to-be-translated x86 machine codes and VLIW-like PowerPC machine codes composed entirely of decoded x86 instructions - they has to flood the pipeline, or else, the CPU stalls. If it's peak bandwidth, but what's the sustained bandwidth? Four cores with 400MHz FSB = 400GB/s max. 50GB/s min. bandwidth for most of the intended purposes - Core i7 Nehalem's internal northbridge bandwidth is a bit higher, but the figures are similar. The only difference between Nehalem and Deneb is that Deneb is a 100% full-blown out-of-order processor.
Assign one AMD to each Cog, set up 100 props with 800 cogs, and code 920 x 800 streaming processors for 736,000 cores. Of course AMDs will need to come down in price before we can afford it. Nevertheless, the price of these boards in China are dropping like a rock. I have my eye on the 5,000+ GPU board. I wonder when the number of GPUs per board will plateau out? Something like this may be feasible for someone willing to make it work.
Almost. But a Phenom II Deneb in my computer here has 2.6 TB/s peak bandwidth, while it can only spit out 78 GFLOPS from all four cores, which are a big difference. How so? About 50 to 70-% of bandwidth are raw codes, both in yet-to-be-translated x86 machine codes and VLIW-like PowerPC machine codes composed entirely of decoded x86 instructions - they has to flood the pipeline, or else, the CPU stalls. If it's peak bandwidth, but what's the sustained bandwidth? Four cores with 400MHz FSB = 400GB/s max. 50GB/s min. bandwidth for most of the intended purposes - Core i7 Nehalem's internal northbridge bandwidth is a bit higher, but the figures are similar. The only difference between Nehalem and Deneb is that Deneb is a 100% full-blown out-of-order processor.
We need to keep the work and calculations inside the machine. The pipeline out is too small.
In the normal run of things, and grossly over simplified, as a processor executes it fetches some code and/or data making an access to RAM. Then it does some processing on that data internally, no RAM access required. Then it writes back some results, accessing RAM again.
Clearly there is scope to have two processors operating with that RAM, when one is processing the other is accessing and vice-versa. Bingo you have perhaps doubled your processing power.
That's good, let's add some more.
Ooops does not work, performance is going down not up!!
As you add more processors contention for RAM starts to creep in and you end up with processors waiting for their chance to access the RAM.
In the extreme with an infinite number of CPU's you have zero performance. That is once a processor has made an access to RAM it has to wait forever to get it's turn again.
So that graph in the Transputer video simply shows this effect.
What's the solution?
Clearly you need to increase bandwidth to memory in line with the number of processors. The Transputer did this by having separate RAM for each processor. Hence the need for high speed comms links between processors as they could not communicate through RAM. Of course the more chips you have the more comms link bandwidth you have a swell so all is good.
In the modern world I'm not even sure how Intel and all stuff 4 or 8 or more cores into a single chip operating on a single RAM space. But imagine a lot of the memory contention problems are resolved by having giant on chip caches for each core from which the work is done. Fast burst access to the outside RAM gets you the bandwidth required to not hinder core performance.
Still even that must have it's limits and we come back to the Transputer like ideas again. Anyone here building real modern super computers. I'd like a clue how those thousands of Opterons or whatever are hooked up.
5k GPU plateau effect... Boy, it's where it gets fun.
Look at Prime calculation benchmark, like y-cruncher. You noticed that it will need RAM memory? 28 million iterations will need about 512MB.
Now consider the limitation of upper-numerical bandwidth of the chosen FPGA (both expensive and cheap), via their softcoded PCIe bus.
As you increase the counts of hardwares, one video card for each PCIe 2.0 bus = 128 GB/s. Double that, you get 256GB/s - keep building up the number to 5,000 cards, you will have to handle.... Let's say. 5,000 x 128GB/s = 640 terabytes per seconds.
Alright, you got a Herculean strength bandwidth. How about the pinpoint to pinpoint delays? That's where it will get real dirty. For an inch PCB, you get 678 picoseconds delay at 800 MHz, enough to warrant strict trace-matching protocol on DDR II memory chips between the processors - now let's say.... 6 feet of cable containing PCIe electrical interconnection wires inside would provoke about 6.7 milliseconds delay. You can usually cheat at this by going with fiber optics which is also beneficial as they don't suffer crosstalk or inductively-generated EMI. Fiber optics aren't cheap, though.
Lastly, there's also hardware-software delays on FPGA - they're pretty much a goose-chasing contest, as they tend to vary wildly, from firmware revisions to die revisions and brand of FPGAs.
Pointing to prime benchmarking memory requirements, GPU will need to store something on their RAMs, such as stacks for IO planes and data being asked to work on. Bandwidths get affected by that as numbers of cards gets larger.
And, Heater, you got that correct.
Lastly, Humanoido, you wouldn't need to worry too much about pipeline-out, as HyperTransport at 1.8GHz can hit 50GB/s, good enough for most supercomputer works. Cray have attempted installing lot of Opterons on HyperTransport transmission boards, to the hub switch boards, finally to few PCs to be controlled. Here's a reference, although they have attempted the same thing few years ago - it's a recent design: http://www.linuxfordevices.com/c/a/News/Cray-XK6/
Still, it's a pretty good hardware BUT I still got a bone to pick with it: Four cores, and it costs $20 a pop, and look, a $8 microcontroller, Propeller is already octo-core..
There are many bones to pick with that device but lack of cores and cost is not any of them.
You should compare the Prop to the single or dual core devices for equivalent cost.
A single core gives you 8 hardware scheduled threads with deterministic timing which is logically equivalent to 8 separate cores. All of threads are faster than COGs simply due to the clock speed but also much faster for larger programs as you don't need to resort to a byte code interpreter or LMM or overlays to get the code in, all of which slows things dramatically.
Don't forget that if you do go to 2 or 4 core devices you also double the amount of RAM and as each core has it's own RAM you double the total RAM bandwidth a scale total performance linearly.
On the down side I do have other bones to pick with those devices:
Power consumption is higher.
Dual power supplies is a pain.
No DIP versions.
The Pins come in groups of various sizes, within a group all pins are either input or output, this is not as flexible as the Props individual pin settings.
For multi-core devices each core has it's own pins, you cannot use just any pin from any core like the Prop. Again not as flexible.
The dev tools and languages are a lot more to deal with than the Spin tool.
It was always so sad to drive past the defunked Inmos factory in South Wales. It was hailed as the new high tech path for the area, no longer was the heavy industries of coal and steel the only way. It seemed that as soon as the EU and regional grants were out of their penalty phases the place got mothballed.
Yep. And ST also ensued InMOS' misery (T9000 productions cost ST too much to continue... Seesh, just keep sellin, and you - ST - would be fine. Look at Intel... They have been doing "limb-threatening" deals for many years and they are still in business....)
If transputer would still have been developed to today, it would probably be able to do several Teraflops / TIPS computing, only cheaper than video cards - which would be nice to have in the home computer assisting in Blu-ray Disc authoring and several other things that need more raw computing powers.
Heater, again, good point. There are also so much to pick with other microcontrollers. They have to be compromised to keep the price down. Very few microcontroller (like Renesas RX600 for example) contain a superscalar out-of-order CPU core, not something you see every day in microcontroller world, and Propeller II's COGs could potentially be similar to RX600, only different in how they handle their workloads.
EDIT: I won't be too sure about Propeller II COG's Execution Unit (CPU engine), although from rough die sketch, it looked like it could be of superscalar design - I will need to look at a picture of real shiny Propeller II die to confirm, and of course the datasheet could be nice too.
Personally, it would make me a bit happy if InMOS was to be resurrected (probably won't happen, though), I would definitely welcome transputer's presence - I don't care if it's a superscalar, Out-of-Order Execution monster with 1,000+ solder balls, it would have the quality that the unmentionable chip don't have.
Sure, both original first, second-generation transputers and that forbidden processors are both designed by David May, they don't feel the same, even from the glance of dieshot (I have became a sort of an expert at being able to tell what that stuff is like by looking at photo of processor die).
And, the forbidden chip kept Vogon Banana Code ISA. Sure, it would have appealed most transputer hackers. BUT it ain't enough, for some reason. It's not the tool, it could have something to do with the CPU internally. That stuff has definitely blurred the lines between regular CPUs and transputers. Personally, I would like to play with an official transputer CPU chip.
And, if the current transputer is to be made, and is an Out-of-Order processor, will it support Speculative Execution? Personally, I would say NO. InMOS engineers preferred to have the CPU they designed to give out complete, dead-on truthful answers that it pondered. Why not? Sometimes speculation isn't something as worthy being implemented on a Silicon die if it was meant for serious mafia-class business, like the original T4 / 8 CPUs. Yet, Out-of-Order Execution would improve the resolution between requested threads - no need to kill the threads if that would take too long, just move it down, and move the higher-priority codes up and cut through both threads quickly and get over with them (of course, thread suspension will still occur, though - it would be part of energy-saving feature, of course, along with power-gating and frequency throttling on newer 45nm multi-core transputers, if it was to be ever made.)
Alright, time to put other things like transputers aside...
I am wondering, I would want to ask you guys a question. Anyone who have worked with FTDI Vinculum II USB controller PHY before, you think I should use SPI bus to be linked onto Propeller chip if I want to use a hard drive (formatted with FAT32) for storage of data? I am going to try and wrap Dendou Oni prototype board up. I have been too busy at the moment.... (I apologize to keep some of you waiting....)
Been thinking what kind of board form should I stick with... Would it be good if I form that out to fit underneath the 3.5 inch IDE hard drive (or SATA one if you can find good SATA to USB bridge chip, available at both Mouser and Digikey - it's possible to find some at other part stores. The completed SATA to USB adapter can be found at a computer shops too.) Or not? Just wondering.
Why hard drive? I may need something with resistance to data wearing (flash is only good for several hundreds thousands time before they finally dying, and that includes SD cards. Not everyone cares, though, they're cheap...) other than MRAM, they're a bit pricey and the largest Everspin SPI version is only 128KB, something I will use with Propeller II chips for boot firmware. I have a Western Digital 40GB IDE Caviar hard drive I am thinking about using with P8X32A chip transputer-like board.
Why will I use MRAM for firmware ROM storage on Propeller II? For $9.05 for 128KB MRAM (at Digikey), I can indefinitely write the firmware all over again, which is VERY important for firmware debugging, something that I kinda wish that P8X32A would be able to use SPI flash as a boot device, like I2C ones. FRAM is kinda okay, as long as I don't bombard it with X-rays or Gamma-rays (which I won't unless I want a corrupted firmware image). The biggest MRAM, about 2 megabytes will cost $37.37 and is parallel. SRAM costs more than this MRAM, though. (I wouldn't be surprised if this 2MB MRAM eventually cost $10 or whatever in few years, because MRAM is also very useful for the basis of mass storage circuitry in the SSD [Solid-State Drive], as soon as the troubles with smaller lithography nodes is solved.)
Dr. Mario,
Thanks for this memory review. I'm also looking at hard drive systems for interactions with the Propeller chip. It's easy to replace the low cost boot EEPROM on the Propeller so it's not a concern when used only for boots and low overhead memory storage. I've used these like pluggable memory devices with a specific development program in each. I was thinking of adding a switch and keeping more on board but plug-in un-plug used fewer parts.
That article lacks some interesting details. It looks like someone chopped some info together and mixed the paragraphs... We support Excel and Linux... ehhh... someone needs to rewrite the whole thing !
Yes, the Brain, in moving up to over 800 combo nodes favorably compares as it has basically become a supercomputer. I fully expect Brain TeraFLOPs power will migrate towards the next level PetaFLOPs.
And yes, very similar in the Hybrid computer design, not with Xenon but with AMD, and not with two TB drives but three.
And finally no, not with 8 cores but with over 1,500 cores (over 800 {Propeller} + 4 {INTEL QUAD} + 720 {AMD GPU}).
What is "AtOnce"?
I see the announcement about it there but absolutely no description.
Presumably those teraflops are going to be provided by your AMDs not your Propellers because a single COG is running at say 100000 flops (if we are generous and assume it is doing nothing else but float ops). So you would need 1E12 / 100000 = 10 million COGS to reach that, or say 1.5 million Propellers. This machine could be rather expensive:)
If the AMDs are doing all the work what are the Props in there for?
JP Morgan risk assessment is basically a serious hardcore stuff, requiring something better than Propeller cores, but I gotta say, if used properly, then yes, it's possible. Then, you will run into trouble with 110 or 1,200 Volts AC rail running into ground and cooling issues - in other word, you will have to build tera-scaled P8X32A supercomputer in the room size of a professional office.
And, Humanoido, the firmware is required - "It's easy to replace the low cost boot EEPROM on the Propeller so it's not a concern when used only for boots and low overhead memory storage." - It won't be easy unless you hose down the firmware into the P8X32A chips which has been succeed by a file, "PropellerLoader.spin". If no firmware, then it won't be able to establish the firm connection via the SPI bus on Vinculum II chip, and it has to understand how the FAT32 is dealt with - you can use a modified version of SD card driver like "SD-MMC_FATEngine.spin" and omit the SD card pin setup - I have to find a way to convert it into an ASM file so it can be executed much faster - hard drive can talk with a P8X32A at about 1.2 to 20MB/s with UDMA at lower level (documented in ATA specification whitepaper - just skip SATA part, unless you're mucking with one, I am just using PATA one). You can use Ferroelectric RAM for boot firmware, like I do, as it's good for 100 trillion rewrites while MRAM will not care (as in INDEFINITE rewrites) - if you're a bit too worried about limited rewrites. *bows apologetically* Sorry if I kinda had to correct - I just had to be clear on some subject, concerning the firmware, but it's what the hard drives are also for, to get rid of SD card, and its pesky limitations. Lastly, the physical storage of a firmware image is not going to matter much - FRAM and EEPROM use the same I2C call address so it's feasible as a boot device, while if you wanna use SPI bus on Popeller II and you have a itchy trigger fingers for the F7 key, then Magnetic RAM is for ya.
Also, within two-year deadline (hopefully I can finish within that time, if not, I will extend it), and Dendou Oni is finished and is running correctly as anticipated, I will try JP Morgan's Risk-Assessment as it's also excellent for stress burn-in test (to ensure all Propeller II BGA chips are in the spot and are electrically functioning due to thermal heating as all superscalar COGs become stressed heavily by whatever nasty stuff you toss at it - P8X32A may also get warm the same way, although to less extent as it's got about the same transistor count as Pentium Pro sans L2 cache RAM.)
I am going to read the complete JPMC PDF as mentioned on Heater's link to see if it's feasible on Propeller II supercomputer with limited number of CPU cores, if yea, then I will try it out as a stress test - nothing as serious as their voltage-sucking behemoth supercomputer arrays, where they usually can afford HUGE electric bills, usually for figuring out whether the stock market will go up or going south by analyzing the past and current stock market trends. Even money can walk away outta the house...
I have read the slide. Very interesting as it do "pick or eliminate" the probability on the FPGAs - basically it's like picking between your favorite snack or video games, to name a few. They have to collect very old data and have the basic understanding of what the past trends were like, then armed with that information, it can decide on whether stuff in Wall Street will turn sour or not. And, I noticed it uses very complicated math, like Pi-Constant coding. To me, it's pretty much a Turing machine being placed on steroids, and quite intelligent in many ways.
Westmere-EX Xeon CPUs just deal with storage and data-hosting, not a real star in this kind of benchmark, but it also provided very crucial interface to the FPGA to be provided to the outside worlds, to be used by the brokers and stock traders. Of course, most banks also use JP Morgan's machines, as they're quite wary of the stock trends.
And, Voter Majority debating technique could also be used by JP Morgan, that can also be accomplished by brain-dead simple microcontrollers like this one: http://www.microchip.com/wwwproducts/Devices.aspx?dDocName=en019863 when arranged in clusters, all they have to do is to read small code, and then vote "yea" or "naw". Propeller II or that magic chip can do it much faster, though.
Can't wait to see it actually do something besides generate pictures and buzz-words.
Jazzed: Please don't wait any longer. The project is happy to announce, as of today, it has reached the goal of a completed picture and buzz-word generating machine.
Dr. Mario: Thanks for your excellent rundown of memory - indeed sometimes we need to go with what's on hand or what is more cost effective. With the cost of TeraByte drives dropping like flies, it was time to go for it. I'm not interested in replacing prop boot code but rather supplementing it. The real gains may be in computer program and data storage when large volumes are required. This could include a large database of learned knowledge for reference or the storage subspace for a trillion neurons, or day to day memory structures in a machine life form. These large volumes could include the sum total of arrays where many redistribution modules are involved.
As you have well noted, power must also be kept down. The project has chosen supplementary and peripheral parts that draw minimal current whenever possible, and even disabled twenty PPPB LEDs because they were overly hungry, and a new study goes on because the judicious use of programming can affect power draw in Propeller chips significantly - perhaps seemingly not much in one chip but 100X in a hundred chips and a 1000X in a thousand where the significance goes up.
Humanoido, you can actually use the boot firmware as a BIOS (or EFI) to load the software from the hard drive to boot the entire system. uCLinux will definitely like it there too. And, the external hard drive (USB compatible ones) like WD My Passport can be used, although for HUGE hard drive, you will need to use SDXC FAT+ to put in 64-bit FAT addressing on larger hard drive, so it can be simpler to do so. The USB hard drives are also treated as Bulk-only transport mass storage, a good news for Vinculum II users. And yea, we will see 4 TB hard drives soon.
If you care about reliability (neuron computing is quite torturous to it as in the modern hard drive, it will have to pull the heads off parking ramps and onto rapidly spinning magnetic discs, and moving them across to read something or to write onto it - I know that it will be accessed by AI software every hundreds of milliseconds to several minutes), I would recommend Western Digital internal hard drives as they got better CPU (In some Caviar Black, there is a Dual-core superscalar VLIW processors soldered onto the board), and very high-end hard drive servos (apparently manufactured under both Japanese Military Standards for Electronics and Electromechanics and MIL-SPECs, and couple other standards that encourage the usage of better parts. Of course, even in your own sparkling new Macintosh, or in my PC, the hard drive isn't even treated like a queen, in term of usages) - and of course, you can pay a bit more and get WD Caviar Black or just Caviar Blue - I have a 500GB Caviar Blue hard drive, it still works fine even after two years. SSD is also doable, as long as you don't overuse it, however for AI, I would personally prefer to go with regular noisy hard drives as they are still superior in terms of life expectancy of the mass storage itself - only the wears on motor or the magnetic surface determine what and which will do numbers on the hard drive, and on a brand-new hard drive with the head parking ramps installed inside can last up to 22 years of heavy usage like in server environments.
And, of course, don't drop it when spinning, you don't want the head slapping to occur, which in rare case the heads will simply explode due to the stress on the heads made entirely of Silicon, rendering ALL discs unrecoverable, even by the labs.
Are you about to get uCLinux running on a one or more Propellers?
If so I might suggest that you stop work on all other parts of this ever growing super-microcontroller project and concentrate on that.
I'm sure just doing that would consume all your free time for quite a while. Many of us out here would like to see it done but don't have the time or skill to tackle it. Might not be useful for the Prop but could just be for Prop II and it's successors.
Well, I only am suggesting it's possible. I am playing with Propeller P8X32A for now, and I do not have an access to the Prop II Virtual Machine, nor the silicon die (I prefer to use Silicon die as the emulator sometimes get it all wrong - it depends on how badly it is written. Too much variables...) but when I do have a Prop II processor soldered and I have much time to port uCLinux, I will attempt to do so - probably have to write MMU driver for it to be able to access SDRAM. The reason why I said "uCLinux will like it there" is simply because it can get pretty large when compiled, and Linux tend to write their cache onto HDDs twice, not a pretty stuff - I killed CompactFlash this way by writing way too much times... uCLinux can also be compacted down to the point it can be written on 512KB flash on PowerPC microcontroller, MPC555, for example, without too much difficulties.
P8X32A? I may probably try it after I get my P8X32A prototype done, as I may need some memories available to waste on uCLinux threads.
If successful, I may include the tarball filled with SPIN or C++ files of the modified version of uCLinux, either immediately or until Dendou Oni is completely assembled and running.
I may have to write the uCLinux onto the FAT32-formatted hard drive I am willing to give P8X32A for it to keep as a host volume so I can have space to test it out (40GB which is generous, even for a tiny operating system), and I will have to include the boot firmware onto Ferroelectric RAM directing it what to do, and it's easier to load bigger codes by the usage of smaller bootstrap codes - like the BIOS (the motherboard firmware).
Finally, I am able to use the Catalina! The previous versions were just plainly annoying, where I kept getting error message about XML mismatch, or many times, CodeBlock just simply pretend that the Catalina do not exist at all. Now, it gotten so much better, now that I can use it. Thanks, Ross, for making such plug-ins!
However, I am wondering if it's possible to write an operating system other than Catalina's own generic firmware package when using this compiler - maybe Ross can answer this question - anyone else who have done that, please tell me! ^_______^
EDIT: I just asked RossH, and the Catalyast OS package is just an option - now I can try compile my own, probably uCLinux.
Comments
Now, time has changed - all processors has better northbridge, along with smaller lithography nodes. The AMD on-die CPU fabric northbridge can handle several terabytes per seconds easily, compard to Intel 430tx or even i820 chipsets. It's now easier to multiply the cores, like the streaming DSP cores on GPU die.
Lastly, transputers also have proven that it's not impossible to do away with putting few CPU cores in a PC case, even in 1989 (or today, if you still got pulled 486 CPUs lying around) - that also motivated IBM, AMD and eventually Intel to create better and faster northbridge chipsets so multicore processing could be accomplished easily.
And, Humaniodo - you don't have to believe it - there are no such things as hard and fast rules in computing world - the only rule that you follow is your own rules written in the machine codes, or even the hardware you design.
But a Phenom II Deneb in my computer here has 2.6 TB/s peak bandwidth, while it can only spit out 78 GFLOPS from all four cores, which are a big difference. How so? About 50 to 70-% of bandwidth are raw codes, both in yet-to-be-translated x86 machine codes and VLIW-like PowerPC machine codes composed entirely of decoded x86 instructions - they has to flood the pipeline, or else, the CPU stalls. If it's peak bandwidth, but what's the sustained bandwidth? Four cores with 400MHz FSB = 400GB/s max. 50GB/s min. bandwidth for most of the intended purposes - Core i7 Nehalem's internal northbridge bandwidth is a bit higher, but the figures are similar. The only difference between Nehalem and Deneb is that Deneb is a 100% full-blown out-of-order processor.
Clearly there is scope to have two processors operating with that RAM, when one is processing the other is accessing and vice-versa. Bingo you have perhaps doubled your processing power.
That's good, let's add some more.
Ooops does not work, performance is going down not up!!
As you add more processors contention for RAM starts to creep in and you end up with processors waiting for their chance to access the RAM.
In the extreme with an infinite number of CPU's you have zero performance. That is once a processor has made an access to RAM it has to wait forever to get it's turn again.
So that graph in the Transputer video simply shows this effect.
What's the solution?
Clearly you need to increase bandwidth to memory in line with the number of processors. The Transputer did this by having separate RAM for each processor. Hence the need for high speed comms links between processors as they could not communicate through RAM. Of course the more chips you have the more comms link bandwidth you have a swell so all is good.
In the modern world I'm not even sure how Intel and all stuff 4 or 8 or more cores into a single chip operating on a single RAM space. But imagine a lot of the memory contention problems are resolved by having giant on chip caches for each core from which the work is done. Fast burst access to the outside RAM gets you the bandwidth required to not hinder core performance.
Still even that must have it's limits and we come back to the Transputer like ideas again. Anyone here building real modern super computers. I'd like a clue how those thousands of Opterons or whatever are hooked up.
Look at Prime calculation benchmark, like y-cruncher. You noticed that it will need RAM memory? 28 million iterations will need about 512MB.
Now consider the limitation of upper-numerical bandwidth of the chosen FPGA (both expensive and cheap), via their softcoded PCIe bus.
As you increase the counts of hardwares, one video card for each PCIe 2.0 bus = 128 GB/s. Double that, you get 256GB/s - keep building up the number to 5,000 cards, you will have to handle.... Let's say. 5,000 x 128GB/s = 640 terabytes per seconds.
Alright, you got a Herculean strength bandwidth. How about the pinpoint to pinpoint delays? That's where it will get real dirty. For an inch PCB, you get 678 picoseconds delay at 800 MHz, enough to warrant strict trace-matching protocol on DDR II memory chips between the processors - now let's say.... 6 feet of cable containing PCIe electrical interconnection wires inside would provoke about 6.7 milliseconds delay. You can usually cheat at this by going with fiber optics which is also beneficial as they don't suffer crosstalk or inductively-generated EMI. Fiber optics aren't cheap, though.
Lastly, there's also hardware-software delays on FPGA - they're pretty much a goose-chasing contest, as they tend to vary wildly, from firmware revisions to die revisions and brand of FPGAs.
Pointing to prime benchmarking memory requirements, GPU will need to store something on their RAMs, such as stacks for IO planes and data being asked to work on. Bandwidths get affected by that as numbers of cards gets larger.
And, Heater, you got that correct.
Lastly, Humanoido, you wouldn't need to worry too much about pipeline-out, as HyperTransport at 1.8GHz can hit 50GB/s, good enough for most supercomputer works. Cray have attempted installing lot of Opterons on HyperTransport transmission boards, to the hub switch boards, finally to few PCs to be controlled. Here's a reference, although they have attempted the same thing few years ago - it's a recent design: http://www.linuxfordevices.com/c/a/News/Cray-XK6/
Re: the chip that cannot be mentioned:
There are many bones to pick with that device but lack of cores and cost is not any of them.
You should compare the Prop to the single or dual core devices for equivalent cost.
A single core gives you 8 hardware scheduled threads with deterministic timing which is logically equivalent to 8 separate cores. All of threads are faster than COGs simply due to the clock speed but also much faster for larger programs as you don't need to resort to a byte code interpreter or LMM or overlays to get the code in, all of which slows things dramatically.
Don't forget that if you do go to 2 or 4 core devices you also double the amount of RAM and as each core has it's own RAM you double the total RAM bandwidth a scale total performance linearly.
On the down side I do have other bones to pick with those devices:
Power consumption is higher.
Dual power supplies is a pain.
No DIP versions.
The Pins come in groups of various sizes, within a group all pins are either input or output, this is not as flexible as the Props individual pin settings.
For multi-core devices each core has it's own pins, you cannot use just any pin from any core like the Prop. Again not as flexible.
The dev tools and languages are a lot more to deal with than the Spin tool.
Horses for courses as usual.
If transputer would still have been developed to today, it would probably be able to do several Teraflops / TIPS computing, only cheaper than video cards - which would be nice to have in the home computer assisting in Blu-ray Disc authoring and several other things that need more raw computing powers.
Heater, again, good point. There are also so much to pick with other microcontrollers. They have to be compromised to keep the price down. Very few microcontroller (like Renesas RX600 for example) contain a superscalar out-of-order CPU core, not something you see every day in microcontroller world, and Propeller II's COGs could potentially be similar to RX600, only different in how they handle their workloads.
EDIT: I won't be too sure about Propeller II COG's Execution Unit (CPU engine), although from rough die sketch, it looked like it could be of superscalar design - I will need to look at a picture of real shiny Propeller II die to confirm, and of course the datasheet could be nice too.
Sure, both original first, second-generation transputers and that forbidden processors are both designed by David May, they don't feel the same, even from the glance of dieshot (I have became a sort of an expert at being able to tell what that stuff is like by looking at photo of processor die).
And, the forbidden chip kept Vogon Banana Code ISA. Sure, it would have appealed most transputer hackers. BUT it ain't enough, for some reason. It's not the tool, it could have something to do with the CPU internally. That stuff has definitely blurred the lines between regular CPUs and transputers. Personally, I would like to play with an official transputer CPU chip.
And, if the current transputer is to be made, and is an Out-of-Order processor, will it support Speculative Execution? Personally, I would say NO. InMOS engineers preferred to have the CPU they designed to give out complete, dead-on truthful answers that it pondered. Why not? Sometimes speculation isn't something as worthy being implemented on a Silicon die if it was meant for serious mafia-class business, like the original T4 / 8 CPUs. Yet, Out-of-Order Execution would improve the resolution between requested threads - no need to kill the threads if that would take too long, just move it down, and move the higher-priority codes up and cut through both threads quickly and get over with them (of course, thread suspension will still occur, though - it would be part of energy-saving feature, of course, along with power-gating and frequency throttling on newer 45nm multi-core transputers, if it was to be ever made.)
I am wondering, I would want to ask you guys a question. Anyone who have worked with FTDI Vinculum II USB controller PHY before, you think I should use SPI bus to be linked onto Propeller chip if I want to use a hard drive (formatted with FAT32) for storage of data? I am going to try and wrap Dendou Oni prototype board up. I have been too busy at the moment.... (I apologize to keep some of you waiting....)
Why hard drive? I may need something with resistance to data wearing (flash is only good for several hundreds thousands time before they finally dying, and that includes SD cards. Not everyone cares, though, they're cheap...) other than MRAM, they're a bit pricey and the largest Everspin SPI version is only 128KB, something I will use with Propeller II chips for boot firmware. I have a Western Digital 40GB IDE Caviar hard drive I am thinking about using with P8X32A chip transputer-like board.
Why will I use MRAM for firmware ROM storage on Propeller II? For $9.05 for 128KB MRAM (at Digikey), I can indefinitely write the firmware all over again, which is VERY important for firmware debugging, something that I kinda wish that P8X32A would be able to use SPI flash as a boot device, like I2C ones. FRAM is kinda okay, as long as I don't bombard it with X-rays or Gamma-rays (which I won't unless I want a corrupted firmware image). The biggest MRAM, about 2 megabytes will cost $37.37 and is parallel. SRAM costs more than this MRAM, though. (I wouldn't be surprised if this 2MB MRAM eventually cost $10 or whatever in few years, because MRAM is also very useful for the basis of mass storage circuitry in the SSD [Solid-State Drive], as soon as the troubles with smaller lithography nodes is solved.)
Thanks for this memory review. I'm also looking at hard drive systems for interactions with the Propeller chip. It's easy to replace the low cost boot EEPROM on the Propeller so it's not a concern when used only for boots and low overhead memory storage. I've used these like pluggable memory devices with a specific development program in each. I was thinking of adding a switch and keeping more on board but plug-in un-plug used fewer parts.
http://www.computerworlduk.com/news/it-business/3290494/jp-morgan-supercomputer-offers-risk-analysis-in-near-real-time/
Yes that article is horribly woolly. If you want the serious low down have a look at the PDF you can find here: http://www.stanford.edu/class/ee380/Abstracts/110511.html
Seriously impressive stuff.
On the other hand, yes, Brain is looking at ways to compress space, time and energy required to perform tasks.
No, because Brain does not do Brain pipelining but rather does new faster ParaP as previously discussed which includes Brain invented AtOnce (AO) Technology. See page 48 post 942 for the intro. http://forums.parallax.com/showthread.php?124495-Fill-the-Big-Brain/page48
Yes, the Brain, in moving up to over 800 combo nodes favorably compares as it has basically become a supercomputer. I fully expect Brain TeraFLOPs power will migrate towards the next level PetaFLOPs.
And yes, very similar in the Hybrid computer design, not with Xenon but with AMD, and not with two TB drives but three.
And finally no, not with 8 cores but with over 1,500 cores (over 800 {Propeller} + 4 {INTEL QUAD} + 720 {AMD GPU}).
I see the announcement about it there but absolutely no description.
Presumably those teraflops are going to be provided by your AMDs not your Propellers because a single COG is running at say 100000 flops (if we are generous and assume it is doing nothing else but float ops). So you would need 1E12 / 100000 = 10 million COGS to reach that, or say 1.5 million Propellers. This machine could be rather expensive:)
If the AMDs are doing all the work what are the Props in there for?
Hmm...didn't we go this calculation once before?
http://forums.parallax.com/showthread.php?124495-Fill-the-Big-Brain&p=1017318&viewfull=1#post1017318
I don't want to hijack Dr. Mario's thread with all Big Brain stuff.
Can't wait to see it actually do something besides generate pictures and buzz-words.
And, Humanoido, the firmware is required - "It's easy to replace the low cost boot EEPROM on the Propeller so it's not a concern when used only for boots and low overhead memory storage." - It won't be easy unless you hose down the firmware into the P8X32A chips which has been succeed by a file, "PropellerLoader.spin". If no firmware, then it won't be able to establish the firm connection via the SPI bus on Vinculum II chip, and it has to understand how the FAT32 is dealt with - you can use a modified version of SD card driver like "SD-MMC_FATEngine.spin" and omit the SD card pin setup - I have to find a way to convert it into an ASM file so it can be executed much faster - hard drive can talk with a P8X32A at about 1.2 to 20MB/s with UDMA at lower level (documented in ATA specification whitepaper - just skip SATA part, unless you're mucking with one, I am just using PATA one). You can use Ferroelectric RAM for boot firmware, like I do, as it's good for 100 trillion rewrites while MRAM will not care (as in INDEFINITE rewrites) - if you're a bit too worried about limited rewrites. *bows apologetically* Sorry if I kinda had to correct - I just had to be clear on some subject, concerning the firmware, but it's what the hard drives are also for, to get rid of SD card, and its pesky limitations. Lastly, the physical storage of a firmware image is not going to matter much - FRAM and EEPROM use the same I2C call address so it's feasible as a boot device, while if you wanna use SPI bus on Popeller II and you have a itchy trigger fingers for the F7 key, then Magnetic RAM is for ya.
Also, within two-year deadline (hopefully I can finish within that time, if not, I will extend it), and Dendou Oni is finished and is running correctly as anticipated, I will try JP Morgan's Risk-Assessment as it's also excellent for stress burn-in test (to ensure all Propeller II BGA chips are in the spot and are electrically functioning due to thermal heating as all superscalar COGs become stressed heavily by whatever nasty stuff you toss at it - P8X32A may also get warm the same way, although to less extent as it's got about the same transistor count as Pentium Pro sans L2 cache RAM.)
I am going to read the complete JPMC PDF as mentioned on Heater's link to see if it's feasible on Propeller II supercomputer with limited number of CPU cores, if yea, then I will try it out as a stress test - nothing as serious as their voltage-sucking behemoth supercomputer arrays, where they usually can afford HUGE electric bills, usually for figuring out whether the stock market will go up or going south by analyzing the past and current stock market trends. Even money can walk away outta the house...
Westmere-EX Xeon CPUs just deal with storage and data-hosting, not a real star in this kind of benchmark, but it also provided very crucial interface to the FPGA to be provided to the outside worlds, to be used by the brokers and stock traders. Of course, most banks also use JP Morgan's machines, as they're quite wary of the stock trends.
And, Voter Majority debating technique could also be used by JP Morgan, that can also be accomplished by brain-dead simple microcontrollers like this one: http://www.microchip.com/wwwproducts/Devices.aspx?dDocName=en019863 when arranged in clusters, all they have to do is to read small code, and then vote "yea" or "naw". Propeller II or that magic chip can do it much faster, though.
Jazzed: Please don't wait any longer. The project is happy to announce, as of today, it has reached the goal of a completed picture and buzz-word generating machine.
Dr. Mario: Thanks for your excellent rundown of memory - indeed sometimes we need to go with what's on hand or what is more cost effective. With the cost of TeraByte drives dropping like flies, it was time to go for it. I'm not interested in replacing prop boot code but rather supplementing it. The real gains may be in computer program and data storage when large volumes are required. This could include a large database of learned knowledge for reference or the storage subspace for a trillion neurons, or day to day memory structures in a machine life form. These large volumes could include the sum total of arrays where many redistribution modules are involved.
As you have well noted, power must also be kept down. The project has chosen supplementary and peripheral parts that draw minimal current whenever possible, and even disabled twenty PPPB LEDs because they were overly hungry, and a new study goes on because the judicious use of programming can affect power draw in Propeller chips significantly - perhaps seemingly not much in one chip but 100X in a hundred chips and a 1000X in a thousand where the significance goes up.
If you care about reliability (neuron computing is quite torturous to it as in the modern hard drive, it will have to pull the heads off parking ramps and onto rapidly spinning magnetic discs, and moving them across to read something or to write onto it - I know that it will be accessed by AI software every hundreds of milliseconds to several minutes), I would recommend Western Digital internal hard drives as they got better CPU (In some Caviar Black, there is a Dual-core superscalar VLIW processors soldered onto the board), and very high-end hard drive servos (apparently manufactured under both Japanese Military Standards for Electronics and Electromechanics and MIL-SPECs, and couple other standards that encourage the usage of better parts. Of course, even in your own sparkling new Macintosh, or in my PC, the hard drive isn't even treated like a queen, in term of usages) - and of course, you can pay a bit more and get WD Caviar Black or just Caviar Blue - I have a 500GB Caviar Blue hard drive, it still works fine even after two years. SSD is also doable, as long as you don't overuse it, however for AI, I would personally prefer to go with regular noisy hard drives as they are still superior in terms of life expectancy of the mass storage itself - only the wears on motor or the magnetic surface determine what and which will do numbers on the hard drive, and on a brand-new hard drive with the head parking ramps installed inside can last up to 22 years of heavy usage like in server environments.
And, of course, don't drop it when spinning, you don't want the head slapping to occur, which in rare case the heads will simply explode due to the stress on the heads made entirely of Silicon, rendering ALL discs unrecoverable, even by the labs.
Are you about to get uCLinux running on a one or more Propellers?
If so I might suggest that you stop work on all other parts of this ever growing super-microcontroller project and concentrate on that.
I'm sure just doing that would consume all your free time for quite a while. Many of us out here would like to see it done but don't have the time or skill to tackle it. Might not be useful for the Prop but could just be for Prop II and it's successors.
P8X32A? I may probably try it after I get my P8X32A prototype done, as I may need some memories available to waste on uCLinux threads.
If successful, I may include the tarball filled with SPIN or C++ files of the modified version of uCLinux, either immediately or until Dendou Oni is completely assembled and running.
I may have to write the uCLinux onto the FAT32-formatted hard drive I am willing to give P8X32A for it to keep as a host volume so I can have space to test it out (40GB which is generous, even for a tiny operating system), and I will have to include the boot firmware onto Ferroelectric RAM directing it what to do, and it's easier to load bigger codes by the usage of smaller bootstrap codes - like the BIOS (the motherboard firmware).
However, I am wondering if it's possible to write an operating system other than Catalina's own generic firmware package when using this compiler - maybe Ross can answer this question - anyone else who have done that, please tell me! ^_______^
EDIT: I just asked RossH, and the Catalyast OS package is just an option - now I can try compile my own, probably uCLinux.