An idea for p3(?) - 2-bank cog ram

KC_Rob · 2013-10-14 15:02

Sapieha wrote: »

Hi Heater.

I think You need revise that

I think Silicon Labs learn from Parallax !!

EFM32 Zero Gecko 32-bit ARM Cortex-M0+ Microcontroller

Like Heater, I don't quite get the connection in this context. But I will say that the folks at Silicon Labs are good, both on the business and technical sides. They've built a nice little actually, not-so-little now business, in large part simply by doing cool things with 8051 cores. I'm sure we can expect many more ARM-based products from them. Although, saying that about almost any semiconductor company nowadays borders on cliché.

KC_Rob · 2013-10-14 15:19

... Possibly (tangentially) what SiLabs calls its Peripheral Reflex System: another concept gaining popularity. I believe that Atmel already has something similar for its Xmega and Cortex-M0+ parts.

potatohead · 2013-10-14 15:51

After re-reading this interesting thread, I am coming around to the idea that perhaps the same thinking process that got the Propeller as MCU needs to be done with a CPU.

And, do that kind of thinking with the understanding that a P2 is there to assist.

The notable thing about the P1 and P2 designs is there being no need for an OS to employ multi-processing. Those same ideas get in the way of larger programs, unless the COG is considered micro-code, and LMM or simply External RAM as HUB running LMM style is considered normal.

One hangup is the speed difference between PASM and LMM. If that could be reduced? Does that change things? At sub Ghz speeds, that speed difference is often notable. However, at Ghz+ speeds, maybe it's not such a big deal, particularly given the COGS can be dedicated to specific tasks / devices.

If LMM were put into silicon directly, do we really need to fundamentally change things?

A COG could start up and just be a COG running PASM. Maybe it provides some peripherials, or maybe it sets itself up to act as a CPU instead?

The LMM kernel in silicon could be optimized for speed, and there would be hardware support for processing instructions and perhaps exception handling too, like say an illegal instruction jumping to some COG code that handles that one, or that allows for user instructions at peak speed.

brucee · 2013-10-14 17:02

Silicon Labs jumped ship on the 8051 for the same reason a number of other suppliers did a number of years back. When you get to 150 nm processes the size of the CPU starts becoming insignificant in terms of space on the die. So going to a 32bit CPU doesn't cost much to add, especially when it is something like an ARM Cortex M0+

The only reason to keep producing 8051s is for the installed code base, and you're probably not going to care much about high performance peripherals like Ethernet.

When most people have ARM products, you get to differentiate yourself on peripherals. In many cases analog peripherals do not scale with process, sometimes they even grow in size.

jmg · 2013-10-14 18:43

brucee wrote: »

Silicon Labs jumped ship on the 8051 for the same reason a number of other suppliers did a number of years back. When you get to 150 nm processes the size of the CPU starts becoming insignificant in terms of space on the die. So going to a 32bit CPU doesn't cost much to add, especially when it is something like an ARM Cortex M0+

The only reason to keep producing 8051s is for the installed code base, and you're probably not going to care much about high performance peripherals like Ethernet.

Yes and no. Silabs have just released new, lower cost 8051's and also dropped the price of their USB models.
There certainly is life there, and I would not call that 'jumped ship'.

The latest M0 comes from Energy Micro, that Silabs bought.

brucee wrote: »

When most people have ARM products, you get to differentiate yourself on peripherals. In many cases analog peripherals do not scale with process, sometimes they even grow in size.

You are right that you can fit more into a shrink process, and that peripherals now matter more than the core.

Which is why I cannot fathom why Energy Micro put 16 bit timers into that 32 bit chip ?!?

Why would anyone make a 32 bit CPU and then put 16 bit timers into it ?

Contrast that with the low cost XMC1000 Family (Arm M0) from Infineon, which says this

Several combinations for timer concatenation can be made inside a CCU4 module:
• one 64 bit timer
• one 48 bit timer plus a 16 timer
• two 32 bit timers
• one 32 bit timer plus two 16 bit timers

jmg · 2013-10-14 18:59

potatohead wrote: »

If LMM were put into silicon directly, do we really need to fundamentally change things?

A COG could start up and just be a COG running PASM. Maybe it provides some peripherials, or maybe it sets itself up to act as a CPU instead?

The problem here is the present ROM is patched RAM, (done for speed), and ROM is expensive to change.

What should follow on, is variants of LMM, for either (or both) Java / .NET Bytecodes.

The future proofing worthwhile I see, would be to add HW support for Quad SPI, and even DDR Quad SPI.
That opens Prop to Execute-in-place from cheap/small flash memory. That may yet make the cut into P2.

David Betz · 2013-10-14 20:03

potatohead wrote: »

If LMM were put into silicon directly, do we really need to fundamentally change things?

At one point I suggested to Chip that the PC could be increased to 32 bits or even just 15 bits to address 128k longs and that any value above 2K could do the equivilent of rdlongc (thanks Bill!) to fetch the next instruction. This would allow us to run code directly from hub memory. Unfortunately, this isn't compatible with the way the instuction pipeline works and would require fairly extensive changes. This is something I wanted to play with if the P2 RTL were made available but I guess that isn't going to happen.

Heater. · 2013-10-14 21:34

potatohead,

It had occurred to me that direct hardware execution of LMM code would be neat. But is it really?

In the extreme a COG can execute one instruction every clock. Given exclusive access to HUB it could perhaps also execute one instruction per clock. BUT it does not have exclusive access it has to share with seven other COGS So LMM execution rate is only one eigth of full speed at best.

Is it actually of any benefit to cast the LMM loop into silicon? Or sufficient benefit? Given the pipelining and such of the P2 what speed gain would hardware LMM actually give us?

jmg,

What should follow on, is variants of LMM, for either (or both) Java / .NET Bytecodes.

At that point you will find me out in the car park throwing up. The entire raison d'

jmg · 2013-10-14 21:53

Heater. wrote: »

At that point you will find me out in the car park throwing up. The entire raison d'

msrobots · 2013-10-14 22:04

no, no.

It is just that heater does not like Microsoft, so .NET is out of question...

Enjoy!

Mike

potatohead · 2013-10-14 22:51

I don't either in this context. Lots of cruft there. However, nothing prevents it from getting done.

Currently, I'm thinking in the context of post-P2. Just FWIW. @JMG, yeah! Maybe. Let's hope Chip got that sorted down to a general, clever circuit he likes. I think that would be a big gain. Very large, very cheap memory spaces. It doesn't need to be super fast.

@Heater: Well, by putting it into silicon, something could get executed every HUB access window. Seems to me, incorporating something like a hardware jump table for user defined opcodes, speed might come up further, due to the COG being able to do things like run the nice math, while also running LMM.

"full speed" is worth some discussion. Again, if the COG is seen as micro-code, then "full speed" at Ghz type clocks LMM style isn't going to be all that slow. Couple that with being able to run large programs concurrently? We may find it's more potent than we think. At the least, it's worth discussion. Right now, it takes a ton of complexity in caching to really make use of higher clocks. Maybe concurrency has a place in that mess.

For a P3, or maybe just post P2 type design, having external RAM be "the HUB" basically means thinking on a CPU scale, which is where people appear to want to go. Right now, large memory XMM style is complicated. Caching will take some real work to make productive, and it's convoluted anyway.

With an external RAM HUB, XMM sort of goes away, or is just an option at the least, leaving fast, concurrent LMM code as the default target. COGS running PASM may fire off SPIN, or perform as peripherials, or run LMM kernels that are silicon assisted.

One nice advantage would be some sort of tuning being possible. A math heavy program could run one kernel, a decision heavy one may run a different one, or they could run together, etc.... just depending.

In any case, focusing on extending PASM to run big memory may not actually be the sweet spot, if the HUB were external memory and the hardware supported LMM style programming. The trade might be complicated caching schemes to bridge the gap between fast CPU and slower RAM being traded for simpler LMM programs that can run concurrently and be deterministic, and the COGS being able to provide support services in PASM.

At Ghz plus speeds, firing up a COG to perform a task isn't such a big deal. It may be having a pool of a few of them able to load up the task at hand, blow through it and return to the pool makes sense. All the while, other COGS are crunching on big LMM programs with their silicon assisted kernels helping to manage user defined instructions and or perform complex operations potentially present in the LMM programs.

rod1963 · 2013-10-14 23:05

No because NETMF is a dog on all but the fastest micro-controllers and it's a resource hog. Real-Time goes out the window and you got something that's worse than running Basic. I looked closely at GHI's offering and I wouldn't touch them with a 10ft pole. As far as I'm concerned they are living proof of the concept - if you put a rocket up a pigs rear end you can make it fly but it doesn't make it right.

I remember seeing a youtube video of someone running a STM32 M4 with NETMF with a 4.3" lcd display and running a game that would have been fine on a Timex/Sinclair. In short it was pathetic. This isn't progress, it's a showcase of what happens when you saddle a state of the art processor with the software equivalent of Fatb*****d.

Heater. · 2013-10-14 23:12

jmg,

Why ? LMM is also a fetch-then-execute solution.

Good point. My extreme reaction does deserve some explanation.

Firstly LMM is not why the Prop was made. It was just a happy discovery by Bill Henning long after the Prop went into production. Far from the raison
d'

rod1963 · 2013-10-14 23:22

Why the need for Ghz speeds on a microcontroller?

Can someone tell me what embedded applications need this speed and I don't mean multi-media or OS support, etc. But plain old fashioned, unsexy industrial, aerospace, medical, automotive applications that require it?

The reason I'm asking this question because most vendors that handle this market aren't going for speed but peripheral support instead. Infineon has wide array of microcontrollers that are aimed directly at the embedded market. Their XMC4000 line tops out at 120 Mhz. Sure the TI Sitara line clocks in near a gigahertz, but it's not a real time controller. it needs it's own version of a Cog called a PRU.(I'm sure if Chip made a Cog specially for TI they might quite interested in it) and even then it's limited and the PRU's are a black box to many.

Heater. · 2013-10-14 23:32

potatohead,

...may be having a pool of a few of them [COGs] able to load up the task at
hand, blow through it and return to the pool makes sense

Perhaps that has it's uses. But now we are designing a general purpose computer for high performance parallel computing. This is no longer the realm of the Propeller which, so far, is all about tight integration with the real world outside. The machine you are now descibing is far better built by Intel
already. There is no point in chasing that for Parallax.

rod1963.

No because NETMF is a dog on all but the fastest micro-controllers

True. And normally my discussion of such things would end there.

I say "normally" because clearly if you have a run of the mill single core MCU things like Java or .Net byte codes are going to make it run like molasses.

However the Pop has 8 cores. One could argue that it does not matter of you have a large bulk of "managment" code churning around slowly on it's byte code engine. You still have 7 other cores for those high speed, real-time, parts of your application. The I/O interfacing, the FFT calculations and so on.

However if one accepts that argument I would still rather see the "big code" as JavaScript than that legacy Java/C# stuff.

Why the need for Ghz speeds on a microcontroller?

I'm not sure about the "need" but clearly things have been getting smaller and faster, that's why we can have a PII, so the question is how to exploit that.

I'm sure if Chip made a Cog specially for TI they might quite interested in it

Yes, exactly. Parallax could think about licensing the Prop II or some future version for use by others in ARM based SoCs.

Mind you, TI already have their PRU's so I suspect they would not be interested.

jmg · 2013-10-15 03:03

rod1963 wrote: »

Why the need for Ghz speeds on a microcontroller?

Can someone tell me what embedded applications need this speed and I don't mean multi-media or OS support, etc. But plain old fashioned, unsexy industrial, aerospace, medical, automotive applications that require it?

A good question, with more than one answer : Time.
There are already parts that offer sub 100MHz or ~ 100Mhz CPU speeds, that also can give 1ns or 150ps of PWM time resolution.
That is useful, which is why they take the trouble to do it.

Nice would be capture to 1ns or below granularity, but I've not seen that yet, it is likely coming.

So most embedded systems do not need GHz CPUs, or GFLOPS, but if the silicon can do it without breaking a sweat, I'd like a controller that can time to GHz precisions.

David Betz · 2013-10-15 03:48

Heater. wrote: »

In the extreme a COG can execute one instruction every clock. Given exclusive access to HUB it could perhaps also execute one instruction per clock. BUT it does not have exclusive access it has to share with seven other COGS So LMM execution rate is only one eigth of full speed at best.

Remember that with rdlongc you get 4 longs at a time so you aren't really limited to a single instruction per hub cycle. Of course that presumes that you aren't using the quad cache for anything else.

Seairth · 2013-10-15 07:32

I get the feeling that we are at a point where we are trying to redesign the propeller to meet the needs of the tools instead of the other way around. And considering that some of these tools (GCC, being a prime example) were not designed with the propeller in mind, this really seems like a case of the tail wagging the dog. I understand the desire to leverage the value of that which already exists, but at what cost to the potential of what the propeller could otherwise be (or become)?

Edit: or maybe this is just the normal order of things and I am naive to think otherwise.

ctwardell · 2013-10-15 07:38

Heater. wrote: »

By the way, isn't it possible to make an LMM style loop that executes code from the stack RAM now? (Or whatever that RAM is called now a days)

Yeah, I played with that back when the FPGA code first came out.

Here is the thread:

http://forums.parallax.com/showthread.php/144675-Challenge-Execute-Prop-II-code-from-it-s-CLUT-space-(CLMM)

It comes from a time when we where having some 'issues', let's leave that part in the past please.

Chris Wardell

brucee · 2013-10-15 07:40

I think this thread is starting to get to the issue I see with the Prop. Everybody seems to agree a RPi with a PROP or an ARM with a PROP is a good thing.

The reason being is you want that PRU or COG to do the bit banging and not be interfered with from the main CPU.

So this makes sense, from the IO side, so you want multiple COGs, and for IO tasks 2K is probably adequate.

The problem with the PROP is that; OK,so what do we do for the main CPU. That becomes a PROP with 1/8th access to HubRAM. So now you have a CPU running at AVR speeds. Then if you have to go off-chip you both give up a COG and end up running at Z80 speeds.

I think the memory balance is wrong, 32K for a shared memory is probably too big in the shared memory function, you normally don't send that much data simultaneously to a peripheral. And 32K is too small for any but the smallest applications, this is where the lack of Flash hurts, as the equivalent in Si space would be 256K. Doing quad fetches from HubRAM helps a little, but it either gets the next few instructions in code or next few words of data, but generally not both. Most ARMs have independent data and code paths, and the quad fetches there are more effective. The way the multi-core ARMs handle the situation, is that CPUs and memorys are arranged in matrices, so that there are multiple RAMs and Flash, and the CPUs have independent paths to access them, and there is only interference when 2 CPUs are sharing a memory.

Getting back to a JS engine, Heater could probably answer this one, how big of a memory would the interpreter need to be to do that for the majority of the cases. I assume the answer is bigger than 2K.

KC_Rob · 2013-10-15 08:29

Wow! this thread has really taken off. Much to chew on, thanks in good part to jmg, Heater, potatohead, rod1963... et al.

brucee wrote: »

I think this thread is starting to get to the issue I see with the Prop. Everybody seems to agree a RPi with a PROP or an ARM with a PROP is a good thing.

Yes, I think most here agree that this is one niche the Prop is well suited for. I can also see it working well in deeply embedded IO-intensive apps lacking a robust UI and the other goodies that a general-purpose CPU is better equipped to handle. In both cases, though, the requirements are very nearly the same.

I think the memory balance is wrong, 32K for a shared memory is probably too big in the shared memory function, you normally don't send that much data simultaneously to a peripheral. And 32K is too small for any but the smallest applications, this is where the lack of Flash hurts, as the equivalent in Si space would be 256K. Doing quad fetches from HubRAM helps a little, but it either gets the next few instructions in code or next few words of data, but generally not both.

I was pondering this the other evening, and I have to agree with you, brucee: once one accepts the above (ie, the sort of application it's realistic to target), Prop's memory balance does seem out of whack; and I believe that P2's might even be more so.

Again, I think it *very* important that you scale the design appropriately for the expected market. By which I mean: unit cost, package size, power requirements, and overall complexity.

Most ARMs have independent data and code paths, and the quad fetches there are more effective. The way the multi-core ARMs handle the situation, is that CPUs and memorys are arranged in matrices, so that there are multiple RAMs and Flash, and the CPUs have independent paths to access them, and there is only interference when 2 CPUs are sharing a memory.

Yes, I think that arrangement gives the best tradeoffs for memory allocation/sharing in such situations: multiple memories.

Getting back to a JS engine, Heater could probably answer this one, how big of a memory would the interpreter need to be to do that for the majority of the cases. I assume the answer is bigger than 2K.

I've said before that it's entirely possible, even once consideration of what a Prop *should* (or rather, most likely *would*) be used for is taken into account, it may yet be found that more than 2K per cog is desirable.

Heater. · 2013-10-15 08:50

brucee,

As you may know the V8 JS engine inside the Chrome browser or SpiderMonkey engine as used by FireFox consume many megabytes. They have to because they are doing a lot of on the fly compiling and recompiling to get the speed we see now a days. They also use a lot of RAM for run time heap and such.

On the the other hand there is Espruino http://www.espruino.com/ which fits in the 1MB FLASH space of my STM32F4. In fact the binary I built for it is slightly less than 200KBytes. Espruino can work in less than 40K RAM. (My F4 has 192KBytes),

It is of course rather slow as it has to basically parse the text of the program as it runs. No byte codes or just in time compilation. That might seem useless until we realize it can be combined with whatever high speed code we have running in the other COGs. The JS only has to orchestrate things.

This won't fly on the Prop so well but I am trying to get it built for the PII using that huge external RAM that seems to be planed for the dev boards when they come.

It may only ever be a curiosity.

brucee · 2013-10-15 10:47

Espruino looks like a pretty good target these days, considering most software can't get out of bed without 1 G of RAM

But that is about the right scale, though it seems bloated when you compare to Apple DOS at <10K

With 32K for SPIN, if you could squeeze it into there or a bit bigger, that would be a tempting target, though I'd really like to see that running at COG speed and not Hub speed, only 1 or maybe 2 COGs would need access to it, and at that point those HID COGs don't care much about code predictability, last I checked I could only notice in the high msec

Anyway the question that still needs to be answered is what is the target application/market. A quick JS interpreter on a small footprint would be a nifty hobbyist tool, but millions, doubtful. And it would only get there with RPi type pricing, ie yes we lose money on every one but we'll make it up in volume.

jmg · 2013-10-15 12:00

brucee wrote: »

Getting back to a JS engine, Heater could probably answer this one, how big of a memory would the interpreter need to be to do that for the majority of the cases. I assume the answer is bigger than 2K.

This can become a software problem, along the lines of what is already in the mix for RAM-Spin.
There is a plan to make Spin adaptive, so it does not load what is never called, and that means it can draw from > 1COG of possible resource. Users either get spare RAM for their own call-able functions, or more extras.

You can run one or even two COGs (I'd start with two) - one has the tight kernal and debug support, and is 'safe' and the other is for swappable functions.
Off-chip compiler tools need to do the heavy lifting of choosing what to compile into functions, what to interpret, and where it all goes.

Another alternative is an "Ardunio Morph" like intel have done - that swallows Ardunio sources, but a lot of software in the background finally runs that on their Pentium Core. Only moderate size (from a Pentium viewpoint) seem to be currently supported, but the environment is simple and learner friendly.
A good remote debug is going to be important.

David Betz · 2013-10-15 13:09

Did anyone ever make a P1 "pie plate" or whatever they call RaspberryPi add-on boards?

Heater. · 2013-10-15 14:03

David,

Did anyone ever make a P1 "pie plate" or whatever they call RaspberryPi add-on boards?

Good grief, as far as I know it did not happen. Despite the fact it has been discussed on these forums at length.
I tried my best to push things along by enabling SimpleIDE to program a Prop via the Raspi UART directly.
Perhaps it's time to try again.
I really think Parallax should look into this "pie plate" or whatever it is called idea.

Oh yeah, I should also publish my Raspi Propeller loader mods somewhere.

mindrobots · 2013-10-15 14:12

David Betz wrote: »

Did anyone ever make a P1 "pie plate" or whatever they call RaspberryPi add-on boards?

Nick Lordi (nglordi) had designed one and had it on BatchPCB (they closed down and moved to OSHPark). I could never find it on their site. There were a couple threads where plates were discussed.

I sent Nick a PM back in May about it but he never replied.

rod1963 · 2013-10-15 15:04

Microchip is already selling a PIC32 variant of the Pi plate.

http://www.microchipdirect.com/ProductSearch.aspx?keywords=TCHIP020

David Betz · 2013-10-15 18:32

rod1963 wrote: »

Microchip is already selling a PIC32 variant of the Pi plate.

http://www.microchipdirect.com/ProductSearch.aspx?keywords=TCHIP020

That looks kind of nice and fairly inexpensive as well. It would be nice to have a similar board with a P1 chip on it.

KC_Rob · 2013-10-15 19:20

rod1963 wrote: »

Microchip is already selling a PIC32 variant of the Pi plate.

http://www.microchipdirect.com/ProductSearch.aspx?keywords=TCHIP020

Not too shabby! It would be nice to see a Prop version.

An idea for p3(?) - 2-bank cog ram

Comments