The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

evanh · 2014-04-11 20:18

Phil Pilgrim (PhiPi) wrote: »

... whatever actual silicon results from this very flawed and much too-open dev process.

I'll take exception to that. I find the whole process has been a great ride and I very much appreciate the level of openness.

The fact that it hasn't gone completely smooth can't be seen as flawed at all.

EDIT: To give the thermal issue as an example. I've been through that sort of process before where I didn't account for something right at the start and only later did I realise how important it was. And, as was the case here, there is often a reason why it was not addressed initially, ie: didn't have the tools at hand.

All this really shows is that the warts are on display at the same time.

rjo__ · 2014-04-11 20:18

Phil,

I was typing as you were writing... the more you can say about the technical issue, the better.

Thanks,

Rich

jazzed · 2014-04-11 20:19

RossH wrote: »

I'm assuming there will be an OBEX equivalent for high level languages other than Spin. Then it will become quite important to know which objects use COG mode and which use HUB mode.

Ross.

Here's the way I see it.

OBEX is all source of course (no binaries).

In the current Propeller C program, we can choose one of the various "TLA" memory models in the SimpleIDE project menu for the main program. Then we can add Spin sources for extracting PASM or cogc sources for linking into the main program ... the file-type defines how things are put together. I may add spinwarp support to SimpleIDE, but that is also just file-type dependent.

In a follow on P16X512LC release, we can remove the memory model box entirely because the file extension defines how the code is built in SimpleIDE. If someone wants to make a program written entirely in COGC or have it built as Compact code that can be selected from the Tools menu. XMM models are not mentioned because the new chip is big enough ... so far.

As far as new spin code goes, Chip has already said that "spin2" code will have the .spin2 extension.

Hopefully that all explains why I think it is not important to know which objects use which mode.

Just for the sake of following through with this, I hope you can describe your vision and concerns in more detail.

jmg · 2014-04-11 20:20

rjo__ wrote: »

I know of all kinds of University level research that depend upon RF modulation. This means that there are academic markets that this chip cannot penetrate.
It also means that there is research that won't be done because budgets won't allow it.

The P1 is full custom, and it managed to include an (Analog?) PLL with every Timer.
The P1+ is fully Synthesised from HDL in the COGs, and has a single custom PLL cell for the master clock in the Pad ring area.

That design flow difference makes getting a PLL into P1+ COGs quite a challenge, and Chip has not yet said it is included.

That said, most RF designs need more serious PLL and performance than a P1 PLL anyway, and there are an expanding range of external RF synthesis chips, that the P1+ could easily drive.

ie by moving the PLL to a device that does it properly, you get better performance, and remove some design risks.

Phil Pilgrim (PhiPi) · 2014-04-11 20:34

rjo__ wrote:

Phil ... the more you can say about the technical issue, the better.

No. The P1+ needs to evolve and morph with a singular vision. The more I opine, the more I distract from that vision -- assuming Chip is misguided enough to listen. I had zero input on the P1 and am thrilled with the result. It manifests a simple, versatile elegance that only a focused, undisturbed mind could produce. It's what I wish for the P1+ but apprehend may have gotten flung out of reach by our own various well-meaning but entropic visions.

-Phil

rjo__ · 2014-04-11 20:39

jmg,

Thanks. Most of the RF work that I know about in the biological sciences is rather cost sensitive, where the rf is sort of commodity and the basic design is whatever is most available.

Rich

RossH · 2014-04-11 20:41

jazzed wrote: »

... the file-type defines how things are put together. I may add spinwarp support to SimpleIDE, but that is also just file-type dependent...

Using the file extension is no different to explicitly identifying the mode. In fact, it would make sense to use ".cog" and ".hub" as the extensions (or equivalent - as I've already said, I don't care what the actual names are that you use). But the fact remains that in many cases you are going to need to know what mode the object uses to know whether it is going to be able to be used in your program.

jazzed wrote: »

... In a follow on P16X512LC release, we can remove the memory model box entirely because the file extension defines how the code is built in SimpleIDE ....

So Catalina will be left as the only compiler that supports XMM? Great - because that mode is going to rock on this new chip!

jazzed wrote: »

Just for the sake of following through with this, I hope you can describe your vision and concerns in more detail.

I am happy to leave further discussion until we see the fine detail of how Chip plans to implement HUB mode. I believe it will become clear at that point that HUB mode and COG mode are both valid native modes, and that they are not as interchangeable as some people seem to think they will be.

Ross.

Cluso99 · 2014-04-11 20:42

Seairth wrote: »

Thats "IN", not "IND".

Of course! Must be having another senior moment.

jmg · 2014-04-11 20:44

rjo__ wrote: »

Thanks. Most of the RF work that I know about in the biological sciences is rather cost sensitive, where the rf is sort of commodity and the basic design is whatever is most available.

What sort of RF - are these communication links, for portable sensing ?

The P1+ is not going to be battery-friendly, nor especially cheap, or small, so it may be better to chose a smaller Micro with inbuilt RF Synthesis Block - and the choices here seem be be expanding all the time.

rjo__ · 2014-04-11 20:47

Phil,

Producing a new chip has levels of complexity that are rarely faced by a single human being. You have often said that there is no such thing as failure in this process...at every step something is learned, and that the final product benefits from every missed step along the way. Nothing could be more true.

The job is so daunting that I can just barely imagine it being possible. Chip is human. The only possible way he could get anything done is just to do it... the inter-related issues, complexities, and opportunities demand experimentation.

Chip would have had to experiment, whether we were here talking about it or not. It isn't a neat and tidy process... by the nature of that process, not just by the nature of this forum.

Fatigue is the issue that concerns me the most. If Chip doesn't change his schedule...and substantially, he is going to make himself sick.

Rich

jmg · 2014-04-11 20:48

Phil Pilgrim (PhiPi) wrote: »

No. The P1+ needs to evolve and morph with a singular vision. The more I opine, the more I distract from that vision -- assuming Chip is misguided enough to listen. I had zero input on the P1 and am thrilled with the result. It manifests a simple, versatile elegance that only a focused, undisturbed mind could produce. It's what I wish for the P1+ but apprehend may have gotten flung out of reach by our own various well-meaning but entropic visions.

Some of what Chip is doing on the P1+ is dictated by the design flows, so not everything in P1 can come along.

Where he has explained what he will be doing, I would think some use-case examples would not disturb, but rather help focus on what is needed to cover those use cases.

Cluso99 · 2014-04-11 20:52

Phil Pilgrim (PhiPi) wrote: »

I was rather hoping for three. My signal front-ends-plus-I/Q-mixers use three counters (actually five, including the local oscillators), and i've had to start an extra cog in the P1, just to accommodate the third one.

A single counter per cog is useless for the kind of stuff I do. But, again, if more can't be accommodated, I'm content to keep using the P1. It'll be lower-power and less expensive anyway. At some point, it would be nice to see a multi-core chip from somebody that's optimized for RF apps, without having to resort to FPGAs.

-Phil

Phil, can't the smart pins do most of what you want? I know we don't have a full spec yet.
Is there anything simple that the smart pins could do that would allow the counters to be completely removed from the cog? It would also save register space too.
At least with 16 cogs, you can afford to throw another cog or two at your job.

BTW When you processed the IQ data, I read all about that on the internet. I would never have learnt that had you not published your code and described it nicely. Thanks!

jmg · 2014-04-11 21:10

Cluso99 wrote: »

Is there anything simple that the smart pins could do that would allow the counters to be completely removed from the cog? It would also save register space too.

You mean besides the PLL in a P1 Counter ?

The smart pins need just [an adder and MUX and carry tacked on], to emulate the core digital sections of a P1 Adder/Counter.

As I mentioned above, that has some trade offs - and probably needs a set of P&R runs on OnSemi process to lock down actual numbers for MHz and area.
It is not hard to run some parallel builds, with 3 or 4 combinations, and check the reports.

- but I think the bigger questions are less technical-details, and more along the lines of

* How important is having Software-backward-compatible Counters (minus PLL) ?
( This is a different question from Hardware-compatible Counters )

* Is the slower communication to the Pin Counter Cells going to be an application issue ?

evanh · 2014-04-11 21:28

Software compatibility is a non-issue. Instruction speed would likely be a problem though. We could do with some numbers - how many counters total is being proposed in the I/O ring? can't possibly expect one per pin.

evanh · 2014-04-11 21:29

Hmm, I guess the current tally for the P1+ as it stands is 32 counters.

jmg · 2014-04-11 21:37

evanh wrote: »

Hmm, I guess the current tally for the P1+ as it stands is 32 counters.

Chip is talking about being able to do one counter cell on every pin.
( but die area numbers on that, are still up in the air, so that 64 may yet modify to 32 or ? )

evanh · 2014-04-11 21:58

Ah, I've been getting a bit behind. I see the counter topic is over here - http://forums.parallax.com/showthread.php/155145-Putting-smarts-into-the-I-O-pins and Chip is speculating on having a shifter in there also ...

Heater. · 2014-04-12 01:20

RossH,

I'm assuming there will be an OBEX equivalent for high level languages other than Spin.

I hope so to. And it's called github.

It would be nice for Prop programmers to get into collaborative development with such a system. Or at least make use of version control, issue tracking, the wiki pages for documentation etc etc.

Then it will become quite important to know which objects use COG mode and which use HUB mode.

Why?

What I do in my module or object should be of no concern to the users of that code. As long as it does actually do what it says on the box. Just like using any of the billions of libraries in Linux or Windows when programming there.

What a user will want to know is: How many processors does it expect to be able to use. So that he knows it will fit in his project.

How it uses them should not matter. Run from RAM or core registers, who cares?

Same story as with RAM space. A user needs to know the RAM usage, not how it is used.

What am I missing here?

Brian Fairchild · 2014-04-12 02:00

re: Compilers and Execution Models

FWIW, I see the ideal, friendly and easy to start with, C world something like this...

1) You write some ANSI vanilla code and give it to your compiler. It outputs straight P1+ instructions which goes into the 512k of RAM and is executed by core0. So much like any other chip out there which is 32-bits and runs at 100 MIPS.

2) In other micros you then have something like the 'interrupt' keyword which tells a compiler that this code is special and can run at any time. In the P1+ this make no sense but an equivalent is needed which tells the compiler 'I want to run this code in it's own core with the code coming from RAM. The writer decides which core to use and the compiler warns if it's already in use or you've run out. So much like any other chip out there which is 32-bit, runs at 100MIPS and has interrupts. Except executing interrupt code does not slow down the main code at all.

3) The coder now reads up on the ability to run his 'core' code directly from the core's registers. He finds the additional keyword to add to his qualifier and, with no other changes, the compiler now outputs code which gets loaded into the core's registers. It warns him if it won't fit in 496 words. So much like any other processor out there which is 32-bits, runs at 100 MIPS, has interrupts and runs the interrupt code without slowing down the main code. Except it now runs the interrupt code at 200 MIPS.

4) Our coder has been very productive and written a huge amount of main-line code, plus he has some huge data arrays. So he needs more memory. He turns to the section in the compiler help file and sees that he has a number of options. The help file lists his options and summarises the trade-offs. Does he want to add an external chip? Can he sacrifice some speed in his main code? He's an engineer and engineers are always making trade-offs. So he decide to use LMM/XMM/CMM/SMM/BMM or any of the 26 different xMM variants.

Am I missing something?

Roy Eltham · 2014-04-12 02:00

Phil,
I'm really getting tired of you calling this process that Chip & Ken decided to use "flawed". It's insulting to Chip and Ken, and the people involved here. Chip has said on numerous occasions that the P2/P1+ has been made much better by getting input from outside folks. You want him to go off on his own and make the new chip without any outside input, but he doesn't want to do that. The chip would not be as good as it's going to be if he had done that, and quite frankly I think you are crazy to think otherwise.

Heater. · 2014-04-12 02:30

Brian,

I see where you are going with the "easy route" to Prop programming. That's pretty much what happens with propgcc already.

One little quibble though - "The writer decides which core to use and the compiler warns if it's already in use or you've run out."

I don't think so. The author should not need to know or care which core gets used to run bits of his code in parallel. Why should he? Just ask for a code to be run in parallel and if there is a free core it is. If not there is an error. That is what happens in Spin and C now.

Think of it like using RAM. Mostly you don't know or care where your variables get put in memory. That's for the compilers to worry about.

Brian Fairchild · 2014-04-12 02:58

Heater. wrote: »

One little quibble though - "The writer decides which core to use and the compiler warns if it's already in use or you've run out."

Quibble accepted.

Phil Pilgrim (PhiPi) · 2014-04-12 03:06

Roy Eltham wrote:

Phil, I'm really getting tired of you calling this process that Chip & Ken decided to use "flawed". It's insulting to Chip and Ken, and the people involved here. ...

Roy, I'm sorry you feel that way, and I certainly mean no disrespect to Chip, Ken, or anyone on the forum. I'm sorry if anyone has taken personal offence at my point of view, and I have taken pains to offer a mea culpa along with the pointed finger when I say that the forum input seems to be part of the problem.

I'm sure the motives for an open process were pure in the beginning and well-meaning. But that does not mean that I see what the process has become producing a finished product. It seems like there's always one more better way to do it, and we've been better-way-to-doing it for much too long. "Better" really can be the enemy of "good" sometimes.

From a technical standpoint, I think part of the problem is that the Propeller that we know and love is just a singular, nearly-perfect jewel that won't scale as simply as everyone hoped it would. Hence biggerbetterfastermore turned into 5 watts. What we're struggling with now is something that may, by necessity, not be as Propeller-like as everyone might have hoped. That's not necessarily bad, but it has caused a surfeit of well-intended but misguided "help" in the forum which, in my mind, has become a distraction from the goal of producing working silicon. This thread alone has passed 16,000 views and 650 replies. And that's just crazy.

I personally find myself torn between saying what I'd like to see -- witness the counter discussion -- and staying the heck out of Chip's way. I think the biggest thing he needs now is space. And if the forum proves to be too much of a temptation for him, we can help by turning off the bubble machine and giving him a chance to let things gel.

Let's face it: there are a lot of egos at work here, and it's a badge of honor to have one's ideas incorporated into the final product. I'm no more immune to that than anyone else. But it's way past time for "final product" to be our goal, rather than pushing for this idea or that suggestion.

Anyway, that's my take on the situation, and I'm not offended by anyone who might disagree.

-Phil

Heater. · 2014-04-12 03:41

Phil,

I think you have some good points there Phil.

This whole thing has been, and is, a big experiment in collaborative design. The likes of which the world has never seen before. Can you think of another micro-controller that has been developed with so much user input?

As such it has been long winded and chaotic. As is the nature of "community".

Of course one could say, and I think you are saying, "I didn't want an experiment in social chip design, all I wanted was the chip".

That's fair enough. I wanted "chipmas" years ago. But chip has instigated and continued this experiment and it's his baby.

If it's any consolation many here have been counselling limiting "feature creep" for a long time. Those 500 different instructions scared the socks off of me for one.

Looks like Chip is now steaming full speed towards a much more streamlined and "Propeller like" Propeller II design. Using what he has learned along the way with the previous approach.

This makes most of us here very happy judging by tone of the posts.

Brian Fairchild · 2014-04-12 05:04

Phil Pilgrim (PhiPi) wrote: »

...it's a badge of honor to have one's ideas incorporated into the final product.

I'm hoping some ideas don't make it into the final chip and I'll claim those as my badge of dishonour.

evanh · 2014-04-12 05:28

Phil Pilgrim (PhiPi) wrote: »

... It seems like there's always one more better way to do it, ...

I haven't seen anything really get bogged down at all. Chip appears to be very quick at marching through everything that has been thrown at him. The design has really benefited, and I'm talking about the P2 as much as I'm talking about the P1+.

and we've been better-way-to-doing it for much too long. "Better" really can be the enemy of "good" sometimes. ...

I'd disagree. Progress has been impressive. The enhanced feature list has made the next Prop designs so much better. It'll be worth the extra time ten times over. The fact that one important factor was overlooked for a while is not a reason to dis the whole the process.

evanh · 2014-04-12 05:32

Phil Pilgrim (PhiPi) wrote: »

Let's face it: there are a lot of egos at work here, and it's a badge of honor to have one's ideas incorporated into the final product. I'm no more immune to that than anyone else.

Please encourage Chip to offer up more of what he's done with the counters in that thread, although I suspect he's not quite on to them just yet. I'm keen to see what you might be able to contribute to them. There is currently a big question mark over shifting them to the pin ring.

RossH · 2014-04-12 06:46

Heater. wrote: »

What am I missing here?

At least a litte, I think.

Has the detail of the HubExec scheme the P16X32 will implement been nailed down yet? If it has, post a link and we can discuss it further - there are so many threads on this subject I'm not sure any more whether Chip is proposing to include Cluso's scheme, Bill's scheme, some combination of the two, or something else entirely.

But at the very least, 16 cogs executing in HUB mode (rather than COG mode) requires at least 32kb more Hub RAM, and it will run slower (and under some schemes would also be non-deterministic).

Whether this is going to be important is difficult to assess without knowing the specifics of both the application and the HubExec implementation - but it is clear that some people are assuming that 512Kb Hub RAM is "inexhaustible" - which it isn't, of course. In fact, it is really only enough to do some quite low-resolution VGA displays before you run of Hub RAM yet again. It is easy to come up with an example program that would "work" if all the plugins it used executed in COG mode, but not if they executed in HUB mode.

Ross.

Bill Henning · 2014-04-12 07:37

Re: hubexec

Last I heard (and deduced from instruction set and register map post by chip)

- one lines of four longs as icache
- one lines of four longs as dcache
- 200MB/sec (ie every 8 clocks) hub limits execution rate to 50MIPS (unless 16 instruction loop fits in the icache)
- CALL/JMP/LOCPTR with embedded hub addresses to save space
- four flavors of call (LR, 4 long LIFO, hub stack based on PTRA/PTRB)
- AUGS and AUGD for 32 bit constants
- fcache-like mechanism would allow full speed inner loops
- it is much easier to write fully deterministic code for cog-only mode due to no caching effects which are unpredictable
- it should be possible to write fully deterministic code to a larger grain (ie 1us) in hubexec that would hide the caching effects

Chip is considering 256 bit bus to speed hubexec back up to nearly cog speed, if too many transistors he may look at my hub slot scheme.

We need either 256 bit bus or slot scheme to be able to get nearly 100MIPS, as both 256 bit bus and slot mapping would give the bandwidth to pre-fetch.

Based on significantly fewer posts by Chip I deduce he is working on the Verilog, hopefully we get new FPGA code soon.

Btw, the reason I have not posted more about hubexec is I want to see what Chip comes up with - no point on speculating more until I see what infrastructure he puts in.

(I have not examined Cluso's scheme in sufficient detail to see how it differs from my scheme - Ray did nicely document the 8 instruction clock hub cycle access though)

Edit: Thanks Dave, I mis-read the effect of the register map. You are correct, it shows a single icache and a single dcache line. I need more coffee! I corrected my text above.

Dave Hein · 2014-04-12 07:43

RossH wrote: »

But at the very least, 16 cogs executing in HUB mode (rather than COG mode) requires at least 32kb more Hub RAM, and it will run slower (and under some schemes would also be non-deterministic).

Ross, how did you get the 32kb number? There are applications where each cog could be executing the same HUB code, and may only require a small amount of extra HUB RAM for local storage or stack space. I can't wait to try out my threaded chess program on P1+. It uses 1.2K per cog, but that's because each cog keeps a copy of the chess board on the stack for each level it evaluates. A P1+ should be able to go one more level deeper than a P1, and do it in less time.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments