Ease of use - P2 vs P1 variant

W9GFO · 2014-04-05 09:23

How will all the new features of the P2 affect it's ease of programming? The P1 was/is wonderfully simple to comprehend and powerful, it makes me look like a programming wiz - though I am certainly not. Will the P2 be as user friendly to users like myself as the P1? I feel like a good portion of the success of the P1 was how simple it is to use.

mindrobots · 2014-04-05 09:44

Rich,

What do you like to program in? Spin, PASM, forth? (Ok, kidding about forth )

Spin we haven't really seen yet but hopefully it should work the same with some added keywords and features. PASM transitionally should be similar and you can incrementally add new features. As you need them and learn them. The daunting part will be the 495+ PASM instructions and some of the new concepts: hardware multi-tasking, software multi-threading, hub execution, PTRX/PTRY, PTRY/PTRB, 256KB of HUBRAM, the new I/O pins, the new video, and more I'm forgetting.

Forth programming will be more awesome than ever!

C will be C with a bunch of new libraries...because that's what C always is.

Other languages will bloom and flourish or sprout, whither and die.

Should be fun times for all!!!

AntoineDoinel · 2014-04-05 09:55

mindrobots wrote: »

Rich,
C will be C with a bunch of new libraries...because that's what C always is.

Other languages will bloom and flourish or sprout, whither and die.

Should be fun times for all!!!

Cleaning the laptop screen, putting back the moka pot on the burner, and redoing coffee... and it's ALL YOUR FAULT!
LOL

W9GFO · 2014-04-05 10:07

I've been using Spin. My question is more about understanding how to use the chip. Multi threading, hub execution, hardware multitasking... the addition of these things and being able to effectively use them, makes me wonder if the P2 will be substantially different to understand and use than the P1. Or will these features be invisible to a regular user, being rolled up into new commands?

Put another way, a P1 variant with more pins and cogs would satisfy all my needs in a microcontroller. The P2 is sounding like extreme overkill for things that I do, and I am guessing, most people that use micro controllers. If there is to be no P1 variant then I will have to use the P2, which may well be an order of magnitude more power (or more) than I will ever use. Won't there be a huge gap between the P1 and the P2 in terms of capabilities and complexity?

mindrobots · 2014-04-05 10:14

AntoineDoinel wrote: »

Cleaning the laptop screen, putting back the moka pot on the burner, and redoing coffee... and it's ALL YOUR FAULT!
LOL

Mea culpa!!

mindrobots · 2014-04-05 10:23

W9GFO wrote: »

I've been using Spin. My question is more about understanding how to use the chip. Multi threading, hub execution, hardware multitasking... the addition of these things and being able to effectively use them, makes me wonder if the P2 will be substantially different to understand and use than the P1. Or will these features be invisible to a regular user, being rolled up into new commands?

Put another way, a P1 variant with more pins and cogs would satisfy all my needs in a microcontroller. The P2 is sounding like extreme overkill for things that I do, and I am guessing, most people that use micro controllers. If there is to be no P1 variant then I will have to use the P2, which may well be an order of magnitude more power (or more) than I will ever use. Won't there be a huge gap between the P1 and the P2 in terms of capabilities and complexity?

I would think (strongly hope) that Spin on a Px will look the same until you start to engage new features. What and how it is done under the hood shouldn't make a difference. I have a feeling Ship will be able to wrap Spin around the new architecture in a way that isn't painful to learn and start using. Some features like hub execution may really have no need to be exposed to a Spin user - it's more an implementation issue for the language that could be exploited but some clever person as a trick or tip. The "Multi-" features I think would be exposed for use in Spin. More just a mental abstraction - if you have Multi-COG'd, then you should be able to multi-task and multi-thread to some degree rather easily.

I hope new features are always invisible until you need to use them.

I had started a couple threads for discussion of Spin2 features nd implementation but nobody seemed to care about discussing it in separate threads. There is some talk in the BLOG thread that I just haven;t clipped out yet to put in the Spin threads.

I'm probably wrong on 1/2 of what I'm saying but these are my best guesses or hopeful opinions.

It would be nice to be able to take a Spin program and carry it from P1 through Px without any significant changes.

potatohead · 2014-04-05 10:38

In PASM, you can treat a P2 a lot like a P1, and it sings, even at 80Mhz!

When you need more capability, you can explore some options, modes, instructions and doing that will pay off for you multiple times.

C is going to be fast, and with some reasonable libraries, will run code you find out there nicely.

In SPIN, P2 SPIN will very likely run as fast as P1 PASM did. Somewhere in that range. Tasks will likely be a part of SPIN and I don't see those as a big deal. However, INLINE PASM will be a part of P2 SPIN, and this means you can do little bits of PASM where you need to and grow your capability, knowledge and chip output a little at a time. This is my favorite capability because it brings PASM to people in the easiest form possible for an assembly language.

Phil Pilgrim (PhiPi) · 2014-04-05 11:07

PASM programming will be a bit more complicated, due to the execution pipeline. In a pipeline, instructions are fetched and begin processing before prior instructions are completed. So you have to be more careful about changing things that affect subsequent instructions, because it might be too late if those instructions have already begun fetching operands and executing.

-Phil

potatohead · 2014-04-05 11:21

Yes, if one wants seriously optimized code, or is doing self-modify code. There are some cases, like with the delayed instructions, that require one know a few things.

However, baseline "just write it in PASM" code isn't significantly more difficult than it is with P1. The nice thing is we have pointers and such that very seriously reduce the need to author self-modifying code. Secondly, the additional throughput means a whole lot of things we have to optimize on P1 to make stuff possible, can be written simply, and run more slowly than they would if they were optimized nicely, and just get the task done.

PASM inline with SPIN is going to change things in that people can use SPIN as they would normally, then author small bits of PASM where needed. The benefits of this are difficult to pack down into a short post. Like I do short... lol

A primary benefit is being able to think about just the bit of the problem that PASM helps! Once that bit is done, SPIN holds the framework and that eliminates a lot of the difficulty people have with assembly language in general.

cgracey · 2014-04-05 21:21

Phil Pilgrim (PhiPi) wrote: »

PASM programming will be a bit more complicated, due to the execution pipeline. In a pipeline, instructions are fetched and begin processing before prior instructions are completed. So you have to be more careful about changing things that affect subsequent instructions, because it might be too late if those instructions have already begun fetching operands and executing.

-Phil

Phil, there is data-forwarding in the pipeline to overcome these supposed difficulties. The only thing one must know about the pipeline is that during multitasking there will probably be other tasks' instructions in the pipeline that will space your tasks' instructions apart, slowing them down, maybe erratically if they are doing hub instructions.

Phil Pilgrim (PhiPi) · 2014-04-05 21:27

Chip, thanks, I've never heard of data forwarding. How does it work?

-Phil

cgracey · 2014-04-05 21:32

Phil Pilgrim (PhiPi) wrote: »

Chip, thanks, I've never heard of data forwarding. How does it work?

-Phil

Pretty much like this:

Roy Eltham · 2014-04-06 03:27

For me the optimal fun coding for P2 will be with no multitasking or multithreading. Those two things bring on MANY headaches and late nights of hair ripping out finding bugs. Not to mention that if you want it to be properly deterministic, then it all has to fix in one cogs memory.

What I really want a P2 without all the multitasking or multithreading, and 8 or even more cogs. For me multitasking and multithreading are the path of despair and destruction. The path to happiness and fun is 8+ cogs each doing their thing with no worries at all about the others. They can just happily, deterministically, hum along doing what they do best. Ah, such bliss...

cgracey · 2014-04-06 03:36

Roy Eltham wrote: »

For me the optimal fun coding for P2 will be with no multitasking or multithreading. Those two things bring on MANY headaches and late nights of hair ripping out finding bugs. Not to mention that if you want it to be properly deterministic, then it all has to fix in one cogs memory.

What I really want a P2 without all the multitasking or multithreading, and 8 or even more cogs. For me multitasking and multithreading are the path of despair and destruction. The path to happiness and fun is 8+ cogs each doing their thing with no worries at all about the others. They can just happily, deterministically, hum along doing what they do best. Ah, such bliss...

I've actually found multitasking to be pretty worthwhile, as you can have a few separate things going on in the cog at once. I'd say it's pretty fun.

However, there's nothing so deadpan simple and certain as using WAITCNT to resume on a certain clock. If the cog stays simple, it is not a waste of resources to casually do something like that. The problem with Prop2 is that the simple cog got loaded way up with all kinds of things that made it powerful, but not expendable for simple tasks. You're obliged to give it some heavy, ongoing work. It's a little like managing a worker who's on the clock and you want to keep him busy, as long as your paying lots of money for him. The simple cogs let you pay by the job, not by the minute, which is way more relaxing to deal with.

I just thought of something... rather than put a CORDIC in every cog, making them expensive like the Prop2 cogs are, maybe we could just make a pipelined CORDIC that any cog could make a deposit into and then pick up the results umpteen clocks later. That way, the cog stays clean and doesn't get overloaded. I'm thinking of what I'd miss in a Prop1-based chip that the Prop2 has. CORDIC is really useful for signal processing and would be missed. We could make the divider work this way, too, as well as the big multiplier.

Roy Eltham · 2014-04-06 03:44

I am fine with the chip having the multitasking feature, I know many will love it and use it. I would probably use other people's code that used it. I personally do not like it, and will likely rarely code for it myself. I'd much rather just use more cogs and not worry about other tasks or the limitations of the multitasking stuff, or fitting into less memory.

I think the only place where I would use it is with hubexec where one cog it run 2 or more tasks that are all hubexec. They would be for things that are not deterministic. They'd probably all be C/C++ code made with propgcc. I'd have one cog doing that, and the rest running PASM drivers one each.

jmg · 2014-04-06 03:46

Roy Eltham wrote: »

For me the optimal fun coding for P2 will be with no multitasking or multithreading. Those two things bring on MANY headaches and late nights of hair ripping out finding bugs. Not to mention that if you want it to be properly deterministic, then it all has to fix in one cogs memory.

What I really want a P2 without all the multitasking or multithreading, and 8 or even more cogs. For me multitasking and multithreading are the path of despair and destruction. The path to happiness and fun is 8+ cogs each doing their thing with no worries at all about the others. They can just happily, deterministically, hum along doing what they do best. Ah, such bliss...

In P2 Multitasking is one way to allow SW to share the extensive maths and other COG support, but that maths etc does not really need 8 copies.

I think Chip has given numbers that show the Logic of a P1 COG is much smaller than the (compact) 2 Port COG it uses, whilst the Logic of a P2 COG is much larger than the (larger) 4 Port COG memory it uses.
Numbers show a P2 COG uses significant power at 180MHz and Sim Power numbers are not yet available on a P1 COG.

All that suggests a P1 COG is a little underpowered, (by a silicon ratio yardstick) and can be made a little smarter, whilst the number of P2 COGs needs to be limited to keep Area and Power under control.

A P1E could certainly benefit from a P2 Timer block ( true PWM, Quadrature, capture), and the Logic cost of that should have P1 Logic still less than the COG memory.
Power and Area Numbers on a build with that simple change, would be very useful. Chip ?

jmg · 2014-04-06 03:50

cgracey wrote: »

The simple cogs let you pay by the job, not by the minute, which is way more relaxing to deal with.

How long would it take to merge the smarter P2 Timer block ( true PWM, Quadrature, capture), into the Simpler 2 Clock P1E and get area and Power Sim values ?

What about (present) P2 SerDes, as a ratio of the P1 COG memory area ?

cgracey · 2014-04-06 03:54

jmg wrote: »

In P2 Multitasking is one way to allow SW to share the extensive maths and other COG support, but that maths etc does not really need 8 copies.

I think Chip has given numbers that show the Logic of a P1 COG is much smaller than the (compact) 2 Port COG it uses, whilst the Logic of a P2 COG is much larger than the (larger) 4 Port COG memory it uses.
Numbers show a P2 COG uses significant power at 180MHz and Sim Power numbers are not yet available on a P1 COG.

All that suggests a P1 COG is a little underpowered, (by a silicon ratio yardstick) and can be made a little smarter, whilst the number of P2 COGs needs to be limited to keep Area and Power under control.

A P1E could certainly benefit from a P2 Timer block ( true PWM, Quadrature, capture), and the Logic cost of that should have P1 Logic still less than the COG memory.
Power and Area Numbers on a build with that simple change, would be very useful. Chip ?

The P2 CTR is almost as big as the whole P1 cog.

Brian Fairchild · 2014-04-06 03:57

cgracey wrote: »

I just thought of something... pipelined CORDIC that any cog could make a deposit into...could make the divider work this way, too, as well as the big multiplier.

Odd. I've just been out for breakfast at the local greasy spoon and had the same idea.

jmg · 2014-04-06 03:58

cgracey wrote: »

The P2 CTR is almost as big as the whole P1 cog.

Do you mean with all the other counter coupled stuff like sine-lookups, ?

Surely adding true PWM, Quadrature, capture to a Counter does not add that much logic ?
ie just fixing the limitations of P1, nothing extra.

cgracey · 2014-04-06 03:59

jmg wrote: »

Do you mean with all the other counter coupled stuff like sine-lookups, ?

Surely adding true PWM, Quadrature, capture to a Counter does not add that much logic ?
ie just fixing the limitations of P1, nothing extra.

You're right.

jmg · 2014-04-06 04:02

cgracey wrote: »

I just thought of something... rather than put a CORDIC in every cog, making them expensive like the Prop2 cogs are, maybe we could just make a pipelined CORDIC that any cog could make a deposit into and then pick up the results umpteen clocks later. That way, the cog stays clean and doesn't get overloaded. I'm thinking of what I'd miss in a Prop1-based chip that the Prop2 has. CORDIC is really useful for signal processing and would be missed. We could make the divider work this way, too, as well as the big multiplier.

That pathway has been suggested before, and it certainly is a good idea.
A system does not really need 8x that 'Big Maths' resource.
It could either be split off, or included as a single P2 Full_COG, with other smaller cogs for the rest.

cgracey · 2014-04-06 04:07

Brian Fairchild wrote: »

Odd. I've just been out for breakfast at the local greasy spoon and had the same idea.

Entanglement!

If you made a 16-stage CORDIC solver that was shared among 16 cogs, each cog could get an opportunity every 16 clocks to stick something in it during his turn. The hardware cost per cog, aside from the conduit, would be just 1 stage worth of the CORDIC circuitry - three adders. He would have to wait his turn, but the result would come out as soon afterwards as if the CORDIC was his own.

Each cog would need just one set of conduit. The central math server would perform whatever function was requested and return a result in as many stages as there are cogs. CORDIC, MUL, DIV, SQRT, all on the cheap.

jmg · 2014-04-06 04:13

cgracey wrote: »

He would have to wait his turn, but the result would come out as soon afterwards as if the CORDIC was his own.

Now that is sounding very good indeed

cgracey · 2014-04-06 04:23

jmg wrote: »

Now that is sounding very good indeed

It might be good to set the number of stages a few less than the number of cogs, so that you could get the results and start another computation before missing the window. Then, it would be just as fast as if you owned it.

Brian Fairchild · 2014-04-06 04:46

cgracey wrote: »

...set the number of stages a few less than the number of cogs...

Based on the work done on the P2, do you have a feel for how many stages the Cordic would be have?

cgracey · 2014-04-06 04:51

Brian Fairchild wrote: »

Based on the work done on the P2, do you have a feel for how many stages the Cordic would be have?

It does up to 31 iterations in the P2, even 35 for LOG/EXP, but we could pack maybe 2..3 levels of computation per flop'd stage, since the 'shifters' would be hardwired.

This kind of technique would solve the lack of higher math in Prop1 cogs. They could just do a D,S/# instruction that would output the values, then stall the cog until the result came back, writing it into some register.

Brian Fairchild · 2014-04-06 04:59

cgracey wrote: »

This kind of technique would solve the lack of higher math in Prop1 cogs.

A microcontroller with higher math. Nice. What order code do I email to sales to buy some?

Baggers · 2014-04-06 05:03

Just throwing this in here!

P1 caught my attention, because it was fun to program, and very easy to re-use objects, carefree throwing a new cog at a device/routine

P2 will have to have ALL it's objects intermingle with each other, due to threading, and also lots of locking around specific tasks that need unsharable resources, throwing out all the deterministic behaviour we're used to, let alone running at 25MIPS when using 4 threads all at full speed ( yes I know not all threads will need full 1/4 speed usage )

P1+ will all be running at 100Mips yes, HUB will round it down a bit, but you don't always get data from HUB plus there will be 16 cogs, but keep all the determinism, objects will be usable anywhere, without worrying if it'll affect anything, and use all it's own cog-ram happily and freely.

Don't get me wrong I'm not dissing the awesomeness of the P2, it truly is an amazing achievement! let alone Chip's Pride and Joy! and I say this with a heavy heart, but I think having 4 cogs will be too restricting, not just for my own usage, but for a lot of peoples requirements. Not only that, but forcing pretty much every application to Intermingle code on a P2 is going to make it less fun to code! as not everyone can think like that, the original P1 made programming in parallel an awesome fun experience, without all the hassle of what's really involved in doing so on normal systems.

Oh and adding the single Maths to the P1+ is a great idea with slotting it!

Bill Henning · 2014-04-06 09:28

That sounds very good.

Basically, they become hub operations, and only that hardware is pipelined.

Nice!

cgracey wrote: »

Entanglement!

If you made a 16-stage CORDIC solver that was shared among 16 cogs, each cog could get an opportunity every 16 clocks to stick something in it during his turn. The hardware cost per cog, aside from the conduit, would be just 1 stage worth of the CORDIC circuitry - three adders. He would have to wait his turn, but the result would come out as soon afterwards as if the CORDIC was his own.

Each cog would need just one set of conduit. The central math server would perform whatever function was requested and return a result in as many stages as there are cogs. CORDIC, MUL, DIV, SQRT, all on the cheap.

dMajo · 2014-04-06 14:52

cgracey wrote: »

I've actually found multitasking to be pretty worthwhile, as you can have a few separate things going on in the cog at once. I'd say it's pretty fun.

However, there's nothing so deadpan simple and certain as using WAITCNT to resume on a certain clock. If the cog stays simple, it is not a waste of resources to casually do something like that. The problem with Prop2 is that the simple cog got loaded way up with all kinds of things that made it powerful, but not expendable for simple tasks. You're obliged to give it some heavy, ongoing work. It's a little like managing a worker who's on the clock and you want to keep him busy, as long as your paying lots of money for him. The simple cogs let you pay by the job, not by the minute, which is way more relaxing to deal with.

I just thought of something... rather than put a CORDIC in every cog, making them expensive like the Prop2 cogs are, maybe we could just make a pipelined CORDIC that any cog could make a deposit into and then pick up the results umpteen clocks later. That way, the cog stays clean and doesn't get overloaded. I'm thinking of what I'd miss in a Prop1-based chip that the Prop2 has. CORDIC is really useful for signal processing and would be missed. We could make the divider work this way, too, as well as the big multiplier.

If you go to 4 cog P2 perhaps it is worthwhile to extend the cog ram. You can have 2048 longs one bank of 512 dedicated to each task, no register remapping, each task starts execution from register 0 of its bank. If a task is not used then its bank is wasted. This is like having 16 slower P2 capable cogs, like a P1E with 16 cogs, but more capable ones.
Perhaps in the 512 bank the higher 32 longs can be common, some of them used for in/out/dir, counter handling ... the remaining for inter-task comms. The PC can roll-over at 512-32 thus making the higher longs (the common ones) data only.
Your idea of common cordic can be developed in the same way for tasks, so at the end you will have 16 virtual cogs having its cordic for each group of four.
cog start/stop/init can really act on tasks, I mean the default behavior is 4 tasks enabled and if not initialized they are halted, task 1 will still receive 1/4 of clock pulses, unless the task scheduler is not modified from the default setting.
Perhaps each task can have its hub window and the task scheduler beside allocating task slots can also allocate the hub windows so if one cog have only one task running full speed it will also have 4 hub windows. This presume 16 hub windows for 4 cogs.
Hubexec can take two tasks resources, I mean it uses normally one block of 512, while the other 512 can be used as cache, it will have at least 1/2 of the cog clocks and 2 hub windows. Max 2 hubexec per cog.
This can be an solution for 4 cogs P2 which can with virtual cogs reuse most of the obex object concepts.

Some times when you have something done, and later you need to add things you start adding upper layers making tweaks to in some way reuse the bottom ones. This can be good because you've tested the results, you've learned the possibilities, but the whole think can became messy.
Some times restarting the development from ground can bring you to the same results (now you know what you want and where the pitfalls was) with a different, more clever, more suitable solution, avoiding compromises and patches to change something that wasn't thought from the beginning to work this way.

Ease of use - P2 vs P1 variant

Comments