Is LUT sharing between adjacent cogs very important?

cgracey · 2016-05-13 17:32

I just did a big -A9 compile last night and the byte/word/long data-node modes in the smart pins made no logic increase that I can see. It was just a matter of a few gates of steering logic.

Heater. · 2016-05-13 17:41

Seairth,

I think that was a joke...

Why?

COGNEW will cannot restart the COG that is calling it with new code.

When the P1 starts up you end up with a COG running the Spin interpretter that starts your Spin code.

If you want to program in C, using Zog or prop-gcc or Catalina or whatever, you want to get your C system started and throw away the redundant Spin COG as soon as possible.

To be honest I don't remember if I just started a new COG running Zog and let the old one die. Or if I actually restated the original COG with ZOG.

What does prop-gcc do now a days?

Of course things are a bit different with the P2 which has no Spin interpreter built in.

Heater. · 2016-05-13 17:44

David,

Yup, there is no way to solve the general case. The dual-COG drivers would have to be handled as an exception.

Quite so.

As soon as the gurus start writing drivers and other objects that require "dual-COG" the "exception" infects the entire Propeller object world. Soon there is no exception, only complexity.

Cluso99 · 2016-05-13 18:02

Heater. wrote: »

David,

Yup, there is no way to solve the general case. The dual-COG drivers would have to be handled as an exception.

Quite so.

As soon as the gurus start writing drivers and other objects that require "dual-COG" the "exception" infects the entire Propeller object world. Soon there is no exception, only complexity.

Heater,
What are you saying...

So gurus best not use the P1 or P2 ???

Maybe it is some of them that sold P1's in enough volume to support the P2 design ???

BTW I showed a perfectly legitimate way to use COGNEW to get pairs or otherwise sets of cogs. Some method or variation is already required to minimise latency if we transfer via the egg beater.

The caveats seem to be much more substantial in using smart pins and also the smart pin signalling and transfers between cogs than using shared LUT.

Heater. · 2016-05-13 18:35

Clusso,

So gurus best not use the P1 or P2 ???

A very good question.

I notice in your posts about this that you make a distinction between "hobbyist" users, who need an easy life, and "professionals" who have the skill and time to do trick stuff and craft every detail of their project the way they like it. I think there is no such distinction.

On the one hand hobbyists can be very skilful and dedicated to what they do. Producing ingenious works of art.

On the other hand professionals can be incompetent twats who just want to get a job done ASAP.

On top of that I see no reason why a casual programmer, who does not know much about anything, can't expect to be using code produced by gurus. Without having to understand how it all works.

This leads me to believe that we should level the playing field, such that gurus can get trick stuff done, without making life complicated for casual users/programmers.

The fact that the P1 does not have the complication of interrupts is a fine example of this approach.

Maybe it is some of them that sold P1's in enough volume to support the P2 design ???

No idea. Do you have any figures on that? I could be wrong but I suspect the P2 effort is funded more from everything else Parallax does than P1 sales.

BTW I showed a perfectly legitimate way to use COGNEW to get pairs or otherwise sets of cogs.

Perhaps I missed that. As far as I can workout it is not possible in the general case. There will be race conditions that bust it.

I think your opinion is that "the general case" does not matter.

David Betz · 2016-05-13 18:47

Heater. wrote: »

Clusso,

So gurus best not use the P1 or P2 ???

A very good question.

I notice in your posts about this that you make a distinction between "hobbyist" users, who need an easy life, and "professionals" who have the skill and time to do trick stuff and craft every detail of their project the way they like it. I think there is no such distinction.

On the one hand hobbyists can be very skilful and dedicated to what they do. Producing ingenious works of art.

On the other hand professionals can be incompetent twats who just want to get a job done ASAP.

On top of that I see no reason why a casual programmer, who does not know much about anything, can't expect to be using code produced by gurus. Without having to understand how it all works.

This leads me to believe that we should level the playing field, such that gurus can get trick stuff done, without making life complicated for casual users/programmers.

The fact that the P1 does not have the complication of interrupts is a fine example of this approach.

Maybe it is some of them that sold P1's in enough volume to support the P2 design ???

No idea. Do you have any figures on that? I could be wrong but I suspect the P2 effort is funded more from everything else Parallax does than P1 sales.

BTW I showed a perfectly legitimate way to use COGNEW to get pairs or otherwise sets of cogs.

Perhaps I missed that. As far as I can workout it is not possible in the general case. There will be race conditions that bust it.

I think your opinion is that "the general case" does not matter.

But COGNEW2 does handle the general case except when you fragment your COG space so much that you can't get two adjacent COGs and that same sort of thing could happen if you use a memory allocator that fragments memory and your use case is extreme. There are going to be limits no matter what we do. Why impose this artificial one? Or maybe we should require that malloc on the P2 always return 1024 bytes and it is up to the user to manage larger buffers in 1K chunks...

Heater. · 2016-05-13 19:07

David,

But COGNEW2 does handle the general case...

I guess it would, if there was one.

But what about COGNEW3, or COGNEW4, for times when you want to get really weird?

Or maybe we should require that malloc....

Well there is a good point.

Maybe there should be no COGNEW in hardware. Maybe there should be a COGALLOC(numberOfCogs) that gets you a sequence of COGS. Like malloc() gets you a sequential bunch of memory.

But that requires an operating system or at least run time support in the programming language to manage the COG resources and take care or atomicity etc.

Then you are into the mysterious world of things failing at random when resources cannot be found. Like we have in Windows or Linux.

jmg · 2016-05-13 19:16

cgracey wrote: »

I just did a big -A9 compile last night and the byte/word/long data-node modes in the smart pins made no logic increase that I can see. It was just a matter of a few gates of steering logic.

Cool. Sounds well worth doing, classically flexible & at close to zero cost !!

cgracey wrote: »

The cog that wants to get others' attention just does 'COGATN mask16'. It's a 2-clock instruction that generates a 1-clock strobe to all other cogs in the mask. Each cog has an ATN event that will register the strobe. There's no data involved. The cogs have to know why they are being strobed for attention.
The 'SETPINX byte,pin' instruction only takes 4 clocks and sends a byte.

...

cgracey wrote: »

The 'COGATN mask16' instruction can set the attention flag in any number of cogs at once, without using up their edge detectors. Maybe what is needed, instead, is simply a second edge event. That would be universally useful.

Maybe some more people could comment on this. I'm too tired to trust my own judgment at the moment.

I'm not sure exactly what you are asking, but I have always thought simply expanding the attention flags to 32 would be useful.
That gives you two edge events per COG, if you wanted it ?
- and, it gives a limited data path, for those using single flags.

Rayman · 2016-05-13 19:23

Was half-joking about getting rid of cogid.
I think best practice should be to make code that doesn't need it...

Maybe we need a new version of Cognew that reloads the current cog...
Say CogReload or something...

jmg · 2016-05-13 19:26

Rayman wrote: »

I like the data via smartpin mode.
I assume it doesn't take much logic to add that.

Only down side there is that it uses a pin.
Can that pin be used for anything else?

Can it be an output only pin maybe?

I think you said it couldn't be a DAC output pin too, right?

It clearly cannot be a DAC pin, but I think it also cannot be any This-Pin-Cell feature, as the mode is fixed, and the Pin Cell is active and enabled.
That does do many things to the normal pin access.

Perhaps a list of what is left is needed ?

Someone mentioned when in Pin-Cell mode, the only READ path is via an adjacent Pin Cell (which infers mode caveats ?)
Write paths are usually used for trigger ?

One obvious location for a few Pin-Cell links is under the SPI boot pins.
If that is done, is the SPI still accessible in normal COG code ?
( this stuff is why I asked about bumping to > 64 in address fields, above real pins)

jmg · 2016-05-13 19:31

Heater. wrote: »

David,

But COGNEW2 does handle the general case...

I guess it would, if there was one.

But what about COGNEW3, or COGNEW4, for times when you want to get really weird?

From earlier discussions, it needs to be COGNEW {N}, but once you have > 1, N is going to be almost natural.

David Betz wrote: »

But COGNEW2 does handle the general case except when you fragment your COG space so much that you can't get two adjacent COGs and that same sort of thing could happen if you use a memory allocator that fragments memory and your use case is extreme. There are going to be limits no matter what we do.

Very true, and I think anyone who is doing COG gymnastics, is already going to have a COG Software Manager, otherwise, what stops them asking for 17th COG, among all those 'balls in the air'. There are already finite limits.

Heater. · 2016-05-13 19:49

jmg,

Exactly.

You know there is this strange thing going on in Linux/Unix/Windows land. If your program needs a gigabyte of memory you might start out by doing a malloc(oneGigabyte). Which might return "OK" and a pointer to 1 gigabye of memory. But then as you run you might find your program crashes. Out of memory.

Why?

Because malloc() only reserves a one gigabyte of "virtual" memory space, Not until you actually try and use it do you find it does not actually exist. Some other processes may have occupied that space in the mean time.

I don't want to see this kind of "wobblyness" come to Propeller land.

rod1963 · 2016-05-13 20:02

Can someone please tell me why we need LUT Sharing as so far all the gurus wanting it are quite vague on their applications for it.

Is it because the Cogs don't have the processing power of a MIPS32 or ARM M4 Core and they figure if they tag team a pair they can do more or have it try to punch above it's weight in video applications.

I don't get it.

Seairth · 2016-05-13 20:05

Heater. wrote: »

Seairth,

I think that was a joke...

Why?

No, I mean that I think @rayman was joking about getting rid of COGID.

Edit: okay. So he was half joking.

jmg · 2016-05-13 20:14

rod1963 wrote: »

Can someone please tell me why we need LUT Sharing as so far all the gurus wanting it are quite vague on their applications for it.

The issue is as fundamental as it is simple : message latency.

That has a great many applications, anytime you need more than one COG, and thus have to pass data to another, latency matters.

With 16 COGS, passing data & signals between them, is going to be very common.

Signal Delay issues have just been fixed, that leaves Data ...

You can use the HUB, but that is slow and jittery (see numbers above).
There are times when that is not going to be good enough.
For slower applications, HUB will be fine, but P2 is not an inherently slow device.
If you can collect and send large Blocks, HUB will probably be ok.

The new DAC-Data mode helps, (and it comes for almost free, what is not to like!) but that is still not as fast as Dual Port LUT.

The total end-end delay numbers for DAC-Data are still coming. Around 10+ clocks ?

Heater. · 2016-05-13 20:17

rod1963,

Is it because the Cogs don't have the processing power of a MIPS32 or ARM M4 Core and they figure if they tag team a pair they can do more or have it try to punch above it's weight in video applications.

I think you get it.

A P2 COG will never have the processing power of a common MIPS or ARM core.

A P2 COG will never handle USB and ethernet, etc as the dedicated hardware you find on such MIPS and ARM SoCs.

But a P2 has 16 cores, so the dream is to be able to stitch them together to achieve similar performance in some cases.

Perhaps.

Personally I think it's like trying to enhance a 555 timer chip into an ATMEL tinyAVR. Sure you could do that but then you don't have the simplicity of the 555 for those situations where you need it.

jmg · 2016-05-13 20:23

Heater. wrote: »

Personally I think it's like trying to enhance a 555 timer chip into an ATMEL tinyAVR. Sure you could do that but then you don't have the simplicity of the 555 for those situations where you need it.

Thanks for the chuckle of the day.
A P2 is about as far removed from a 555, as I can imagine !

David Betz · 2016-05-13 20:34

jmg wrote: »

Heater. wrote: »

Personally I think it's like trying to enhance a 555 timer chip into an ATMEL tinyAVR. Sure you could do that but then you don't have the simplicity of the 555 for those situations where you need it.

Thanks for the chuckle of the day.
A P2 is about as far removed from a 555, as I can imagine !

In that case we should remove most of the P2 features so they don't complicate the task of using it as a 555 replacement!

kwinn · 2016-05-13 20:40

I am not in favor of imposing limits on what the P2 can do for the sake of protecting programming newcomers and/or hobbyists, and that seems to be what most if not all the arguments against having the LUT shared between adjacent cogs amount to. It is the programmers job to write good working code that takes things like this into account.

No need to change from the current cognew or coginit either if all one has to do is start cogs that need to be adjacent initially. In fact since the P2 has more memory and more cogs to work with the main code routine probably should be responsible for allocating and tracking cogs, memory, and assigned tasks.

Heater. · 2016-05-13 20:41

jmg,

A P2 is about as far removed from a 555, as I can imagine !

Yes, yes. It was an analogy.

Should we bend a 555 into a tinyAVR and throw away the beauty and utility of the 555?

Should we bend a Propeller into a MIPS/ARM SoC and throw away the beauty and utility of the Propeller?

Let's see what happens....

jmg · 2016-05-13 20:54

Heater. wrote: »

Personally I think it's like trying to enhance a 555 timer chip into an ATMEL tinyAVR. Sure you could do that but then you don't have the simplicity of the 555 for those situations where you need it.

There are lessons to be learned from the Small MCU market, and it is already happening, that they do displace simpler parts, which also have poor specs.

When you can buy an EFM8BB1 for 31c/1k, 12b ADC, why would anyone design in a more expensive, less flexible, ADC (82c/1k @ 12b) ?

As for that 555 you mentioned, let's see : 10mA (ouch), up to 10% (ouch again), SO8 (hmmm)
Single component, 2% timing ? Ah, Nope.

Seems that 'simplicity' is more of an illusion, as it comes with many caveats.

Does anyone use a 555 in a new commercial design ?

*wait* I find a Microchip 555 variant that is smaller and 350uA, - oh, it is 60c/3k
guess that underlines my point.

jmg · 2016-05-13 20:55

David Betz wrote: »

In that case we should remove most of the P2 features so they don't complicate the task of using it as a 555 replacement!

hehe yes, or maybe it needs two P2 Manuals, one for those 555 users still out there ?

Rayman · 2016-05-13 21:06

Hey I'm about to use a 555 for something. You don't always want an mcu

jmg · 2016-05-13 21:10

Rayman wrote: »

Hey I'm about to use a 555 for something. You don't always want an mcu

In a commercial design ? Which one did you choose ?

jmg · 2016-05-13 21:12

Heater. wrote: »

Should we bend a 555 into a tinyAVR and throw away the beauty and utility of the 555?

Should we bend a Propeller into a MIPS/ARM SoC and throw away the beauty and utility of the Propeller?

I think your metaphor bending has gone way too far here...

Heater. · 2016-05-13 21:12

Grrr....

My 555 timer analogy is being taken all wrong here. As analogies often are.

Sure new improved 555's are produced, lower power, higher frequency, quad packages, etc. But the essence of the idea is still there.

But hey, 555 timers still sell by the billions: https://en.wikipedia.org/wiki/555_timer_IC

potatohead · 2016-05-13 21:14

For what it's worth, many video related things that make the shared LUT attractive can be done on multiple COGS anyway. Signaling and the DAC mode to move bits of data at low latency are most important. Screens can be divided into zones, draw schemes can be used, data prefetched, etc...

I personally would use the feature, but am not at all needing it. If it's best we leave it out, fine by me.

And it looks like it's out now, sans a meaningful solution to the cog race problem.

Just being clear here. I'm not pushing the feature at all, for any reason.

Much prefer we get real chips!

Heater. · 2016-05-13 21:19

jmg,

I think your metaphor bending has gone way too far here...

Perhaps.

At the end of the day we have to wait and see what Chip ends up with.

The P1 and Spin/PASM and the Propeller Tool indicate to me that he prefers a simpler user experience.

Edit: Ick, did I say "user experience"?

jmg · 2016-05-13 21:25

Heater. wrote: »

Grrr....

My 555 timer analogy is being taken all wrong here. As analogies often are.

It was too much to resist

Heater. wrote: »

But hey, 555 timers still sell by the billions: https://en.wikipedia.org/wiki/555_timer_IC

Maybe w-a-y back in 2003 ? - but today, I rather doubt it.

A quick check at Digikey shows even the 'better monostable' in Tiny Logic of 1G123 has higher single-part code stock levels than the 555.
There is a whole page of MCU part numbers, with higher stock numbers than 555, including a few SOT23 ones...

rjo__ · 2016-05-13 21:28

kwinn wrote: »

I am not in favor of imposing limits on what the P2 can do for the sake of protecting programming newcomers and/or hobbyists, and that seems to be what most if not all the arguments against having the LUT shared between adjacent cogs amount to. It is the programmers job to write good working code that takes things like this into account.

Agreed.
This is how I currently understand things:

I still think people are mixing apples and oranges. My impression is that "LUT sharing" is pretty much out of the running... not only because of the race conditions that Heater mentions but also because when Chip dumped a version of it into Quartus, Quartus rebelled.... possibly because no-one at Altera ever thought that someone would try to do what Chip tried to do:)

What Chip has done is to implement fast signals(strobes) from cog to cog. It looks more flexible than I thought would be possible at this late stage. I have looked but I don't see the bit depth of the messaging system...?

Is LUT sharing between adjacent cogs very important?

Comments