Is LUT sharing between adjacent cogs very important?

Heater. · 2016-05-13 21:46

jmg,

Maybe w-a-y back in 2003 ? - but today, I rather doubt it.

I know, those are old figures now. I have no idea what the situation is today.

But hey, the 555 had a very good run don't you think?

Does not change the fact that we can still make monostable or bistable etc, faster and quicker with a 555 than pulling up an IDE, learning to program, and getting a tinyAVR or PIC to do it. They are different things.

When I want chips for lunch I don't want the hassle of booking a table at a 5 star restaurant, negotiating with the waiter, tasting the wine etc, etc. Just give me chips now!

Oops, another analogy...

Oh yeah, chips. Chip, I want chips...now!

jmg · 2016-05-13 21:48

rjo__ wrote: »

What Chip has done is to implement fast signals(strobes) from cog to cog. It looks more flexible than I thought would be possible at this late stage. I have looked but I don't see the bit depth of the messaging system...?

There are choices of 8/16/32b, with increasing clock counts in the new DAC-Data path.

The total end-to-end time is still tbf, and may come in a little lower than the simple sum of PINSET and PINGET ?
Handshake details are also tbf, so I think some use cases are needed to shake out the actual operational numbers.

Then we can better compare where this sits between LUT sharing, and HUB timings.

Rayman · 2016-05-13 21:51

Digikey has about half a million 555 timer chips in stock. Seems like a lot...

dMajo · 2016-05-13 22:01

jmg wrote: »

cgracey wrote: »

If all cogs waited for attention (WAITATN) and then did a PINGETZ, they would all get the same data. They would be in identical phase.

I think this can also be used for a 'collected printf' style debug reporting ?

In this, a master COG would manage PC link, (USB or UART) and it would watch signals from multiple slaves, which would use an available pin-channel to send data.

This would be a Wait on one of N signals, then select that Pin cell for GET.

For highest speed, (lowest impact) one channel per COG would be used, but for least resource consumed, a queued method could be ok,

If another pin has Pin Cell connect, do other slaves get 'busy' flag, and can wait ?

Cluso99 wrote: »

My preference is to have shared LUT, whether or not we get another way using smart pins etc.

Since shared LUT has been done already, could we get a set of fpga code to test?

Yes, LUT sharing will have the highest bandwidth, and so complements the Pin-cell connect, which is intermediate in speed, but very general in use.

There is another thing to consider: the sync. Managing comms/protocols with esternal world/field the cog must be synced with the pin, if he writes to hub it needs to sync with hub, now, using smart-pins it requires another source to sync to.

From my understanding using the shared LUT cog1 can remain in sync with pin IO while dealing with LUT. At the same time cog2 can maintain its sync with hub while interacting with cog1 through the LUT. Or two cogs can by synced to respectively each one to one IO pin (non necessary in phase) while interacting between them through the LUT without loosing each one its sync.
How mor simpler is writing code without always being worried about sync.

BTW: @cgracey, what happens presently if lets say two or three cogs are synced (wait on pin or syscounter) and all two/three executes cognew exactly the same clock? Will all of them try to start the same next available cog? Or will the 3 cog new be executed sequentially based on the executing cogid precedence?

jmg · 2016-05-13 22:03

Heater. wrote: »

Does not change the fact that we can still make monostable or bistable etc, faster and quicker with a 555 than pulling up an IDE, learning to program, and getting a tinyAVR or PIC to do it. They are different things.

Sure, you can get one working quickly (assuming you already have the values in stock).
- but if you need lower power, better precision and lower parts count, for higher volume use, sometimes you do need to apply some engineering design.
Simple can be an illusion.

jmg · 2016-05-13 22:09

dMajo wrote: »

There is another thing to consider: the sync. Managing comms/protocols with esternal world/field the cog must be synced with the pin, if he writes to hub it needs to sync with hub, now, using smart-pins it requires another source to sync to.

From my understanding using the shared LUT cog1 can remain in sync with pin IO while dealing with LUT. At the same time cog2 can maintain its sync with hub while interacting with cog1 through the LUT. Or two cogs can by synced to respectively each one to one IO pin (non necessary in phase) while interacting between them through the LUT without loosing each one its sync.
How mor simpler is writing code without always being worried about sync.

Yes, certainly, avoiding HUB jitter is going to make many things simpler, and in some cases, even possible.

dMajo · 2016-05-13 22:20

David Betz wrote: »

Bill Henning wrote: »

(I have not read the whole thread, so this may have already been suggested)

what about a

COGNEW2 wc

that starts a cog pair from the same cog image, with same par?

The code in the cog could then use the lowest bit of its cogid (ideally placed in the carry flag) to run odd/even cog code...

Didn't I just say that? :-)

Yes, you and many others, but always slightly different.
I like Bill's idea a lot. It's very simple and elegant. COGNEW2 just starts two cogs without any register any mask any other requirements. Just the next two cogs. Then is the cog being the odd or even one that differentiates the code execution. If the code space/image is not enough for both cog operation one of them can coginit itself with different code.

Heater. · 2016-05-13 22:30

jmg,

Sure, you can get one working quickly (assuming you already have the values in stock).
- but if you need lower power, better precision and lower parts count, for higher volume use, sometimes you do need to apply some engineering design.
Simple can be an illusion.

Yes, yes, it was only an analogy.

But now you bring to mind a guy I worked with years ago that had been involved with designing the flight controls of the Concorde. He was then our software QA lead on digital flight control systems. Sometimes over a beer he would say things like "Why do we have a dozen software engineers, and the same again in hardware, thousands of lines of code, building things that, back then, were done by one guy with a few op-amps and capacitors?"

I could never honestly answer those questions....

Cluso99 · 2016-05-13 23:05

You better kill the egg beater design because without shared LUT, the cog allocation will be much worse!!!!!!!!

You see, the egg beater (or hub for that matter too) has inbuilt variable latency between successive cogs.

To minimise that latency, we will need to be about 2-3 cogs apart for a single data directions, or 8 cogs apart for equal latency in both directions. So depending on the application, we may be forced to tweek the cog numbers to get the best latency figures for the task at hand.

We will be doing this because we have 16 cogs. There will be I/O that we want to run that does not fit the smart pins properly and by using cooperating cogs we will be able to increase the possibilities. Video processing is an example, as is perhaps Ethernet. USB FS has been used as a real example that would use these cooperating cogs were it not for the specific smart pin USB mode.

I sure would rather deal with an order of magnitude less latency (Chip's words) with shared LUT, than with egg beater hub. Cog allocation is far simpler with shared LUT. There is nothing complex to allocating cogs to tasks. In reality, many of us do it now without realising it.

I totally acknowledge the P2 is not an ARM. But quite frankly, they are looking better every day... a plenthora of peripherals, substantially better clock frequencies, multi-cored, lots of ram, flash and eeprom, cheap. Bugbears... Interrupts and complex instruction set but they are more suited to high level languages, and lots of them. Oh, and in silicon already

evanh · 2016-05-13 23:05

rjo__ wrote: »

What Chip has done is to implement fast signals(strobes) from cog to cog. It looks more flexible than I thought would be possible at this late stage. I have looked but I don't see the bit depth of the messaging system...?

Strobing is using the event system so doesn't contain data itself (and doesn't use explicit acknowledge/reset either).

The new data passing is via the Smartpins. For a single byte it's a mere four clocks for write and read each. This is close to an ordinary instruction time and just as consistent, no variability. And can be directly waited on too. Total of 8 clocks when waited.

Compare that with LUT sharing and it's not so bad. To pass a byte via LUT it would have required writing (2 clocks) then strobing (2 clocks, maybe 3 for the receiving Cog) then reading (3 clocks). 7(8?) clocks when waited.

PS: There is discussion about what to do about resetting the IN bit of the Smartpin being waited on. - http://forums.parallax.com/discussion/164260/why-is-pingetz-rdpin-whatever-not-an-implicit-pinack/p1

evanh · 2016-05-13 23:16

Cluso99 wrote: »

To minimise that latency, we will need to be about 2-3 cogs apart for a single data directions, or 8 cogs apart for equal latency in both directions. So depending on the application, we may be forced to tweek the cog numbers to get the best latency figures for the task at hand.

We will be doing this because we have 16 cogs. There will be I/O that we want to run that does not fit the smart pins properly and by using cooperating cogs we will be able to increase the possibilities. Video processing is an example, as is perhaps Ethernet. USB FS has been used as a real example that would use these cooperating cogs were it not for the specific smart pin USB mode.

You're talking nonsense, Cluso. You can't, on the one hand, argue for sub rotation latencies that apply to single byte performance, then, on the other hand, say that the Smartpin comms is worthless because it's max throughput can't compete.

jmg · 2016-05-13 23:28

evanh wrote: »

The new data passing is via the Smartpins. For a single byte it's a mere four clocks for write and read each. This is close to an ordinary instruction time and just as consistent, no variability. And can be directly waited on too. Total of 8 clocks when waited.

Do you have a link to that ? I've not seen any specs yet on actual end-end performance ?
The lowest clock counts were also for Bytes only, not Words, whilst LUT times are for 32b transfers.

evanh · 2016-05-13 23:35

4, 6 and 10 clocks for byte, short and long respectively. So a long via the Smartpins is 20 clocks total .... Details. :P

Rayman · 2016-05-13 23:55

Wonder if flag on ina could be set at start of smart pin write instead of end

evanh · 2016-05-14 00:00

Ah, damn, there is some variability, it's 4-6 clocks for a byte read, because there is no start/stop data control or something. The read data from Z just keeps repeating. I'm not sure if this info is out of date or not.

evanh · 2016-05-14 00:05

Rayman wrote: »

Wonder if flag on ina could be set at start of smart pin write instead of end

That sounded crazy at first glance. But you're right, the data can't leave Z faster than it arrives and it's guaranteed to complete the write in a fixed number of clocks. And IN can be raised in-sync with the packet loop start also, so, in theory, eliminating the variability at the same time.

EDIT: Doh! Nope, all fail. The 4-bit bus is just a typical databus, ie: half-duplex, so can't perform both a read and write in parallel.

But the variability may not be there when waited on. That would be nice.

evanh · 2016-05-14 00:23

evanh wrote: »

The 4-bit bus is just a typical databus, ie: half-duplex, so can't perform both a read and write in parallel.

Or is it? There is a chance read and write kept on completely separate buses given the nature of the OR merging at, I presume, both ends. Top and tail style.

Tor · 2016-05-14 00:53

The concept of cog direct communication with their immediate neighbours _is_ elegant, as I see it. It has been done before elsewhere (that is, something similar in other architectures) and was considered elegant also in practice.

The problem then boils down to the cog allocation issue only. Tricky, but if all the effort (the discussions) could be focused on that problem and not mentioning the lut sharing at all, maybe something could come up.

evanh · 2016-05-14 01:24

Tor wrote: »

The concept of cog direct communication with their immediate neighbours _is_ elegant, as I see it. It has been done before elsewhere (that is, something similar in other architectures) and was considered elegant also in practice.

This does need discussed me thinks. The Prop architecture has data sharing features already. Neighbour sharing is not one but I don't think it's really any more elegant than what the Prop2 already has. Large data sets do and should run in parallel on their respective sections of data. It doesn't involve masses of thrashing. Small data sets, like a servo loop for example, are done by a single core.

LUT sharing would be abused. That I'm sure of because I'd abuse it too. I know, the argument that it's out right to do so has it's validity too.

The problem then boils down to the cog allocation issue only. Tricky, but if all the effort (the discussions) could be focused on that problem and not mentioning the lut sharing at all, maybe something could come up.

That question is whether dynamic Cog allocation is continued to be supported in hardware, and therefore is naturally provided in all languages too. Or do we leave that for a kernel/library feature that only some things use? Which, in turn, changes the OBEX to a snippets/example repository for patching into projects.

Rayman · 2016-05-14 01:34

As I've said before, there are 2 problems with LUT sharing:

First, it's against Propeller (and multi core in general) precepts.
Second, we don't have enough logic in A9 chip to test it with everything else...

evanh · 2016-05-14 01:36

evanh wrote: »

Large data sets do and should run in parallel on their respective sections of data. It doesn't involve masses of thrashing. Small data sets, like a servo loop for example, are done by a single core.

Protocols are done in layers, concurrently, so can be spread across multiple cores without needing tight timing between layers.

kwinn · 2016-05-14 01:45

Rayman wrote: »

Digikey has about half a million 555 timer chips in stock. Seems like a lot...

555's are still in fairly common use in equipment. What other chip provides two comparators, a frequency/pulse width control, a reset input, an output that can provide up to 250mA of current, and operates over a wide voltage range for under a dollar.
The only improvement I can see is a couple of more pins so both comparators have control inputs and a complementary output signal.

jmg · 2016-05-14 01:52

Rayman wrote: »

As I've said before, there are 2 problems with LUT sharing:

First, it's against Propeller (and multi core in general) precepts.

Gong as far back as Transputer, there were local connections... so I do not follow that as a genuine 'problem'.
Genuine problems are technical in nature.

Rayman wrote: »

Second, we don't have enough logic in A9 chip to test it with everything else...

That is also not a genuine problem, Chip has said the die-area is significantly above an A9, so the real ceiling to worry about, is the true die area.

Agreed, it is nice to be able to test all features in one build, but the Pin-cells are highly repeatable, so testing 32 of those gives good coverage.

jmg · 2016-05-14 01:57

evanh wrote: »

4, 6 and 10 clocks for byte, short and long respectively. So a long via the Smartpins is 20 clocks total .... Details. :P

Rayman wrote: »

Wonder if flag on ina could be set at start of smart pin write instead of end

This is why real end-to-end numbers are sought, including any jitter effects.

There might be ways to compress that transit time, from that 20+ clocks (Chip mentioned read took more clks than write), but if it really is in the region to 20 clks to send 32b, that makes a stronger technical case for LUT overlay.

rjo__ · 2016-05-14 02:00

evanh wrote: »

rjo__ wrote: »

What Chip has done is to implement fast signals(strobes) from cog to cog. It looks more flexible than I thought would be possible at this late stage. I have looked but I don't see the bit depth of the messaging system...?

Strobing is using the event system so doesn't contain data itself (and doesn't use explicit acknowledge/reset either).

The new data passing is via the Smartpins. For a single byte it's a mere four clocks for write and read each. This is close to an ordinary instruction time and just as consistent, no variability. And can be directly waited on too. Total of 8 clocks when waited.

Compare that with LUT sharing and it's not so bad. To pass a byte via LUT it would have required writing (2 clocks) then strobing (2 clocks, maybe 3 for the receiving Cog) then reading (3 clocks). 7(8?) clocks when waited.

PS: There is discussion about what to do about resetting the IN bit of the Smartpin being waited on. - http://forums.parallax.com/discussion/164260/why-is-pingetz-rdpin-whatever-not-an-implicit-pinack/p1

Thanks. I understand the issue of implicit-pinack now. Didn't understand where THAT came from:)

K2 · 2016-05-14 07:06

Is LUT sharing between adjacent cogs very important? YES!

Can't express how much I wish it would be put back into the P2. Yes, I understand there is no COGNEW 2 in hardware, and putting it in would be complicated. So DON'T PUT IT IN. Let LUT sharing be the province solely of those who map their cogs. Those who willy-nilly allocate and free cogs all day long don't use LUT sharing. Problem solved. Please don't let Nany State policies creep into the P2!

Frankly, the cog-mapping folks are also those who are much more likely to need the ultra-low latency no-jitter cog-to-cog value passing.

To arbitrarily prevent the P2 from doing something extremely valuable like this is akin to permanently welding training wheels onto every bicycle. Seriously, are we such infants that we can't wrap our heads around statically allocating two cogs at start-up for those cases where we need LUT sharing?

evanh · 2016-05-14 07:27

K2 wrote: »

Yes, I understand there is no COGNEW 2 in hardware, and putting it in would be complicated. So DON'T PUT IT IN.

I want to see this statement from everyone that thinks the same. However ....

Let LUT sharing be the province solely of those who map their cogs. Those who willy-nilly allocate and free cogs all day long don't use LUT sharing. Problem solved.

... it seems you may not understand at all - You are requiring everyone to change with you. What we have with the OBEX relies heavily on dynamic Cog mapping. That will not continue if shared LUTs come to be.

Heater. · 2016-05-14 07:30

It' amazing that the application of a few transistors, or not, can give rise to such disagreement and well, friction, between people.

potatohead · 2016-05-14 07:51

It is the classic I vs "us" or "we" scenario.

IMHO, unresolvable to the satisfaction of everyone.

P1 was designed in this way, and we all got good benefit.

P2 will be no different. That we get to see one of these choices makes it harder to see. Chip could have never done it, and we all would be thinking about making the most of what is in there.

I'm good with keeping the COGS atomic, like they are.

The feature is out, messaging and signaling is in. Great!

Now we move on to testing and building the on board boot ROM.

We all could be writing some PASM to use what there is and make sure it free of bugs.

This design trades throughput for the easy and slow HUB scheme in P1. Learning to make the most of it will need to be done just like it was for P1 too. All choices made a long time ago.

Nobody was happy then either, but everyone also knew the "cogs are equal" matters more than maximizing any one of them does too. This idea is so foundational that we for the egg beater rather than mess with cog equality. IMHO, good call.

No surprise here, other than Chip exploring it briefly.

Time to move on, IMHO.

BTW, interrupt events, done the way they are in P2, actually do keep COGS atomic, unless one really puts in some effort to break that. Turns out, we can do interrupts and still preserve that object portability. The key was keeping them in COG and not system wide. IMHO, that does mean exploring things is good. We will find those to be very useful, and easily ignored while "a cog is a cog" still exists. Nice.

But it also means letting go of a few things, or doing them differently too. This is gonna be one of those things.

We got a lot more than I was ever expecting after seeing the hot one and the massive heat, etc... associated with it.

Again, IMHO, time to move on.

MJB · 2016-05-14 08:47

evanh wrote: »

You're talking nonsense, Cluso.

EDIT: interresting - someone changed the original and the citation changed with it as well - no so harsh this way - thanks moderators? or who is watching us ??

Gentlemen - please do all remember

everybody here is giving his time and energy into this project to make it better.

Even if it might look like an idea is in conflict with your own thinking.

And a language which acknowledges this is more helpful thanks

Is LUT sharing between adjacent cogs very important?

Comments