Is LUT sharing between adjacent cogs very important?

cgracey · 2016-05-10 05:36

Cluso99 wrote: »

Chip,
I am extremely disappointed that the scaremongers have succeeded in derailing a fantastically simple mechanism that would provide a fast efficient mechanism for two cooperating cogs

I'm scared, all on my own.

Code to find two sequential cogs is quite simple and could be handled simply by the start routine in the object. The chances of not finding two adjacent cogs when we have 16 cogs is highly unlikely. It is way more likely that we cannot locate two cogs to minimise latency in the egg beater hub mechanism, and the latency will always be higher!

The additional mechanism of signalling a cog is great, but does not resolve fast simple cog-cog Comms.

I suspect that cooperating cogs using the egg beater hub will need to be spaced 3-4 cogs apart. This is a much more difficult setup than with adjacent cogs. With LUT sharing, the second cog could be in a tight loop waiting for the long to become non-zero, at which time we have the byte/word/long! Almost the same mechanism we use as mailboxes in P1, but without the delays due to hub latency, so extremely efficient.

Please reconsider your stance, because it will be way more complex and less efficient to have cooperating cog objects under the egg beater mechanism than with shared LUT.

The way I see it, for objects to be universally usable, they need to be written for worst-case timing, period. If someone wants to make a custom app that juxtaposes cogs in some pattern, they can do that, but objects cannot be written anticipating such order. They have to work in the wild where any object can run in any cog, along with other objects running in any other cogs.

I've been thinking that what would be handy for cog-to-cog messaging would be for the sender cog to use setq+wrlong, then 'attention', and then the receiver cog doing a setq+rdlong. There you can move an N-long message pretty quickly, without any FIFO uncertainties. You would just have waits to get to the initial egg-beater position. By the way, this could be optimized at run-time, once you know the relative cog numbers, by picking the starting long offsets of sender and receiver that result in the lowest latency. The LUT-sharing would be better at single-long back-and-forth messaging, but it has the cost of application-level minding. Keeping COGNEW as unrestrained as an XOR instruction has high value.

Electrodude · 2016-05-10 05:41

jmg wrote: »

cgracey wrote: »

There is no universal guarantee that they will launch sequentially, though. That's the whole problem.

I'm not following - in order to test the P2, you surely must be able to launch specific COGs on command ?

That will also be pretty much required for Debug too.

Yes, but that's only for controlled testing. The test code can be specially written to start specific cogs. But real life programs often use objects out of the obex that start their own cogs. These objects must use COGNEW to start their cogs and not COGINIT, since they could be initialized in any order and don't know what other cogs are being used by other objects, and are sometimes even started and stopped in the middle of normal operation.

I've written programs on the P1 that start and stop cogs on the fly, and I've done plenty of debugging on the Propeller, and I've always used COGNEW and never had to care about COGINITing specific cogs.

cgracey · 2016-05-10 05:41

jmg wrote: »

cgracey wrote: »

There is no universal guarantee that they will launch sequentially, though. That's the whole problem.

I'm not following - in order to test the P2, you surely must be able to launch specific COGs on command ?

That will also be pretty much required for Debug too.

Of course you can do that, but when bringing lots of unknown objects together, you can't have any of them demanding a certain seat. They need to take whatever seat is available at the time they board. Things have to work at run-time without any possibility of faulting.

evanh · 2016-05-10 05:47

Chip,
Can a DAC value be set without the physical output being driven?

With the new DAC buses in place this becomes a fast set of registers that all Cogs can both read and write, right?

cgracey · 2016-05-10 05:50

Electrodude wrote: »

jmg wrote: »

cgracey wrote: »

There is no universal guarantee that they will launch sequentially, though. That's the whole problem.

I'm not following - in order to test the P2, you surely must be able to launch specific COGs on command ?

That will also be pretty much required for Debug too.

Yes, but that's only for controlled testing. The test code can be specially written to start specific cogs. But real life programs often use objects out of the obex that start their own cogs. These objects must use COGNEW to start their cogs and not COGINIT, since they could be initialized in any order and don't know what other cogs are being used by other objects, and are sometimes even started and stopped in the middle of normal operation.

I've written programs on the P1 that start and stop cogs on the fly, and I've done plenty of debugging on the Propeller, and I've always used COGNEW and never had to care about COGINITing specific cogs.

That's right. That's how it needs to work. The only way you could dictate certain cog numbers would be to make sure your code runs before any of the objects' code. That can be done in your application, but no objects can be written like that, unless the user knows to initialize them first. This would suck everybody into an application framework, just to accommodate LUT-sharing. That's a whole level of mental overhead that isn't worth the benefit.

cgracey · 2016-05-10 05:54

evanh wrote: »

Chip,
Can a DAC value be set without the physical output being driven?

With the new DAC buses in place this becomes a fast set of registers that all Cogs can both read and write, right?

You can set any pin's DAC to either an 8-bit value or a 16-bit PWM'd or dithered value. If you want your cog's DAC channels to be routed to a pin, you configure that pin as a DAC output which is receiving your cog's DAC channel. Then, your cog can update that 8-bit DAC value on every single clock, if it wants to.

Wait... I just understood the real meaning of your question. No, they are not readable, though they are writeable. They cannot serve as messaging systems.

jmg · 2016-05-10 06:08

cgracey wrote: »

Wait... I just understood the real meaning of your question. No, they are not readable, though they are writeable. They cannot serve as messaging systems.

Others have posited using the Smart pins as message conduits.
What are the caveats and what timings could result ?
Can only one COG connect to a Smart pin at a time ?
How many SysCLKs to send then receive a read-back signal ?

jmg · 2016-05-10 06:10

cgracey wrote: »

Of course you can do that, but when bringing lots of unknown objects together, you can't have any of them demanding a certain seat. They need to take whatever seat is available at the time they board. Things have to work at run-time without any possibility of faulting.

Using the seat analogy, some have seat numbers, and some do not.
Usually, those with numbers get first dibs on the seats ( which, if all goes well, will match their numbers).

Electrodude · 2016-05-10 06:19

jmg wrote: »

cgracey wrote: »

Of course you can do that, but when bringing lots of unknown objects together, you can't have any of them demanding a certain seat. They need to take whatever seat is available at the time they board. Things have to work at run-time without any possibility of faulting.

Using the seat analogy, some have seat numbers, and some do not.
Usually, those with numbers get first dibs on the seats ( which, if all goes well, will match their numbers).

Except cog allocation happens in the order in which COGNEW and COGINIT instructions are executed, not in some sort of priority order. Cog initialization might never complete if you're starting and stopping cogs on the fly.

cgracey · 2016-05-10 06:19

jmg wrote: »

cgracey wrote: »

Of course you can do that, but when bringing lots of unknown objects together, you can't have any of them demanding a certain seat. They need to take whatever seat is available at the time they board. Things have to work at run-time without any possibility of faulting.

Using the seat analogy, some have seat numbers, and some do not.
Usually, those with numbers get first dibs on the seats ( which, if all goes well, will match their numbers).

With boarding rules, you could do that. Getting rid of the boarding rules really simplifies everything, though.

jmg · 2016-05-10 06:22

cgracey wrote: »

I've been thinking that what would be handy for cog-to-cog messaging would be for the sender cog to use setq+wrlong, then 'attention', and then the receiver cog doing a setq+rdlong. There you can move an N-long message pretty quickly, without any FIFO uncertainties. You would just have waits to get to the initial egg-beater position.

By the way, this could be optimized at run-time, once you know the relative cog numbers, by picking the starting long offsets of sender and receiver that result in the lowest latency.

That's quite a few hoops you have jumped through there, but the idea is quite interesting.

Taking this further, you could (with moderately complex tools) decide at Compile time the relative COG placements, and then use all cogs allocated (no COGNEW in sight) and you now have defined the EggBeater offsets, that can have compile-time fix-ups applied (some 'allocation waste' is implicit here - reserve 16 locations, and pick the fastest one, for each direction ).

This would report in the MAP file, but the user would not see the level of housekeeping being done.

Anyone wanting speed would probably be fine using a block of 16 to always get a fastest-path.

This can also work between more than just 2 COGS, should anyone be that ambitious.
It is unlikely the optimal slot would be the same address for target COGS. Were that ever to happen, the next slot is only 1 SysCLK worse.

Could you ever get a result, driven by COG offset, that wanted the same slot for read and write ?
( I think the ideal phase will be less than 8 slots, so this should never occur ?)

More up-front leg-work, (done by SW) but it saves run-time 'auto calibrate' code being needed.

Of course, the COGNEW had to be thrown overboard very early on, to make this deterministic & user controlled.

cgracey wrote: »

The LUT-sharing would be better at single-long back-and-forth messaging..

... and also better at block and record passing, of variable sizes...

cgracey · 2016-05-10 06:23

Electrodude wrote: »

jmg wrote: »

cgracey wrote: »

Of course you can do that, but when bringing lots of unknown objects together, you can't have any of them demanding a certain seat. They need to take whatever seat is available at the time they board. Things have to work at run-time without any possibility of faulting.

Using the seat analogy, some have seat numbers, and some do not.
Usually, those with numbers get first dibs on the seats ( which, if all goes well, will match their numbers).

Except cog allocation happens in the order in which COGNEW and COGINIT instructions are executed, not in some sort of priority order. Cog initialization might never complete if you're starting and stopping cogs on the fly.

And some objects may re-COGINIT already COGNEW'd cogs. COGINIT is really only useful like that in object-based applications.

cgracey · 2016-05-10 06:30

jmg wrote: »

cgracey wrote: »

I've been thinking that what would be handy for cog-to-cog messaging would be for the sender cog to use setq+wrlong, then 'attention', and then the receiver cog doing a setq+rdlong. There you can move an N-long message pretty quickly, without any FIFO uncertainties. You would just have waits to get to the initial egg-beater position.

By the way, this could be optimized at run-time, once you know the relative cog numbers, by picking the starting long offsets of sender and receiver that result in the lowest latency.

That's quite a few hoops you have jumped through there, but the idea is quite interesting.

Taking this further, you could (with moderately complex tools) decide at Compile time the relative COG placements, and then use all cogs allocated (no COGNEW in sight) and you now have defined the EggBeater offsets, that can have compile-time fix-ups applied (some 'allocation waste' is implicit here - reserve 16 locations, and pick the fastest one, for each direction ).

This would report in the MAP file, but the user would not see the level of housekeeping being done.
Anyone wanting speed would probably be fine using a block of 16 to always get a fastest-path.

More up-front leg-work, but it saves run-time 'auto calibrate' code being needed.

Of course, the COGNEW had to be thrown overboard very early on, to make this deterministic & user controlled.

cgracey wrote: »

The LUT-sharing would be better at single-long back-and-forth messaging..

... and also better at block and record passing, of variable sizes...

If the compiler knew that all COGNEWs were performed initially at application start, it could do that, but one of the neat things about objects is that they can de-allocate cogs using COGSTOP(oneofmycogs). This COGNEW stuff won't be all known at compile-time, anymore than something like MALLOC(size) can be anticipated.

About that jiggery-pokery involving long start addresses associated with cog numbers... I would never write an object to bother with that. It's too complicated, though it would afford some possible improvement. I would make things work with worse-case timing and leave it at that.

Electrodude · 2016-05-10 06:32

jmg wrote: »

cgracey wrote: »

I've been thinking that what would be handy for cog-to-cog messaging would be for the sender cog to use setq+wrlong, then 'attention', and then the receiver cog doing a setq+rdlong. There you can move an N-long message pretty quickly, without any FIFO uncertainties. You would just have waits to get to the initial egg-beater position.

By the way, this could be optimized at run-time, once you know the relative cog numbers, by picking the starting long offsets of sender and receiver that result in the lowest latency.

That's quite a few hoops you have jumped through there, but the idea is quite interesting.

Taking this further, you could (with moderately complex tools) decide at Compile time the relative COG placements, and then use all cogs allocated (no COGNEW in sight) and you now have defined the EggBeater offsets, that can have compile-time fix-ups applied (some 'allocation waste' is implicit here - reserve 16 locations, and pick the fastest one, for each direction ).

This would report in the MAP file, but the user would not see the level of housekeeping being done.
Anyone wanting speed would probably be fine using a block of 16 to always get a fastest-path.

More up-front leg-work, but it saves run-time 'auto calibrate' code being needed.

Of course, the COGNEW had to be thrown overboard very early on, to make this deterministic & user controlled.

cgracey wrote: »

The LUT-sharing would be better at single-long back-and-forth messaging..

... and also better at block and record passing, of variable sizes...

Good luck dynamically starting and stopping cogs on the fly with that method. Dynamic cog allocation can only be done at runtime, since which cogs are running is dependent upon inputs to the chip, something the compiler cannot possibly predict ahead of time.

cgracey · 2016-05-10 06:36

Yes, Electrodude, and in some applications, like I believe Erna has worked on, cogs are being started up and shut down on a thread-like basis. Cog allocation being dynamic enables lots of neat things.

jmg · 2016-05-10 06:49

Electrodude wrote: »

Good luck dynamically starting and stopping cogs on the fly with that method. Dynamic cog allocation can only be done at runtime, since which cogs are running is dependent upon inputs to the chip, something the compiler cannot possibly predict ahead of time.

Sure, but COGNEW will not allocate on top of an already started COG, so I do not see the issue ?
If someone wanted to fully define 14 COGS and 'float' 2, I see nothing preventing that ?
In this case, the simple rule is cogstop applies to a float-pool COG.

cgracey · 2016-05-10 07:13

jmg wrote: »

Electrodude wrote: »

Good luck dynamically starting and stopping cogs on the fly with that method. Dynamic cog allocation can only be done at runtime, since which cogs are running is dependent upon inputs to the chip, something the compiler cannot possibly predict ahead of time.

Sure, but COGNEW will not allocate on top of an already started COG, so I do not see the issue ?
If someone wanted to fully define 14 COGS and 'float' 2, I see nothing preventing that ?
In this case, the simple rule is cogstop applies to a float-pool COG.

In all the objects I wrote for Prop1, I had 'start' and 'stop' methods. The 'start' method would start the cog and save the cog id. The 'stop' method would shut the started cog down. The idea is, of course, that you can have cogs work on whatever is needed at the time. Erna would just keep allocating cogs as they came available and the cogs would shut themselves down when finished. It's a different way of doing things that allows different approaches.

Heater. · 2016-05-10 07:41

Cluso99,

Code to find two sequential cogs is quite simple and could be handled simply by the start routine in the object.

No quite.

In the general case there are many COGS running many "processes" asynchronously. It is impossible for software to find a free COG and then take use of it without hardware support for atomic operations. You can never be sure that some other process also finds that same free COG and tries starting it as well.

That is why the Propeller has COGNEW.

Given the idea of "objects", as in all that Spin code in OBEX and elsewhere but also applicable to other languages, users don't generally want to know or care what the code in an object is doing as long as it provides the driver or whatever functionality it advertises. COGNEW makes it possible to do this. Users should not have to worry about COG allocation any more than they have to worry about which addresses their variables live at.

You can't just push this problem of finding sequential COGs to the start of an object. You have to push it all the way to the start of the users program. At that point you have broken the object isolation and ease of reuse referred to above.

As Chip says:

"There is no universal guarantee that they will launch sequentially, though. That's the whole problem. To make such a guarantee, we would either have to have an extended dual-cog COGNEW or some application framework in which startup code is called for all objects before regular runtime code. Everything will get wrapped around that axle because of that single LUT-sharing feature which many apps may not care about. I just don't think it's worth it.

It would turn what is now an atomic hardware function into an application-level paradigm."

It is argued that such resource sharing between COGs is there to be used if wanted but can otherwise be ignored. Perhaps true, but as soon as code that does use it hits OBEX all code would become polluted with the problem.

cgracey · 2016-05-10 08:07

Good explanation, Heater.

I hope everyone understands why sequential LUT sharing is problematic.

I don't want to kill any useful feature, but anything that keeps us from being able to share objects without caveats is a bit like kryptonite to the Prop2.

Since we freed the cogs' DAC channels from fixed pins and got rid of the LUT sharing, objects can now be simple and free again, like they are on the Prop1. This has huge import to how people will experience Prop2.

jmg · 2016-05-10 09:21

Heater. wrote: »

COGNEW makes it possible to do this. Users should not have to worry about COG allocation any more than they have to worry about which addresses their variables live at.

Yup, COGNEW is like a High level language - it allows most users, the luxury of knowing less.

However, I also know no one that makes a MCU with no Assembler support, and sometimes, yes, you DO need to "worry about which addresses your variables live at", or which SysCLK two COGS may fire at.

Objects from OBEX already have to follow some project imposed rules (they are not usually Clock Frequency agnostic, for example), nor can each one freely choose pins, so I do not see that implementing LUT sharing has to cause any problems with OBEX.
It simply becomes another rule.

Roy Eltham · 2016-05-10 09:47

cgracey wrote: »

Good explanation, Heater.

I hope everyone understands why sequential LUT sharing is problematic.

I don't want to kill any useful feature, but anything that keeps us from being able to share objects without caveats is a bit like kryptonite to the Prop2.

Since we freed the cogs' DAC channels from fixed pins and got rid of the LUT sharing, objects can now be simple and free again, like they are on the Prop1. This has huge import to how people will experience Prop2.

I'm extremely happy about this Chip. I have been worried about this for a while. I really think it's a key feature of the Prop, and should be for the Prop 2 as well.

cgracey · 2016-05-10 09:58

Roy Eltham wrote: »

cgracey wrote: »

Good explanation, Heater.

I hope everyone understands why sequential LUT sharing is problematic.

I don't want to kill any useful feature, but anything that keeps us from being able to share objects without caveats is a bit like kryptonite to the Prop2.

Since we freed the cogs' DAC channels from fixed pins and got rid of the LUT sharing, objects can now be simple and free again, like they are on the Prop1. This has huge import to how people will experience Prop2.

I'm extremely happy about this Chip. I have been worried about this for a while. I really think it's a key feature of the Prop, and should be for the Prop 2 as well.

I just implemented the cog-to-cog(s) 'attention' mechanism, which can be used to alert multiple cogs on the same clock that they need to do something. There is nothing else that needs to be added, I feel. I just need to update the documentation and get a new release out, so that people can start using what should be the final Prop2 logic. There's the boot ROM, yet, but that's different from the hardware. I think the hardware is done!

Is there anything else that has been bothering you about the design?

We've got the reciprocal counters that jmg's been talking about for a while implemented into the smart pins. They are really neat. We also have the streamer doing 1/2/4-bit modes. I don't think anything is missing now. Do you see anything? Does anyone else see anything? (okay, aside from LUT sharing)

Heater. · 2016-05-10 10:22

jmg,

Yup, COGNEW is like a High level language - it allows most users, the luxury of knowing less.

COGNEW is more than that.

COGNEW allows for the claiming of a resource, a COG, in a multi-processors system in an atomic way. This cannot be done in a sequence of assembler instructions without a race hazard. Unless you write a resource manager that wraps critical parts of it's code in locks to achieve atomicity.

Objects from OBEX already have to follow some project imposed rules (they are not usually Clock Frequency agnostic, for example), nor can each one freely choose pins, so I do not see that implementing LUT sharing has to cause any problems with OBEX.

Clock frequency and pin selection are externally imposed constraints. Not much we can do about that. The LUT sharing can cause problems due to the race conditions described above.

It simply becomes another rule.

Don't like rules

evanh · 2016-05-10 10:31

I've been mulling over that write buffering I've been harking on about and realised that anything more than one level deep falls into a similar trap that caches do. It can only write to single block spaced addresses per hub rotation.

But a single depth buffer would be effective because it would provided a reliable non-stalling single write per rotation without having to align to the Hub timing.

On the other hand, a clear description of minimum HubRAM latency with this eggbeater might help me put this one to rest.

evanh · 2016-05-10 10:32

JMG, read this again:

Heater. wrote: »

You can't just push this problem of finding sequential COGs to the start of an object. You have to push it all the way to the start of the users program. At that point you have broken the object isolation and ease of reuse referred to above.

Which means the rule would become that OBEX programs can't start their own Cogs.

Rayman · 2016-05-10 10:44

Very glad to hear the hardware is done.

cgracey · 2016-05-10 10:45

evanh wrote: »

I've been mulling over that write buffering I've been harking on about and realised that anything more than one level deep falls into a similar trap that caches do. It can only write to single block spaced addresses per hub rotation.

But a single depth buffer would be effective because it would provided a reliable non-stalling single write per rotation without having to align to the Hub timing.

On the other hand, a clear description of minimum HubRAM latency with this eggbeater might help me put this one to rest.

I'm trying to visualize all the ramifications this might have on software. It would eliminate WRxxxx delays if they were paced at ~16+ clocks. If another hub memory instruction tried to execute, it would have to wait for the buffered write to finish, first. I'm sure there's more.

Tubular · 2016-05-10 10:49

cgracey wrote: »

We've got the reciprocal counters that jmg's been talking about for a while implemented into the smart pins. They are really neat. We also have the streamer doing 1/2/4-bit modes. I don't think anything is missing now. Do you see anything? Does anyone else see anything? (okay, aside from LUT sharing)

A way for a smartpin to output a NCO clock > 40 MHz (for an 80MHz fsys clock). At the moment the max NCO frequency that can be generated is 40 MHz ($8000_0000 added each cycle), and if you try to push it further it folds back - if you ask for 50 MHz it'll output 30 MHz etc. The streamer can go all the way to 80 MHz, so there's a mismatch between 40-80MHz, and we'll probably want this for lcd/hdmi/external dacs

What needs to happen (I think) is for the smart pin output to be Anded with the system clock, for NCO frequency outputs above $8000_0000, so the output pulses shorten to 12.5ns

I was thinking about this, I don't think it should break your synthesis rules, at all. I'm not after a higher frequency, just a shorter high pulse length.

Let me know if you need a diagram to illustrate this better!

evanh · 2016-05-10 11:00

cgracey wrote: »

I'm trying to visualize all the ramifications this might have on software. It would eliminate WRxxxx delays if they were paced at ~16+ clocks. If another hub memory instruction tried to execute, it would have to wait for the buffered write to finish, first. I'm sure there's more.

I think that's all. It doesn't help with maximum throughput but that never was the primary objective.

It should even be fine with a tight 'attention' strobe.

ozpropdev · 2016-05-10 11:15

cgracey wrote: »

I just need to update the documentation and get a new release out, so that people can start using what should be the final Prop2 logic.... I think the hardware is done!

BEST quote of the year!

Is LUT sharing between adjacent cogs very important?

Comments