Best practice for the fast case being to avoid multiple cogs writing to the same cog lut at the same time.
We get for that:
Write broadcast to all cogs, potentially
Any cog to any cog fast data transfer, with no time restriction.
Write happened events.
This is no different from the event system and it's potential complications. We have safe, simple, good performance, and we have the go ahead and abuse it options too.
Or people have lots of options to ignore events entirely.
Same with this thing. Ignore it, use the safe mode, or abuse it. Your call.
And for that we still get cogs are just cogs. People can author very complicated stuff and the impact of all that is limited to their code.
IMHO, these are smart features with teeth for those who want them.
As with the events, people will author things not otherwise possible and or practical. Throughout this whole design process, that argument has been used to promote powerful features that still work on a reasonable power budget.
This implementation does a lot more than the simpler near cog sharing did, and the cost is modest, keep out of trouble options present and accounted for.
All without really impacting one of our core and very attractive design ideas.
Strange no one seem to be considering or analyzing the READ path variant version of this ?
I know the feeling. I even started a whole separate thread on the expanded ATN mechanism, but it's going largely uncommented upon. Which is surprising for this group.
As for the READ variant, I think the data corruption issue is a red herring. Data written to the wrong address or data read from the wrong address will still potentially cause problems; it's just a matter of which cog it will cause problems for. One issue with READ that I do see is that you you have to do more work to get the same one-to-many effect that WRITE allows. In other words, with WRITE, a single cog distributes a value to several LUTs at once. But, with READ, multiple cogs must work to get the same single value from the first cog.
What am missing? Isn't the P2 intended to be a microcontroller? NOT a micro-processor running a multi-user operating system? Aren't there honestly a very limited number of potential attack vectors for malicious code beyond what you write and what you liberate as objects from OBEX2? These are very advanced features that won't be used by the faint of heart. Any OBEX2 code using them will be well tested and documented or will not be very popular among object borrowers. You embed the code you want inro your P2, nobody is going to wander up and drop a virus into it that runs willy nilly through your shared LUT space.
Strange no one seem to be considering or analyzing the READ path variant version of this ?
The Read version has the same ultimate concerns, does it not? However it seems harder to wrap ones head around vs the write variant
Not really, they are both one-way paths, that need some inter-COG rules.
Read avoids the nasty side effect of permanent change of memory that was not addressed.
Read still has same-slot effects, but now they are immediate and much easier to debug.
Read gains some very useful Debug and trace features, but it does lose the carefully timed OR-merge instance of Write. Is there a real use case for that ?
Personally, I prefer anything that makes debug & development easier.
Strange no one seem to be considering or analyzing the READ path variant version of this ?
I know the feeling. I even started a whole separate thread on the expanded ATN mechanism, but it's going largely uncommented upon. Which is surprising for this group.
I think the ATN needs more examples, to show where and how it can be used.
As for the READ variant, I think the data corruption issue is a red herring. Data written to the wrong address or data read from the wrong address will still potentially cause problems; it's just a matter of which cog it will cause problems for.
Yes and no. Read corruption occurs immediately, on the opcode used, so is easier to identify and debug.
Corruption of unexpected and not addressed memory, is far harder to identify and debug, and likely triggers multiple failures.
Not only did you not put data where you expected, thus giving your own process bad info, but you have also corrupted some other, probably tested code or data.
If that other code crashes first, where do you naturally start looking ?
One issue with READ that I do see is that you you have to do more work to get the same one-to-many effect that WRITE allows. In other words, with WRITE, a single cog distributes a value to several LUTs at once. But, with READ, multiple cogs must work to get the same single value from the first cog.
Yes, there are some trade offs.
Read gains some great, non intrusive, debug and trace visibility.
Strange no one seem to be considering or analyzing the READ path variant version of this ?
I know the feeling. I even started a whole separate thread on the expanded ATN mechanism, but it's going largely uncommented upon. Which is surprising for this group.
I think the ATN needs more examples, to show where and how it can be used.
Yeah, "attack vectors" and "malicious code" is far out of the picture.
On the other hand "multi-user operating system" is not so crazy a view of things.
We have multiple cores. Some will run code that the user has written. Some will run code snagged from OBEX or wherever. Perhaps code the user knows nothing about internally.
Trick is to make this soup of code play together easily and nicely.
The P1 does this. We hope the P2 continues that idea.
I agree. If you have multiple cogs writing to a particular LUT at potentially the same time, use WRLUTS. That will eliminate all uncertainty. There will be no possibility of OR'd data at OR'd addresses.
Maybe we should just think of this as a bonus feature.
Doesn't have to be ideal. Just makes something out of spare capacity that would otherwise feel wasted...
Strange no one seem to be considering or analyzing the READ path variant version of this ?
If one cog needs to tell 15 other cogs something, you want shared writing, not reading. Writing can happen instantly from one to all. Reading would have to be staggered in time. Shared writing just seems a lot more valuable than shared reading.
Well yes,... but I'm unclear of what applies to what is there now, and what applies to what you propose ?
With all these changes I've lost track of what exactly ATN now does, but Chip did confirm it is still there separate, and not merged into LUT_Any.
All that's implemented is COGATN D, which sets the ATN event on the cogs specified in the D mask. From the perspective of the receiving cog, its ATN event is set to the ORed input signals from the other cogs. All the receiving cog can do is WAITATN, POLLATN, or set an interrupt. The receiving cog has no ability to know which other cog or cogs signaled it.
That's where my proposal comes in. It adds a 16-bit ATN register that tracks where the signals came from.
Take a look at the 5 examples in the other thread and answer this: could they be done as easily, efficiently, and concisely if you only had the event (i.e. no ATN register)? If someone can come up with another approach that's just a effective, using the existing hardware, I'd like to see those examples re-written to show it. (on the other thread, please.)
(And, yes, I know this could be expanded to 32 bits, but I think that adds marginal gain for the added complexity. Whereas, I think adding the basic ATN register itself adds significant gain for the added complexity.)
If you have a collision in the READ version, then all of the colliding cogs will get incorrect data. This could produce similarly hard to diagnose bugs as a collision in the WRITE version. In either case you will mess up seemingly unrelated processes.
If you have a collision in the READ version, then all of the colliding cogs will get incorrect data. This could produce similarly hard to diagnose bugs as a collision in the WRITE version. In either case you will mess up seemingly unrelated processes.
It's not quite so 'unrelated' :
To get bad data, both cogs need to be talking to another LUT, which in itself is rare, and gives you a massive clue.
Contrast the LUT Corruption, the passive COG, that was being written to, has a LUT memory location corrupted that is not related to the COG-COG links memory areas.
With enough big print in the manual, a user might think where to look, but of course they will think any bug of their own making is also the 'write corruption' effect....
Maybe we should just think of this as a bonus feature.
Doesn't have to be ideal. Just makes something out of spare capacity that would otherwise feel wasted...
Strange no one seem to be considering or analyzing the READ path variant version of this ?
While I share your concerns about this possibly creating difficult to find bugs I have to point out the read path shares the same problem. Because the address bits from all the cogs are or'ed into the LUT having two cogs output different addresses at the same time will result in reading the wrong LUT location. Reading incorrect data from the LUT is no better than writing incorrect data to a LUT as far as having working code is concerned.
PS Even two cogs reading data simultaneously from the same locations in two LUTS will result in data corruption unless those locations had identical data.
PS Even two cogs reading data simultaneously from the same locations in two LUTS will result in data corruption unless those locations had identical data.
I don't think this is true. If two COGs each write to a different LUT then no corruption will occur even if they write to different addresses. As I understand it, the problem only occurs if two COGs try to write to the same LUT at the same time. In that case, both the address and the data can be corrupted since the addresses and data values from each COG are ORed together.
That is it exactly. The difficult case is two cogs writing to the same lut area on the same clock.
Personally, I believe somebody will find something great to do with the tough case. We need to keep it.
Well, if a bunch of COGs write to the same LUT at the same time and each uses a different address bit then the owner of the LUT can tell what combination of COGs did the writing by noticing which address was changed. Not sure how that would be useful though. :-)
A conflict only exists when multiple cogs write the same LUT on the same clock cycle using WRLUTX. In this case, addresses and write data are each OR'd together, causing errant data and an errant address. By using WRLUTS, this problem can be completely avoided, since each cog's write output will only occur during its unique 1-of-16 timeslot, thereby singulating in time the various writes.
Well, say the lut holds some image data, or color definitions. Four other COGS could contribute their part of the definition, red, green, blue, alpha. They all use the same target address and each contribute their portion of the bitfield.
That is just one, I could think of. There will be much better in time
Until then, we observe best practice, or use the safe form of the feature.
PS Even two cogs reading data simultaneously from the same locations in two LUTS will result in data corruption unless those locations had identical data.
I don't think this is true. If two COGs each write to a different LUT then no corruption will occur even if they write to different addresses. As I understand it, the problem only occurs if two COGs try to write to the same LUT at the same time. In that case, both the address and the data can be corrupted since the addresses and data values from each COG are ORed together.
I think you may be right, depending on how the muxing is done for the read case.
A conflict only exists when multiple cogs write the same LUT on the same clock cycle using WRLUTX. In this case, addresses and write data are each OR'd together, causing errant data and an errant address. By using WRLUTS, this problem can be completely avoided, since each cog's write output will only occur during its unique 1-of-16 timeslot, thereby singulating in time the various writes.
Close - by everyone using WRLUTS this problem can be completely avoided.
Of course, that gives a large speed hit, but it does completely avoid the issue.
If 14 or 15 COGS use WRLUTS and 2 or 1 use WRLUTX, you are not sure you are ok, until you check the destination COGs can never overlap.
It's not quite so 'unrelated' :
To get bad data, both cogs need to be talking to another LUT, which in itself is rare, and gives you a massive clue.
If Cog 1 tries to read LUT address A in Cog 2,
and Cog 3 tries to read LUT address B in Cog 2,
on the same clock cycle, without the RDLUTS option in place,
won't both Cog 1 and Cog 3 receive data that is from Cog2 LUT address (A or B ) ?
It's not quite so 'unrelated' :
To get bad data, both cogs need to be talking to another LUT, which in itself is rare, and gives you a massive clue.
If Cog 1 tries to read LUT address A in Cog 2,
and Cog 3 tries to read LUT address B in Cog 2,
on the same clock cycle, without the RDLUTS option in place,
won't both Cog 1 and Cog 3 receive data that is from Cog2 LUT address (A or B ) ?
Yes, which is what I said.
Notice they are both accessing another COG (#2), at the time they get bad data.
Also, target COG 2 is not affected in any way here.
A conflict only exists when multiple cogs write the same LUT on the same clock cycle using WRLUTX. In this case, addresses and write data are each OR'd together, causing errant data and an errant address. By using WRLUTS, this problem can be completely avoided, since each cog's write output will only occur during its unique 1-of-16 timeslot, thereby singulating in time the various writes.
Close - by everyone using WRLUTS this problem can be completely avoided.
Of course, that gives a large speed hit, but it does completely avoid the issue.
If 14 or 15 COGS use WRLUTS and 2 or 1 use WRLUTX, you are not sure you are ok, until you check the destination COGs can never overlap.
But nobody should be so careless to write software which mixes WRLUTX and WRLUTS when writing the same LUT(s), right?
But nobody should be so careless to write software which mixes WRLUTX and WRLUTS when writing the same LUT(s), right?
Ideally no, but to prove this is avoided in a design situation may not be so simple.
Especially with OBEX having any mix of approaches, you will need to take care to track which COG does what.
Or, someone may be after 'just a bit more speed' :
"Hey, look, WRLUTX is faster, cool."
"Seems to work fine too, let's ship that !"
Comments
An instruction that spreads writes out in time
Best practice for the fast case being to avoid multiple cogs writing to the same cog lut at the same time.
We get for that:
Write broadcast to all cogs, potentially
Any cog to any cog fast data transfer, with no time restriction.
Write happened events.
This is no different from the event system and it's potential complications. We have safe, simple, good performance, and we have the go ahead and abuse it options too.
Or people have lots of options to ignore events entirely.
Same with this thing. Ignore it, use the safe mode, or abuse it. Your call.
And for that we still get cogs are just cogs. People can author very complicated stuff and the impact of all that is limited to their code.
IMHO, these are smart features with teeth for those who want them.
As with the events, people will author things not otherwise possible and or practical. Throughout this whole design process, that argument has been used to promote powerful features that still work on a reasonable power budget.
This implementation does a lot more than the simpler near cog sharing did, and the cost is modest, keep out of trouble options present and accounted for.
All without really impacting one of our core and very attractive design ideas.
In a system where multiple writers write to a LONG at the same time chaos ensues.
In this case the atomic unit is not a LONG but a whole LUT.
Not so different.
Personally I'd be happy to see all this LUT business scrapped and we get on with making a chip.
I know the feeling. I even started a whole separate thread on the expanded ATN mechanism, but it's going largely uncommented upon. Which is surprising for this group.
As for the READ variant, I think the data corruption issue is a red herring. Data written to the wrong address or data read from the wrong address will still potentially cause problems; it's just a matter of which cog it will cause problems for. One issue with READ that I do see is that you you have to do more work to get the same one-to-many effect that WRITE allows. In other words, with WRITE, a single cog distributes a value to several LUTs at once. But, with READ, multiple cogs must work to get the same single value from the first cog.
At least, that is what I'm seeing and expecting.
Let's get on with the show!!
The Read version has the same ultimate concerns, does it not? However it seems harder to wrap ones head around vs the write variant
Not really, they are both one-way paths, that need some inter-COG rules.
Read avoids the nasty side effect of permanent change of memory that was not addressed.
Read still has same-slot effects, but now they are immediate and much easier to debug.
Read gains some very useful Debug and trace features, but it does lose the carefully timed OR-merge instance of Write. Is there a real use case for that ?
Personally, I prefer anything that makes debug & development easier.
Yes and no. Read corruption occurs immediately, on the opcode used, so is easier to identify and debug.
Corruption of unexpected and not addressed memory, is far harder to identify and debug, and likely triggers multiple failures.
Not only did you not put data where you expected, thus giving your own process bad info, but you have also corrupted some other, probably tested code or data.
If that other code crashes first, where do you naturally start looking ?
Yes, there are some trade offs.
Read gains some great, non intrusive, debug and trace visibility.
I thought that's exactly what I did.
Well yes,... but I'm unclear of what applies to what is there now, and what applies to what you propose ?
With all these changes I've lost track of what exactly ATN now does, but Chip did confirm it is still there separate, and not merged into LUT_Any.
Yeah, "attack vectors" and "malicious code" is far out of the picture.
On the other hand "multi-user operating system" is not so crazy a view of things.
We have multiple cores. Some will run code that the user has written. Some will run code snagged from OBEX or wherever. Perhaps code the user knows nothing about internally.
Trick is to make this soup of code play together easily and nicely.
The P1 does this. We hope the P2 continues that idea.
Just say no to LUT sharing. In any form.
Get me it, the chip, already!
I agree. If you have multiple cogs writing to a particular LUT at potentially the same time, use WRLUTS. That will eliminate all uncertainty. There will be no possibility of OR'd data at OR'd addresses.
If one cog needs to tell 15 other cogs something, you want shared writing, not reading. Writing can happen instantly from one to all. Reading would have to be staggered in time. Shared writing just seems a lot more valuable than shared reading.
All that's implemented is COGATN D, which sets the ATN event on the cogs specified in the D mask. From the perspective of the receiving cog, its ATN event is set to the ORed input signals from the other cogs. All the receiving cog can do is WAITATN, POLLATN, or set an interrupt. The receiving cog has no ability to know which other cog or cogs signaled it.
That's where my proposal comes in. It adds a 16-bit ATN register that tracks where the signals came from.
Take a look at the 5 examples in the other thread and answer this: could they be done as easily, efficiently, and concisely if you only had the event (i.e. no ATN register)? If someone can come up with another approach that's just a effective, using the existing hardware, I'd like to see those examples re-written to show it. (on the other thread, please.)
(And, yes, I know this could be expanded to 32 bits, but I think that adds marginal gain for the added complexity. Whereas, I think adding the basic ATN register itself adds significant gain for the added complexity.)
It's not quite so 'unrelated' :
To get bad data, both cogs need to be talking to another LUT, which in itself is rare, and gives you a massive clue.
Contrast the LUT Corruption, the passive COG, that was being written to, has a LUT memory location corrupted that is not related to the COG-COG links memory areas.
With enough big print in the manual, a user might think where to look, but of course they will think any bug of their own making is also the 'write corruption' effect....
Besides, two cogs can signal one another with the writes, and the attention system.
While I share your concerns about this possibly creating difficult to find bugs I have to point out the read path shares the same problem. Because the address bits from all the cogs are or'ed into the LUT having two cogs output different addresses at the same time will result in reading the wrong LUT location. Reading incorrect data from the LUT is no better than writing incorrect data to a LUT as far as having working code is concerned.
PS Even two cogs reading data simultaneously from the same locations in two LUTS will result in data corruption unless those locations had identical data.
Personally, I believe somebody will find something great to do with the tough case. We need to keep it.
A conflict only exists when multiple cogs write the same LUT on the same clock cycle using WRLUTX. In this case, addresses and write data are each OR'd together, causing errant data and an errant address. By using WRLUTS, this problem can be completely avoided, since each cog's write output will only occur during its unique 1-of-16 timeslot, thereby singulating in time the various writes.
That is just one, I could think of. There will be much better in time
Until then, we observe best practice, or use the safe form of the feature.
I think you may be right, depending on how the muxing is done for the read case.
Of course, that gives a large speed hit, but it does completely avoid the issue.
If 14 or 15 COGS use WRLUTS and 2 or 1 use WRLUTX, you are not sure you are ok, until you check the destination COGs can never overlap.
If Cog 1 tries to read LUT address A in Cog 2,
and Cog 3 tries to read LUT address B in Cog 2,
on the same clock cycle, without the RDLUTS option in place,
won't both Cog 1 and Cog 3 receive data that is from Cog2 LUT address (A or B ) ?
Yes, which is what I said.
Notice they are both accessing another COG (#2), at the time they get bad data.
Also, target COG 2 is not affected in any way here.
But nobody should be so careless to write software which mixes WRLUTX and WRLUTS when writing the same LUT(s), right?
Especially with OBEX having any mix of approaches, you will need to take care to track which COG does what.
Or, someone may be after 'just a bit more speed' :
"Hey, look, WRLUTX is faster, cool."
"Seems to work fine too, let's ship that !"