It's not quite so 'unrelated' :
To get bad data, both cogs need to be talking to another LUT, which in itself is rare, and gives you a massive clue.
If Cog 1 tries to read LUT address A in Cog 2,
and Cog 3 tries to read LUT address B in Cog 2,
on the same clock cycle, without the RDLUTS option in place,
won't both Cog 1 and Cog 3 receive data that is from Cog2 LUT address (A or B ) ?
Yes, which is what I said.
Notice they are both accessing another COG (#2), at the time they get bad data.
Also, target COG 2 is not affected in any way here.
Ok, but the "ultimate problem" is similar in both cases. However the WRLUT, to me, is much more intuitive than sucking data through RDLUTs from the other end
But yes I accept that in effect the RDLUT empowered "parasitic cogs" (Cog 1,3) can be paralleled up without affecting the operation of the original cog 2. Is it worth the less intuitive approach for purist peace of mind? Not sure.
Perhaps if this was framed in terms of a debug tool from the start, rather than some way of, what sharing sprites?, we'd be thinking differently
IT seems strange that Heater, the poster child for "A Boy and his Soldering Iron" should be so quick to emasculate the P2 architecture and instruction set in order to protect us from ourselves.
The only reason I mentioned image data was to share a use case. There will be better ones, I am sure. Multicog sync is another easy one mentioned by Chip.
Another use case might just be using a COG to fetch code, maybe in tandem with another one fetching from SDRAM. We may find an advanced kind of LMM is possible. Running code, "in place" from big external RAM. It may be possible to get this to really perform. The safe mode won't go as fast.
Overlays?
If we frame it as anything, it's advanced. Something we know will get used for X, Y, Z, but like the events, we have it in there to improve on the possible. We don't know what that will look like today, but we will be glad we did sometime in the future.
Same argument used for a number of features in this design. I believe we are making the right call on them too.
You're probably right PH, it doesn't have to be framed as 'debug', but doing that does provide a different starting point that (imho) leans towards a different evaluation.
You also right in that we need to evaluate usage cases against these proposals. I think best way to achieve this would be for Chip to release an image with "something", and lets see how it works in practice.
In general, getting data out of the prop is really well catered for, between streamers and smart pins. Its further up the dataflow that the benefits of something like this are really needed.
A conflict only exists when multiple cogs write the same LUT on the same clock cycle using WRLUTX. In this case, addresses and write data are each OR'd together, causing errant data and an errant address. By using WRLUTS, this problem can be completely avoided, since each cog's write output will only occur during its unique 1-of-16 timeslot, thereby singulating in time the various writes.
But nobody should be so careless to write software which mixes WRLUTX and WRLUTS when writing the same LUT(s), right?
To be on the safe side the code should use WRLUTS, doesn't that vanishes or greatly reduce any advantage of the LUT sharing ? WRLUTX and its speed advantage should be used only when the code knows exactly what others are doing, and even in that case it could happen that two COGs uses WRLUTX at the same time conflicting with each other. Is that right ?
Looks to me these two instructions are a great source of trouble.
To be on the safe side the code should use WRLUTS, doesn't that vanishes or greatly reduce any advantage of the LUT sharing ? WRLUTX and its speed advantage should be used only when the code knows exactly what others are doing, and even in that case it could happen that two COGs uses WRLUTX at the same time conflicting with each other. Is that right ?
It's quite simple. The primary purpose of this feature is to provide for any two Cogs to have fast, low-latency, and deterministic dataflow as one-on-one arrangement. Up to eight pairs of Cogs. Anything else is icing. For this singular ability you use WRLUTX. For everything else just use HubRAM instead ... Although it would be possible to use WRLUTX in a daisy-chain with more than two Cogs.
But Chip was nice and added an extra instruction, WRLUTS, to allow the extra multi-write abilities too.
WRLUTX and its speed advantage should be used only when the code knows exactly what others are doing, and even in that case it could happen that two COGs uses WRLUTX at the same time conflicting with each other. Is that right ?
Close. The critical element is hitting both of Same-Target-COG_LUT and Same-SysCLK.
Two closely co-operating, and isolated COGs, will likely be quite OK.
A mish-mash from OBEX ?, weellll...
Looks to me these two instructions are a great source of trouble.
Quite possibly, but Chip says this " But nobody should be so careless to write software which mixes WRLUTX and WRLUTS when writing the same LUT(s), right?"
Which I guess will go on page 1 of the Manual, that everyone reads... ?
WRLUTS is still much lower latency than going out to HUB and back from there into the other COG. So, I wouldn't say that it "vanishes or greatly reduces" the advantages.
WRLUTX and its speed advantage should be used only when the code knows exactly what others are doing, and even in that case it could happen that two COGs uses WRLUTX at the same time conflicting with each other. Is that right ?
Close. The critical element is hitting both of Same-Target-COG_LUT and Same-SysCLK.
Two closely co-operating, and isolated COGs, will likely be quite OK.
Maybe a truth table explaining what happens with each instructions would help me and others understand better the interactions. Something with clock cycle, cog1 instruction, cog2 instruction, cog3 target, cog4 target with all possible combinations of WRLUTS/X and the effects on the targets. Would that be possible to do ?
I think this will turn out to be a case like coginit vs. cognew, but much worse: "Yes, wrlutx is there, and you could use it, but please just don't." Anything whose misuse might cause problems will get misused and will cause problems. How many times have we had to correct programmers for their use of coginit? I think it's important to consider the customer-support consequences of every feature embodied in the P2. Features designed for the priesthood will get used by the laity, whether they fully understand them or not. And this one is particularly pernicious, since the bad consequences of misuse will occur only rarely, making them almost impossible to debug.
I say just get rid of it. Take the performance hit. That matters way less than the support issues.
Maybe a truth table explaining what happens with each instructions would help me and others understand better the interactions. Something with clock cycle, cog1 instruction, cog2 instruction, cog3 target, cog4 target with all possible combinations of WRLUTS/X and the effects on the targets. Would that be possible to do ?
Yes, such a table could include ideally HUB and the new DAC_Data pathways too.
Chip has yet to define the end-to-end delays in DAC_Data, but in the region of 10 cycles for a byte payload, increasing for 16,32b loads.
WRLUTS can take 16 Clocks to slot-sync, then add LUT RdTime, whilst a HUB W/R could need 16+16, some may call that 'much lower', but I'd call that 'about half', and a lot slower than the 2 cycles a direct WRLUTX takes.
I would call a change from 2 to 16, greatly reduced.
I think this will turn out to be a case like coginit vs. cognew, but much worse: "Yes, wrlutx is there, and you could use it, but please just don't." Anything whose misuse might cause problems will get misused and will cause problems. How many times have we had to correct programmers for their use of coginit? I think it's important to consider the customer-support consequences of every feature embodied in the P2. Features designed for the priesthood will get used by the laity, whether they fully understand them or not. And this one is particularly pernicious, since the bad consequences of misuse will occur only rarely, making them almost impossible to debug.
I say just get rid of it. Take the performance hit. That matters way less than the support issues.
-Phil
It seems to me that this feature is only going to be used by objects which start those cogs whose LUTs will be written to. I don't think anyone would use it as a general interface, as they would have no idea of the context.
It seems to me that this feature is only going to be used by objects which start those cogs whose LUTs will be written to. I don't think anyone would use it as a general interface, as they would have no idea of the context.
That's a good point - the tools will need something along the lines of PUBLIC and EXTERN that common linkers use, to track and allocate memory referenced in more than one module.
Ideally, compile time knowledge of which COGs will be using this, will allow smaller code, in those objects which start those cogs whose LUTs will be written to.
I've got 16 cogs compiling now with this new LUT writing. It's been two hours, so far. I pulled out 24 smart pins, which left 40.
Do you get any 'trajectory indicators' on these big compiles, or does it simply cough at the end ?
If it doesn't work, it just never finishes. If it goes over two hours, it's probably not going to finish. I just took out 8 more smart pins and restarted it.
That's a good point - the tools will need something along the lines of PUBLIC and EXTERN that common linkers use, to track and allocate memory referenced in more than one module.
Ideally, compile time knowledge of which COGs will be using this, will allow smaller code, in those objects which start those cogs whose LUTs will be written to.
That's the opposite of what Chip said. There is no memory allocations, it's just a LUT.
In fact it doesn't make much sense beyond custom bound assembly. HubRAM works fine for everything else.
I started a Quartus compile one night and left it to do it's thing overnight.
I was surprised that 9 hours later it was still going.
It did finish but the resulting image was very erratic.
I made quite a few changes and had another attempt , 1 hour later success.
The joy of using Quartus.
... "Yes, wrlutx is there, and you could use it, but please just don't." Anything whose misuse might cause problems will get misused and will cause problems. How many times have we had to correct programmers for their use of coginit? ...
I say just get rid of it. Take the performance hit. That matters way less than the support issues.
WRLUTX is the only reason this hardware investment is there. If WRLUTS becomes the only option then the new hardware fails it's primary purpose.
On that note, I would recommend removing WRLUTS and relegate the hardware to it's sole purpose. HubRAM works much better for less demanding uses.
That's the opposite of what Chip said. There is no memory allocations, it's just a LUT.
Perhaps we read different posts ? Or use quite different tools ?
Chip said " I don't think anyone would use it as a general interface, as they would have no idea of the context.", which to me means each COG has to understand which addresses each are going to use - aka context.
He is right, they are going to be naturally paired, and developed that way.
Cog A has to write to some part of LUT.B and likewise COG B has to write to some agreed part ot LUT.A - Just like function parameter passing.
Now, they are going to have to agree on names for those, and if they (say) want to add two more, it is nice to have the tools allocate those.
I have idea how you imagine to do this without some memory allocation context, even something as primitive as a series of equates, is still user memory allocation.
A conflict only exists when multiple cogs write the same LUT on the same clock cycle using WRLUTX. In this case, addresses and write data are each OR'd together, causing errant data and an errant address. By using WRLUTS, this problem can be completely avoided, since each cog's write output will only occur during its unique 1-of-16 timeslot, thereby singulating in time the various writes.
Close - by everyone using WRLUTS this problem can be completely avoided.
Of course, that gives a large speed hit, but it does completely avoid the issue.
If 14 or 15 COGS use WRLUTS and 2 or 1 use WRLUTX, you are not sure you are ok, until you check the destination COGs can never overlap.
But nobody should be so careless to write software which mixes WRLUTX and WRLUTS when writing the same LUT(s), right?
That is precisely the argument used against shared LUT. The assumption is no one will check the obex code they use.
I love the new LUT concept. It is the random corruption that may occur, without any seeming relationship to what was done.
Would it take much logic to prevent a write, or only allow one write, if more than one cog attempts to wire to the same cog? Alternately delay to the next clock, even if it becomes indeterminate clocks?
However, IMHO the shared LUT is far simpler, way less silicon (might give us some more hub ram), and easy to implement. I am sorry, but I just do not agree that we cannot allocat cogs for shared LUT code early in the initialisation. If there are no adjacent cogs, then the system can fail gracefully. It is neater than what happens when you don't allocate enough stack space, or IMHO you corrupt an unknown LUT address because two cogs write to anothe cog LUT simultaneously.
BTW Chip, I keep thinking there is likely a mix of these busses that may yield a simpler and superior hub/LUT mix. Think I need to draw a diagram and look at the paths.
The LUT is almost an equivalent multi port hub ram.
I just took out 8 more smart pins and restarted it.
Bugger, that's half of 'em! 400 x 32 = 12800 ALMs. I suspect ALMs are not the problem. Long run routes maybe? EDIT: The failed case is actually 400 x 24 = 9600 ALMs. Maybe that much is needed ... we'll find out soon enough I suppose. We're up to 90 minutes now ...
This might be the death of the feature right here.
Comments
Ok, but the "ultimate problem" is similar in both cases. However the WRLUT, to me, is much more intuitive than sucking data through RDLUTs from the other end
But yes I accept that in effect the RDLUT empowered "parasitic cogs" (Cog 1,3) can be paralleled up without affecting the operation of the original cog 2. Is it worth the less intuitive approach for purist peace of mind? Not sure.
Perhaps if this was framed in terms of a debug tool from the start, rather than some way of, what sharing sprites?, we'd be thinking differently
Your call, Chip
The only reason I mentioned image data was to share a use case. There will be better ones, I am sure. Multicog sync is another easy one mentioned by Chip.
Another use case might just be using a COG to fetch code, maybe in tandem with another one fetching from SDRAM. We may find an advanced kind of LMM is possible. Running code, "in place" from big external RAM. It may be possible to get this to really perform. The safe mode won't go as fast.
Overlays?
If we frame it as anything, it's advanced. Something we know will get used for X, Y, Z, but like the events, we have it in there to improve on the possible. We don't know what that will look like today, but we will be glad we did sometime in the future.
Same argument used for a number of features in this design. I believe we are making the right call on them too.
You also right in that we need to evaluate usage cases against these proposals. I think best way to achieve this would be for Chip to release an image with "something", and lets see how it works in practice.
In general, getting data out of the prop is really well catered for, between streamers and smart pins. Its further up the dataflow that the benefits of something like this are really needed.
To be on the safe side the code should use WRLUTS, doesn't that vanishes or greatly reduce any advantage of the LUT sharing ? WRLUTX and its speed advantage should be used only when the code knows exactly what others are doing, and even in that case it could happen that two COGs uses WRLUTX at the same time conflicting with each other. Is that right ?
Looks to me these two instructions are a great source of trouble.
But Chip was nice and added an extra instruction, WRLUTS, to allow the extra multi-write abilities too.
Close. The critical element is hitting both of Same-Target-COG_LUT and Same-SysCLK.
Two closely co-operating, and isolated COGs, will likely be quite OK.
A mish-mash from OBEX ?, weellll...
Quite possibly, but Chip says this
" But nobody should be so careless to write software which mixes WRLUTX and WRLUTS when writing the same LUT(s), right?"
Which I guess will go on page 1 of the Manual, that everyone reads... ?
Just has to be with the instruction description is all.
Maybe a truth table explaining what happens with each instructions would help me and others understand better the interactions. Something with clock cycle, cog1 instruction, cog2 instruction, cog3 target, cog4 target with all possible combinations of WRLUTS/X and the effects on the targets. Would that be possible to do ?
The rule is: If there is possibility for more than one simultaneous writer to a specific LUT then all writers to that LUT have to use WRLUTS.
I say just get rid of it. Take the performance hit. That matters way less than the support issues.
-Phil
Chip has yet to define the end-to-end delays in DAC_Data, but in the region of 10 cycles for a byte payload, increasing for 16,32b loads.
WRLUTS can take 16 Clocks to slot-sync, then add LUT RdTime, whilst a HUB W/R could need 16+16, some may call that 'much lower', but I'd call that 'about half', and a lot slower than the 2 cycles a direct WRLUTX takes.
I would call a change from 2 to 16, greatly reduced.
Going to hub and back into another cog is not just 16+16. You are smart enough to know that...
I was talking about the times to slot-sync, but the P2 DOCS say this about added-wait states
"When a cog wants to read or write the hub RAM, it must wait up to 15 clocks to access the RAM slice of interest. "
Add that wait to the opcode time, plus any pipelines, plus any handshakes/flags, but those tend to be common, no matter what transport path is chosen.
It seems to me that this feature is only going to be used by objects which start those cogs whose LUTs will be written to. I don't think anyone would use it as a general interface, as they would have no idea of the context.
That's a good point - the tools will need something along the lines of PUBLIC and EXTERN that common linkers use, to track and allocate memory referenced in more than one module.
Ideally, compile time knowledge of which COGs will be using this, will allow smaller code, in those objects which start those cogs whose LUTs will be written to.
If it doesn't work, it just never finishes. If it goes over two hours, it's probably not going to finish. I just took out 8 more smart pins and restarted it.
In fact it doesn't make much sense beyond custom bound assembly. HubRAM works fine for everything else.
I was surprised that 9 hours later it was still going.
It did finish but the resulting image was very erratic.
I made quite a few changes and had another attempt , 1 hour later success.
The joy of using Quartus.
On that note, I would recommend removing WRLUTS and relegate the hardware to it's sole purpose. HubRAM works much better for less demanding uses.
Chip said " I don't think anyone would use it as a general interface, as they would have no idea of the context.", which to me means each COG has to understand which addresses each are going to use - aka context.
He is right, they are going to be naturally paired, and developed that way.
Cog A has to write to some part of LUT.B and likewise COG B has to write to some agreed part ot LUT.A - Just like function parameter passing.
Now, they are going to have to agree on names for those, and if they (say) want to add two more, it is nice to have the tools allocate those.
I have idea how you imagine to do this without some memory allocation context, even something as primitive as a series of equates, is still user memory allocation.
I love the new LUT concept. It is the random corruption that may occur, without any seeming relationship to what was done.
Would it take much logic to prevent a write, or only allow one write, if more than one cog attempts to wire to the same cog? Alternately delay to the next clock, even if it becomes indeterminate clocks?
However, IMHO the shared LUT is far simpler, way less silicon (might give us some more hub ram), and easy to implement. I am sorry, but I just do not agree that we cannot allocat cogs for shared LUT code early in the initialisation. If there are no adjacent cogs, then the system can fail gracefully. It is neater than what happens when you don't allocate enough stack space, or IMHO you corrupt an unknown LUT address because two cogs write to anothe cog LUT simultaneously.
BTW Chip, I keep thinking there is likely a mix of these busses that may yield a simpler and superior hub/LUT mix. Think I need to draw a diagram and look at the paths.
The LUT is almost an equivalent multi port hub ram.
Bugger, that's half of 'em! 400 x 32 = 12800 ALMs. I suspect ALMs are not the problem. Long run routes maybe? EDIT: The failed case is actually 400 x 24 = 9600 ALMs. Maybe that much is needed ... we'll find out soon enough I suppose. We're up to 90 minutes now ...
This might be the death of the feature right here.