I see you have come to the same conclusion. The "mix master" scheme does indeed give rise to COG ID and HUB address dependency. This is not good.
As I said above. It seems possible to, quite by accident, write code that fails when moved around. Worse still it will be failing in unpredictable, hard to debug, ways!
How would I ensure that my code does not have such COG/HUB addressing dependencies?
Is it really an issue in practice?
After all that first HUB access is best assumed to be at a random time even in the P1. Also when responding to external events and consequently writing to HUB, e.g. a UART Rx, we can assume random HUB access delays because the external events are not synced to the HUB.
I was going to suggest the new scheme be called the "Monte Carlo HUB Arbiter". After all it is a variant of my crazy suggestion to randomize HUB access when more than one cog is in play. Only here the "randomness" is being supplied by the compiler deciding what addresses things sit at.
My original "Monte Carlo HUB Arbiter" would give even greater HUB performance if it were feasible.
It comes down to treating it like Roy said. If you want a fast transfer, use a block. If you are working on longs, words, bytes, then those will happen when they happen, but you can assume it will take no more than 16 cycles to happen. If you want to 'sync' up, write an address, then you know where you are, and work from there, and that will work on any COG.
Interesting split, if you ask me.
People can get complicated, and that's going to lock in many? cases. Needs to be COG / address, but only modulo addresses matter. Lowest nibble. Or, people can not get complicated, and it's not going to lock to a COG and it will work anywhere.
Choices, which is just exactly what people asked for, isn't it?
And that means some learning is going to have to happen. Remember the confusion around hub syncs early on in the P1? Well, we got those sorted. We will get this one sorted.
I'm wanting to run some code, and so that's as far as I'm taking the discussion. When we get the FPGA, it's fun time! Maybe it's horrible. Maybe it's awesome.
There will be gotchas that can get hidden subtly and difficult to explain simply.
But there will be ways to avoid it. We will not be able to time precisely windows like we could in the P1. There will be slop depending on your cogid, and the memory address you are accessing.
potatohead: remember you video code that I helped sort code between hub accesses? Well, you won't be able to do that as precisely now.
Here's the difference. With the round robin, we know exactly how many instructions to pack in between hub accesses. So we do that, putting things out of order, whatever it takes to maximize those access cycles.
With this scheme, we know it could take up to 16 cycles. So we put the instructions NEEDED between hub accesses, and on average it's going to perform very well, and we know the worst case.
, or
If we know something about the addresses and or COGS is true we can write to one, know where we are, and then do it exactly.
In return for that, we get killer fast blocks, and we get a lot more COGS, and the COGS are performing very well, at the same time!
When more than one COG is on a task, there are more hubops going on overall, meaning the task gets done a lot quicker, despite the slop per COG.
Cluso,
Why do you think cogid matters at all in this scheme? Because it absolutely does not!
The same code will run the same way no matter what cog it runs in. HUB access works the same way for all cogs. You need to explain how you think there is, so we can set you straight, because there is nothing that will make code written for one cog fail because it gets moved to another cog. NOTHING. WHY do you think this, WHY are you saying it matter of fact-ly for all to read and be confused by?
potatohead is right, this hub scheme is still deterministic. You people are just not thinking about it clearly.
When you write/read an address, you know exactly how many clocks to have available until you can read any given address, the difference is that it's not the same for every address like it was on P1. It's still known and can be exploited if you need the performance. It only takes a little extra care and handling, but you've always had to have extra care and handling for HUB access when you wanted precise timing / best performance.
The added bonus is that if your access order is chosen well, you can deterministically(sp?) go faster than before.
potatohead is right, this hub scheme is still deterministic. You people are just not thinking about it clearly.
When you write/read and address, you know exactly how many clocks to have available until you can read any given address, the difference is that it's not the same for every address like it was on P1. It's still known and can be exploited if you need the performance. It only takes a little extra care and handling, but you've always had to have extra care and handling for HUB access when you wanted precise timing / best performance.
Roy, even Chip admits this technique will suffer jitter when accessing random addresses. Yes, you can fix it - but at the cost of added code complexity and the loss of all speed improvements.
It still looks worthwhile - but let's not pretend it is a "magic bullet". It will make this chip harder to program deterministically.
RossH,
I didn't say there would not be jitter with random access. I said you could with extra handling and care deterministically read from different orders than sequential. Truly random access will have jitter/waits for hub window aligns, but usually code that needs performance/precise timing isn't doing truly random access. It may not be sequential, or even near enough for block access, but you can write code to sync those reads up with hub.
RossH,
I didn't say there would not be jitter with random access. I said you could with extra handling and care deterministically read from different orders than sequential. Truly random access will have jitter/waits for hub window aligns, but usually code that needs performance/precise timing isn't doing truly random access. It may not be sequential, or even near enough for block access, but you can write code to sync those reads up with hub.
The hub windows are deterministic.
I think we are actually agreeing. You are just putting a more positive "spin" on the deficiencies than me.
I do not want to derail the new hub method. I think it's fantastic. But it is not consistent access for all cogs and all hub addresses.
So, unfortunately it is not absolutely deterministic down to a clock cycle. It will not matter to most programs, but definitely not all.
Here is a link to where I showed that the hub access is not consistent. See item 3.
I would agree but phrase it a little more carefully, as there are two components to the process.
Hub access spokes are 100% cycle deterministic, as Roy says.
The time the COG can wait for one of those spokes, varies with Address Nibble and Present Spoke value.
If you are careful with your addresses those waits can become cycle deterministic.
If you do not know the address, then you do not know the precise wait, without applying a snap correction.
The suggestion is to run time 'attach'/detach that snap correction optionally to HUB opcodes.
Cluso,
You have a different definition of deterministic than I do. You think because the number of wait clocks changes at all, it's not deterministic. To me determinism means that you can predict accurately how many clocks it will take for a given sequence of instructions. The HUB runs continuously from chip startup at the same rate, so it's deterministic, period. Once you have done a read/write, future read/write timing is predictable accurately, even if it has different sized wait times.
Just because random read/write accesses have variable wait times, doesn't mean it'll be random wait times. A given sequence will have the same wait time pattern every time it runs, except, of course, the first access, just like P1.
Roy,
re your second paragraph....
Say I write an object that uses a hub buffer of 'x' bytes and or operates in hubexec mode.
When the object is called, the hub buffer would typically be passed in 'par' or 'ptra/b?'.
This address will likely vary from my original testing.
Likewise, the hubexec code may have shifted, or been relocated.
The cog may be a different cogid.
All these differences can all affect the wait times that result for the various hub accesses. They are not fixed, only if all the above remain the same. You cannot guarantee this, not even between compiles while testing. Thus, you cannot have absolute deterministic code.
Cluso,
You have a different definition of deterministic than I do. You think because the number of wait clocks changes at all, it's not deterministic. To me determinism means that you can predict accurately how many clocks it will take for a given sequence of instructions. The HUB runs continuously from chip startup at the same rate, so it's deterministic, period. Once you have done a read/write, future read/write timing is predictable accurately, even if it has different sized wait times.
Just because random read/write accesses have variable wait times, doesn't mean it'll be random wait times. A given sequence will have the same wait time pattern every time it runs, except, of course, the first access, just like P1.
A given sequence will have the same wait time pattern every time it runs, except, of course, the first access, just like P1.
I don't know if this a serious issue or not but in general this is not true.
Consider:
1) This new HUB arbiter scheme introduces a HUB access delay that depends on the addresses being accessed. True it has a known upper bound and the average access delay is between zero and the maximum. All well and good.
2) If I am a normal code monkey working in an HLL, like C, I neither know nor care what actual addresses things live at. Addresses are effectively randomized and can change from compile to compile as I add and remove code and move things around. Heck, that is true if I am working in PASM.
Ergo, the HUB access time is effectively randomized.
Now, this leads to an interesting problem.
Code that I write and test and works fine for me in my project may fail when I add or remove code and data or link things in a different order.
Why? Because I have inadvertently relied on some nice access pattern that hits the HUB at opportune moments and thus achieves the speed I'm looking for. When I move things around and addresses change that HUB access speed also changes. Perhaps leading to failure.
An extreme example:
Clearly this HUB arbiter can give increased block transfer rates if I am read/writing consecutive addresses. So I blindly write my code and it does that and all is well.
But, what happens when my same code is dropped into an environment where the variables I'm addressing are in reverse order or scattered around at random?
Poof, my speed drops, my code fails!
Not only that it fails in very unexpected random ways that are hard to debug. This can cause a lot of head scratching.
Questions:
1) Is this a likely common scenario? If one is not thinking about this possibility like the average code monkey would not?
And external events will change the delays too.
For instance, it depends when a signal/byte/event happens and its relationship to a ring buffer position in hub (address) and of course the cog.id too. All play together to vary the hub delay.
Why I see this as over complicated for process control timing.
With the old style hub you had 16 clocks + get in sync for your first data read. Once you did that you KNOW it is 16 clocks back around so you can program your code to do work or wait for the next access. That is NOT dependent on the memory address you just read or the one you are going to read next.
With the new system this is no longer true, unless you are doing a block read.
Example:
I want to read a value at xxx1 and then xxx5. I wait to sync with xxx1. Now instead of a known 16 til next read it is 4. Want to read xxx2 then xxx12... oh now my wait is 10... you NEVER know how much time you have. You cannot use the time in between to do other things.
To say it again: The memory address I am randomly reading is going to change the time I have between reads. I can no longer program other things during that wait because I no longer KNOW how many instructions I have until I have to get back to issuing the next read instruction. Every random hub read is going to throw us out of sync.
Cluso,
Why do you keep saying the cogid matters? It does not.
Also, if you access pattern stays the same your wait pattern will be the same after the first read/write. Of course, if you have other wait instructions (waitpne/waitcnt) then your next wait will vary, but that's true on the P1. waitxxx is not deterministic.
I'm tired of saying the same thing over and over, and being ignored and told I am wrong without any evidence of why... Also, again, why do you insist the cogid matters? It will only matter for the size of the initial wait to sync with hub, and that is already unpredictable.
The first read/write is an unknown wait, but that is true on the P1 also. So I dunno what you guys are hung up on...
Yes with this scheme HUB access speed is effectively randomised. So knowing how to interleave instructions with HUB accesses is basically impossible.
On the other hand, we have 16 times more HUB access bandwidth!
P.S. Somewhere here I proposed a scheme that purposely randomized HUB arbitration in the event that two or more COGS wanted to hit hit it at the same time. I thought I was kind of joking but here we are. If that randomized scheme were feasible it would yield even higher performance.
Kerry S,
I love how you just said the wait times between given examples, then said you never know the wait time. Which is it? You clearly know.... but say that you don't.
You only don't know the wait time when it's a truly random access pattern. If you know the access pattern, then you know the wait times.
If you know the access pattern, then you know the wait times.
I think this is key. We don't know the access pattern.
Not unless we are going to extraordinary lengths to check the addresses that the compiler assigns to everything. Perhaps not even then as these things can be determined at run time.
My answer: whatever mix gets it out the door yesterday. Other than that, I do not care, and neither shoild anyone else. If you can't trust Parallax with this simple matter, maybe you should be looking at Arduinos.
Kerry S,
I love how you just said the wait times between given examples, then said you never know the wait time. Which is it? You clearly know.... but say that you don't.
You only don't know the wait time when it's a truly random access pattern. If you know the access pattern, then you know the wait times.
LOL Roy, I am sure you realize that is not serious...
So to optimize/time my code I have to sit there with the address of EVERY variable I am going to use and write my code around those? Hmmm need to read variable myVar1, lets see that is xxx5 and the next one I need is going to be xxx9 so ok, I need to put 4 waits here, ok next one is xxx7, lets see how many spaces is that... can I get some code squeezed in there? Wait a sec, I am pulling that address from a look up table based on math... now how do I figure out where the heck my timing is.
Your idea is BRILLIANT for block reads. Really it is.
However it is going to cause headaches to those who do assembly drivers where they want/need to time it all out.
One of the great things about the P1 is that you can easily write time dependent code. Now that is gone?
Yes it is, but I bet it would be difficult to implement.
Don't be an old stick in the mud now.
The 'difficulty' in implementing it is part of the 'fun'.
Fun is all that matters and meeting simple Customer-driven feature sets simply isn't 'fun' enough.
Comments
I see you have come to the same conclusion. The "mix master" scheme does indeed give rise to COG ID and HUB address dependency. This is not good.
As I said above. It seems possible to, quite by accident, write code that fails when moved around. Worse still it will be failing in unpredictable, hard to debug, ways!
How would I ensure that my code does not have such COG/HUB addressing dependencies?
Is it really an issue in practice?
After all that first HUB access is best assumed to be at a random time even in the P1. Also when responding to external events and consequently writing to HUB, e.g. a UART Rx, we can assume random HUB access delays because the external events are not synced to the HUB.
I was going to suggest the new scheme be called the "Monte Carlo HUB Arbiter". After all it is a variant of my crazy suggestion to randomize HUB access when more than one cog is in play. Only here the "randomness" is being supplied by the compiler deciding what addresses things sit at.
My original "Monte Carlo HUB Arbiter" would give even greater HUB performance if it were feasible.
Interesting split, if you ask me.
People can get complicated, and that's going to lock in many? cases. Needs to be COG / address, but only modulo addresses matter. Lowest nibble. Or, people can not get complicated, and it's not going to lock to a COG and it will work anywhere.
Choices, which is just exactly what people asked for, isn't it?
And that means some learning is going to have to happen. Remember the confusion around hub syncs early on in the P1? Well, we got those sorted. We will get this one sorted.
I'm wanting to run some code, and so that's as far as I'm taking the discussion. When we get the FPGA, it's fun time! Maybe it's horrible. Maybe it's awesome.
It WILL be AWESOME !!!
There will be gotchas that can get hidden subtly and difficult to explain simply.
But there will be ways to avoid it. We will not be able to time precisely windows like we could in the P1. There will be slop depending on your cogid, and the memory address you are accessing.
potatohead: remember you video code that I helped sort code between hub accesses? Well, you won't be able to do that as precisely now.
Here's the difference. With the round robin, we know exactly how many instructions to pack in between hub accesses. So we do that, putting things out of order, whatever it takes to maximize those access cycles.
With this scheme, we know it could take up to 16 cycles. So we put the instructions NEEDED between hub accesses, and on average it's going to perform very well, and we know the worst case.
, or
If we know something about the addresses and or COGS is true we can write to one, know where we are, and then do it exactly.
In return for that, we get killer fast blocks, and we get a lot more COGS, and the COGS are performing very well, at the same time!
When more than one COG is on a task, there are more hubops going on overall, meaning the task gets done a lot quicker, despite the slop per COG.
Why do you think cogid matters at all in this scheme? Because it absolutely does not!
The same code will run the same way no matter what cog it runs in. HUB access works the same way for all cogs. You need to explain how you think there is, so we can set you straight, because there is nothing that will make code written for one cog fail because it gets moved to another cog. NOTHING. WHY do you think this, WHY are you saying it matter of fact-ly for all to read and be confused by?
When you write/read an address, you know exactly how many clocks to have available until you can read any given address, the difference is that it's not the same for every address like it was on P1. It's still known and can be exploited if you need the performance. It only takes a little extra care and handling, but you've always had to have extra care and handling for HUB access when you wanted precise timing / best performance.
The added bonus is that if your access order is chosen well, you can deterministically(sp?) go faster than before.
Roy, even Chip admits this technique will suffer jitter when accessing random addresses. Yes, you can fix it - but at the cost of added code complexity and the loss of all speed improvements.
It still looks worthwhile - but let's not pretend it is a "magic bullet". It will make this chip harder to program deterministically.
Ross.
I didn't say there would not be jitter with random access. I said you could with extra handling and care deterministically read from different orders than sequential. Truly random access will have jitter/waits for hub window aligns, but usually code that needs performance/precise timing isn't doing truly random access. It may not be sequential, or even near enough for block access, but you can write code to sync those reads up with hub.
The hub windows are deterministic.
I think we are actually agreeing. You are just putting a more positive "spin" on the deficiencies than me.
Here is a link to where I showed that the hub access is not consistent. See item 3.
http://forums.parallax.com/showthread.php/155675-New-Hub-Scheme-For-Next-Chip?p=1267543&viewfull=1#post1267543
I do not want to derail the new hub method. I think it's fantastic. But it is not consistent access for all cogs and all hub addresses.
So, unfortunately it is not absolutely deterministic down to a clock cycle. It will not matter to most programs, but definitely not all.
Yes, and that task is made easier with the suggested simple WAIT variant I've called SyncCNT
I would agree but phrase it a little more carefully, as there are two components to the process.
Hub access spokes are 100% cycle deterministic, as Roy says.
The time the COG can wait for one of those spokes, varies with Address Nibble and Present Spoke value.
If you are careful with your addresses those waits can become cycle deterministic.
If you do not know the address, then you do not know the precise wait, without applying a snap correction.
The suggestion is to run time 'attach'/detach that snap correction optionally to HUB opcodes.
You have a different definition of deterministic than I do. You think because the number of wait clocks changes at all, it's not deterministic. To me determinism means that you can predict accurately how many clocks it will take for a given sequence of instructions. The HUB runs continuously from chip startup at the same rate, so it's deterministic, period. Once you have done a read/write, future read/write timing is predictable accurately, even if it has different sized wait times.
Just because random read/write accesses have variable wait times, doesn't mean it'll be random wait times. A given sequence will have the same wait time pattern every time it runs, except, of course, the first access, just like P1.
re your second paragraph....
Say I write an object that uses a hub buffer of 'x' bytes and or operates in hubexec mode.
When the object is called, the hub buffer would typically be passed in 'par' or 'ptra/b?'.
This address will likely vary from my original testing.
Likewise, the hubexec code may have shifted, or been relocated.
The cog may be a different cogid.
All these differences can all affect the wait times that result for the various hub accesses. They are not fixed, only if all the above remain the same. You cannot guarantee this, not even between compiles while testing. Thus, you cannot have absolute deterministic code.
Consider:
1) This new HUB arbiter scheme introduces a HUB access delay that depends on the addresses being accessed. True it has a known upper bound and the average access delay is between zero and the maximum. All well and good.
2) If I am a normal code monkey working in an HLL, like C, I neither know nor care what actual addresses things live at. Addresses are effectively randomized and can change from compile to compile as I add and remove code and move things around. Heck, that is true if I am working in PASM.
Ergo, the HUB access time is effectively randomized.
Now, this leads to an interesting problem.
Code that I write and test and works fine for me in my project may fail when I add or remove code and data or link things in a different order.
Why? Because I have inadvertently relied on some nice access pattern that hits the HUB at opportune moments and thus achieves the speed I'm looking for. When I move things around and addresses change that HUB access speed also changes. Perhaps leading to failure.
An extreme example:
Clearly this HUB arbiter can give increased block transfer rates if I am read/writing consecutive addresses. So I blindly write my code and it does that and all is well.
But, what happens when my same code is dropped into an environment where the variables I'm addressing are in reverse order or scattered around at random?
Poof, my speed drops, my code fails!
Not only that it fails in very unexpected random ways that are hard to debug. This can cause a lot of head scratching.
Questions:
1) Is this a likely common scenario? If one is not thinking about this possibility like the average code monkey would not?
2) Is there a way to warn that this might happen?
For instance, it depends when a signal/byte/event happens and its relationship to a ring buffer position in hub (address) and of course the cog.id too. All play together to vary the hub delay.
With the old style hub you had 16 clocks + get in sync for your first data read. Once you did that you KNOW it is 16 clocks back around so you can program your code to do work or wait for the next access. That is NOT dependent on the memory address you just read or the one you are going to read next.
With the new system this is no longer true, unless you are doing a block read.
Example:
I want to read a value at xxx1 and then xxx5. I wait to sync with xxx1. Now instead of a known 16 til next read it is 4. Want to read xxx2 then xxx12... oh now my wait is 10... you NEVER know how much time you have. You cannot use the time in between to do other things.
To say it again: The memory address I am randomly reading is going to change the time I have between reads. I can no longer program other things during that wait because I no longer KNOW how many instructions I have until I have to get back to issuing the next read instruction. Every random hub read is going to throw us out of sync.
Why do you keep saying the cogid matters? It does not.
Also, if you access pattern stays the same your wait pattern will be the same after the first read/write. Of course, if you have other wait instructions (waitpne/waitcnt) then your next wait will vary, but that's true on the P1. waitxxx is not deterministic.
I'm tired of saying the same thing over and over, and being ignored and told I am wrong without any evidence of why... Also, again, why do you insist the cogid matters? It will only matter for the size of the initial wait to sync with hub, and that is already unpredictable.
The first read/write is an unknown wait, but that is true on the P1 also. So I dunno what you guys are hung up on...
I don't know.
Yes with this scheme HUB access speed is effectively randomised. So knowing how to interleave instructions with HUB accesses is basically impossible.
On the other hand, we have 16 times more HUB access bandwidth!
P.S. Somewhere here I proposed a scheme that purposely randomized HUB arbitration in the event that two or more COGS wanted to hit hit it at the same time. I thought I was kind of joking but here we are. If that randomized scheme were feasible it would yield even higher performance.
I love how you just said the wait times between given examples, then said you never know the wait time. Which is it? You clearly know.... but say that you don't.
You only don't know the wait time when it's a truly random access pattern. If you know the access pattern, then you know the wait times.
Do read my post #45 above where I tried to layout these concerns in detail.
No, I'm not worried too much about COG IDs.
Do you have answers for the questions I posed there?
Personally I don't worry about randomizing HUB access times but it seems it can cause unexpected issues.
Not unless we are going to extraordinary lengths to check the addresses that the compiler assigns to everything. Perhaps not even then as these things can be determined at run time.
In general we have a random HUB access latency.
Seems to me, the best case is to put only the NEEDED instructions in with hub accesses for an overall best case execute with a worst case also known.
+1
+1
LOL Roy, I am sure you realize that is not serious...
So to optimize/time my code I have to sit there with the address of EVERY variable I am going to use and write my code around those? Hmmm need to read variable myVar1, lets see that is xxx5 and the next one I need is going to be xxx9 so ok, I need to put 4 waits here, ok next one is xxx7, lets see how many spaces is that... can I get some code squeezed in there? Wait a sec, I am pulling that address from a look up table based on math... now how do I figure out where the heck my timing is.
Your idea is BRILLIANT for block reads. Really it is.
However it is going to cause headaches to those who do assembly drivers where they want/need to time it all out.
One of the great things about the P1 is that you can easily write time dependent code. Now that is gone?
Don't be an old stick in the mud now.
The 'difficulty' in implementing it is part of the 'fun'.
Fun is all that matters and meeting simple Customer-driven feature sets simply isn't 'fun' enough.