Synchronization puzzle
Phil Pilgrim (PhiPi)
Posts: 23,514
Here's a brain teaser related to synchronizing multiple assembly language cogs. I don't yet know the answer or whether it even has one. Bear in mind, I've added constraints for the sake of the puzzle that one wouldn't have to suffer in the real world. So here goes:
Suppose you start multiple cogs, one right after the other, and you want them synchronized to each other. In other words, you want them all to execute a preselected instruction simultaneously, at which point they'll be considered "in sync". We'll assume that they all execute their first instructions within a known value x clocks of each other and that x is a very small fraction of 232. Here's the added constraint: the cogs can't communicate with each other or with the cog that started them, either via hub memory or the port pins. The only common element they have for reference is the master counter, cnt. We'll also assume that a cog's cogid provides no information about when it was started relative to the others. And finally, the time limit for getting into synchronizaiton is 233 clocks.
The question is, given these assumptions and constraints, is it possible for all the cogs to read cnt and come up with a single, identical, future counter value to use in a waitcnt instruction? If so, how would you compute such a number? If not, what makes it impossible?
Cheers!
Phil
Suppose you start multiple cogs, one right after the other, and you want them synchronized to each other. In other words, you want them all to execute a preselected instruction simultaneously, at which point they'll be considered "in sync". We'll assume that they all execute their first instructions within a known value x clocks of each other and that x is a very small fraction of 232. Here's the added constraint: the cogs can't communicate with each other or with the cog that started them, either via hub memory or the port pins. The only common element they have for reference is the master counter, cnt. We'll also assume that a cog's cogid provides no information about when it was started relative to the others. And finally, the time limit for getting into synchronizaiton is 233 clocks.
The question is, given these assumptions and constraints, is it possible for all the cogs to read cnt and come up with a single, identical, future counter value to use in a waitcnt instruction? If so, how would you compute such a number? If not, what makes it impossible?
Cheers!
Phil
Comments
I haven't tried this, but couldn't each cog wait until the cnt rolled over?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Beau Schwabe
IC Layout Engineer
Parallax, Inc.
No, because the cluster of initial cnt readings could straddle the rollover point. So some cogs would execute the sync instruction right away; the others would wait nearly a full cycle of cnt.
-P.
restart the initial cog again at the same interval.
This way ALL cogs would at least be in sync with cnt
Finally within each cog wait for cnt to rollover.
The "specific interval" would relate to the execution loop that each cog needs to wait during the cnt rollover detection.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Beau Schwabe
IC Layout Engineer
Parallax, Inc.
I can't think of one, though, there might be some way. CNT rollover is the gotcha' here. I can't make myself think about this too much, though, because even 2^32 clocks is nearly a minute at 80MHz - seems too long for practical purposes.
You've probably figured that you could pre-compute an adequately-advanced static CNT value and store it into·either global RAM or the COG assembly code that is going to get loaded into every COG. Then, even though the COGs come on line at slightly different times due to·short latencies between launch starts, they could all begin with a WAITCNT instruction that would sync them to the exact same CNT value. That sync would hold perfectly until HUB instructions executed, causing HUB sync's to occur, introducing 2-COG-clock offsets between COGs.
In the Propeller tester program, the same assembly program gets loaded into every COG and they more-less run in unison, though there are HUB-related clock offsets which remain static. To do this, COG #0 would execute the code:
PUB start | i
· repeat i from 0 to 7
··· initx[noparse][[/noparse]i] := @sync << 4 + i··· 'set initx's to point to sync routine and specify COG
· coginit(0, @go, 0)············· 'relaunch COG 0 (this COG) with the fast launcher
DAT···· org
go····· coginit initx+7··· 'launch COG7
······· coginit initx+6··· 'launch COG6
······· coginit initx+5··· 'launch COG5
······· coginit initx+4··· 'launch COG4
······· coginit initx+3··· 'launch COG3
······· coginit initx+2··· 'launch COG2
······· coginit initx+1··· 'launch COG1
······· coginit initx+0··· 'relaunch COG0 (me)
initx·· long··· 0 [noparse][[/noparse]8]
······· org
sync··· 'assembly language code here that will be nearly sync'd
When a COG executes a COGINIT instruction, it begins the launch process for whichever COG it selected. In the above example, all the COGs will be loading simultaneously, then COG7 through COG0 will come on line with the 'sync' program, each staggered by ~16 COG clock cycles.·I used '~16' because there is still some HUB order that each COG locks to after it begins launching. This is hard to think about, though. Feel free to resolve this extra detail.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Unless I am missing something, I believe the answer to this is simple.........
Call the time to launch one cog "Launch", hence the time to launch 8 cogs is 8 times "Launch", or "All Launch". (This need not even be exact.)
After it is launched, let each cog after pick the next highest binary number which has all its least significant bits found in "All Launch" equal to zero, and then add to that the current value of "cnt". Then simply "waitcnt" until this future number is matched.
They should then all trigger at exactly the same time.
Post Edit.....
Oops, forgot to say to clear the least significant N bits (as found in "All Launch")·of the "future" count. This finds the next "boundary" of "cnt" that "All Launch" will fit into. In fact, now that I think of it, the "All Launch" number needs to be doubled.
Sorry about that.
An algorithm that should work would be:
Calculate "All Launch" into a 32 bit long.
Left shift to double it. (By definition the number is not large)
Set all bits from 31 to the most significant bit of the number.
Clear all bits from the most significant number through bit 0.
And "cnt" into it.
Then "waitcnt" this number.
This should yield a trigger value for the counter with the right hand N bits cleared; that "right-hand" portion being larger than "All Launch". Thus "All Launch" or more ticks will pass before the cogs all sync.
Cheers,
Peter (pjv)
Post Edited (pjv) : 4/12/2006 1:25:33 AM GMT
In fact it is one of the things that an operating system might include as a feature. A lot of it hinges on cnt and how accessing it relates to accessing main memory. I don't have the details of how the system actually works -- everything is speculative just now because there is no documentation.
I'd probably do this by having the cogs run asm programs to establish the relative delay they see from an increment of cnt. For example, you could have cog7 write the value of an internal loop counter to hub memory slot as fast as it could, starting the counter from zero when it sees cnt increment. Other cogs repeatedly read that value as well as cnt value, waiting for it to click over. The value that they have when they see cnt click is their offset from cog7. They each write that offset into a memory location in main memory. Now you have a table that gets you synchronized to the leading edge of cnt. If you want two cogs to start something simultneously they each load their code and then wait for cnt to increment and then an additional number of cycles equal to the increment (actually the inverse -- that is, the number of cycles in a cnt - the increment). Then off they go. Of course this is indeterminate below the size of the cogs' window time slot access to central memory.
Interestingly, this works even if the cogs have different clock speeds -- since each one is keeping track of its own count you could have a slow processor and a fast one use the same code and get results that are good for itself. so long as the clock doesn't change. his is also a pretty nice way to test to see if cogs are running appropriately. If you make a convention that when the table is all zeros that you perform the synch task again it is fairly easy to see which cogs are running by starting the synch and waiting for 2 cnt increments and examining the table. Any cog's entry that remains zero didn't get the message and isn't running the synch code.
Can anyone tell me if cnt is rigorously regularly incremented? When is that increment relative to the beginning of a hub cycle? Is the timing for a main memory read deterministic and constant?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Richard "Still thinks the PDP-8 was a great machine" Cook
The 32-bit system counter increments every clock cycle and powers up in an unknown state. All COGs use the same clock as the counter. Each COG has an internal register CNT which shows the current state of the system counter. There is also a WAITCNT instruction which can wait for a target value to appear in CNT and then release, so that execution resumes.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
It seems to me to be easier to synchronize any number of cogs from one single "waitpeq" event. And that event can itself be run by a cog, which then can be pre-adjusted to trigger at a specified instant.
Only if you are running in assembly can cogs be kept synchronous. In Spin it's impossible. That does NOT mean that in Spin they can't be made to be synchronous for a moment, but after a few instructions and/or global memory accesses they would again all be random.
Not to say that some programs can't be contrived to remain synchronous, but then they would likely all be doing the same thing, and what might be the practical point of that.
To answer your last questions, there is one system counter, and it runs all the time totally "rigorously". In fact it counts every oscillator tick, so that's typically 80 MHz, and it typically takes 4 of those ticks for one assembler (machine) instruction (a few take more than 4 cycles), and perhaps 400 to 1000 of them for one Spin instruction.
The hub permits accesses to each cog for exactly 2 cycles out of every 16 cycles; totally deterministic.
Cheers,
Peter (pjv)
Well, this has certainly provoked some interesting responses!
Beau, if I had thought of your solution (i.e. wait for the right moment relative to cnt to start the cogs, so all have the same relation to the next rollover) I probably wouldn't have wound up posing this conundrum. Even though it "colors outside the lines" of the problem as posed, I like it!
Chip, I had to slap my head with a big "d'oh!" when I read your comment! What brought this all on was an earlier thread about synchronizing cogs. I was going to post examples of synchronizing both Spin and assembler cogs using port bits as squarewave outputs to demonstrate the multiple-of-25nS offsets one gets with Spin programs vs. assembler programs. I used the "pick-a-big-enough-offset-from-cnt" idea and wrote it to common memory, then started the Spin cogs with an argument telling it which port bit to use. But when I got to the assembler cog demo, I realized I had to pass the cog two arguments: the port bit number and the cnt-plus-offset number. Since par (not being a quantum register) can only hold one number at a time, the demo looked like it might get complicated. That's when it hit me that it might be possible to compute a value for waitcnt just by reading cnt in each cog and considering a few plausible assumptions. That way, I'd only have to pass the port number (<< 2) in par. But I had completely overlooked the fact that anything written into the hub RAM DAT area before calling cognew would get copied into each of the cogs! That goes into the T&T section for sure!
Peter, I got excited when I started to read your post, because I thought you might be onto someting. But when I got to the part about adding something to cnt, you lost me. What concerns me is the value read from cnt in each cog will be different, so adding some constant value to each will yield a different sync time for each one. Remember, the constraints of the puzzle state that the cogs can't communicate with each other or with the cog that launched them. That includes communicating a common cnt value. But I'm sure I'm reading your post wrong. Perhaps you could provide an example to help explain.
Richard, you're right in that writing a value to hub memory would solve the problem -- but not within the constraints as posed. And, yes, cnt increments by one every oscillator cycle -- just like clockwork!
My own thinking on this puzzle has been circular, at best. I thought that dividing the entire cycle of cnt into overlapping half-cycle "epochs" would provide a sure-fire way to avoid the rollover "straddle" problem mentioned earlier. For example, three sets of epochs could be defined as ($00000000, $80000000), ($2AAAAAAA, $AAAAAAAA), ($55555555, $D55555555). The idea was for each cog to read cnt, then pick the epoch that all the cogs are closest to the center of. Since all the readings will be close together relative to the length of one epoch, this should be pretty easy -- right? Then all you have to do is waitcnt for the end of the chosen epoch, and you'll be in sync. Well, this argument is just as wrong as it is seductive. The problem is that you can always come up with a counterexample, wherein one cog's cnt reading will be closest to the center of one epoch, while another is closer to the center of a different epoch. And it doesn't matter how tight the cluster of cnt readings is. So I suspect the puzzle, as stated, deosn't have a solution. But I'd still enjoy reading any counterproposals!
Peter, your comment about Spin programs losing synchronicity is pretty interesting -- and sounds right -- even though a program I wrote to demonstrate synchronizing Spin cogs stayed in sync within the expected 175nS (25nS minimum offset per cog x 7 cogs). The fact that the cog routines were identical may have something to do with that...
-Phil
Post Edited (Phil Pilgrim (PhiPi)) : 4/12/2006 4:01:41 AM GMT
I've been doing a LOT of work in Propeller assembly lately, in fact almost full time since I got back from Parallax in February, and I think I'm getting the hang of it...... assembler that is.
My particular interest is assembler since it is the only way for (practical) deterministic code, and for high performance and high speed stuff, my personal favourite, it is the only possible answer. Not to say that Spin is no good; it's phenomenal.... just not deterministic.
Tomorrow I will test my proposed solution to your cog puzzle (I believe my answer is correct), and then spend the balance of the week with high speed comms.
At this point I have UARTs running at 1 bit per instruction (20 Mbits/sec), and Manchester comms running at 1 bit per 2 instructions (10 Mbits/sec). I hope to have that receiving to 10 Mb/s Ethernet packets next week, of course without the obligatory PHY. Currently I am running continuously synchronized multiple cogs for generating and decoding this high speed data.
I am now able to select at which of the 4 clocks in one instruction the data sample gets taken. Also, we can now generate pulse trains without the PLL upscaling feature at 40 Mhz. All this is done with using the counters in very weird ways and synchronizing cogs to one clock cycle (12.5 nano seconds) of each other. There is lots of interesting interplay to be discovered here.
Man, this is a hoot!!!
Cheers,
Peter (pjv)
-Phil
Awesome!
I'm·excited that you've found ways around the obvious limitations and that you've got the CTRs doing unanticipated things now. That's also very neat that you've got multiple COGs on the same job. I had to read a few of your sentences several times before they could sink in, because they sounded outlandish to me. I finally thought "okay, it MUST BE possible if he's done it, so how did he do it?" Then, it·starts to·unfold (not to suggest I understand completely). Andre, the Hyrda guy, said that in video game programming you start with the assumption that EVERYTHING is possible, then you just have to figure out how to do it. I'm starting to believe that, myself.
One thing that you probably know: if you want evenly-staggered access to global memory, maybe for dumping high-speed samples into to RAM at, say, twice a COG's HUB access rate, you can start two COGs with a difference of 4 in their IDs (ie. 1 and 5, or 2 and 6, or for 4x HUB rate use 1, 3, 5, 7 or 0, 2, 4, 6).
Have you read Andre's Hydra manual about the video generator in each COG? This can certainly spit stuff out very quickly. Of course, inputting is a much more complex matter which you are now working on.
I suppose I have a lot of questions, but I'll wait for you to show us some of this stuff, if and when you're interested.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey) : 4/12/2006 6:22:05 AM GMT
That's good advice. When I was still in college, I got a summer job programming (in Fortran) for a small company that made truck bodies, propane cylinders, and milking machines. The data processing department consisted of Verus, the manager, Verda, a grandmotherly type who ran the computer, two keypunch operators, and me. Verda ran a tight ship in the computer room, and didn't like to deviate from routine. Everything was on a schedule and had to be done just so. Occasionally Verus would ask Verda to do something a little differently, and she would say, "Verus, can we do that?" And Verus would always reply, "Verda, we're computerized. We can do anything!"
Words to live by!
-Phil
Thanks for your kind words of encouragement.
Interesting you mention Andre's approach. It is exactly how I tackle things (within some level of reason of course). I do get the other engineers who work for me quite upset at times.
Oh, well.......
As far as the video generator is concerned, I have been able to send some VERY fast data... 128 Mb/s UART and 64 Mb/s Manchester if I recall correctly. Now, none of these things are fully fleshed out with comparable feed rates etc, and just now are kind of "bursty" in nature. But I do expect folks to be able to find novel uses for them.
Oh, how I DO wish the video also had a serial IN pin............. and that the counters' FREQ and PHAS registers were general memory mapped registers.
Keep up the good work you guys!
Cheers,
Peter (pjv)
Next generation, we'll augment the peripherals to be able to do data serialization in and out w/manchester. Also, I plan to make the counters able to add AND subtract conditionally. BTW, the FRQ registers are just RAM shadowed by a storage latch. It's the PHS registers that are more complicated. I've also been thinking about putting a CORDIC peripheral into each cog to do fast coordinate rotation.
Also, the PLL's are designed to work up to 160MHz across temperature, voltage, and process. We spec'd 128 just to be very conservative. At room temperature, I see them peak at ~230 MHz.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
NOW YOU'RE TALKIN' !!!!!!!!!
Can't wait for THAT puppy!
Regarding the PLL's, at one point I actually had mine running at 320 MHz. Not too stable but at least it was showing the effort.
Cheers,
Peter (pjv)
I tried my solution, using it to sync 4 cogs that were launched in rapid succession....... I only have 4 channels on my scope, but the concept is the same whether it is for one or eight cogs.
So, I find that it works..... mostly. There are occasional mis-fires, that I suspect may have someting to do with cnt roll-over, but is happening more often than that would predict, so it could be some flaw in my thinking.
Anyhow when things do fire right, I can see each cog launch (about 10 uSec apart), and then fire an output pin to within 1 or 2 nano Seconds of each other, indicating synchronism.
I'm sure I can find the source of the mis-fires, but don't want to spend the time just now.
For my own convenience in testing, I pre-calculated the mask as a constant; a trivial change from my algorithm.
I suspect there are other ways of dealing with your poser, but this one came to mind immediately though it may not be the most elegant.
Cheers,
Peter (pjv)
Consider what happens when one cog reads a cnt of $1FFF, and the other a cnt of $2000:
··($2000 + $1FFF) & $FFFFE000 == $2000
··($2000 + $2000) & $FFFFE000 == $4000
It's the ol' rollover straddle problem again, and I'm not sure there's a way around it. I keep thinking there ought to be -- that there has to be a way for each cog to choose the epoch, based on its own cnt reading, that it and all the other cogs are members of; then wait until the end of that epoch. In your solution, this would be equivalent to choosing an adder based on the cnt reading, instead of using $2000 in all cases. In my example above, adding $3000 would work. But the trick is to make sure that all the cogs use the same adder. This seems as if it should be easy, since the cnt readings are tightly clustered. It should be a simple matter to say, "Well, my reading is pretty close to a rollover point, so all the others must be, too. Therefore, I'll add an extra half cycle before computing the sync time." But I can think of no way to define "pretty close" in a way that it holds true either for all the cogs or for none of them.
By stretching the constraints a tad, I believe there's an easy solution if the top-level program is allowed to provide all the cogs the same single bit of information: 0 or 1. This would be based on the top-level's own reading of cnt just before it launches the cogs and would tell the cogs whether to use %0000... or %1000... as the end of the epoch, thus avoiding the straddle. This would be equivalent to waiting for the "right" moment, relative to cnt, to launch the cogs.
Cheers!
Phil
Post Edited (Phil Pilgrim (PhiPi)) : 4/12/2006 8:40:44 PM GMT
I'm chasing another counter problem just now, and I'll give the issue some thought later. Perhaps I'm wrong in my assessment, but would like to consider it more before I give up.
Cheers,
Peter (pjv)
You've been a great sport with what is barely more than an intellectual diversion. The stuff you're doing with the counters and comms sounds way more interesting!
-Phil
I consider all this to be one of the best learning experiences! And so much fun to boot!
Cheers,
Peter (pjv)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
1+1=10
····synci = f(cnti)
where i is the cogid and sync0 = sync1 = ... = syncn for n cogs. But this is impossible unless f is a constant function, because there will always be at least two boundaries in the domain of f, either side of which f will return different values. And there's nothing to guarantee two of the cnti won't straddle one of these boundaries, thus returning different sync times. But we've already seen that if f is a constant function, straddle at the rollover point is a problem. So I would have to conclude that knowing only the local value for cnt is insufficient to determine a common sync time.
-Phil
It's non existent as far as the physical I/O, but you can read and write to it.
Could this be used to "sync" the cogs by way of waitpeq or waitpne?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Beau Schwabe
IC Layout Engineer
Parallax, Inc.
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey) : 4/13/2006 4:22:17 AM GMT
I seem to remember Chip saying in another thread that Port B exists only as RAM in the individual cogs and that the intercog "wiring" is non-existent.
-Phil
Update: Whoops, Chip beat me to it!
But I suppose it could be scratch pad within each cog then.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Beau Schwabe
IC Layout Engineer
Parallax, Inc.
Chip Gracey
Parallax, Inc.
Gotcha!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Beau Schwabe
IC Layout Engineer
Parallax, Inc.
Wouldnt it be as simple as having cog(0) initialise the others, but only when cnt value is a very low number such as 10, to avoid rollover during new cog startups.
If you made·each cogs first instruction a·'repeat until cnt = n'·that made each cog wait until a preset higher cnt value· such as 5000 was reached ?
Cheers,
Chris