an open discussion about precision external clocks for propeller

Ari · 2011-03-13 19:38

Once again, I am trying to create an open discussion topic.....this one is very near and dear to my heart.

The topic in question is the theoretical and possible real world accuracy of various types of clock sources.

I say this topic is near and dear to me, because one of the ultimate disciplines in my line of work is/are precision clocks for digital audio. much debate and trial and error has gone on in my world around this topic. the ideas being that tighter clock ref = less quantization distortion and jitter in a digital signal. this equates itself in the real world to a more accurate and linear analog waveform (when converted d/a)

however the topic here is that application to the propeller. will a precision clock reach a point of diminishing returns in the propeller, because of it's own internal PLL limitations?

If so, what would be regarded as the tightest possible tolerance, and what would be the acceptable thermal constraints to govern such "overclocking"

these items in consideration, what would be the most accurate external clock source...

1. a precision clock ref (oscillator) $this could include oven regulation, atomic, crystal etc...$
2. a multi clock synched array, with a comparator and corrector

the sky is the limit here...including the conversion of an analog clock ref via dedicated a/d sub-systems

and no this is not a topic specifically about "overclocking"...it's more about precision and what values would equal the highest precision for the propeller

here is an excerpt about what could be possible in the near future

"In August 2004, NIST scientists demonstrated a chip-scaled atomic clock.[12] According to the researchers, the clock was believed to be one-hundredth the size of any other. It was also claimed that it requires just 75 mW, making it suitable for battery-driven applications. This device could conceivably become a consumer product."

here is a relevant diagram about what happens when a clock cycle is missed.....also can the best case scenario be improved upon?

Mike Green · 2011-03-13 20:18

Any precision clock will reach a point of diminishing returns with any PLL system. You always do better if the reference clock is a submultiple of the output clock and you do better than that with direct clock drive when that's allowed.

The Prop can be clocked with an external clock anywhere over its clock range (DC to 80MHz). You can certainly use a precision temperature regulated clock. An atomic clock would be pricey, but possible otherwise. As far as maximum clock speeds, look at the graph in the Propeller datasheet that gives maximum clock speed vs. environment temperature and supply voltage. In practice, unselected chips seem to work fine at 100MHz and 104MHz at room temperatures and 3.3V. There are delays across the surface of the chip. Some cogs are closer than others to certain I/O pins and there is a chain of OR gates involved. There was a thread recently on the EFX-TEK AP-16+ WAV player which uses a Prop to play WAV files and has a built-in 20W amplifier. The firmware was changed during development to use different I/O pins due to differences in signal propagation across the chip. We're talking about times on the order of a few nanoseconds. Apparently the difference was audible at high volume levels in quiet passages. You might want to browse for the thread.

Ari · 2011-03-13 21:15

Mike Green wrote: »

Any precision clock will reach a point of diminishing returns with any PLL system. You always do better if the reference clock is a submultiple of the output clock and you do better than that with direct clock drive when that's allowed.

The Prop can be clocked with an external clock anywhere over its clock range (DC to 80MHz). You can certainly use a precision temperature regulated clock. An atomic clock would be pricey, but possible otherwise. As far as maximum clock speeds, look at the graph in the Propeller datasheet that gives maximum clock speed vs. environment temperature and supply voltage. In practice, unselected chips seem to work fine at 100MHz and 104MHz at room temperatures and 3.3V. There are delays across the surface of the chip. Some cogs are closer than others to certain I/O pins and there is a chain of OR gates involved. There was a thread recently on the EFX-TEK AP-16+ WAV player which uses a Prop to play WAV files and has a built-in 20W amplifier. The firmware was changed during development to use different I/O pins due to differences in signal propagation across the chip. We're talking about times on the order of a few nanoseconds. Apparently the difference was audible at high volume levels in quiet passages. You might want to browse for the thread.

the data sheet suggest a maximum of 128mghz. and 3.6v

I think with some thermal management that could be bested at a consistent level....(wish the prop had a better heat spreader)

I will go buy a few props and see what maximums I can get......

however my concerns weren't speed....prop is plenty fast for what i want to do and overclocking would seem very pointless for my applications

my concern is more about precision in relation to cog timing.....I suppose with propview you could easily calculate the offsets of each cog in relation to the master clock input....

I am not sure why the problem with the WAV player couldn't have been solved with a buffer and offsets....maybe this is my lack of understanding about the propeller, but I know in my world....you have a variable shift option in all clack sources in the chain....it's called a pull-up or pull-down rate....it originally is a concept from the synchronization of analog tape machines

so i wonder then if a direct drive clock, with a secondary clock and a comparator/corrector is really the way to go for my needs....hmmmm

a significant error rate in clock would be catastrophic to the way i would like to manage my data and process

also thanks for the thread suggestion, you are very helpful Mike

jazzed · 2011-03-13 21:48

Ari wrote: »

.....also can the best case scenario be improved upon?

Unfortunately Propeller can not perform a HUB cycle faster than 7 clocks (kuroneko has better details on some instructions). RdByte and friends need 8 clocks (practically speaking) for in-sync HUB access.

Mike Green · 2011-03-13 21:51

I wouldn't push the frequency nor the supply voltage. You need very tight control of the supply voltage if you're going close to the Absolute Maximum ratings. If you put the Propeller on a thermoelectric chiller, you could do fine at 128MHz. The graph was done using a forced air chamber and at -50C, you've got a little headroom at 3.3V.

The issue with the WAV player was that the time offsets are all sub-clock speed. The resolution of the timebase is 12.5ns and the delays are less than that, I think 1-2ns per gate with up to 8 gates in the chain. The Propeller block diagram in the datasheet gives you the idea. If you want to characterize these offsets, you need a good scope with nanosecond or sub-nanosecond resolution not to mention careful control of signal path lengths off chip.

Frankly, if you're worrying about sub-clock timing, you should consider another solution, maybe some kind of simple processor in a fast gate array or a state machine. It's not like the hardware doesn't exist at those speeds. The Propeller was simply not designed to handle things faster than maybe 4-10MHz with the exception of the video hardware and the cog counters and those have specific uses up to maybe 128MHz. By synchronizing multiple cogs, you can do short bursts of I/O at the system clock rate, but that's about it.

Ari · 2011-03-13 21:57

Mike Green wrote: »

I wouldn't push the frequency nor the supply voltage. You need very tight control of the supply voltage if you're going close to the Absolute Maximum ratings. If you put the Propeller on a thermoelectric chiller, you could do fine at 128MHz. The graph was done using a forced air chamber and at -50C, you've got a little headroom at 3.3V.

The issue with the WAV player was that the time offsets are all sub-clock speed. The resolution of the timebase is 12.5ns and the delays are less than that, I think 1-2ns per gate with up to 8 gates in the chain. The Propeller block diagram in the datasheet gives you the idea. If you want to characterize these offsets, you need a good scope with nanosecond or sub-nanosecond resolution not to mention careful control of signal path lengths off chip.

Mike do you see why I was so concerned about the schematic issue now?

I hope it's all falling into place

Mike Green · 2011-03-13 22:29

The block diagram that's in the datasheet has existed on this website since the Propeller went on the market. It's only an approximation for what's actually laid out on the chip. In fact, the optimal I/O pins to use for something like the WAV player are not the ones you'd expect from the block diagram. You have to look at the actual chip layout which is what was done.

You're making assumptions about what is going to be crucial information for your project. You know what sorts of information may be critical, but you're asking for documentation that won't necessarily answer your questions. This is where you have to formulate your questions clearly, thoroughly, and in your problem domain, then ask people "in the know" who might be able to answer them. Several people who frequent these forums might be able to help. It's always helpful to frame your question in a way to stimulate others' interest.

Ari · 2011-03-13 22:49

Mike Green wrote: »

The block diagram that's in the datasheet has existed on this website since the Propeller went on the market. It's only an approximation for what's actually laid out on the chip. In fact, the optimal I/O pins to use for something like the WAV player are not the ones you'd expect from the block diagram. You have to look at the actual chip layout which is what was done.

You're making assumptions about what is going to be crucial information for your project. You know what sorts of information may be critical, but you're asking for documentation that won't necessarily answer your questions. This is where you have to formulate your questions clearly, thoroughly, and in your problem domain, then ask people "in the know" who might be able to answer them. Several people who frequent these forums might be able to help. It's always helpful to frame your question in a way to stimulate others' interest.

it's why I was somewhat disappointed....I expected to find the data I need in the data sheet....which is typically where I have found such things....

I need to learn more about the propellers architecture before I even know what to ask, much less whom....

thanks for you help Mike...it really has been very nice of you to lend your ear...

Ari · 2011-03-13 22:57

jazzed wrote: »

Unfortunately Propeller can not perform a HUB cycle faster than 7 clocks (kuroneko has better details on some instructions). RdByte and friends need 8 clocks (practically speaking) for in-sync HUB access.

so theoretically the cog access speed by the hub is going to be the maximum clock ref. x7? Hope I am reading that correctly....

thank you for this reply

i don't suppose you would happen to know the physical distance from each cog to the hub would you?

Ari · 2011-03-13 23:16

Mike Green wrote: »

The resolution of the timebase is 12.5ns and the delays are less than that, I think 1-2ns per gate with up to 8 gates in the chain. The Propeller block diagram in the datasheet gives you the idea. If you want to characterize these offsets, you need a good scope with nanosecond or sub-nanosecond resolution not to mention careful control of signal path lengths off chip.
.

it's starting to come together.....=)

I will look for that thread to try and find the physical distance from the hub to each cog....using that I should be able to calculate the offset based on clock ref....

I will try to get access to a scope that can do 8 layovers to confirm the predictions....I had mentioned this in another thread....

the reason I want to know this is for my concept of "program bouncing" (I have touched on it in a few threads now)

I have taken your earlier advice to heart and am re-reading all of my posts and replies in all threads....I am sure a lot of answers and clarifications are buried in there already

Heater. · 2011-03-14 04:18

Ari,

I think we are all intrigued by this idea of "program bouncing" and look forward to finding out what it is all about.

I do worry that any software solution that relies on knowing the distance form COG to HUB memory is in for some surprises. What happens when a Prop 1.5 comes out that is a die shrink and changes all the timing?

Thinking about it I don't see how knowing the distance from COG to HUB is going to help anyway. Data is clocked in and out of different parts of the chip when it is clocked. Nothing you can do about that.

jazzed · 2011-03-14 10:21

Ari wrote: »

so theoretically the cog access speed by the hub is going to be the maximum clock ref. x7? Hope I am reading that correctly....

thank you for this reply

i don't suppose you would happen to know the physical distance from each cog to the hub would you?

Essentially the access rate will be Ra=SYSCLK/16 (the system clock frequency / 16) in the best case because of COG round-robin access. The actual hub instruction takes cycles to execute and has to wait for it's next turn on the hub.

I've never considered the distance from each cog to the hub important ... there may have been one practical case regarding audio that JonnyMac found. The only example where I've really needed clarification on timing is covered in this thread . If you have a specific timing question, just post it. People here love digging into things in great detail.

Regarding "bouncing": You should probably start a new topic and describe your ideas. Lots of stuff has already been discovered about what is possible with Propeller. I would love to see something completely new, but according to The Teacher (NIV): "What has been will be again, what has been done will be done again; there is nothing new under the sun."

Ari · 2011-03-14 11:52

Heater. wrote: »

Ari,

I think we are all intrigued by this idea of "program bouncing" and look forward to finding out what it is all about.

I do worry that any software solution that relies on knowing the distance form COG to HUB memory is in for some surprises. What happens when a Prop 1.5 comes out that is a die shrink and changes all the timing?

Thinking about it I don't see how knowing the distance from COG to HUB is going to help anyway. Data is clocked in and out of different parts of the chip when it is clocked. Nothing you can do about that.

I am also intrigued by the concept. Since I have no provable examples of this I can be considered nothing more than a concept at this point. I think the only thing to stop this though would be the process becoming non-linear. This is why the timing is so very critical, At least from a physics perspective. In reality, who knows what may happen. To further the problem, maybe program execution in a linear fashion is also a bit of a misnomer.

I think I must find a way to do this in it's simplest state first, and then possibly find a way to execute in a non-linear fashion. If the process could be compiled in a random time line, then the physical attributes about timing and voltage, would be totally arbitrary....let's hope that winds up being the case

I know that data is clock based on the 7 cycle run time and then a 9 cycle dwell time....I can't change that....but if I know the exact amount of time, I can predict a repeatable offset into the timing algo.

This process would never survive a chip migration....the offsets would have to be changed, based on the new physical topology....or erased all together....but that would be a good thing....the concept would stay in tact...just as long as there remain 2 or more "cogs" or processing cores....

Ari · 2011-03-14 12:09

jazzed wrote: »

Essentially the access rate will be Ra=SYSCLK/16 (the system clock frequency / 16) in the best case because of COG round-robin access. The actual hub instruction takes cycles to execute and has to wait for it's next turn on the hub.

I've never considered the distance from each cog to the hub important ... there may have been one practical case regarding audio that JonnyMac found. The only example where I've really needed clarification on timing is covered in this thread . If you have a specific timing question, just post it. People here love digging into things in great detail.

Regarding "bouncing": You should probably start a new topic and describe your ideas. Lots of stuff has already been discovered about what is possible with Propeller. I would love to see something completely new, but according to The Teacher (NIV): "What has been will be again, what has been done will be done again; there is nothing new under the sun."

I understand the issue of round robin timing access, but I am curious about the delay between each COG and hub interaction, not the grand total of a RR cycle. It seems from the visual example and the data sheet available, that the access time is 7 times the clock cycle, with a 9 cycle dwell time between the ability to grab the next high order sync pulse

I wonder if the 9 cycle dwell time is necessary for the hub to move on to the next cog....if that is the case then the offset for each hub/cog interaction is log. function, so the offset between each hub/cog interaction could be longer than the simple notion I had earlier

I think Mike is right, the only way to really know is to employ a scope and strap it to a pin....the repeat the same pulse operation across all cogs, and measure the waveform offset on a consistent timeline

The problem with this is finding a scope that will do sub ns. time line, and will allow for 8 or more lay overs....I would need to see the same data from each i/o pin

This sounds like a big problem....I wonder if I should just try to execute the concept first and see what happens....I could just be making more trouble for myself than is necessary (that seems to be the recurring theme in my life)

I think my concept is just a twist on common functions....it';s more of an academic effort than anything else....I highly doubt I am going to discover something monumental....I will probably just discover my own naivety and be humbled by my peers....that is a worthwhile outcome in itself

I will do a proper write up when I have some more answers and have a better understanding of the prop terminology and architecture....I think to be helpful to anyone else, I have to understand the specific language of propeller....I can't go around calling an apple an orange....

Mike Green · 2011-03-14 12:24

The Propeller datasheet has a nice diagram that shows the best case and worst case for cog access to the hub functions. Look at page 7 for this and page 28 for a description of this OR logic chain that's part of the I/O structure.

Ari · 2011-03-14 12:57

Mike Green wrote: »

The Propeller datasheet has a nice diagram that shows the best case and worst case for cog access to the hub functions. Look at page 7 for this and page 28 for a description of this OR logic chain that's part of the I/O structure.

Mike the graphic I inserted in my first post is the diagram from page 7.....*scratches head*

Mike Green · 2011-03-14 13:33

Yes, I see that graphic now. For some reason I didn't see it originally.

There's no way to improve on the best case for successive references to hub memory from a cog. That's the way the hardware was designed. Each cog gets a 2 clock window in a 16 clock cycle and the cog is forced to stall until it's window comes up.

On the other hand, you can synchronize several cogs together. If they're not being used already for something else, you can pick specific cogs to be synchronized together such that they use successive windows so that every other hub access slot is used or every slot is used for part of the cycle ... unless you want to tie up all the cogs synchronized and all accessing the hub, doing nothing else for a time. If the amount of data is small, it could be buffered in the cogs themselves and, for short bursts, data could be input or output every clock cycle (using 4 cogs) with no access to the hub at all (except for filling / emptying the buffers at the end of the burst).

Ari · 2011-03-14 13:56

Mike Green wrote: »

Yes, I see that graphic now. For some reason I didn't see it originally.

There's no way to improve on the best case for successive references to hub memory from a cog. That's the way the hardware was designed. Each cog gets a 2 clock window in a 16 clock cycle and the cog is forced to stall until it's window comes up.

On the other hand, you can synchronize several cogs together. If they're not being used already for something else, you can pick specific cogs to be synchronized together such that they use successive windows so that every other hub access slot is used or every slot is used for part of the cycle ... unless you want to tie up all the cogs synchronized and all accessing the hub, doing nothing else for a time. If the amount of data is small, it could be buffered in the cogs themselves and, for short bursts, data could be input or output every clock cycle (using 4 cogs) with no access to the hub at all (except for filling / emptying the buffers at the end of the burst).

so if I can synch cogs, then the offset is predictable by my method of choosing....

the sliding window concept is exactly how tcp/ip works....it's the determining factor in frame size....snd ack....aka handshake

I get this....thanks for that excellent description

so the cogs could pick up chunks from hub ram in a random order...since no processing would be going on, just buffer filling

then once the chunks are loaded into cogs they could process the chunk in synch....since the reliance for my concept is linearity, I could just program a predictable offset into each cog

ex.

cog 0 = 0
cog 1 = 0.1
cog 2 = 0.2
etc
etc
etc

ok this could work....

once the smaller nested opperations are executed (as well as the larger ones) the collective processed objects could be fed bad to memory, or back to a cog for processing....

with this method you could nest thousands of operations in to a few directives

I will draw a diagram of my concept....thanks again Mike

jazzed · 2011-03-14 14:47

Hmm. I can see an LMM pipelined design for one thread with N COGs. The problem is controlling the pipeline and scaling the COGs. That is a different topic though.

Ari · 2011-03-14 18:17

jazzed wrote: »

Hmm. I can see an LMM pipelined design for one thread with N COGs. The problem is controlling the pipeline and scaling the COGs. That is a different topic though.

scaling would be done by the chunks....the singular object (process) could be broken up into 8 chunks...then 16....then 32....and on and on and on....control of the pipeline is an issue though....and not a small one

If the divisions beyond 8 were necessary, the cumulative output could be dumped back to memory for compilation once all chunks are re-assembled....

let me clarify....my concept is not to increase speed....it's to allow for more things to happen in limited amounts of code.....eventually you would just wind up with a single program that would consist of thousands of nested objects...essentially a framework which can be processed by PASM....with no translation necessary what so ever....

jazzed · 2011-03-14 18:43

Ari wrote: »

let me clarify....my concept is not to increase speed....it's to allow for more things to happen in limited amounts of code.....eventually you would just wind up with a single program that would consist of thousands of nested objects...essentially a framework which can be processed by PASM....with no translation necessary what so ever....

If that's your goal, there are already 2 solutions: 1) Catalina C allows multi-threading per COG with a pthread like interface and 2) PJV wrote a scheduler that can be expanded to many PASM threads per COG.

I just offered what I thought that no one had tried before that could be beneficial to someone.

A Propeller could then use pipelined LMM code across multiple cogs to run a single program at more than 100MIPS.

Cheers,
--Steve

Ari · 2011-03-14 18:56

jazzed wrote: »

If that's your goal, there are already 2 solutions: 1) Catalina C allows multi-threading per COG with a pthread like interface and 2) PJV wrote a scheduler that can be expanded to many PASM threads per COG.

I just offered what I thought that no one had tried before that could be beneficial to someone.

A Propeller could then use pipelined LMM code across multiple cogs to run a single program at more than 100MIPS.

Cheers,
--Steve

that would be the ideal Steve.....but I can't figure out a way to control the flow...the only solution I could come up with would be a second propeller

also what I am talking about is not just multi-threading....it's multi-threading, stacked on top of thousands of concurrent operations....so when one process completes it has actually compiled hundreds of other process....

I think my last reply was confusing....your initial reply to me was more what I had in mind....

simply breaking up a process to run in parallel is a byproduct, and not the core concept....the core concept is the nesting of that concept to the capacity of available memory....so one process is actually the compilation of thousands of others....kind of like a neural net or AC current

my goal is to use multiple or ALL cogs to process the broken up objects concurrently....the reason i say that I don't think this will increase speed, is because of the complexity and the fact that the output along the timeline will have to be re-buffered and then processed

an open discussion about precision external clocks for propeller

Comments