Clocks and clock ratios, Core and Timers.

jmg · 2014-03-04 21:05

Thinking about the P2 clocks to Core and Timers.
The P1 has a 80MHz timer clock and a 20MHz effective core clock, and WAIT resolves to the 80MHz value.

Users may want to get maximum precision from P2 timers, but run the COGs at some lower MHz to save power.

Or, it may transpire the Silicon can do 200MHz+ easily on timers/capture, but only get to some not-quite-there value on the Core.
It would be a real pity to limit the timers to that lower core ceiling.

That would need a simple prescale design, to allow (amongst others) a P1 equivalent setting, where the CPU is PLL/4 and the Timers and WAIT engines and Baud Gen are PLL based.

For symmetry and control, a converse prescaler could allow Core to PLL speeds, and Timers to PLL/N
That would allow (eg) 100MHz / 10ns timers, and a wider gear range on the CPU speed.

This control should also suit FPGA's, where the CORE's complex pathways, impose lower limits than the timers could get away with.
It would allow FPGA development to work to higher timing MHz values.

potatohead · 2014-03-04 21:43

I'm interested in what Chip thinks about the implications of these design choices, given he's exploited things in various ways throughout the design.

cgracey · 2014-03-04 23:01

Well, this is good food for thought, but the big inhibitor to doing something like is that the counters' critical paths are just under that of the cogs'. I think we cannot even run the counters 10% faster than the cogs, let alone 200% or 400%. I understand the desire for increased precision, though. I think at 200MHz, we'll have exceptional precision, already. The really solid way to get more precision is to go a 40nm process, or smaller. Then we can have 600% more.

Sapieha · 2014-03-04 23:39

Hi Chip

Part on counters I understand.

BUT still IN jmg's question are one other ---->

Can't PLL be run faster in REAL silicon and have added PRESCALLER (user controled) that draw it down to standard BUT we can disable to test bigger frequencies on same Crystal ----> Maybe bigger that You have planed for it .

With other words GIVE us Overclocking.

cgracey wrote: »

Well, this is good food for thought, but the big inhibitor to doing something like is that the counters' critical paths are just under that of the cogs'. I think we cannot even run the counters 10% faster than the cogs, let alone 200% or 400%. I understand the desire for increased precision, though. I think at 200MHz, we'll have exceptional precision, already. The really solid way to get more precision is to go a 40nm process, or smaller. Then we can have 600% more.

jmg · 2014-03-05 00:58

cgracey wrote: »

Well, this is good food for thought, but the big inhibitor to doing something like is that the counters' critical paths are just under that of the cogs'. I think we cannot even run the counters 10% faster than the cogs, let alone 200% or 400%. .

Do you mean in the ASIC process ?

I found these specs in tables for Lattice XP2 & ECP3 devices, which I think are behind Cyclone parts.
Altera does not seem to do the same tables

LatticeXP2 (90-nm)
16-bit Adder 507 MHz
64-bit Adder 293 MHz
16-bit Counter 541 MHz
32-bit Counter 440 MHz
64-bit Counter 321 MHz
64-bit Accumulator 261 MHz

LatticeECP3 ( 65 nm)
16-bit adder 500 MHz
64-bit adder 336 MHz
16-bit counter 500 MHz
32-bit counter 500 MHz
64-bit counter 350 MHz
64-bit accumulator 326 MHz

Cyclone IV  (60nm)
Cyclone V  (28 nm) 

Virtex-5, (65 nm )
http://ece.gmu.edu/coursewebpages/ECE/ECE645/S11/projects/project_1_resources/Adders_MELECON_2010.pdf
gives 
32b Adder circuit   
Proposed 521MHz  98FF 41 LUT
Standard RCA 483MHz  98FF 31 LUT
64b Adder circuit   
Proposed 397MHz 194FF 81 LUT
Standard RCA 358MHz 194FF  65LUT

Xilinx WP245    Virtex-4 (90 nm) and  Virtex-5, (65 nm )
Adder, 64-bit   :  Virtex-4 FPGA  3.5 ns     Virtex-5 FPGA  2.5 ns

cgracey · 2014-03-05 01:12

jmg wrote: »

Do you mean in the ASIC process ?

In both. The granularity of the FPGA is small enough that it tracks the ASIC pretty closely, as relative speed limits go for different circuits.

cgracey · 2014-03-05 01:13

Sapieha wrote: »

Hi Chip

Part on counters I understand.

BUT still IN jmg's question are one other ---->

Can't PLL be run faster in REAL silicon and have added PRESCALLER (user controled) that draw it down to standard BUT we can disable to test bigger frequencies on same Crystal ----> Maybe bigger that You have planed for it .

With other words GIVE us Overclocking.

You can set a multiplier for the main PLL from 2,3,4,5,...16. With a 20MHz crystal, you could try to make it go 320MHz, or just set the PLL multiplier to 10 and go 200MHz.

Sapieha · 2014-03-05 01:20

Hi Chip.

That answer my question.

Thanks

LewisD · 2014-03-05 09:27

Hi Chip,
I have been lurking for a few years now, and the white knuckle roller coaster ride of the last ~6 months has been AWESOME. This thread has me thinking and I would pose this question. (P3 or P4)

With 8 Very powerful COGs would it be possible to run each at its own clock?
The hub, i/o and inter cog communication would run and the fastest clock , but each COG would have its own clock (domain or fractional ratio)
So you could run one at each or 192MHz (USB) , 49.152 MHz (Digital audio and reference clock for Firewire) , 36.864 MHz (UART clock) , and even 4.1934 MHz (Real-time clocks).
The idea is that most of the COGs are software programmable i/o (to minimizing jitter and simplify programs) matching the clock to the i/o task would help.
For the best jitter free operation we may need separate crystals for each COG on a unique frequency, so we would need pairs of i/o that could run a crystal and feed that to a PLL for a COG...

This is not meant to be the monkey wrench that broke the camel’s back, Just an idea.

Lewis

jmg · 2014-03-05 10:37

Completely independent fractional clocks would make HUB access 'challenging' !
Likewise for a separate PLL per COG.
A separate prescaler per COG is more practical, as that just scales when the COG fires. Works as a Clock enable.
There could be some impact on HUB access - but HUB slots round-robin now.

Cluso99 · 2014-03-05 13:57

The only useful reason I can see to slow a cog down is to reduce power. However, it needs to maintain the hub window.

I don't know how much power would be saved it the cog were able to use 1/8..8/8 clocks (ie 1/8 would only use 1 in every 8 clocks, but the clock would still be the same original frequency eg 200MHz)

This could sort of be achieved by something similar to...
SETTASK %11111111_11111100_11111111_11111100
So that Task 0 gets 2:16 clocks and Task 3 gets 14:16 clocks
But Task 3 implements a special "IDLE" instruction which does nothing (PC does not advance, etc) - ie an instruction designed for minimum power usage.

Maybe there is some other way to achieve this (minimise power consumption).

Bill Henning · 2014-03-05 14:18

- set the slow processing task to 1/16

- have the powersaving cog to 15/16, have it wait on a pin.

Cluso99 wrote: »

The only useful reason I can see to slow a cog down is to reduce power. However, it needs to maintain the hub window.

I don't know how much power would be saved it the cog were able to use 1/8..8/8 clocks (ie 1/8 would only use 1 in every 8 clocks, but the clock would still be the same original frequency eg 200MHz)

This could sort of be achieved by something similar to...
SETTASK %11111111_11111100_11111111_11111100
So that Task 0 gets 2:16 clocks and Task 3 gets 14:16 clocks
But Task 3 implements a special "IDLE" instruction which does nothing (PC does not advance, etc) - ie an instruction designed for minimum power usage.

Maybe there is some other way to achieve this (minimise power consumption).

jmg · 2014-03-05 15:05

Cluso99 wrote: »

The only useful reason I can see to slow a cog down is to reduce power.

There are other advantages as well, but power is the main one.

Another advantage is with Clock-enable of a COG clocked at 200MHz, you gain more timing margin, which means if the silicon comes in at something under 200MHz on CPU, but above 200MHz on timers, you can still use 200MHz timing.
ie it is a sort of insurance policy to reach that important psychological number of 200Mhz

I'm trying to reality check the numbers above, so I took a XO2 design I have, installed ECP3 libraries and built it.
It reports after place and route, MAX 317.158MHz on ECP3 ( vs MAX 133.316MHz on XO2)
This is a 4 channel, 32 bit Quadrature Up/Down counter, with serial readout.

So it looks about right, but I have no means to test that.
( That 317.158MHz is on a similar process node to Cyclone IV, Cyclone V should have more headroom ? )

Cluso99 wrote: »

However, it needs to maintain the hub window.

I think COGs already wait for their window, so it does not have to be 8N locked, but If you want to reduce beat effects, then yes, choosing a N8 divide value would make the delays consistent.

Cluso99 wrote: »

I don't know how much power would be saved it the cog were able to use 1/8..8/8 clocks (ie 1/8 would only use 1 in every 8 clocks, but the clock would still be the same original frequency eg 200MHz)

This could sort of be achieved by something similar to...
SETTASK %11111111_11111100_11111111_11111100
So that Task 0 gets 2:16 clocks and Task 3 gets 14:16 clocks
But Task 3 implements a special "IDLE" instruction which does nothing (PC does not advance, etc) - ie an instruction designed for minimum power usage.

It could use a WAITCNT loop, which should be the lowest power still alive ?

Tubular · 2014-03-05 15:35

Video PLL for transmit?

Another way would be to have the supervisor task "drip" a single slot each time a passcnt happens in the supervisor task. You could use a DDS style counter in the supervisor cog to keep track of when to "drip" (setting the passcnt target)

jmg · 2014-03-05 15:50

Tubular wrote: »

Another way would be to have the supervisor task "drip" a single slot each time a passcnt happens in the supervisor task. You could use a DDS style counter in the supervisor cog to keep track of when to "drip" (setting the passcnt target)

Sure, single step debug will work almost exactly like this.

Cluso99 · 2014-03-05 16:22

Bill & jmg,

In multi-task mode you have to use PASSCNT instead of WAITCNT. Not so sure about WAITPxx.

However, from what I understand, these instructions do not put the P2 into a type of sleep mode like they did on P1. So you are not going to get greatly reduced power - there will be some as some of the ALU will not be ticking over.

cgracey · 2014-03-05 16:34

LewisD wrote: »

Hi Chip,
I have been lurking for a few years now, and the white knuckle roller coaster ride of the last ~6 months has been AWESOME. This thread has me thinking and I would pose this question. (P3 or P4)

With 8 Very powerful COGs would it be possible to run each at its own clock?
The hub, i/o and inter cog communication would run and the fastest clock , but each COG would have its own clock (domain or fractional ratio)
So you could run one at each or 192MHz (USB) , 49.152 MHz (Digital audio and reference clock for Firewire) , 36.864 MHz (UART clock) , and even 4.1934 MHz (Real-time clocks).
The idea is that most of the COGs are software programmable i/o (to minimizing jitter and simplify programs) matching the clock to the i/o task would help.
For the best jitter free operation we may need separate crystals for each COG on a unique frequency, so we would need pairs of i/o that could run a crystal and feed that to a PLL for a COG...

This is not meant to be the monkey wrench that broke the camel’s back, Just an idea.

Lewis

As jmg pointed out, it would make hub access difficult. It would also make pin OR'ing among cogs difficult, as there must be flipflops in these paths to meet timing requirements.

I'm glad you've been lurking and have been enjoying all this.

jmg · 2014-03-05 16:37

Cluso99 wrote: »

Bill & jmg,

In multi-task mode you have to use PASSCNT instead of WAITCNT. Not so sure about WAITPxx.

However, from what I understand, these instructions do not put the P2 into a type of sleep mode like they did on P1. So you are not going to get greatly reduced power - there will be some as some of the ALU will not be ticking over.

Good point, I was using WAITCNT in the generic sense.
The ALU will not be used, but the jump to self may take more power than a P1 WAITCNT.
Very few FF's will be changing, so power may come down to the clock tree, and how much of that is gated in the various modes.

potatohead · 2014-03-05 16:46

Shouldn't we be discussing this for P3?

jmg · 2014-03-05 17:09

potatohead wrote: »

Shouldn't we be discussing this for P3?

That's up to Chip.

It recently occurred to me, that the silicon may not quite make 200MHz, in the CPU, but should make > 200MHz in the Timers.
Prescalers are relatively simple and many uC now have these, plus they give that insurance factor, and can give users power control.
Quite a few micros now run timers faster than the CPU.

Prescalers may also allow more FPGA testing coverage, as the FPGA counter info above looks like 200MHz+ should be practical, even if the CPU is at ~80MHz.

potatohead · 2014-03-05 17:17

Yeah OK. Let's just say I'm not a fan of even MORE things being crammed into what is now a very crowded space. And yes, each one of them "is only a little bit..." but they add up.

Bill Henning · 2014-03-05 17:25

Hi Ray,

from what I understand, the reason not to use WAITCNT is that it stops all other tasks while it is waiting.

Thinking about it, that makes it perfect for low speed / low power running

task 0 - the code you want to run low speed, low enegry
task 1 - the power saver

Set task 0 to run at 1/2 cycles

For extra savings, have task 1 wait for N cycles (N probably has to be bigger than 10 or so - whatever delta is the minimum)

powersaver_task

      reps     #16363,#1
      getcnt  now
      waitcnt now,#200 ' wait for 1us

      jmp   #powersaver_task

Between controlling the % of slices the two tasks get, and the delta for waitcnt, you can extremely finely tune the time spent at almost no power vs. power needed.

No need for funky clock modes, no need for more hardware or instructions.

Just a cute use of the task feature and existing instructions.

Cluso99 wrote: »

Bill & jmg,

In multi-task mode you have to use PASSCNT instead of WAITCNT. Not so sure about WAITPxx.

However, from what I understand, these instructions do not put the P2 into a type of sleep mode like they did on P1. So you are not going to get greatly reduced power - there will be some as some of the ALU will not be ticking over.

jmg · 2014-03-05 19:45

jmg wrote: »

I'm trying to reality check the numbers above, so I took a XO2 design I have, installed ECP3 libraries and built it.
It reports after place and route, MAX 317.158MHz on ECP3 ( vs MAX 133.316MHz on XO2)
This is a 4 channel, 32 bit Quadrature Up/Down counter, with serial readout.

More P&R info for LatticeECP3 ( 65 nm) 32b U/D Counter x 4

That 317.158MHz was with no particular CLK targets, so I tried 385MHz and it reported 328.623MHz, and then trying 330MHz and it gives
Report: 330.688MHz is the maximum frequency for this preference.
Seems the SW tries a little harder, if it is close to making the target, than if you over-spec the target. ?

cgracey · 2014-03-06 02:03

jmg wrote: »

More P&R info for LatticeECP3 ( 65 nm) 32b U/D Counter x 4

That 317.158MHz was with no particular CLK targets, so I tried 385MHz and it reported 328.623MHz, and then trying 330MHz and it gives
Report: 330.688MHz is the maximum frequency for this preference.
Seems the SW tries a little harder, if it is close to making the target, than if you over-spec the target. ?

Timing closure is a peculiar thing. You can ask for some impossibly-high Fmax and it will wind up compiling logic slower than what is actually possible to achieve. When you give it a realistic goal, it can achieve better, because it is actually possible. This goes for compiling standard cells for an ASIC, as well as logic elements for an FPGA. One thing that can happen in an ASIC, where there is no fixed routing grid like in an FPGA, is that you can get a marginally-higher Fmax by climbing the wall with additional signal buffering, but it comes at the expense of silicon area and power. It's best to back down the curve a bit to where power, speed, and area are more balanced.

Clocks and clock ratios, Core and Timers.

Comments