Clocks and clock ratios, Core and Timers.
jmg
Posts: 15,175
Thinking about the P2 clocks to Core and Timers.
The P1 has a 80MHz timer clock and a 20MHz effective core clock, and WAIT resolves to the 80MHz value.
Users may want to get maximum precision from P2 timers, but run the COGs at some lower MHz to save power.
Or, it may transpire the Silicon can do 200MHz+ easily on timers/capture, but only get to some not-quite-there value on the Core.
It would be a real pity to limit the timers to that lower core ceiling.
That would need a simple prescale design, to allow (amongst others) a P1 equivalent setting, where the CPU is PLL/4 and the Timers and WAIT engines and Baud Gen are PLL based.
For symmetry and control, a converse prescaler could allow Core to PLL speeds, and Timers to PLL/N
That would allow (eg) 100MHz / 10ns timers, and a wider gear range on the CPU speed.
This control should also suit FPGA's, where the CORE's complex pathways, impose lower limits than the timers could get away with.
It would allow FPGA development to work to higher timing MHz values.
The P1 has a 80MHz timer clock and a 20MHz effective core clock, and WAIT resolves to the 80MHz value.
Users may want to get maximum precision from P2 timers, but run the COGs at some lower MHz to save power.
Or, it may transpire the Silicon can do 200MHz+ easily on timers/capture, but only get to some not-quite-there value on the Core.
It would be a real pity to limit the timers to that lower core ceiling.
That would need a simple prescale design, to allow (amongst others) a P1 equivalent setting, where the CPU is PLL/4 and the Timers and WAIT engines and Baud Gen are PLL based.
For symmetry and control, a converse prescaler could allow Core to PLL speeds, and Timers to PLL/N
That would allow (eg) 100MHz / 10ns timers, and a wider gear range on the CPU speed.
This control should also suit FPGA's, where the CORE's complex pathways, impose lower limits than the timers could get away with.
It would allow FPGA development to work to higher timing MHz values.
Comments
Part on counters I understand.
BUT still IN jmg's question are one other ---->
Can't PLL be run faster in REAL silicon and have added PRESCALLER (user controled) that draw it down to standard BUT we can disable to test bigger frequencies on same Crystal ----> Maybe bigger that You have planed for it .
With other words GIVE us Overclocking.
Do you mean in the ASIC process ?
I found these specs in tables for Lattice XP2 & ECP3 devices, which I think are behind Cyclone parts.
Altera does not seem to do the same tables
In both. The granularity of the FPGA is small enough that it tracks the ASIC pretty closely, as relative speed limits go for different circuits.
You can set a multiplier for the main PLL from 2,3,4,5,...16. With a 20MHz crystal, you could try to make it go 320MHz, or just set the PLL multiplier to 10 and go 200MHz.
That answer my question.
Thanks
I have been lurking for a few years now, and the white knuckle roller coaster ride of the last ~6 months has been AWESOME. This thread has me thinking and I would pose this question. (P3 or P4)
With 8 Very powerful COGs would it be possible to run each at its own clock?
The hub, i/o and inter cog communication would run and the fastest clock , but each COG would have its own clock (domain or fractional ratio)
So you could run one at each or 192MHz (USB) , 49.152 MHz (Digital audio and reference clock for Firewire) , 36.864 MHz (UART clock) , and even 4.1934 MHz (Real-time clocks).
The idea is that most of the COGs are software programmable i/o (to minimizing jitter and simplify programs) matching the clock to the i/o task would help.
For the best jitter free operation we may need separate crystals for each COG on a unique frequency, so we would need pairs of i/o that could run a crystal and feed that to a PLL for a COG...
This is not meant to be the monkey wrench that broke the camel’s back, Just an idea.
Lewis
Likewise for a separate PLL per COG.
A separate prescaler per COG is more practical, as that just scales when the COG fires. Works as a Clock enable.
There could be some impact on HUB access - but HUB slots round-robin now.
I don't know how much power would be saved it the cog were able to use 1/8..8/8 clocks (ie 1/8 would only use 1 in every 8 clocks, but the clock would still be the same original frequency eg 200MHz)
This could sort of be achieved by something similar to...
SETTASK %11111111_11111100_11111111_11111100
So that Task 0 gets 2:16 clocks and Task 3 gets 14:16 clocks
But Task 3 implements a special "IDLE" instruction which does nothing (PC does not advance, etc) - ie an instruction designed for minimum power usage.
Maybe there is some other way to achieve this (minimise power consumption).
- have the powersaving cog to 15/16, have it wait on a pin.
There are other advantages as well, but power is the main one.
Another advantage is with Clock-enable of a COG clocked at 200MHz, you gain more timing margin, which means if the silicon comes in at something under 200MHz on CPU, but above 200MHz on timers, you can still use 200MHz timing.
ie it is a sort of insurance policy to reach that important psychological number of 200Mhz
I'm trying to reality check the numbers above, so I took a XO2 design I have, installed ECP3 libraries and built it.
It reports after place and route, MAX 317.158MHz on ECP3 ( vs MAX 133.316MHz on XO2)
This is a 4 channel, 32 bit Quadrature Up/Down counter, with serial readout.
So it looks about right, but I have no means to test that.
( That 317.158MHz is on a similar process node to Cyclone IV, Cyclone V should have more headroom ? )
I think COGs already wait for their window, so it does not have to be 8N locked, but If you want to reduce beat effects, then yes, choosing a N8 divide value would make the delays consistent.
It could use a WAITCNT loop, which should be the lowest power still alive ?
Another way would be to have the supervisor task "drip" a single slot each time a passcnt happens in the supervisor task. You could use a DDS style counter in the supervisor cog to keep track of when to "drip" (setting the passcnt target)
Sure, single step debug will work almost exactly like this.
In multi-task mode you have to use PASSCNT instead of WAITCNT. Not so sure about WAITPxx.
However, from what I understand, these instructions do not put the P2 into a type of sleep mode like they did on P1. So you are not going to get greatly reduced power - there will be some as some of the ALU will not be ticking over.
As jmg pointed out, it would make hub access difficult. It would also make pin OR'ing among cogs difficult, as there must be flipflops in these paths to meet timing requirements.
I'm glad you've been lurking and have been enjoying all this.
Good point, I was using WAITCNT in the generic sense.
The ALU will not be used, but the jump to self may take more power than a P1 WAITCNT.
Very few FF's will be changing, so power may come down to the clock tree, and how much of that is gated in the various modes.
That's up to Chip.
It recently occurred to me, that the silicon may not quite make 200MHz, in the CPU, but should make > 200MHz in the Timers.
Prescalers are relatively simple and many uC now have these, plus they give that insurance factor, and can give users power control.
Quite a few micros now run timers faster than the CPU.
Prescalers may also allow more FPGA testing coverage, as the FPGA counter info above looks like 200MHz+ should be practical, even if the CPU is at ~80MHz.
from what I understand, the reason not to use WAITCNT is that it stops all other tasks while it is waiting.
Thinking about it, that makes it perfect for low speed / low power running
task 0 - the code you want to run low speed, low enegry
task 1 - the power saver
Set task 0 to run at 1/2 cycles
For extra savings, have task 1 wait for N cycles (N probably has to be bigger than 10 or so - whatever delta is the minimum)
Between controlling the % of slices the two tasks get, and the delta for waitcnt, you can extremely finely tune the time spent at almost no power vs. power needed.
No need for funky clock modes, no need for more hardware or instructions.
Just a cute use of the task feature and existing instructions.
More P&R info for LatticeECP3 ( 65 nm) 32b U/D Counter x 4
That 317.158MHz was with no particular CLK targets, so I tried 385MHz and it reported 328.623MHz, and then trying 330MHz and it gives
Report: 330.688MHz is the maximum frequency for this preference.
Seems the SW tries a little harder, if it is close to making the target, than if you over-spec the target. ?
Timing closure is a peculiar thing. You can ask for some impossibly-high Fmax and it will wind up compiling logic slower than what is actually possible to achieve. When you give it a realistic goal, it can achieve better, because it is actually possible. This goes for compiling standard cells for an ASIC, as well as logic elements for an FPGA. One thing that can happen in an ASIC, where there is no fixed routing grid like in an FPGA, is that you can get a marginally-higher Fmax by climbing the wall with additional signal buffering, but it comes at the expense of silicon area and power. It's best to back down the curve a bit to where power, speed, and area are more balanced.