Prop2 Analog Test Chip Arrived!

JRetSapDoog · 2016-11-26 15:36

cgracey wrote: »

The risk would be very low, but to do this properly, it needs to be a forethought, not an afterthought.

But self-hosted development is still doable, right, just not in as protected of a way? I say that because this is Chip's chip and it is his desire to be able to do that and offer that, such as in school environments or wherever, so it'd be a shame not to meet that design desire if it were within reach.

I know that it is somewhat ironic that some desire having a simple development option that doesn't depend on a PC considering that these complex chips couldn't be designed without the aid of complex operating systems and development software, despite all the baggage that comes with them. But it is still kind of comforting to have self-hosted development as one of our options, even if most will elect to use PC's and civilization doesn't collapse any time soon. The simplicity of it--the lack of the dependencies of another PC, its operating system, other software--is enticing, just the mind, a keyboard, a monitor and electricity (and a P2 board, of course).

Rayman · 2016-11-26 15:44

I'm still thinking a shared LUT can be used as I/O interface for cog being programmed.

The onboard compiler would just not generate any code that writes directly to hub or messes with I/O pins. You'd instead write your request to shared LUT and let conjoined cog do it.
Maybe some latency, but should be fine for blinking LEDs or whatever...

jmg · 2016-11-26 19:57

cgracey wrote: »

The 0.8V peak-peak XI input could even be tested on the test chip. I could run my function generator at low-voltage and high-frequency and observe it's output to come up with a sensitivity curve.

Yes, that's a good idea.
You can check the chain output by looking at the VCO frequency.

cgracey wrote: »

If the maximum VCO frequency is 320 MHz, is someone really going to want to divide that by 256? I could see only 4 bits being useful.

I'm going by what even small MCUs offer, which is 7-8 bits of prescaler.
eg Nuvoton new 8b have this formula :
fSYS = fOSC, while CKDIV = 00H then fSYS = fOSC /(2*CKDIV) for CKDIV = 01H ~ FFH.

The appeal of this is a means to snooze and save power, and even /256 can give a still decent 0.5MIP

Can you get useful leakage numbers from the test chip ?

potatohead · 2016-11-26 20:43

But self-hosted development is still doable, right, just not in as protected of a way?

Totally. What we can do is setup everything, and provide storage. At the sizes and clock rates being discussed, it's not going to take much to archive the state of RAM, run something, recover if needed.

While I think a "local" kind of COG mode is worthy, we also really do have to balance changes. What we have right now will be just fine.

I'm going to make a bench computer. One can develop on it, and for little, "Hey I just want this" type stuff, making a signal, data logging, etc... short, powerful little programs make sense.

But, that same device can do a bunch of basic test / measure, and it would make a great general purpose instrument too.

jmg · 2016-11-26 21:25

Rayman wrote: »

I'm still thinking a shared LUT can be used as I/O interface for cog being programmed.

The onboard compiler would just not generate any code that writes directly to hub or messes with I/O pins. You'd instead write your request to shared LUT and let conjoined cog do it.
Maybe some latency, but should be fine for blinking LEDs or whatever...

That's workable too, but does dictate conjoined pairing.

A simple Data space limit in HUB, to protect larger CODE areas (become read only) would allow a COG co-processor design that was proven safe.
It could only write to the exchange area, and never be able to write into what could be HUBEXEC space.

There could also be compile-time options, that can bound array writes to within array limits, for a very slight speed cost.
We do this safe-run-time stuff now, in tiny MCUs with Binary-Sized arrays and binary AND or OR .

ozpropdev · 2016-11-26 23:37

cgracey wrote: »

So, I've got only two things to complete:

1) Finish testing new PLL - two days
2) Change J/K reporting in USB smart pin per Garryl - 15 seconds

and don't forget

3) rel9 issues with CALLPA/PB/D and Jxxx/JNxxx instructions

cgracey · 2016-11-27 00:47

ozpropdev wrote: »

cgracey wrote: »

So, I've got only two things to complete:

1) Finish testing new PLL - two days
2) Change J/K reporting in USB smart pin per Garryl - 15 seconds

and don't forget

3) rel9 issues with CALLPA/PB/D and Jxxx/JNxxx instructions

Yes!!!

cgracey · 2016-11-27 06:05

ozpropdev wrote: »

cgracey wrote: »

So, I've got only two things to complete:

1) Finish testing new PLL - two days
2) Change J/K reporting in USB smart pin per Garryl - 15 seconds

and don't forget

3) rel9 issues with CALLPA/PB/D and Jxxx/JNxxx instructions

I found the problem. The early instruction decoding for #rel9 branches was not including CALLPA/CALLPB and JINT..JNQMT in the block of instructions whose S values are sign-extended to 20 bits.

Are you sure CALLD was having trouble, or was it an assembler issue?

ozpropdev · 2016-11-27 08:46

cgracey wrote: »

Are you sure CALLD was having trouble, or was it an assembler issue?

Chip
Your post here might help.

cgracey · 2016-11-27 09:07

ozpropdev wrote: »

cgracey wrote: »

Are you sure CALLD was having trouble, or was it an assembler issue?

Chip
Your post here might help.

I remember that, but I'm wondering if maybe I didn't make a mistake on the assembler, after all, and the problem you were having was really Verilog related.

Can you tell me if CALLD was, indeed, failing, or was it just CALLPA, CALLPB, and the event jumps?

ozpropdev · 2016-11-27 11:55

Chip
In the example below I had to hand encode the CALLD D,#rel9 instruction to get it to work.

dat	org

'CCCC 1011001 CZI DDDDDDDDD SSSSSSSSS        CALLD   D,S/#rel9   {WC,WZ}

	setnib	dirb,#$f,#0
'	calld	myret,#mycode	'fail
'	calld	myret,@mycode	'fail
'	calld	myret,#@mycode	'fail
	long	$fb240c02	'works

	outh	#32
me	jmp	#me

mycode	outh	#33
	jmp	myret

myret	res	1

cgracey · 2016-11-27 12:52

ozpropdev wrote: »

Chip
In the example below I had to hand encode the CALLD D,#rel9 instruction to get it to work.

dat	org

'CCCC 1011001 CZI DDDDDDDDD SSSSSSSSS        CALLD   D,S/#rel9   {WC,WZ}

	setnib	dirb,#$f,#0
'	calld	myret,#mycode	'fail
'	calld	myret,@mycode	'fail
'	calld	myret,#@mycode	'fail
	long	$fb240c02	'works

	outh	#32
me	jmp	#me

mycode	outh	#33
	jmp	myret

myret	res	1

Ok. Sorry. I see the same problem here. There was a Verilog mistake, too, that didn't allow the newer #rel9 branch instructions to go backwards.

There is definitely an assembler error, as well. I will address this soon. Sorry about the wait.

Seairth · 2016-11-27 18:46

I've got a question about the PLL dividers. I don't think this question is specific to the P2, but I'm primarily curious about the answer as it relates to the P2. The question is this: Given the dividers (divide by 2..64 and multiply by 2..1024), it would be possible to specify a number of combinations that are effectively the same, such as 2/2, 16/16, 64/64 or 32/2, 256/16, 1024/64. In cases like this, is there any reason to choose one combination over another?

jmg · 2016-11-27 19:08

Seairth wrote: »

I've got a question about the PLL dividers. I don't think this question is specific to the P2, but I'm primarily curious about the answer as it relates to the P2. The question is this: Given the dividers (divide by 2..64 and multiply by 2..1024), it would be possible to specify a number of combinations that are effectively the same, such as 2/2, 16/16, 64/64 or 32/2, 256/16, 1024/64. In cases like this, is there any reason to choose one combination over another?

The PFD frequency varies in those cases, so you might want to avoid certain PFD frequencies, but unless you were getting to the limits, I'd not expect much operational difference (eg 4MHz or 8MHz PFD, would give a slightly faster lock time, see the Plots Chip did above, but once locked, any PFD element reduces to chip-leakage-difference compensation levels ).
There will also be some practical min on PFD, so at lower Xtal/ClkIn speeds, /64 may be too large.

potatohead · 2016-11-27 20:50

There is some risk hedging here too.

We may find any number of little things, specific frequency resonances or noise, etc...

Having the multiple combinations allows both a wider range of frequency selection (I feel is important), and options to work around things we may find that simulation may not.

dMajo · 2016-11-28 08:28

jmg wrote: »

cgracey wrote: »

The 0.8V peak-peak XI input could even be tested on the test chip. I could run my function generator at low-voltage and high-frequency and observe it's output to come up with a sensitivity curve.

Yes, that's a good idea.
You can check the chain output by looking at the VCO frequency.

cgracey wrote: »

If the maximum VCO frequency is 320 MHz, is someone really going to want to divide that by 256? I could see only 4 bits being useful.

I'm going by what even small MCUs offer, which is 7-8 bits of prescaler.
eg Nuvoton new 8b have this formula :
fSYS = fOSC, while CKDIV = 00H then fSYS = fOSC /(2*CKDIV) for CKDIV = 01H ~ FFH.

The appeal of this is a means to snooze and save power, and even /256 can give a still decent 0.5MIP

Can you get useful leakage numbers from the test chip ?

@jmg and @cgracey
I think that if you need to save overall power you can develop the application with lower clock source and that what was done till now is more than enough.

If there is some room/verilog space regarding pre/post dividers I will much more prefer a "sort of" cog sysclock prescaler.

I mean a N value that will skip N*16 next clocks. Or to be compatible also for smaller devices to skip "N*NumCogsInTheChip next clocks.
In this way the clock pulse/period width and speed is the same in all the chip. Only a specific cog, by setting its prescaler, can run for 16 (number of cogs in the chip= base time unit) and then skip the next N*NumCogsInTheChip, the next N base time units.
This will make the specific cog wait for N complete hub rotations in sleep, without needing any wait and preserving its state during the sleep thus lowering its bandwidth/mips and so saving power. There is many times when you need quick response to pin events or response to fast/short events and this require higher chip clocks. But then eg a serial transmission, logging, or other things can require significantly less time. Thiss allows for different cogs running at (averaged) different speed.
Since I think that the cog clock_in is already gated this requires only an enable more to that gate driven by the prescaler which is synced with the hub turnaround.

cgracey · 2016-11-28 17:51

dMajo wrote: »

jmg wrote: »

cgracey wrote: »

The 0.8V peak-peak XI input could even be tested on the test chip. I could run my function generator at low-voltage and high-frequency and observe it's output to come up with a sensitivity curve.

Yes, that's a good idea.
You can check the chain output by looking at the VCO frequency.

cgracey wrote: »

If the maximum VCO frequency is 320 MHz, is someone really going to want to divide that by 256? I could see only 4 bits being useful.

I'm going by what even small MCUs offer, which is 7-8 bits of prescaler.
eg Nuvoton new 8b have this formula :
fSYS = fOSC, while CKDIV = 00H then fSYS = fOSC /(2*CKDIV) for CKDIV = 01H ~ FFH.

The appeal of this is a means to snooze and save power, and even /256 can give a still decent 0.5MIP

Can you get useful leakage numbers from the test chip ?

@jmg and @cgracey
I think that if you need to save overall power you can develop the application with lower clock source and that what was done till now is more than enough.

If there is some room/verilog space regarding pre/post dividers I will much more prefer a "sort of" cog sysclock prescaler.

I mean a N value that will skip N*16 next clocks. Or to be compatible also for smaller devices to skip "N*NumCogsInTheChip next clocks.
In this way the clock pulse/period width and speed is the same in all the chip. Only a specific cog, by setting its prescaler, can run for 16 (number of cogs in the chip= base time unit) and then skip the next N*NumCogsInTheChip, the next N base time units.
This will make the specific cog wait for N complete hub rotations in sleep, without needing any wait and preserving its state during the sleep thus lowering its bandwidth/mips and so saving power. There is many times when you need quick response to pin events or response to fast/short events and this require higher chip clocks. But then eg a serial transmission, logging, or other things can require significantly less time. Thiss allows for different cogs running at (averaged) different speed.
Since I think that the cog clock_in is already gated this requires only an enable more to that gate driven by the prescaler which is synced with the hub turnaround.

Yes, a cog-clock enable could be issued every clock, every 2nd clock, every 3rd clock, etc. It does get a little complicated, though, when interfacing to the hub. I wish we would have thought of this earlier! Good idea.

Rayman · 2016-11-28 18:37

Seems like you could do this using the WAIT and maybe interrupts...
If you just want to wait for a pin event, maybe it's even lower power...

jmg · 2016-11-28 19:08

Rayman wrote: »

Seems like you could do this using the WAIT and maybe interrupts...
If you just want to wait for a pin event, maybe it's even lower power...

Yes, the key question with a feature like this, is how can it compare in Power with WAIT opcodes.
As you say, you can interrupt and wait, to generate a IDLE form, with any desired duty cycle.
During WAIT very little is clocking.

Even with a COG-swallow prescaler, you probably would want WAIT to run at SysCLK granularity (or certainly have the option to do that)

I'd still rate a VCO Post divider (global) ahead of a COG-swallow prescaler, as the VCO post divider gives insurance on VCO range, and can give more frequency precision, and it lowers the global clock tree, which will be a significant power hog.

It also allows lower MHz SysCLKs, which are outside the VCO range, and it could simplify the design of things like Logic Analysers and Scopes, where a clock prescaler can set the sample rate, and save power on battery apps.
Clock sources are trending up in MHz, with many TXCOs now not available under 16MHz, and more stocked around 25/26/27MHz, so a Post Divider also covers those to lower SysCLKs.

eg imagine a P2 powered DSO203 type instrument, with a Logic Analyser included.

pedward · 2016-11-28 19:40

Chip, could you implement an ECC algorithm in hardware, which can then be used when reading the security bits? The idea being that we could have some built-in ECC correction to hedge against silicon failures. I'm sure that a widely used ECC algorithm (like reed-solomon, maybe) would have uses outside of just verifying the security bits.

Rayman · 2016-11-28 20:11

I was thinking about ECC earlier too...

But, was just thinking that there might be an even better way to account for bad fuse bits.
After the 128 bits of key, you have a table of bad key bits. Just use 8-bits per bad fuse, I think.
Empty table (no blown fuses) would mean all fuse bits are good.

So, you burn the key, read it, then write the bad fuse table.
Maybe could even be recursive to correct table errors...

jmg · 2016-11-28 20:29

Rayman wrote: »

Maybe could even be recursive to correct table errors...

hehe, yes this quickly becomes a challenge of how to cover failures...

ECC has limited bad-bit recovery, and Chip has proposed a dual-fuse redundancy approach to try to help yields.

With more info from OnSemi, hopefully the fuse-yield can be improved - it seems using Electro Migration as the main mechanism, with some tail-end-rounding from local melting, but avoiding the splat effects of metal vaporization, should give better long term yields.
A P2 can manage time-domain PWM control of fuse energy reasonably well, and even better if a smart pin can be used.

Rayman · 2016-11-28 22:17

Dual fuse redundancy is simplest ECC, but costs the most bits... Leaves none for user.
ECC is better.

Guess it depends on the expected error rate. If it's really low, then I think this table will cost the least bits. If one expects at most one or two, then just need 16 bits for table, leaving 112 for user...

dMajo · 2016-11-28 23:53

cgracey wrote: »

dMajo wrote: »

jmg wrote: »

cgracey wrote: »

The 0.8V peak-peak XI input could even be tested on the test chip. I could run my function generator at low-voltage and high-frequency and observe it's output to come up with a sensitivity curve.

Yes, that's a good idea.
You can check the chain output by looking at the VCO frequency.

cgracey wrote: »

If the maximum VCO frequency is 320 MHz, is someone really going to want to divide that by 256? I could see only 4 bits being useful.

I'm going by what even small MCUs offer, which is 7-8 bits of prescaler.
eg Nuvoton new 8b have this formula :
fSYS = fOSC, while CKDIV = 00H then fSYS = fOSC /(2*CKDIV) for CKDIV = 01H ~ FFH.

The appeal of this is a means to snooze and save power, and even /256 can give a still decent 0.5MIP

Can you get useful leakage numbers from the test chip ?

@jmg and @cgracey
I think that if you need to save overall power you can develop the application with lower clock source and that what was done till now is more than enough.

If there is some room/verilog space regarding pre/post dividers I will much more prefer a "sort of" cog sysclock prescaler.

I mean a N value that will skip N*16 next clocks. Or to be compatible also for smaller devices to skip "N*NumCogsInTheChip next clocks.
In this way the clock pulse/period width and speed is the same in all the chip. Only a specific cog, by setting its prescaler, can run for 16 (number of cogs in the chip= base time unit) and then skip the next N*NumCogsInTheChip, the next N base time units.
This will make the specific cog wait for N complete hub rotations in sleep, without needing any wait and preserving its state during the sleep thus lowering its bandwidth/mips and so saving power. There is many times when you need quick response to pin events or response to fast/short events and this require higher chip clocks. But then eg a serial transmission, logging, or other things can require significantly less time. Thiss allows for different cogs running at (averaged) different speed.
Since I think that the cog clock_in is already gated this requires only an enable more to that gate driven by the prescaler which is synced with the hub turnaround.

Yes, a cog-clock enable could be issued every clock, every 2nd clock, every 3rd clock, etc. It does get a little complicated, though, when interfacing to the hub. I wish we would have thought of this earlier! Good idea.

The idea is not to divide the sysclock for a given cog, but instead not allow some clock pulses to reach it.
It will consistently skip this many clock periods needed for a complete hub turnaround. How many clocks depends on how many cogs are in the chip. So basically you decide how many complete turnarounds to skip (0..255). This should completely avoid toggling of any cog logic and hub address/data(enable,cs) lines currently under control of the sleeping cog.
Such cog, if N>0, will run in 16(CogsInTheChip) clock bursts every N hub rotations.
During sleep, ideally, this should reduce chip power by 1/16, if you do not consider the static loses of the cog logic and 1/16 of hub ram.

I thought that by forcing this kind of sleep nothing changes in regards to hub and other cogs. The hub windows stil rotate only the sleeping cog is not accessing/using them.

cgracey · 2016-11-29 00:51

Rayman wrote: »

I was thinking about ECC earlier too...

But, was just thinking that there might be an even better way to account for bad fuse bits.
After the 128 bits of key, you have a table of bad key bits. Just use 8-bits per bad fuse, I think.
Empty table (no blown fuses) would mean all fuse bits are good.

So, you burn the key, read it, then write the bad fuse table.
Maybe could even be recursive to correct table errors...

I like this idea. But, wait... there's a chance that the bad-block indicator fuse will fail, too.

jmg · 2016-11-29 00:57

cgracey wrote: »

I like this idea. But, wait... there's a chance that the bad-block indicator fuse will fail, too.

Yes, anything that applies a band-aid to try to tag errors, is itself exposed to errors...

That means you really do need to try to get the fuse yield up, otherwise blowing an average of 128 of them, has a quite low chance of success.

jmg · 2016-11-29 01:04

dMajo wrote: »

The idea is not to divide the sysclock for a given cog, but instead not allow some clock pulses to reach it.

I think this has been brought up before.
This approach requires a gated clock, with a counter-gate-enable, and I'm not sure the ASIC flow supports this.
This adds a skew to the clock tree, and lowers the total SysCLK.

Worse, a fully gated COG clock, also forces WAIT to be far more granular, which limits the use cases.
The 1 SysCLK granularity of the P2, is a important differentiation to other MCUs

A VCO post divider, goes before the global clock buffer, so has no design-wide impact.

Rayman · 2016-11-29 01:31

cgracey wrote: »

I like this idea. But, wait... there's a chance that the bad-block indicator fuse will fail, too.

That's what I meant about the "recursive" feature. You would write the table. Then, you'd read it and if there's a bad bit, you'd write another table entry to account for it.

If the failure rate is low, this should be fine.

Rayman · 2016-11-29 01:33

BTW: The flash memory ECC is pretty robust, but they have to worry about bits flipping over time. I think that's a fundamental difference here. We just have to worry about bit flips at write time, not later...

cgracey · 2016-11-29 13:41

With XTAL and VCO dividers feeding the phase-frequency detector in the PLL with low ~300KHz signals, there is a big need for very stable 1.8V power, so that the power supply does not change faster than the PLL can compensate. After a couple of hair-brained schemes, I found a reliable way to make stable 1.8V power.

I first thought of using the VDD pin near the XO pin to become a dedicated VDD_PLL input, but that was kind of messy. Then, I thought of using a bandgap to generate a reference voltage from which a good 1.8V supply could be made from the local VIO (3.3V), but I remembered that those things need offset trimming, and that gets into fuses. Not good.

Finally, I figured that the internal 1.8V supply, on average, is a good reference, but it needs to be heavily filtered. So, I made this passive VDD filter that slows the voltage WAY down. It can take 400mV steps at 10KHz and reduce them to a gentle 5mV ripple:

Then, I used an instance of the wide-output transconductance amplifier that is used as a comparator in each I/O pin. I hooked the VDD filter into it and made a voltage regulator. This repurposes those 160 20x20um NMOS caps that are already in the clock pin layout. Here it is with a simulation of it taking a 500mV-stepped VIO and a 250mV-stepped VDD, and regulating a nice, clean 1.8V power supply that is delivering 2.5mA. You can see the regulated 1.8V in red (VDDQ), along with the crazy stepped VDD and VIO supplies that feed the regulator. In reality, VIO will never transition so abruptly (500mV in 10ns), and those 50ns/25mV VDDQ spikes will be much lower:

We will run both the PLL and the RC oscillator from this internal 1.8V power supply. I just need to corner test this now.

Prop2 Analog Test Chip Arrived!

Comments