New P2 Silicon Observations

Cluso99 · 2018-09-29 22:35

IMHO: There are about 1500 dies that should hopefully be packaged and that would make a decent test and small production run. Wait and see the results from that before a logic respin.

BTW If those are the only bugs then that is an amazing result really. A total credit to Chip. Especially when you see some of the AVR, etc, errata from a companies that do hundreds of designs!

Chip, you might ask Wendy what it would take to try and push for a 300MHz 200MHz critical path ??? Just a thought

BTW I am amazed at the speed anyway, but it's always nice to push the limits even further.

PostEdit: Meant try to push critical path from 180MHz??? to 200MHz

pedward · 2018-09-29 23:01

Something tells me these are going to be "ringer" chips and the respin is going to have a Fmax of 200Mhz before it nutters. I suspect rebuilding the chip with "proper" IQ modulation and lowered power draw is going to contributed to a lower critical speed.

jmg · 2018-09-29 23:20

Cluso99 wrote: »

Chip, you might ask Wendy what it would take to try and push for a 300MHz critical path ??? Just a thought BTW I am amazed at the speed anyway, but it's always nice to push the limits even further.

Remember, the target here was 180MHz, (-10% Vcc, 125'C) and that needed a lot of delay tweaks so 300MHz is never going to be a viable synthesis target.
Of course, there is always over-clocking, and Chip has been testing this at 1.8V, so a nudge up on Vcc and aggressive cooling, you might hit 300MHz 'bench test' numbers.

To me 200MHz is more realistic, perhaps with -4% Vcc, and lower TMax ? Other vendors spec that way, where highest MHz has tightest Vcc spec.

pedward wrote: »

I suspect rebuilding the chip with "proper" IQ modulation and lowered power draw is going to contributed to a lower critical speed.

Clock gating can lower the partial-clock powers, but it adds logic, so it is more likely to worsen the Max MHz and Max mA slightly, than improve them.

Roy Eltham · 2018-09-30 00:24

Honestly, I would be fine if the P2 never had NTSC in it. However, as Chip said the faulty part is useful for several other things that I think would be good to have.

Parallax does need to be careful shipping P2's with known issues that they plan to fix in a future revision. It can lead to support problems in the future. I would suggest that all these chips be shipped on boards that clearly indicate that they are prototypes or early versions.
If it's possible to have the on package markings adjust for this batch to indicate early version or prototype, then that would be even better. At least then, no matter where they end up, support for them can easily identify them.

Bob Lawrence (VE1RLL) · 2018-09-30 00:46

cgracey wrote: »

If this does well and we have the capital, we could use a 28nm process and get well over 1GHz. Power dissipation would be only 5%-10% of the current design.

re:If this does well and we have the capital, we could use a 28nm

Now we're talking

jmg · 2018-09-30 01:19

Roy Eltham wrote: »

Honestly, I would be fine if the P2 never had NTSC in it. However, as Chip said the faulty part is useful for several other things that I think would be good to have.

Parallax does need to be careful shipping P2's with known issues that they plan to fix in a future revision. It can lead to support problems in the future. I would suggest that all these chips be shipped on boards that clearly indicate that they are prototypes or early versions.
If it's possible to have the on package markings adjust for this batch to indicate early version or prototype, then that would be even better. At least then, no matter where they end up, support for them can easily identify them.

Should be no problem - Many vendors have a revision letter A,B,C etc that shows what die revision is inside, and the Chip Errata explains the differences.

Cluso99 · 2018-09-30 01:46

jmg wrote: »

Cluso99 wrote: »

Chip, you might ask Wendy what it would take to try and push for a 300MHz critical path ??? Just a thought BTW I am amazed at the speed anyway, but it's always nice to push the limits even further.

Remember, the target here was 180MHz, (-10% Vcc, 125'C) and that needed a lot of delay tweaks so 300MHz is never going to be a viable synthesis target.
Of course, there is always over-clocking, and Chip has been testing this at 1.8V, so a nudge up on Vcc and aggressive cooling, you might hit 300MHz 'bench test' numbers.

To me 200MHz is more realistic, perhaps with -4% Vcc, and lower TMax ? Other vendors spec that way, where highest MHz has tightest Vcc spec.

pedward wrote: »

I suspect rebuilding the chip with "proper" IQ modulation and lowered power draw is going to contributed to a lower critical speed.

Clock gating can lower the partial-clock powers, but it adds logic, so it is more likely to worsen the Max MHz and Max mA slightly, than improve them.

Oops. Yes I meant 200MHz, which would easily get to the next USB sweet spot of 192MHz. That may also allow us to push to 300MHz under pristine conditions

I just don't know the ramifications of trying this - I presume it's just a longer compile time to see if it's possible.

kbash · 2018-09-30 01:53

Watching the birth of a new "Baby" 13 years in the making was great. I too missed the exact moment of it's first breath, but appreciate greatly the opportunity to be a part of the process win, lose or draw. Looks like a win this time.

I agree with Cluso about getting the first chips out. Unless some MAJOR error pops up. getting the 1500 chips out and software/hardware hammered by fools like me before doing costly respins might be the best idea. I've found that knowing too much about how something SHOULD work sometimes works against you in finding all the failure modes of a product. It's amazing the first time you throw a new product out after "extensive" testing only find out how many ways to break something customers can come up with that you never thought of.

As to the "Trying for 300Mhz"... I am much less enthused about ideas like this. "Just one more thing" got us the "P2 HOT" and years more waiting. Enough is enough... and 180Mhz is plenty.

What I WOULD like to see, After this version is "Put to bed" are the LESS capable (less costly) variants that would be justified designing into places where the PIC's , AVRs, ATtiny and such are going.

I've had a fast 6 axis CNC program stuffed into a single cog for several years now, however, with the total cost of a P1 compared to the cost of an ARM or Arduino board. It was never justified to try to market it as a 3d printer controller.

Promised Code protection... I was waiting for the P2 to build the "Ultimate" 6-9 axis CNC controller... but honestly, not many of us NEED more than 4 axis machines. A couple of P2 cogs and 20-30K of memory would probably do just fine for a killer 3D printer controller.

Wrap up:
1. get the first chips onto boards and out into the hands of your loyal supporters (those who bought and used the FPGA boards).

2. If some are left over... get them out to fools like me who will figure out new ways to torture them.

3. After a (predetermined) few weeks of torture tests, make any known changes necessary and start turning out cheap, minimum footprint demo boards. ( that center pad might be the death of even those of us who are comfortable soldering fine-pitch chips). You might even consider a module similar to the ESP32.

4. take a nice vacation, bask in your glory.

5. get back to work turning out a (limited) few variants that can be designed into products that sell by the millions.

(my 2 cents.)

I'd be happy to purchase a few of the first boards as-is. ( I found one of my P1 boards that uses the SX-Key yesterday, I would CHERISH one of the first P2 sample chips but won't push my luck.)

Gotta get back to re-re-re designing my P2 CNC control board.

Congratulations!

Ken Gracey · 2018-09-30 03:13

Roy Eltham wrote: »

Honestly, I would be fine if the P2 never had NTSC in it. However, as Chip said the faulty part is useful for several other things that I think would be good to have.

Parallax does need to be careful shipping P2's with known issues that they plan to fix in a future revision. It can lead to support problems in the future. I would suggest that all these chips be shipped on boards that clearly indicate that they are prototypes or early versions.
If it's possible to have the on package markings adjust for this batch to indicate early version or prototype, then that would be even better. At least then, no matter where they end up, support for them can easily identify them.

Yes, yes. We have discussed markings and version numbers.

Ken Gracey

Ken Gracey · 2018-09-30 03:21

kbash wrote: »

As to the "Trying for 300Mhz"... I am much less enthused about ideas like this. "Just one more thing" got us the "P2 HOT" and years more waiting. Enough is enough... and 180Mhz is plenty.

Agreed. Wet blanket just arrived again.

You can count on me not to support the addition of any new features beyond those we designed for, even if they are "free". They'd take time to design and test and document. We need to learn from this device first and make improvements in time, leaving room for future releases along the way.

It may be easy for the community to make suggestions, not being aware of the actual impact it has on Parallax. These suggestions may impact our finances, staff levels, and ROI. Parallax has backed this project very well and the company support provided through the years warrants completion.

Ken Gracey

Roy Eltham · 2018-09-30 03:28

Ken,
As someone who has a even just little insight into the impact it's already had on Parallax, I wholeheartedly agree with you here. Just get this one to shipping, anything else can come in the future.
Get some ROI, with some (hopefully) associated growth, then think about revisions as finances/budget allow.

THere is so much potential in what we have now; Can't wait to see it realized.

Tubular · 2018-09-30 03:43

I know its not per the intended design, but with all the P2 features including LUT and streaming I'm pretty sure we could almost bitbang a decent NTSC driver if we had to

For my purposes its mainly greyscale I'm interested in. I'm absolutely sure we could bitbang greyscale.

Anyway I concur with the thoughts here, lets get it out and properly tested in real world applications before even contemplating a respin.

Roy Eltham · 2018-09-30 03:48

Tubular,
It's not just NTSC that it breaks. In case you missed that, go back and read Chip's replies earlier.
It's possible those other things could be bitbanged too.

Anyway, I agree, let's get this one out and fully tested.

KeithE · 2018-09-30 04:15

cgracey wrote: »

There's definitely something wrong with the IQ modulator in the colorspace converter. I was using "$signed(<term>)" to sign-extend some values to add to other values which already had the full number of bits, but there seems to be a problem coming from these sections. I've asked Wendy to look at it.

If you post a verilog snippet maybe a forum member can point out the problem. I know some people just avoid using that feature.

Here are two papers of potential interest:

http://www.sutherland-hdl.com/papers/2006-SNUG-Boston_standard_gotchas_presentation.pdf see page 16
http://www.tumbush.com/published_papers/Tumbush DVCon 05.pdf

Anyways congratulations on having many things working so soon. I've been involved in many chip bring-ups, and I can't image live streaming one. I can image what my ex-manager (New Yorker) would have said to someone filming him during bring-up ;-)

Edited to add:

For what it's worth, I've always seen sign extension done using the method in the first answer here:

https://stackoverflow.com/questions/4176556/how-to-sign-extend-a-number-in-verilog

extended[15:0] <= { {8{extend[7]}}, extend[7:0] };

cgracey · 2018-09-30 05:07

KeithE wrote: »

cgracey wrote: »

There's definitely something wrong with the IQ modulator in the colorspace converter. I was using "$signed(<term>)" to sign-extend some values to add to other values which already had the full number of bits, but there seems to be a problem coming from these sections. I've asked Wendy to look at it.

If you post a verilog snippet maybe a forum member can point out the problem. I know some people just avoid using that feature.

Here are two papers of potential interest:

http://www.sutherland-hdl.com/papers/2006-SNUG-Boston_standard_gotchas_presentation.pdf see page 16
http://www.tumbush.com/published_papers/Tumbush DVCon 05.pdf

Anyways congratulations on having many things working so soon. I've been involved in many chip bring-ups, and I can't image live streaming one. I can image what my ex-manager (New Yorker) would have said to someone filming him during bring-up ;-)

Edited to add:

For what it's worth, I've always seen sign extension done using the method in the first answer here:

https://stackoverflow.com/questions/4176556/how-to-sign-extend-a-number-in-verilog

extended[15:0] <= { {8{extend[7]}}, extend[7:0] };

Thanks, KeithE. I usually do sign extension as you showed in your last example, but in this part of the modulator, I used something maybe a little goofy. I need to figure out how it probably failed and then do an FPGA compile with it forced the wrong way, in order to see if the FPGA behaves the same as the silicon then.

Tubular · 2018-09-30 05:11

Roy Eltham wrote: »

Tubular,
It's not just NTSC that it breaks. In case you missed that, go back and read Chip's replies earlier.

True. Actually this brings up something possibly important; the final modulation of waveform in P2-Hot was a bit broken - from what I remember pure sine wave generation on one channel A was OK, but waveform on channel B (or indeed, modulation) had a bug. Earlier versions of P2-Hot were OK. I wonder what got cut and pasted across

Roy Eltham wrote: »

It's possible those other things could be bitbanged too.
Anyway, I agree, let's get this one out and fully tested.

Yes I fully agree. And cordic will help a lot too

evanh · 2018-09-30 05:19

Chip,
Did you know, unlike the Prop1, that repeated RDLONGs can only access on every second hub rotation? There is an implicit pending delay that always adds at least one rotation to each read access. Write accesses are fine.

PS: I assume it has something to do with simplifying the FIFO sequencing.

Cluso99 · 2018-09-30 05:27

Ken Gracey wrote: »

kbash wrote: »

As to the "Trying for 300Mhz"... I am much less enthused about ideas like this. "Just one more thing" got us the "P2 HOT" and years more waiting. Enough is enough... and 180Mhz is plenty.

Agreed. Wet blanket just arrived again.

You can count on me not to support the addition of any new features beyond those we designed for, even if they are "free". They'd take time to design and test and document. We need to learn from this device first and make improvements in time, leaving room for future releases along the way.

It may be easy for the community to make suggestions, not being aware of the actual impact it has on Parallax. These suggestions may impact our finances, staff levels, and ROI. Parallax has backed this project very well and the company support provided through the years warrants completion.

Ken Gracey

Sorry Ken.
My 300MHz was a typo - I meant 200MHz.

There's no hardware/software changes in this.

It's a question for Wendy (if there's a new respin) as to whether she thinks the tools might manage 200MHz (which is up from 180MHz).
I have no idea how hard the tools worked to achieve the 180MHz.

cgracey · 2018-09-30 05:40

200MHz was just beyond possibility. Wendy really tried, and knew that 200MHz would be the perfect number, but it was not to be. We could only get to 180MHz.

cgracey · 2018-09-30 05:44

evanh wrote: »

Chip,
Did you know, unlike the Prop1, that repeated RDLONGs can only access on every second hub rotation? There is an implicit pending delay that always adds at least one rotation to each read access. Write accesses are fine.

PS: I assume it has something to do with simplifying the FIFO sequencing.

Good observation, Evanh. I didn't realize that, but it makes sense, due to all the clocked stages used in hub read, in order to meet timing.

evanh · 2018-09-30 06:03

cgracey wrote: »

... but it makes sense, due to all the clocked stages used in hub read, in order to meet timing.

Oh, yep, I'm seeing it now. With one instruction inserted between two RDLONGs it misses it's next window and has to wait another round. But if, for example, the two RDLONGs are adjacent and reading consecutive ram blocks then it just makes it within the consecutive window. Immediately rereading the same address always has the problem.

EDIT: Okay, that makes sense with the minimum of 9 clocks specified in the docs. Reading consecutive blocks fits the 9 clocks because that has an inherent +1 timing progression in the eggbeater sequence.

EDIT2: I note the minimum of 9 clocks is the same for 16 Cog build. So, I was wrong about it being a whole rotation extra. Heh, it was 8 clocks on the Prop1. It just feels harsh I guess because the instructions are faster.

Heater. · 2018-09-30 06:24

Sorry if I'm off on a tangent here but this talk of sign extension in Verilog got me worrying about it my RISC V processor design attempt. I have been writing my CPU in Scala using SpinalHDL rather than Verilog because Verilog gives me headache. That leaves me wondering what Verilog code gets generated by Spinal for sign extension. So bear with me:

I tried an example component with a 16 bit unsigned internal state like so:

val counter = Reg(UInt(16 bits)) init(0)

We try to output it to a signed 32 bit bus, converting it to signed and extending it like so:

io.state32 := counter.asSInt.resize(32)

The Verilog code generated for this sign extension is this:

wire [15:0] _zz_1;
assign _zz_1 = counter;
assign io_state32 = {{16{_zz_1[15]}}, _zz_1};

Which if I understand correctly takes bit 15 of our counter, replicates it 16 times then concatenates the result with the counter value. Which looks like the correct thing to me and leaves no wiggle room for the compiler to get it wrong.

It's a nice example of how Verilog gives me headache. The Spinal expression is much nicer and spells out what it is doing.

Are we doing right here?

evanh · 2018-09-30 06:31

There is a group of missing timing values in the docs: SETQ/SETQ2 + RD/WRLONG

[/me goes reads the details of SETQ ...]

EDIT: Okay, it'll just be RD/WRLONG timings plus however many loadwords are read/written.

Cluso99 · 2018-09-30 08:15

cgracey wrote: »

200MHz was just beyond possibility. Wendy really tried, and knew that 200MHz would be the perfect number, but it was not to be. We could only get to 180MHz.

Thanks for that info Chip. I didn't know 200MHz had been tried (or I forgot if you said so). Anyway, what we've got is great!

evanh · 2018-09-30 14:44

Chip,
I just tested out RDFAST with and without bit31 of #D set. Using a RFBYTE immediately following the RDFAST, I can't get even a single data error!

I'm measuring a difference of 10 clocks between with and without bit31 set. But the data stream is perfect no matter what I try. I've redirecting the start address with an extra RDFAST before targeting the valid data. Somehow, even with as small as 2 clocks from RDFAST to RFBYTE it still works correctly.

It's working too perfect! I must be doing something wrong. O_o

Here's the core test code:

...
diag1		byte    "Sysclock = ",0
diag2		byte    ", SmartPin mode = %11111, Async Receive:  config = ",0
diag3		byte    13,10," rd-same   rd-next   rd-prev     wr-same   wr-next   wr-prev",13,10,0
diag4		byte    13,10,"Duration in ticks = ",0



ORG   $200                                    'longword addressing
'*******************************************************************************
'  LUT Code  (Has to be copied from hubram to lutram)
'*******************************************************************************
LUT_code
		mov     i, ##$8000_0001
		mov     j, ##diag3
'		waitx   #0
		getct   ticks
		rdfast  ##$8000_0000, #0      'start with garbage
		rfbyte  char
		rdfast  ##$000_0000, ##diag1
'		fblock  i, j                  'shift print to next block

		rfbyte  char                  'get next byte from FIFO
		call    #putch                'emit character

		mov     i, #140
		getct   j
.putsf
		rfbyte  char                  'get next byte from FIFO
		call    #putch                'emit character
		djnz    i, #.putsf            'increment pointer (assumes non-zero address)
		getct   ticke

		mov     parm, ##diag4
		call    #puts
		mov     parm, j
		sub     parm, ticks
		call    #itoa
		call    #putsp
		mov     parm, ticke
		sub     parm, ticks
		call    #itoa
		call    #putnl
		jmp     #$

evanh · 2018-09-30 15:04

Opps, I see. It was all nulls so wasn't printing until the valid data came through. Fool myself.

KeithE · 2018-09-30 16:34

Heater. wrote: »

Are we doing right here?

I don't understand why you would sign extend an unsigned counter.

On the subject of $signed() causing problems in the Prop2, I find that when these things happen it's good to investigate how they fell through the cracks. Look through the various log files for warnings. That way you'll know how to catch this particular issue in the future, and assure yourself that there aren't other yet undiscovered issues.

Simulators which are probably available to Parallax {ModelSim, Icarus, Verilator}
Synthesizers {Quartus, whatever was used for ASIC synthesis}
Formal Equivalence Checker {Formality, Conformal, or ???}
Linter {SpyGlass or ???}

All of these may flag issues that are mentioned in those papers. I'm not sure.

Heater. · 2018-09-30 17:00

Quite so, KeithE, one might rarely want to sign extend a counter. It was just an example I had open in my editor that I adapted to show sign extension. Imagine it is called "offset" or something, it does not matter.

In the RISC V one has to pull out some bit fields of an instruction, which are basically just a bunch unsigned bits, and assemble them into a signed quantity. Say and offset for a relative jump, you need to jump backwards as well as forwards, or an immediate value to add to register. So I worry that my sign extension works as it should.

So far I have only verified this by inspection of the generated Verilog, and with the Icarus and Verilator simulators. It has yet to hit synthesis to a real FPGA via Quartus. Likely there are traps I'm not aware of waiting for me. Hence my question.

Just now I'm wondering why the SpinalHDL Verilog generator uses that bit concatenation I showed rather than the $signed()

Looking at this paper: http://www.tumbush.com/published_papers/Tumbush DVCon 05.pdf $signed() seems to be able to cause confusions all of it's own!

KeithE · 2018-09-30 19:14

Heater - o.k. It looks like what you are doing should work.

Did you see this release? Multithreaded!

https://www.veripool.org/news/241-Verilator-Verilator-4-002-Released

Heater. · 2018-09-30 19:33

That sounds great.

I haven't felt the need for speed with multiple threads over multiple cores yet.

Good to know that development is going on full bore though.

New P2 Silicon Observations

Comments