List of Changes in Next P2 Silicon

cgracey · 2018-11-16 06:57

I thought it would be helpful to list all changes made to the P2 source Verilog, so that everyone could anticipate what is coming next. I will maintain this list.

Bugs in initial silicon fixed (all known bugs):

(a) Sign-extension problem that caused IQ modulator, quadrature decoder, and ALTx negative deltas to not work. Fixed sources to follow all Verilog signed-expression rules.
(b) 1/2/4-bit output in the streamer's RFBYTE mode didn't output DAC data if the pins were disabled. Getting redesigned.
(c) DIR transitioning after OUT causing negative pin glitches. Timing constraints being added.

Completed improvements for next silicon:

(1) XORO32 improved with better settings. No source code or tool impact.
(2) POP now returns Z=1 if result=0, used to return Z=result[30]. Source code impact, no tool impact.
(3) BITL/BITH/BITC/BITNC/BITZ/BITNZ/BITRND/BITNOT can now work on a span of bits (+S[9:5] bits). Prior SETQ overrides S[9:5]. Source code impact, no tool impact.
(4) DIRx/OUTx/FLTx/DRVx can now work on a span of pins (+D[10:6] pins). Prior SETQ overrides D[10:6]. Source code impact, no tool impact.
(5) WRPIN/WXPIN/WYPIN/AKPIN can now work on a span of pins (+S[10:6] pins). Prior SETQ overrides S[10:6]. Source code impact, no tool impact.
(6) BIT_DAC output now has 4-bit settings for low and high states, instead of single 8-bit setting vs. GND. Source code impact, no tool impact.
(7) RDxxxx/WRxxxx+PTRx expressions now index -16..+16 with updating and -32..+31 without updating. No source code impact, but assembler impact.
(8) RDLUT/WRLUT now take PTRx expressions. Source code and assembler impact.
(9) HDMI added to streamer with ascending and descending pinouts. No source code or tool impact.
(10) Sensible PTRx behavior for 'SETQ(2)+RD/WR/WMLONG' operations. Source code impact, no tool impact.
(11) System counter extended to 64 bits. GETCT WC retrieves upper 32-bits of 64-bit system counter. No source code impact, but assembler impact.
(12) SINC2/SINC3 filters added to smart pins for doubling the effective number of bits in ADC conversions.
(13) Each cog has four 8-bit-sample-per-clock ADC scope channels. No source code impact, but assembler impact.
(14) New streamer modes. SINC1/SINC2 supported for Goertzel. Source code impact, no tool impact.
(15) Clock-gating to reduce dynamic power achieved by tool configuration - 1,830 clock gates added, eliminated lots of ENA mux's.
(16) LUT sharing is now glitch-free.

Planned improvements for next silicon:

(17) Install latest ROM code. - DONE
(18) Reduce ADC integrator caps by 50% to increase ADC bandwidth. - NOT DONE
(19) Be able to output system CLK via smart pins, must explore with ON Semi. - NOT DONE

ozpropdev · 2018-11-16 07:34

Does the streamer update include the 1/2/4 bit bug fix?

jmg · 2018-11-16 07:39

ozpropdev wrote: »

Does the streamer update include the 1/2/4 bit bug fix?

Maybe a bug-list (fixed) needs to be included too.... ? (eg those affected by verilog syntax issue )

cgracey · 2018-11-16 07:52

ozpropdev wrote: »

Does the streamer update include the 1/2/4 bit bug fix?

Of course.

cgracey · 2018-11-16 07:52

jmg wrote: »

ozpropdev wrote: »

Does the streamer update include the 1/2/4 bit bug fix?

Maybe a bug-list (fixed) needs to be included too.... ? (eg those affected by verilog syntax issue )

Good idea. I'll add that.

Cluso99 · 2018-11-16 08:07

15) Discuss power usage with OnSemi.

Why doesn't power drop when hub is not being accessed? Perhaps the HUB RAM is continually being accessed?

Why doesn't power drop when cogs are not running?

Anything else hogging the power usage?

jmg · 2018-11-16 08:15

Cluso99 wrote: »

15) Discuss power usage with OnSemi.

Why doesn't power drop when hub is not being accessed? Perhaps the HUB RAM is continually being accessed?

Why doesn't power drop when cogs are not running?

Anything else hogging the power usage?

Yes, the clock tree, is the main power hog.
There is no clock gating, so that results in nano-farads of power dissipation Cpd.
Clock gating was considered, but has been deferred to later revisions.

cgracey · 2018-11-16 08:17

I need to investigate clock-gating. I'll add that to the list.

evanh · 2018-11-16 08:25

Cluso,
Power curve roughly look the same as that of the Prop1. It runs hotter due to more transistors in active path. There's nothing unusual.

jmg · 2018-11-16 08:31

evanh wrote: »

Cluso,
Power curve roughly look the same as that of the Prop1. It runs hotter due to more transistors in active path. There's nothing unusual.

It's not quite the same as P1, when you drill into the wait modes and N cogs :

* In P1, wait is granular, and the COG clock pauses while just the minimal wait hardware spins, this gives quite low Cpd values for WAIT
* In P1, it behaves on a per-COG basis, so an inactive COG truly is inactive.

P2 does neither of those, the clock tree drives all the time, feeding all those registers. Active COGs add only slightly to Cpd, mainly due to register-out nodes also toggling.

cgracey · 2018-11-16 08:36

Yeah, it has the feel of an FPGA, not a micro.

evanh · 2018-11-16 09:06

Reducing clock rate is the biggie, that was true on the Prop1 too.

EDIT: On the other hand, static leakage on the Prop2 is surprisingly only 10x more current than Prop1. Maybe clock gating can do more here, particularly at higher clock rates. Max logic power for the Prop1 was 100 mW, oops, make that 330 mW, mixed up current with power. Max logic power for Prop2 is spec'd for 1.0 W but given the 300 MHz clock rates we're hitting it's more like 2.0 W, maybe up to 3.0 W while hammering hubram and cordic.

EDIT2: For reference: Prop1 static leakage is about 3.2 uA (11 uW). Prop2 is about 37 uA (67 uW). I'm surprised how low power the Prop2 could be.

evanh · 2018-11-16 09:45

I can't believe the Prop1 ever ran that hot really.

cgracey · 2018-11-16 16:20

I just finished getting the PTRx behavior straightened out for SETQ(2)+RD/WR/WMLONG.

It works like this:

	SETQ	#16-1		'ready to transfer 16 longs
	RDLONG	base,PTRA	'read at PTRA

	SETQ	#10-1		'ready to transfer 10 longs
	RDLONG	base,++PTRB	'read at PTRB+10<<2, PTRB += 10<<2

	SETQ	#100-1		'ready to transfer 100 longs
	RDLONG	base,--PTRA	'read at PTRA-100<<2, PTRA -= 100<<2

	SETQ	#8-1		'ready to transfer 8 longs
	RDLONG	base,PTRA++	'read at PTRA, PTRA += 8<<2

	SETQ	#5-1		'ready to transfer 5 longs
	RDLONG	base,PTRB--	'read at PTRB, PTRB -= 5<<2

Only the MSB of the encoded index is used to increment or decrement PTRx by the block size. This way, you can keep loading or storing memory sequentially.

One more thing off the list.

Rayman · 2018-11-16 17:00

List looks not so dramatic as it was in my mind, that really helps.

What is "CLK output mode in smart pins"?

cgracey · 2018-11-16 17:07

Rayman wrote: »

List looks not so dramatic as it was in my mind, that really helps.

What is "CLK output mode in smart pins"?

Jmg has been pointing out that we ought to get the internal clock out onto pins so that it can clock things and coordinate with the streamer.

cgracey · 2018-11-16 17:08

I'm wondering if I forgot anything on the list. Maybe someone here remembers more than I do.

TonyB_ · 2018-11-16 17:42

cgracey wrote: »

- The streamer modes now output proper DAC data for the 1/2/4-bit RFBYTE modes, in case the pins aren't enabled (e=0).

cgracey · 2018-11-16 17:44

TonyB_ wrote: »

cgracey wrote: »

- The streamer modes now output proper DAC data for the 1/2/4-bit RFBYTE modes, in case the pins aren't enabled (e=0).

That is (b).

TonyB_ · 2018-11-16 17:54

Oops, nothing forgotten then.

cgracey · 2018-11-16 18:00

TonyB_ wrote: »

Oops, nothing forgotten then.

I knew you would remember anything that had changed. I went through the Verilog to remind myself of what had changed. I think it's all listed there.

twm47099 · 2018-11-16 18:08

What about ADC noise?

TonyB_ · 2018-11-16 18:23

cgracey wrote: »

TonyB_ wrote: »

Oops, nothing forgotten then.

I knew you would remember anything that had changed. I went through the Verilog to remind myself of what had changed. I think it's all listed there.

I remembered there was another list from a little while ago:
http://forums.parallax.com/discussion/comment/1450738/#Comment_1450738

I checked just the numbers in the first post because a-c were off the top of the screen!

Rayman · 2018-11-16 19:14

Maybe new ROM should be on list?

potatohead · 2018-11-16 19:21

Yes.

pedward · 2018-11-16 19:23

Chip, what do you think about getting the verilog into a proper version control system, so you can better track changes and enter notes with changes? Tools like github allow you to visualize code changes, see who changed a line of code and in what commit, and annotate lines of code with the notes that were committed with that change.

It would also allow you to move towards using a bug tracker to track and classify bugs, feature requests, etc.

At work we also have another system called WorkLogs, which tracks features from the kernel state to full implementation, it's like a bug tracker, but is more of a formalized way of fleshing out designs and tracking those features to implementation.

jmg · 2018-11-16 20:16

cgracey wrote: »

Rayman wrote: »

List looks not so dramatic as it was in my mind, that really helps.

What is "CLK output mode in smart pins"?

Jmg has been pointing out that we ought to get the internal clock out onto pins so that it can clock things and coordinate with the streamer.

Yes, the issue here is the streamer can pump at SysCLK speeds, which is very impressive - but you cannot connect to any part that requires a clock with that.
Currently, highest CLK is SysCLK/2

One real example : There are SPI LCD displays designed for RaspPi, CPLD based & spec'd to operate to 128MHz (that's where the Pi stops) - the CPLD can go faster.
If P2 can output a SysCLK with streamer, it could hit that SPI speed at 128MHz PLL, and save a whole lot of power (as well as be inside the actual spec!) and it has scope at the 180MHz spec to out-pace Pi.
P2 updating an SPI display faster than Pi, will get peoples attention.

Being able to simply connect to already existing infrastructure, like this fast SPI LCD, will be important for P2 sales.

Tubular · 2018-11-16 20:20

I'd add a look at 'pin pulldown' to the list. Its been observed that floating inputs tend towards '1' rather than '0', perhaps thanks to the interleaving positive VIO and VDD pins being closest.

I know analog pad block respin isn't on the cards for the next iteration, and thats absolutely fine, but we do have other options such as whether to engage the 150 kohm pulldown resistors, or look at a GND "guard ring" on the pcb that might tend things back toward towards '0'

This is all really low priority, it nothing at all is done all we have to do is manage user expectations for why their inputs show '1' when nothing is connected, but while we're making a list it may as well go on it.

Rayman · 2018-11-16 20:30

Ok, I can see the "CLK output mode" useful when using external HDMI encoder too...
For 1080p, you could run at 150 MHz and output the pixels...

Otherwise, I guess you'd need 300 MHz for digital 1080p60 video...

jmg · 2018-11-16 20:36

Rayman wrote: »

Maybe new ROM should be on list?

Good point, that has already upgraded....
If there is ROM space, I'd like to see SPI Dual IO read attempt on Flash (with 1 bit SPI fallback). Dual IO has zero added pin cost, but doubles the data speeds.
Candidate commands are 0BBH and 03BH

Cluso99 · 2018-11-16 21:56

cgracey wrote: »
I just finished getting the PTRx behavior straightened out for SETQ(2)+RD/WR/WMLONG.

It works like this:
	SETQ	#16-1		'ready to transfer 16 longs
	RDLONG	base,PTRA	'read at PTRA

	SETQ	#10-1		'ready to transfer 10 longs
	RDLONG	base,++PTRB	'read at PTRB+10<<2, PTRB += 10<<2

	SETQ	#100-1		'ready to transfer 100 longs
	RDLONG	base,--PTRA	'read at PTRA-100<<2, PTRA -= 100<<2

	SETQ	#8-1		'ready to transfer 8 longs
	RDLONG	base,PTRA++	'read at PTRA, PTRA += 8<<2

	SETQ	#5-1		'ready to transfer 5 longs
	RDLONG	base,PTRB--	'read at PTRB, PTRB -= 5<<2
Only the MSB of the encoded index is used to increment or decrement PTRx by the block size. This way, you can keep loading or storing memory sequentially.

One more thing off the list.

Am I understanding this correctly for --PTRx

	SETQ	#100-1		'ready to transfer 100 longs
	RDLONG	base,--PTRA	'read at PTRA-100<<2, PTRA -= 100<<2

First location read is
RDLONG base, PTR-100<<2
next is
RDLONG base+1, PTR-99<<2 (postedit correction)
etc, and when done
PTR=PTR-100<<2

List of Changes in Next P2 Silicon

Comments