List of Changes in Next P2 Silicon
cgracey
Posts: 14,152
I thought it would be helpful to list all changes made to the P2 source Verilog, so that everyone could anticipate what is coming next. I will maintain this list.
Bugs in initial silicon fixed (all known bugs):
(a) Sign-extension problem that caused IQ modulator, quadrature decoder, and ALTx negative deltas to not work. Fixed sources to follow all Verilog signed-expression rules.
(b) 1/2/4-bit output in the streamer's RFBYTE mode didn't output DAC data if the pins were disabled. Getting redesigned.
(c) DIR transitioning after OUT causing negative pin glitches. Timing constraints being added.
Completed improvements for next silicon:
(1) XORO32 improved with better settings. No source code or tool impact.
(2) POP now returns Z=1 if result=0, used to return Z=result[30]. Source code impact, no tool impact.
(3) BITL/BITH/BITC/BITNC/BITZ/BITNZ/BITRND/BITNOT can now work on a span of bits (+S[9:5] bits). Prior SETQ overrides S[9:5]. Source code impact, no tool impact.
(4) DIRx/OUTx/FLTx/DRVx can now work on a span of pins (+D[10:6] pins). Prior SETQ overrides D[10:6]. Source code impact, no tool impact.
(5) WRPIN/WXPIN/WYPIN/AKPIN can now work on a span of pins (+S[10:6] pins). Prior SETQ overrides S[10:6]. Source code impact, no tool impact.
(6) BIT_DAC output now has 4-bit settings for low and high states, instead of single 8-bit setting vs. GND. Source code impact, no tool impact.
(7) RDxxxx/WRxxxx+PTRx expressions now index -16..+16 with updating and -32..+31 without updating. No source code impact, but assembler impact.
(8) RDLUT/WRLUT now take PTRx expressions. Source code and assembler impact.
(9) HDMI added to streamer with ascending and descending pinouts. No source code or tool impact.
(10) Sensible PTRx behavior for 'SETQ(2)+RD/WR/WMLONG' operations. Source code impact, no tool impact.
(11) System counter extended to 64 bits. GETCT WC retrieves upper 32-bits of 64-bit system counter. No source code impact, but assembler impact.
(12) SINC2/SINC3 filters added to smart pins for doubling the effective number of bits in ADC conversions.
(13) Each cog has four 8-bit-sample-per-clock ADC scope channels. No source code impact, but assembler impact.
(14) New streamer modes. SINC1/SINC2 supported for Goertzel. Source code impact, no tool impact.
(15) Clock-gating to reduce dynamic power achieved by tool configuration - 1,830 clock gates added, eliminated lots of ENA mux's.
(16) LUT sharing is now glitch-free.
Planned improvements for next silicon:
(17) Install latest ROM code. - DONE
(18) Reduce ADC integrator caps by 50% to increase ADC bandwidth. - NOT DONE
(19) Be able to output system CLK via smart pins, must explore with ON Semi. - NOT DONE
Bugs in initial silicon fixed (all known bugs):
(a) Sign-extension problem that caused IQ modulator, quadrature decoder, and ALTx negative deltas to not work. Fixed sources to follow all Verilog signed-expression rules.
(b) 1/2/4-bit output in the streamer's RFBYTE mode didn't output DAC data if the pins were disabled. Getting redesigned.
(c) DIR transitioning after OUT causing negative pin glitches. Timing constraints being added.
Completed improvements for next silicon:
(1) XORO32 improved with better settings. No source code or tool impact.
(2) POP now returns Z=1 if result=0, used to return Z=result[30]. Source code impact, no tool impact.
(3) BITL/BITH/BITC/BITNC/BITZ/BITNZ/BITRND/BITNOT can now work on a span of bits (+S[9:5] bits). Prior SETQ overrides S[9:5]. Source code impact, no tool impact.
(4) DIRx/OUTx/FLTx/DRVx can now work on a span of pins (+D[10:6] pins). Prior SETQ overrides D[10:6]. Source code impact, no tool impact.
(5) WRPIN/WXPIN/WYPIN/AKPIN can now work on a span of pins (+S[10:6] pins). Prior SETQ overrides S[10:6]. Source code impact, no tool impact.
(6) BIT_DAC output now has 4-bit settings for low and high states, instead of single 8-bit setting vs. GND. Source code impact, no tool impact.
(7) RDxxxx/WRxxxx+PTRx expressions now index -16..+16 with updating and -32..+31 without updating. No source code impact, but assembler impact.
(8) RDLUT/WRLUT now take PTRx expressions. Source code and assembler impact.
(9) HDMI added to streamer with ascending and descending pinouts. No source code or tool impact.
(10) Sensible PTRx behavior for 'SETQ(2)+RD/WR/WMLONG' operations. Source code impact, no tool impact.
(11) System counter extended to 64 bits. GETCT WC retrieves upper 32-bits of 64-bit system counter. No source code impact, but assembler impact.
(12) SINC2/SINC3 filters added to smart pins for doubling the effective number of bits in ADC conversions.
(13) Each cog has four 8-bit-sample-per-clock ADC scope channels. No source code impact, but assembler impact.
(14) New streamer modes. SINC1/SINC2 supported for Goertzel. Source code impact, no tool impact.
(15) Clock-gating to reduce dynamic power achieved by tool configuration - 1,830 clock gates added, eliminated lots of ENA mux's.
(16) LUT sharing is now glitch-free.
Planned improvements for next silicon:
(17) Install latest ROM code. - DONE
(18) Reduce ADC integrator caps by 50% to increase ADC bandwidth. - NOT DONE
(19) Be able to output system CLK via smart pins, must explore with ON Semi. - NOT DONE
Comments
Maybe a bug-list (fixed) needs to be included too.... ? (eg those affected by verilog syntax issue )
Of course.
Good idea. I'll add that.
Why doesn't power drop when hub is not being accessed? Perhaps the HUB RAM is continually being accessed?
Why doesn't power drop when cogs are not running?
Anything else hogging the power usage?
Yes, the clock tree, is the main power hog.
There is no clock gating, so that results in nano-farads of power dissipation Cpd.
Clock gating was considered, but has been deferred to later revisions.
Power curve roughly look the same as that of the Prop1. It runs hotter due to more transistors in active path. There's nothing unusual.
* In P1, wait is granular, and the COG clock pauses while just the minimal wait hardware spins, this gives quite low Cpd values for WAIT
* In P1, it behaves on a per-COG basis, so an inactive COG truly is inactive.
P2 does neither of those, the clock tree drives all the time, feeding all those registers. Active COGs add only slightly to Cpd, mainly due to register-out nodes also toggling.
EDIT: On the other hand, static leakage on the Prop2 is surprisingly only 10x more current than Prop1. Maybe clock gating can do more here, particularly at higher clock rates. Max logic power for the Prop1 was 100 mW, oops, make that 330 mW, mixed up current with power. Max logic power for Prop2 is spec'd for 1.0 W but given the 300 MHz clock rates we're hitting it's more like 2.0 W, maybe up to 3.0 W while hammering hubram and cordic.
EDIT2: For reference: Prop1 static leakage is about 3.2 uA (11 uW). Prop2 is about 37 uA (67 uW). I'm surprised how low power the Prop2 could be.
It works like this:
Only the MSB of the encoded index is used to increment or decrement PTRx by the block size. This way, you can keep loading or storing memory sequentially.
One more thing off the list.
What is "CLK output mode in smart pins"?
Jmg has been pointing out that we ought to get the internal clock out onto pins so that it can clock things and coordinate with the streamer.
That is (b).
I knew you would remember anything that had changed. I went through the Verilog to remind myself of what had changed. I think it's all listed there.
I remembered there was another list from a little while ago:
http://forums.parallax.com/discussion/comment/1450738/#Comment_1450738
I checked just the numbers in the first post because a-c were off the top of the screen!
It would also allow you to move towards using a bug tracker to track and classify bugs, feature requests, etc.
At work we also have another system called WorkLogs, which tracks features from the kernel state to full implementation, it's like a bug tracker, but is more of a formalized way of fleshing out designs and tracking those features to implementation.
Yes, the issue here is the streamer can pump at SysCLK speeds, which is very impressive - but you cannot connect to any part that requires a clock with that.
Currently, highest CLK is SysCLK/2
One real example : There are SPI LCD displays designed for RaspPi, CPLD based & spec'd to operate to 128MHz (that's where the Pi stops) - the CPLD can go faster.
If P2 can output a SysCLK with streamer, it could hit that SPI speed at 128MHz PLL, and save a whole lot of power (as well as be inside the actual spec!) and it has scope at the 180MHz spec to out-pace Pi.
P2 updating an SPI display faster than Pi, will get peoples attention.
Being able to simply connect to already existing infrastructure, like this fast SPI LCD, will be important for P2 sales.
I know analog pad block respin isn't on the cards for the next iteration, and thats absolutely fine, but we do have other options such as whether to engage the 150 kohm pulldown resistors, or look at a GND "guard ring" on the pcb that might tend things back toward towards '0'
This is all really low priority, it nothing at all is done all we have to do is manage user expectations for why their inputs show '1' when nothing is connected, but while we're making a list it may as well go on it.
For 1080p, you could run at 150 MHz and output the pixels...
Otherwise, I guess you'd need 300 MHz for digital 1080p60 video...
Good point, that has already upgraded....
If there is ROM space, I'd like to see SPI Dual IO read attempt on Flash (with 1 bit SPI fallback). Dual IO has zero added pin cost, but doubles the data speeds.
Candidate commands are 0BBH and 03BH
Am I understanding this correctly for --PTRx First location read is
RDLONG base, PTR-100<<2
next is
RDLONG base+1, PTR-99<<2 (postedit correction)
etc, and when done
PTR=PTR-100<<2