Ok, I can see the "CLK output mode" useful when using external HDMI encoder too...
For HDMI you get lucky, as that is a more complex interface, it includes a clock PLL, and clock is sent at a lower rate (Data rate /10)
It's the simpler synchronous interfaces like SPI that need CLK at data rate.
I'd add a look at 'pin pulldown' to the list. Its been observed that floating inputs tend towards '1' rather than '0', perhaps thanks to the interleaving positive VIO and VDD pins being closest.
I know analog pad block respin isn't on the cards for the next iteration, and thats absolutely fine, but we do have other options such as whether to engage the 150 kohm pulldown resistors, or look at a GND "guard ring" on the pcb that might tend things back toward towards '0'
This is all really low priority, it nothing at all is done all we have to do is manage user expectations for why their inputs show '1' when nothing is connected, but while we're making a list it may as well go on it.
Any SW load is only a partial effect to floating pins issues, as during RESET=L time, the P2 pins are floating, and until reset exit delays are done, plus serial ROM loaded time, things are undefined.
Then, do you pull down, or pull up ?
SPI Chip selects are active LOW, and so are UARTs, and i2c, so it is common for MCUs to reset with light pullups. Light enough that any pin that needs to be LOW during reset can be pulled down with a resistor.
Guys, the floating pull-up current is due to incidental leakage and I'm sure it measures in the mere nanoamps. There's always going to be leakage and either VIO leakage or GND leakage is going to dominate.
I just finished getting the PTRx behavior straightened out for SETQ(2)+RD/WR/WMLONG.
It works like this:
SETQ #16-1 'ready to transfer 16 longs
RDLONG base,PTRA 'read at PTRA
SETQ #10-1 'ready to transfer 10 longs
RDLONG base,++PTRB 'read at PTRB+10<<2, PTRB += 10<<2
SETQ #100-1 'ready to transfer 100 longs
RDLONG base,--PTRA 'read at PTRA-100<<2, PTRA -= 100<<2
SETQ #8-1 'ready to transfer 8 longs
RDLONG base,PTRA++ 'read at PTRA, PTRA += 8<<2
SETQ #5-1 'ready to transfer 5 longs
RDLONG base,PTRB-- 'read at PTRB, PTRB -= 5<<2
Only the MSB of the encoded index is used to increment or decrement PTRx by the block size. This way, you can keep loading or storing memory sequentially.
One more thing off the list.
Am I understanding this correctly for --PTRx
SETQ #100-1 'ready to transfer 100 longs
RDLONG base,--PTRA 'read at PTRA-100<<2, PTRA -= 100<<2
First location read is
RDLONG base, PTR-100<<2
next is
RDLONG base+2, PTR-99<<2
etc, and when done
PTR=PTR-100<<2
That's right. It will track based on block size, which is (Q+1)×4.
Wait, your example is wrong, in that the first register written is base, followed by base+1, etc.
Multiple 'SETQ+RDLONG base,PTRA++' operations will read blocks in order.
ON comes to Parallax in ten days with their ECO and associated agreement ($$$$). Their schedule will depend on your schedule.
How much time are you allocating to customer testing once the P2 Eval Board is shipped?
How much time do you need to complete the list of changes you posted? When you make changes, will they need to be tested in FPGA by users?
Only when these two variables are known will we be able to consider signing their agreement and pressing them for a schedule. We need to be ready to talk specifics and provide a schedule when they arrive at our office.
I just finished getting the PTRx behavior straightened out for SETQ(2)+RD/WR/WMLONG.
It works like this:
SETQ #16-1 'ready to transfer 16 longs
RDLONG base,PTRA 'read at PTRA
SETQ #10-1 'ready to transfer 10 longs
RDLONG base,++PTRB 'read at PTRB+10<<2, PTRB += 10<<2
SETQ #100-1 'ready to transfer 100 longs
RDLONG base,--PTRA 'read at PTRA-100<<2, PTRA -= 100<<2
SETQ #8-1 'ready to transfer 8 longs
RDLONG base,PTRA++ 'read at PTRA, PTRA += 8<<2
SETQ #5-1 'ready to transfer 5 longs
RDLONG base,PTRB-- 'read at PTRB, PTRB -= 5<<2
Only the MSB of the encoded index is used to increment or decrement PTRx by the block size. This way, you can keep loading or storing memory sequentially.
One more thing off the list.
Am I understanding this correctly for --PTRx
SETQ #100-1 'ready to transfer 100 longs
RDLONG base,--PTRA 'read at PTRA-100<<2, PTRA -= 100<<2
First location read is
RDLONG base, PTR-100<<2
next is
RDLONG base+1, PTR-99<<2 (postedit correction)
etc, and when done
PTR=PTR-100<<2
That's right. It will track based on block size, which is (Q+1)×4.
Wait, your example is wrong, in that the first register written is base, followed by base+1, etc.
Multiple 'SETQ+RDLONG base,PTRA++' operations will read blocks in order.
Thanks Chip. Corrected the error in my post (and here too).
It's as I thought you explained. A bit confusing to effectively reverse the data, but there is probably good reason, and we can use that reversal to advantage elsewhere too I'm sure.
I just finished getting the PTRx behavior straightened out for SETQ(2)+RD/WR/WMLONG.
It works like this:
SETQ #16-1 'ready to transfer 16 longs
RDLONG base,PTRA 'read at PTRA
SETQ #10-1 'ready to transfer 10 longs
RDLONG base,++PTRB 'read at PTRB+10<<2, PTRB += 10<<2
SETQ #100-1 'ready to transfer 100 longs
RDLONG base,--PTRA 'read at PTRA-100<<2, PTRA -= 100<<2
SETQ #8-1 'ready to transfer 8 longs
RDLONG base,PTRA++ 'read at PTRA, PTRA += 8<<2
SETQ #5-1 'ready to transfer 5 longs
RDLONG base,PTRB-- 'read at PTRB, PTRB -= 5<<2
Only the MSB of the encoded index is used to increment or decrement PTRx by the block size. This way, you can keep loading or storing memory sequentially.
One more thing off the list.
Am I understanding this correctly for --PTRx
SETQ #100-1 'ready to transfer 100 longs
RDLONG base,--PTRA 'read at PTRA-100<<2, PTRA -= 100<<2
First location read is
RDLONG base, PTR-100<<2
next is
RDLONG base+1, PTR-99<<2 (postedit correction)
etc, and when done
PTR=PTR-100<<2
That's right. It will track based on block size, which is (Q+1)×4.
Wait, your example is wrong, in that the first register written is base, followed by base+1, etc.
Multiple 'SETQ+RDLONG base,PTRA++' operations will read blocks in order.
Thanks Chip. Corrected the error in my post (and here too).
It's as I thought you explained. A bit confusing to effectively reverse the data, but there is probably good reason, and we can use that reversal to advantage elsewhere too I'm sure.
Think of 'SETQ(2)+RDLONG base,--PTRA' as a POP and it makes sense.
'SETQ(2)+WRLONG base,PTRA++' is like a PUSH.
You've also got PTRA, ++PTRA, and PTRA-- to use. They all conform to the block size of (Q+1)*4.
I just finished getting the PTRx behavior straightened out for SETQ(2)+RD/WR/WMLONG.
It works like this:
SETQ #16-1 'ready to transfer 16 longs
RDLONG base,PTRA 'read at PTRA
SETQ #10-1 'ready to transfer 10 longs
RDLONG base,++PTRB 'read at PTRB+10<<2, PTRB += 10<<2
SETQ #100-1 'ready to transfer 100 longs
RDLONG base,--PTRA 'read at PTRA-100<<2, PTRA -= 100<<2
SETQ #8-1 'ready to transfer 8 longs
RDLONG base,PTRA++ 'read at PTRA, PTRA += 8<<2
SETQ #5-1 'ready to transfer 5 longs
RDLONG base,PTRB-- 'read at PTRB, PTRB -= 5<<2
Only the MSB of the encoded index is used to increment or decrement PTRx by the block size. This way, you can keep loading or storing memory sequentially.
One more thing off the list.
Am I understanding this correctly for --PTRx
SETQ #100-1 'ready to transfer 100 longs
RDLONG base,--PTRA 'read at PTRA-100<<2, PTRA -= 100<<2
First location read is
RDLONG base, PTR-100<<2
next is
RDLONG base+1, PTR-99<<2 (postedit correction)
etc, and when done
PTR=PTR-100<<2
That's right. It will track based on block size, which is (Q+1)×4.
Wait, your example is wrong, in that the first register written is base, followed by base+1, etc.
Multiple 'SETQ+RDLONG base,PTRA++' operations will read blocks in order.
Thanks Chip. Corrected the error in my post (and here too).
It's as I thought you explained. A bit confusing to effectively reverse the data, but there is probably good reason, and we can use that reversal to advantage elsewhere too I'm sure.
Think of 'SETQ(2)+RDLONG base,--PTRA' as a POP and it makes sense.
'SETQ(2)+WRLONG base,PTRA++' is like a PUSH.
You've also got PTRA, ++PTRA, and PTRA-- to use. They all conform to the block size of (Q+1)*4.
I talked to Wendy at ON Semi today, who is doing the synthesis and place-and-route.
She said that even though our cell instance count has gone from 630k to 780k in the new silicon, the max-power test I gave her is showing a reduction from 1.2W to 1.0W. This is due to clock-gating she enabled in the synthesis tool. It makes the clock tree more complicated, but allows a lot of flops to lose their enable circuits which mux the Q output back into the D input. And 180MHz is still no problem.
I'm waiting for her to take a simulated power measurement during downloading, when only one cog is enabled. I think we'll see the current 77mA drop to under 10mA.
Clock gating means the chip will take power that is proportional to functionality, while the current silicon dissipates most all power in the clock tree, itself. The new clock tree will have many levels and take a lot less power.
I'm waiting for her to take a simulated power measurement during downloading, when only one cog is enabled. I think we'll see the current 77mA drop to under 10mA.
That would be a great improvement. I hope that 250MHz is still an achievable overclock too.
I talked to Wendy at ON Semi today, who is doing the synthesis and place-and-route.
She said that even though our cell instance count has gone from 630k to 780k in the new silicon, the max-power test I gave her is showing a reduction from 1.2W to 1.0W....
So that max-power test, is all clocks gated ON ?
If so, nice that everything-running has fallen slightly, (to 83.33%) even tho the cells have increased by 23.8%.
Has that come at any cost to the indicated MHz, relative to the P2-ES ?
I thought it would be helpful to list all changes made to the P2 source Verilog, so that everyone could anticipate what is coming next. I will maintain this list.
...
(18) Be able to output system CLK via smart pins, must explore with ON Semi.
I thought it would be helpful to list all changes made to the P2 source Verilog, so that everyone could anticipate what is coming next. I will maintain this list.
...
(18) Be able to output system CLK via smart pins, must explore with ON Semi.
Did (18) make the cut ?
No.
We are skew-banding the DIR, OUT, and IN signals to within 1ns across the chip, though.
Will be nice to see that reduction in power when less functions/cogs are in use. I've never used the full functionality of P1, so I am not expecting to on P2.
If anything causes a respin of the verilog, the WAITPAT may do with a look at.
P2 is waiting for the pattern to set an event when the pattern goes from not met to met, whereas IIRC the P1 would immediately return if the pattern was met. Therefore I am having to hand code the check instead of using WAITPAT.
We had a conference call today with ON Semi to discuss the tapeout status.
With all the new logic in the next P2, ON has been having a very hard time closing timing.
Wendy did some compiles last night at 180, 170, and 160 MHz Fmax targets, in order to see how different speed goals affect the instance count.
Get a load of this... By dropping the goal from 180MHz to 160MHz, the instance count went from ~780k to ~680k, which is less than the current silicon contains. And we were only reaching 172MHz, anyway, with those ~780k instances. Those extra ~100k instances were certainly buffers to speed up signalling between flops. Those buffers were not only taking power, but increasing the routing congestion, which was the main impediment to meeting timing. A big metal-density/cell-scarcity hurricane-eye pattern was forming in the middle of the die from 100% routing utilization, pushing cells out of the middle of the logic area.
I told them today to just go with 160MHz, since 180MHz was not going to happen and even 172MHz was taking ~100k instances of extra cells.
So, 160MHz will close timing fine and it will take considerably less power at any frequency. If the chip runs cooler, it can go faster, so I think we'll probably be not much slower than the current silicon is. In addition to these ~100k cells going away, clock-gating is going to make a huge contribution to power reduction.
Over the next few days, simulations will reveal what the power levels will be.
Cluso,
We already know, and have the hard evidence. If the die temperature is held low then it can go faster. It's called over-clocking.
I think the question relates more to a commercial use specification rather than what one sample, from one run, might be able to achieve on a test bench.
Volume users need guaranteed specifications to work to.
We had a conference call today with ON Semi to discuss the tapeout status.
With all the new logic in the next P2, ON has been having a very hard time closing timing.
That was always a risk of the new added logic....
Wendy did some compiles last night at 180, 170, and 160 MHz Fmax targets, in order to see how different speed goals affect the instance count.
Get a load of this... By dropping the goal from 180MHz to 160MHz, the instance count went from ~780k to ~680k, which is less than the current silicon contains. And we were only reaching 172MHz, anyway, with those ~780k instances. [/quote]
Comments
For HDMI you get lucky, as that is a more complex interface, it includes a clock PLL, and clock is sent at a lower rate (Data rate /10)
It's the simpler synchronous interfaces like SPI that need CLK at data rate.
Any SW load is only a partial effect to floating pins issues, as during RESET=L time, the P2 pins are floating, and until reset exit delays are done, plus serial ROM loaded time, things are undefined.
Then, do you pull down, or pull up ?
SPI Chip selects are active LOW, and so are UARTs, and i2c, so it is common for MCUs to reset with light pullups. Light enough that any pin that needs to be LOW during reset can be pulled down with a resistor.
That's right. It will track based on block size, which is (Q+1)×4.
Wait, your example is wrong, in that the first register written is base, followed by base+1, etc.
Multiple 'SETQ+RDLONG base,PTRA++' operations will read blocks in order.
ON comes to Parallax in ten days with their ECO and associated agreement ($$$$). Their schedule will depend on your schedule.
How much time are you allocating to customer testing once the P2 Eval Board is shipped?
How much time do you need to complete the list of changes you posted? When you make changes, will they need to be tested in FPGA by users?
Only when these two variables are known will we be able to consider signing their agreement and pressing them for a schedule. We need to be ready to talk specifics and provide a schedule when they arrive at our office.
Ken Gracey
Yes. Just added that.
Thanks Chip. Corrected the error in my post (and here too).
It's as I thought you explained. A bit confusing to effectively reverse the data, but there is probably good reason, and we can use that reversal to advantage elsewhere too I'm sure.
Think of 'SETQ(2)+RDLONG base,--PTRA' as a POP and it makes sense.
'SETQ(2)+WRLONG base,PTRA++' is like a PUSH.
You've also got PTRA, ++PTRA, and PTRA-- to use. They all conform to the block size of (Q+1)*4.
And data not reversed.
She said that even though our cell instance count has gone from 630k to 780k in the new silicon, the max-power test I gave her is showing a reduction from 1.2W to 1.0W. This is due to clock-gating she enabled in the synthesis tool. It makes the clock tree more complicated, but allows a lot of flops to lose their enable circuits which mux the Q output back into the D input. And 180MHz is still no problem.
I'm waiting for her to take a simulated power measurement during downloading, when only one cog is enabled. I think we'll see the current 77mA drop to under 10mA.
Clock gating means the chip will take power that is proportional to functionality, while the current silicon dissipates most all power in the clock tree, itself. The new clock tree will have many levels and take a lot less power.
Sounds really good! Do we have a rough ETA to first samples if all goes smoothly from here forward?
That would be a great improvement. I hope that 250MHz is still an achievable overclock too.
So that max-power test, is all clocks gated ON ?
If so, nice that everything-running has fallen slightly, (to 83.33%) even tho the cells have increased by 23.8%.
Has that come at any cost to the indicated MHz, relative to the P2-ES ?
If anything, the next P2 should run even a little faster, because it will self-heat less.
Are they still trying to get 250MHz as a formal spec-point, ? even if that is at a reduced Tmax, and maybe 5% Vdd specs ?
Did (18) make the cut ?
No.
We are skew-banding the DIR, OUT, and IN signals to within 1ns across the chip, though.
Same target as before: 180MHz from -40C to +85C at 1.8V +- 5%.
Will be nice to see that reduction in power when less functions/cogs are in use. I've never used the full functionality of P1, so I am not expecting to on P2.
If anything causes a respin of the verilog, the WAITPAT may do with a look at.
P2 is waiting for the pattern to set an event when the pattern goes from not met to met, whereas IIRC the P1 would immediately return if the pattern was met. Therefore I am having to hand code the check instead of using WAITPAT.
I thought you were going to shoot for 200MHz? Did I get this wrong or was that not on the table?
We tried, but the biggest inhibitor was the hub memory.
With all the new logic in the next P2, ON has been having a very hard time closing timing.
Wendy did some compiles last night at 180, 170, and 160 MHz Fmax targets, in order to see how different speed goals affect the instance count.
Get a load of this... By dropping the goal from 180MHz to 160MHz, the instance count went from ~780k to ~680k, which is less than the current silicon contains. And we were only reaching 172MHz, anyway, with those ~780k instances. Those extra ~100k instances were certainly buffers to speed up signalling between flops. Those buffers were not only taking power, but increasing the routing congestion, which was the main impediment to meeting timing. A big metal-density/cell-scarcity hurricane-eye pattern was forming in the middle of the die from 100% routing utilization, pushing cells out of the middle of the logic area.
I told them today to just go with 160MHz, since 180MHz was not going to happen and even 172MHz was taking ~100k instances of extra cells.
So, 160MHz will close timing fine and it will take considerably less power at any frequency. If the chip runs cooler, it can go faster, so I think we'll probably be not much slower than the current silicon is. In addition to these ~100k cells going away, clock-gating is going to make a huge contribution to power reduction.
Over the next few days, simulations will reveal what the power levels will be.
When done, is it possible to ask what a lesser spec would give as fmax?
Say 0C to +70C at 1.8V +- 2.5%
We already know, and have the hard evidence. If the die temperature is held low then it can go faster. It's called over-clocking.
Thanks for sharing this, Chip.
I think the question relates more to a commercial use specification rather than what one sample, from one run, might be able to achieve on a test bench.
Volume users need guaranteed specifications to work to.
That was always a risk of the new added logic....
Wendy did some compiles last night at 180, 170, and 160 MHz Fmax targets, in order to see how different speed goals affect the instance count.
Get a load of this... By dropping the goal from 180MHz to 160MHz, the instance count went from ~780k to ~680k, which is less than the current silicon contains. And we were only reaching 172MHz, anyway, with those ~780k instances. [/quote]
What numbers does 165MHz give ?