OK, thanks. I guess my brain missed the "right after the hub memories" part. Will a DE2 FPGA image be available? I would like to update my development tools, and run my tests on the latest version.
I am at my parents' house today, but when I get home, I will compile a DE2-115 version. I'll cover the other boards, also.
The debugging is much improved. I changed the criteria for when a breakpoint can occur per each int1/int2/int3/main filter, so that each is exactly what you'd expect it to be. For example, you don't see the pipeline CALLD injections for interrupts, nor the cancelled instructions trailing branches. You see only what constitutes int1, int2, int3, or main code.
You now get one debug interrupt for each instruction executed, according to the filter settings.
You can also break on the first instruction of each interrupt, by int1/2/3. Because these trigger on the pipeline CALLD injections, the breakpoint is before the first ISR instruction executes, so the first susequent single-step would execute the first instruction of the ISR. Everything is how you would ideally imagine it being. The exceptions are address breakpoints and BRK instruction breakpoints. They stop after the address match or BRK.
Chip,
I'm still worried there is an issue with constraints on I/O pin timing.
My previous tests, using the Smartpins to count clocks comparing an OUT to the returning input transition, gave differing measurements for different pins at 80 MHz. Some pins are 1 count while others are 2 counts - It is pin specific. This hasn't changed.
Hopefully you can remember me separately checking the speed of the FPGA pin buffers with a tiny ring circuit schematic. They came out at less than 2.5 ns propagation out and back.
My previous tests, using the Smartpins to count clocks comparing an OUT to the returning input transition, gave differing measurements for different pins at 80 MHz. Some pins are 1 count while others are 2 counts - It is pin specific. This hasn't changed.
If you change clock speed, how does that affect the results ?
In that thread, you mentioned 4ns/pin, and Chip mentions 10ns routing delays, so it follows that you could/should see a 80MHz / 20MHz difference ?
I later refined that down to what was routing what was pin buffers. 2.5 ns being the conclusion.
Chip's 10 ns appears to be a requested rule that extended those timing out. I never really understood it.
Maybe the question here is, what are the routing delays in the final P2, and at what MHz will they add-another-clock ?
Can OnSemi indicate those delays, from the script-testing they have already done ?
I'm mostly worried that OnSemi are depending on the constraints Chip has for the FPGA auto routing from which to setup their constraints with. Of course, I don't know if that's how things work at all.
We can figure out why the FPGA is taking a long time, but the actual chip I/O timing is well-constrained. On Semi has used some special cyclical SPICE simulator that sweeps clock against data to resolve setup and hold times in nine process corners.
We can figure out why the FPGA is taking a long time, but the actual chip I/O timing is well-constrained. On Semi has used some special cyclical SPICE simulator that sweeps clock against data to resolve setup and hold times in nine process corners.
Are OnSemi able to answer the question : what are the routing delays in the final P2, and at what MHz will they add-another-clock ?
We can figure out why the FPGA is taking a long time, but the actual chip I/O timing is well-constrained. On Semi has used some special cyclical SPICE simulator that sweeps clock against data to resolve setup and hold times in nine process corners.
Are OnSemi able to answer the question : what are the routing delays in the final P2, and at what MHz will they add-another-clock ?
The routing delays are all individual, between instances.
We can figure out why the FPGA is taking a long time, but the actual chip I/O timing is well-constrained. On Semi has used some special cyclical SPICE simulator that sweeps clock against data to resolve setup and hold times in nine process corners.
Are OnSemi able to answer the question : what are the routing delays in the final P2, and at what MHz will they add-another-clock ?
The routing delays are all individual, between instances.
I don't understand the second question you asked.
The FPGA has relatively long routing delays, and I think you calculated ~ 71MHz as the (pin+routing) delay ~~ clock threshold, that bumps another clock delay into the mix.
The P2 has lower delays, but still finite, so there will still be some (higher) MHz equivalent where the P2 delay is ~~ clock speed.
Is that P2 number 100MHz, 150MHz, 180MHz, ?? MHz
We can figure out why the FPGA is taking a long time, but the actual chip I/O timing is well-constrained. On Semi has used some special cyclical SPICE simulator that sweeps clock against data to resolve setup and hold times in nine process corners.
Are OnSemi able to answer the question : what are the routing delays in the final P2, and at what MHz will they add-another-clock ?
The routing delays are all individual, between instances.
I don't understand the second question you asked.
The FPGA has relatively long routing delays, and I think you calculated ~ 71MHz as the (pin+routing) delay ~~ clock threshold, that bumps another clock delay into the mix.
The P2 has lower delays, but still finite, so there will still be some (higher) MHz equivalent where the P2 delay is ~~ clock speed.
Is that P2 number 100MHz, 150MHz, 180MHz, ?? MHz
The hub RAMs are the current Fmax limiters. We could go with four times 1/4-size RAMs to get the speed up, but it would come at the expense of 4 square mm of area and jack up the power by some amount. Not worth it, at this point. We are targetting 180MHz now, which is easily doable. The I/O pin signals will have no problem meeting that timing. Worst-case propagation times (in and out), setup times, and data delays are all about 2.0ns within the I/O pad.
... We are targetting 180MHz now, which is easily doable. The I/O pin signals will have no problem meeting that timing. Worst-case propagation times (in and out), setup times, and data delays are all about 2.0ns within the I/O pad.
That's cool, so the P2 is free of any clock-bump effects at all expected MHz speeds.
Also, if the part does over-clock a little, those delays are likely to be shorter than worst case anyway.
Just posted a new v32a at the top of this thread. Only difference from v32 is that XORO32 has faster logic and there are now files for all FPGA boards.
This may be a dumb question, but is there a reason the max freq cannot be a nice round 200MHz ?
This would make the clock a even 5nSec instead of 5.555555555555nSec.
Bean,
The recommended top frequency is still to be finalised. 200 MHz is hoped for by all but the final given spec depends on many factors that aren't completely worked out yet.
I'm guessing there will be a simulated spec generated once the layout is set to go. But even after that I'm guessing there will be test chips from a shuttle run to verify all matches the simulations before a final spec is assigned.
What's limiting us to under 200MHz are the hub RAMs. Their Q outputs are immediately registered, but must first go through an AND-OR circuit which forms the JTAG scan chain that the tester will use. This knocks their 207MHz worst-case performance down to just under 200MHz. To go faster, we'd need to use two or four half-size or quarter-size RAMs in place of each current RAM instance. That is going to eat several square mm of silicon and jack up the power more. We looked into doing it, already, but decided against it. So, we are missing 200MHz by something like 180ps. Timing will stretch out a little more in the final layout.
Their Q outputs are immediately registered, but must first go through an AND-OR circuit which forms the JTAG scan chain that the tester will use.
Shame the JTAG can't be integral to those first registers.
That's what I thought, too.
We tried using the special integrated scan flops and the timing got worse!
There is a need, in either case, for an AND-OR gate before the actual flop. Note that the AND-OR gate will always have a buffered output which has potentially better driving ability than tired signals coming over long wires. By allowing the place-and-route to build the scan chain from discrete ao221 gates and flops, it has freedom to place the ao221 gate away from the flop, to improve timing, rather than suffer that the ao221 and flop are always glued together, forcing a worse compromise in the wire-delay tug-of-war. I really scratched my head, at first.
I got the XBYTE documentation updated in the Google Doc.
Tomorrow, I will cover the new debug scheme.
I did a big test tonight, where I ran some bytecode loop in the Spin interpreter that incremented a bitfield in the OUTA register. I single-stepped and traced its execution. It was really neat to see only the instructions of interest from the bytecode routines executing. You don't see XBYTE execute, because it is a location-less instruction that spontaneously happens when a RET/_RET_ executes with $1FF on the hardware stack. So, all you see are the distilled SKIPF-filtered instructions executing. It's kind of magical. I really look forward to having nice debug windows.
Comments
I am at my parents' house today, but when I get home, I will compile a DE2-115 version. I'll cover the other boards, also.
You now get one debug interrupt for each instruction executed, according to the filter settings.
You can also break on the first instruction of each interrupt, by int1/2/3. Because these trigger on the pipeline CALLD injections, the breakpoint is before the first ISR instruction executes, so the first susequent single-step would execute the first instruction of the ISR. Everything is how you would ideally imagine it being. The exceptions are address breakpoints and BRK instruction breakpoints. They stop after the address match or BRK.
I'm still worried there is an issue with constraints on I/O pin timing.
My previous tests, using the Smartpins to count clocks comparing an OUT to the returning input transition, gave differing measurements for different pins at 80 MHz. Some pins are 1 count while others are 2 counts - It is pin specific. This hasn't changed.
Hopefully you can remember me separately checking the speed of the FPGA pin buffers with a tiny ring circuit schematic. They came out at less than 2.5 ns propagation out and back.
In that thread, you mentioned 4ns/pin, and Chip mentions 10ns routing delays, so it follows that you could/should see a 80MHz / 20MHz difference ?
Maybe the question here is, what are the routing delays in the final P2, and at what MHz will they add-another-clock ?
Can OnSemi indicate those delays, from the script-testing they have already done ?
Chip's 10 ns appears to be a requested rule that extended those timing out. I never really understood it.
I'm mostly worried that OnSemi are depending on the constraints Chip has for the FPGA auto routing from which to setup their constraints with. Of course, I don't know if that's how things work at all.
Are OnSemi able to answer the question : what are the routing delays in the final P2, and at what MHz will they add-another-clock ?
The routing delays are all individual, between instances.
I don't understand the second question you asked.
The FPGA has relatively long routing delays, and I think you calculated ~ 71MHz as the (pin+routing) delay ~~ clock threshold, that bumps another clock delay into the mix.
The P2 has lower delays, but still finite, so there will still be some (higher) MHz equivalent where the P2 delay is ~~ clock speed.
Is that P2 number 100MHz, 150MHz, 180MHz, ?? MHz
The hub RAMs are the current Fmax limiters. We could go with four times 1/4-size RAMs to get the speed up, but it would come at the expense of 4 square mm of area and jack up the power by some amount. Not worth it, at this point. We are targetting 180MHz now, which is easily doable. The I/O pin signals will have no problem meeting that timing. Worst-case propagation times (in and out), setup times, and data delays are all about 2.0ns within the I/O pad.
Also, if the part does over-clock a little, those delays are likely to be shorter than worst case anyway.
Any chance of getting some updated docs or notes on the debug changes?
That's right.
Yes, after getting this XORO32 thing fixed, I'll get on the documentation.
What happens from here, Chip? When do they need the ROM code, or does that go with the verilog?
All 7 FPGA images flashed and running Ok.
This would make the clock a even 5nSec instead of 5.555555555555nSec.
Bean
The recommended top frequency is still to be finalised. 200 MHz is hoped for by all but the final given spec depends on many factors that aren't completely worked out yet.
Thanks, Brian!
After I get the documentation caught up, I'll get into the ROM code. We need to have that ready by the 22nd, I believe.
That's what I thought, too.
We tried using the special integrated scan flops and the timing got worse!
There is a need, in either case, for an AND-OR gate before the actual flop. Note that the AND-OR gate will always have a buffered output which has potentially better driving ability than tired signals coming over long wires. By allowing the place-and-route to build the scan chain from discrete ao221 gates and flops, it has freedom to place the ao221 gate away from the flop, to improve timing, rather than suffer that the ao221 and flop are always glued together, forcing a worse compromise in the wire-delay tug-of-war. I really scratched my head, at first.
Tomorrow, I will cover the new debug scheme.
I did a big test tonight, where I ran some bytecode loop in the Spin interpreter that incremented a bitfield in the OUTA register. I single-stepped and traced its execution. It was really neat to see only the instructions of interest from the bytecode routines executing. You don't see XBYTE execute, because it is a location-less instruction that spontaneously happens when a RET/_RET_ executes with $1FF on the hardware stack. So, all you see are the distilled SKIPF-filtered instructions executing. It's kind of magical. I really look forward to having nice debug windows.