Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

evanh · 2018-02-06 10:06

I've been distracted with finding easy ways to move and resize multi-volume Linux installs between drives, assigning new UUIDs, and convincing Grub to boot them cleanly.

Turns out the non-bootable volumes are a simple case of editing the UUIDs in the fstab file to match the full bootup needs. At least that part is easy.

cgracey wrote: »

evanh wrote: »

Where in the menus is the report?

I used some selection to show all I/O timing. It was near the top of the list.

The 10.0 is nanoseconds. That is the timing goal I gave it.

Okay, I've had a look around and it seems TimeQuest is installed even though it was unselected at install time. It looks like this is where you where looking.

I'm not getting much reported due to lack of any clock. But I did find something good, see attached ... The way I read that report is:
0.694 ns propagation for p[8]'s input buffer
3.163 ns propagation for internal FPGA routing back to p[9] output buffer (type IC, means interconnect presumably)
1.850 ns is the output slew time

evanh · 2018-02-06 10:12

My understanding is that report details Quartus's conservative estimate (5.7 ns) of what I measured on my scope to be about 4.2 ns.

evanh · 2018-02-06 10:36

Ah, heh, I'm learning ... Worked out how to generate more than one path in the report. This time I've reported on everything in my design and selected the fastest one for comparison, the interconnect here is down to 1 ns!

And the sum of all 12 together is 45.1 ns at 0°C. Actually faster than I measured. So not as conservative as I first imagined.

evanh · 2018-02-06 10:40

I can't see any sign of the inverter in the paths report. Presumably it's a freebie in all I/O cells.

PS: The useful part from the above reports is that the I/O buffers are even quicker than measured, taking only about 2.5 ns each. The remaining average of 1.7 ns is interconnect time.

evanh · 2018-02-10 08:53

A small update with clocking in the picture and a bit of fun learning DDR tricks. This one clocks nice and stable at exactly 10 MHz. It has one two less pin stages in the loop than before. The unclocked version, with 12 pins, was wobbly, crossing back and forth across 10 MHz.

Note: When using all 12 11 pins, this design also steps back and forth between 50 ns and 60 ns transitions.

The second snapshot has the path timings. I'm not sure how useful it is though. I've selected the slowest one of relevance - Has a long interconnect time routing from pin9 output to the clocked flip-flops. Of note I guess, is it has no other logic, it's just the extra 3 ns for the signal to thread its way to the internal FPGA macrocells.

evanh · 2018-02-10 16:04

Chip,
The global buffer on the clock signal is critical. Make sure you are using them for your clocks. Without it the timings are screwed. It seems okay at first but at certain beat patterns the two flip-flops are misaligning their sampling. That above loop can easily get 10 ns holes in it when the global buffer is removed..

But it also runs slower as well. The delays around each side of the flip-flops blow out to 5-6 ns, according to TimeQuest.

evanh · 2018-02-16 10:04

A belated correction, I guess. That global clock isn't so critical if a real flip-flop is used in place of that latch. The latch is apparently merged into the combinational logic ... so the synchronising of the two uses of the positive clocking edge is superior when locking it into the global clock network ... or maybe the fitter then understands what delays have to be added to align the latching with the positive clocked flip-flop ...

cgracey · 2018-03-08 13:07

I've decided that I need to get some kind of debugger working to prove that our debugging scheme doesn't have any holes in it.

While making some textual mods for On Semi, I changed the way debugging works:

GETINT D/# - generates async breakpoint in cog D/# (if enabled in cog D/# via SETBRK)
GETINT D WC - writes {CORDIC_inventory[4:0], Last_XBYTE_SETQ[9:0], LUT_share, Event[15:0]} into D, clears C
GETINT D WZ - writes {8'b0, CALL_depth_during_SKIP[3:0], INT_select[3:1][3:0], INT_state[3:1][1:0], STALLI, SKIP_mode} into D, clears Z
GETINT D WCZ - writes SKIP_pattern[31:0] into D, clears C and Z

SETBRK no longer returns any value during debug interrupt, as GETINT can report all data at any time. SETBRK is only used to set the next break condition during debug interrupt. GETINT now generates async breakpoints at any time.

I will do a new compile and call it v32. Also, I'll update the documentation.

These changes should open up debugging to where it needs to be. Now, I need to get some code written to quickly prove everything.

Here is what the debug ISR should be able to do, in order:

1) dump all COG and LUT registers
2) report PC, flags, and status data from GETINT
3) receive next command to single-step, run to address, or run until an async breakpoint gets asserted from another cog
4) restore all registers and resume execution

If it can do that, we'll have a nice basis for debugging.

I've been thinking that a really nice, open debugger that shows lots of context will go a long way in helping people to understand and get comfortable with the architecture. I thought it was important to be able to get all 32 SKIP bits out for that reason. I think each cog could be handled in a separate, but identical, window.

I thought about making a breakpoint for bytecode execution (XBYTE), but it's really not needed. A better way to do that is to swap out bytecodes where you'd like to have breakpoints and then break on the 'debug' code-snippet address, patching the bytecode afterwards. That is cleaner than interrupting every XBYTE and gives you some nice high-level control. Plus, you can have unlimited breakpoints. And the PA register always let you know where the bytecode came from, as GETPTR gets written to PA on each XBYTE.

One other thing I'm thinking about is automatically bypassing the last 16KB of RAM protection within debug ISRs. That keeps memory protected from user activity, but allows debug code secure access. That should keep the debugger alive, even if the user code goes crazy and writes all over hub RAM.

potatohead · 2018-03-08 14:20

All that makes a ton of sense. And yes, a program like that will definitely get used.

potatohead · 2018-03-08 14:22

As for write protect on ISR, is that true for all COGS, or just any COG in ISR?

cgracey · 2018-03-08 14:56

potatohead wrote: »

As for write protect on ISR, is that true for all COGS, or just any COG in ISR?

Any cog executing a debug ISR (which takes some initial setup) could access the last 16KB of hub RAM, regardless of its protect status.

I just added breakpoints for hub read and write via the fast FIFO, with address match. This will let us trap RFxxxx/WFxxxx accesses as well as streamer reads and writes.

potatohead · 2018-03-08 15:01

Nice.

That implies the write inhibit is global then? Any COG is write inhibited on any COG in ISR?

cgracey · 2018-03-08 15:18

There is a global switch that protects the last 16KB of hub RAM. I want to make it so that debug ISR code, which is not user code, can bypass the write-protect without any special effort. This will enable the top 16KB to be used for debugging and housekeeping with some safety.

TonyB_ · 2018-03-08 15:57

cgracey wrote: »

I've decided that I need to get some kind of debugger working to prove that our debugging scheme doesn't have any holes in it.

While making some textual mods for On Semi, I changed the way debugging works:

GETINT D/# - generates async breakpoint in cog D/# (if enabled in cog D/# via SETBRK)
GETINT D WC - writes {CORDIC_inventory[4:0], Last_XBYTE_SETQ[9:0], LUT_share, Event[15:0]} into D, clears C
GETINT D WZ - writes {8'b0, CALL_depth_during_SKIP[3:0], INT_select[3:1][3:0], INT_state[3:1][1:0], STALLI, SKIP_mode} into D, clears Z
GETINT D WCZ - writes SKIP_pattern[31:0] into D, clears C and Z

SETBRK no longer returns any value during debug interrupt, as GETINT can report all data at any time. SETBRK is only used to set the next break condition during debug interrupt. GETINT now generates async breakpoints at any time. <snip>

I've been thinking that a really nice, open debugger that shows lots of context will go a long way in helping people to understand and get comfortable with the architecture. I thought it was important to be able to get all 32 SKIP bits out for that reason. I think each cog could be handled in a separate, but identical, window.

Being able to read all 32 skip bits should mean that nested skipping is possible, something I suggested a long time ago. If so, that's excellent.

Chip, I think this would be a very good time to consider my XBYTE idea, as there are implications for debugging. I know you were about to do this before your last trip and you can find full details in about half a dozen posts starting at
https://forums.parallax.com/discussion/comment/1431401/#Comment_1431401

I assume the Last_XBYTE_SETQ[9:0] means either XBYTE_SETQ[9:0] or XBYTE_SETQ2[9:0]. For my idea, Last_XBYTE_SETQ[9:0] would become XBYTE_SETQ[9:0] xor XBYTE_SETQ2[9:0].

One of my innovations is that the low three bits of the return address on the stack would be ignored for the next XBYTE only. In order for debugging to deal with this, three of the extra XBYTE status bits that used to readable in GETINT, now called GETBRK, would be needed again.

If the break-code is read by BRK D, similarly to how SETBRK D used to work, then CORDIC_inventory[4:0] could be moved to the high byte of GETBRK D WZ. Assuming the new (actually modified old) XBYTE status bits are called XBYTE_STK[2:0], then GETBRK could become

GETBRK D WC - writes {2'b0,XBYTE_STK[2:0], Last_XBYTE_SETQ[9:0], LUT_share, Event[15:0]} into D, clears C
GETBRK D WZ - writes {3'b0, CORDIC_inventory[4:0], CALL_depth_during_SKIP[3:0], INT_select[3:1][3:0], INT_state[3:1][1:0], STALLI, SKIP_mode} into D, clears Z

Note there is room for five more hidden state bits to be readable.

EDIT:
GETINT changed to GETBRK.

potatohead · 2018-03-08 16:27

Chip, I agree. Without that, the special area is gonna be clobbered. Was just wanting to understand.

Have fun at OnSemi. Thanks for sharing stuff with us.

jmg · 2018-03-08 18:15

cgracey wrote: »

..
These changes should open up debugging to where it needs to be. Now, I need to get some code written to quickly prove everything.

Here is what the debug ISR should be able to do, in order:

1) dump all COG and LUT registers
2) report PC, flags, and status data from GETINT
3) receive next command to single-step, run to address, or run until an async breakpoint gets asserted from another cog
4) restore all registers and resume execution

If it can do that, we'll have a nice basis for debugging.
...

Sounding good.
One quality measure of a debugger, is the footprint. ie how much resource the debugger needs to operate.
When you have the code written can you list the debug overhead, as in Pins/Stack/Code(COG/HUB splits)/Registers ...

cgracey · 2018-03-08 19:50

TonyB_, I'm going to need some rest before I spool up your idea. I remember reading those posts before, but I didn't get it. I'm way too tired at the moment.

I have some questions for everyone, though:

a) Aside from single-step, break-on-address, and external async breakpoints, what else would be good to have?
b) We could break on instruction by using the 66 SETPAT flops to mask and compare the 32-bit instruction being executed. We could use the extra two bits to detect register write. Is this worthwhile?
c) I thought I had break-on-hub-read/write, but it's way too complicated with the FIFO and other mechanisms in play. Big loss?
d) Would break-on-write-to-selectable-register be very useful? This would just be a subset of (b).

cgracey · 2018-03-08 19:55

jmg wrote: »

cgracey wrote: »

..
These changes should open up debugging to where it needs to be. Now, I need to get some code written to quickly prove everything.

Here is what the debug ISR should be able to do, in order:

1) dump all COG and LUT registers
2) report PC, flags, and status data from GETINT
3) receive next command to single-step, run to address, or run until an async breakpoint gets asserted from another cog
4) restore all registers and resume execution

If it can do that, we'll have a nice basis for debugging.
...

Sounding good.
One quality measure of a debugger, is the footprint. ie how much resource the debugger needs to operate.
When you have the code written can you list the debug overhead, as in Pins/Stack/Code(COG/HUB splits)/Registers ...

It should only use a stretch of hub RAM. The debug ISR sits in hub RAM. First, it saves off maybe 64 registers ($000..$03F, perhaps), loads some code into that place for fast execution and workspace, takes care of business, jumps back out, restores the RAM and exits the debug interrupt with a RETI0. Shouldn't leave a trace.

I want to prove all the conduit, because it's going to set the parameters for what is possible. I want to make a nice interface that helps people get comfortable right away with what is going on and visibly shows them the constraints they are working under. The better they comprehend the machine, the quicker they can attain mastery and work confidently, without any shadows plaguing their understanding.

cgracey · 2018-03-08 20:17

I see a window for each cog being debugged and a separate, singular window for the smart pins, where the RQPIN (read 'quiet', without ACK) values are constantly updated and any WRPIN, WXPIN, WYPIN can be forced. All the INA/INB signals should be shown, too. And, it will be necessary to control DIRA/DIRB and OUTA/OUTB. Maybe a single cog will be needed for top-level debug coordination back to the host system.

Anyway, the innards should be displayed in a well-lit room, where everything is plainly visible and tangible.

cgracey · 2018-03-08 20:26

On second thought, I don't think a central cog will be needed to coordinate. We could just use LOCK[15], say, as the baton for who gets to talk over the serial comm. That would work, and ALL the cogs could be under live debugging. The only need for a central coordinator cog would be to issue an async breakpoint to get some other cog's attention. As long as some cog is periodically breakpoint-ing, it could be used to get any other cog's attention.

msrobots · 2018-03-08 20:57

I rarely use a debugger thus my opinion may not be of much help.

But Hanno;s P1 debugger/visualizer ViewPort was a nice help at the beginning of programming the P1.

What is missing in your dump list is direction of pins and current state and maybe the current state of the smartpin subsystems.

Would it be possible to break on part of instruction? Say break on all WRLONGs or all RDFAST? That would allow to find HUB mistakes in absence of break on r/w HUB ram.

Break on selectable-register-write would be very helpful to nail down unintended self modifying code or trace some data register. Usually one sets breakpoints on code addresses, this would allow to set a breakpoint to a data address.

Since the debug interrupt overrules all other ones, maybe break on interrupt X?

Is there a instruction for code in the COG itself to create conditional breakpoints when needed, say if_c BRK?

Enjoy!

Mike

Roy Eltham · 2018-03-08 21:00

Chip,
Something I use a lot when debugging is essentially "set IP to X" when on a debug break. Allowing you to either skip over some instructions or skip back and redo some, in the code you are debugging.

Also, I think we can already do "edit and continue". Meaning editing instructions in place (while on a debug break) and then continue running.

---

When debugging a multi-threaded windows app, when you hit a breakpoint, all threads are stopped wherever they are. If you step, they all run for that bit of time. This isn't really possible, as far as I can tell, with how debugging is setup on the P2, but if it was possible to stop multiple cogs when you hit a breakpoint in one of them, it would be useful, especially for stepping and seeing the interactions between them.

jmg · 2018-03-08 21:19

cgracey wrote: »

I have some questions for everyone, though:

a) Aside from single-step, break-on-address, and external async breakpoints, what else would be good to have?

How many break points are supported ?

cgracey wrote: »

b) We could break on instruction by using the 66 SETPAT flops to mask and compare the 32-bit instruction being executed. We could use the extra two bits to detect register write. Is this worthwhile?
c) I thought I had break-on-hub-read/write, but it's way too complicated with the FIFO and other mechanisms in play. Big loss?
d) Would break-on-write-to-selectable-register be very useful? This would just be a subset of (b).

Break on Instruction is rare, as you know what opcodes to expect.

Break on Address is more common/useful, as that's errant pointer stuff.
Break on Value is another.

An intermediate (SW) way to manage this, is to quickly read a few locations and fast-step / slow-run code. A compromise.
That may need a local MCU (another P2?) to help debug speed, so you are not trying to shuffle over USB all the time.

If it is a one-location-test, maybe the debug stub could manage this ?

Another feature is break after count, ie set a break, but only after N times actually (fully) break.
That can probably also be managed in the debug stub, with a small impact on speed, as every tested-break is a lost time slice, but much less time than a full break/dump.

jmg · 2018-03-08 21:23

Roy Eltham wrote: »

When debugging a multi-threaded windows app, when you hit a breakpoint, all threads are stopped wherever they are. If you step, they all run for that bit of time.
This isn't really possible, as far as I can tell, with how debugging is setup on the P2, but if it was possible to stop multiple cogs when you hit a breakpoint in one of them, it would be useful, especially for stepping and seeing the interactions between them.

Good point, some MCUs have an option to step, or run peripherals during debug.
I guess useful here would be a 'Run N SysCLKs' type command, that allows bursts/pause of activity.

cgracey · 2018-03-08 21:32

There is one address-compare breakpoint. I could have the SETBRK instruction (which doesn't do anything except when in a debug ISR) serve as a BREAK instruction for regular code. That way, we could have lots of breakpoints - wherever you put a SETBRK. Patch it with the real instruction afterwards and execute from there.

Single-stepping everyone in unison sounds like a mess. Likely to give errant results, just the same, as timing still won't be exactly real-time slowed down.

jmg · 2018-03-08 21:39

cgracey wrote: »

There is one address-compare breakpoint. I could have the SETBRK instruction (which doesn't do anything except when in a debug ISR) serve as a BREAK instruction for regular code. That way, we could have lots of breakpoints - wherever you put a SETBRK. Patch it with the real instruction afterwards and execute from there.

Yes, do that, if not there already.
That software break (replace on step) is common, and it allows many break points. No problems at all in a RAM based system.
(the flash based MCUs have wear caveats on their SW breaks, in the fine print....)

cgracey · 2018-03-08 21:51

jmg wrote: »

cgracey wrote: »

There is one address-compare breakpoint. I could have the SETBRK instruction (which doesn't do anything except when in a debug ISR) serve as a BREAK instruction for regular code. That way, we could have lots of breakpoints - wherever you put a SETBRK. Patch it with the real instruction afterwards and execute from there.

Yes, do that, if not there already.
That software break (replace on step) is common, and it allows many break points. No problems at all in a RAM based system.
(the flash based MCUs have wear caveats on their SW breaks, in the fine print....)

In thinking about this, we DO want to keep address-compare breakpoint, because that enables a certain cog to stop on what might be a lot of public-access hub exec code.

cgracey · 2018-03-08 21:52

I'm thinking that maybe we need some external-pin debug breakpoint, just to wake up some cog without requiring another cog to do the job, in case they are all busy. What do you think?

evanh · 2018-03-08 22:19

I call that the RESET pin.

evanh · 2018-03-08 22:22

cgracey wrote: »

There is a global switch that protects the last 16KB of hub RAM. I want to make it so that debug ISR code, which is not user code, can bypass the write-protect without any special effort. This will enable the top 16KB to be used for debugging and housekeeping with some safety.

It's a slippery slope to protected mode!

Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

Comments