Right now, any cog can generate an asynchronous debug breakpoint in any other cog, assuming the target cog has that enabled.
If all cogs are running code, there is no cog free to issue that asynchronous debug breakpoint. This is where a pin-based debug interrupt would come in handy.
Then there is the problem of how to toggle some arbitrary pin from the debugger link. We could make a blanket option where all cogs can be interrupted by a transition on P63. This would certainly work, except it would disturb all cogs, which is not good. An alternative would be to have one cog running as a dedicated debugger interface.
Right now, any cog can generate an asynchronous debug breakpoint in any other cog, assuming the target cog has that enabled.
If all cogs are running code, there is no cog free to issue that asynchronous debug breakpoint. This is where a pin-based debug interrupt would come in handy.
Then there is the problem of how to toggle some arbitrary pin from the debugger link. We could make a blanket option where all cogs can be interrupted by a transition on P63. This would certainly work, except it would disturb all cogs, which is not good. An alternative would be to have one cog running as a dedicated debugger interface.
Electing P63 as being the conveyor of BRK stimulus will introduce a slight modification at its structure, creating a someway different version of a smart pin.
Taking a step further...
Why don't use its long repository to hold up to 16 selectors, one for each COG that could be affected by the external BRK input it will become, when programmed to behave this way?
Chip
I must be missing something here in regards to async breaks.
When I execute a BRK #/D outside of a debug ISR an async break is triggered for cog #/D.
This works Ok but your verilog and posts also show that a 8 bit value can be passed to the ISR and retrieved via a GETINT D wz.
This way, when BRK is used outside of the debug ISR (in user code) it can generate a breakpoint AND pass an 8-bit value to the debug ISR (BRK #/D, D[7:0] is passed). This should allow for some flexibility in how debugging can be implemented.
There seems to be a conflict here.
If the #/D value is the cog number to be triggered how can a user value be passed.
A GETINT D wz from within a debug ISR always seems to return $00 for the user value?
<:(
Chip
I must be missing something here in regards to async breaks.
When I execute a BRK #/D outside of a debug ISR an async break is triggered for cog #/D.
This works Ok but your verilog and posts also show that a 8 bit value can be passed to the ISR and retrieved via a GETINT D wz.
This way, when BRK is used outside of the debug ISR (in user code) it can generate a breakpoint AND pass an 8-bit value to the debug ISR (BRK #/D, D[7:0] is passed). This should allow for some flexibility in how debugging can be implemented.
There seems to be a conflict here.
If the #/D value is the cog number to be triggered how can a user value be passed.
A GETINT D wz from within a debug ISR always seems to return $00 for the user value?
<:(
BRK {#}D used outside of a debug ISR will set brk_code to D[7:0] and trigger a software breakpoint which must have been given permission via a prior BRK {#}D instruction WITHIN a debug ISR that had bit 5 set. That is needed to recognize BRK {#}D outside of the debug ISR; otherwise the BRK will just be a NOP.
To generate an asynchronous breakpoint in another cog, you would execute GETINT {#}D (with D[3:0] = cog) and that other cog would need to have had that type of breakpoint enabled by having bit 6 set in a BRK {#}D instruction from within its own debug ISR.
This is simple, but seems complex for me to explain. Some untangling of names is maybe needed, I think.
It's funny how you can stare at the smallest piece of code for hours and see nothing wrong, then one little clue and all becomes clear.
The pitfalls of assumption. What I gave you was confusing and you were inclined to see it differently than I thought you would.
I've renamed the instructions:
GETBRK D WC/WZ/WCZ 'Get breakpoint-related data according to WC/WZ/WCZ
COGBRK {#}D 'Generate asynchronous breakpoint in cog D[3:0]
BRK {#}D 'Trigger breakpoint if outside of debug ISR, or set next breakpoint condition if inside of debug ISR
GETBRK D WC/WZ/WCZ 'Get breakpoint-related data according to WC/WZ/WCZ
COGBRK {#}D 'Generate asynchronous breakpoint in cog D[3:0]
BRK {#}D 'Trigger breakpoint if outside of debug ISR, or set next breakpoint condition if inside of debug ISR
I would expand that one further
GETBRK D WC/WZ/WCZ 'Get breakpoint-related data according to WC/WZ/WCZ
COGBRK {#}D 'Generate asynchronous breakpoint in cog D[3:0]
BREAK {#}D 'Trigger breakpoint if outside of debug ISR
SETBRK {#}D 'Set next breakpoint condition if inside of debug ISR
If someone has downloaded a working image, and they want to insert a break point, what are the steps required from the PC / P2 ends to do that ?
GETBRK D WC/WZ/WCZ 'Get breakpoint-related data according to WC/WZ/WCZ
COGBRK {#}D 'Generate asynchronous breakpoint in cog D[3:0]
BRK {#}D 'Trigger breakpoint if outside of debug ISR, or set next breakpoint condition if inside of debug ISR
I would expand that one further
GETBRK D WC/WZ/WCZ 'Get breakpoint-related data according to WC/WZ/WCZ
COGBRK {#}D 'Generate asynchronous breakpoint in cog D[3:0]
BREAK {#}D 'Trigger breakpoint if outside of debug ISR
SETBRK {#}D 'Set next breakpoint condition if inside of debug ISR
If someone has downloaded a working image, and they want to insert a break point, what are the steps required from the PC / P2 ends to do that ?
I've been running a lot of acid tests on the debugging circuitry. Everything is looking good.
I'm able to generate all types of debug interrupts, within all levels of ISRs and main code, including XBYTE, and pull out the SKIP(F) settings, tweak them, restore them, and resume.
There was one issue I had to rediscover and I'm thinking about how to handle it: Hub-exec takes control of the hub RAM FIFO and clobbers whatever RDFAST/WRFAST activity was underway. There's no way around this, except to wholly locate the debug ISR activity within the cog RAM. This doesn't mean we need much space, as simple SETQ+RDLONG/WRLONG combos can shuttle blocks of data between hub RAM and cog RAM. This means the debug stub code only needs to be several longs. I'm now trying to figure out how to make it zero longs. Maybe a tiny patchable "ROM" that executes whenever code execution is in the special-registers range of $1F8..$1FF is the way to achieve this. I just need to think a bit more. I think that's how we can do it. Then, we are not dependent upon any pre-agreement with the code being debugged. It just works!
I've been running a lot of acid tests on the debugging circuitry. Everything is looking good.
I'm able to generate all types of debug interrupts, within all levels of ISRs and main code, including XBYTE, and pull out the SKIP(F) settings, tweak them, restore them, and resume.
Does the debug interrupt consume stack - meaning you need to have one stack layer spare, if you want to debug deepest stack code ?
There was one issue I had to rediscover and I'm thinking about how to handle it: Hub-exec takes control of the hub RAM FIFO and clobbers whatever RDFAST/WRFAST activity was underway. There's no way around this, except to wholly locate the debug ISR activity within the cog RAM. This doesn't mean we need much space, as simple SETQ+RDLONG/WRLONG combos can shuttle blocks of data between hub RAM and cog RAM. This means the debug stub code only needs to be, maybe, two longs. I'm now trying to figure out how to make it zero longs. Maybe a tiny patchable "ROM" that executes whenever code execution is in the special-registers range of $1F8..$1FF is the way to achieve this. I just need to think a bit more. I think that's how we can do it. Then, we are not dependent upon any pre-agreement with the code being debugged. It just works!
I was pondering something similar (only RAM based) in the special register area, alive for opcodes like DJNZ, to allow Debug pass counters that have zero footprint.
Of course, once you document such 'free DJNZ' locations, someone else will use it, making debug harder
What is the speed cost of the muxed PC_Read ROM, and does it need to be ROM, or can it be RAM ?
Such a ROM(RAM?) could possibly be larger than 8L, if needed, if a rule is entry is by increment from special register area.
That's sounding more like a small FIFO, which could work, it just means jumps inside that fifo are not supported.
What does the small debug stub code look like ? Is 2 longs enough ?
Would making an 8L parallel-code-ram allow easy code roll over COG-HUB ?
1) save cog $000..$00F to some space in hub RAM (SETQ+WRLONG)
2) load cog $000..$00F from some space in hub RAM (SETQ+RDLONG)
3) jump to $000
ISR exit:
1) load cog $000..$00F from some space in hub RAM (SETQ+RDLONG)
2) RETI0
If someone wanted to add a fast Break pass count, they might prefix a
0) DJNZ PassCtrCogReg,AdrRETI0
in order to minimize the debug overhead of each break.
Does that still fit ?
Unless you are concerned about the number of clock cycles debug ISR will take when entered, even the logic needed to inject the 'jump to $000' could be spared.
If your synthesized logic will generate two longs (not counting the jump itself) then replicate it four times, in that address space ($1F8...$1FF).
It'll be a four-fold execution block, but, who will care?
At least some equations (and area too) will be spared.
If someone wanted to add a fast Break pass count, they might prefix a
0) DJNZ PassCtrCogReg,AdrRETI0
in order to minimize the debug overhead of each break.
Does that still fit ?
If the debug entry code execution overhead is a concern, a possible solution would be simultaneously forcing the higher address bits to 1.
This way, for the case of a two-long ISR entry lenght, the COG fetch logic would ever see an execution at $1FE, followed by another at $1FF, before arriving at $000.
...
If the debug entry code execution overhead is a concern, a possible solution would be simultaneously forcing the higher address bits to 1.
This way, for the case of a two-long ISR entry lenght, the COG fetch logic would ever see an execution at $1FE, followed by another at $1FF, before arriving at $000.
I think more than 2 longs is needed for Chips code above ?
Does $1FF wrap to COG:000, or into LUT exec region ?
There was one issue I had to rediscover and I'm thinking about how to handle it: Hub-exec takes control of the hub RAM FIFO and clobbers whatever RDFAST/WRFAST activity was underway. There's no way around this, except to wholly locate the debug ISR activity within the cog RAM..
.. any RDFAST/WRFAST activity has a short time limit, choices might be to
delay the exec part, on BREAK ?
or
to have just the WAIT:JMP code in COG ?
Chip,
Not sure if this will help. I did a "Zero Footprint Debugger" for P1 that resided in the shadow ram. It installed an LMM routine in the shadow ram and ran the debugger from Hub.
I was able to single step PASM and/or SPIN Interpreter from this.
Does $1FF wrap to COG:000, or into LUT exec region ?
In this test execution continues into the lut region BUT code doesn't appear to execute when in top of cogram, The DRVNOT #2 instruction never fires.
dat org
loc ptra,#@loop
setq2 #4
rdlong 0,ptra 'copy code to lutram
mov $1ff,topcode 'copy instruction to cogram top
jmp #$1ff 'jump to top of cogram
topcode drvnot #2
orgh $400
org $200
loop drvnot #0
drvnot #1
jmp #$1ff
In this test execution continues into the lut region BUT code doesn't appear to execute when in top of cogram, The DRVNOT #2 instruction never fires.
Interesting, I wonder what opcode it fetches ?
Guess this effect is why Chip was thinking of "Maybe a tiny patchable "ROM" that executes whenever code execution is in the special-registers range of $1F8..$1FF"
- tho mapped RAM that could deliver opcodes, seems safer than ROM ?
Sometimes, mistakes (or even gross errors, like my previous ones) are the best counsellors. Remembrances too, sometimes.
While I was driving towards the local university, to bring my wife home at the end of her workday, I was also thinkering...
How hard (or mux-prone/resources-consuming) would be if, during an active BRK, fetching instructions from the interval $01F8 - $01FF have the net effect of receiving back $F1F8 - $F1FF (ADD and SUB group of instructions), as the instructions to be executed?
Suposing that there is some bit-flag, indicating that a BRK ISR must be executed, its first job would be jumping to $01F8, without reseting itself.
If the flag is kept alive, until reset by the last needed instruction, it can be used as a discriminator, to anable the decode of the partial alternate instruction set.
This scheme would produce up to eight new instructions, whose individual meaning is to be determined by the ISR-entering procedure needs.
Only a thought.
P.S. I forgot to say that I've mentally shifted-left the three least significant bits, so the new instructions would become $1111 0001000 CZI DDDDDDDDD SSSSSSSSS thru $1111 0001111 CZI DDDDDDDDD SSSSSSSSS, with CZI, D and S fields to be filled/forced by the alternative decoding scheme.
How hard (or mux-prone/resources-consuming) would be if, during an active BRK, fetching instructions from the interval $01F8 - $01FF have the net effect of receiving back $F1F8 - $F1FF (ADD and SUB group of instructions), as the instructions to be executed?
Seems quite restrictive, for little gain ?
You now have to carefully (re)craft the opcode map, to put the ones you need into those slots.
The small ROM saving here, could easily be lost in larger decode logic.
Perhaps its a fault of my poor written English skills.
I was trying to describe the use of an active BRK ISR-request flag as bit 32 of a 33-bit long opcode.
P.S. You must be right.
One bit is not just one more bit. It's not just another single wire in the mix.
It's one more input term, in almost each and every equation.
I often miss the era of fusible-link proms and Gals. Tri-stateable buses are not that bad too.
And nanoseconds, hundreds of them, to react.
Becoming short on input terms? Just add another gate, and keep going.
OGT!
Comments
Yes, we will improve the naming for these things.
Here is the video on new ideas in "debug", as in reverse engineering without as much, or defined contexts being needed.
Right now, any cog can generate an asynchronous debug breakpoint in any other cog, assuming the target cog has that enabled.
If all cogs are running code, there is no cog free to issue that asynchronous debug breakpoint. This is where a pin-based debug interrupt would come in handy.
Then there is the problem of how to toggle some arbitrary pin from the debugger link. We could make a blanket option where all cogs can be interrupted by a transition on P63. This would certainly work, except it would disturb all cogs, which is not good. An alternative would be to have one cog running as a dedicated debugger interface.
Electing P63 as being the conveyor of BRK stimulus will introduce a slight modification at its structure, creating a someway different version of a smart pin.
Taking a step further...
Why don't use its long repository to hold up to 16 selectors, one for each COG that could be affected by the external BRK input it will become, when programmed to behave this way?
I must be missing something here in regards to async breaks.
When I execute a BRK #/D outside of a debug ISR an async break is triggered for cog #/D.
This works Ok but your verilog and posts also show that a 8 bit value can be passed to the ISR and retrieved via a GETINT D wz.
There seems to be a conflict here.
If the #/D value is the cog number to be triggered how can a user value be passed.
A GETINT D wz from within a debug ISR always seems to return $00 for the user value?
<:(
BRK {#}D used outside of a debug ISR will set brk_code to D[7:0] and trigger a software breakpoint which must have been given permission via a prior BRK {#}D instruction WITHIN a debug ISR that had bit 5 set. That is needed to recognize BRK {#}D outside of the debug ISR; otherwise the BRK will just be a NOP.
To generate an asynchronous breakpoint in another cog, you would execute GETINT {#}D (with D[3:0] = cog) and that other cog would need to have had that type of breakpoint enabled by having bit 6 set in a BRK {#}D instruction from within its own debug ISR.
This is simple, but seems complex for me to explain. Some untangling of names is maybe needed, I think.
It's working great now.
It's funny how you can stare at the smallest piece of code for hours and see nothing wrong, then one little clue and all becomes clear.
The pitfalls of assumption. What I gave you was confusing and you were inclined to see it differently than I thought you would.
I've renamed the instructions:
Here is the spreadsheet:
https://docs.google.com/spreadsheets/d/1usUcCCQVp3liAqENX9rvX-XVqJomMREhKYExM_taG0A/edit?usp=sharing
See lines 330, 331, and 332.
I haven't played with P2 for a couple of weeks so it was good to catch up again.
Okay. Good. I've got PNut.exe using those names now. I just need to do some complete compiles and update the Google Doc.
I would expand that one further
If someone has downloaded a working image, and they want to insert a break point, what are the steps required from the PC / P2 ends to do that ?
I've been running a lot of acid tests on the debugging circuitry. Everything is looking good.
I'm able to generate all types of debug interrupts, within all levels of ISRs and main code, including XBYTE, and pull out the SKIP(F) settings, tweak them, restore them, and resume.
There was one issue I had to rediscover and I'm thinking about how to handle it: Hub-exec takes control of the hub RAM FIFO and clobbers whatever RDFAST/WRFAST activity was underway. There's no way around this, except to wholly locate the debug ISR activity within the cog RAM. This doesn't mean we need much space, as simple SETQ+RDLONG/WRLONG combos can shuttle blocks of data between hub RAM and cog RAM. This means the debug stub code only needs to be several longs. I'm now trying to figure out how to make it zero longs. Maybe a tiny patchable "ROM" that executes whenever code execution is in the special-registers range of $1F8..$1FF is the way to achieve this. I just need to think a bit more. I think that's how we can do it. Then, we are not dependent upon any pre-agreement with the code being debugged. It just works!
I was pondering something similar (only RAM based) in the special register area, alive for opcodes like DJNZ, to allow Debug pass counters that have zero footprint.
Of course, once you document such 'free DJNZ' locations, someone else will use it, making debug harder
What is the speed cost of the muxed PC_Read ROM, and does it need to be ROM, or can it be RAM ?
Such a ROM(RAM?) could possibly be larger than 8L, if needed, if a rule is entry is by increment from special register area.
That's sounding more like a small FIFO, which could work, it just means jumps inside that fifo are not supported.
What does the small debug stub code look like ? Is 2 longs enough ?
Would making an 8L parallel-code-ram allow easy code roll over COG-HUB ?
The ROM needs to do two things.
ISR entry:
1) save cog $000..$00F to some space in hub RAM (SETQ+WRLONG)
2) load cog $000..$00F from some space in hub RAM (SETQ+RDLONG)
3) jump to $000
ISR exit:
1) load cog $000..$00F from some space in hub RAM (SETQ+RDLONG)
2) RETI0
If someone wanted to add a fast Break pass count, they might prefix a
0) DJNZ PassCtrCogReg,AdrRETI0
in order to minimize the debug overhead of each break.
Does that still fit ?
If your synthesized logic will generate two longs (not counting the jump itself) then replicate it four times, in that address space ($1F8...$1FF).
It'll be a four-fold execution block, but, who will care?
At least some equations (and area too) will be spared.
If the debug entry code execution overhead is a concern, a possible solution would be simultaneously forcing the higher address bits to 1.
This way, for the case of a two-long ISR entry lenght, the COG fetch logic would ever see an execution at $1FE, followed by another at $1FF, before arriving at $000.
Does $1FF wrap to COG:000, or into LUT exec region ?
delay the exec part, on BREAK ?
or
to have just the WAIT:JMP code in COG ?
Good catch (your's)!
Not sure if this will help. I did a "Zero Footprint Debugger" for P1 that resided in the shadow ram. It installed an LMM routine in the shadow ram and ran the debugger from Hub.
I was able to single step PASM and/or SPIN Interpreter from this.
Interesting, I wonder what opcode it fetches ?
Guess this effect is why Chip was thinking of "Maybe a tiny patchable "ROM" that executes whenever code execution is in the special-registers range of $1F8..$1FF"
- tho mapped RAM that could deliver opcodes, seems safer than ROM ?
While I was driving towards the local university, to bring my wife home at the end of her workday, I was also thinkering...
How hard (or mux-prone/resources-consuming) would be if, during an active BRK, fetching instructions from the interval $01F8 - $01FF have the net effect of receiving back $F1F8 - $F1FF (ADD and SUB group of instructions), as the instructions to be executed?
Suposing that there is some bit-flag, indicating that a BRK ISR must be executed, its first job would be jumping to $01F8, without reseting itself.
If the flag is kept alive, until reset by the last needed instruction, it can be used as a discriminator, to anable the decode of the partial alternate instruction set.
This scheme would produce up to eight new instructions, whose individual meaning is to be determined by the ISR-entering procedure needs.
Only a thought.
P.S. I forgot to say that I've mentally shifted-left the three least significant bits, so the new instructions would become $1111 0001000 CZI DDDDDDDDD SSSSSSSSS thru $1111 0001111 CZI DDDDDDDDD SSSSSSSSS, with CZI, D and S fields to be filled/forced by the alternative decoding scheme.
You now have to carefully (re)craft the opcode map, to put the ones you need into those slots.
The small ROM saving here, could easily be lost in larger decode logic.
I was trying to describe the use of an active BRK ISR-request flag as bit 32 of a 33-bit long opcode.
P.S. You must be right.
One bit is not just one more bit. It's not just another single wire in the mix.
It's one more input term, in almost each and every equation.
I often miss the era of fusible-link proms and Gals. Tri-stateable buses are not that bad too.
And nanoseconds, hundreds of them, to react.
Becoming short on input terms? Just add another gate, and keep going.
OGT!