Strange that one pass would be ok.
(maybe not so strange, as a different ADD is used )
Do both FPGA boards do the same, and is it easy to lower the CLK speed ?
This needs to read / add / save / compare to 64 bits, and the upper 32 bits is likely pipelined, so there are long paths here.
The details this far suggest a 'bonus 1' is added at bit 32, somehow, but not always ?
The DOCs show 32 bit & 64 bit WAITCNT, does the 64 bit version work ok ?
The problem may be in the serial part. You are assuming the serouta is taking less than a second. To verify, replace those calls with a blinking led or some other output. Also, you could maybe try repeating the getcnt approach to make sure you are waiting one second (after the serouta, in this case).
I am finding the CLRP/SETP/OFFP/NOTP confusing/misleading. I went looking for a SETDIR or SETPDIR instruction to set the pin 1..127 direction.
Perhaps its just me, but thought I would mention it anyway. See post below for further info.
Can anyone see my bug???
The ">" is output, but then it hangs on the waitcnt so "A" is never seen.
I am getting a ~53s delay between each set ">", "AB", "AB". Is there something I am missing with the new 64bit waitcnt?
DAT
orgh $00E00 ' start of hub ram
org 0
Entry
getcnt waitx
add waitx, delay5s ' 5 secs
waitcnt waitx, delay1s
setsera _serenable, _period
CLRP #_txpin ' make txpin an output, SERA drives it high
nop
nop
nop
serouta #">"
:again
waitcnt waitx, delay1s
serouta #"A"
serouta #"B"
jmp #:again
' txid rxid mask r t en rxpin# en txpin#
_serenable long 00_0000_0000_0_0_10_0000000_10_0000000 | (_rxpin<<9) | _txpin
_period long _bitrate
waitx long 0
delay500ms long _xinfreq / 2 ' 0.5 sec
delay1s long _xinfreq * 1 ' 1s
delay5s long _xinfreq * 5 ' 5 sec
_xinfreq should be the clock frequency (80_000_000). If it is a large number and this is not working, there is a bug in the Verilog code. I'm at my parents' place for Thanksgiving, so I don't have access to my work, but I'll be back on it tomorrow.
Chip: I hope you take the day off to enjoy "Thanksgiving" with your family!
I am finding the CLRP/SETP/OFFP/NOTP a bit confusing/misleading. I went looking for a SETDIR or SETPDIR instruction to set the pin 1..127 direction.
Perhaps its just me, but thought I would mention it anyway.
IIRC, these are the current instructions...
ZCL- 1111111 ZC L CCCC DDDDDDDDD x00110000 GETP D/# (pin# into !Z/C via WZ/WC)
ZCL- 1111111 ZC L CCCC DDDDDDDDD x00110001 GETNP D/# (pin# into Z/!C via WZ/WC)
--L- 1111111 xx L CCCC DDDDDDDDD x10011000 OFFP D/# (pin#=0??? , dir#=0)
--L- 1111111 xx L CCCC DDDDDDDDD x10011001 NOTP D/# (pin#=!pin# , dir#=1)
--L- 1111111 xx L CCCC DDDDDDDDD x10011010 CLRP D/# (pin#=0 , dir#=1)
--L- 1111111 xx L CCCC DDDDDDDDD x10011011 SETP D/# (pin#=1 , dir#=1)
--L- 1111111 xx L CCCC DDDDDDDDD x10011100 SETPC D/# (pin#=C , dir#=1)
--L- 1111111 xx L CCCC DDDDDDDDD x10011101 SETPNC D/# (pin#=!C , dir#=1)
--L- 1111111 xx L CCCC DDDDDDDDD x10011110 SETPZ D/# (pin#=Z , dir#=1)
--L- 1111111 xx L CCCC DDDDDDDDD x10011111 SETPNZ D/# (pin#=!Z , dir#=1)
[I]where D/# specifies a pin 0..127[/I]
Questions: (obviously I can check the following with sw)
Does OFFP modify/reset the out# value?
Does SETPC/SETPNC/SETPZ/SETPNZ also set DIR#=1?
Is there just one big 128 DIR# register per cog that sets the output direction ?
Are the port DIRA/DIRB/DIRC/DIRD registers just a window of 32 bit banks on the DIR# register ?
So if I do SETP #32, will a read of DIRB (defaults to b32..63) show bit0=1 ?
Depending on the above answers, perhaps the addition of the following instruction might be nice...
DIRP [#]D 'set pin# to an output (out# unchanged)
As I mentioned in the post above, some/all of the above instructions could be expanded similarly to also drive pin# pairs (by utilising the WZ/WC bits, although I notice there are plenty of opcode space for this format instruction).
This would give us a simple method of bit-banging differential (really they are just complementary) pairs for protocols such as USB, RS422, etc.
They would utilise the same pin/pair mux logic that I described previously for the special GETXP instruction
Would these be useful? Is there the space? Do you have the time? If so, I can document them for you. But I don't want to open a can of worms either.
_xinfreq should be the clock frequency (80_000_000). If it is a large number and this is not working, there is a bug in the Verilog code. I'm at my parents' place for Thanksgiving, so I don't have access to my work, but I'll be back on it tomorrow.
Hey Chip,
Take the day off and enjoy yourself and your family!
BTW Yes, I do have _xinfreq set to 80_000_000. I didn't post the complete con/var sections.
Chip: I hope you take the day off to enjoy "Thanksgiving" with your family!
I am finding the CLRP/SETP/OFFP/NOTP a bit confusing/misleading. I went looking for a SETDIR or SETPDIR instruction to set the pin 1..127 direction.
Perhaps its just me, but thought I would mention it anyway.
IIRC, these are the current instructions...
ZCL- 1111111 ZC L CCCC DDDDDDDDD x00110000 GETP D/# (pin# into !Z/C via WZ/WC)
ZCL- 1111111 ZC L CCCC DDDDDDDDD x00110001 GETNP D/# (pin# into Z/!C via WZ/WC)
--L- 1111111 xx L CCCC DDDDDDDDD x10011000 OFFP D/# (pin#=0??? , dir#=0)
--L- 1111111 xx L CCCC DDDDDDDDD x10011001 NOTP D/# (pin#=!pin# , dir#=1)
--L- 1111111 xx L CCCC DDDDDDDDD x10011010 CLRP D/# (pin#=0 , dir#=1)
--L- 1111111 xx L CCCC DDDDDDDDD x10011011 SETP D/# (pin#=1 , dir#=1)
--L- 1111111 xx L CCCC DDDDDDDDD x10011100 SETPC D/# (pin#=C , dir#=1)
--L- 1111111 xx L CCCC DDDDDDDDD x10011101 SETPNC D/# (pin#=!C , dir#=1)
--L- 1111111 xx L CCCC DDDDDDDDD x10011110 SETPZ D/# (pin#=Z , dir#=1)
--L- 1111111 xx L CCCC DDDDDDDDD x10011111 SETPNZ D/# (pin#=!Z , dir#=1)
[I]where D/# specifies a pin 0..127[/I]
Questions: (obviously I can check the following with sw)
Does OFFP modify/reset the out# value?
Does SETPC/SETPNC/SETPZ/SETPNZ also set DIR#=1?
Is there just one big 128 DIR# register per cog that sets the output direction ?
Are the port DIRA/DIRB/DIRC/DIRD registers just a window of 32 bit banks on the DIR# register ?
So if I do SETP #32, will a read of DIRB (defaults to b32..63) show bit0=1 ?
Depending on the above answers, perhaps the addition of the following instruction might be nice...
DIRP [#]D 'set pin# to an output (out# unchanged)
As I mentioned in the post above, some/all of the above instructions could be expanded similarly to also drive pin# pairs (by utilising the WZ/WC bits, although I notice there are plenty of opcode space for this format instruction).
This would give us a simple method of bit-banging differential (really they are just complementary) pairs for protocols such as USB, RS422, etc.
They would utilise the same pin/pair mux logic that I described previously for the special GETXP instruction
Would these be useful? Is there the space? Do you have the time? If so, I can document them for you.
All those OFFP/NOTP/SETP/CLRP/SETPC/SETPNC/SETPZ/SETPNZ instructions affect both the pin's OUT and DIR bits. OFFP clears the OUT and DIR bits, while all the rest set the DIR bit. There are no instructions which affect just the DIR bit, because that can be achieved via SETB/CLRB/etc on the DIR register, directly. The pin instructions, however, always affect both OUT and DIR. We could make them support Z and C in various ways.
After a WATCNT reg1,delay the reg1 value is always zero.
This was tested on a DE2 board.
I bet I left WAITCNT out of the mux select equation. Its result comes out of the main adder, along with ADD/SUB/etc. This is probably just a matter of adding a symbol into a big OR equation. Sorry about this. Thanks for discovering the problem.
I bet I left WAITCNT out of the mux select equation. Its result comes out of the main adder, along with ADD/SUB/etc. This is probably just a matter of adding a symbol into a big OR equation. Sorry about this. Thanks for discovering the problem.
Thanks Chip and no worries. Now go and enjoy that turkey!
About having a signal to single-step cogs: The trace signals are going to appear to other cogs a few cycles after the fact. Single-stepping will not be very meaningful.
This wouldn't be the first micro to have that limitation. My coworkers say this is a commonly-used code snippet on the pic24:
if(some_condition)
{
noop(); // Put breakpoint here
noop(); // The extra noops are necessary because breakpoints have latency.
noop();
noop();
noop();
noop();
noop();
}
Either way, would the cycles of latency continue while in PAUSE mode? Could I have a debugging cog turn on pause on a debugged cog when it got close to where it might need to break, then look at the trace data, and single-step with several clocks of wait time until the trace data lined up with where the breakpoint was supposed to be?
This wouldn't be the first micro to have that limitation.
Also some micros have an option to disable the caches for debug, but if the cache is a little smarter as in the Prop 2, this changes code enough you are no longer really testing what you ship.
I think the new serial opcodes, plus a Prop2 as a Debug-Bridge, should be able to give good debug operations, with a quite small debug-stub.
Also some micros have an option to disable the caches for debug, but if the cache is a little smarter as in the Prop 2, this changes code enough you are no longer really testing what you ship.
I think the new serial opcodes, plus a Prop2 as a Debug-Bridge, should be able to give good debug operations, with a quite small debug-stub.
This is what I've been thinking, too. Trace is useful, but single-stepping might be quite superfluous with all else that is doable.
This is what I've been thinking, too. Trace is useful, but single-stepping might be quite superfluous with all else that is doable.
To the contrary, with taskswitching, I think there are a lot more things necessary. I am just trying to think of a simple way to implement a stall in the stage 4 pipe that can be controlled by another cog (or external gating?).
Currently, I think a two line interface would work as follows:
0x=normal
1x=step on each falling edge of x
This permits a slower interface and removes the timing between cogs.
This mode would be a enabled by settrace with an extra bit option.
We would need to specify the pin pair to be used, or use the next pair above the current trace pins.
To the contrary, with taskswitching, I think there are a lot more things necessary. I am just trying to think of a simple way to implement a stall in the stage 4 pipe that can be controlled by another cog (or external gating?).
Currently, I think a two line interface would work as follows:
0x=normal
1x=step on each falling edge of x
This permits a slower interface and removes the timing between cogs.
This mode would be a enabled by settrace with an extra bit option.
We would need to specify the pin pair to be used, or use the next pair above the current trace pins.
I'm only half following this - you do not need to fiddle at the pipeline level, as single step is largely an illusion.
On most micros, there are actually thousands of cycles behind the scenes of each 'step', and a good many are dedicated to trying to stay invisible.
If you give one COG a single time-slot, then disable and dump the entire memory map for single clock taken, you have what looks like single step, with a full device image.
Some small amount of support in the Time Slot manager, along the lines of a single-shot mono-flop, is all the hardware this would need. That may already be there ?
COG memory dump is down to what now ? - inside half a dozen lines of code ?
What I am thinking of is a single flip-flop. It's output is gated with the single-step enable bit set by the setrace instruction, and an external "single" line (from an I/O). The output of the gate causes the stage 4 pipeline to stall.
The flip-flop is set by clocking the external "step" line (from an I/O) and reset as soon as an instruction is executed by stage 4 (ie 1 clock).
This is all I am after, and it would permit another cog (or external device if the I/O line(s) were external - ie P0..92).
This way, we "can" see each instruction that passes thru the stage 4 pipeline. This is sufficient to see what is going on in your program - not entirely, but good enough, and certainly better than what we have.
So, when the "single" line is low, we have what we have now. Once a trigger point is seen, another cog/device sets the "single" line active which stalls the pipeline at stage 4. Then, to advance 1 clock (stage 4 cycle), the other cog/device toggles the "step" line from low/high/low which clocks the flip flop.
Now we can see what instructions (their cog addresses and flags) run through stage 4. When we are happy to let it free-run, the "single" line is brought low.
Otherwise, we can only read a certain number of stage 4 addresses before we fill memory.
Cog space is short enough, and potentially with multi-tasks running, it may be even more precious. To actually see what is happening inside the cog, we need to place some code inside the cog space. There is no way, other than by running the cog under LMM, to step through code. I have done this in the P1, and almost have my P2 Debugger doing this. There is no way I know of otherwise intercept running code to examine what is happening.
Since P2 is designed to operate down to 0Hz, how difficult would it be to use externally controlled, variable clocking to provide the single-step capability?
Since P2 is designed to operate down to 0Hz, how difficult would it be to use externally controlled, variable clocking to provide the single-step capability?
That bit is easy - but what you also need to do, is dump a snapshot of the whole cog, between every single step.
Ideally done with zero visible impact, but in SW based [step+dump], usually some small impact is tolerated. ( < 1%?)
Yes, it is the ability to stall while you look through and think about what is happening. But then you may want to run through at faster speed again until the next pseudo breakpoint, or just step to the next instruction. So with the two lines, the other cog could release the "stall" line and delay a number of clocks and again "stall", all the time capturing what went on via the setrace output. With just these two simple lines and a flipflop and extra setrace command bit, a whole flexible tracing system can be built in software using another cog (or even external hardware = another P2, etc).
Then the only things missing are examining cog space, the instruction and its operand values and results. Its not possible in this regime to view cog space unless you have another task running that you can use to do this. Chip was alluding to having these types of options available now which is true, providing you don't need to run anything real-time.
I am sure we can build lots of debugging within the current framework. I just believe that it would be easier and simpler with these basic additions. Unfortunately, Chip opened th door with setrace, and some of us have seized the opportunity to make it soooo much better. I know there are other things we could ask for. When we have real P2s I am sure we can ask Chip for some additions for the FPGA only so we can use that to test our code and use it to aid debugging.
Gating the oscillator would stall all cogs so we would have to use another P2 for debugging. And what of the ramifications of stalling all the cogs.
I originally though of just a single input to stall, but then thought that the driving cog to create single-stepping would have to output a single clock pulse and there may be timing delays that may result in a variable number of steps of 0-1-2. That is why I asked for 2 lines and a flipflop.
I'm only half following this - you do not need to fiddle at the pipeline level, as single step is largely an illusion.
On most micros, there are actually thousands of cycles behind the scenes of each 'step', and a good many are dedicated to trying to stay invisible.
If you give one COG a single time-slot, then disable and dump the entire memory map for single clock taken, you have what looks like single step, with a full device image.
Some small amount of support in the Time Slot manager, along the lines of a single-shot mono-flop, is all the hardware this would need. That may already be there ?
COG memory dump is down to what now ? - inside half a dozen lines of code ?
This is how I see things, too. You can stop and single-step a cog, but unless you can get register dumps, it's not better than taking a full-speed log via SETRACE. To get a register dump, you would have to execute code to facilitate it, which can be very small, but would take hundreds of cycles. Single-stepping only slows down what SETRACE would do in real-time. The way I see it, you'd just want to put a JMP to the breakpoint handler at your breakpoint location and not bother trying to coerce SETRACE into doing something it's not going to be good at.
Not sure if that is what you are after, but you can single step thru a code very easy with 2 tasks:
org
jmp #Task0
Task1 setp #0 'single-stepped code
clrp #0
jmp #Task1
Task0 waitpf #91 'wait for <space key> on Terminal
waitpr #91
waitpr #91 'wait until space finished
settask #%%1 'do one step of Task1
settask #%%0 'disable Task1
'optional debug output here
jmp #Task0
This executes one instruction of Task1 every time you press the space key. Also a debug output for every step can be added.
Instead of the WAITPF/R you can also use SERINx but that needs more setup.
Not sure if that is what you are after, but you can single step thru a code very easy with 2 tasks:
This is a very interesting idea. I'm thinking it could be extended further for more functionality (at the expense of extra COG overhead).
Often during debugging it is convenient to run some of the code at full speed until a certain condition at a PC address is reached at which point you then want to be able to single step through (or resume) the following code to narrow down and find the culprit. We could use a settask instruction to effectively become a breakpoint instruction by turning back on the debugger task, which has a PC already prepared waiting and whose first instruction immediately turns off the task under test to go do all its dumping of COG state etc to the serial port or wherever. Something like that could be very useful but it would obviously require the extra COG resources and would become very difficult/impossible to debug code already using multi-tasking, but not all COG code would be using multi-tasking so some could be candidates for this debugging method.
Have self modifiable COG code also even allow us to rewind the debugged task's PC and put back the original instruction at the breakpoint position to go execute it again as we single step. This may allow dynamic breakpoints and would be very useful during an interactive debugging session. One main issue is the extra COG space used by this "internal debugger" but perhaps it could be driven from a small VM running debugger code from hub memory to keep the COG overhead small.
This is a very interesting idea. I'm thinking it could be extended further for more functionality (at the expense of extra COG overhead).
Often during debugging it is convenient to run some of the code at full speed until a certain condition at a PC address is reached at which point you then want to be able to single step through (or resume) the following code to narrow down and find the culprit. We could use a settask instruction to effectively become a breakpoint instruction by turning back on the debugger task, which has a PC already prepared waiting and whose first instruction immediately turns off the task under test to go do all its dumping of COG state etc to the serial port or wherever. Something like that could be very useful but it would obviously require the extra COG resources and would become very difficult/impossible to debug code already using multi-tasking, but not all COG code would be using multi-tasking so some could be candidates for this debugging method.
Have self modifiable COG code also even allow us to rewind the debugged task's PC and put back the original instruction at the breakpoint position to go execute it again as we single step. This may allow dynamic breakpoints and would be very useful during an interactive debugging session. One main issue is the extra COG space used by this "internal debugger" but perhaps it could be driven from a small VM running debugger code from hub memory to keep the COG overhead small.
I already have a P2 Debugger that uses LMM and 16 longs of cog. It's not running yet on the latest instruction set - I could just be the waitcnt issue.
One main issue is the extra COG space used by this "internal debugger" but perhaps it could be driven from a small VM running debugger code from hub memory to keep the COG overhead small.
In maybe 6 instructions (using REPS), you could dump the cog RAM to hub and in a few more, read in some debugger code. Afterwards, swap the old code back in and resume execution. Something like this has to be done in order to get a register dump and allow the I/Os to be modified. It would be impractical to build all that functionality into hardware. This is a job for software.
Tools and methods like this will be excellent for debugging PASM knowing all the pain I've been through when debugging my own driver code. If I can dynamically setup a breakpoint (using settask instructions) in PASM, run full speed, stop then single step, reading register state along the way, and then seamlessly resume execution of the original code I will be a very happy and more productive developer. Not all PASM code will be a candidate for doing this type of thing, but certainly quite a lot would be. I really like it.
I bet I left WAITCNT out of the mux select equation. Its result comes out of the main adder, along with ADD/SUB/etc. This is probably just a matter of adding a symbol into a big OR equation. Sorry about this. Thanks for discovering the problem.
Thanks for making that fix-it list, Cluso99. It reminded me about this WAITCNT problem that was, indeed, caused by a missing term in the result selector mux.
I got rid of the 'SETPIX0/1/2/3 D/#,S/#' instructions, freeing up those four 'D,S' opcodes. I added a SETPIXW instructions that sets all 8 PIX terms from the WIDEs.
All the QUAD naming has been changed to WIDE now. I changed SETWIDE/SETWIDZ (was SETQUAD/SETQUAZ) so that the address must be %X_XXXX_X000 to map the quads into cog RAM. If the three LSBs are not %000, WIDE is not mapped into cog RAM. This saved a whole 1% of flipflops, since all 8 addresses were previously registered for fast comparisons. Now, we only compare bits 8..3, which means WIDEs must start at $000/$008/$010/etc.
Thanks for making that fix-it list, Cluso99. It reminded me about this WAITCNT problem that was, indeed, caused by a missing term in the result selector mux.
I got rid of the 'SETPIX0/1/2/3 D/#,S/#' instructions, freeing up those four 'D,S' opcodes. I added a SETPIXW instructions that sets all 8 PIX terms from the WIDEs.
All the QUAD naming has been changed to WIDE now. I changed SETWIDE/SETWIDZ (was SETQUAD/SETQUAZ) so that the address must be %X_XXXX_X000 to map the quads into cog RAM. If the three LSBs are not %000, WIDE is not mapped into cog RAM. This saved a whole 1% of flipflops, since all 8 addresses were previously registered for fast comparisons. Now, we only compare bits 8..3, which means WIDEs must start at $000/$008/$010/etc.
Great work Chip!
Yes, WIDEs must be on 000 boundaries. Doesn't really make sense otherwise, just like long and word boundaries.
1% is another nice saving. Every little bit counts
Comments
(maybe not so strange, as a different ADD is used )
Do both FPGA boards do the same, and is it easy to lower the CLK speed ?
This needs to read / add / save / compare to 64 bits, and the upper 32 bits is likely pipelined, so there are long paths here.
The details this far suggest a 'bonus 1' is added at bit 32, somehow, but not always ?
The DOCs show 32 bit & 64 bit WAITCNT, does the 64 bit version work ok ?
_xinfreq should be the clock frequency (80_000_000). If it is a large number and this is not working, there is a bug in the Verilog code. I'm at my parents' place for Thanksgiving, so I don't have access to my work, but I'll be back on it tomorrow.
I am finding the CLRP/SETP/OFFP/NOTP a bit confusing/misleading. I went looking for a SETDIR or SETPDIR instruction to set the pin 1..127 direction.
Perhaps its just me, but thought I would mention it anyway.
IIRC, these are the current instructions... Questions: (obviously I can check the following with sw)
Does OFFP modify/reset the out# value?
Does SETPC/SETPNC/SETPZ/SETPNZ also set DIR#=1?
Is there just one big 128 DIR# register per cog that sets the output direction ?
Are the port DIRA/DIRB/DIRC/DIRD registers just a window of 32 bit banks on the DIR# register ?
So if I do SETP #32, will a read of DIRB (defaults to b32..63) show bit0=1 ?
Depending on the above answers, perhaps the addition of the following instruction might be nice...
DIRP [#]D 'set pin# to an output (out# unchanged)
As I mentioned in the post above, some/all of the above instructions could be expanded similarly to also drive pin# pairs (by utilising the WZ/WC bits, although I notice there are plenty of opcode space for this format instruction).
This would give us a simple method of bit-banging differential (really they are just complementary) pairs for protocols such as USB, RS422, etc.
They would utilise the same pin/pair mux logic that I described previously for the special GETXP instruction
Would these be useful? Is there the space? Do you have the time? If so, I can document them for you. But I don't want to open a can of worms either.
Take the day off and enjoy yourself and your family!
BTW Yes, I do have _xinfreq set to 80_000_000. I didn't post the complete con/var sections.
All those OFFP/NOTP/SETP/CLRP/SETPC/SETPNC/SETPZ/SETPNZ instructions affect both the pin's OUT and DIR bits. OFFP clears the OUT and DIR bits, while all the rest set the DIR bit. There are no instructions which affect just the DIR bit, because that can be achieved via SETB/CLRB/etc on the DIR register, directly. The pin instructions, however, always affect both OUT and DIR. We could make them support Z and C in various ways.
I have found the problem with WAITCNT
After a WATCNT reg1,delay the reg1 value is always zero.
This was tested on a DE2 board.
I bet I left WAITCNT out of the mux select equation. Its result comes out of the main adder, along with ADD/SUB/etc. This is probably just a matter of adding a symbol into a big OR equation. Sorry about this. Thanks for discovering the problem.
Also some micros have an option to disable the caches for debug, but if the cache is a little smarter as in the Prop 2, this changes code enough you are no longer really testing what you ship.
I think the new serial opcodes, plus a Prop2 as a Debug-Bridge, should be able to give good debug operations, with a quite small debug-stub.
This is what I've been thinking, too. Trace is useful, but single-stepping might be quite superfluous with all else that is doable.
Currently, I think a two line interface would work as follows:
0x=normal
1x=step on each falling edge of x
This permits a slower interface and removes the timing between cogs.
This mode would be a enabled by settrace with an extra bit option.
We would need to specify the pin pair to be used, or use the next pair above the current trace pins.
I'm only half following this - you do not need to fiddle at the pipeline level, as single step is largely an illusion.
On most micros, there are actually thousands of cycles behind the scenes of each 'step', and a good many are dedicated to trying to stay invisible.
If you give one COG a single time-slot, then disable and dump the entire memory map for single clock taken, you have what looks like single step, with a full device image.
Some small amount of support in the Time Slot manager, along the lines of a single-shot mono-flop, is all the hardware this would need. That may already be there ?
COG memory dump is down to what now ? - inside half a dozen lines of code ?
The flip-flop is set by clocking the external "step" line (from an I/O) and reset as soon as an instruction is executed by stage 4 (ie 1 clock).
This is all I am after, and it would permit another cog (or external device if the I/O line(s) were external - ie P0..92).
This way, we "can" see each instruction that passes thru the stage 4 pipeline. This is sufficient to see what is going on in your program - not entirely, but good enough, and certainly better than what we have.
So, when the "single" line is low, we have what we have now. Once a trigger point is seen, another cog/device sets the "single" line active which stalls the pipeline at stage 4. Then, to advance 1 clock (stage 4 cycle), the other cog/device toggles the "step" line from low/high/low which clocks the flip flop.
Now we can see what instructions (their cog addresses and flags) run through stage 4. When we are happy to let it free-run, the "single" line is brought low.
Otherwise, we can only read a certain number of stage 4 addresses before we fill memory.
Cog space is short enough, and potentially with multi-tasks running, it may be even more precious. To actually see what is happening inside the cog, we need to place some code inside the cog space. There is no way, other than by running the cog under LMM, to step through code. I have done this in the P1, and almost have my P2 Debugger doing this. There is no way I know of otherwise intercept running code to examine what is happening.
That bit is easy - but what you also need to do, is dump a snapshot of the whole cog, between every single step.
Ideally done with zero visible impact, but in SW based [step+dump], usually some small impact is tolerated. ( < 1%?)
Then the only things missing are examining cog space, the instruction and its operand values and results. Its not possible in this regime to view cog space unless you have another task running that you can use to do this. Chip was alluding to having these types of options available now which is true, providing you don't need to run anything real-time.
I am sure we can build lots of debugging within the current framework. I just believe that it would be easier and simpler with these basic additions. Unfortunately, Chip opened th door with setrace, and some of us have seized the opportunity to make it soooo much better. I know there are other things we could ask for. When we have real P2s I am sure we can ask Chip for some additions for the FPGA only so we can use that to test our code and use it to aid debugging.
Gating the oscillator would stall all cogs so we would have to use another P2 for debugging. And what of the ramifications of stalling all the cogs.
I originally though of just a single input to stall, but then thought that the driving cog to create single-stepping would have to output a single clock pulse and there may be timing delays that may result in a variable number of steps of 0-1-2. That is why I asked for 2 lines and a flipflop.
This is how I see things, too. You can stop and single-step a cog, but unless you can get register dumps, it's not better than taking a full-speed log via SETRACE. To get a register dump, you would have to execute code to facilitate it, which can be very small, but would take hundreds of cycles. Single-stepping only slows down what SETRACE would do in real-time. The way I see it, you'd just want to put a JMP to the breakpoint handler at your breakpoint location and not bother trying to coerce SETRACE into doing something it's not going to be good at.
This executes one instruction of Task1 every time you press the space key. Also a debug output for every step can be added.
Instead of the WAITPF/R you can also use SERINx but that needs more setup.
Andy
This is a very interesting idea. I'm thinking it could be extended further for more functionality (at the expense of extra COG overhead).
Often during debugging it is convenient to run some of the code at full speed until a certain condition at a PC address is reached at which point you then want to be able to single step through (or resume) the following code to narrow down and find the culprit. We could use a settask instruction to effectively become a breakpoint instruction by turning back on the debugger task, which has a PC already prepared waiting and whose first instruction immediately turns off the task under test to go do all its dumping of COG state etc to the serial port or wherever. Something like that could be very useful but it would obviously require the extra COG resources and would become very difficult/impossible to debug code already using multi-tasking, but not all COG code would be using multi-tasking so some could be candidates for this debugging method.
Have self modifiable COG code also even allow us to rewind the debugged task's PC and put back the original instruction at the breakpoint position to go execute it again as we single step. This may allow dynamic breakpoints and would be very useful during an interactive debugging session. One main issue is the extra COG space used by this "internal debugger" but perhaps it could be driven from a small VM running debugger code from hub memory to keep the COG overhead small.
In maybe 6 instructions (using REPS), you could dump the cog RAM to hub and in a few more, read in some debugger code. Afterwards, swap the old code back in and resume execution. Something like this has to be done in order to get a register dump and allow the I/Os to be modified. It would be impractical to build all that functionality into hardware. This is a job for software.
Thanks for making that fix-it list, Cluso99. It reminded me about this WAITCNT problem that was, indeed, caused by a missing term in the result selector mux.
I got rid of the 'SETPIX0/1/2/3 D/#,S/#' instructions, freeing up those four 'D,S' opcodes. I added a SETPIXW instructions that sets all 8 PIX terms from the WIDEs.
All the QUAD naming has been changed to WIDE now. I changed SETWIDE/SETWIDZ (was SETQUAD/SETQUAZ) so that the address must be %X_XXXX_X000 to map the quads into cog RAM. If the three LSBs are not %000, WIDE is not mapped into cog RAM. This saved a whole 1% of flipflops, since all 8 addresses were previously registered for fast comparisons. Now, we only compare bits 8..3, which means WIDEs must start at $000/$008/$010/etc.
Yes, WIDEs must be on 000 boundaries. Doesn't really make sense otherwise, just like long and word boundaries.
1% is another nice saving. Every little bit counts
Did this get solved (I missed it)?
Chip, regarding the INDx/WAITVID issue
http://forums.parallax.com/showthread.php/151904-Here-is-the-update-from-the-Big-Change!!!?p=1222350&viewfull=1#post1222350
re SETSLOT D/# - this was the latest proposal. It could be simpler, but this is nice.
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223254&viewfull=1#post1223254
re AUXA/AUXB pointers, etc
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223795&viewfull=1#post1223795
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223798&viewfull=1#post1223798
re Random Number generation
http://forums.parallax.com/showthread.php/151904-Here-is-the-update-from-the-Big-Change!!!?p=1222443&viewfull=1#post1222443
Is there any possibility of the RDWIDE instruction being able to read multiple 8*Long blocks into hub using a tiny state m/c?
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223880&viewfull=1#post1223880
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223930&viewfull=1#post1223930
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223936&viewfull=1#post1223936
Is there any possibility of mapping more than one 8*Long block into Cog?
I think I understand what Bill and David were discussion regarding effectively executing from HUB without LMM. There are a couple of posts here that I think explains it better with code
http://forums.parallax.com/showthread.php/152079-Hub-Execution-Model-Thread-(split-from-blog)?p=1224716&viewfull=1#post1224716
HJMP/HCALL/HRET
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223938&viewfull=1#post1223938
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223940&viewfull=1#post1223940
I noticed a couple of gotcha's you mentioned, so here are the links just in case
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223918&viewfull=1#post1223918
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223922&viewfull=1#post1223922
Thanks for posting that summary. That's a lot of stuff!
I've got to sleep now, but I'll address all these tomorrow. Thanks for all your perseverance in keeping track of these issues.