Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

cgracey · 2018-03-16 20:55

jmg wrote: »

Yanomani wrote: »

IMHO, being capable of time-stamping the exact moment that a debug interrupt was entered could be useful to solve some intrincate problems, including keeping a number of COGs in sync.

But, before any debug processing can take place, there are many HUB accesses that must be done, to transfer registers to/from HUB ram.
...
In such situation, is there any way to freeze a copy of CT for each COG, to be later retrieved during each one individual debug processing?

Good point.
The time jitter is not large, so it may be possible to save a smaller copy of CT, to save logic resource ? Debug code sends both values to the PC-side for adjustments.
Is 5~6 bits sufficient here ?

From when an interrupt trigger occurs, there's always some potentially-variable number of clocks to get past things like SETQ+instruction and WAITX, etc, before the interrupt CALLD can be inserted into the pipeline. Some CT capture could be done, but it's value is questionable to begin with, I think.

I think people will realize early on that running full-speed without debug is where there timing will regulate nicely. Debug interrupts steal time.

What I've noticed, so far, that presents a big headache, is that ADDCTn-equaled-CT interrupts can occur during debug interrupts and wind up getting ignored, only to resume 2^32 clocks later (~53 seconds later at 80MHz). Now, this is a problem that is hard to solve, because even if we notice from the event bits that we missed those interrupts, and we then retrigger them, the timing accounting is already lost in the user's program. The only way I can figure to resolve this is to place a programmable mask on the ADDCTn-equaled-CT detector. For example, a mask of $00000FFF would limit match detection to the 12 LSBs, allowing interrupts to resume soon. That's not really good, either, though, because now absolute registration has been lost that might have been correlating one cog's activity with another's. Maybe there are some things that cannot be gotten around.

Roy Eltham · 2018-03-16 21:04

Chip,
Why not LOCKTRY and UNLOCK ?

Maybe LOCKEND, or LOCKFIN, or LOCKREL, or RELEASE or LOCKRLS or....

cgracey · 2018-03-16 21:16

Roy Eltham wrote: »

Chip,
Why not LOCKTRY and UNLOCK ?

Maybe LOCKEND, or LOCKFIN, or LOCKREL, or RELEASE or LOCKRLS or....

TRYLOCK and UNLOCK?

The LOCKxxx naming convention is a bit stifling.

cgracey · 2018-03-16 21:19

...And you're not really 'unlocking', but giving up the lock that you hold - which isn't really a lock, as much as it is a baton in a relay race.

jmg · 2018-03-16 21:26

cgracey wrote: »

...
What I've noticed, so far, that presents a big headache, is that ADDCTn-equaled-CT interrupts can occur during debug interrupts and wind up getting ignored, only to resume 2^32 clocks later (~53 seconds later at 80MHz). Now, this is a problem that is hard to solve, because even if we notice from the event bits that we missed those interrupts, and we then retrigger them, the timing accounting is already lost in the user's program. The only way I can figure to resolve this is to place a programmable mask on the ADDCTn-equaled-CT detector. For example, a mask of $00000FFF would limit match detection to the 12 LSBs, allowing interrupts to resume soon. That's not really good, either, though, because now absolute registration has been lost that might have been correlating one cog's activity with another's. Maybe there are some things that cannot be gotten around.

There is always some real-time trade off with debug. Some MCU vendors with plentiful SFRs add bits to control what pauses during debug.
eg high speed serial receive is going to drop characters when debug fires.

I'm not following this exact case tho - if you have an ADDCTn-equaled-CT detector & true interrupt, is that not going to set a HW flag - so you may skip multiple rapid sets, but should not hang for 53 sec ?
I thought some HW tests were >= instead of == , in order to avoid such long hang times ?

cgracey · 2018-03-16 21:50

jmg wrote: »

cgracey wrote: »

...
What I've noticed, so far, that presents a big headache, is that ADDCTn-equaled-CT interrupts can occur during debug interrupts and wind up getting ignored, only to resume 2^32 clocks later (~53 seconds later at 80MHz). Now, this is a problem that is hard to solve, because even if we notice from the event bits that we missed those interrupts, and we then retrigger them, the timing accounting is already lost in the user's program. The only way I can figure to resolve this is to place a programmable mask on the ADDCTn-equaled-CT detector. For example, a mask of $00000FFF would limit match detection to the 12 LSBs, allowing interrupts to resume soon. That's not really good, either, though, because now absolute registration has been lost that might have been correlating one cog's activity with another's. Maybe there are some things that cannot be gotten around.

There is always some real-time trade off with debug. Some MCU vendors with plentiful SFRs add bits to control what pauses during debug.
eg high speed serial receive is going to drop characters when debug fires.

I'm not following this exact case tho - if you have an ADDCTn-equaled-CT detector & true interrupt, is that not going to set a HW flag - so you may skip multiple rapid sets, but should not hang for 53 sec ?
I thought some HW tests were >= instead of == , in order to avoid such long hang times ?

It requires the user's program to set the next target. It doesn't advance automatically by some value each time. That's why it gets lost.

If I changed to a 'CT >= target' test, that would help in the first timer interrupt after debug, but the next target would likely wind up below CT, waiting for the CT wrap again.

The only kind of timer interrupts that could get around this kind of problem are interrupts that fire every N clocks. The problem with those is that they cannot be used to match arbitrary CT values.

cgracey · 2018-03-16 22:00

One kind of kludge approach could be a selectable configuration, where you agree to put your CT targets and adder values into certain cog registers. Then, the debugger could update the targets to CT-plus-adders on the way out.

Or, you could provide an 'exit' overlay that gets executed each time the debugger returns to your program. That would work, but it would require more than just your normal source code.

jmg · 2018-03-16 22:55

cgracey wrote: »

One kind of kludge approach could be a selectable configuration, where you agree to put your CT targets and adder values into certain cog registers. Then, the debugger could update the targets to CT-plus-adders on the way out..

or, the debugger just needs to know where they are, like a special type of Watch Var. That seems tolerable, to tell the debugger which vars to CT-Adjust on step/exit ?

jmg · 2018-03-16 23:03

cgracey wrote: »

It requires the user's program to set the next target. It doesn't advance automatically by some value each time. That's why it gets lost.

The DOCs say
Set CT1 event to trigger on CT = D + S. Adds S into D.

Doesn't that CT1 event then fire during Debug, so Debug exit will enter the CT1 interrupt immediately ?
I guess an issue is the NEXT addition, done to setup the following interrupt, is then behind - how much code is needed to make that debug-safe ? A few lines ?

cgracey · 2018-03-16 23:25

jmg wrote: »

cgracey wrote: »

One kind of kludge approach could be a selectable configuration, where you agree to put your CT targets and adder values into certain cog registers. Then, the debugger could update the targets to CT-plus-adders on the way out..

or, the debugger just needs to know where they are, like a special type of Watch Var. That seems tolerable, to tell the debugger which vars to CT-Adjust on step/exit ?

The adder values might be computed dynamically, though, not lending themselves to a static debug-patching approach.

cgracey · 2018-03-16 23:28

jmg wrote: »

cgracey wrote: »

It requires the user's program to set the next target. It doesn't advance automatically by some value each time. That's why it gets lost.

The DOCs say
Set CT1 event to trigger on CT = D + S. Adds S into D.

Doesn't that CT1 event then fire during Debug, so Debug exit will enter the CT1 interrupt immediately ?
I guess an issue is the NEXT addition, done to setup the following interrupt, is then behind - how much code is needed to make that debug-safe ? A few lines ?

Yes, the NEXT addition would likely be behind. I don't know what to do about that. Who knows how they are computing their CT targets? There's no way to 'solve' that 'problem' without constraining their usage.

In some cases, the value of the debug interrupt will just be to STOP and dump state, as the timing interruption has already blown up their application.

cgracey · 2018-03-16 23:35

The other thing we can do is switch to count-down-and-reload timers. They, at least, wouldn't lose registration, but then you wouldn't be able to interrupt at any arbitrary CT value, which is an important capability.

TonyB_ · 2018-03-17 01:36

Is changing P2 functionality as this very late stage just so that debugging might be a little easier a good thing to do?

I think there's a good thing to do that's not being done!

cgracey · 2018-03-17 02:06

TonyB_ wrote: »

Is changing P2 functionality as this very late stage just so that debugging might be a little easier a good thing to do?

I think there's a good thing to do that's not being done!

I know. I'm quite over a barrel on all this debugging stuff, so I still haven't investigated the XBYTE change.

Yanomani · 2018-03-17 02:42

Every state machine has its particular constraints, and timing, perhaps, it's is one of the most stringent.

Thus, the best debugger should depend on least interfering behavioral observation, when possible.

When you can't grew it (state machine) beyound a certain limit, perhaps the solution would be to have the option to run some observer, tracking it, in parallel.

Sometimes, a COG will need to rely on a LUT's sharing scheme, relegating to the code that runs inside the paired COG, the duty to broadcast of broadcasting the inwards of its sausage-alike states.

Perhaps, when paired, one of the COGs could fully forward its debug interrupts to its (now) twin, using the least resource-consuming means to share its internals, thus avoiding to disrupt timing-sensitive procedures at their own routines.

Only a thought.

Yanomani · 2018-03-17 03:03

Pushing the limits; at the end of the day, having all of your most powerful resources consumed by an application, paves the way towards the need of some heavy-loaded, resource prone JTAG debug port.

Or at least some blazing-fast serdes channels, relying on LVDS comms.

Both will consume silicon area, power and extra pins.

Both seems as good candidates to fit under the "Banker's no!" statement.

jmg · 2018-03-17 03:11

cgracey wrote: »

Yes, the NEXT addition would likely be behind. I don't know what to do about that. Who knows how they are computing their CT targets? There's no way to 'solve' that 'problem' without constraining their usage.

If you can test for MSB(SetPoint-CounterTimer), that seems to cover wrap cases, and is valid for Debug delays of up to 50% of the 53s CT frame.

Does that mean a subtraction, instead of a compare ?

That also solves the general hang case when another interrupt prevents the next-calc quite making it, and will produce a series of rapid catch-up interrupts, until the added value is ahead of CT.
Ticks are not lost, but they can bunch if disturbed greatly. Seems preferable to a brutal 53s hang ?

cgracey · 2018-03-17 03:32

jmg wrote: »

cgracey wrote: »

Yes, the NEXT addition would likely be behind. I don't know what to do about that. Who knows how they are computing their CT targets? There's no way to 'solve' that 'problem' without constraining their usage.

If you can test for MSB(SetPoint-CounterTimer), that seems to cover wrap cases, and is valid for Debug delays of up to 50% of the 53s CT frame.

Does that mean a subtraction, instead of a compare ?

That also solves the general hang case when another interrupt prevents the next-calc quite making it, and will produce a series of rapid catch-up interrupts, until the added value is ahead of CT.
Ticks are not lost, but they can bunch if disturbed greatly. Seems preferable to a brutal 53s hang ?

Yes, I was thinking the same thing.

We actually have a special CMPM instruction which puts the MSB of the comparison result into C, just for these kinds of tests.

All things considered, this is probably the best way to handle it. Instead of XOR comparators, I'll need real subtractors which are going to take more area and power.

Thanks for bringing this up, Jmg. I had dismissed the idea earlier, but it is the best way, after considering all the alternatives.

cgracey · 2018-03-17 04:42

I just got those CT1/CT2/CT3 events redone to do subtractive comparisons against CNT with MSB checks, instead of the original equality tests. It's working fine. You can make the timer interrupts eat up all the cog bandwidth now, if you want. They are dogged in their persistence, never missing the bus and waiting 53 seconds for it to come around again. Timer interrupts are graceful under debug now. This is a huge improvement to debugging.

This was the right thing to do, Jmg. Thanks!

cgracey · 2018-03-17 08:45

I added C and Z flags for when you read breakpoint status using 'GETBRK D WC/WZ/WCZ' in debug ISRs.

On debug ISR entry, 'GETBRK D WC' can be used to find out if the cog has just been (re)started (C=0). This is critical because it lets us know if we need to give it the general introductory treatment, like letting the host to know that it's now running and getting some commands back, or do we get to carry on with whatever breakpoint pattern is underway, like a pass counter, for example.

With this advent, it makes sense again to have not just one ISR program program for all cogs to execute on their initial debug interrupt, but custom programs for each cog, with ALL programs checking first for startup, in which case they'll overlay the startup code into the low registers and execute it, instead of whatever they would have done, otherwise, with their custom code. So, while each cog can have its own debug ISR that gets swapped in and out via the 8-instruction ROM in $1F8..$1FF (where the special function registers are for D and S access), all debug ISRs must be sensitive to new startup. This just takes one instruction (GETBRK D WC) to discover and another few to load in the startup overlay and run it.

I've decided now to, again, shrink the debug ISR buffers back down. I'm going to 16 longs, instead of 64. This will allow for much faster pass counters and what not, but still provide for all other purposes with overlays. The register and ISR buffers are located as follows, where CCCC is !cogid:

%1111_1111_1CCC_C0xx_xxxx - register buffers
%1111_1111_1CCC_C1xx_xxxx - current ISR buffers

So, this will accommodate whole-app debugging, as well as fast, simple debug ISRs for whatever you want. 16 instructions is plenty for the kinds of quick checks we are likely to code by hand.

The big debugger can also be used to take over and handle the whole affair, without the programmer needing to think about anything. I think this is the best of everything: fast, small, flexible, fun. And heavy-duty with memory protection, if desired.

cgracey · 2018-03-17 09:02

I've enhanced the new LOCKs.

LOCKTRY {#}D WC 'Attempt to become sole owner of LOCK D[3:0]. If you win (or had already won) the round-robin lottery, then C=1.

LOCKBYE {#}D 'Dispose of LOCK D[3:0] (in case you have it).
LOCKBYE D WC 'Dispose of LOCK D[3:0] (in case you have it), get current or last LOCK owner's cogid into D and current LOCK status into C.

So, by using 'LOCKBYE D WC', you can find out the disposition of a LOCK: Is it in use and which cog has it, or last had it?

Roy Eltham · 2018-03-17 09:14

LOCKBYE is a terrible name for it. Please use one of the other suggested names for it.

All this debugging stuff is great, but I too am worried about what this does to the timeline. I thought the verilog had to be locked down because OnSemi was working out the synthesizing? Isn't this causing them to have to redo stuff?

cgracey · 2018-03-17 09:57

Roy Eltham wrote: »

LOCKBYE is a terrible name for it. Please use one of the other suggested names for it.

All this debugging stuff is great, but I too am worried about what this does to the timeline. I thought the verilog had to be locked down because OnSemi was working out the synthesizing? Isn't this causing them to have to redo stuff?

We have until April 2 for final drop. I want to get this out of the way ASAP and then focus on fixing critical paths involving I/O pins.

cgracey · 2018-03-17 09:58

Here is a debug example:

DAT

' Set up debug ISR for cog0

	org

	setq	#$0F		'install cog0 debug ISR
	rdlong	buff,#@isr0
	setq	#$0F
	wrlong	buff,##$FFFC0

	hubset	##$20000001	'enable debugging for cog0

	coginit	#0,#@cog0	'restart cog0 (this cog)

buff	res	16


' Cog0 debug ISR at $FFFC0
'
'	on debug interrupt, jmp #$1F8 executes
'	registers $000..$00F copied to hub $FFF80..$FFFBF
'	registers $000..$00F load from hub $FFFC0..$FFFFF
'	jmp #0 (isr0) executes
'	isr0 code runs, ending with jmp #$1FD
'	registers $000..$00F load from hub $FFF80..$FFFBF
'	reti0 executes, returns to cog0 code

	org

isr0	drvnot	#40		'toggle led

	waitx	##20_000_000/4	'1/4 second delay

'	brk	#$40		'enable async breakpoint
'	brk	#$20		'enable break instruction
'	brk	##bp<<8 + $10	'enable address breakpoint
	brk	#$0F		'enable single-stepping in int3/int2/int1/main

	jmp	#$1FD		'restore $000..$00F, reti0

'
'**********
'*  cog0  *
'**********
'
	org

cog0	drvnot	#32		'leds on
	drvnot	#33
	drvnot	#34
	drvnot	#35
	drvnot	#36
	drvnot	#37
	drvnot	#38
bp	drvnot	#39		'breakpoint address

	brk	#0		'break instruction

	drvnot	#32		'leds off
	drvnot	#33
	drvnot	#34
	drvnot	#35
	drvnot	#36
	drvnot	#37
	drvnot	#38
	drvnot	#39

	jmp	#cog0		'loop

This single steps the cog0 program at 4 instructions per second. You can see the activity on the Prop123's green LEDs.

cgracey · 2018-03-17 10:38

Roy Eltham wrote: »

LOCKBYE is a terrible name for it. Please use one of the other suggested names for it.

All this debugging stuff is great, but I too am worried about what this does to the timeline. I thought the verilog had to be locked down because OnSemi was working out the synthesizing? Isn't this causing them to have to redo stuff?

Ah, no, it's not causing them to redo stuff. They are running many trial compilations to get the scripts optimized. After that, we can run the final Verilog code through it. That's not much change by their metrics.

David Betz · 2018-03-17 11:19

cgracey wrote: »

Roy Eltham wrote: »

LOCKBYE is a terrible name for it. Please use one of the other suggested names for it.

All this debugging stuff is great, but I too am worried about what this does to the timeline. I thought the verilog had to be locked down because OnSemi was working out the synthesizing? Isn't this causing them to have to redo stuff?

Ah, no, it's not causing them to redo stuff. They are running many trial compilations to get the scripts optimized. After that, we can run the final Verilog code through it. That's not much change by their metrics.

Is there any chance that all of these debugging changes might break something in non-debugging execution. It seems there will be essentially no time to even try the new Verilog before it gets sent to synthesis. I would have thought you'd want the Verilog locked down and extensive testing on the FPGA done long before releasing it for synthesis.

Cluso99 · 2018-03-17 12:46

Chip,
You are adding in 8 longs of ROM overlaying the special registers. It doesn't make sense to add silicon just so the debug routine doesn't take cog space. We have LUT and HUBEXEC so to take 8 longs for debug (or more if we later come up with better ways) doesn't really matter like it did on P1. To lock it as ROM likewise doesn't make sense.

I could argue that it would be better for this to be a resource available normally to the cog if not running debug. I could also argue that this bit of silicon would be better making the stack a bit longer. But we don't need more changes because everything comes with risk. Please keep it simple.

We have a BRK Interrupt instruction and that's all we need. No one knows what program the user will be running, and hence what resources will be used and not used, what will interfere with debugging, etc. We have lots of possibilities with just BRK. We can pass debug info to another cog via hub, via the lut dual porting to adjacent cog, or via I/O, and perhaps even via SmartPins.

In the same way, let's not tie down Hub memory for debugging. We should be able to put code anywhere.

One thing that concerns me is the 16KB of Hub being mapped to the top of Hub ram. Previously it was to be dual mapped. ie for 512KB, the 16KB at 496-512KB would also appear at 1008-1024KB too. That gave us the ability to have a whole contiguous block of 512KB (or 256KB etc). Removing 16KB from the block disrupts the contiguous block concept. We also have a disruption at the bottom end where cog and lut addresses interfere with the mapping. If you cannot dual map the 16KB block, could you simply add an extra 4/8/16KB block at the end? Alternately, why not just ignore the top address bit(s) and therefore map the 512KB into both lower and upper 512KB spaces? Same if 256KB - map it 4x by ignoring 2 top address bits?

One last thought. The main reason for needing the 16KB at top of Hub space is for the Interrupt vectors. Could these be relocatable as a block per cog by writing a SETQ3 instruction to store the block address bits? Since Interrupts have to be setup by software, setting an additional SETQ register for the Interrupt Table Block Address would be a simple addition, and be far more flexible. This would negate the requirement for having 16KB at the top of Hub ram altogether. Just copy the boot rom into the bottom 16KB of Hub. If we want it elsewhere, we can move it by software. Then all P2 versions would boot the same.

Roy Eltham · 2018-03-17 16:38

Chip,
Okay, it's good to know this isn't resetting things with OnSemi, but as Cluso99 says, don't we want some significant time testing things? Hopefully you can get everything finalized with at least some time for testing after.

Also, is April 2nd when the ROM has to be finale also?

jmg · 2018-03-17 20:37

Cluso99 wrote: »

You are adding in 8 longs of ROM overlaying the special registers. It doesn't make sense to add silicon just so the debug routine doesn't take cog space. We have LUT and HUBEXEC so to take 8 longs for debug (or more if we later come up with better ways) doesn't really matter like it did on P1. To lock it as ROM likewise doesn't make sense.

I'm not following exactly. The 8 longs are a debug keyhole, and from what Chip has said, they mean the Streamer is not disturbed.
Debug design should always strive to have least disturbance.
I suggested RAM for those 8 longs, but ROM is ok, as they are quite stable, and I think that is quite a bit smaller. It may come down to OnSemi's smallest memory compiler size, and how much area cost that has.
Also, given the time lines, I think compiled ROM can arrive later than hand placed memory areas...
Registers for RAM may be another option, not sure what the silicon logic size of that is ?
A RAM keyhole does require another load step.

Cluso99 wrote: »

I could argue that it would be better for this to be a resource available normally to the cog if not running debug. I could also argue that this bit of silicon would be better making the stack a bit longer. But we don't need more changes because everything comes with risk. Please keep it simple.

Valid point about the stack, I don't think I got an answer to the question of can you Debug a Stack-Max COG ? As it is a CALL, it looks to consume one stack level itself ?
With interrupts and debug and HLL, that stack really does look small....

Cluso99 wrote: »

In the same way, let's not tie down Hub memory for debugging. We should be able to put code anywhere.

It was already tied down for boot loader use ?

Cluso99 wrote: »

One thing that concerns me is the 16KB of Hub being mapped to the top of Hub ram. Previously it was to be dual mapped. ie for 512KB, the 16KB at 496-512KB would also appear at 1008-1024KB too. That gave us the ability to have a whole contiguous block of 512KB (or 256KB etc). Removing 16KB from the block disrupts the contiguous block concept. We also have a disruption at the bottom end where cog and lut addresses interfere with the mapping. If you cannot dual map the 16KB block, could you simply add an extra 4/8/16KB block at the end? Alternately, why not just ignore the top address bit(s) and therefore map the 512KB into both lower and upper 512KB spaces? Same if 256KB - map it 4x by ignoring 2 top address bits?

'Adding on the end' was suggested before, but that requires new, skewed size memory compile, and it needs another address bit.
Allowing the area to dual map seems like an ok thing to do, most programmers should grasp that - you just recommend to use the top copy, as the lower ones may change on 1MB variants.

Cluso99 wrote: »

One last thought. The main reason for needing the 16KB at top of Hub space is for the Interrupt vectors. Could these be relocatable as a block per cog by writing a SETQ3 instruction to store the block address bits? Since Interrupts have to be setup by software, setting an additional SETQ register for the Interrupt Table Block Address would be a simple addition, and be far more flexible. This would negate the requirement for having 16KB at the top of Hub ram altogether. Just copy the boot rom into the bottom 16KB of Hub. If we want it elsewhere, we can move it by software. Then all P2 versions would boot the same.

That's more logic than fixed vectors, but the idea of all P2 versions booting the same has appeal.

jmg · 2018-03-17 20:40

cgracey wrote: »

I added C and Z flags for when you read breakpoint status using 'GETBRK D WC/WZ/WCZ' in debug ISRs.

On debug ISR entry, 'GETBRK D WC' can be used to find out if the cog has just been (re)started (C=0).

What does Z bit encode ?

cgracey wrote: »

I've decided now to, again, shrink the debug ISR buffers back down. I'm going to 16 longs, instead of 64. This will allow for much faster pass counters and what not, but still provide for all other purposes with overlays. The register and ISR buffers are located as follows, where CCCC is !cogid:

%1111_1111_1CCC_C0xx_xxxx - register buffers
%1111_1111_1CCC_C1xx_xxxx - current ISR buffers

So, this will accommodate whole-app debugging, as well as fast, simple debug ISRs for whatever you want. 16 instructions is plenty for the kinds of quick checks we are likely to code by hand.

Seems OK, but you would want to have real Debug engines running in the wild, before locking down the actual buffer sizes ? Choices I think are 16,32,64 ? ( not 48 )

Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

Comments