CNT extension to 64-bit

evanh · 2018-11-14 23:02

cgracey wrote: »

Wait! without interrupt shielding, you can't get a reliable count.

Needs a holding register.

cgracey · 2018-11-14 23:05

It just compiled. Now, I'll test it to make sure I can get the following readings by waiting one more clock in each case:

$0000_0000_FFFF_FFFE
$0000_0000_FFFF_FFFF
$0000_0001_0000_0000
$0000_0001_0000_0001

This added only 80 LE's to the 2-cog compile.

evanh · 2018-11-14 23:07

Chip,
I'm of the mind that this is really something easily managed by software. GETCT by itself isn't very useful. Even long event jitter can be eliminated by smartly re-arming consecutive events.

cgracey · 2018-11-14 23:19

evanh wrote: »

Chip,
I'm of the mind that this is really something easily managed by software. GETCT by itself isn't very useful. Even long event jitter can be eliminated by smartly re-arming consecutive events.

I agree, but there is value in having a 64-bit elapsed-time counter that no cog has to maintain.

Peter Jakacki · 2018-11-14 23:23

Chip, consecutive read instead of a holding register is fine too but of course you have to hold off interrupts during the sequence which may be of consequence. A holding register just needs a variant of GETCT that as suggested used perhaps WZ to select the holding register. I'm not sure of the usefulness of a WZ or even WCZ by itself which is why I suggested this one.

Doing this simple simple thing in hardware means we don't have to worry about interrupts if we need more than a 32-bit count. If it's simple and useful, DO IT.

I was only interested in a simple full 64-bits for a reference count but if anyone needs more than GETCT then why not put forward some examples of how you would use it then.

BTW, we would never ever use the full 64-bits so 48-bits is all that is really required and as Cluso mentioned, just reading the top or bottom 32-bits of that 48-bits is quite practical and useful. No holding register or interrupt holdoff required.

Cluso99 · 2018-11-14 23:33

1. With 2 instructions (can use existing GETCT instruction as CZI are not used) if you read the high first, that can disable Interrupts for one instruction. If you only need the lower then all is fine with no penalty.

2. With a smaller, say 48 bits, this provides advantages to do higher granularity simply. I see this as a more useful solution. For the rarer case where full granularity is required, then it's solvable by software. A 64 bit doesn't give this option without further software fiddling

evanh · 2018-11-14 23:35

Sorry to be a pain, Chip, but I've realised the holding register isn't such a great idea. It actually has a pitfall that might be better avoided:
There is good chance of GETCT being used in various ISRs, with this comes possible dual use which will corrupt data for the non-ISR code.

Go back to your first approach.

cgracey · 2018-11-14 23:43

Peter Jakacki wrote: »

Chip, consecutive read instead of a holding register is fine too but of course you have to hold off interrupts during the sequence which may be of consequence. A holding register just needs a variant of GETCT that as suggested used perhaps WZ to select the holding register. I'm not sure of the usefulness of a WZ or even WCZ by itself which is why I suggested this one.

Doing this simple simple thing in hardware means we don't have to worry about interrupts if we need more than a 32-bit count. If it's simple and useful, DO IT.

I was only interested in a simple full 64-bits for a reference count but if anyone needs more than GETCT then why not put forward some examples of how you would use it then.

BTW, we would never ever use the full 64-bits so 48-bits is all that is really required and as Cluso mentioned, just reading the top or bottom 32-bits of that 48-bits is quite practical and useful. No holding register or interrupt holdoff required.

Interrupts aren't allowed on GETCT now, just like they aren't allowed on SETQ. There's no problem.

By doing a double-GETCT, you will always get a clean snapshot of the 64-bit counter, whether in main code or in an interrupt service routine.

TonyB_ · 2018-11-14 23:47

Moved to correct thread.

cgracey wrote: »

TonyB_ wrote: »

cgracey wrote: »

TonyB_ wrote: »

Question:

Does the eggbeater use the low bits of CT for its slice addresses?

No, but there is a fixed relationship between the two. They both start cycling from reset.

Thanks, Chip. So we could deduce the phase difference and it will never change from one reset to the next?

Would a 64-bit CT require another 32-bit bus from hub to cog? If so, I have an idea.

Yes, there's another 32-bit bus involved. We can't mux high and low longs, though, because the timer events are still looking at the lower long in the background. What was your idea?

It might not help but the idea is that the hub sends 32 bits of count data CTx, where CTx[0]=CT[0] always and CTx[31:1]=CT[31:1]/CT[62:32] when CT[0]=0/1. The cogs do the demuxing. We lose CT[63] but nobody will notice!

cgracey · 2018-11-14 23:48

It checks out fine.

Here's the code that I used to get snapshots around the 32-bit rollover point. The "+0" adds can be changed to "+1" to check for $0000_0001_0000_0000, instead of $0000_0000_FFFF_FFFF.

dat		org

		hubset	#$FF			'select 80MHz on FPGA

.msb		getct	lo			'wait for ct msb
		tjns	lo,#.msb

		addct1	x,#0			'set ct target near rollover

		waitct1				'wait for target

		getct	lo			'capture lower ct
		getct	hi			'capture upper ct

		cmp	lo,##$FFFF_FFFF+0 wz	'check 64-bit ct value
	if_z	cmp	hi,##$0000_0000+0 wz

		drvz	#0			'good on p0
		drvh	#1			'done on p1

		jmp	#$


x		long	$FFFF_FFFB+0		'$FFFF_FFFB gets to $0000_0000_FFFF_FFFF

lo		res	1
hi		res	1

cgracey · 2018-11-14 23:58

TonyB_ wrote: »

Moved to correct thread.

cgracey wrote: »

TonyB_ wrote: »

cgracey wrote: »

TonyB_ wrote: »

Question:

Does the eggbeater use the low bits of CT for its slice addresses?

No, but there is a fixed relationship between the two. They both start cycling from reset.

Thanks, Chip. So we could deduce the phase difference and it will never change from one reset to the next?

Would a 64-bit CT require another 32-bit bus from hub to cog? If so, I have an idea.

Yes, there's another 32-bit bus involved. We can't mux high and low longs, though, because the timer events are still looking at the lower long in the background. What was your idea?

It might not help but the idea is that the hub sends 32 bits of count data CTx, where CTx[0]=CT[0] always and CTx[31:1]=CT[31:1]/CT[62:32] when CT[0]=0/1. The cogs do the demuxing. We lose CT[63] but nobody will notice!

That's a neat idea. It would require registering, the upper 31 bits of the lower long, though, to keep the timer events going. Makes me realize that a cog could maintain maybe just 6 LSB's of continously-running counter, while receiving the upper bits serially, along with a registration pulse. I think it's less logic to just brute-force it centrally from the hub, like is done now.

Rayman · 2018-11-15 00:00

This looks pretty good. minimal impact. for some reason, I really like having 64-bit clock...

Hopefully the C compiler will be able to do 64-bit unsigned int math...

jmg · 2018-11-15 00:06

cgracey wrote: »

evanh wrote: »

I'd have a second instruction because it costs nothing being a single operand variety, and also you get the freedom to use it at a later point along with no interrupt shielding.

Good point. I've been avoiding adding new instructions.

Wait! without interrupt shielding, you can't get a reliable count. I think it's maybe best the way it is, in that case.

I think evanh was meaning to read the upper 32b, at any time, via a second opcode. Not sure of the use cases where you only want upper checks ?

TonyB_ · 2018-11-15 00:07

cgracey wrote: »

TonyB_ wrote: »

Moved to correct thread.

cgracey wrote: »

TonyB_ wrote: »

cgracey wrote: »

TonyB_ wrote: »

Question:

Does the eggbeater use the low bits of CT for its slice addresses?

No, but there is a fixed relationship between the two. They both start cycling from reset.

Thanks, Chip. So we could deduce the phase difference and it will never change from one reset to the next?

Would a 64-bit CT require another 32-bit bus from hub to cog? If so, I have an idea.

Yes, there's another 32-bit bus involved. We can't mux high and low longs, though, because the timer events are still looking at the lower long in the background. What was your idea?

It might not help but the idea is that the hub sends 32 bits of count data CTx, where CTx[0]=CT[0] always and CTx[31:1]=CT[31:1]/CT[62:32] when CT[0]=0/1. The cogs do the demuxing. We lose CT[63] but nobody will notice!

That's a neat idea. It would require registering, the upper 31 bits of the lower long, though, to keep the timer events going. Makes me realize that a cog could maintain maybe just 6 LSB's of continously-running counter, while receiving the upper bits serially, along with a registration pulse. I think it's less logic to just brute-force it centrally from the hub, like is done now.

Is a register really needed? Couldn't the timers look at CT[31:1] every other clock and delay one cycle if required?

TonyB_ · 2018-11-15 00:13

May need a register for GETCT as we don't want it taking 3 cycles. Anyway, another notion has been let loose.

cgracey · 2018-11-15 00:13

TonyB_ wrote: »

cgracey wrote: »

TonyB_ wrote: »

Moved to correct thread.

cgracey wrote: »

TonyB_ wrote: »

cgracey wrote: »

TonyB_ wrote: »

Question:

Does the eggbeater use the low bits of CT for its slice addresses?

No, but there is a fixed relationship between the two. They both start cycling from reset.

Thanks, Chip. So we could deduce the phase difference and it will never change from one reset to the next?

Would a 64-bit CT require another 32-bit bus from hub to cog? If so, I have an idea.

Yes, there's another 32-bit bus involved. We can't mux high and low longs, though, because the timer events are still looking at the lower long in the background. What was your idea?

It might not help but the idea is that the hub sends 32 bits of count data CTx, where CTx[0]=CT[0] always and CTx[31:1]=CT[31:1]/CT[62:32] when CT[0]=0/1. The cogs do the demuxing. We lose CT[63] but nobody will notice!

That's a neat idea. It would require registering, the upper 31 bits of the lower long, though, to keep the timer events going. Makes me realize that a cog could maintain maybe just 6 LSB's of continously-running counter, while receiving the upper bits serially, along with a registration pulse. I think it's less logic to just brute-force it centrally from the hub, like is done now.

Is a register really needed? Couldn't the timers look at CT[31:1] every other clock and delay one cycle if required?

I suppose they could, but then you'd need two 32-bit comparators (with different LSBs) to know when the event was. That's a lot of logic.

jmg · 2018-11-15 00:15

cgracey wrote: »

In order to get time-aligned reads 2 clocks apart (GETCT takes 2 clocks), the upper long increments when then lower long is $0000_0001, not $FFFF_FFFF. This means that on reset, the counter must be initialized to $0000_0000_0000_0002 to avoid an early increment in the upper long. By the time user code starts running, the counter is already into the 10's of thousands.

That all sounds ok.
An alternative is to use carry chains, which are faster in FPGA, but I'm not sure about ASIC compilers.

Terminal count (D-FF) is then the roll over from $FFFF_FFFF to $0000_0000, and it appears on the first clock, when LSB is 00.
Add one more D-FF delay to move that TC to allow for the 2 sysclk delay of GETCT twice.
Counter can be initialized to 0000, and because terminal count is registered, and only fires on overflow, there is no early increment in the upper long effect.
Not sure if that will be any smaller/faster in ASIC ?

cgracey · 2018-11-15 00:24

jmg wrote: »

cgracey wrote: »

evanh wrote: »

I'd have a second instruction because it costs nothing being a single operand variety, and also you get the freedom to use it at a later point along with no interrupt shielding.

Good point. I've been avoiding adding new instructions.

Wait! without interrupt shielding, you can't get a reliable count. I think it's maybe best the way it is, in that case.

I think evanh was meaning to read the upper 32b, at any time, via a second opcode. Not sure of the use cases where you only want upper checks ?

If you want to just read the upper long of CT, you'll need to do two GETCT's. They could be to the same register.

I think, though, that won't be commonly done. Most of the time, you'll want the whole enchilada because you can divide it by, say, 250,000,000 to get seconds @250MHz:

		getct	lo			'get 64-bit count
		getct	hi

		setq	hi			'convert to seconds @250MHz
		qdiv	lo,##250_000_000
		getqx	seconds			'tops out at ~136 years of seconds

Now that's what I'm talkin' about.

jmg · 2018-11-15 00:31

cgracey wrote: »

It checks out fine.

Did you also check interrupt hold-off ?

cgracey wrote: »

Interrupts aren't allowed on GETCT now, just like they aren't allowed on SETQ. There's no problem.
By doing a double-GETCT, you will always get a clean snapshot of the 64-bit counter, whether in main code or in an interrupt service routine.

Are they deferred only on the first GETCT, or do both GETCT defer interrupts ? (the second defer is not really needed)

TonyB_ · 2018-11-15 00:42

cgracey wrote: »

TonyB_ wrote: »

cgracey wrote: »

TonyB_ wrote: »

It might not help but the idea is that the hub sends 32 bits of count data CTx, where CTx[0]=CT[0] always and CTx[31:1]=CT[31:1]/CT[62:32] when CT[0]=0/1. The cogs do the demuxing. We lose CT[63] but nobody will notice!

That's a neat idea. It would require registering, the upper 31 bits of the lower long, though, to keep the timer events going. Makes me realize that a cog could maintain maybe just 6 LSB's of continously-running counter, while receiving the upper bits serially, along with a registration pulse. I think it's less logic to just brute-force it centrally from the hub, like is done now.

Is a register really needed? Couldn't the timers look at CT[31:1] every other clock and delay one cycle if required?

I suppose they could, but then you'd need two 32-bit comparators (with different LSBs) to know when the event was. That's a lot of logic.

This might be all theoretical but it's a way of avoiding an extra 32-bit bus if adding one would be problematic. The cost is the loss of one bit that wouldn't change for 1000+ years @ 250MHz.

I think a single high-31-bit compare would work, with the low bit specifying immediate or delay by one. CT[31:1] won't change when CT[0] goes high and CT[62:32] are on the bus. We're in a strange 31+1 bit world now.

cgracey · 2018-11-15 00:44

jmg wrote: »

cgracey wrote: »

In order to get time-aligned reads 2 clocks apart (GETCT takes 2 clocks), the upper long increments when then lower long is $0000_0001, not $FFFF_FFFF. This means that on reset, the counter must be initialized to $0000_0000_0000_0002 to avoid an early increment in the upper long. By the time user code starts running, the counter is already into the 10's of thousands.

That all sounds ok.
An alternative is to use carry chains, which are faster in FPGA, but I'm not sure about ASIC compilers.

Terminal count (D-FF) is then the roll over from $FFFF_FFFF to $0000_0000, and it appears on the first clock, when LSB is 00.
Add one more D-FF delay to move that TC to allow for the 2 sysclk delay of GETCT twice.
Counter can be initialized to 0000, and because terminal count is registered, and only fires on overflow, there is no early increment in the upper long effect.
Not sure if that will be any smaller/faster in ASIC ?

Good idea. Just delay the carry by two clocks. That would be one flop used as an enable to the upper long flops. That would be smaller than a 32'h0000_0001 detector. I'll do it that way. Wait. What I've got reads very clearly and this only appears once in the design. Here's what I've got:

// system counter

reg [31:0] ctlx, cthx;

`regscan (ctlx, 32'h2, 1'b1,	      ctlx + 1'b1)	// lower long of ct
`regscan (cthx, 32'h0, ctlx == 32'h1, cthx + 1'b1)	// upper long of ct, delayed by 2 clocks for 2nd-getct retrieval

wire cth_update	=	ctlx == 32'h2;

`regscan (ctl[3], 32'b0, |cog_ena[15:12], ctlx)
`regscan (ctl[2], 32'b0, |cog_ena[11:08], ctlx)
`regscan (ctl[1], 32'b0, |cog_ena[07:04], ctlx)
`regscan (ctl[0], 32'b0, |cog_ena[03:00], ctlx)

`regscan (cth[3], 32'b0, |cog_ena[15:12] && cth_update, cthx)
`regscan (cth[2], 32'b0, |cog_ena[11:08] && cth_update, cthx)
`regscan (cth[1], 32'b0, |cog_ena[07:04] && cth_update, cthx)
`regscan (cth[0], 32'b0, |cog_ena[03:00] && cth_update, cthx)

Each set of 4 cogs gets its own ctl and cth, in order to cut down wire delays.

cgracey · 2018-11-15 00:51

TonyB_ wrote: »

cgracey wrote: »

TonyB_ wrote: »

cgracey wrote: »

TonyB_ wrote: »

It might not help but the idea is that the hub sends 32 bits of count data CTx, where CTx[0]=CT[0] always and CTx[31:1]=CT[31:1]/CT[62:32] when CT[0]=0/1. The cogs do the demuxing. We lose CT[63] but nobody will notice!

That's a neat idea. It would require registering, the upper 31 bits of the lower long, though, to keep the timer events going. Makes me realize that a cog could maintain maybe just 6 LSB's of continously-running counter, while receiving the upper bits serially, along with a registration pulse. I think it's less logic to just brute-force it centrally from the hub, like is done now.

Is a register really needed? Couldn't the timers look at CT[31:1] every other clock and delay one cycle if required?

I suppose they could, but then you'd need two 32-bit comparators (with different LSBs) to know when the event was. That's a lot of logic.

This might be all theoretical but it's a way of avoiding an extra 32-bit bus if adding one would be problematic. The cost is the loss of one bit that wouldn't change for 1000+ years @ 250MHz.

I think a single high-31-bit compare would work, with the low bit specifying immediate or delay by one. CT[31:1] won't change when CT[0] goes high and CT[62:32] are on the bus. We're in a strange 31+1 bit world now.

Oh, I see how that could work. It's somewhat complicated and wouldn't allow inspection of the full counter at any time offset without some patching maybe needed. What we've got just now is simple and readily understandable. If it gets complicated, I can't remember later why it works.

cgracey · 2018-11-15 00:53

jmg wrote: »

cgracey wrote: »

It checks out fine.

Did you also check interrupt hold-off ?

cgracey wrote: »

Interrupts aren't allowed on GETCT now, just like they aren't allowed on SETQ. There's no problem.
By doing a double-GETCT, you will always get a clean snapshot of the 64-bit counter, whether in main code or in an interrupt service routine.

Are they deferred only on the first GETCT, or do both GETCT defer interrupts ? (the second defer is not really needed)

I started making it so that the 2nd GETCT would not shield interrupts, but then I figured it was more trouble than it was worth. More to explain, at least.

jmg · 2018-11-15 00:56

cgracey wrote: »

.... Makes me realize that a cog could maintain maybe just 6 LSB's of continously-running counter, while receiving the upper bits serially, along with a registration pulse. I think it's less logic to just brute-force it centrally from the hub, like is done now.

It is less logic, but is more routing, and more power dissipation capacitance...

TonyB_ wrote: »

This might be all theoretical but it's a way of avoiding an extra 32-bit bus if adding one would be problematic. The cost is the loss of one bit that wouldn't change for 1000+ years @ 250MHz.

I think a single high-31-bit compare would work, with the low bit specifying immediate or delay by one. CT[31:1] won't change when CT[0] goes high and CT[62:32] are on the bus. We're in a strange 31+1 bit world now.

If you are going down a MUX path, you do not need to lose any bits, and do not need to fully serialize either.

eg There are always at least 3 bits of eggbeater count available in all COGs, which should be timing sync with the lower 3 bits of CT.
That means you could send mux'd 64b actually as 30b+31b(+3b local), ie get the LSB from the eggbeater, but this would add a possible 1 sysclk delay to wait for the correct LH pair, but drops BUS width significantly.
To save some small dynamic energy, the system could emit the high 32b only when any COG has used GETCNT and paused INT ?

cgracey · 2018-11-15 01:01

jmg wrote: »

cgracey wrote: »

.... Makes me realize that a cog could maintain maybe just 6 LSB's of continously-running counter, while receiving the upper bits serially, along with a registration pulse. I think it's less logic to just brute-force it centrally from the hub, like is done now.

It is less logic, but is more routing, and more power dissipation capacitance...

So, how many wires, on average, are changing state in the 64-bit counter output on each clock cycle?

Yanomani · 2018-11-15 02:35

Not exactly the mathematical answer, but as a term of comparison, from EE Times:

"In addition to preventing intermediate states, Gray code counters consume only half the power of an equivalent binary counter and they generate correspondingly less noise. Actually, while the power and average noise difference between a Gray and a binary counter asymptotically approaches two, the peak noise difference is equal to the number of bits, since a Gray counter toggles only one bit at a time while a binary counter toggles all of its bits simultaneously two times over the course of a full-count cycle with fewer bits toggling proportionally more times."

https://eetimes.com/document.asp?doc_id=1278827

cgracey · 2018-11-15 02:42

Inside the P2, one more periodic banging of a garbage can lid will not be noticed.

The thing about Gray code is that it is painful to add numbers to.

Yanomani · 2018-11-15 02:51

But not a problem when it comes to sequential addressing and counting.

FIFO and Streamers would bennefit, immediatelly.

Egg-beater doesn't suffer, only a matter of doing the right decoding.

Only random things remain .... random, as they ever are.

Data is random by nature, addressing doesn't need to be.

Transforming from binary to gray is immediate, kind of a one-level xoring, excluding bit position 0.

Gray to binary is kind of a rippling thing, but, who needs to transform a raw address this fast, from gray to binary?

Cluso99 · 2018-11-15 03:04

Chip,
You seemed to miss this post...
It's

Cluso99 wrote: »

1. With 2 instructions (can use existing GETCT instruction as CZI are not used) if you read the high first, that can disable Interrupts for one instruction. If you only need the lower then all is fine with no penalty.

2. With a smaller, say 48 bits, this provides advantages to do higher granularity simply. I see this as a more useful solution. For the rarer case where full granularity is required, then it's solvable by software. A 64 bit doesn't give this option without further software fiddling

From 1. above: So, with 64bits, only if you read the hi

		getct	lo			' current normal use (no hold interrupts)
...
		getct	hi			' just read the hi bits (holds off interrupts until next instruction executed)
		xxxx	  			' 
...
		getct	hi			' read the hi bits (holds off interrupts until next instruction executed)
		getct	lo			' read the lo bits

		setq	hi			'convert to seconds @250MHz
		qdiv	lo,##250_000_000
		getqx	seconds			'tops out at ~136 years of seconds

The hi now needs to +1 early rather than late (eg ~$FFFF_FFFE)

Cluso99 · 2018-11-15 03:10

cgracey wrote: »
jmg wrote: »

cgracey wrote: »

evanh wrote: »

I'd have a second instruction because it costs nothing being a single operand variety, and also you get the freedom to use it at a later point along with no interrupt shielding.

Good point. I've been avoiding adding new instructions.

Wait! without interrupt shielding, you can't get a reliable count. I think it's maybe best the way it is, in that case.

I think evanh was meaning to read the upper 32b, at any time, via a second opcode. Not sure of the use cases where you only want upper checks ?

If you want to just read the upper long of CT, you'll need to do two GETCT's. They could be to the same register.

I think, though, that won't be commonly done. Most of the time, you'll want the whole enchilada because you can divide it by, say, 250,000,000 to get seconds @250MHz:
		getct	lo			'get 64-bit count
		getct	hi

		setq	hi			'convert to seconds @250MHz
		qdiv	lo,##250_000_000
		getqx	seconds			'tops out at ~136 years of seconds
Now that's what I'm talkin' about.

BTW I make that ~2,338 years !!!!!

CNT extension to 64-bit

Comments