Chip,
I'm of the mind that this is really something easily managed by software. GETCT by itself isn't very useful. Even long event jitter can be eliminated by smartly re-arming consecutive events.
Chip,
I'm of the mind that this is really something easily managed by software. GETCT by itself isn't very useful. Even long event jitter can be eliminated by smartly re-arming consecutive events.
I agree, but there is value in having a 64-bit elapsed-time counter that no cog has to maintain.
Chip, consecutive read instead of a holding register is fine too but of course you have to hold off interrupts during the sequence which may be of consequence. A holding register just needs a variant of GETCT that as suggested used perhaps WZ to select the holding register. I'm not sure of the usefulness of a WZ or even WCZ by itself which is why I suggested this one.
Doing this simple simple thing in hardware means we don't have to worry about interrupts if we need more than a 32-bit count. If it's simple and useful, DO IT.
I was only interested in a simple full 64-bits for a reference count but if anyone needs more than GETCT then why not put forward some examples of how you would use it then.
BTW, we would never ever use the full 64-bits so 48-bits is all that is really required and as Cluso mentioned, just reading the top or bottom 32-bits of that 48-bits is quite practical and useful. No holding register or interrupt holdoff required.
1. With 2 instructions (can use existing GETCT instruction as CZI are not used) if you read the high first, that can disable Interrupts for one instruction. If you only need the lower then all is fine with no penalty.
2. With a smaller, say 48 bits, this provides advantages to do higher granularity simply. I see this as a more useful solution. For the rarer case where full granularity is required, then it's solvable by software. A 64 bit doesn't give this option without further software fiddling
Sorry to be a pain, Chip, but I've realised the holding register isn't such a great idea. It actually has a pitfall that might be better avoided:
There is good chance of GETCT being used in various ISRs, with this comes possible dual use which will corrupt data for the non-ISR code.
Chip, consecutive read instead of a holding register is fine too but of course you have to hold off interrupts during the sequence which may be of consequence. A holding register just needs a variant of GETCT that as suggested used perhaps WZ to select the holding register. I'm not sure of the usefulness of a WZ or even WCZ by itself which is why I suggested this one.
Doing this simple simple thing in hardware means we don't have to worry about interrupts if we need more than a 32-bit count. If it's simple and useful, DO IT.
I was only interested in a simple full 64-bits for a reference count but if anyone needs more than GETCT then why not put forward some examples of how you would use it then.
BTW, we would never ever use the full 64-bits so 48-bits is all that is really required and as Cluso mentioned, just reading the top or bottom 32-bits of that 48-bits is quite practical and useful. No holding register or interrupt holdoff required.
Interrupts aren't allowed on GETCT now, just like they aren't allowed on SETQ. There's no problem.
By doing a double-GETCT, you will always get a clean snapshot of the 64-bit counter, whether in main code or in an interrupt service routine.
Does the eggbeater use the low bits of CT for its slice addresses?
No, but there is a fixed relationship between the two. They both start cycling from reset.
Thanks, Chip. So we could deduce the phase difference and it will never change from one reset to the next?
Would a 64-bit CT require another 32-bit bus from hub to cog? If so, I have an idea.
Yes, there's another 32-bit bus involved. We can't mux high and low longs, though, because the timer events are still looking at the lower long in the background. What was your idea?
It might not help but the idea is that the hub sends 32 bits of count data CTx, where CTx[0]=CT[0] always and CTx[31:1]=CT[31:1]/CT[62:32] when CT[0]=0/1. The cogs do the demuxing. We lose CT[63] but nobody will notice!
Here's the code that I used to get snapshots around the 32-bit rollover point. The "+0" adds can be changed to "+1" to check for $0000_0001_0000_0000, instead of $0000_0000_FFFF_FFFF.
dat org
hubset #$FF 'select 80MHz on FPGA
.msb getct lo 'wait for ct msb
tjns lo,#.msb
addct1 x,#0 'set ct target near rollover
waitct1 'wait for target
getct lo 'capture lower ct
getct hi 'capture upper ct
cmp lo,##$FFFF_FFFF+0 wz 'check 64-bit ct value
if_z cmp hi,##$0000_0000+0 wz
drvz #0 'good on p0
drvh #1 'done on p1
jmp #$
x long $FFFF_FFFB+0 '$FFFF_FFFB gets to $0000_0000_FFFF_FFFF
lo res 1
hi res 1
Does the eggbeater use the low bits of CT for its slice addresses?
No, but there is a fixed relationship between the two. They both start cycling from reset.
Thanks, Chip. So we could deduce the phase difference and it will never change from one reset to the next?
Would a 64-bit CT require another 32-bit bus from hub to cog? If so, I have an idea.
Yes, there's another 32-bit bus involved. We can't mux high and low longs, though, because the timer events are still looking at the lower long in the background. What was your idea?
It might not help but the idea is that the hub sends 32 bits of count data CTx, where CTx[0]=CT[0] always and CTx[31:1]=CT[31:1]/CT[62:32] when CT[0]=0/1. The cogs do the demuxing. We lose CT[63] but nobody will notice!
That's a neat idea. It would require registering, the upper 31 bits of the lower long, though, to keep the timer events going. Makes me realize that a cog could maintain maybe just 6 LSB's of continously-running counter, while receiving the upper bits serially, along with a registration pulse. I think it's less logic to just brute-force it centrally from the hub, like is done now.
I'd have a second instruction because it costs nothing being a single operand variety, and also you get the freedom to use it at a later point along with no interrupt shielding.
Good point. I've been avoiding adding new instructions.
Wait! without interrupt shielding, you can't get a reliable count. I think it's maybe best the way it is, in that case.
I think evanh was meaning to read the upper 32b, at any time, via a second opcode. Not sure of the use cases where you only want upper checks ?
Does the eggbeater use the low bits of CT for its slice addresses?
No, but there is a fixed relationship between the two. They both start cycling from reset.
Thanks, Chip. So we could deduce the phase difference and it will never change from one reset to the next?
Would a 64-bit CT require another 32-bit bus from hub to cog? If so, I have an idea.
Yes, there's another 32-bit bus involved. We can't mux high and low longs, though, because the timer events are still looking at the lower long in the background. What was your idea?
It might not help but the idea is that the hub sends 32 bits of count data CTx, where CTx[0]=CT[0] always and CTx[31:1]=CT[31:1]/CT[62:32] when CT[0]=0/1. The cogs do the demuxing. We lose CT[63] but nobody will notice!
That's a neat idea. It would require registering, the upper 31 bits of the lower long, though, to keep the timer events going. Makes me realize that a cog could maintain maybe just 6 LSB's of continously-running counter, while receiving the upper bits serially, along with a registration pulse. I think it's less logic to just brute-force it centrally from the hub, like is done now.
Is a register really needed? Couldn't the timers look at CT[31:1] every other clock and delay one cycle if required?
Does the eggbeater use the low bits of CT for its slice addresses?
No, but there is a fixed relationship between the two. They both start cycling from reset.
Thanks, Chip. So we could deduce the phase difference and it will never change from one reset to the next?
Would a 64-bit CT require another 32-bit bus from hub to cog? If so, I have an idea.
Yes, there's another 32-bit bus involved. We can't mux high and low longs, though, because the timer events are still looking at the lower long in the background. What was your idea?
It might not help but the idea is that the hub sends 32 bits of count data CTx, where CTx[0]=CT[0] always and CTx[31:1]=CT[31:1]/CT[62:32] when CT[0]=0/1. The cogs do the demuxing. We lose CT[63] but nobody will notice!
That's a neat idea. It would require registering, the upper 31 bits of the lower long, though, to keep the timer events going. Makes me realize that a cog could maintain maybe just 6 LSB's of continously-running counter, while receiving the upper bits serially, along with a registration pulse. I think it's less logic to just brute-force it centrally from the hub, like is done now.
Is a register really needed? Couldn't the timers look at CT[31:1] every other clock and delay one cycle if required?
I suppose they could, but then you'd need two 32-bit comparators (with different LSBs) to know when the event was. That's a lot of logic.
In order to get time-aligned reads 2 clocks apart (GETCT takes 2 clocks), the upper long increments when then lower long is $0000_0001, not $FFFF_FFFF. This means that on reset, the counter must be initialized to $0000_0000_0000_0002 to avoid an early increment in the upper long. By the time user code starts running, the counter is already into the 10's of thousands.
That all sounds ok.
An alternative is to use carry chains, which are faster in FPGA, but I'm not sure about ASIC compilers.
Terminal count (D-FF) is then the roll over from $FFFF_FFFF to $0000_0000, and it appears on the first clock, when LSB is 00.
Add one more D-FF delay to move that TC to allow for the 2 sysclk delay of GETCT twice.
Counter can be initialized to 0000, and because terminal count is registered, and only fires on overflow, there is no early increment in the upper long effect.
Not sure if that will be any smaller/faster in ASIC ?
I'd have a second instruction because it costs nothing being a single operand variety, and also you get the freedom to use it at a later point along with no interrupt shielding.
Good point. I've been avoiding adding new instructions.
Wait! without interrupt shielding, you can't get a reliable count. I think it's maybe best the way it is, in that case.
I think evanh was meaning to read the upper 32b, at any time, via a second opcode. Not sure of the use cases where you only want upper checks ?
If you want to just read the upper long of CT, you'll need to do two GETCT's. They could be to the same register.
I think, though, that won't be commonly done. Most of the time, you'll want the whole enchilada because you can divide it by, say, 250,000,000 to get seconds @250MHz:
getct lo 'get 64-bit count
getct hi
setq hi 'convert to seconds @250MHz
qdiv lo,##250_000_000
getqx seconds 'tops out at ~136 years of seconds
Interrupts aren't allowed on GETCT now, just like they aren't allowed on SETQ. There's no problem.
By doing a double-GETCT, you will always get a clean snapshot of the 64-bit counter, whether in main code or in an interrupt service routine.
Are they deferred only on the first GETCT, or do both GETCT defer interrupts ? (the second defer is not really needed)
It might not help but the idea is that the hub sends 32 bits of count data CTx, where CTx[0]=CT[0] always and CTx[31:1]=CT[31:1]/CT[62:32] when CT[0]=0/1. The cogs do the demuxing. We lose CT[63] but nobody will notice!
That's a neat idea. It would require registering, the upper 31 bits of the lower long, though, to keep the timer events going. Makes me realize that a cog could maintain maybe just 6 LSB's of continously-running counter, while receiving the upper bits serially, along with a registration pulse. I think it's less logic to just brute-force it centrally from the hub, like is done now.
Is a register really needed? Couldn't the timers look at CT[31:1] every other clock and delay one cycle if required?
I suppose they could, but then you'd need two 32-bit comparators (with different LSBs) to know when the event was. That's a lot of logic.
This might be all theoretical but it's a way of avoiding an extra 32-bit bus if adding one would be problematic. The cost is the loss of one bit that wouldn't change for 1000+ years @ 250MHz.
I think a single high-31-bit compare would work, with the low bit specifying immediate or delay by one. CT[31:1] won't change when CT[0] goes high and CT[62:32] are on the bus. We're in a strange 31+1 bit world now.
In order to get time-aligned reads 2 clocks apart (GETCT takes 2 clocks), the upper long increments when then lower long is $0000_0001, not $FFFF_FFFF. This means that on reset, the counter must be initialized to $0000_0000_0000_0002 to avoid an early increment in the upper long. By the time user code starts running, the counter is already into the 10's of thousands.
That all sounds ok.
An alternative is to use carry chains, which are faster in FPGA, but I'm not sure about ASIC compilers.
Terminal count (D-FF) is then the roll over from $FFFF_FFFF to $0000_0000, and it appears on the first clock, when LSB is 00.
Add one more D-FF delay to move that TC to allow for the 2 sysclk delay of GETCT twice.
Counter can be initialized to 0000, and because terminal count is registered, and only fires on overflow, there is no early increment in the upper long effect.
Not sure if that will be any smaller/faster in ASIC ?
Good idea. Just delay the carry by two clocks. That would be one flop used as an enable to the upper long flops. That would be smaller than a 32'h0000_0001 detector. I'll do it that way. Wait. What I've got reads very clearly and this only appears once in the design. Here's what I've got:
It might not help but the idea is that the hub sends 32 bits of count data CTx, where CTx[0]=CT[0] always and CTx[31:1]=CT[31:1]/CT[62:32] when CT[0]=0/1. The cogs do the demuxing. We lose CT[63] but nobody will notice!
That's a neat idea. It would require registering, the upper 31 bits of the lower long, though, to keep the timer events going. Makes me realize that a cog could maintain maybe just 6 LSB's of continously-running counter, while receiving the upper bits serially, along with a registration pulse. I think it's less logic to just brute-force it centrally from the hub, like is done now.
Is a register really needed? Couldn't the timers look at CT[31:1] every other clock and delay one cycle if required?
I suppose they could, but then you'd need two 32-bit comparators (with different LSBs) to know when the event was. That's a lot of logic.
This might be all theoretical but it's a way of avoiding an extra 32-bit bus if adding one would be problematic. The cost is the loss of one bit that wouldn't change for 1000+ years @ 250MHz.
I think a single high-31-bit compare would work, with the low bit specifying immediate or delay by one. CT[31:1] won't change when CT[0] goes high and CT[62:32] are on the bus. We're in a strange 31+1 bit world now.
Oh, I see how that could work. It's somewhat complicated and wouldn't allow inspection of the full counter at any time offset without some patching maybe needed. What we've got just now is simple and readily understandable. If it gets complicated, I can't remember later why it works.
Interrupts aren't allowed on GETCT now, just like they aren't allowed on SETQ. There's no problem.
By doing a double-GETCT, you will always get a clean snapshot of the 64-bit counter, whether in main code or in an interrupt service routine.
Are they deferred only on the first GETCT, or do both GETCT defer interrupts ? (the second defer is not really needed)
I started making it so that the 2nd GETCT would not shield interrupts, but then I figured it was more trouble than it was worth. More to explain, at least.
.... Makes me realize that a cog could maintain maybe just 6 LSB's of continously-running counter, while receiving the upper bits serially, along with a registration pulse. I think it's less logic to just brute-force it centrally from the hub, like is done now.
It is less logic, but is more routing, and more power dissipation capacitance...
This might be all theoretical but it's a way of avoiding an extra 32-bit bus if adding one would be problematic. The cost is the loss of one bit that wouldn't change for 1000+ years @ 250MHz.
I think a single high-31-bit compare would work, with the low bit specifying immediate or delay by one. CT[31:1] won't change when CT[0] goes high and CT[62:32] are on the bus. We're in a strange 31+1 bit world now.
If you are going down a MUX path, you do not need to lose any bits, and do not need to fully serialize either.
eg There are always at least 3 bits of eggbeater count available in all COGs, which should be timing sync with the lower 3 bits of CT.
That means you could send mux'd 64b actually as 30b+31b(+3b local), ie get the LSB from the eggbeater, but this would add a possible 1 sysclk delay to wait for the correct LH pair, but drops BUS width significantly.
To save some small dynamic energy, the system could emit the high 32b only when any COG has used GETCNT and paused INT ?
.... Makes me realize that a cog could maintain maybe just 6 LSB's of continously-running counter, while receiving the upper bits serially, along with a registration pulse. I think it's less logic to just brute-force it centrally from the hub, like is done now.
It is less logic, but is more routing, and more power dissipation capacitance...
So, how many wires, on average, are changing state in the 64-bit counter output on each clock cycle?
Not exactly the mathematical answer, but as a term of comparison, from EE Times:
"In addition to preventing intermediate states, Gray code counters consume only half the power of an equivalent binary counter and they generate correspondingly less noise. Actually, while the power and average noise difference between a Gray and a binary counter asymptotically approaches two, the peak noise difference is equal to the number of bits, since a Gray counter toggles only one bit at a time while a binary counter toggles all of its bits simultaneously two times over the course of a full-count cycle with fewer bits toggling proportionally more times."
1. With 2 instructions (can use existing GETCT instruction as CZI are not used) if you read the high first, that can disable Interrupts for one instruction. If you only need the lower then all is fine with no penalty.
2. With a smaller, say 48 bits, this provides advantages to do higher granularity simply. I see this as a more useful solution. For the rarer case where full granularity is required, then it's solvable by software. A 64 bit doesn't give this option without further software fiddling
From 1. above: So, with 64bits, only if you read the hi
getct lo ' current normal use (no hold interrupts)
...
getct hi ' just read the hi bits (holds off interrupts until next instruction executed)
xxxx '
...
getct hi ' read the hi bits (holds off interrupts until next instruction executed)
getct lo ' read the lo bits
setq hi 'convert to seconds @250MHz
qdiv lo,##250_000_000
getqx seconds 'tops out at ~136 years of seconds
The hi now needs to +1 early rather than late (eg ~$FFFF_FFFE)
I'd have a second instruction because it costs nothing being a single operand variety, and also you get the freedom to use it at a later point along with no interrupt shielding.
Good point. I've been avoiding adding new instructions.
Wait! without interrupt shielding, you can't get a reliable count. I think it's maybe best the way it is, in that case.
I think evanh was meaning to read the upper 32b, at any time, via a second opcode. Not sure of the use cases where you only want upper checks ?
If you want to just read the upper long of CT, you'll need to do two GETCT's. They could be to the same register.
I think, though, that won't be commonly done. Most of the time, you'll want the whole enchilada because you can divide it by, say, 250,000,000 to get seconds @250MHz:
getct lo 'get 64-bit count
getct hi
setq hi 'convert to seconds @250MHz
qdiv lo,##250_000_000
getqx seconds 'tops out at ~136 years of seconds
Comments
Needs a holding register.
$0000_0000_FFFF_FFFE
$0000_0000_FFFF_FFFF
$0000_0001_0000_0000
$0000_0001_0000_0001
This added only 80 LE's to the 2-cog compile.
I'm of the mind that this is really something easily managed by software. GETCT by itself isn't very useful. Even long event jitter can be eliminated by smartly re-arming consecutive events.
I agree, but there is value in having a 64-bit elapsed-time counter that no cog has to maintain.
Doing this simple simple thing in hardware means we don't have to worry about interrupts if we need more than a 32-bit count. If it's simple and useful, DO IT.
I was only interested in a simple full 64-bits for a reference count but if anyone needs more than GETCT then why not put forward some examples of how you would use it then.
BTW, we would never ever use the full 64-bits so 48-bits is all that is really required and as Cluso mentioned, just reading the top or bottom 32-bits of that 48-bits is quite practical and useful. No holding register or interrupt holdoff required.
2. With a smaller, say 48 bits, this provides advantages to do higher granularity simply. I see this as a more useful solution. For the rarer case where full granularity is required, then it's solvable by software. A 64 bit doesn't give this option without further software fiddling
There is good chance of GETCT being used in various ISRs, with this comes possible dual use which will corrupt data for the non-ISR code.
Go back to your first approach.
Interrupts aren't allowed on GETCT now, just like they aren't allowed on SETQ. There's no problem.
By doing a double-GETCT, you will always get a clean snapshot of the 64-bit counter, whether in main code or in an interrupt service routine.
It might not help but the idea is that the hub sends 32 bits of count data CTx, where CTx[0]=CT[0] always and CTx[31:1]=CT[31:1]/CT[62:32] when CT[0]=0/1. The cogs do the demuxing. We lose CT[63] but nobody will notice!
Here's the code that I used to get snapshots around the 32-bit rollover point. The "+0" adds can be changed to "+1" to check for $0000_0001_0000_0000, instead of $0000_0000_FFFF_FFFF.
That's a neat idea. It would require registering, the upper 31 bits of the lower long, though, to keep the timer events going. Makes me realize that a cog could maintain maybe just 6 LSB's of continously-running counter, while receiving the upper bits serially, along with a registration pulse. I think it's less logic to just brute-force it centrally from the hub, like is done now.
Hopefully the C compiler will be able to do 64-bit unsigned int math...
I think evanh was meaning to read the upper 32b, at any time, via a second opcode. Not sure of the use cases where you only want upper checks ?
Is a register really needed? Couldn't the timers look at CT[31:1] every other clock and delay one cycle if required?
I suppose they could, but then you'd need two 32-bit comparators (with different LSBs) to know when the event was. That's a lot of logic.
That all sounds ok.
An alternative is to use carry chains, which are faster in FPGA, but I'm not sure about ASIC compilers.
Terminal count (D-FF) is then the roll over from $FFFF_FFFF to $0000_0000, and it appears on the first clock, when LSB is 00.
Add one more D-FF delay to move that TC to allow for the 2 sysclk delay of GETCT twice.
Counter can be initialized to 0000, and because terminal count is registered, and only fires on overflow, there is no early increment in the upper long effect.
Not sure if that will be any smaller/faster in ASIC ?
If you want to just read the upper long of CT, you'll need to do two GETCT's. They could be to the same register.
I think, though, that won't be commonly done. Most of the time, you'll want the whole enchilada because you can divide it by, say, 250,000,000 to get seconds @250MHz:
Now that's what I'm talkin' about.
Did you also check interrupt hold-off ?
Are they deferred only on the first GETCT, or do both GETCT defer interrupts ? (the second defer is not really needed)
This might be all theoretical but it's a way of avoiding an extra 32-bit bus if adding one would be problematic. The cost is the loss of one bit that wouldn't change for 1000+ years @ 250MHz.
I think a single high-31-bit compare would work, with the low bit specifying immediate or delay by one. CT[31:1] won't change when CT[0] goes high and CT[62:32] are on the bus. We're in a strange 31+1 bit world now.
Good idea. Just delay the carry by two clocks. That would be one flop used as an enable to the upper long flops. That would be smaller than a 32'h0000_0001 detector. I'll do it that way. Wait. What I've got reads very clearly and this only appears once in the design. Here's what I've got:
Each set of 4 cogs gets its own ctl and cth, in order to cut down wire delays.
Oh, I see how that could work. It's somewhat complicated and wouldn't allow inspection of the full counter at any time offset without some patching maybe needed. What we've got just now is simple and readily understandable. If it gets complicated, I can't remember later why it works.
I started making it so that the 2nd GETCT would not shield interrupts, but then I figured it was more trouble than it was worth. More to explain, at least.
It is less logic, but is more routing, and more power dissipation capacitance...
If you are going down a MUX path, you do not need to lose any bits, and do not need to fully serialize either.
eg There are always at least 3 bits of eggbeater count available in all COGs, which should be timing sync with the lower 3 bits of CT.
That means you could send mux'd 64b actually as 30b+31b(+3b local), ie get the LSB from the eggbeater, but this would add a possible 1 sysclk delay to wait for the correct LH pair, but drops BUS width significantly.
To save some small dynamic energy, the system could emit the high 32b only when any COG has used GETCNT and paused INT ?
So, how many wires, on average, are changing state in the 64-bit counter output on each clock cycle?
"In addition to preventing intermediate states, Gray code counters consume only half the power of an equivalent binary counter and they generate correspondingly less noise. Actually, while the power and average noise difference between a Gray and a binary counter asymptotically approaches two, the peak noise difference is equal to the number of bits, since a Gray counter toggles only one bit at a time while a binary counter toggles all of its bits simultaneously two times over the course of a full-count cycle with fewer bits toggling proportionally more times."
https://eetimes.com/document.asp?doc_id=1278827
The thing about Gray code is that it is painful to add numbers to.
FIFO and Streamers would bennefit, immediatelly.
Egg-beater doesn't suffer, only a matter of doing the right decoding.
Only random things remain .... random, as they ever are.
Data is random by nature, addressing doesn't need to be.
Transforming from binary to gray is immediate, kind of a one-level xoring, excluding bit position 0.
Gray to binary is kind of a rippling thing, but, who needs to transform a raw address this fast, from gray to binary?
You seemed to miss this post...
It's
From 1. above: So, with 64bits, only if you read the hi The hi now needs to +1 early rather than late (eg ~$FFFF_FFFE)
BTW I make that ~2,338 years !!!!!