I started making it so that the 2nd GETCT would not shield interrupts, but then I figured it was more trouble than it was worth. More to explain, at least.
Cluso99's reverse read above with a new opcode, avoids that double-delay effect. (but it costs the new opcode decode, and TC is now on $FFFF_FFFE - maybe that can be 31b carry & 1b==, to keep speed & reduce logic ?)
Good idea. Just delay the carry by two clocks. That would be one flop used as an enable to the upper long flops. That would be smaller than a 32'h0000_0001 detector. I'll do it that way.
I know FPGA's take special care to have fast carry for counters, and I expect ASIC compilers can do the same thing too. 'See' a counter and optimize the design for Terminal Count ?
It would be nice to have the CT -> 64b well away from the critical path, for those keen on overclocking.
1. With 2 instructions (can use existing GETCT instruction as CZI are not used) if you read the high first, that can disable Interrupts for one instruction. If you only need the lower then all is fine with no penalty.
2. With a smaller, say 48 bits, this provides advantages to do higher granularity simply. I see this as a more useful solution. For the rarer case where full granularity is required, then it's solvable by software. A 64 bit doesn't give this option without further software fiddling
From 1. above: So, with 64bits, only if you read the hi
getct lo ' current normal use (no hold interrupts)
...
getct hi ' just read the hi bits (holds off interrupts until next instruction executed)
xxxx '
...
getct hi ' read the hi bits (holds off interrupts until next instruction executed)
getct lo ' read the lo bits
setq hi 'convert to seconds @250MHz
qdiv lo,##250_000_000
getqx seconds 'tops out at ~136 years of seconds
The hi now needs to +1 early rather than late (eg ~$FFFF_FFFE)
I like reading the top first and only shielding interrupts for that instruction. I don't understand how you get more time, though.
.... I don't understand how you get more time, though.
Do you mean the years ? Both numbers can apply, but to different things...
I think one number is the reach of 64 bits, whilst the other one is the 32bit reach, in seconds.
ie because when you divide 64b by 250M, it does not all fit into 32b
And we've shielded stateful things from interrupts all over the place. IMHO, very, very good calls all of them, because overall chip complexity would have gone off the charts. If you ask me, that's one of the very best trade-offs made in this design cycle, and it's going to make P2 a notable chip, once people see it in action.
This simple thing fits right in with a whole lot of other simple things. Has my vote.
Another instruction would expand on this, and it's a different sort of simple thing, but not the very simplest thing. Honestly, people will pick up on how it is right now, and that's the dominant use case too.
... I don't understand how you get more time, though.
64 bits = 4,294,967,296 * 4,294,967,296 = 1.844674407370955e +19
Now, in seconds at 250MHz = 1.844674407370955e+19 / 250,000,000 = 73,786,976,294.83821
In minutes = 73,786,976,294.83821 / 60 = 1,229,782,938.247303
In hours = 1,229,782,938.247303 / 60 = 20,496,382.30412172
In days = 20,496,382.30412172 / 24 = 854,015.9293384052
In years = 854,015.9293384052 /365.25 (century non leap years not accounted for) = 2,338.168184362506 years
Chip,
As I understand this, there is a single counter running of 64 bits (was 32 bits) and this has to feed every cog. That is one hell of a bus of wires for little use. And it's clocking all the time on those long wires. I do hope the extra congestion doesn't cause timing or routing issues for OnSemi.
I am wondering how this could be simplified. As I said previously, I think 48 bits would be fine.
I wonder if the counter could be gated onto the I/O bus, or the HUB RAM bus, going to the cog(s) when being read?
Alternately, what about each cog having it's own counter, or at least most of the counter's bits. I know the silicon for a big counter is a lot of flops, but surely this would be relevant compared to the bus, and would relieve the routing congestion.
To me, this no longer seems the simple solution we all thought it would be, as the ramifications seem much bigger than first thought.
Chip,
As I understand this, there is a single counter running of 64 bits (was 32 bits) and this has to feed every cog. That is one hell of a bus of wires for little use. And it's clocking all the time on those long wires. I do hope the extra congestion doesn't cause timing or routing issues for OnSemi.
I am wondering how this could be simplified. As I said previously, I think 48 bits would be fine.
I wonder if the counter could be gated onto the I/O bus, or the HUB RAM bus, going to the cog(s) when being read?
Alternately, what about each cog having it's own counter, or at least most of the counter's bits. I know the silicon for a big counter is a lot of flops, but surely this would be relevant compared to the bus, and would relieve the routing congestion.
To me, this no longer seems the simple solution we all thought it would be, as the ramifications seem much bigger than first thought.
There are some wires, but out of 64, only 2 change state, on average. It all goes into the wash. Nothing to get concerned about.
With CT incrementing for every tick of a 320Mhz clock we will have to wait until the year 3844 for it to rollover
-1 U. 18446744073709551615 ok
$7FFFFFFFFFFFFFFF DUP . 9223372036854775807 ok
DUP 320000000 / DUP . 28823037615 ok
3600 / DUP . 8006399 ok
24 / DUP . 333599 ok
365 / DUP . 913 ok
913 2018 + . 2931 ok
913 2* . 1826 ok
1826 2018 + . 3844 ok
So no worries about a Y2K type problem for a while then.
Not questioning the reliability of the P2, but I seriously doubt that one will run for more than a thousand years between reset events :-)
I think the part would fail from eventual electromigration effects after 100 years, if it were run at high speed. If you ran it at 1MHz, you might make it to the end of the 64-bit counter. Of course, that would take even longer.
and that's the absurdity of having a full 64-bits since we would never use it in a thousand years, literally.
It can of course be made smaller, with small but finite routing and register savings.
10 years is probably too small, but ~100 could be ok, which comes in at a round 60 bits.
If the local eggbeater spins 3 LSBs, always in sync, you can also save routing those 3 bits, so there is scope to maybe shave 4+3 = 7 bits off the total route needs.
Well, that comment was made with tongue firmly in cheek. Peter does have a point as far as 64 bits providing an absurdly long count. An additional 16 or even 8 bits would have been more than adequate for most things, although I suspect either one would not be all that much simpler than the 32 bit version. Perhaps "better to have it and not need it..." applies in this case.
... An additional 16 or even 8 bits would have been more than adequate for most things...
Not really, if you want a useful time-since-reset, you do not want that to wrap inside any sensible time. Another 8 bits gives up-time wraps every 1 hour!
Even 16 bits only nudges you out to 10 days. Both would need additional software and some time-manager COG allocated.
As my numbers indicated, you can decrease from 64b, but not by very much (~ 60 bits).
... An additional 16 or even 8 bits would have been more than adequate for most things...
Not really, if you want a useful time-since-reset, you do not want that to wrap inside any sensible time. Another 8 bits gives up-time wraps every 1 hour!
Even 16 bits only nudges you out to 10 days. Both would need additional software and some time-manager COG allocated.
As my numbers indicated, you can decrease from 64b, but not by very much (~ 60 bits).
If it were any less than 64 bits, it would seem miserly. The next step after 32 is 64.
... An additional 16 or even 8 bits would have been more than adequate for most things...
Not really, if you want a useful time-since-reset, you do not want that to wrap inside any sensible time. Another 8 bits gives up-time wraps every 1 hour!
Even 16 bits only nudges you out to 10 days. Both would need additional software and some time-manager COG allocated.
As my numbers indicated, you can decrease from 64b, but not by very much (~ 60 bits).
If it were any less than 64 bits, it would seem miserly. The next step after 32 is 64.
We could present it in the API as if it were 64 bits, but leave the top N bits hardcoded to 0. (This would have to be documented, but I doubt anyone would complain if the counter were restricted to, say, 100 years worth of cycles.) That's probably not going to be a huge saving, but it's something to consider if routing turns out to be tricky.
... An additional 16 or even 8 bits would have been more than adequate for most things...
Not really, if you want a useful time-since-reset, you do not want that to wrap inside any sensible time. Another 8 bits gives up-time wraps every 1 hour!
Even 16 bits only nudges you out to 10 days. Both would need additional software and some time-manager COG allocated.
As my numbers indicated, you can decrease from 64b, but not by very much (~ 60 bits).
If it were any less than 64 bits, it would seem miserly. The next step after 32 is 64.
We could present it in the API as if it were 64 bits, but leave the top N bits hardcoded to 0. (This would have to be documented, but I doubt anyone would complain if the counter were restricted to, say, 100 years worth of cycles.) That's probably not going to be a huge saving, but it's something to consider if routing turns out to be tricky.
Those three bits are a drop in the ocean, amid everything else in there.
We could present it in the API as if it were 64 bits, but leave the top N bits hardcoded to 0. (This would have to be documented, but I doubt anyone would complain if the counter were restricted to, say, 100 years worth of cycles.) That's probably not going to be a huge saving, but it's something to consider if routing turns out to be tricky.
Exactly. They have to prove it is not 64 bits first
There is no law that says you have to implement in quanta of 32 bits, eg I see parts with 24b counters.
Those three bits are a drop in the ocean, amid everything else in there.
Perhaps, but it looks like you can save 4+3 bits of routing, and it all adds up.. at some stage, all the added stuff will start to push down system clock speeds.
Maybe there's something clever that can be done with upper bits?
Perhaps upper 5 bits get inc'd whenever an interrupt is called?
There's must be something fun here...
There might be, but that would dictate adding masking for normal use comparisons.]
I'd be fine with saving routing to 60 (or 57) lines, and reading undefined as 0000.
If it were any less than 64 bits, it would seem miserly. The next step after 32 is 64.
Oh, please! This is just nuts! In the P1, we've been more than happy with a 53-second rollover. I still submit that 32 bits is plenty, regardless of the clock speed. It has nothing to do with real time, only the number of clock ticks it takes to deal with rollover in software. And 232 ticks is more than enough.
At this rate, the P2 is never going to get finished! Chip, and his forumista enablers (yes, "enablers", since mission creep is a form of addiction for Chip), what the hell are you thinking?!!
Phil, the P2 will be finished at some point, but it may take an extra round of silicon to fix the bugs that may get introduced with the new features. In the meantime, we'll be able to play around with the P2 from the first round of silicon. This will be obsoleted by the second round of silicon, which may end up being obsoleted by the third round of silicon. Eventually, there will be a stable version of the P2.
I'm not overly concerned. I'm also not really advocating for new features. But I'm not going to balk at it. I think chip knows what domains works well and what didn't, and frankly what he wrote worked, minus interpretation difference in the tools.
Worst case, we do have a working mask set. If people get impatient, or use cases presented, on semi can be asked to make those, and it can work.
So we have one revision if a P2. It can be made into additional chips.
Comments
Cluso99's reverse read above with a new opcode, avoids that double-delay effect. (but it costs the new opcode decode, and TC is now on $FFFF_FFFE - maybe that can be 31b carry & 1b==, to keep speed & reduce logic ?)
It would be nice to have the CT -> 64b well away from the critical path, for those keen on overclocking.
Cluso, thanks for re-editing the thread title!
I like reading the top first and only shielding interrupts for that instruction. I don't understand how you get more time, though.
I think one number is the reach of 64 bits, whilst the other one is the 32bit reach, in seconds.
ie because when you divide 64b by 250M, it does not all fit into 32b
And we've shielded stateful things from interrupts all over the place. IMHO, very, very good calls all of them, because overall chip complexity would have gone off the charts. If you ask me, that's one of the very best trade-offs made in this design cycle, and it's going to make P2 a notable chip, once people see it in action.
This simple thing fits right in with a whole lot of other simple things. Has my vote.
Another instruction would expand on this, and it's a different sort of simple thing, but not the very simplest thing. Honestly, people will pick up on how it is right now, and that's the dominant use case too.
64 bits = 4,294,967,296 * 4,294,967,296 = 1.844674407370955e +19
Now, in seconds at 250MHz = 1.844674407370955e+19 / 250,000,000 = 73,786,976,294.83821
In minutes = 73,786,976,294.83821 / 60 = 1,229,782,938.247303
In hours = 1,229,782,938.247303 / 60 = 20,496,382.30412172
In days = 20,496,382.30412172 / 24 = 854,015.9293384052
In years = 854,015.9293384052 /365.25 (century non leap years not accounted for) = 2,338.168184362506 years
As I understand this, there is a single counter running of 64 bits (was 32 bits) and this has to feed every cog. That is one hell of a bus of wires for little use. And it's clocking all the time on those long wires. I do hope the extra congestion doesn't cause timing or routing issues for OnSemi.
I am wondering how this could be simplified. As I said previously, I think 48 bits would be fine.
I wonder if the counter could be gated onto the I/O bus, or the HUB RAM bus, going to the cog(s) when being read?
Alternately, what about each cog having it's own counter, or at least most of the counter's bits. I know the silicon for a big counter is a lot of flops, but surely this would be relevant compared to the bus, and would relieve the routing congestion.
To me, this no longer seems the simple solution we all thought it would be, as the ramifications seem much bigger than first thought.
There are some wires, but out of 64, only 2 change state, on average. It all goes into the wash. Nothing to get concerned about.
No, because then we have to recreate the counter logic in each cog. Better to send wires.
So no worries about a Y2K type problem for a while then.
Not questioning the reliability of the P2, but I seriously doubt that one will run for more than a thousand years between reset events :-)
I think the part would fail from eventual electromigration effects after 100 years, if it were run at high speed. If you ran it at 1MHz, you might make it to the end of the 64-bit counter. Of course, that would take even longer.
and that's the absurdity of having a full 64-bits since we would never use it in a thousand years, literally.
If we were to leave for a while at near-light-speed, we'd like to come back later and see that it was still working, though.
10 years is probably too small, but ~100 could be ok, which comes in at a round 60 bits.
If the local eggbeater spins 3 LSBs, always in sync, you can also save routing those 3 bits, so there is scope to maybe shave 4+3 = 7 bits off the total route needs.
Not really, if you want a useful time-since-reset, you do not want that to wrap inside any sensible time. Another 8 bits gives up-time wraps every 1 hour!
Even 16 bits only nudges you out to 10 days. Both would need additional software and some time-manager COG allocated.
As my numbers indicated, you can decrease from 64b, but not by very much (~ 60 bits).
If it were any less than 64 bits, it would seem miserly. The next step after 32 is 64.
We could present it in the API as if it were 64 bits, but leave the top N bits hardcoded to 0. (This would have to be documented, but I doubt anyone would complain if the counter were restricted to, say, 100 years worth of cycles.) That's probably not going to be a huge saving, but it's something to consider if routing turns out to be tricky.
Those three bits are a drop in the ocean, amid everything else in there.
Exactly. They have to prove it is not 64 bits first
There is no law that says you have to implement in quanta of 32 bits, eg I see parts with 24b counters.
Perhaps, but it looks like you can save 4+3 bits of routing, and it all adds up.. at some stage, all the added stuff will start to push down system clock speeds.
Maybe there's something clever that can be done with upper bits?
Perhaps upper 5 bits get inc'd whenever an interrupt is called?
There's must be something fun here...
Or, maybe it gets inc'd whenever in a wait state?
Ok, maybe don't need to be 100% efficient. Maybe 2000+ years of counter life is cool...
Or, upper bits get inc'd by internal RC oscillator?
There might be, but that would dictate adding masking for normal use comparisons.]
I'd be fine with saving routing to 60 (or 57) lines, and reading undefined as 0000.
hehe, if that were possible, we could measure RC osc using the crystal.... Sadly, no..
Maybe more amazing will be 2k year old p2 system showing count on lcd...
Wait: the forever clock! Already exist?
Oh, please! This is just nuts! In the P1, we've been more than happy with a 53-second rollover. I still submit that 32 bits is plenty, regardless of the clock speed. It has nothing to do with real time, only the number of clock ticks it takes to deal with rollover in software. And 232 ticks is more than enough.
At this rate, the P2 is never going to get finished! Chip, and his forumista enablers (yes, "enablers", since mission creep is a form of addiction for Chip), what the hell are you thinking?!!
-Phil
Worst case, we do have a working mask set. If people get impatient, or use cases presented, on semi can be asked to make those, and it can work.
So we have one revision if a P2. It can be made into additional chips.