CNT extension to 64-bit

Cluso99 · 2018-11-13 22:56

I say from the outset, I am concerned about feature creep.
Postedit: Thought about it. It's feature creep and not warranted. Forget it

But here I raised whether it may be simple to extend the CNT register to permit above 32 bits
forums.parallax.com/discussion/comment/1453238/#Comment_1453238

Requirement: Minimal logic, minimal change, Minimal risk

For reference, here is the GETCT instruction
EEEE 1101011 000 DDDDDDDDD 000011000 GETCT D

Note CZL=000 which allows for expansion

Tony suggested this as an extension to my suggestion...

TonyB_ wrote: »
Could GETCT set Z if CT has passed through zero since the last GETCT? This rollover bit cleared after GETCT. Could also set C if CT[31]=1 for completeness. The 64-bit count code go be:
	MOV	CTHI,#0
	...
	GETCT	CTLO		WZ
  IF_Z	ADD	CTHI,#1
If Z if difficult, then use C.

Further refinement...
* CNT[31:0] is extended to be CNT[63:0]
* When CNT[31] overflows (ie goes from 1 to 0), it sets an internal flag "FLAG31"
* GETCT instruction extended to be GETCNT D and GETCNTH D {WC}
* When GETCNT executes, it clears "FLAG31"
* When GETCNTH D {WC} executes, C is optionally set if "FLAG31" is set (ie an overflow of CNT[31] occurred since the last GETCNT executed)

        GETCNT  countl
        GETCNTH counth     wc
 if_c   sub     counth,#1      'because CNT[31:0] incremented since CNT[31:0] was read, so we need to adjust CNTH

I would expect in many instances, that when requiring longer count times, that only CNTH (CNT[63:32]) would be required.

EEEE 1101011 000 DDDDDDDDD 000011000 GETCNT D
EEEE 1101011 C01 DDDDDDDDD 000011000 GETCNTH D {WC}

Cluso99 · 2018-11-13 23:10

For reference, 333MHz clock gives 3ns, so 32-bits gives a time of ~12.9s

By extending to 64-bits will give ~1,753 years !!!

An extra 8-bits CNT[39:0] gives ~55m
An extra 16-bits CNT[47:0] gives ~9.7 days

So perhaps we only need an extra 8-bits ???

Cluso99 · 2018-11-13 23:18

jmg pointed out that Interrupts can cause a problem.

So to simplify, just extend CNT to CNT[39:0] and change the GETCT D instruction to

EEEE 1101011 000 DDDDDDDDD 000011000 GETCT D 'reads CNT[31:0]
EEEE 1101011 001 DDDDDDDDD 000011000 GETCTX D 'reads CNT[39:7]

jmg · 2018-11-13 23:20

Cluso99 wrote: »

* When GETCNTH D {WC} executes, C is optionally set if "FLAG31" is set (ie an overflow of CNT[31] occurred since the last GETCNT executed)

A problem here is interrupts. With split reads, if an INT and then an overflow occur, by the time it returns overflow info is lost.

One solution would be to have GETCNTH RegAdr read 64 bits in an atomic manner to 2 adjacent locations. Taking 3 or 4 SysCLKs to do so ?
Another is to have GETCNT always pause any INTs for 2 more sysclks, to effectively allow atomic 64b read. This could be a second GETCNT read, if opcode space is tight.

pedward · 2018-11-13 23:25

Since CNT overflows every ~17 seconds at 250Mhz, 64 bits seems like a very reasonable ask.

That extends the useful range of CNT to just 2339 years, but should be sufficient for most applications.

Roy Eltham · 2018-11-13 23:33

This is a feature I think we absolutely need.
I think this is way more important than convenience stuff that could just be handled in the compiler/documentation (the WRPIN things).

This is something that people will run into and have to handle all the time.

Cluso99 · 2018-11-13 23:44

Thought about it. It's feature creep and not warranted. Forget it

Cluso99 · 2018-11-13 23:46

Oh Roy. Just when I was convinced there was no reason to do it ???

Isn't 12 seconds enough ???

evanh · 2018-11-13 23:49

pedward wrote: »

Since CNT overflows every ~17 seconds at 250Mhz

The CT events range is half that because they operate on high-bit signed compare, rather than rollover equality. So, it's actually worse.

evanh · 2018-11-13 23:56

Cluso99 wrote: »

Oh Roy. Just when I was convinced there was no reason to do it ???

Isn't 12 seconds enough ???

Hehe, everyone has their priorities.

I'm personally okay with 6 seconds. Software can do the rest.

msrobots · 2018-11-14 00:01

+1

I stumbled a lot over the P1 overflow at ~57 seconds. 55 minutes would be better, but if it is not prohibitive I would like to see a 64-bit cnt.

Just setting a flag at 32-bit overflow and using that to count overflows by one self in another register would need constant attention.

I am quite unsure about the getx, gety things, but if it would be possible to take a current snapshot of both cnt values at the same time and then reading them one after the other would avoid any interrupt problems.

Mike

jmg · 2018-11-14 00:02

Cluso99 wrote: »

jmg pointed out that Interrupts can cause a problem.

So to simplify, just extend CNT to CNT[39:0] and change the GETCT D instruction to

EEEE 1101011 000 DDDDDDDDD 000011000 GETCT D 'reads CNT[31:0]
EEEE 1101011 001 DDDDDDDDD 000011000 GETCTX D 'reads CNT[39:7]

See above, that is still not atomic.

GETCT could pause any pending INT for 2 SysCLKS, allowing any (optional) second GETCT to immediately follow.
If one GETCT immediately follows another (now safe from INTs), the second GETCT reads upper part of 40~64b, and does not add any INT pause. (A tiny state engine)

With a now known and locked 2 Sysclk delay on the 2 readings, I think a simple fixed 2 sysclk overflow pipeline is enough to give a perfect 40~64b capture, no upper holding buffer needed & no housekeeping.

Cluso99 wrote: »

For reference, 333MHz clock gives 3ns, so 32-bits gives a time of ~12.9s

By extending to 64-bits will give ~1,753 years !!!

An extra 8-bits CNT[39:0] gives ~55m
An extra 16-bits CNT[47:0] gives ~9.7 days

So perhaps we only need an extra 8-bits ???

There is only one CNT, so logic cost is not great. 8b means user-code still needs to manage rollovers.
24b is just under 7 years wrap time.

David Betz · 2018-11-14 00:06

jmg wrote: »

Cluso99 wrote: »

jmg pointed out that Interrupts can cause a problem.

So to simplify, just extend CNT to CNT[39:0] and change the GETCT D instruction to

EEEE 1101011 000 DDDDDDDDD 000011000 GETCT D 'reads CNT[31:0]
EEEE 1101011 001 DDDDDDDDD 000011000 GETCTX D 'reads CNT[39:7]

See above, that is still not atomic.

GETCT could pause any pending INT for 2 SysCLKS, allowing any (optional) second GETCT to immediately follow.
If one GETCT immediately follows another (now safe from INTs), the second GETCT reads upper part of 40~64b, and does not add any INT pause. (A tiny state engine)

With a now known and locked 2 Sysclk delay on the 2 readings, I think a simple fixed 2 sysclk overflow pipeline is enough to give a perfect 40~64b capture, no upper holding buffer needed & no housekeeping.

Cluso99 wrote: »

For reference, 333MHz clock gives 3ns, so 32-bits gives a time of ~12.9s

By extending to 64-bits will give ~1,753 years !!!

An extra 8-bits CNT[39:0] gives ~55m
An extra 16-bits CNT[47:0] gives ~9.7 days

So perhaps we only need an extra 8-bits ???

There is only one CNT, so logic cost is not great. 8b means user-code still needs to manage rollovers.
24b is just under 7 years wrap time.

Yeah but aren't there lots of wires involved to connect that one CNT register will all 8 COGs?

jmg · 2018-11-14 00:07

evanh wrote: »

pedward wrote: »

Since CNT overflows every ~17 seconds at 250Mhz

The CT events range is half that because they operate on high-bit signed compare, rather than rollover equality. So, it's actually worse.

Yes, those are waitcnts, which test for >=, the CNT extension was not going to affect those. If someone is waiting, they can do that in a loop.

This 64b read of CNT means you do not have to ensure some SW somewhere, in the system is alive and tracking those 6~17s quanta.

cgracey · 2018-11-14 00:08

What about some kind of alternative, where the main counter is 64 bits, but each cog can chose its own window of 32 bits? That way, all the 32-bit math doesn't get trashed.

jmg · 2018-11-14 00:11

cgracey wrote: »

What about some kind of alternative, where the main counter is 64 bits, but each cog can chose its own window of 32 bits? That way, all the 32-bit math doesn't get trashed.

Isn't that by COG shifted window view of 64b, more complex in logic and use, than just reading twice if you want 64b ?

macrobeak · 2018-11-14 00:17

cgracey wrote: »

What about some kind of alternative, where the main counter is 64 bits, but each cog can chose its own window of 32 bits? That way, all the 32-bit math doesn't get trashed.

Chip, that sounds like a very good solution.
Having to continuously handle cnt overflow for long duration timing operations is tedious.

pedward · 2018-11-14 00:20

Hmm, the window idea is great. You can choose what precision you want to count at, then set your mask.

For high precision timers, you choose the lowest 32bits, for seconds you'd bump it up a bit.

The window idea nicely handles the need for high/low precision, and for atomic transfers of the count.

Best of all, the count stays large precision and monotonic. CNT would never overflow, even in the worst cases.

You can easily sample the lower 32 bits, then the upper 32 bits and based on the lower 32bit value, sample it again if there was an overflow imminent.

jmg · 2018-11-14 00:26

pedward wrote: »

Hmm, the window idea is great. You can choose what precision you want to count at, then set your mask.

You do need to consider the logic cost here - if you expect a nice-to-have 1 bit shifting granularity onto a 64b counter, that is a lot of config (5 offset bits per COG) and many MUXes .... (all this duplicated 8 times...)

potatohead · 2018-11-14 00:27

Chip, that is a fine idea. If it is simple, my vote is to do it.

evanh · 2018-11-14 00:32

pedward wrote: »

You can easily sample the lower 32 bits, then the upper 32 bits and based on the lower 32bit value, sample it again if there was an overflow imminent.

Chip, this is good enough. Software can make good use of repeated reads of high/low halves. Leave the events as is.

So there is just one added instruction this way. Nothing else changes for the compilers.

PS: This also means you can implement CT as two 32-bit counters cascaded. Eliminating any risk of not meeting critical path problems later.

Cluso99 · 2018-11-14 00:33

cgracey wrote: »

What about some kind of alternative, where the main counter is 64 bits, but each cog can chose its own window of 32 bits? That way, all the 32-bit math doesn't get trashed.

That was what I was trying to get at with the GETCTX D.

With a 40-bit counter, GETCTX D would return CNT[39:8]. This gives ~55mins.
With a 44-bit counter, GETCTX D would return CNT[43:12]. This would give ~14.5 hours.
With a 48-bit counter, GETCTX D would return CNT[47:16]. This gives ~9.7 days.

With any case (just implement one case) there would be no need to have rollover.

Rayman · 2018-11-14 00:45

can this counter be in the hub and use cordic like commands to get all 64-bits in x and y?

evanh · 2018-11-14 00:52

Cluso,
Chip is talking about a shiftable window. Where you could specify how far up the 64 bits you want to read the 32 bits or, for events, compare your 32-bits against. More flexible but more costly too.

Rayman,
It kind of is in the hub in that all cogs share the one counter, but, since it is read only, it can be read by all cogs simultaneously so it acts like it's part of each cog.

Cluso99 · 2018-11-14 00:58

Evan,
Yes. But I don't think it needs to be this flexible.
Just choose one of 40/44/48 bits and implement this. Definately no need to be any larger than 48 bits. Keep it simple

There are software ways around this if better granularity is also required by a specific user.

evanh · 2018-11-14 01:03

Agreed, one new instruction to fetch the high half of the 64-bits is the simplest improvement. A 64-bit counter, especially as dual 32-bit cascaded, can be done with small resources. EDIT: And even an extra hidden register in each cog for holding a latched copy of high half is reasonably small resource cost.

Cluso99 · 2018-11-14 01:11

evanh wrote: »

Agreed, one new instruction to fetch the high half of the 64-bits is the simplest improvement. A 64-bit counter, especially as dual 32-bit cascaded, can be done with small resources. EDIT: And even an extra hidden register in each cog for holding a latched copy of high half is reasonably small resource cost.

By making the total bits less than 64, software can read both halves if necessary, and compare the overlapping bits to see if rollover occurred. This is why I suggested say 48 bits.

evanh · 2018-11-14 01:22

If the high half has changed then the low half has rolled over. Just have to have an earlier reading is all.

cgracey · 2018-11-14 01:26

jmg wrote: »

pedward wrote: »

Hmm, the window idea is great. You can choose what precision you want to count at, then set your mask.

You do need to consider the logic cost here - if you expect a nice-to-have 1 bit shifting granularity onto a 64b counter, that is a lot of config (5 offset bits per COG) and many MUXes .... (all this duplicated 8 times...)

This is true!

I'm not through reading comments here, but maybe a 48-bit counter with 16, or even 8, selectable window offsets would be good.

jmg · 2018-11-14 01:33

cgracey wrote: »

jmg wrote: »

pedward wrote: »

Hmm, the window idea is great. You can choose what precision you want to count at, then set your mask.

You do need to consider the logic cost here - if you expect a nice-to-have 1 bit shifting granularity onto a 64b counter, that is a lot of config (5 offset bits per COG) and many MUXes .... (all this duplicated 8 times...)

This is true!

I'm not through reading comments here, but maybe a 48-bit counter with 16, or even 8, selectable window offsets would be good.

That's more config, logic and granularity compromise than a queued 32+32 read.
A 2 SysClk L32-> H32 delay (matches reading delay) would mean the upper 32 bits do not even need a holding register, and a 2 SysCLK monostable is used to hold-off interrupts to avoid INT aperture effects.
All up, quite simple logic, done once.

cgracey · 2018-11-14 01:40

jmg wrote: »

cgracey wrote: »

jmg wrote: »

pedward wrote: »

Hmm, the window idea is great. You can choose what precision you want to count at, then set your mask.

You do need to consider the logic cost here - if you expect a nice-to-have 1 bit shifting granularity onto a 64b counter, that is a lot of config (5 offset bits per COG) and many MUXes .... (all this duplicated 8 times...)

This is true!

I'm not through reading comments here, but maybe a 48-bit counter with 16, or even 8, selectable window offsets would be good.

That's more config, logic and granularity compromise than a queued 32+32 read.
A 2 SysClk L32-> H32 delay (matches reading delay) would mean the upper 32 bits do not even need a holding register, and a 2 SysCLK monostable is used to hold-off interrupts to avoid INT aperture effects.
All up, quite simple logic, done once.

But if you want a long timeout using built-in timeout mechanisms, you are limited to ~8s.

I need to look through the Verilog code and see how CNT is used everywhere. That will indicate what approaches might be best.

CNT extension to 64-bit

Comments