The need for CTRx addressable registers

Seairth · 2015-03-02 16:15

(I vaguely recall prior discussion on this topic, but don't remember what was said....)

There are three registers per counter:

CTRx : The control register
PHSx : The counter accumulator
FRQx : The value to conditionally add to PHSx

Do all of these need to be implemented as addressable registers? I understand the need for PHSx, and possible FRQx, to be addressable. But what about CTRx? And, if CTRx does need to be kept addressable, how necessary is it to have the fields aligned for use with MOVI, MOVD, and MOVS?

jmg · 2015-03-02 20:39

I'm not sure I follow - you still need to write to CTRx ?
If you are thinking of freeing more code space, which is relatively expensive 4 port memory, then you could add indirect access dual port memory, where CTRs could sit, but that breaks a lot of code for not much gain ?

Field alignment is useful because it allows more compact code - there are certainly potential uses for the unused bits.

Things like Quadrature Counting modes and bits for Atomic (optionally Armed/triggered) Capture are obvious ones.
To avoid consuming memory the Capture result could be aliased at PHSx

Addit: Ozpropdev has mentioned extra (optional) counters in his COGs - would be interesting to see how he has chosen to map these. Certainly it would be preferable to find ways to avoid eating more into CODE space.
Alias of some control bits across CTRx registers could allow steering, and atomic trigger of multiple counters ?.

Seairth · 2015-03-03 04:31

jmg wrote: »

I'm not sure I follow - you still need to write to CTRx ?

Indeed. The trade-offs between having the registers be addressable or not are:

Allows more direct manipulation of the registers.
Provides maximal performance (i.e. can manipulate the registers in tighter loops)
Limits the ability to add more registers, both in terms of addressable space and in terms of maintaining compatibility.
Reduces usable address space for instructions and data.

The alternative to addressable registers would obviously be dedicated instructions. A single GET/SET instruction pair would open up the ability to have up to 512 special registers (not just CTRx-related), while also freeing up some COG RAM. The down side to this is that any existing code that manipulates a register in a single instruction (other than MOV) would end up taking 3 additional registers (2 instructions and 1 working register) to accomplish the same goal, as well as add 8 clock cycles to the process.

jmg wrote: »

If you are thinking of freeing more code space, which is relatively expensive 4 port memory, then you could add indirect access dual port memory, where CTRs could sit, but that breaks a lot of code for not much gain ?

Actually, I am not thinking about this in P2 terms, at least not right now. As it stands, the P1V assumes single-ported memory. However, I have also been considering a slight rewrite to take advantage of the dual-ported RAM on the FPGAs. With that in mind, I suppose you *could* change the system to use indirect addressing. If you replaced $1F8 through $1FF with 8 indirect registers, you could still provide some semblance of backwards-compatibility (by initializing the indirect registers to reference the original shadow registers) while opening up the opportunity for more registers.

As an added bonus, you could divide the indirect address space into two, where the lower 256 addresses would be special registers and the upper 256 addresses would reference the upper 256 cog registers. This would allow indirect addressing for other purposes as well. This wouldn't be as efficient as P2's version of indirect registers, but it may still be worth having.

Of course, if you need to access more than the original registers, or want to use indirect addressing, you will still end up taking a hit both in terms of code size and maximal performance. But the trade-off may be worth it (make the shadow register space expandable and get indirect addressing).

jmg wrote: »

Field alignment is useful because it allows more compact code - there are certainly potential uses for the unused bits.

Originally, I was thinking of using some of those bits to provide a sort of "index" into a set of CTRx registers. However, the current design would make this difficult to implement, so I'm abandoning that idea.

ksltd · 2015-03-03 08:24

Without providing some context, there are an infinity of correct answers; it's essentially arbitrary.

If this is a question about general purpose P1 evolution, then code compatibility might be the defining consideration. If it's an open field question, then there's probably nothing about how the counters work that really matters. Your sitution is likely somewhere in between, but we don't know ...

Also, the other "changes" being considered intersect the fate of the control registers presently located in the per-core address map. If you're thinking of an indirect addressing scheme as you suggest, then it becomes increasing important to get the control register out of the memory map because suddenly software bugs have the ability to corrupt hardware state, drive bus conflicts and damage external devices and those possibilities should probably be high on your list of non-goals.

jmg · 2015-03-03 12:34

Seairth wrote: »

...
The alternative to addressable registers would obviously be dedicated instructions. A single GET/SET instruction pair would open up the ability to have up to 512 special registers (not just CTRx-related), while also freeing up some COG RAM. The down side to this is that any existing code that manipulates a register in a single instruction (other than MOV) would end up taking 3 additional registers (2 instructions and 1 working register) to accomplish the same goal, as well as add 8 clock cycles to the process.

That's sounding costly, and eating into that code space we are trying to save in the first place...

Seairth wrote: »

.... As it stands, the P1V assumes single-ported memory. However, I have also been considering a slight rewrite to take advantage of the dual-ported RAM on the FPGAs.

That makes good sense, as that memory is essentially free in a FPGA

Seairth wrote: »

As an added bonus, you could divide the indirect address space into two, where the lower 256 addresses would be special registers and the upper 256 addresses would reference the upper 256 cog registers. This would allow indirect addressing for other purposes as well. This wouldn't be as efficient as P2's version of indirect registers, but it may still be worth having.

I agree. Some form of indirect memory would take buffers OUT of valuable code space, giving a real gain.
It would even be possible to share buffers between adjacent COGS

Seairth wrote: »

Originally, I was thinking of using some of those bits to provide a sort of "index" into a set of CTRx registers. However, the current design would make this difficult to implement, so I'm abandoning that idea.

? That seems the simplest approach, and it is backward compatible, and could allow a form of Counter stacking.
ie default visible are the standard counters, and spare-bits access/remap others and things like capture registers.

evanh · 2015-03-14 14:18

Seairth wrote: »

The alternative to addressable registers would obviously be dedicated instructions. A single GET/SET instruction pair would open up the ability to have up to 512 special registers (not just CTRx-related), while also freeing up some COG RAM. The down side to this is that any existing code that manipulates a register in a single instruction (other than MOV) would end up taking 3 additional registers (2 instructions and 1 working register) to accomplish the same goal, as well as add 8 clock cycles to the process.

Typical uses are sampling in nature, and by extension are unidirectional accesses, so thereby requiring a software buffer anyway. To reliably massage the data it needs to be an atomic operation.

I don't see much of a downside to having dedicated I/O load/store instructions so as to have the extra dedicated address space.

evanh · 2015-03-14 14:29

Hmm, I wonder if this matter should be raised with Chip for the Prop2 also. After all, it's not like there aren't already dedicated I/O instructions, he's already got a bunch of special instructions lined up for smartpins, not to mention waitvid and the likes.

He's laid out an extra 8 I/O registers into the Prop2 Cog address space. It would be nice to get all 512 locations back as program/data space and have room for lots more I/O at the same time.

Seairth · 2015-03-17 04:32

Since this topic is also being discussed on the P2 forum, I thought I'd throw out another approach that I have been thinking about. Suppose the following changes (to P1V):

Add Indirect Registers. I think you could do this by:

Taking advantage of 2-port memory on the FPGAs to make room for indirect register resolution in the current 4-cycle timing.
Take over the top 16 addresses for indirect registers. Yes, that means we would have 16 indirect register to use. Sort of.

For the 16 existing memory-mapped registers, relocate them to start at indirect address zero. In other words, all special registers would be accessible via the indirect addressing registers.

To fill this out, here are a couple additional rules:

Initialize each cog such that the 16 indirect registers point to the 16 special registers (i.e. indirect addresses $000-$00F)
Use the top bit of the indirect address to break the indirect space into two groups: lower half is dedicated to special registers and upper half is dedicated to indirectly referencing the upper half of cog memory`. This would allow the addition of more IN/OUT/DIR, CTRx, etc. It would also give indirect addressing to 256 cog registers where data is often placed (i.e. after the code) anyhow.

So, when a cog first starts, the addressing behaviour would be exactly as the P1V is now. Addresses $1F0-$1FF would indirectly resolve to the same special registers that had original been mapped to that location.

From there, you can change the indirect registers (with new instructions, I'm assuming) as you see fit. For instance, if you aren't using video or counters, you free up 8 indirect registers for other uses. Or, if you want to re-arrange the special register (e.g. INA, OUTA, DIRA, INB, OUTB, DIRB instead of INA, INB, OUTA, OUTB, DIRA, DIRB), you can do that too. Or, if you add Port C or inter-cog communication through a "PortD", you can map that in. Or maybe you want to throw some dedicated SPI hardware in and need access to a control register and shift register. You could do that as well.

Anyhow, I think you get the idea. The only time you are going to have a challenge is when you want to use all of the ports AND you also want to use indirect addressing. Of course, any time you are pushing the limits of the resources, you have to start making compromises. And that doesn't seem like a bad deal considering what you gain in return.

evanh · 2015-03-17 14:18

I'd call that "bank switching" rather than "indirection". I'm now suspicious that bank switching is what you had intended in the opening post too.

Yep, it's a perfectly reasonable way of dealing with a limited address space. The downside is when wanting to use two items in two mutually exclusive banks. It requires constant flipping back and forth, but as long as that is rare and is a quick operation then it shouldn't be a big deal.

Seairth · 2015-03-17 18:45

evanh wrote: »

I'd call that "bank switching" rather than "indirection". I'm now suspicious that bank switching is what you had intended in the opening post too.

Yep, it's a perfectly reasonable way of dealing with a limited address space. The downside is when wanting to use two items in two mutually exclusive banks. It requires constant flipping back and forth, but as long as that is rare and is a quick operation then it shouldn't be a big deal.

No, this isn't bank switching. You are not setting swapping out the entire set of 16 addresses at a time. Each is separately settable, just as the INDx registers in the P2. The difference is that, depending on the address, they refer to special registers (e.g. INA, CTRB, etc.) or indirectly to the top half of the 512 general cog registers. The second part is absolutely indirection.

In fact, I had not original thought of this idea when I posted the original question. I was originally thinking of having dedicated instructions to load/store those registers. At the same time, I was wondering about adding indirect registers, but couldn't think of a way to fit them in other than to do what Chip had done with the P2. But I didn't want to take up even more address space. Which is when the idea hit me. I could use indirection in such a way that a cog would start up exactly has it did now, with the special registers all located at $1F0-1FF. In reality, these were really indirect registers and the cog defaulted to indirectly referencing the same special registers that were originally at that location. All existing code would work exactly as it does now. No changes required.

Only if you wanted to take advantage of the indirection would then have to decide which special registers you wanted to share the indirect addressing functionality with. In reality, I suspect most cogs do not make use of *all* of the top 16 addresses (i.e. counters, but not vid; port a and b, but not counters; etc.) In which case, you would typically have at least 2-3 registers that could be used for indirection. At the very least, you could usually reassign $1F0 (PAR) once you have read it. And, as you point out, if you really needed to use *all* of those special registers, you could switch back and forth as a worst case scenario. My guess is that this is a scenario that would rarely happen.

Actually, you would also have to make trade-offs if you started adding more special registers to the P1V build. As I suggested earlier, you might choose to add 32 more pins (Port C), which might be mapped to $010-012 (remember that $000-00F are the original 16 special registers). In which case, if you needed to use Ports A-C, you would have to remap three of the other indirect registers to make them available.

Or, maybe it turns out that your DIRx registers are all statically set (i.e. don't change during the course of operation). In which case, you could do the following:

Redirect $1F4 (normally OUTA at $004) to DIRC (at $012).
Set DIRA, DIRB, and DIRC appropriately (through $1F6, $1F7, and $1F4 respectively).
Redirect $1F2-1F7 (normally, INA, INB, OUTA, OUTB, DIRA, DIRB) to $002, $003, $010, $004, $005, and $011 (INA, INB, INC, OUTA, OUTB, OUTC).

And voila! You've mapped in Port C without losing PAR, CNT, CTRx, FRXx, PHSx, VCFG, or VSCL!

You get the idea, I'm sure. The point is, I don't see this as bank switching at all. Also, it's backwards compatible, allows room for expansion, and gives us indirect registers!

jmg · 2015-03-17 19:08

There seem to be a number of ideas alive now.
The semantics have me somewhat lost, but when discussing opcodes it is useful to show what bits of the current opcodes are used for the new features.
Any flip schemes to give more memory have to be traded off against the code needed to do that flipping.
- you can go backwards, if more lines/data are needed on a running pgm, than what you 'freed up'.

evanh · 2015-03-17 22:36

Seairth wrote: »

No, this isn't bank switching. You are not setting swapping out the entire set of 16 addresses at a time. Each is separately settable, just as the INDx registers in the P2. ...

... And voila! You've mapped in Port C without losing PAR, CNT, CTRx, FRXx, PHSx, VCFG, or VSCL!

You get the idea, I'm sure. The point is, I don't see this as bank switching at all. Also, it's backwards compatible, allows room for expansion, and gives us indirect registers!

Whatever you are proposing, it's not indirection. Indirection provides pointers for C, it does not extend addressing. Remapping extends addressable space. As does an added I/O map, albeit via new instructions.

Remapping is the more generic name I guess. Though, it's the same hardware as bank switching.

Segmentation would be another option but that adds more complexity to the pipeline, as does indirection. I think Intel/AMD have 20 odd stages these days.

Seairth · 2015-03-18 07:59

jmg wrote: »

There seem to be a number of ideas alive now.
The semantics have me somewhat lost, but when discussing opcodes it is useful to show what bits of the current opcodes are used for the new features.
Any flip schemes to give more memory have to be traded off against the code needed to do that flipping.
- you can go backwards, if more lines/data are needed on a running pgm, than what you 'freed up'.

Assuming you are talking about the semantics of my proposal, let me put it another way:

We are all familiar with the indirect register scheme designed for the P2, correct? In that case, Chip dedicated two cog addresses to accessing the data in those indirection registers. If you recall, when the cog first starts up, those registers are pointing to the cog memory that's actually at that location.

Now, on the P1V, what I'm proposing it to extend the indirect registers to take up all 16 of the top addresses, which are currently hard-coded to map to special registers. Just like the P2 INDx registers pointing to their own address (rather the actual cog memory at that address), the P1V INDx registers would point to the special registers that were originally accessible at those addresses. But, because it would be nice to use the INDx as actual indirection registers (i.e. point to cog memory), I suggested that the special registers could have effective indirect addresses from $000 to $0FF, while cog memory from $100 to $1FF would be indirectly accessible via indirect addresses $100 to $1FF.

No, this indirection ability is not as flexible as on the P2, since all manipulation of the INDx registers requires explicit instructions (i.e. no instruction format that pre/post-increments the INDx value). These would looks something like:

SETINDx S/# 'Sets indirect register to S/#
GETINDx D 'Gets current value of indirect register and store in D
INCINDx S/# 'Adds S/# to current value of indirect register (signed add)

You might even be able to get in an instruction like:

MODINDx S

- S is a value in the format of 0000XX_00000000_LLLLLLLLL_HHHHHHHHH (note: layout allows use of MOVI/MOVD/MOVS)
- X is the amount to inc/dec INDx for testing against L and H
- If INDx + X is less than L, set to H. If INDx + X is greather than H, set to L.

Yes, this is all manual, compared to what the P2 does. But I don't see any way around it without reworking the instruction format like Chip did with the P2.

And, to reiterate, the proposed change does not break *any* existing code. If you never call one of the above instructions, the default behavior is *exactly* the same as the current P1V.

Seairth · 2015-03-18 08:01

evanh wrote: »

Whatever you are proposing, it's not indirection. Indirection provides pointers for C, it does not extend addressing. Remapping extends addressable space. As does an added I/O map, albeit via new instructions.

Remapping is the more generic name I guess. Though, it's the same hardware as bank switching.

Segmentation would be another option but that adds more complexity to the pipeline, as does indirection. I think Intel/AMD have 20 odd stages these days.

I was using the term as it is used in the P2 architecture. The only difference is that some of the indirect addresses map to special registers instead of cog memory. From the code perspective, it is functionally identical. But that's not the point!

The question is: would this be a good solution (regardless of what you call it) to the problem of adding new special registers (including INDx) without taking up more cog memory address space?

jmg · 2015-03-18 12:48

Seairth wrote: »

.... But, because it would be nice to use the INDx as actual indirection registers (i.e. point to cog memory), I suggested that the special registers could have effective indirect addresses from $000 to $0FF, while cog memory from $100 to $1FF would be indirectly accessible via indirect addresses $100 to $1FF.

No, this indirection ability is not as flexible as on the P2, since all manipulation of the INDx registers requires explicit instructions (i.e. no instruction format that pre/post-increments the INDx value). These would looks something like:

SETINDx S/# 'Sets indirect register to S/#
GETINDx D 'Gets current value of indirect register and store in D
INCINDx S/# 'Adds S/# to current value of indirect register (signed add)

The code cost of this indirection is quite high, and what you propose is very close to what a 8051 already does.
However, in the 8051, the direct (sfr) space is always in front, as that mostly needs fast and random access,and the indirect memory (@Ri) in the 8051, always needs the MOV Ri,#nn & MOV @ Ri pair you have above,
I think some variants have a MOV @ Ri++ which they enable separately.

On a P1V, you could use one bit of the SETINDx mode change instruction to control ++/--/ maybe size of ++ etc ( see * MOVR below )

In an 8051, the register is only 8 bits, not 32b as in a Prop, so the Prop can access much more memory with the @Rn opcode.
Also in a 8051, to squeeze the most out of the opcode sizes, they have two memory planes.
Indirect goes 00..FF and direct is 00.7F (identical to indirect up to here) and from 80..FF direct splits to SFR space.
(no indirect access to this) and another 128 bytes is only indirect (so 384 total bytes can be reached)

The P1V lends itself to this same approach, 512 words can be as-now, and another area of COG-local memory can be added (128/256/512 ie whatever makes FPGA sense, and size can even be a compile-time choice) that is @Ri accessible, used for buffers and arrays.
@Ri index values above COG space, can remap into HUB, (now hub-slot-paced) and even off-chip, with some fast SPI memory HW support. (slower again than HUB, but not by much -eg the HyperRAM specs 36ns random access times at full clock speed)
Code could not run in that memory, but it needs fewer ports, and can be physically anywhere.

Notice the big gains made here are in the additional local memory, and in the ease of COG/HUB/External handling.
Users need know nothing other than the address, the HW does the rest.
There is less gain from borrowing the memory space used by the present IO registers, as something like the feature of @Ri++ can free more CODE space.

An indirect flag would likely fit well into a FPGA, there are actually 4 'free' flags in a FPGA.
hardware wise, if code that was doing a R/W to a register saw that @Ri flag set, it would flip it indirect, likely adding another opcode cycle.

Looking at those 4 free flags, we have something like
[IDss]
I = Set for @Ri use, default clear, is normal memory
D = Direction of Auto-INC
ss - size 0x0 = +/- 0, 0x1 = +/-1, 0x2 = +/-2 0x3 = +/-4 & also sets 8b/16b/32b read alignments

That allows any COG memory at all to be used this way, a slight over-kill, but the bits are 'free' in FPGA, and any-of follows the COG philosophy, and allows fairly free user cut and paste.
In contrast, the 8051 has only 2 @Ri registers, and of course, you always run out of them and worry about what else is using them...

* Looking at the Prop, I can see a MOVR could be added to MOVI/MOVD/MOVS - each of those loads 9 bits, and a 4th opcode nicely completes the set, and now loads all/any of 36 bits, which is exactly a FPGA width.
That opcode would become your SETINDx

evanh · 2015-03-19 06:09

Seairth wrote: »

I was using the term as it is used in the P2 architecture. The only difference is that some of the indirect addresses map to special registers instead of cog memory. From the code perspective, it is functionally identical. But that's not the point!

Fair call. Apologies, I wasn't being very imaginative.

The question is: would this be a good solution (regardless of what you call it) to the problem of adding new special registers (including INDx) without taking up more cog memory address space?

Would need to be tested is about all I should say.

The need for CTRx addressable registers

Comments