The need for CTRx addressable registers
Seairth
Posts: 2,474
(I vaguely recall prior discussion on this topic, but don't remember what was said....)
There are three registers per counter:
Do all of these need to be implemented as addressable registers? I understand the need for PHSx, and possible FRQx, to be addressable. But what about CTRx? And, if CTRx does need to be kept addressable, how necessary is it to have the fields aligned for use with MOVI, MOVD, and MOVS?
There are three registers per counter:
- CTRx : The control register
- PHSx : The counter accumulator
- FRQx : The value to conditionally add to PHSx
Do all of these need to be implemented as addressable registers? I understand the need for PHSx, and possible FRQx, to be addressable. But what about CTRx? And, if CTRx does need to be kept addressable, how necessary is it to have the fields aligned for use with MOVI, MOVD, and MOVS?
Comments
If you are thinking of freeing more code space, which is relatively expensive 4 port memory, then you could add indirect access dual port memory, where CTRs could sit, but that breaks a lot of code for not much gain ?
Field alignment is useful because it allows more compact code - there are certainly potential uses for the unused bits.
Things like Quadrature Counting modes and bits for Atomic (optionally Armed/triggered) Capture are obvious ones.
To avoid consuming memory the Capture result could be aliased at PHSx
Addit: Ozpropdev has mentioned extra (optional) counters in his COGs - would be interesting to see how he has chosen to map these. Certainly it would be preferable to find ways to avoid eating more into CODE space.
Alias of some control bits across CTRx registers could allow steering, and atomic trigger of multiple counters ?.
Indeed. The trade-offs between having the registers be addressable or not are:
The alternative to addressable registers would obviously be dedicated instructions. A single GET/SET instruction pair would open up the ability to have up to 512 special registers (not just CTRx-related), while also freeing up some COG RAM. The down side to this is that any existing code that manipulates a register in a single instruction (other than MOV) would end up taking 3 additional registers (2 instructions and 1 working register) to accomplish the same goal, as well as add 8 clock cycles to the process.
Actually, I am not thinking about this in P2 terms, at least not right now. As it stands, the P1V assumes single-ported memory. However, I have also been considering a slight rewrite to take advantage of the dual-ported RAM on the FPGAs. With that in mind, I suppose you *could* change the system to use indirect addressing. If you replaced $1F8 through $1FF with 8 indirect registers, you could still provide some semblance of backwards-compatibility (by initializing the indirect registers to reference the original shadow registers) while opening up the opportunity for more registers.
As an added bonus, you could divide the indirect address space into two, where the lower 256 addresses would be special registers and the upper 256 addresses would reference the upper 256 cog registers. This would allow indirect addressing for other purposes as well. This wouldn't be as efficient as P2's version of indirect registers, but it may still be worth having.
Of course, if you need to access more than the original registers, or want to use indirect addressing, you will still end up taking a hit both in terms of code size and maximal performance. But the trade-off may be worth it (make the shadow register space expandable and get indirect addressing).
Originally, I was thinking of using some of those bits to provide a sort of "index" into a set of CTRx registers. However, the current design would make this difficult to implement, so I'm abandoning that idea.
If this is a question about general purpose P1 evolution, then code compatibility might be the defining consideration. If it's an open field question, then there's probably nothing about how the counters work that really matters. Your sitution is likely somewhere in between, but we don't know ...
Also, the other "changes" being considered intersect the fate of the control registers presently located in the per-core address map. If you're thinking of an indirect addressing scheme as you suggest, then it becomes increasing important to get the control register out of the memory map because suddenly software bugs have the ability to corrupt hardware state, drive bus conflicts and damage external devices and those possibilities should probably be high on your list of non-goals.
That makes good sense, as that memory is essentially free in a FPGA
I agree. Some form of indirect memory would take buffers OUT of valuable code space, giving a real gain.
It would even be possible to share buffers between adjacent COGS
? That seems the simplest approach, and it is backward compatible, and could allow a form of Counter stacking.
ie default visible are the standard counters, and spare-bits access/remap others and things like capture registers.
Typical uses are sampling in nature, and by extension are unidirectional accesses, so thereby requiring a software buffer anyway. To reliably massage the data it needs to be an atomic operation.
I don't see much of a downside to having dedicated I/O load/store instructions so as to have the extra dedicated address space.
He's laid out an extra 8 I/O registers into the Prop2 Cog address space. It would be nice to get all 512 locations back as program/data space and have room for lots more I/O at the same time.
Add Indirect Registers. I think you could do this by:
For the 16 existing memory-mapped registers, relocate them to start at indirect address zero. In other words, all special registers would be accessible via the indirect addressing registers.
To fill this out, here are a couple additional rules:
So, when a cog first starts, the addressing behaviour would be exactly as the P1V is now. Addresses $1F0-$1FF would indirectly resolve to the same special registers that had original been mapped to that location.
From there, you can change the indirect registers (with new instructions, I'm assuming) as you see fit. For instance, if you aren't using video or counters, you free up 8 indirect registers for other uses. Or, if you want to re-arrange the special register (e.g. INA, OUTA, DIRA, INB, OUTB, DIRB instead of INA, INB, OUTA, OUTB, DIRA, DIRB), you can do that too. Or, if you add Port C or inter-cog communication through a "PortD", you can map that in. Or maybe you want to throw some dedicated SPI hardware in and need access to a control register and shift register. You could do that as well.
Anyhow, I think you get the idea. The only time you are going to have a challenge is when you want to use all of the ports AND you also want to use indirect addressing. Of course, any time you are pushing the limits of the resources, you have to start making compromises. And that doesn't seem like a bad deal considering what you gain in return.
Yep, it's a perfectly reasonable way of dealing with a limited address space. The downside is when wanting to use two items in two mutually exclusive banks. It requires constant flipping back and forth, but as long as that is rare and is a quick operation then it shouldn't be a big deal.
No, this isn't bank switching. You are not setting swapping out the entire set of 16 addresses at a time. Each is separately settable, just as the INDx registers in the P2. The difference is that, depending on the address, they refer to special registers (e.g. INA, CTRB, etc.) or indirectly to the top half of the 512 general cog registers. The second part is absolutely indirection.
In fact, I had not original thought of this idea when I posted the original question. I was originally thinking of having dedicated instructions to load/store those registers. At the same time, I was wondering about adding indirect registers, but couldn't think of a way to fit them in other than to do what Chip had done with the P2. But I didn't want to take up even more address space. Which is when the idea hit me. I could use indirection in such a way that a cog would start up exactly has it did now, with the special registers all located at $1F0-1FF. In reality, these were really indirect registers and the cog defaulted to indirectly referencing the same special registers that were originally at that location. All existing code would work exactly as it does now. No changes required.
Only if you wanted to take advantage of the indirection would then have to decide which special registers you wanted to share the indirect addressing functionality with. In reality, I suspect most cogs do not make use of *all* of the top 16 addresses (i.e. counters, but not vid; port a and b, but not counters; etc.) In which case, you would typically have at least 2-3 registers that could be used for indirection. At the very least, you could usually reassign $1F0 (PAR) once you have read it. And, as you point out, if you really needed to use *all* of those special registers, you could switch back and forth as a worst case scenario. My guess is that this is a scenario that would rarely happen.
Actually, you would also have to make trade-offs if you started adding more special registers to the P1V build. As I suggested earlier, you might choose to add 32 more pins (Port C), which might be mapped to $010-012 (remember that $000-00F are the original 16 special registers). In which case, if you needed to use Ports A-C, you would have to remap three of the other indirect registers to make them available.
Or, maybe it turns out that your DIRx registers are all statically set (i.e. don't change during the course of operation). In which case, you could do the following:
And voila! You've mapped in Port C without losing PAR, CNT, CTRx, FRXx, PHSx, VCFG, or VSCL!
You get the idea, I'm sure. The point is, I don't see this as bank switching at all. Also, it's backwards compatible, allows room for expansion, and gives us indirect registers!
The semantics have me somewhat lost, but when discussing opcodes it is useful to show what bits of the current opcodes are used for the new features.
Any flip schemes to give more memory have to be traded off against the code needed to do that flipping.
- you can go backwards, if more lines/data are needed on a running pgm, than what you 'freed up'.
Whatever you are proposing, it's not indirection. Indirection provides pointers for C, it does not extend addressing. Remapping extends addressable space. As does an added I/O map, albeit via new instructions.
Remapping is the more generic name I guess. Though, it's the same hardware as bank switching.
Segmentation would be another option but that adds more complexity to the pipeline, as does indirection. I think Intel/AMD have 20 odd stages these days.
Assuming you are talking about the semantics of my proposal, let me put it another way:
We are all familiar with the indirect register scheme designed for the P2, correct? In that case, Chip dedicated two cog addresses to accessing the data in those indirection registers. If you recall, when the cog first starts up, those registers are pointing to the cog memory that's actually at that location.
Now, on the P1V, what I'm proposing it to extend the indirect registers to take up all 16 of the top addresses, which are currently hard-coded to map to special registers. Just like the P2 INDx registers pointing to their own address (rather the actual cog memory at that address), the P1V INDx registers would point to the special registers that were originally accessible at those addresses. But, because it would be nice to use the INDx as actual indirection registers (i.e. point to cog memory), I suggested that the special registers could have effective indirect addresses from $000 to $0FF, while cog memory from $100 to $1FF would be indirectly accessible via indirect addresses $100 to $1FF.
No, this indirection ability is not as flexible as on the P2, since all manipulation of the INDx registers requires explicit instructions (i.e. no instruction format that pre/post-increments the INDx value). These would looks something like:
SETINDx S/# 'Sets indirect register to S/#
GETINDx D 'Gets current value of indirect register and store in D
INCINDx S/# 'Adds S/# to current value of indirect register (signed add)
You might even be able to get in an instruction like:
MODINDx S
- S is a value in the format of 0000XX_00000000_LLLLLLLLL_HHHHHHHHH (note: layout allows use of MOVI/MOVD/MOVS)
- X is the amount to inc/dec INDx for testing against L and H
- If INDx + X is less than L, set to H. If INDx + X is greather than H, set to L.
Yes, this is all manual, compared to what the P2 does. But I don't see any way around it without reworking the instruction format like Chip did with the P2.
And, to reiterate, the proposed change does not break *any* existing code. If you never call one of the above instructions, the default behavior is *exactly* the same as the current P1V.
I was using the term as it is used in the P2 architecture. The only difference is that some of the indirect addresses map to special registers instead of cog memory. From the code perspective, it is functionally identical. But that's not the point!
The question is: would this be a good solution (regardless of what you call it) to the problem of adding new special registers (including INDx) without taking up more cog memory address space?
The code cost of this indirection is quite high, and what you propose is very close to what a 8051 already does.
However, in the 8051, the direct (sfr) space is always in front, as that mostly needs fast and random access,and the indirect memory (@Ri) in the 8051, always needs the MOV Ri,#nn & MOV @ Ri pair you have above,
I think some variants have a MOV @ Ri++ which they enable separately.
On a P1V, you could use one bit of the SETINDx mode change instruction to control ++/--/ maybe size of ++ etc ( see * MOVR below )
In an 8051, the register is only 8 bits, not 32b as in a Prop, so the Prop can access much more memory with the @Rn opcode.
Also in a 8051, to squeeze the most out of the opcode sizes, they have two memory planes.
Indirect goes 00..FF and direct is 00.7F (identical to indirect up to here) and from 80..FF direct splits to SFR space.
(no indirect access to this) and another 128 bytes is only indirect (so 384 total bytes can be reached)
The P1V lends itself to this same approach, 512 words can be as-now, and another area of COG-local memory can be added (128/256/512 ie whatever makes FPGA sense, and size can even be a compile-time choice) that is @Ri accessible, used for buffers and arrays.
@Ri index values above COG space, can remap into HUB, (now hub-slot-paced) and even off-chip, with some fast SPI memory HW support. (slower again than HUB, but not by much -eg the HyperRAM specs 36ns random access times at full clock speed)
Code could not run in that memory, but it needs fewer ports, and can be physically anywhere.
Notice the big gains made here are in the additional local memory, and in the ease of COG/HUB/External handling.
Users need know nothing other than the address, the HW does the rest.
There is less gain from borrowing the memory space used by the present IO registers, as something like the feature of @Ri++ can free more CODE space.
An indirect flag would likely fit well into a FPGA, there are actually 4 'free' flags in a FPGA.
hardware wise, if code that was doing a R/W to a register saw that @Ri flag set, it would flip it indirect, likely adding another opcode cycle.
Looking at those 4 free flags, we have something like
[IDss]
I = Set for @Ri use, default clear, is normal memory
D = Direction of Auto-INC
ss - size 0x0 = +/- 0, 0x1 = +/-1, 0x2 = +/-2 0x3 = +/-4 & also sets 8b/16b/32b read alignments
That allows any COG memory at all to be used this way, a slight over-kill, but the bits are 'free' in FPGA, and any-of follows the COG philosophy, and allows fairly free user cut and paste.
In contrast, the 8051 has only 2 @Ri registers, and of course, you always run out of them and worry about what else is using them...
* Looking at the Prop, I can see a MOVR could be added to MOVI/MOVD/MOVS - each of those loads 9 bits, and a 4th opcode nicely completes the set, and now loads all/any of 36 bits, which is exactly a FPGA width.
That opcode would become your SETINDx
Would need to be tested is about all I should say.