Hmmm ... how many clocks does it save? Is it a full 2 clock cycle or even 4 clocks? If yes, then it’d be a tremendous benefit in speed. E.g. in case of a high velocity closed loop controller. Higher frequency without the need of a higher clock rate. I‘d buy it.
We only have 1 streamer per cog so to achieve multiple channel DDS we use bit bang.
A SETDAC instruction tightens things up so timer interrupt DDS performance benefits.
Option 1 above is a sort of GETBYTE for DACs, with optional inversion for AC outputs. Data could be packed into a long to save space with no time penalty.
Being able to WRPIN/WXPIN/WYPIN more than one pin per instruction is a neat idea.
Bits 13..8 of S (normally 0) could express how many sequential pins, 1..64, are going to receive the same W?PIN operation, beginning at pin S[5:0].
I've noticed in my code that I'll often need to have four identical instructions on subsequent pins to set up four cog DACs, or something.
Also, these instructions could benefit from bits 13..8 of S expressing HOW MANY pins to operate on, but these would always take only two cycles, since we could use a thermometer decoder to get the pattern:
Option 1 above is a sort of GETBYTE for DACs, with optional inversion for AC outputs. Data could be packed into a long to save space with no time penalty.
Being able to WRPIN/WXPIN/WYPIN more than one pin per instruction is a neat idea.
Bits 13..8 of S (normally 0) could express how many sequential pins, 1..64, are going to receive the same W?PIN operation, beginning at pin S[5:0].
I've noticed in my code that I'll often need to have four identical instructions on subsequent pins to set up four cog DACs, or something.
Also, these instructions could benefit from bits 13..8 of S expressing HOW MANY pins to operate on, but these would always take only two cycles, since we could use a thermometer decoder to get the pattern:
That's certainly flexible, but is the logic cost starting to climb here ?
Adding a variable width, to a variable base, and extracting a field up to 64b wide, does not feel cheap.
Maybe if a lot of opcodes can leverage this, it can be worth the logic cost ?
Option 1 above is a sort of GETBYTE for DACs, with optional inversion for AC outputs. Data could be packed into a long to save space with no time penalty.
Being able to WRPIN/WXPIN/WYPIN more than one pin per instruction is a neat idea.
Bits 13..8 of S (normally 0) could express how many sequential pins, 1..64, are going to receive the same W?PIN operation, beginning at pin S[5:0].
I've noticed in my code that I'll often need to have four identical instructions on subsequent pins to set up four cog DACs, or something.
Also, these instructions could benefit from bits 13..8 of S expressing HOW MANY pins to operate on, but these would always take only two cycles, since we could use a thermometer decoder to get the pattern:
That's certainly flexible, but is the logic cost starting to climb here ?
Adding a variable width, to a variable base, and extracting a field up to 64b wide, does not feel cheap.
Maybe if a lot of opcodes can leverage this, it can be worth the logic cost ?
I'm realizing it could only be made to work on pins 0..31 or pins 32..64, due to data-forwarding limitations on the instructions which affect the DIR and/or OUT bits.
So, the range of pins needs to be bound within DIRA/OUTA or DIRB/OUTB.
How to specify that in some fool-proof way? It means that we'd only need 5 bits for the span, not 6.
WRPIN/WXPIN/WYPIN could span 64 pins, though.
How to unify all this?
Logic cost is just a thermometer decoder for the DIR/OUT-affecting instructions.
W?PIN would require a 6-bit counter, is all.
Might as will limit W?PIN to 32-pin spans, too, for consistency, with consistent rules regarding wrap-around.
I'm realizing it could only be made to work on pins 0..31 or pins 32..64, due to data-forwarding limitations on the instructions which affect the DIR and/or OUT bits.
So, the range of pins needs to be bound within DIRA/OUTA or DIRB/OUTB.
How to specify that in some fool-proof way? It means that we'd only need 5 bits for the span, not 6.
It's easy enough to specify, either as 32 or 64 total reach
Kind of a Ninja warrior, dispensing a lot of shuriken, right at the targets, and in perfect sequence!
But, up to date, there weren't Ninja warriors capable of doing it that fast!
If that unit could be encapsulated into some state machine (kind of a limited function, mini-streamer), once started, it could be left unatended, on its own, till the burst exhausts.
If appliable, IN could be raised at the end, thus available to be sampled, during the burst, by any RDPIN or RQPIN whose destiny is within the limits of the interval.
The above could enable another Cog to wait for the end of the burst, and take over it, almost being able to dovetail the next burst..
Instead of using d[12:8], it would be better to use d[10:6], since that would enable 1..8 pins to be called out in a 9-bit immediate.
FLTL #%111_001000 'float pins 8..15
Yes, that makes sense.
What happens to interrupts, with this new variable-time opcode ?
This FLTL example is a 2-clock instruction, always.
The W?PIN instruction would take 2 clocks, plus 1 more clock for each extra pin.
So that means very large jitter is possible, in interrupts, when using this opcode ?
The pins also will update sequentially, rather than all on the same clock cycle ?
I presume there is still some means to update all pins/config, on the same SysCLK ?
Instead of using d[12:8], it would be better to use d[10:6], since that would enable 1..8 pins to be called out in a 9-bit immediate.
FLTL #%111_001000 'float pins 8..15
Yes, that makes sense.
What happens to interrupts, with this new variable-time opcode ?
This FLTL example is a 2-clock instruction, always.
The W?PIN instruction would take 2 clocks, plus 1 more clock for each extra pin.
So that means very large jitter is possible, in interrupts, when using this opcode ?
The pins also will update sequentially, rather than all on the same clock cycle ?
I presume there is still some means to update all pins/config, on the same SysCLK ?
The smart pins are released from reset when their DIRs go high. That can be done all at once to align their states.
Which of the following ideas would be better for the pin instructions (DIRx/OUTx/FLTx/DRVx), which use S[5:0] to call out a (base) pin?
a) Use S[10:6] to specify how many extra pins above pin S[5:0] will be affected by the operation:
DRVRND #7<<6+16 'drive P[23:16] to random states
b) Ignore bits above S[5:0], but become sensitive to a just-prior SETQ whose D[4:0] would specify the number of extra pins to affect:
SETQ #7
DRVRND #16 'drive P[23:16] to random states
For pins, I think it's pretty safe to use (a), though (b) could always act as an override.
For BITL/BITH/BITC/BITNC/BITZ/BITNZ/BITRND/BITNOT which use S[4:0] to specify a (base) bit, using S[9:5] to specify how many extra bits to operate on would be problematic, since S might hold a bit address which spans several registers (used with ALTB) and has other meaningful bits in S[9:5], already, which should not be interpreted to mean 'extra bits'. There are two ways to get around this:
a) Always interpret bits S[9:5] as 'number of extra bits to affect', unless WCZ is used to write the old bit to C and Z, since there'd be little reason to copy what amounts to the LSB of a span of bits into C and Z:
BITRND reg,#15<<5+16 'set bits 31..16 to random states
BITRND reg,#15<<5+16 WCZ 'set only bit 16 to random state, get prior bit 16 into C and Z
b) Ignore bits above S[4:0], but become sensitive to a just-prior SETQ whose D[4:0] would specify the number of extra bits to affect:
SETQ #15
BITRND reg,#16 'set bits 31..16 to random states
How should all this be handled?
Sometimes, it might be more convenient to use a separate SETQ, instead of having a compound value in a register or needing to use a ##, anyway.
SETQ preceding a pin/bit instruction always overrides the span. This way, if you've got random junk in the bits that specify the span in the pin/bit instruction, you can precede with 'SETQ #0' and always operate on a single pin/bit, if that is your intent. That would keep two simple rules:
a) The five bits above the pin/bit number ALWAYS control the span of a pin/bit instruction.
b) SETQ before a pin/bit instruction ALWAYS overrides the span bits.
SETQ preceding a pin/bit instruction always overrides the span. This way, if you've got random junk in the bits that specify the span in the pin/bit instruction, you can precede with 'SETQ #0' and always operate on a single pin/bit, if that is your intent. That would keep two simple rules:
a) The five bits above the pin/bit number ALWAYS control the span of a pin/bit instruction.
b) SETQ before a pin/bit instruction ALWAYS overrides the span bits.
I am a bit late to the party, but here goes anyway...
From P1 we use a 32-bit mask although that is obviously limited to 32 successive pins.
What if for DIR/OUT/FLT/DRV we had...
Same as current silicon: DIR/OUT/FLT/DRV S[6:5]=0 & S[4:0] (yes it's actually D)
New SETM & SETM2 D/# instructions to set the mask(s) for pins [31:0] & [63:32].
When DIR/OUT/FLT/DRV S[6:5]=00=no masks, single pin, 01=use mask[31:0], 10=use mask[63:31], 11=use masks[63:32]+[31:0]
The SETM & SETM2 load internal 32-bit registers which remain set until modified. Must be set for first use (ie not necessarily zero'd on coginit)
Not sure of the silicon cost here. What it gives us is the concept of masks like P1. We can set the mask(s) once, so no need to repeat them each time the DIR/OUT/FLT/DRV is used.
I am a bit late to the party, but here goes anyway...
From P1 we use a 32-bit mask although that is obviously limited to 32 successive pins.
What if for DIR/OUT/FLT/DRV we had...
Same as current silicon: DIR/OUT/FLT/DRV S[6:5]=0 & S[4:0] (yes it's actually D)
New SETM & SETM2 D/# instructions to set the mask(s) for pins [31:0] & [63:32].
When DIR/OUT/FLT/DRV S[6:5]=00=no masks, single pin, 01=use mask[31:0], 10=use mask[63:31], 11=use masks[63:32]+[31:0]
The SETM & SETM2 load internal 32-bit registers which remain set until modified. Must be set for first use (ie not necessarily zero'd on coginit)
Not sure of the silicon cost here. What it gives us is the concept of masks like P1. We can set the mask(s) once, so no need to repeat them each time the DIR/OUT/FLT/DRV is used.
I wonder if this could be used with SETPAT ?
The trouble with registers like this is that they'd need to be readable and restorable on interrupts, which is expensive to do.
In the next silicon rev, these instructions will be able to affect spans of 1..32 bits by using S[9:5] or Q[4:0] as an additional bit count:
Can you add the timing equations of these opcodes ?
To me, they are cute, but not in the 'must have' column, and I hope they do not squeeze the logic into lower sysclk speeds due to more congested routing.
More fundamentally and widely useful I can see, are the small details like being able to output a SysCLK clk with the streamer data flow at SysCLK speeds.
ie being able to generate a clock, that actually matches the top speed the P2 can emit data.
In the next silicon rev, these instructions will be able to affect spans of 1..32 bits by using S[9:5] or Q[4:0] as an additional bit count:
Can you add the timing equations of these opcodes ?
To me, they are cute, but not in the 'must have' column, and I hope they do not squeeze the logic into lower sysclk speeds due to more congested routing.
More fundamentally and widely useful I can see, are the small details like being able to output a SysCLK clk with the streamer data flow at SysCLK speeds.
ie being able to generate a clock, that actually matches the top speed the P2 can emit data.
Outputting SysCLK would require some special timing assignments. Not sure how viable that is, but when we get into the respin work with ON Semi, I'll ask about it.
Comments
A SETDAC instruction tightens things up so timer interrupt DDS performance benefits.
Being able to WRPIN/WXPIN/WYPIN more than one pin per instruction is a neat idea.
Bits 13..8 of S (normally 0) could express how many sequential pins, 1..64, are going to receive the same W?PIN operation, beginning at pin S[5:0].
I've noticed in my code that I'll often need to have four identical instructions on subsequent pins to set up four cog DACs, or something.
Also, these instructions could benefit from bits 13..8 of S expressing HOW MANY pins to operate on, but these would always take only two cycles, since we could use a thermometer decoder to get the pattern:
dirl/dirh/dirc/dirnc/dirz/dirnz/dirrnd/dirnot
outl/outh/outc/outnc/outz/outnz/outrnd/outnot
fltl/flth/fltc/fltnc/fltz/fltnz/fltrnd/fltnot
drvl/drvh/drvc/drvnc/drvz/drvnz/drvrnd/drvnot
Would you like to enable pins 32..47 for 256-step triangle PWM output and start them up, initialized to 0? Just do this:
Adding a variable width, to a variable base, and extracting a field up to 64b wide, does not feel cheap.
Maybe if a lot of opcodes can leverage this, it can be worth the logic cost ?
I'm realizing it could only be made to work on pins 0..31 or pins 32..64, due to data-forwarding limitations on the instructions which affect the DIR and/or OUT bits.
So, the range of pins needs to be bound within DIRA/OUTA or DIRB/OUTB.
How to specify that in some fool-proof way? It means that we'd only need 5 bits for the span, not 6.
WRPIN/WXPIN/WYPIN could span 64 pins, though.
How to unify all this?
Logic cost is just a thermometer decoder for the DIR/OUT-affecting instructions.
W?PIN would require a 6-bit counter, is all.
Might as will limit W?PIN to 32-pin spans, too, for consistency, with consistent rules regarding wrap-around.
Counter ? Is this a multi-clock opcode taking up to 32? 64? 128? sysclks. Is that interruptable ?
To make these instructions...
dirl/dirh/dirc/dirnc/dirz/dirnz/dirrnd/dirnot
outl/outh/outc/outnc/outz/outnz/outrnd/outnot
fltl/flth/fltc/fltnc/fltz/fltnz/fltrnd/fltnot
drvl/drvh/drvc/drvnc/drvz/drvnz/drvrnd/drvnot
...work on not just single pins, but on up to 32 pins within either DIRA/OUTA or DIRB/OUTB, this line of Verilog...
Just needs to be changed to this...
That is almost nothing, in the big picture.
The programmer would have to be aware that D[10:6] are being utilized for span, not just D[5:0] for pin, as that is now 'base' pin.
Yes, that makes sense.
What happens to interrupts, with this new variable-time opcode ?
This FLTL example is a 2-clock instruction, always.
The W?PIN instruction would take 2 clocks, plus 1 more clock for each extra pin.
But, up to date, there weren't Ninja warriors capable of doing it that fast!
If that unit could be encapsulated into some state machine (kind of a limited function, mini-streamer), once started, it could be left unatended, on its own, till the burst exhausts.
If appliable, IN could be raised at the end, thus available to be sampled, during the burst, by any RDPIN or RQPIN whose destiny is within the limits of the interval.
The above could enable another Cog to wait for the end of the burst, and take over it, almost being able to dovetail the next burst..
Just a thought...
So that means very large jitter is possible, in interrupts, when using this opcode ?
The pins also will update sequentially, rather than all on the same clock cycle ?
I presume there is still some means to update all pins/config, on the same SysCLK ?
The smart pins are released from reset when their DIRs go high. That can be done all at once to align their states.
a) Use S[10:6] to specify how many extra pins above pin S[5:0] will be affected by the operation:
b) Ignore bits above S[5:0], but become sensitive to a just-prior SETQ whose D[4:0] would specify the number of extra pins to affect:
For pins, I think it's pretty safe to use (a), though (b) could always act as an override.
For BITL/BITH/BITC/BITNC/BITZ/BITNZ/BITRND/BITNOT which use S[4:0] to specify a (base) bit, using S[9:5] to specify how many extra bits to operate on would be problematic, since S might hold a bit address which spans several registers (used with ALTB) and has other meaningful bits in S[9:5], already, which should not be interpreted to mean 'extra bits'. There are two ways to get around this:
a) Always interpret bits S[9:5] as 'number of extra bits to affect', unless WCZ is used to write the old bit to C and Z, since there'd be little reason to copy what amounts to the LSB of a span of bits into C and Z:
b) Ignore bits above S[4:0], but become sensitive to a just-prior SETQ whose D[4:0] would specify the number of extra bits to affect:
How should all this be handled?
Sometimes, it might be more convenient to use a separate SETQ, instead of having a compound value in a register or needing to use a ##, anyway.
I agree. Should we permit the compact form, though? It is twice as fast and half the code.
Both would be nice. The SETQ appealed because the span can be easily controlled by a register.
SETQ preceding a pin/bit instruction always overrides the span. This way, if you've got random junk in the bits that specify the span in the pin/bit instruction, you can precede with 'SETQ #0' and always operate on a single pin/bit, if that is your intent. That would keep two simple rules:
a) The five bits above the pin/bit number ALWAYS control the span of a pin/bit instruction.
b) SETQ before a pin/bit instruction ALWAYS overrides the span bits.
How would that be?
That would work fine.
This feels like the right solution. You get the best of both possibilities, without being stuck if you only want one pin/bit to be affected.
What does SETDAC/PINDAC or what I call WRDAC look like now?
Yes, you brought up the notion of doing a span of operations via a single instruction.
No movement on SETDAC stuff, yet. Still working on this pin/bit-span stuff. W?PIN is next.
I am a bit late to the party, but here goes anyway...
From P1 we use a 32-bit mask although that is obviously limited to 32 successive pins.
What if for DIR/OUT/FLT/DRV we had...
Same as current silicon: DIR/OUT/FLT/DRV S[6:5]=0 & S[4:0] (yes it's actually D)
New SETM & SETM2 D/# instructions to set the mask(s) for pins [31:0] & [63:32].
When DIR/OUT/FLT/DRV S[6:5]=00=no masks, single pin, 01=use mask[31:0], 10=use mask[63:31], 11=use masks[63:32]+[31:0]
The SETM & SETM2 load internal 32-bit registers which remain set until modified. Must be set for first use (ie not necessarily zero'd on coginit)
Not sure of the silicon cost here. What it gives us is the concept of masks like P1. We can set the mask(s) once, so no need to repeat them each time the DIR/OUT/FLT/DRV is used.
I wonder if this could be used with SETPAT ?
The trouble with registers like this is that they'd need to be readable and restorable on interrupts, which is expensive to do.
These instructions will be able to affect spans of 1..32 pins by using S[10:6] or Q[4:0] as an additional pin count:
In both sets of instructions, the extra bits that exceed the MSB will wrap around, starting at the LSB, in the register(s) being affected.
The pin-modifying instructions (2nd set) will not span between DIRA/OUTA and DIRB/OUTB, but will wrap within DIRA/OUTA or DIRB/OUTB.
To me, they are cute, but not in the 'must have' column, and I hope they do not squeeze the logic into lower sysclk speeds due to more congested routing.
More fundamentally and widely useful I can see, are the small details like being able to output a SysCLK clk with the streamer data flow at SysCLK speeds.
ie being able to generate a clock, that actually matches the top speed the P2 can emit data.
Outputting SysCLK would require some special timing assignments. Not sure how viable that is, but when we get into the respin work with ON Semi, I'll ask about it.