ROL D,{#}S {WC/WZ/WCZ} Syntax question

Bob Drury · 2021-07-21 18:52

ROL D,{#}S {WC/WZ/WCZ} Rotate left. D = [63:32] of ({D[31:0], D[31:0]} << S[4:0]). C = last bit shifted out if S[4:0] > 0, else D[31]. *

The shifting of B31-> B0 and When WC is used C=1 when B31 =1 . S[4:0] having upto 31 bits at one time (shift 1 or upto 31) bits. The registers Being 32 bits.
But what does D = [63:32] of ({D31:0],D[31:0]} <<S[4:0] mean ?
Regards and Thanks
Bob (WRD)

evanh · 2021-07-21 19:06

It's an ordered bitwise assignment from D on rightside of "of", as a 64-bit word, to D on leftside of "of", as a 32-bit word, with S specifying the alignment.

evanh · 2021-07-21 19:08

That's the functional description. In reality, the variability of S means the logic gates needed to achieve that, in what's called a "barrel-shifter", is a pretty large structure.

AJL · 2021-07-21 23:40

@"Bob Drury" said:
ROL D,{#}S {WC/WZ/WCZ} Rotate left. D = [63:32] of ({D[31:0], D[31:0]} << S[4:0]). C = last bit shifted out if S[4:0] > 0, else D[31]. *

The shifting of B31-> B0 and When WC is used C=1 when B31 =1 . S[4:0] having upto 31 bits at one time (shift 1 or upto 31) bits. The registers Being 32 bits.
But what does D = [63:32] of ({D31:0],D[31:0]} <<S[4:0] mean ?
Regards and Thanks
Bob (WRD)

In an effort to explain this as simply as possible:

Imagine you have taken the register to be shifted and placed a copy of that beside it so now you have 64 bits, with the top half the same as the bottom half.
You then use the S register value to shift the whole thing left, and you then take the top half, which gives the same effect as a circular shift of 32 bits.
It should be noted that S can hold 0, which results in no shift, but can still set the flags which would put D[31] into the C flag.

While a barrel shifter is fairly large compared to a simple shifter, it gives a speed advantage that scales with register size (a 31 bit shift takes the same time as a single bit shift, giving a x31 speed up)

Bob Drury · 2021-07-22 01:54

EVANH
I am confused how you would use a single register of 32 bits to creat a 64 bit register if the instruction is written ROL Pr0,#1 where Pr0 =10101010_10101010_10101010_10101010 after instruction
executes Pr0 = 01010101_01010101_01010101_01010101 B0->B1...B30->B31 B31->B0 I would assume B31 would be stored Prior to shift and move to B0 . So how would the code look if double shifting D (Pr0).
Thanks for your Time
Bob (WRD)

AJL · 2021-07-22 02:54

@"Bob Drury"

Starting with your example Pr0, the hardware latches two copies of it to give 64 bits = 10101010_10101010_10101010_10101010__10101010_10101010_10101010_10101010
It then shifts the 64 bits left by the value in S giving 01010101_01010101_01010101_01010101__01010101_01010101_01010101_0101010x
The result is then taken as bits 63 to 32 of this (bolded above) giving 01010101_01010101_01010101_01010101

For a double shift your example value is awkward as it appears to make no change. To illustrate it better I'll show a variety of bit shifts with different a starting value
Pr0 = DEADBEEF = 11011110_10101101_10111110_11101111
64 bits in the shifter (preshift) is 11011110_10101101_10111110_11101111__11011110_10101101_10111110_11101111
A single shift gives 10111101_01011011_01111101_11011111__10111101_01011011_01111101_1101111x
A double shift gives 01111010_10110110_11111011_10111111__01111010_10110110_11111011_101111xx
A 20 bit shift gives 11101110_11111101_11101010_11011011__11101110_1111xxxx_xxxxxxxx_xxxxxxxx
A 31 bit shift gives 11101111_01010110_11011111_01110111__1xxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxxxx

Please note that the bits at the right hand end are x (don't care) values, because I don't know what the chip sets these to, and it doesn't matter because they never leave the shifter. In fact, bits 31:0 out of the shifter may not be implemented in the silicon to save gates, power, and heat.

All of these shifts take the same amount of time to occur, as they are all performed in parallel with only the correct one being passed to the result register.

Does this help?

Bob Drury · 2021-07-22 03:24

EVANH
Yes. I am going to go through this and add this to the document Notes. My understanding is by having the two tempary D registers it does the storage of bits being shifted from 1 to 31 bits. For some reason
I was hung up on the 64 bits . The D register is copied into two register inside the instruction operation and then shifted with shift bits ,the result is then copied from the upper D temporary shift register
to the D register as the result. Essentially this how the Upper Bits are remembered during the shift. I am going to detail this in my notes for next revision.
Thanks EVANH Very Helpful
Bob (WRD)

Bob Drury · 2021-07-22 03:32

EVANH
Just wondering about shift right ROR?. Wouldn't the the double temporary register also be used ? shouldn't the D[64:32} notation alson be shown? The PASM spread sheet has
D = [31:0] of ({D[31:0], D[31:0]} >> S[4:0]). C = last bit shifted out if S[4:0] > 0, else D[0]. *
Regards
Bob (WRD)

AJL · 2021-07-22 04:50

Well, I’ll get this one too.

Yes, it would use the 64-bit temporary register, but because the shift is going in the opposite direction you need to pick the results up from the other end too.

evanh · 2021-07-22 06:36

The doubling up of bits, to make 64-bits, doesn't need a register as such. They're just two wires split from each bit of the D source operand.

Bob Drury · 2021-07-22 13:05

EVANH
Not knowing how instructions actually occur , I would assume I should describe this register operation is actually an analogy from using discrete components and actually the D register and S shift
registers are inputs to a combinational logic which has the result eventually stored in the D register. If this is true that is a lot of combinations just for this one instruction. Are all PASM instructions really a massive amount of combinational logic and should I include this in my document or is there a better descriptiion that I should Include.
Regards and really thanks for responding
Bob (WRD)

msrobots · 2021-07-22 19:14

Well this is how chips and instructions work. And yes it is a massive amount of combinational logic.

I find it quite mindboggling that all of this decoding, and moving of bits happens in two clock cycles per instruction. This is just possible because of some pipeline where the next instructions get decoded while the current instruction is writing its results back. Or something alike.

Enjoy!

Mike

evanh · 2021-07-23 02:59

The amount of logic for each instruction varies widely. AND/OR/XOR are as simple and primitive as they seem. A fixed shift is even simpler still, it just remaps the inputs to different specific outputs. A single cycle multiplication, on the other hand, is a fat blob of combinations; not unlike a barrel shifter. The multiply instruction is often the reference when optimising "critical path" in the design of a processor. Ie: Anything slower needs attention to sort out why it's so slow.

evanh · 2021-07-23 04:23

@"Bob Drury" said:
... the D register and S shift registers are inputs to a combinational logic which has the result eventually stored in the D register.

Many RISC Instruction Sets have a 3-operand architecture. The so-called third operand is not an input to the ALU but rather the register number to put the ALU result into. It's a very handy feature to have because then the D input register doesn't get over-written with the result. And the names 'S' and 'D' are replaced with r1 and r2, short for register1, register2.

Prop2 doesn't directly have this feature in the regular instructions but it can provide it as part of the ALTx instructions. If you are using an ALTI instruction for indexing a table/buffer in cogRAM, then the third operand is also there for free. Named 'R'. Also, ALTR is specifically just this feature but at the cost of adding the prefixing instruction.

evanh · 2021-07-23 04:49

On that note, there is an unrelated instruction SETR. It doesn't have anything to do with ALU results. I don't know what the 'R' signifies. SETS and SETD on the other hand are named because they replace the data in the S and D bit-fields of a cogRAM register. SETR replaces the unnamed 9-bit field above D field - most of the opcode if it were an instruction.

msrobots · 2021-07-23 05:44

isn't SETR providing a overwrite for the result destination?

I am slowly loosing track with the instructionset.

Enjoy!

Mike

evanh · 2021-07-23 06:10

ALTR does that.

EDIT: The three SETx instructions are effectively 9-bit MOVs covering the lower 27 bits between them. Similar to SETNIB, SETBYTE and SETWORD being 4-bit, 8-bit and 16-bit MOVs respectively.

AJL · 2021-07-23 06:54

As I recall, there was some question about what to call those three instructions. SETD and SETS were fairly straightforward, the last one wasn’t.
If I recall correctly SETR was chosen because the predicted primary use case was for setting the result register field in longs that would be used with ALTI instructions.

evanh · 2021-07-23 07:35

@AJL said:
... that would be used with ALTI instructions.

Ahhhh! I see now. I've not needed to do that so far. My limited use of ALTI has only used its R field as a fixed accumulator, ie: For a FIR filter.

evanh · 2021-07-30 10:24

Huh, except for carry flag treatment, SUBR reg1, #0 is same as NEG reg1. And I'm not sure what use NEG's C = bit31 has.

NEG could be an alias of SUBR.

TonyB_ · 2021-07-30 12:40

@evanh said:
Huh, except for carry flag treatment, SUBR reg1, #0 is same as NEG reg1. And I'm not sure what use NEG's C = bit31 has.

NEG could be an alias of SUBR.

Probably best to add or move this post to NEGx discussion at
https://forums.parallax.com/discussion/170380/new-p2-silicon/p28

evanh · 2021-08-06 05:22

What would've been cool is having two special purpose cogRAM mapped registers for the QX/QY Cordic results. Eliminating the GETQX and GETQY instructions. That alone would notably speed up any parallel coded use of the Cordic. The 8 clocks per issued command is consumed all to easy on load and store ops. There is no time remaining for inline massaging without skipping to 16-clock spacings.

Cluso99 · 2021-08-06 08:50

@evanh said:
What would've been cool is having two special purpose cogRAM mapped registers for the QX/QY Cordic results. Eliminating the GETQX and GETQY instructions. That alone would notably speed up any parallel coded use of the Cordic. The 8 clocks per issued command is consumed all to easy on load and store ops. There is no time remaining for inline massaging without skipping to 16-clock spacings.

The problem here is that in order to write to those registers the ALU would need to suspend the instruction execution to steal the clocks required to write them. This would mess up determinism.

evanh · 2021-08-06 09:13

I don't think that'd be an issue. For the ALU, QX/QY registers would be treated no different to INA/INB registers. They'd exist inside the cog.

There might be a bus'ing issue getting the data from the Cordic (in the hub) though. The GETQX/Y instructions will use a common 'hub' bus that's shared with all the hub accessing instructions of that cog.

ROL D,{#}S {WC/WZ/WCZ} Syntax question

Comments