Propeller II update - BLOG

cgracey · 2014-03-04 14:54

Ariba wrote: »
Because on the DE0-Nano the big multiplier and divider are removed, I had to find an alternative.
To my surprise the mult32 solution I found is even faster than the big hardware multiplier! And it should be multitasking-safe.

The following subroutine multiplies two 32 bit factors in register A and B and returns a 32 bit result in RES. The trick is to split the 32bit numbers into two 16 bits and use the MUL instruction which can do 16x16 in 2 clock cycles. With 3 MULs and a bit of shifting and adding we get a 32x32 bit result:
mult32  'a * b -> res
        getword res,a,#0    '1
        getword tmp,b,#0    '1
        getword a,a,#1      '1
        getword b,b,#1      '1
        mul  a,tmp          '2
        mul  b,res          '2
        mul  res,tmp        '2
        shl  a,#16          '1
         retd               '1
        shl  b,#16          '1
        add  res,a          '1
        add  res,b          '1 = 15 cycles
This seems to work also with signed numbers in A and B.

If you do the same subroutine with the big hardware multiplier and task-lock it takes 7 cycles more:
mult32  tlock               '1
         retd               '1
        mul32   a,b         '1
        getmull res         '18
        tfree               '1 = 22 cycles
Andy

Andy! That's nuts! This goes to show that necessity is the mother of invention. How come this never occurred to me?

I wonder if we could make some conduit around the multipliers to feed these values in automatically and make an instruction out of it, getting rid of the big hardware multiplier. There are two 20x20 bit multipliers, so that they can be switched between for a 1-clock MAC, while each multiplier actually takes two clocks. Maybe we could make a 5-clock 32x32 multiplier this way.

Seairth · 2014-03-04 14:55

cgracey wrote: »

Well, it turned out to be less than glorious, after all. Quartus was eliminating duplicate registers. I just switched that feature back 'off'. It's recompiling now. I don't know how that switch had been turned on, because I know when I was designing the CORDIC circuit, I was seeing the benefit of duplicate registers. We'll see what happens with this compile.

Is that a global switch? If so, it's quite possible that more than just the CORDIC and ALU MUX could be affected? Crossing fingers that this doesn't increase the core size too much...

Heater. · 2014-03-04 14:58

Ariba,

Wow! That's stunning.

Down side is that it eats code if working in COG. Doesn't the big multiplier get us a 64 bit result ?

Sapieha · 2014-03-04 15:06

Hi Chip.

I made Modules from Yours code that can be compiled separately.
Then used bout REG and WIRE --->
That give me same circuit results.

Look in attachment's

//
//    Here is how a mux might be coded with a single set of mux driver signals:
//
module ChipMux(muxs,muxr,muxd);

input                [ 5:0] muxs;        // mux selector flops
input                [31:0] muxd;        // mux inputs
output    [63:0][31:0] muxr;        // mux result

  always @(muxs, muxd, muxr)
    begin
    end
//---------------------------------------------------------------------------------------------------------
//reg         [5:0] muxs;        // mux selector flops                    Ues of REG/WIRE    Compiles in same circuity
wire         [5:0] muxs;        // mux selector flops

//---------------------------------------------------------------------------------------------------------
//wire [63:0][31:0] muxd;        // mux inputs
wire [ 3:0][31:0] muxd;            // mux inputs
wire       [31:0] muxr;            // mux result

assign muxr = muxd[muxs];
endmodule
//
//
//--------------------------------------------------------------------
//
//    Here it is with 4x the drivers:
//
//--------------------------------------------------------------------
//
module ChipMux4 (bmuxs,bmuxr,bmuxd);

input        [3:0] [ 5:0] bmuxs;        // mux selector flops, 4 copies
input                [31:0] bmuxd;        // mux data inputs
//output    [63:0][31:0] bmuxr;
output    [ 3:0][31:0] bmuxr;        // mux result outputs

  always @(bmuxs, bmuxr, bmuxd)

  begin

reg   [3:0] [5:0] bmuxs;        // mux selector flops, 4 copies
    end

//wire [63:0][31:0] bmuxd;        // mux inputs
wire    [ 3:0][31:0] bmuxd;        // mux inputs
wire            [31:0] bmuxr;        // mux result


assign bmuxr[31:24] = bmuxd[bmuxs[3]][31:24];
assign bmuxr[23:16] = bmuxd[bmuxs[2]][23:16];
assign bmuxr[15: 8] = bmuxd[bmuxs[1]][15: 8];
assign bmuxr[ 7: 0] = bmuxd[bmuxs[0]][ 7: 0];

endmodule

//-------- Org Code ----------
//reg   [3:0] [5:0] muxs;        // mux selector flops, 4 copies
//wire [63:0][31:0] muxd;        // mux data inputs
//wire       [31:0] muxr;        // mux result outputs

//assign muxr[31:24] = muxd[muxs[3]][31:24];
//assign muxr[23:16] = muxd[muxs[2]][23:16];
//assign muxr[15:8]  = muxd[muxs[1]][15:8];
//assign muxr[7:0]   = muxd[muxs[0]][7:0];

.

cgracey · 2014-03-04 15:21

Sapieha wrote: »

Hi Chip.

I made Modules from Yours code that can be compiled separately.
Then used bout REG and WIRE --->
That give me same circuit results.

Look in attachment's

//
//    Here is how a mux might be coded with a single set of mux driver signals:
//
module ChipMux(muxs,muxr,muxd);

input                [ 5:0] muxs;        // mux selector flops
input                [31:0] muxd;        // mux inputs
output    [63:0][31:0] muxr;        // mux result

  always @(muxs, muxd, muxr)
    begin
    end
//---------------------------------------------------------------------------------------------------------
//reg         [5:0] muxs;        // mux selector flops                    Ues of REG/WIRE    Compiles in same circuity
wire         [5:0] muxs;        // mux selector flops

//---------------------------------------------------------------------------------------------------------
//wire [63:0][31:0] muxd;        // mux inputs
wire [ 3:0][31:0] muxd;            // mux inputs
wire       [31:0] muxr;            // mux result

assign muxr = muxd[muxs];
endmodule
//
//
//--------------------------------------------------------------------
//
//    Here it is with 4x the drivers:
//
//--------------------------------------------------------------------
//
module ChipMux4 (bmuxs,bmuxr,bmuxd);

input        [3:0] [ 5:0] bmuxs;        // mux selector flops, 4 copies
input                [31:0] bmuxd;        // mux data inputs
//output    [63:0][31:0] bmuxr;
output    [ 3:0][31:0] bmuxr;        // mux result outputs

  always @(bmuxs, bmuxr, bmuxd)

  begin

reg   [3:0] [5:0] bmuxs;        // mux selector flops, 4 copies
    end

//wire [63:0][31:0] bmuxd;        // mux inputs
wire    [ 3:0][31:0] bmuxd;        // mux inputs
wire            [31:0] bmuxr;        // mux result


assign bmuxr[31:24] = bmuxd[bmuxs[3]][31:24];
assign bmuxr[23:16] = bmuxd[bmuxs[2]][23:16];
assign bmuxr[15: 8] = bmuxd[bmuxs[1]][15: 8];
assign bmuxr[ 7: 0] = bmuxd[bmuxs[0]][ 7: 0];

endmodule

//-------- Org Code ----------
//reg   [3:0] [5:0] muxs;        // mux selector flops, 4 copies
//wire [63:0][31:0] muxd;        // mux data inputs
//wire       [31:0] muxr;        // mux result outputs

//assign muxr[31:24] = muxd[muxs[3]][31:24];
//assign muxr[23:16] = muxd[muxs[2]][23:16];
//assign muxr[15:8]  = muxd[muxs[1]][15:8];
//assign muxr[7:0]   = muxd[muxs[0]][7:0];

.

I didn't look too carefully, but it seemed that your compiler recognized the replication of the input signals and therefore reduced them.

To make the faster effect, you'd need to use 'always @(posedge clk)' for the muxs signals. That will cause them to be flops. And you need to make sure your compiler is not eliminating duplicate registers.

Sapieha · 2014-03-04 15:24

Hi Chip.

Thanks

Ariba · 2014-03-04 15:26

Heater. wrote: »

Ariba,

Wow! That's stunning.

Down side is that it eats code if working in COG. Doesn't the big multiplier get us a 64 bit result ?

Yes, fom the big multiplier you get a 64 bit result with 1 instruction more (GETMULH). But you can get also a 64 bit result with 4 16x16 MULs combined.
Now I calculate: AH*BL<<16 + BH*AL<<16 + AL*BL
where H and L are the High and Low word of the 32bit regiuster.

If you do: AH*BH<<32 + AH*BL<<16 + BH*AL<<16 + AL*BL
.. and make all adding with 64 bits then you get a 64bit result.

Here is a draft which I have not tested yet:

mult64  'a * b -> resH/L
        getword resL,a,#0   '1
        getword tmp,b,#0    '1
        sar  a,#16          '1
        sar  b,#16          '1
        mov  resH,a         '1
        mul  resH,b         '2
        mul  a,tmp          '2
        mul  b,resL         '2
        mul  resL,tmp       '2
        getword tmp,a,#1    '1
        shl  a,#16          '1
        add  resL,a  wc     '1
        addx resH,tmp       '1
        getword tmp,b,#1    '1
         retd               '1
        shl  b,#16          '1
        add  resL,b  wc     '1
        addx resH,tmp       '1 = 22 cy
'

So it takes the same amount of clocks as the big multiplier.
The big multiplier has still the adantage that it runs parallel to the CPU, so you can execute up to 17 instruction before you read the result (in single task mode).

Andy

cgracey · 2014-03-04 15:34

Seairth wrote: »

Is that a global switch? If so, it's quite possible that more than just the CORDIC and ALU MUX could be affected? Crossing fingers that this doesn't increase the core size too much...

The compiler just finished and I see no improvement in Fmax, just some growth in cell count. Darn!

I'll leave these extra flops in the design, though, because I know it's going to help with the standard-cell synthesis.

cgracey · 2014-03-04 15:40

About coding your way around the big multiplier: In order to do signed multiplication, you'd have to make the inputs absolute, then possibly negate the 64-bit result.

Ariba · 2014-03-04 15:41

cgracey wrote: »

Andy! That's nuts! This goes to show that necessity is the mother of invention. How come this never occurred to me?

I wonder if we could make some conduit around the multipliers to feed these values in automatically and make an instruction out of it, getting rid of the big hardware multiplier. There are two 20x20 bit multipliers, so that they can be switched between for a 1-clock MAC, while each multiplier actually takes two clocks. Maybe we could make a 5-clock 32x32 multiplier this way.

Thanks Chip (I hope "that's nuts" is something positive).

There are other CPUs that have for example a 16x32bit Mult in one cycle and then provide a 2 cycle instruction which use the same hardware two times for a 32x32 bit.
What I don't really understand is why this works with signed numbers, I thought I need to do the 3 MULs with absolute values and add the sign later.

Yeah with the help of the two 20x20 multipliers a much faster 32x32 bit MUL instruction in hardware should be possible.

Andy

Heater. · 2014-03-04 15:45

Sounds like it's time to remove the big multiplier.

Yes it can execute parallel with other code but I'm convinced that will almost never be used in practice. Firstly because it's hard to do. Secondly because often it's impossible. Compiler writers are almost certain to not make use of that parallelism feature in code generation especially in the face of threading. (Am I right compiler writers?).

With the problems it causes for threading it seems we could do without the big multiplier.

Unless of course it shares a lot of it's logic with other functions in cordic or elsewhere.

cgracey · 2014-03-04 15:53

Ariba wrote: »

What I don't really understand is why this works with signed numbers, I thought I need to do the 3 MULs with absolute values and add the sign later.

Multiplying signed numbers without consideration of their signs works if you are only using the bottom half of the result.

-1 * -1 with bytes:

$FF * $FF = $FE01, the bottom half ($01) is correct.

cgracey · 2014-03-04 15:56

Heater. wrote: »

Sounds like it's time to remove the big multiplier.

What about getting a 64-bit result? Maybe we'd need to make separate multiply and scale instructions.

Sapieha · 2014-03-04 15:59

Hi Chip.

MUL32 D,S

D+D1 = Result

cgracey wrote: »

What about getting a 64-bit result? Maybe we'd need to make separate multiply and scale instructions.

cgracey · 2014-03-04 16:07

Sapieha wrote: »

Hi Chip.

MUL32 D,S

D+D1 = Result

We only have a single register-write opportunity. We'd have to write the top long off to some hidden register that we can read via special instruction. That would create other problems, like inter-task conflict.

Sapieha · 2014-03-04 16:12

Hi Chip.

Thanks

Understand

ctwardell · 2014-03-04 16:13

Heater. wrote: »

Sounds like it's time to remove the big multiplier.

Yes it can execute parallel with other code but I'm convinced that will almost never be used in practice. Firstly because it's hard to do. Secondly because often it's impossible. Compiler writers are almost certain to not make use of that parallelism feature in code generation especially in the face of threading. (Am I right compiler writers?).

With the problems it causes for threading it seems we could do without the big multiplier.

Unless of course it shares a lot of it's logic with other functions in cordic or elsewhere.

Put the axe away there George!

IMO this makes no sense.

All the fits being thrown over moving ahead and now we start doing knee jerk pruning?

The software multiply may be the best option in code meant for preemptive tasking sometimes but that is not a valid reason to kill off the BHM.

C.W.

cgracey · 2014-03-04 16:22

ctwardell wrote: »

Put the axe away there George!

IMO this makes no sense.

All the fits being thrown over moving ahead and now we start doing knee jerk pruning?

The software multiply may be the best option in code meant for preemptive tasking sometimes but that is not a valid reason to kill off the BHM.

C.W.

I feel this way, too. For now.

cgracey · 2014-03-04 16:56

I figured out a way a thread can invoke a yield that a scheduler can detect:

The thread executes SETTASK #0, turning all the time slots back over to the scheduler, which otherwise is only getting every 16th slot. The scheduler can execute:

GETCNT  time
SUBCNT  time
CMP     time,#1      wz

If Z then a yield occurred. A couple instructions would have executed after the yield (SETTASK #0) in the thread, though.

I could add a simple instruction that would set Z if single-task mode was enabled. That would be simpler.

jmg · 2014-03-04 17:15

cgracey wrote: »

... A couple instructions would have executed after the yield (SETTASK #0) in the thread, though.

I could add a simple instruction that would set Z if single-task mode was enabled. That would be simpler.

Wouldn't read of the Task mapping holding register do the same thing ?
Can Threads read that register now ?

David Betz · 2014-03-04 17:23

Roy Eltham wrote: »

I imagine much of the size win of "byte code" could be achieved with making a kernel of functions that the main code calls (the opposite of inlining small stuff). Thinking more about it, we could quite easily just have one of those functions be a "byte code" interpreter in the form of treating the next X bytes as jump table indices. So the Spin2 code would just be PASM2 along with this byte based jump table function, which is essentially the byte code interpreter, but thinking about it this way lends itself to thinking about that table being custom made for the program being compiled based on what it uses (which circles back around to the things we talked about way back when). Ultimately, it always generates PASM2 code, it just has some helpers to allow it to be small.

I could see the GCC stuff doing the same thing for it's CMM model (or just when you optimize for code size vs speed).

Yes, this is very possible (combining hub execution with a more compact interpreter). I mentioned this possibility recently to the GCC team. We couldn't combine LMM with CMM in P1 because both kernels won't fit in hub memory at the same time but hub execution mode doesn't really anywhere near as much COG space so it might be very doable. Not to mention the fact that the bytecode interpreter could be executed from hub memory itself. This is worth considering. Use the slower more compact CMM code for stuff that doesn't need the speed and reserve hub execution for fast inner loops. You might even allow the CMM mode to execute from external memory through a cache like XMM does now and put mostly time-critical code in hub memory. Lots of possibilities. Not all of them will end up making sense but they're all worth considering.

David Betz · 2014-03-04 17:24

I've got a simple question but one that has been worrying me a bit lately. How will I know when Chip has posted a new FPGA image if I don't have the time to read the zillions of messages posted to this thread every day? Will a new thread be created to announce the new FPGA release? Please! :-)

cgracey · 2014-03-04 17:29

David Betz wrote: »

I've got a simple question but one that has been worrying me a bit lately. How will I know when Chip has posted a new FPGA image if I don't have the time to read the zillions of messages posted to this thread every day? Will a new thread be created to announce the new FPGA release? Please! :-)

Okay. New thread.

cgracey · 2014-03-04 17:31

jmg wrote: »

Wouldn't read of the Task mapping holding register do the same thing ?
Can Threads read that register now ?

We don't have an instruction to read the task register and I don't think it's a good idea to have one, either, as the thing rotates two bits for some variable-length loop. I think it would encourage bad practices to make it available. That's just my gut feeling.

jmg · 2014-03-04 17:43

cgracey wrote: »

We don't have an instruction to read the task register and I don't think it's a good idea to have one, either, as the thing rotates two bits for some variable-length loop. I think it would encourage bad practices to make it available. That's just my gut feeling.

I was meaning one register back, the holding register, not the rotating register itself.
Can that holding register be read now ?
This would give across-thread info about where they were in the (new) pecking order.

The LOCK based force 100% case would be invisible, but that is inherent anyway, as Threads set to 0 cannot run, plus the Thread forcing itself should know it has just done so

David Betz · 2014-03-04 17:47

cgracey wrote: »

Okay. New thread.

Thanks Chip!

Cluso99 · 2014-03-04 17:56

cgracey wrote: »

In the CORDIC block, it was necessary to make a special arithmetic shifter that could produce a result fast enough to feed the adder in time, so that the CORDIC block could operate as fast as the rest of the cog. This was done by pre-decoding the mux selector bits that fed the shifter, and then storing them in flipflops. Not only were those signals going to be ready at the start of the next cycle, when they would be needed, but I even made multiple copies of them, so that for every 8 bits that needed mux'ing, I had a unique flipflop Q output to drive their mux selector inputs. This made things way faster because it reduced the load that each signal would have to drive, enabling them to toggle their loads faster.

There are two things that take time in silicon logic: computation through gates (AND/OR/ADD/MUL) and buffering (or amplification) of signals, so that they can, in turn, drive many other inputs (fan-out), or even just long wires en route to other inputs. Buffering is accomplished by simply running a signal through a series of inverters of increasing size, and for optimal speed, an increase factor of e (~2.7, the base of the natural log) can be used. Practically, buffering inverter chains just use a factor of 3, because it's an integer close to e.

Anyway, the ALU result mux's in the Prop2 are huge, with the largest (slowest) having upwards of 64 inputs, each 34 bits wide (Z override, Z, C, 32-bit result). For some reason, I never thought to apply the idea of increasing parallelism in these mux's selector signals, while these mux's have long formed the bulk of critical paths in the design. It's very inexpensive to do this, too, since even the biggest mux just uses 6 unique signals to select 1 of 64 results. So, at the cost of a mere 48 new flipflops, I increased the mux driver signals by a factor of 4, so that instead of each signal routing 34 bits, four signals will route ~9 bits each. This should increase the Fmax of the design. It's compiling right now. I'm anxious to see how the timing improves on the FPGA, as this will translate directly into an improvement in the speed of the synthesized logic in the final chip.

In doing all that synthesis work before our last tape-out, I realized that speeding up even non-critical-path circuits makes the overall design faster because it allows the design compiler to spread them out more (while still meeting timing requirements), so that critical-path circuits can be grouped more closely together, reducing their need for signal-buffering and long wiring.

WTG Chip. These are the brilliant little things that are the real gems that many designers never really understand or implement.

Postedit: Shame it didn't result in a speed improvement, but easing the layout constraints is nevertheless worthwhile.

Roy Eltham · 2014-03-04 17:57

David,
I imagined the "byte code interpreter" as just being a function you call that you give a buffer and count to and it calls functions in a table using that buffer data. So the program is just PASM2 code that happens to call that function with different buffers as needed. In the extreme case, the PASM2 code consists of just one call to that function with a single buffer that is the whole program. Functions in the table can read data from the buffer (and increment the pointer into the buffer accordingly), as needed. So the buffer can contain indices into the table and data. It's a generic idea that could work for any compiled code.

cgracey · 2014-03-04 18:25

Roy Eltham wrote: »

David,
I imagined the "byte code interpreter" as just being a function you call that you give a buffer and count to and it calls functions in a table using that buffer data. So the program is just PASM2 code that happens to call that function with different buffers as needed. In the extreme case, the PASM2 code consists of just one call to that function with a single buffer that is the whole program. Functions in the table can read data from the buffer (and increment the pointer into the buffer accordingly), as needed. So the buffer can contain indices into the table and data. It's a generic idea that could work for any compiled code.

That's a really neat way to handle it. You could do the CALL with the byte codes directly following, so that the routine called pops the address and starts interpreting the bytes, until it hits a return byte code, at which point it returns to the next long. Like you implied, PASM could be the default mode for Spin.

P.S. I'm adding PTRX/PTRY to all tasks so that each task can exercise every type of CALL/RET/PUSH/POP. This is something I thought should be done before the final coding for the thread save/restore happens. This adds 48 flops per cog.

cgracey · 2014-03-04 18:29

jmg wrote: »

I was meaning one register back, the holding register, not the rotating register itself.
Can that holding register be read now ?
This would give across-thread info about where they were in the (new) pecking order.

The LOCK based force 100% case would be invisible, but that is inherent anyway, as Threads set to 0 cannot run, plus the Thread forcing itself should know it has just done so

There is no static holding register, just a 2-bit wide variable-length loop of flops.

The thing about looking at or tweaking the task order is that it could change at any time by some other task (maybe a scheduler?) doing a SETTASK. It's kind of like worrying about what cogs are running, because it's very fluid. If you start or stop a cog, though, that is a definitive action that will have a certain effect.

Propeller II update - BLOG

Comments