System Verilog Translation

Kye · 2014-08-20 13:59

I was wondering if anyone would want to use a version of the P1 core translated from Verilog to System Verilog.

System Verilog is to Verilog what C++ is to C. System Verilog adds a ton of new features to Verilog that make writing Verilog code almost as easy as writing C. Translating the P1 core to System Verilog will make it substantially easier for the community to modify the core.

Below is an example of a System Verilog FIFO. The logic of the FIFO more or less is C code:

// By: Kwabena W. Agyeman

module SREG_FIFO #(parameter WIDTH = 8, parameter DEPTH = 8)
(
    input logic clk, rst_n,
    
    input logic fifo_enqueue,
    input logic fifo_dequeue,
    
    output logic [$clog2(DEPTH)-0:0] fifo_free_space,
    output logic [$clog2(DEPTH)-0:0] fifo_used_space,
    
    output logic fifo_full,
    output logic fifo_empty,


    input logic [WIDTH-1:0] fifo_data_in,
    output logic [WIDTH-1:0] fifo_data_out
);
    
    logic [$clog2(DEPTH)-0:0] counter, counter_q;
    
    dff #(.WIDTH($bits(counter))) counter_register
    (
        .clk(clk), .rst_n(rst_n),
        .d(counter),
        .q(counter_q)
    );
    
    logic [DEPTH-1:0][WIDTH-1:0] fifo_register, fifo_register_q;
    
    genvar k;
    
    generate
        for(k = (DEPTH-1); k >= 0; k--) begin : fifo
            dff #(.WIDTH($bits(fifo_register[k]))) register
            (
                .clk(clk), .rst(rst),
                .d(fifo_register[k]),
                .q(fifo_register_q[k])
            );
        end : fifo
    endgenerate


    always_comb begin : fifo_logic
        
        enum [1:0] { HOLD, ENQUEUE, DEQUEUE, BOTH } action; 
        
        action = HOLD;
        
        counter = counter_q;
        fifo_register = fifo_register_q;


        case({fifo_dequeue, fifo_enqueue})
             
            1: begin
                if(counter_q < DEPTH) begin
                    action = ENQUEUE;
                    counter = counter_q + 1;
                end
            end
            
            2: begin
                if(counter_q > 0) begin
                    action = DEQUEUE;
                    counter = counter_q - 1;
                end    
            end
                
            3: begin
                action = BOTH;    
            end
            
        endcase
        
        for(int i = (DEPTH-1); i >= 0; i--) begin 


            if((action == DEQUEUE) || (action == BOTH)) begin
                if(i != (DEPTH-1)) begin
                    fifo_register[i] = fifo_register[i+1]; 
                end else begin
                    fifo_register[i] = '0; 
                end
            end
             
            if((action == ENQUEUE) || (action == BOTH)) begin
                if(i == counter_q) begin
                    fifo_register[i] = fifo_data_in;     
                end
            end
            
        end


        fifo_free_space = DEPTH - counter_q;
        fifo_used_space = counter_q;
        
        fifo_full = !fifo_free_space;
        fifo_empty = !fifo_used_space;


        fifo_data_out = fifo_register_q[0];
        
    end : fifo_logic
    
endmodule : SREG_FIFO


module dff #(parameter WIDTH = 8)
(
    input logic clk, rst_n,
    
    input logic [WIDTH-1:0] d,
    output logic [WIDTH-1:0] q
);
        
    always_ff @(posedge clk or negedge rst_n) begin : dff_logic
        if(!rst_n) begin
            q <= '0;    
        end else begin
            q <= d; 
        end
    end : dff_logic
    
endmodule : dff

Writing hardware code in this high level style makes it very easy to add new features at the cost of having to trust the compiler a lot more. Ten years ago you wouldn't want to write Verilog code this way because the compilers weren't that good. But, now you can do this. In fact, compilers actually prefer the code being written in high level style now as it gives them more flexibility with how to implement your logic. The more specific you are the more you tie the tool's hands.

I'm interested in doing this because I want to make architectural modifications to the P1 core. For example, two things I want to do are to slice the hub up into 8 pieces like the P2 hub and make the hub 32-bit accessible. Chip's style is super optimized which is great! But... this also means that changing the P1 architecture will be much trickier.

Anyway, so, would doing this be of value? Will anyone be interested in it even if it's not the official Parallax source? Note that the code will inherit the GPL license like the Parallax source. My modifications will also be GPL.

NOTE: The System Verilog translation will only be functionality equivalent to the official P1 code. It will not be the same down to the gate level.

mindrobots · 2014-08-20 15:52

Kye,

I say "go for it!"

It looks like a valuable addition to the project. Certainly a leading tool and if it inspires other flights of imagination or is something that clicks better than plain Verilog for some people, it is certainly a worthy endeavor!

I for one would be interested in seeing the differences.

Seairth · 2014-08-21 04:56

I think this is a very good idea, and I'd be happy to help.

As for its equivalence to the P1, I don't think that should be a concern. If a future SV-based design is worth converting into an ASIC, dealing with gate-level issues (if there are any) can be dealt with then.

Kye · 2014-08-21 05:29

I worked on this last night and got about half way done converting the ALU. It's way easier to read the code now. For example:

8: begin : ror_op
                logic [63:0] temp;
                temp = {in.d, in.d} >> in.s[4:0];
                out.wr = 1'b1;
                out.r = temp[31:0];
                out.co = in.d[0];
                out.zo = !out.r;
            end : ror_op


            9: begin : rol_op
                logic [63:0] temp;
                temp = {in.d, in.d} << in.s[4:0];
                out.wr = 1'b1;
                out.r = temp[63:32];
                out.co = in.d[31];
                out.zo = !out.r;
            end : rol_op


            10: begin : shr_op
                out.wr = 1'b1;
                out.r = in.d >> in.s[4:0];
                out.co = in.d[0];
                out.zo = !out.r;
            end : shr_op


            11: begin : shl_op
                out.wr = 1'b1;
                out.r = in.d << in.s[4:0];
                out.co = in.d[31];
                out.zo = !out.r;
            end : shl_op


            12: begin : rcr_op
                logic [63:0] temp;
                temp = {{32{in.ci}}, in.d} >> in.s[4:0];
                out.wr = 1'b1;
                out.r = temp[31:0];
                out.co = in.d[0];
                out.zo = !out.r;
            end : rcr_op


            13: begin : rcl_op
                logic [63:0] temp;
                temp = {in.d, {32{in.ci}}} << in.s[4:0];
                out.wr = 1'b1;
                out.r = temp[63:32];
                out.co = in.d[31];
                out.zo = !out.r;
            end : rcl_op


            14: begin : sar_op
                logic [63:0] temp;
                temp = {{32{in.d[31]}}, in.d} >> in.s[4:0];
                out.wr = 1'b1;
                out.r = temp[31:0];
                out.co = in.d[0];
                out.zo = !out.r;
            end : sar_op


            15: begin : rev_op
                logic [63:0] temp;
                temp = {{32{1'b0}}, {<<{in.d}}} << in.s[4:0];
                out.wr = 1'b1;
                out.r = in.s ? temp[63:32] : temp[31:0];
                out.co = in.d[0];
                out.zo = !out.r;
            end : rev_op

The ALU is now just one big mux with the result of every possible instruction going into it. Only the selected instruction gets outputted.

I'm not sure if the tool will optimize common logic. I'm used to using Design Compiler at work where this is the case.

mindrobots · 2014-08-21 05:48

It will be interesting to see the Quartus build numbers from Verilog versus System Verilog.

The code sure is going to look different!!

cgracey · 2014-08-21 08:29

Kye wrote: »

I worked on this last night and got about half way done converting the ALU. It's way easier to read the code now. For example:

8: begin : ror_op
                logic [63:0] temp;
                temp = {in.d, in.d} >> in.s[4:0];
                out.wr = 1'b1;
                out.r = temp[31:0];
                out.co = in.d[0];
                out.zo = !out.r;
            end : ror_op


            9: begin : rol_op
                logic [63:0] temp;
                temp = {in.d, in.d} << in.s[4:0];
                out.wr = 1'b1;
                out.r = temp[63:32];
                out.co = in.d[31];
                out.zo = !out.r;
            end : rol_op


            10: begin : shr_op
                out.wr = 1'b1;
                out.r = in.d >> in.s[4:0];
                out.co = in.d[0];
                out.zo = !out.r;
            end : shr_op


            11: begin : shl_op
                out.wr = 1'b1;
                out.r = in.d << in.s[4:0];
                out.co = in.d[31];
                out.zo = !out.r;
            end : shl_op


            12: begin : rcr_op
                logic [63:0] temp;
                temp = {{32{in.ci}}, in.d} >> in.s[4:0];
                out.wr = 1'b1;
                out.r = temp[31:0];
                out.co = in.d[0];
                out.zo = !out.r;
            end : rcr_op


            13: begin : rcl_op
                logic [63:0] temp;
                temp = {in.d, {32{in.ci}}} << in.s[4:0];
                out.wr = 1'b1;
                out.r = temp[63:32];
                out.co = in.d[31];
                out.zo = !out.r;
            end : rcl_op


            14: begin : sar_op
                logic [63:0] temp;
                temp = {{32{in.d[31]}}, in.d} >> in.s[4:0];
                out.wr = 1'b1;
                out.r = temp[31:0];
                out.co = in.d[0];
                out.zo = !out.r;
            end : sar_op


            15: begin : rev_op
                logic [63:0] temp;
                temp = {{32{1'b0}}, {<<{in.d}}} << in.s[4:0];
                out.wr = 1'b1;
                out.r = in.s ? temp[63:32] : temp[31:0];
                out.co = in.d[0];
                out.zo = !out.r;
            end : rev_op

The ALU is now just one big mux with the result of every possible instruction going into it. Only the selected instruction gets outputted.

I'm not sure if the tool will optimize common logic. I'm used to using Design Compiler at work where this is the case.

Kye, do you have any relative LE counts between this and the original? I bet this compiles under Quartus to be several times bigger. I don't believe that it will see the commonality among all those huge shifters.

Ramon · 2014-08-21 08:30

Kye wrote: »

Ten years ago you wouldn't want to write Verilog code this way because the compilers weren't that good. But, now you can do this. [...] Anyway, so, would doing this be of value?

Hi Kye, looks a good idea. What are the free tools to work with systemverilog?

Seairth · 2014-08-21 08:39

Ramon wrote: »

Hi Kye, looks a good idea. What are the free tools to work with systemverilog?

Quartus II works with SV just fine, as does ModelSim.

Kye · 2014-08-21 11:11

@Chip

Yeah, I'm concerned about that. At work, using Design Compiler the tool will optimize the logic completely. However, I don't know if Quartus II does this. I haven't compiled the ALU yet, I need to finish coding it first. But, if the tool's optimizer is good then I should be able to continue with this approach. Otherwise I'll have to optimize the code myself.

However, the whole point of writing the code in this high level style is to make it easy to modify. If the tool requires me to heavily optimize everything like you have already done then this exercise is moot.

@Ramon

Use DVKit = http://dvkit.sourceforge.net/

It's Eclipse with System Verilog support. It can do all the fancy stuff like call tips and clrl+click navigation. That said, its just an editor. You have to use Quartus to compile and ModelSim to simulate.

@Searith

Thanks for the offer. If writing the ALU in high level system verilog style produces good synthesis results then I'll create a git hub repo for others to wok on the code.

cgracey · 2014-08-21 12:06

Kye wrote: »

@Chip

Yeah, I'm concerned about that. At work, using Design Compiler the tool will optimize the logic completely. However, I don't know if Quartus II does this. I haven't compiled the ALU yet, I need to finish coding it first. But, if the tool's optimizer is good then I should be able to continue with this approach. Otherwise I'll have to optimize the code myself.

However, the whole point of writing the code in this high level style is to make it easy to modify. If the tool requires me to heavily optimize everything like you have already done then this exercise is moot.

Perhaps Design Compiler has some intermediate output that could be fed into Quartus, so that it could be used to make these far-flung inferences. Then, Quartus could just map the distilled logic into its fabric.

One benefit of thinking about the low-level implementation is that you tend to design things that have a lot of overlap and that will always result in smaller circuits. You actually morph your problem to allow a simpler hardware solution. I don't think any compiler can be smart enough to understand the goal in such a way.

Kye · 2014-08-21 17:08

Design Compiler can generate Verilog file outputs. However, these Verilog files are mapped to the chip technology library. So, it's not full of simple gate connections.

It looks like my design uses up about 3051 LE vs your ALU of 613 LEs. Given the short compile time of about 10 seconds I don't think Quartus II did much work. Design Compiler in ultra mode would normally spend 15 to 20 minutes compiling the ALU I wrote. It's very serious about making the smallest possible circuit.

One benefit of thinking about the low-level implementation is that you tend to design things that have a lot of overlap and that will always result in smaller circuits. You actually morph your problem to allow a simpler hardware solution. I don't think any compiler can be smart enough to understand the goal in such a way.

I really like how efficient the Propeller ALU is. It's really quite impressive.

Kye · 2014-08-21 18:11

Sigh... I guess I got ahead of myself. I don't think I'll spend time translating the source code to System Verilog in a high level style if the tool can't optimize it. There's not much point in doing this if I have to spend a lot of time optimizing everything (also, I don't have a lot of time to spend optimizing everything).

I can still add modifications to the code however!

cgracey · 2014-08-21 18:57

Kye wrote: »

Sigh... I guess I got ahead of myself. I don't think I'll spend time translating the source code to System Verilog in a high level style if the tool can't optimize it. There's not much point in doing this if I have to spend a lot of time optimizing everything (also, I don't have a lot of time to spend optimizing everything).

I can still add modifications to the code however!

Wait! Let's find out how smart Design Compiler really is.

Would it be too much trouble to compile both versions with Design Compiler using some standard CMOS library you've got handy? I'm really curious to see if it makes those combined shifter inferences.

(For those of you that don't know, Design Compiler is pretty much the industry-standard workhorse for ASIC logic compilation. It's made by Synopsis and costs about $120k/year, though the price is always secret. It's renowned for its thoroughness.)

Kye · 2014-08-21 20:11

Doh!

I already deleted all the code I wrote. Too late. However, just for kicks I can compile the P1 source using design compiler.

...

I suppose I can also compare the code I posted here with the P1 shifter code too.

(For those of you that don't know, Design Compiler is pretty much the industry-standard workhorse for ASIC logic compilation. It's made by Synopsis and costs about $120k/year, though the price is always secret. It's renowned for its thoroughness.)

Per seat

Tor · 2014-08-21 22:57

Ouch! Never delete code you have written, even if it turned out not to be immediately useful.. something I started to appreciate after I got older is that information must never be deleted - too much stuff has left the earth by now (which is why I'm now into mirroring web sites with information, you never know when they go away. And what will be wanted. For example, a few years ago there was a site with an interesting, obscure article about a very unusual 6502 trick. With two pictures of schematics. Site disappeared, wayback machine has the text but not the schematics, site owner disappeared off the earth, and people keep asking if someone had mirrored that site. Nobody had. No copies to be found anywhere. This must not happen. Fortunately bandwidth and TB disks are cheap now, so I mirror all I see that is of any interest.)

For my own stuff, everything I write, even test stubs, go into their own Git repository, extremely easy to do. And easy to backup and save, takes very little space. Occasionally I want to do something in Perl or C or TeX or whatever and realise that I probably have a stub somewhere that did something similar - and it's there. I didn't always do this in the past. But somewhen in 1997 I was analysing a tricky problem in handling satellite data in realtime, and remembered that I had already solved that problem in 1985, and by luck I found an old CCT (reel-to-reel tape) with the code. Had to dig out an old CCT drive and whip up a tool to read the old format, but I found the source code and could incorporate it almost directly into the new system. Saved a lot of time. Now I'm more careful by storing everything in a more accessible way. I have lost much more than I have kept of the old code, back when I just made random backups now and then, with no real thought for the future.

The short version of the above is that if you put effort into creating something, don't delete it

KeithE · 2014-08-22 10:49

If you have access to Synplify Pro then you could turn on "resource sharing" and see if it helps. I think that Quartus has a similar option?

One thing ASIC designers on tight schedules need to consider is the ECO-ability of code. e.g. Will the tool Conformal ECO be able to handle ECOs efficiently? This isn't really a Verilog versus SystemVerilog issue - it's about coding style. I wonder if the coding style above would hinder this or not? Once you get to the ECO level you need to have a mental model of the structure of the design.

Kye · 2014-08-22 12:21

I did turn on resource sharing. Didn't seem to change the output. I don't think that works for everything...

...

As for ECOs, writing in a higher level style is not a hindrance. As a designer you have to write code that meets timing (and area). So, this means you always have an idea of what you're generating. The goal of the higher level style is just to leave the optimizing work to the tool. No matter the code style you write in... once the technology mapping process takes place the output gate level netlist will look nothing like what you wrote. Technology libraries have all kinds of crazy gates that the tool will use. Even if you code up a simple addition of two values together the tool may decide that it wants to replace the adder with complex gates that take into account the logic feeding the adder. The tool will generally output a netlist that looks nothing at all like you think it should be.

EDIT: We don't use FPGAs at my job so I think I approached this whole thing with the wrong mind set. My primary goal was to write easy to understand code that is very fast. Modern processes more or less give you unlimited resources to play with. I've gotten used to just focusing on performance and not thinking about resource costs (within reason).

KeithE · 2014-08-23 21:00

In our experience at work coding style does indeed influence ECO-ability. There are tools like Conformal ECO that have proved invaluable at times, and useless at other times. Wouldn't you love to have the tools do your ECOs for you? We have a guy that really knows Conformal and he wrote up coding guidelines for ECO-ability based on many examples of things falling apart. Coding style can also affect the time to run formal equivalence checking and you tend to do this many times near the end of a project. When you're close to tapeout on a critical project which happens often in consumer electronics with set product cycles these do matter. I work on chips that ship in the 100's of millions. If you slip a tapeout you might miss a product launch. (I'm not recommending anyone work on this stuff - consumer electronics in the valley is a real grind. And of course resources and power matter.)

Maybe if you're trusting Formality and getting side files sent by DC things are better, but in our group there are those that want to formal checks done by a non-Synopsys tool "third party" tool.

Kye · 2014-08-24 16:30

Hi KeithE,

I'm a new college hire, one year of experience right now. I only know a little bit about what they do at my job for ECOs. I bet you know way more about this stuff

. I'm working in the non-consumer sector. While power matters for us, our envelope is in the tens of watts. Latency and throughput are the most important factors.

KeithE · 2014-08-24 19:06

I'm a fellow CMU alumni - class of '89 who had Bruce Krogh as my advisor ;-) I think that you picked an interesting area - consumer stuff can burn you out due to the schedules, and endless variations on a theme. And a lot of people end up just connecting up a bunch of existing IP which isn't all that interesting either. I've been fortunate to be working on new IPs. So I hope you're having fun in your area! In all fairness we hadn't even tried the Conformal ECO flow until the last couple of years, and there's no way I would have had gotten it to work. Synopsys had a similar product "ECO Compiler" but ended support - probably because it wasn't profitable due to the support costs. We had run into long runtimes before that with formal equivalence checks, mostly debugged by the same expert.

System Verilog Translation

Comments