FPGA P1 DSP hybrid
Dr_Acula
Posts: 5,484
Inside Quartus is a treasure trove of DSP functions that can be dropped into an emulation as pre-written modules. Hardware multiply, divide, log, sin, cosine, square root, multiple floating point functions, filters, transforms, video processing and fft.
Of course, much of this exists on the propeller in software, but the difference here is the speed. FPGA chips have hardware multipliers (even the humble cyclone II has a number of these).
I'm working on creating a hybrid Z80/DSP chip in FPGA and some of the principles are going to be very similar to doing this for the propeller.
Open Quartus and go to Tools and then about 3/4 of the way down, megawizard plugin manager. Create new custom function. Select the language - you can choose AHDL, VHDL or Verilog. On the left side are all the functions available. Let's try something really simple - an 8x8 bit multiplier. Select Arithmetic, then LPM_MULT. Create a test name for the file. Click on Next and it asks how many bits to use. The default is 8x8 but you can use more - there are 18bit multipliers on the cyclone II. So it pre-calculates that 8x8 will create a 16 bit output. You can select signed or unsiged. You can also select if one of the values is a constant. Select the defaults on the next screen for speed/area. Click next a few times and click finish.
There is a pop up box at this stage about adding this to your existing project. Click No to this. But just for reference, you can add it later with Project/Add remove files in project.
Browse to where the file was saved, and the core of the file is
A 8x8 multiply is fairly simple - the code for floating point functions is more complex. But they all have a common structure - both in vhdl and verilog, with inputs as "dataa" and "datab" and outputs as "result".
So now you can plug these into existing modules.
How to do this? I'm not entirely sure for the propeller, but I'll describe what I'm doing for a Z80 and I think the process should be similar. First, the Z80 comes with 256 ports. On a real chip, the address lines select the port (technically there are actually 65536 ports but 256 ports was always enough in hardware). And there is an iorq (io request) line and there is /wr and /rd. So to output to a port, use an instruction OUT (portnumber),A and the iorq line would go low, /wr would go low and the portnumber would appear on A0-A7.
For a fpga emulation these ports can exist internally. So we can decode port numbers, decode address ranges, and latch values, all inside the chip. Well, 8 bits are a bit limiting, so the first thing is to create an internal 32 bit bus. Write out to four 8bit ports and store that value in a 32 bit latch which can be dataa. Write out to another four ports and that can be datab. Read back the result through four ports and that can be result. So now with 32 bits we can start to play around with large multiplication and division, and proper floating point maths.
Then we need a way of grouping all these functions. This can also be done through a port that selects the function. Output to an 8 bit port the number zero, and for testing purposes, dataa loops back as result. Output the number 1, and datab loops back. Output 2, and the result of dataa*datab is returned. If a hardware multiply is being used, this is pretty much instant.
So we need a series of if..then..else type statements to multiplex the result values from all of the megafunctions.
It seems crazy, but these can all happen in parallel. You might have two functions, one does 8x8 multiply, and one does 16x16 bit, and you output the data and both of these return the result at the same time. It is just that you only read back one.
I'm writing this in vhdl but verilog would be similar. Each function can be thought of as a box with inputs and outputs, much like a TTL chip. So the top level is the main microprocessor code, and lower down in the tree are these megafunctions, and so between these there needs to be a new box, which multiplexes all the answers. And then there needs to be another box that handles the ports so that the answer comes out only when a certain port is selected.
It is the ports box that I'm not sure how to do for the propeller. For the Z80 the io lines are already there, coming out of the Z80 box as address and iorq and /rd and /wr. For the propeller, I think someone has added Port B. Maybe that idea could be extended so there are many more virtual ports, all with different addresses. Also I'm not sure what the instruction would be. In pseudo code, it would be "move 32 bits from register i to port j"
The inputs to this box might be clock, address, data, read, write, chip select and possibly reset. Output would be 32 bit dataout.
Thoughts would be most appreciated!
Of course, much of this exists on the propeller in software, but the difference here is the speed. FPGA chips have hardware multipliers (even the humble cyclone II has a number of these).
I'm working on creating a hybrid Z80/DSP chip in FPGA and some of the principles are going to be very similar to doing this for the propeller.
Open Quartus and go to Tools and then about 3/4 of the way down, megawizard plugin manager. Create new custom function. Select the language - you can choose AHDL, VHDL or Verilog. On the left side are all the functions available. Let's try something really simple - an 8x8 bit multiplier. Select Arithmetic, then LPM_MULT. Create a test name for the file. Click on Next and it asks how many bits to use. The default is 8x8 but you can use more - there are 18bit multipliers on the cyclone II. So it pre-calculates that 8x8 will create a 16 bit output. You can select signed or unsiged. You can also select if one of the values is a constant. Select the defaults on the next screen for speed/area. Click next a few times and click finish.
There is a pop up box at this stage about adding this to your existing project. Click No to this. But just for reference, you can add it later with Project/Add remove files in project.
Browse to where the file was saved, and the core of the file is
module testmul2 ( dataa, datab, result); input [7:0] dataa; input [7:0] datab; output [15:0] result; wire [15:0] sub_wire0; wire [15:0] result = sub_wire0[15:0]; lpm_mult lpm_mult_component ( .dataa (dataa), .datab (datab), .result (sub_wire0), .aclr (1'b0), .clken (1'b1), .clock (1'b0), .sum (1'b0)); defparam lpm_mult_component.lpm_hint = "MAXIMIZE_SPEED=5", lpm_mult_component.lpm_representation = "UNSIGNED", lpm_mult_component.lpm_type = "LPM_MULT", lpm_mult_component.lpm_widtha = 8, lpm_mult_component.lpm_widthb = 8, lpm_mult_component.lpm_widthp = 16; endmodule
A 8x8 multiply is fairly simple - the code for floating point functions is more complex. But they all have a common structure - both in vhdl and verilog, with inputs as "dataa" and "datab" and outputs as "result".
So now you can plug these into existing modules.
How to do this? I'm not entirely sure for the propeller, but I'll describe what I'm doing for a Z80 and I think the process should be similar. First, the Z80 comes with 256 ports. On a real chip, the address lines select the port (technically there are actually 65536 ports but 256 ports was always enough in hardware). And there is an iorq (io request) line and there is /wr and /rd. So to output to a port, use an instruction OUT (portnumber),A and the iorq line would go low, /wr would go low and the portnumber would appear on A0-A7.
For a fpga emulation these ports can exist internally. So we can decode port numbers, decode address ranges, and latch values, all inside the chip. Well, 8 bits are a bit limiting, so the first thing is to create an internal 32 bit bus. Write out to four 8bit ports and store that value in a 32 bit latch which can be dataa. Write out to another four ports and that can be datab. Read back the result through four ports and that can be result. So now with 32 bits we can start to play around with large multiplication and division, and proper floating point maths.
Then we need a way of grouping all these functions. This can also be done through a port that selects the function. Output to an 8 bit port the number zero, and for testing purposes, dataa loops back as result. Output the number 1, and datab loops back. Output 2, and the result of dataa*datab is returned. If a hardware multiply is being used, this is pretty much instant.
So we need a series of if..then..else type statements to multiplex the result values from all of the megafunctions.
It seems crazy, but these can all happen in parallel. You might have two functions, one does 8x8 multiply, and one does 16x16 bit, and you output the data and both of these return the result at the same time. It is just that you only read back one.
I'm writing this in vhdl but verilog would be similar. Each function can be thought of as a box with inputs and outputs, much like a TTL chip. So the top level is the main microprocessor code, and lower down in the tree are these megafunctions, and so between these there needs to be a new box, which multiplexes all the answers. And then there needs to be another box that handles the ports so that the answer comes out only when a certain port is selected.
It is the ports box that I'm not sure how to do for the propeller. For the Z80 the io lines are already there, coming out of the Z80 box as address and iorq and /rd and /wr. For the propeller, I think someone has added Port B. Maybe that idea could be extended so there are many more virtual ports, all with different addresses. Also I'm not sure what the instruction would be. In pseudo code, it would be "move 32 bits from register i to port j"
The inputs to this box might be clock, address, data, read, write, chip select and possibly reset. Output would be 32 bit dataout.
Thoughts would be most appreciated!
Comments
The virtual ports idea is interesting too. It would let things be connected at run time. For some applications it might be useful to be able to arrange the port fabric without having to recompile everything every time.
In the P1 we could do this by using fixed registers, say $1EC-$1EF for data, datab, result, and a result2 or another input.
If there were just 1 set for all cogs, then window it into hub ram/rom.
Or, use special instructions to write/read the registers.
Point is, there are ways to do it, and meanwhile we can play with the blocks.
Thanks for posting, and do keep us up to date
Ok, I spent a few hours coding and I got something working! The following is vhdl code, so apologies for that, but it should be fairly easy to translate. Much of this code is the part that translates 8 bit ports into 32 bits. But down the bottom is the part that interfaces to a megawizard function. Talking in 8 bit for the moment, there are 4 ports from 0x60 to 0x63 for dataa, 4 ports from 0x64 to 0x67 for datab, 4 ports from 0x68 to 0x6B for the result, and then port 0x6F defines what function you want to do.
I started with function 0 is a simple loopback test that returns dataa. function 1 is a loopback for datab.
But the clever bit is function 2 which does an 8x8 multiply. I started with the megafunction and created this multiply. Then I added "Multiply8x8 : entity work.Mul8x8" to tell quartus that it exists. I also checked Mul8x8 was in the project list (I added it at the end of the wizard).
Then compiled and fixed a few errors. When the compilation was finished, the design tree now correctly includes the 8x8 multiply as a branch of ports32.
Testing was done in CP/M, and while this is an ancient operating system, it has MBASIC which is interpreted and means you can test things very quickly.
I deliberately chose an 8x8 multiply as it is one of the simplest functions available. But since these just drop into the code, it is very easy to add many more. Maybe till it runs out of hardware multiplies or something.
The question for the propeller - what is the easiest way to interface with this? Do you add virtual ports? Or maybe reserve a few longs of memory (would only need four) and these are the DSP interface?
Rapid DSP functions would have to be useful for something. Real time .jpg decompression for instance.
Compilation says it is using three hardware 9x9 multiplies which is correct - one for the 8x8 multiply and two for the 16x16 multiply.
I think these IP cores are free. There is an option in the list of IP cores that says "click to open IP megastore". I think they are the ones you pay for (the IP megastore has a .jpeg decompression algorithm for instance and if I was guessing, I'd expect a multiply to be free but a jpg to cost money).
But for multiply and floating point etc I think they are free. In any case, the file is downloaded permanently to the fpga board, the jtag programmer is disconnected and it still seems to work just fine.
Found a catch though - a 16 bit divide used up almost all the resources on a cyclone II. And by comparison, a multiply seems to use less than 1%. So need to think of other ways to do a divide. Multiply by the inverse. Bit shifts. Divide by fixed numbers. Lots of mathematical tricks there.
I'm very happy with this. It is really cool to send in two 16 bit numbers and instantly get a 32 bit result.
I think he used some queuing scheme that tags the operands and results to a COG.
The same could be done in P1V, and you can also pipeline Divides to reduce the silicon.