FPGA P1 DSP hybrid

Dr_Acula · 2014-08-16 18:31

Inside Quartus is a treasure trove of DSP functions that can be dropped into an emulation as pre-written modules. Hardware multiply, divide, log, sin, cosine, square root, multiple floating point functions, filters, transforms, video processing and fft.

Of course, much of this exists on the propeller in software, but the difference here is the speed. FPGA chips have hardware multipliers (even the humble cyclone II has a number of these).

I'm working on creating a hybrid Z80/DSP chip in FPGA and some of the principles are going to be very similar to doing this for the propeller.

Open Quartus and go to Tools and then about 3/4 of the way down, megawizard plugin manager. Create new custom function. Select the language - you can choose AHDL, VHDL or Verilog. On the left side are all the functions available. Let's try something really simple - an 8x8 bit multiplier. Select Arithmetic, then LPM_MULT. Create a test name for the file. Click on Next and it asks how many bits to use. The default is 8x8 but you can use more - there are 18bit multipliers on the cyclone II. So it pre-calculates that 8x8 will create a 16 bit output. You can select signed or unsiged. You can also select if one of the values is a constant. Select the defaults on the next screen for speed/area. Click next a few times and click finish.

There is a pop up box at this stage about adding this to your existing project. Click No to this. But just for reference, you can add it later with Project/Add remove files in project.

Browse to where the file was saved, and the core of the file is

module testmul2 (
	dataa,
	datab,
	result);

	input	[7:0]  dataa;
	input	[7:0]  datab;
	output	[15:0]  result;

	wire [15:0] sub_wire0;
	wire [15:0] result = sub_wire0[15:0];

	lpm_mult	lpm_mult_component (
				.dataa (dataa),
				.datab (datab),
				.result (sub_wire0),
				.aclr (1'b0),
				.clken (1'b1),
				.clock (1'b0),
				.sum (1'b0));
	defparam
		lpm_mult_component.lpm_hint = "MAXIMIZE_SPEED=5",
		lpm_mult_component.lpm_representation = "UNSIGNED",
		lpm_mult_component.lpm_type = "LPM_MULT",
		lpm_mult_component.lpm_widtha = 8,
		lpm_mult_component.lpm_widthb = 8,
		lpm_mult_component.lpm_widthp = 16;


endmodule

A 8x8 multiply is fairly simple - the code for floating point functions is more complex. But they all have a common structure - both in vhdl and verilog, with inputs as "dataa" and "datab" and outputs as "result".

So now you can plug these into existing modules.

How to do this? I'm not entirely sure for the propeller, but I'll describe what I'm doing for a Z80 and I think the process should be similar. First, the Z80 comes with 256 ports. On a real chip, the address lines select the port (technically there are actually 65536 ports but 256 ports was always enough in hardware). And there is an iorq (io request) line and there is /wr and /rd. So to output to a port, use an instruction OUT (portnumber),A and the iorq line would go low, /wr would go low and the portnumber would appear on A0-A7.

For a fpga emulation these ports can exist internally. So we can decode port numbers, decode address ranges, and latch values, all inside the chip. Well, 8 bits are a bit limiting, so the first thing is to create an internal 32 bit bus. Write out to four 8bit ports and store that value in a 32 bit latch which can be dataa. Write out to another four ports and that can be datab. Read back the result through four ports and that can be result. So now with 32 bits we can start to play around with large multiplication and division, and proper floating point maths.

Then we need a way of grouping all these functions. This can also be done through a port that selects the function. Output to an 8 bit port the number zero, and for testing purposes, dataa loops back as result. Output the number 1, and datab loops back. Output 2, and the result of dataa*datab is returned. If a hardware multiply is being used, this is pretty much instant.

So we need a series of if..then..else type statements to multiplex the result values from all of the megafunctions.

It seems crazy, but these can all happen in parallel. You might have two functions, one does 8x8 multiply, and one does 16x16 bit, and you output the data and both of these return the result at the same time. It is just that you only read back one.

I'm writing this in vhdl but verilog would be similar. Each function can be thought of as a box with inputs and outputs, much like a TTL chip. So the top level is the main microprocessor code, and lower down in the tree are these megafunctions, and so between these there needs to be a new box, which multiplexes all the answers. And then there needs to be another box that handles the ports so that the answer comes out only when a certain port is selected.

It is the ports box that I'm not sure how to do for the propeller. For the Z80 the io lines are already there, coming out of the Z80 box as address and iorq and /rd and /wr. For the propeller, I think someone has added Port B. Maybe that idea could be extended so there are many more virtual ports, all with different addresses. Also I'm not sure what the instruction would be. In pseudo code, it would be "move 32 bits from register i to port j"

The inputs to this box might be clock, address, data, read, write, chip select and possibly reset. Output would be 32 bit dataout.

Thoughts would be most appreciated!

Tubular · 2014-08-16 19:50

Sounds like a whole new playground, Dr A. The hardware floating point could be really useful

The virtual ports idea is interesting too. It would let things be connected at run time. For some applications it might be useful to be able to arrange the port fabric without having to recompile everything every time.

pmrobert · 2014-08-16 20:05

In Quartis II 14.0, Altera calls it "IP Catalog". Good stuff!

Cluso99 · 2014-08-16 20:22

Truly intereesting Drac.

In the P1 we could do this by using fixed registers, say $1EC-$1EF for data, datab, result, and a result2 or another input.

If there were just 1 set for all cogs, then window it into hub ram/rom.

Or, use special instructions to write/read the registers.

Point is, there are ways to do it, and meanwhile we can play with the blocks.

Thanks for posting, and do keep us up to date

Dr_Acula · 2014-08-17 07:16

Thanks guys - great to hear there could be a way to interface to the P1.

Ok, I spent a few hours coding and I got something working! The following is vhdl code, so apologies for that, but it should be fairly easy to translate. Much of this code is the part that translates 8 bit ports into 32 bits. But down the bottom is the part that interfaces to a megawizard function. Talking in 8 bit for the moment, there are 4 ports from 0x60 to 0x63 for dataa, 4 ports from 0x64 to 0x67 for datab, 4 ports from 0x68 to 0x6B for the result, and then port 0x6F defines what function you want to do.

I started with function 0 is a simple loopback test that returns dataa. function 1 is a loopback for datab.

But the clever bit is function 2 which does an 8x8 multiply. I started with the megafunction and created this multiply. Then I added "Multiply8x8 : entity work.Mul8x8" to tell quartus that it exists. I also checked Mul8x8 was in the project list (I added it at the end of the wizard).

Then compiled and fixed a few errors. When the compilation was finished, the design tree now correctly includes the 8x8 multiply as a branch of ports32.

Testing was done in CP/M, and while this is an ancient operating system, it has MBASIC which is interpreted and means you can test things very quickly.

OUT (&h60),3 ' data a
OUT (&h64),4 ' data b
OUT (&h6F),2 ' function to call - multiply 8x8
PRINT INP(&h69) ' read back the answer
  12

I deliberately chose an 8x8 multiply as it is one of the simplest functions available. But since these just drop into the code, it is very easy to add many more. Maybe till it runs out of hardware multiplies or something.

The question for the propeller - what is the easiest way to interface with this? Do you add virtual ports? Or maybe reserve a few longs of memory (would only need four) and these are the DSP interface?

Rapid DSP functions would have to be useful for something. Real time .jpg decompression for instance.

-- connect to the Altera DSP megafunction libary Tools/Megawizard Plugin Manager
-- multiply, divide, sin, tan, log, fft, video processing, filters etc

-- many functions like multiply and divide use more than 8 bits
-- eg a 16x16 bit multiply with a 32 bit output would use 8 standard Z80 ports
-- with lots of DSP megafunctions would run out of Z80 ports
-- so instead, define some 32 bit ports and virtual port numbers
-- n is regAddr (one of 16 ports)
-- n,n+1,n+2, n+3 are dataa (lsb first) 0x60,61,62,63
-- n+4,n+5,n+6,n+7 are datab 0x64,65,66,67
-- n+8,n+9,n+10,n+11 are result 0x68,69,6A,6B
-- n+12,n+13,n+14 are not used
-- n+15 is the DSP function to use 0x6F


library ieee;
	use ieee.std_logic_1164.all;
	use ieee.numeric_std.all;
	use ieee.std_logic_unsigned.all;
	
entity ports32 is
	port (
	   clk : in std_logic;
		n_wr : in  std_logic;
		n_rd : in  std_logic;
	   dataIn : in std_logic_vector(7 downto 0); -- data from the cpu
		dataOut : out std_logic_vector(7 downto 0); -- data back to the CPU
		RegAddr : in std_logic_vector(3 downto 0) -- one of 16 Z80 ports
   );
	
end ports32;

architecture rtl of ports32 is

signal dataa : std_logic_vector(31 downto 0);
signal datab : std_logic_vector(31 downto 0);
signal result : std_logic_vector(31 downto 0);
signal Mul8x8result : std_logic_vector(15 downto 0); -- only needs 16 bits as 8x8 multiply
  
begin
--    IF   and CASE can only be used inside a process.
--    WHEN and WITH can only be used outside a process.
--    IF   corresponds to WHEN
--    CASE correpsonds to WITH


process (n_wr,regAddr,dataIn,dataa) begin -- dataa - pass write
  if rising_edge(n_wr) then
    if      regAddr = "0000" then dataa(7 downto 0)   <= dataIn(7 downto 0);
	   elsif regAddr = "0001" then dataa(15 downto 8)  <= dataIn(7 downto 0);
      elsif regAddr = "0010" then dataa(23 downto 16) <= dataIn(7 downto 0);
	   elsif regAddr = "0011" then dataa(31 downto 24) <= dataIn(7 downto 0);
	 end if;
  end if;
end process;

process (n_wr,regAddr,dataIn,datab) begin -- datab - pass write
  if rising_edge(n_wr) then
    if      regAddr = "0100" then datab(7 downto 0)   <= dataIn(7 downto 0);
	   elsif regAddr = "0101" then datab(15 downto 8)  <= dataIn(7 downto 0);
      elsif regAddr = "0110" then datab(23 downto 16) <= dataIn(7 downto 0);
	   elsif regAddr = "0111" then datab(31 downto 24) <= dataIn(7 downto 0);
	 end if;
  end if;
end process;

process (n_rd,regAddr,result) begin -- result - pass read
  if n_rd = '0' then
    if      regAddr = "1000" then dataOut <= result(7 downto 0);
	   elsif regAddr = "1001" then dataOut <= result(15 downto 8);
      elsif regAddr = "1010" then dataOut <= result(23 downto 16);
	   elsif regAddr = "1011" then dataOut <= result(31 downto 24);
		else                        dataOut <= "00000000";
	 end if;
  end if;
end process;

process (n_wr,regAddr,dataIn) begin -- detect a write to port n+15, if happens then calculate answer
  if rising_edge(n_wr) and regAddr = "1111" then -- output a number to port 0x6F 
    if dataIn(7 downto 0) = "00000000" then -- value 0 is loopback dataa
	   result <= dataa; -- debug loopback dataa 
	 end if;
	 if dataIn(7 downto 0) = "00000001" then -- value 1 is loopback datab
	   result <= datab; -- debug loopback datab
    end if;	
	 if dataIn(7 downto 0) = "00000010" then -- value 2 is 8x8 multiply
	   result(15 downto 0) <= Mul8x8result(15 downto 0); -- returns 16 bits
	 end if;
   
  end if;
end process;  

-- add megawizard plugin manager functions here
Multiply8x8 : entity work.Mul8x8
port map(
dataa(7 downto 0) => dataa(7 downto 0), 
datab(7 downto 0) => datab(7 downto 0), 
result(15 downto 0) => Mul8x8result(15 downto 0)
);

-- _____________

end rtl;

nutson · 2014-08-17 08:03

Some of the IP blocks shown come free only with the Quartus subscription edition, for some you need to buy a license. You can evaluate all blocks, also in the free Quartus web edition, Quartus then makes the IP block time-limited, it stops working after some time or when the connection with the computer running Quartus is broken. This prevents you to insert a licensed IP block into a stand-alone product. I have been unable to locate a comprehensive list of which functions need a license, and which are free. The only reference is in the the document describing the differences between Quartus subscription editon and the free web edition.

Dr_Acula · 2014-08-17 16:12

Thanks nutson. I'm reading through all the rules now. I'm still trying to work out how it all fits together, as the actual code to do the multiply (which is one line of vhdl, with a * in it) does not appear in the created megafunction. Instead, it refers off to an IP core, and then they can have different rules etc. Another option is OpenCores, where many of these DSP functions also exist. Either way, at least this provides a framework for interfacing to DSP functions without having to change a microcontroller's instruction set. I'll do some more experiments.

Dr_Acula · 2014-08-18 18:22

Just added a 16x16 = 32 bit multiply.

Compilation says it is using three hardware 9x9 multiplies which is correct - one for the 8x8 multiply and two for the 16x16 multiply.

I think these IP cores are free. There is an option in the list of IP cores that says "click to open IP megastore". I think they are the ones you pay for (the IP megastore has a .jpeg decompression algorithm for instance and if I was guessing, I'd expect a multiply to be free but a jpg to cost money).

But for multiply and floating point etc I think they are free. In any case, the file is downloaded permanently to the fpga board, the jtag programmer is disconnected and it still seems to work just fine.

Found a catch though - a 16 bit divide used up almost all the resources on a cyclone II. And by comparison, a multiply seems to use less than 1%. So need to think of other ways to do a divide. Multiply by the inverse. Bit shifts. Divide by fixed numbers. Lots of mathematical tricks there.

I'm very happy with this. It is really cool to send in two 16 bit numbers and instantly get a 32 bit result.

-- connect to the Altera DSP megafunction libary Tools/Megawizard Plugin Manager
-- multiply, divide, sin, tan, log, fft, video processing, filters etc

-- many functions like multiply and divide use more than 8 bits
-- eg a 16x16 bit multiply with a 32 bit output would use 8 standard Z80 ports
-- with lots of DSP megafunctions would run out of Z80 ports
-- so instead, define some 32 bit ports and virtual port numbers
-- n is regAddr (one of 16 ports)
-- n,n+1,n+2, n+3 are dataa (lsb first) 0x60,61,62,63
-- n+4,n+5,n+6,n+7 are datab 0x64,65,66,67
-- n+8,n+9,n+10,n+11 are result 0x68,69,6A,6B
-- n+12,n+13,n+14 are not used
-- n+15 is the DSP function to use 0x6F

-- divides use a lot of fpga space - more than 80% of a cyclone II for a 16 bit divide


library ieee;
	use ieee.std_logic_1164.all;
	use ieee.numeric_std.all;
	use ieee.std_logic_unsigned.all;
	
entity ports32 is
	port (
	   clk : in std_logic;
		n_wr : in  std_logic;
		n_rd : in  std_logic;
	   dataIn : in std_logic_vector(7 downto 0); -- data from the cpu
		dataOut : out std_logic_vector(7 downto 0); -- data back to the CPU
		RegAddr : in std_logic_vector(3 downto 0) -- one of 16 Z80 ports
   );
	
end ports32;

architecture rtl of ports32 is

signal dataa : std_logic_vector(31 downto 0);
signal datab : std_logic_vector(31 downto 0);
signal result : std_logic_vector(31 downto 0);
signal Mul8x8result : std_logic_vector(15 downto 0); -- only needs 16 bits as 8x8 multiply
signal Mul16x16result :std_logic_vector(31 downto 0); 
  
begin
--    IF   and CASE can only be used inside a process.
--    WHEN and WITH can only be used outside a process.
--    IF   corresponds to WHEN
--    CASE correpsonds to WITH

process (n_wr,regAddr,dataIn,dataa) begin -- dataa - pass write
  if rising_edge(n_wr) then
    if      regAddr = "0000" then dataa(7 downto 0)   <= dataIn(7 downto 0);
	   elsif regAddr = "0001" then dataa(15 downto 8)  <= dataIn(7 downto 0);
      elsif regAddr = "0010" then dataa(23 downto 16) <= dataIn(7 downto 0);
	   elsif regAddr = "0011" then dataa(31 downto 24) <= dataIn(7 downto 0);
	 end if;
  end if;
end process;

process (n_wr,regAddr,dataIn,datab) begin -- datab - pass write
  if rising_edge(n_wr) then
    if      regAddr = "0100" then datab(7 downto 0)   <= dataIn(7 downto 0);
	   elsif regAddr = "0101" then datab(15 downto 8)  <= dataIn(7 downto 0);
      elsif regAddr = "0110" then datab(23 downto 16) <= dataIn(7 downto 0);
	   elsif regAddr = "0111" then datab(31 downto 24) <= dataIn(7 downto 0);
	 end if;
  end if;
end process;

process (n_rd,regAddr,result) begin -- result - pass read
  if n_rd = '0' then
    if      regAddr = "1000" then dataOut <= result(7 downto 0);
	   elsif regAddr = "1001" then dataOut <= result(15 downto 8);
      elsif regAddr = "1010" then dataOut <= result(23 downto 16);
	   elsif regAddr = "1011" then dataOut <= result(31 downto 24);
		else                        dataOut <= "00000000";
	 end if;
  end if;
end process;

process (n_wr,regAddr,dataIn) begin -- detect a write to port n+15, if happens then calculate answer
  if rising_edge(n_wr) and regAddr = "1111" then -- output a number to port 0x6F 
    if dataIn(7 downto 0) = "00000000" then -- value 0 is loopback dataa
	   result <= dataa; -- debug loopback dataa 
	 end if;
	 if dataIn(7 downto 0) = "00000001" then -- value 1 is loopback datab
	   result <= datab; -- debug loopback datab
    end if;	
	 if dataIn(7 downto 0) = "00000010" then -- value 2 is 8x8 multiply
	   result(15 downto 0) <= Mul8x8result(15 downto 0); -- returns 16 bits
	 end if;
    if dataIn(7 downto 0) = "00000011" then -- value 3 is 16x16 multiply
	   result (31 downto 0) <= Mul16x16result(31 downto 0);
	 end if;	
  end if;
end process;  

-- add megawizard plugin manager functions here
Multiply8x8 : entity work.Mul8x8
port map(
dataa(7 downto 0) => dataa(7 downto 0), 
datab(7 downto 0) => datab(7 downto 0), 
result(15 downto 0) => Mul8x8result(15 downto 0)
);

Multiply16x16 : entity work.Mul16x16 
port map(
dataa(15 downto 0) => dataa(15 downto 0),
datab(15 downto 0) => datab(15 downto 0),
result(31 downto 0) => Mul16x16result(31 downto 0)
);

-- _____________

end rtl;

jmg · 2014-08-18 19:08

Dr_Acula wrote: »

Found a catch though - a 16 bit divide used up almost all the resources on a cyclone II. And by comparison, a multiply seems to use less than 1%. So need to think of other ways to do a divide. Multiply by the inverse. Bit shifts. Divide by fixed numbers. Lots of mathematical tricks there.

IIRC Chip was moving some of the hungrier mathops to a common/shared resource in P2.
I think he used some queuing scheme that tags the operands and results to a COG.
The same could be done in P1V, and you can also pipeline Divides to reduce the silicon.

nutson · 2014-08-19 01:34

The multipliers are FPGA chip resources, you only need to tell the Megacore function how to configure them. A divide is implemented using multiple shifters and adders, that take a lot of LE's. I used the Megacore "rotate" function in my Soft-Cog, it took 250 LE's!! Tried to use the FFT function once, ran into the licensing problem then.

FPGA P1 DSP hybrid

Comments