Xilinx port started...

overclocked · 2014-08-09 15:01

I started porting the Propeller 1 tonight. Ouch what a job.
Starting with using the Xilinx tool called Xport that actually tries to do AHDL=>verilog.
Some of it worked.
From there the syntax of several things is not compatible. Maybe because of SystemVerilog, I don't know.

1) The wire/regs are used before they ar declared, Xilinx-verilog don't like that!
2) Multi-dim wire/regs syntax differ.
3) The setting of multi-dim regs differ... Havn't solved that yet..

And other stuff!

Cluso99 · 2014-08-09 15:35

Ouch!
I thought that the Verilog files would have been fine and would therefore be quite a simple job.
Please keep us posted with your progress.
When you have success could you please post the usage info on the comparison thread. I am trying to work out which fpga group fits the P1 best.
http://forums.parallax.com/showthread.php/156822-What-is-the-most-suitable-FPGA-type...-Cyclone-V-Cyclone-IV-Spartan-6-Spartan-3A

overclocked · 2014-08-09 16:10

Absolutely. I'll keep you posted. It already compiles but don't get your hopes up to high. A lot of changes has been done. Very big chance of me introducing bugs because of all cut/paste done. Probably it all needs debugging to get going later.

And all though it compiles in the first synthesize step it still returns 0 usage. I think it is just me needing to implement a new pll xilinx style.
I'm taking a break now (night in Sweden) and maybe get some time tomorrow to at least see the size of it. I think there are several days of work for me to get anything going even in simulation.

I would probably need to start with getting something going on my BeMicroSDK which is a much closer hanging fruit!

David Betz · 2014-08-09 16:13

overclocked wrote: »

Absolutely. I'll keep you posted. It already compiles but don't get your hopes up to high. A lot of changes has been done. Very big chance of me introducing bugs because of all cut/paste done. Probably it all needs debugging to get going later.

And all though it compiles in the first synthesize step it still returns 0 usage. I think it is just me needing to implement a new pll xilinx style.
I'm taking a break now (night in Sweden) and maybe get some time tomorrow to at least see the size of it. I think there are several days of work for me to get anything going even in simulation.

I would probably need to start with getting something going on my BeMicroSDK which is a much closer hanging fruit!

With all of the changes you've had to do, do you think there is a chance of coming up with a common code base that can be built for both the Altera and Xilinx chips?

overclocked · 2014-08-09 16:29

Best bet of that would be to start with disabling SystemVerilog in Quartus from the beginning. Doing a full test running of that changed code in in real hardware. Then take it from there. I think it should be doable. I've done similar stuff before but this was a little bit more work than I initially though.

Cluso99 · 2014-08-09 19:52

I also downloaded Vivado today. But going to use Quartus first.

overclocked · 2014-08-09 22:17

I can't use Vivado for any of my boards, they are too old. I only have Spartan 3 family boards. After these came Spartan 6 and even those can't be handled in Vivado. So only the latest series 7 like Artix kintex and so. A pity because from what I've read Vivado is a much better tool. I keep on hammering with ISE.

pik33 · 2014-08-09 22:45

overclocked wrote: »

Best bet of that would be to start with disabling SystemVerilog in Quartus from the beginning.

This thing needs SystemVerilog enabled in Quartus to compile, or you got an error with for instruction generating 8 cogs.

overclocked · 2014-08-09 22:48

The thing needs to change before being portable to other free tools I think..

overclocked · 2014-08-09 22:58

OK short update here:

As a next step I've moved down from the top-level and by just using the core-dig as the top-level it actually compiles.
DISCLAIMER NOTE: I have no idea if this works!

The core-dig level includes HUB+COG's but misses the pll and clkgen/tim logic.
Setting clkgen as top-level adds another 12 slices.. so that probably a good guess for overhead of that part, thus it won't anything measurable.
The PLL is just a DCM is Xilinx langauge is thats just a non-logic part that needs to be hooked up.

Remember that this is without connecting to actual pins, so the usage of logic is probably about right for the dig-level but the performance can shift when starting to route to specific pins on specific packages.

I have the bigger Digilent Microblaze Starter Kit which includes a Spartan-3E 1600E so this is the target below.

FULL TARGET:

topconv Project Status (08/10/2014 - 07:34:12)

Project File:
Xilinx.xise
Parser Errors:
No Errors

Module Name:
dig
Implementation State:
Placed and Routed

Target Device:
xc3s1600e-4fg320

Errors:

No Errors

Product Version:
ISE 14.5

Warnings:

914 Warnings (320 new)

Design Goal:
Balanced

Routing Results:

All Signals Completely Routed

Design Strategy:
Xilinx Default (unlocked)

Timing Constraints:

All Constraints Met

Environment:
System Settings

Final Timing Score:

0 (Timing Report)

Device Utilization Summary
[-]

Logic Utilization
Used
Available
Utilization
Note(s)

Number of Slice Flip Flops
4,913
29,504
16%

Number of 4 input LUTs
11,081
29,504
37%

Number of occupied Slices
7,139
14,752
48%

Number of Slices containing only related logic
7,139
7,139
100%

Number of Slices containing unrelated logic
0
7,139
0%

Total Number of 4 input LUTs
11,233
29,504
38%

Number used as logic
11,081

Number used as a route-thru
152

Number of bonded IOBs
115
250
46%

Number of RAMB16s
24
36
66%

Number of BUFGMUXs
10
24
41%

Average Fanout of Non-Clock Nets
3.52

Performance Summary
[-]

Final Timing Score:
0 (Setup: 0, Hold: 0)
Pinout Data:
Pinout Report

Routing Results:
All Signals Completely Routed
Clock Data:
Clock Report

Timing Constraints:
All Constraints Met

Detailed Reports
[-]

Report Name
Status
Generated
Errors
Warnings
Infos

Synthesis Report
Current
s

overclocked · 2014-08-09 23:38

pik33 wrote: »

This thing needs SystemVerilog enabled in Quartus to compile, or you got an error with for instruction generating 8 cogs.

Yea actually that part I handled by just exchanging:

for (i=0; i<8; i++)

for

for (i=0; i<8; i=i+1)

The generate is part of Verilog2001.
As a parallel project I now started to look at Quartus and disabling SystemVerilog.
The next error is the multi-dim array problems that need to be restated with new Verilog2001 syntax. So I think that there is a good possibility to get a working 2001 version for both Xilinx and Altera. But it will take some time, I'm not sure that I will make it.

Things that can make it a little bit hard is:
1) No Propeller knowledge and no Propeller-HW.
2) No Prop-programmer so for now i have no idea how to load a program into the FPGA when working. Probably need a USB-UART cable that someone mentioned.
3) No DE0-Nano BUT a BeMicroSDK..
4) Not sure how to hook up hardware to the FPGA-Prop to actually test what I'm building.

Well I just have to take one step at a time.

Cluso99 · 2014-08-10 01:02

oooh! Not handling i++ is a surprise.

There's a nice lot of info there. Thanks for posting.

BTW It's a bit late, but welcome to the forums.

overclocked · 2014-08-10 06:06

Thanks for the welcome!!

I think I retract my statement about Spartan-3 being a viable platform for Propeller 1.
It maybe possible but for now it seem hard to get the COG to actually peform 80Mhz.

With strict timing constraints and some 10min Compiling I've got the 8-COG up to 65Mhz.

Single COG and some simple floor-planning gave 70Mhz+. That design took only 7% of my 1600E. Nice!

Before I really turn Spartan-3 down, I must create the DCM/PLL so I can connect real pins. Will do that now.

Probably the Spartan-3 is just too old to be able to run the COG:s at 80Mhz. Spartan-6 is my next try, but sorry to say I don't own any cards with LX's or other Spartan-6.

UPDATE: After creating DCM/PLL and doing even more floor-planning everything seem to be OK..

For anyone being interested in floor-planning, see below how it looks when seeing the Spartan-3E 1600E circuit from above. Each COG gots its own corner and the HUB in the middle. I don't know if this is optimal but it saves TONS of compiling time and seem to work great.

Tubular · 2014-08-10 18:08

Thanks for posting all this, Overclocked. It's really interesting seeing how a familiar design spreads out like that. Chip posted a similar 'pizza diagram' for an earlier P2, once, with 8 roughly even slices, but a bit of variability too.

In your report some cogs take 303 and some 317 slices. It'll be interesting to see whether there is a difference in performance between them.

jmg · 2014-08-10 19:45

overclocked wrote: »

UPDATE: After creating DCM/PLL and doing even more floor-planning everything seem to be OK..

For anyone being interested in floor-planning, see below how it looks when seeing the Spartan-3E 1600E circuit from above. Each COG gots its own corner and the HUB in the middle. I don't know if this is optimal but it saves TONS of compiling time and seem to work great.

Can you expand on this ? seem to work great. - does that mean you get 80MHz ?
How much does (what amount of) floor planning improve compile times ?

overclocked · 2014-08-11 01:54

jmg wrote: »

Can you expand on this ? seem to work great. - does that mean you get 80MHz ?
How much does (what amount of) floor planning improve compile times ?

If I remember correctly I think that a full compile tool maybe 10min and when adding constraints with both clocks given and floor-plan it took the compile time maybe down to 3-4min. But these are interim numbers when doing numbers, but I've seen similar numbers before. By helping the router out, you can shorten the compile time alot.

SOURCE INCOMPLETE
Yea. I'll start with a disclaimer. Up until now all source code has NOT been included in the build. There are several parts of the circuit that I've been commenting out with a TODO-notice to get it to build initially, so don't take anything as the final thing until it actually starts and runs Propeller code in a Xilinx circuit. I've started with the DeE0-Nano source code so the CharRom has been commented out.
From now I think that everything in that project is included... But I could still be missing something.

PLL CREATED
I've created a XIlinx DCM via CoreGen that hopefuly matches the Altera Counterpart. I don't understand why the output from the original has several clock outputs, but the project only seem to use one so that's what I've created.

MIF INIT FILE CONVERTED
Because of failing syntax of init the other ROM (interpreter+others) this memory also got replaced by a static =0000 by the compiler. I noticed the comment from the compiler the other night. I've converted both the HEX-files into a format that can be used to init memory in verilog in the Xilinx environment using a different source syntax for both part of the verilog and the actual hex/text-file. This way the memory doesn't get optimized away anymore.

ROUTED PINS
I've added pin constraints to the design to match my Microblaze Starter Kit with 50Mhz clock and IO's going to suitable places. Although specific IO's probably must be reconfigured to match different Propeller designs later, like VGA,Sound, Joystick and other external connections.

CURRENT RESULT
With everything above in place the compilation seem to work much better even when disabling floor-planning. The floor-plan was mostly fruitful as guidance when all other guidance was missing. The floor-plan also outgrew in size when I chose to add the CharRom.

FMAX
From what I can see now, it reports handling all timing with the default 80/160Mhz clocks without problems.

RESOURCES
The size reported now is 80%+ of the 1600E chip (10'000 slices+). The reason for the size to become so much bigger now is that is runs out of BRAM's and uses logic as RAM to be able to fit all RAM (COG*8+CHARROM+INTERPREST ROM)

overclocked · 2014-08-11 13:27

Short update:

Using the the Spartan-3 500E Starter Kit as base the following Propeller can possibly fit:

`define HUB_MEM_SIZE 4096  // default is 8192
`define NUMBER_OF_COGS 4  // default is 8
`define COG_RAM_SIZE 256  // default is 512

Is this still a usuable Propeller or just a bastard that won't work for anything?
My guess is that everything that fits into spec will work? Or is there are other thing that will break here?
Is there any way in the Propeller hardware that can tell the spec of the current circuit before trying to load stuff? Like a HW register that one can check early in th program?

Cluso99 · 2014-08-11 13:56

You really require the whole 2KB of cog ram. The registers are in $1f0-1ff.

Bill Henning · 2014-08-11 14:22

If you want to run Spin or propgcc programs, you need at least one of the cogs to have 512 longs of memory.

overclocked wrote: »
Short update:

Using the the Spartan-3 500E Starter Kit as base the following Propeller can possibly fit:
`define HUB_MEM_SIZE 4096  // default is 8192
`define NUMBER_OF_COGS 4  // default is 8
`define COG_RAM_SIZE 256  // default is 512
Is this still a usuable Propeller or just a bastard that won't work for anything?
My guess is that everything that fits into spec will work? Or is there are other thing that will break here?
Is there any way in the Propeller hardware that can tell the spec of the current circuit before trying to load stuff? Like a HW register that one can check early in th program?

overclocked · 2014-08-11 14:53

Cluso99 wrote: »

You really require the whole 2KB of cog ram. The registers are in $1f0-1ff.

OK thanks for the info! Good to know.
A variant would be to map those 32*16 against LUT-RAM just and leave a hole @ 0x100-0x1ef, but like I said, a bastard maybe used for specific apps that fits, but probably not worth the trouble.

But a good hack would be to be able to use a conditional define to get those registers mapped to the correct address and then be free to use different memory sizes..

jmg · 2014-08-11 15:25

overclocked wrote: »

OK thanks for the info! Good to know.
A variant would be to map those 32*16 against LUT-RAM just and leave a hole @ 0x100-0x1ef, but like I said, a bastard maybe used for specific apps that fits, but probably not worth the trouble.

But a good hack would be to be able to use a conditional define to get those registers mapped to the correct address and then be free to use different memory sizes..

There is some merit in dynamic sizing, and if doing that, I'd allow each COG size to vary too.

Then the front end tools can compile and report HUB Size, COG count and COG sizes to a small config file, and then that file can be used to BUILD just the right sized FPGA image.

Usually users would round-up and change this rarely, but there are CPLD targets where fewer COGS and fewer RAM may still give a workable system. eg Lattice MachXO2/XO3/Ice40 and I'm not sure what Altera MAX 10 looks like.

One combination that would have common-tools appeal, would be a Genuine Prop1 Chip, and a 1-2COG CPLD, with focused peripherals. The total system works like a 9-10 COG design, but with tight user control of the 'special use' COGs. Most OBEX and generic code would go into the Prop1, and the special stuff into P1v.

Anyone seen early release info on the Altera MAX 10 resource ?

overclocked · 2014-08-11 16:13

OK I'll on posting my small experiments here. Just let me know if you get bored:

Spartan-6 LX9 good fit:

- CharRom Disabled

`define HUB_MEM_SIZE 8192 //8192
`define NUMBER_OF_COGS 4 //8
`define COG_RAM_SIZE 512 //512

So full size memories for HUB/COG but half the COG count.
Where talking 99% Slice usage, so for anything more to fit something else must be sacrifices.

But clock seem to go high, I gave it a 300 Mhz clock in and it gave 360 Mhz as max. But the actual clock resource does not like to clock from 50=>320Mhz. Some type of DRC violation..

But maybe 4 150Mhz COG should be possible in an LX9.

I wouldn't by a LX9 from these theoretical discussion, but it could be a good cheap buy.

jmg · 2014-08-11 16:23

overclocked wrote: »

`define HUB_MEM_SIZE 8192 //8192
`define NUMBER_OF_COGS 4 //8
`define COG_RAM_SIZE 512 //512

So full size memories for HUB/COG but half the COG count.
Where talking 99% Slice usage, so for anything more to fit something else must be sacrifices.

What about a combination like

`define HUB_MEM_SIZE 14848 // 8192 ( I think 14848 fills a 3 COG LX9 ?)
`define NUMBER_OF_COGS 3 // 8
`define COG_RAM_SIZE 512 // 512

to see how much resource is left for the user.... ?

rogloh · 2014-08-11 18:06

overclocked wrote: »

OK I'll on posting my small experiments here. Just let me know if you get bored:

Spartan-6 LX9 good fit:

- CharRom Disabled

`define HUB_MEM_SIZE 8192 //8192
`define NUMBER_OF_COGS 4 //8
`define COG_RAM_SIZE 512 //512

So full size memories for HUB/COG but half the COG count.
Where talking 99% Slice usage, so for anything more to fit something else must be sacrifices.

But clock seem to go high, I gave it a 300 Mhz clock in and it gave 360 Mhz as max. But the actual clock resource does not like to clock from 50=>320Mhz. Some type of DRC violation..

But maybe 4 150Mhz COG should be possible in an LX9.

I wouldn't by a LX9 from these theoretical discussion, but it could be a good cheap buy.

@overclocked,
Thanks, this is all very useful information for those people like me without boards/tools yet and who are still deciding on suitable platforms but don't have good data on how it would all fit. It would be interesting to see how the P1 resource usage is on the LX25 Spartan 6 variant as well, as used on one of these guys http://www.xess.com/shop/product/xula2-lx25/. Any chance of a compile attempt on that?
Cheers,
rogloh

Cluso99 · 2014-08-11 18:57

Register/RAM usage seems to be the main sticking point on these smaller FPGAs.
But where logic is a problem, the VGA section could be omitted except for say one cog. Not sure how much that saves as I haven't checked.

overclocked · 2014-08-11 22:27

It is not possible to go over 14-bit addressbus without changing several parts of the original code. This is not interesting until we see that the original code actually executes Propeller code correctly.

But giving your idea with 3 COGS the LX9 does seem to be a good fit for this:

`define HUB_MEM_SIZE 8192 // 8192
`define NUMBER_OF_COGS 3 // 8
`define COG_RAM_SIZE 512 // 512

And with the added bonus that the CharROM can be included.
The resource usage result is approx:
96% 1385/1430 slices
100% RAMB16BWER 32/32

overclocked · 2014-08-11 23:14

I think I've got a correct setup of the Xula2 board.. but need to be checked again.
The special thing about this board is that it uses a 12Mhz clock only. So to get close 160Mhz you need both PLL+DCM solution for the clock. I think it landed on 162Mhz => 81Mhz COG.

Fmax seem to be 300Mhz for PLL.

`define NUMBER_OF_COGS 8 //8
`define HUB_MEM_SIZE 8192 //8192
`define COG_RAM_SIZE 512 //512

Both ROM enabled I think..

71% Slices used
76% S6-BRAM used (40/52)

Seem like a good fit if it works..

rogloh · 2014-08-11 23:45

overclocked wrote: »

I think I've got a correct setup of the Xula2 board.. but need to be checked again.
The special thing about this board is that it uses a 12Mhz clock only. So to get close 160Mhz you need both PLL+DCM solution for the clock. I think it landed on 162Mhz => 81Mhz COG.

Fmax seem to be 300Mhz for PLL.

`define NUMBER_OF_COGS 8 //8
`define HUB_MEM_SIZE 8192 //8192
`define COG_RAM_SIZE 512 //512

Both ROM enabled I think..

71% Slices used
76% S6-BRAM used (40/52)

Seem like a good fit if it works..

This is good news indeed. Thanks for checking it out for me. There appears to still be a bit of slice resource left, hopefully enough for a SDRAM memory controller. Looks like quite a nice fit for an embeddable/breadboardable prop with gobs of SDRAM fitted and you get the SD card reader built in too to hold programs/data. This could make a nice 16/32 bit logic analyzer platform as it brings out a clockIO line too - maybe one could coax it to talk to a host via USB, or just sacrifice a few pins for VGA.

David Betz · 2014-08-12 04:02

overclocked wrote: »

I think I've got a correct setup of the Xula2 board.. but need to be checked again.
The special thing about this board is that it uses a 12Mhz clock only. So to get close 160Mhz you need both PLL+DCM solution for the clock. I think it landed on 162Mhz => 81Mhz COG.

Fmax seem to be 300Mhz for PLL.

`define NUMBER_OF_COGS 8 //8
`define HUB_MEM_SIZE 8192 //8192
`define COG_RAM_SIZE 512 //512

Both ROM enabled I think..

71% Slices used
76% S6-BRAM used (40/52)

Seem like a good fit if it works..

Cool! I think I may have one of those boards somewhere. Do you have a bitfile I can try?

Edit: Never mind. I have the XuLA-200 which is much smaller. I wonder if one or two COGs might fit?

overclocked · 2014-08-12 07:44

David Betz wrote: »

Cool! I think I may have one of those boards somewhere. Do you have a bitfile I can try?

Edit: Never mind. I have the XuLA-200 which is much smaller. I wonder if one or two COGs might fit?

OK lets try it.
The board you mention hosts a Spartan-3A 200K which includes:
4,032 LC
288Kbit BRAM => 36MB memory => 16 BRAM resources

I've tried to sum up the demands for Xilinx FPGA circuits to be able to host a Propeller1 of any type:

Full P1 demands:
- 8x 512 words COG (16KB)
- Enabled both Interpreter+Char ROM's (32KB)
- HUB RAM memory (32KB)

- 80 KB (640kbit) total memory demand

XILINX:
8k-10k LC => about 3500 S3 slices and about 1250 S6 slices
36 S3/S6 BRAM resources

Minimum P1 demands:
- Single 512 COG (2K)
- Disabled CharRom, enabled Interpreter ROM (16KB)
- HUB RAM memory (32KB)

- 50KB (400 kbit) total memory demand

XILINX:
8000-(7*310) = 5800 LC
23 S3/S6 BRAM resources

Because of the Mapper being smart, it is possible to exchange part of non-commited logic as RAM-memory instead, but that is expensive => Need lots of logic for small memories.
As an example, if I remember correctly, one COG's logic takes about 700 LC's and the 2K COG memory (if using logic) takes about 2000 LC!

But in the case of the Spartan-3/3A 200K FPGA, there are both missing BRAM and logic resource even for minimal Propeller1 core:
5800 LC <> 4032 LC (approx 143% full)
16 BRAM <> 23 BRAM (approx 143% full)

So not worth using as Propeller1 platform, not even if it was possible to use external RAM.

Most Spartan-3/3E/3A are tight fits for the full Propeller1 core. I own the S3E-1600E and even that is quite a tight fit with BRAM resources.
Maybe like Spartan-3 1000K/1500K/2000K/5000K can work, but only from 2000K and upwards is the internal BRAM resources matching the demands, so for 1000K/1500K logic will have to be used as RAM, which in non-optimal.

David Betz · 2014-08-12 07:49

overclocked wrote: »

OK lets try it.
The board you mention hosts a Spartan-3A 200K which includes:
4,032 LC
288Kbit BRAM => 36MB memory => 16 BRAM resources

I've tried to sum up the demands for Xilinx FPGA circuits to be able to host a Propeller1 of any type:

Full P1 demands:
- 8x 512 words COG (16KB)
- Enabled both Interpreter+Char ROM's (32KB)
- HUB RAM memory (32KB)

- 80 KB (640kbit) total memory demand

XILINX:
8k-10k LC => about 3500 S3 slices and about 1250 S6 slices
36 S3/S6 BRAM resources

Minimum P1 demands:
- Single 512 COG (2K)
- Disabled CharRom, enabled Interpreter ROM (16KB)
- HUB RAM memory (32KB)

- 50KB (400 kbit) total memory demand

XILINX:
8000-(7*310) = 5800 LC
23 S3/S6 BRAM resources

Because of the Mapper being smart, it is possible to exchange part of non-commited logic as RAM-memory instead, but that is expensive => Need lots of logic for small memories.
As an example, if I remember correctly, one COG's logic takes about 700 LC's and the 2K COG memory (if using logic) takes about 2000 LC!

But in the case of the Spartan-3/3A 200K FPGA, there are both missing BRAM and logic resource even for minimal Propeller1 core:
5800 LC <> 4032 LC (approx 143% full)
16 BRAM <> 23 BRAM (approx 143% full)

So not worth using as Propeller1 platform, not even if it was possible to use external RAM.

Most Spartan-3/3E/3A are tight fits for the full Propeller1 core. I own the S3E-1600E and even that is quite a tight fit with BRAM resources.
Maybe like Spartan-3 1000K/1500K/2000K/5000K can work, but only from 2000K and upwards is the internal BRAM resources matching the demands, so for 1000K/1500K logic will have to be used as RAM, which in non-optimal.

Thanks for the analysis. I guess I can forget about using my XuLA-200 for P1 work but fortunately that isn't what I bought for. Now if I can only remember what I *did* buy it for... :-)

Xilinx port started...

Comments