P1V for Digilent Arty FPGA board (Xilinx Artix-7)

jac_goudsmitjac_goudsmit Posts: 348
edited November 2016 in Propeller 1 Vote Up0Vote Down
So a few weeks ago I had a chance to go to the Hackaday.com SuperConference in Pasadena CA, and I was one of the 40(?) lucky ones who did the FPGA workshop, by Sam Bobrowicz of Digilent Inc.

All the participants in the workshop got to take home an Arty FPGA board, and a new Multi-Touch Display Shield with MicroSD adapter that is not available yet, as far as I can tell. The price for the workshop was probably about half of the value of this hardware. Thanks Digilent, Hackaday and sponsors!

The Arty is a great FPGA board for makers, with a Xilinx Artix-7 35T FPGA. It has 4 "PMOD" connectors with 8 signal pins each (PMOD is the standard that Digilent uses to make peripherals for this kind of board), and it also has a shield header for Arduino and ChipKit. There is Flash ROM on board and DDR3 RAM but unfortunately no I2C EEPROM.

As you may know, I've been maintaining the Propeller Hardware source code on Github (https://github.com/jacgoudsmit/P1V) and I thought this would be an excellent opportunity to see if I could port the P1V sources to the Xilinx tools again. When @Mindrobots and @Heater were working on our initial Github repo of the Propeller HDL sources ("P8X32_Emulation"), we got some great submissions from Magnus Karlsson and others who ported the sources to the ISE tool that was in use for Xilinx targets back then. But this code hasn't been ported to the new "P1V" repo (which is the one that's based on the official Parallax Github release) until now.

Xilinx now has a new tool called Vivado, which (as I understand) integrates the steps of their development process better than the previous tools. I can't confirm that because I never used the older tools, but thanks to the Digilent workshop, I knew enough to be "dangerous" with Vivado. I imported the ISE project for the Pipistrello from the old P8X32_Emulation repo and it looks like it builds just fine but the source code is very different from the current HDL sources in the head revision of the Release branch of P1V, because we basically started over and Magnus made a lot of changes to the code to get everything to compile in ISE. So I haven't integrated Magnus' code into P1V just yet (there are other problems, for example the way he implemented hub memory is super straightforward but can't be used for the DE0-Nano because it doesn't have enough internal memory. But that's a whole different story).

I started a new Vivado project based on the current HDL sources in the head revision of my development branch, with the Arty as a target. It took me a day or so to figure out some minor problems which were basically me getting used to Vivado. For example, in the Altera (now Intel) Quartus software, I ignored pretty much all warnings, but I couldn't get Vivado to compile the source correctly without fixing all of them. Most warnings were caused by wires and regs being used before they were declared, apparently this is an important thing to get right in Vivado (I saw this afternoon that Chip has done some work recently in the P1V source code, to move some declarations around, so I guess he's aware of the problem too :-).

I eventually got the P1V source to compile for the Arty, but the bad news is that this FPGA is only big enough for 2 cogs.

Screenshot%202016-11-21%2023.12.04.png

The above is an image of the Artix-7 with a 2-cog P1V. The blue areas are the gates that are in use, and the green lines show how the I/O pins are connected. I guess I'm too spoiled with my Arrow BeMicroCV and BeMicroCV-A9 having plenty of space for the full Propeller with room to spare :-)

Anyway, I thought I'd share with you guys. This hasn't been checked in to my Github repo yet at this time; I need to do some housekeeping before I'll merge it, probably in the next few days. I'll keep you informed.

Thanks for reading!

===Jac
Rancho Cucamonga, CA
«1

Comments

  • 40 Comments sorted by Date Added Votes
  • I eventually got the P1V source to compile for the Arty, but the bad news is that this FPGA is only big enough for 2 cogs.

    Good to hear.
    How much spare is there, after those 2 COGs fit ?
    Any MHz indications on this FPGA yet ?



  • Something must be wrong if you get only 2 Cogs.
    From the Arty Description:
    Xilinx Artix-35T FPGA:
     - 33,280 logic cells in 5200 slices (each slice contains four 6-input LUTs and 8 flip-flops);
     - 1,800 Kbits of fast block RAM;
     ...
    
    These should allow 10..15 Cogs with >180 kByte HubRam.

    Maybe the RAM got implemented with LUTs instead of BlockRams?

    Andy
  • Nice work jac.
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • jmg wrote: »
    How much spare is there, after those 2 COGs fit ?
    Any MHz indications on this FPGA yet ?

    I don't remember the exact numbers but I think with 2 cogs, it uses about 80% of this particular FPGA.

    The specs for the Artix-7 say that it's supposed to be able to do 450+ MHz internally; I don't know if that's possible with the P1V but if it is, the result would be a Propeller emulator at 225MHz (the internal clock runs twice the speed of the simulated frequency feeding CNT).
    Ariba wrote: »
    Something must be wrong if you get only 2 Cogs.

    ...

    These should allow 10..15 Cogs with >180 kByte HubRam.

    Maybe the RAM got implemented with LUTs instead of BlockRams?

    I checked, and it looks like it uses block RAMs (it does give warnings about block RAMs being set/reset by an asynchronous signal but I don't think those are severe).

    Yes, there is significant RAM left over and I'm not even using the on-board DDR3 chip. But the number of LUTs in use is pretty high; when I tried to do 8 cogs I think it said it needed something like 245% of the LUTs.

    Once I clean it up, I'll put it online and you (or someone else) can look into what can be improved to fit more cogs; I tried removing the video modules and that wasn't enough.

    I hope to update the repo tonight.

    ===Jac

    Rancho Cucamonga, CA
  • andrewsiandrewsi Posts: 59
    edited September 1 Vote Up0Vote Down
    I found your problem by comparing Vivado synthesis results from this version with one from my much older fork. Your version is causing the synthesizer to treat every cog memory bit as a flip-flop register rather than just a bit in a block RAM. The explosion of flops required overflows the available resources in the chip.

    Somewhere along the line, this line in cog_ram.v:
    reg [31:0] r [511:0];
    
    got changed to:
    reg [511:0] [31:0] r;
    

    If you use the first version, the synthesizer will correctly synthesize the cog RAM using block RAM, and many more cogs will fit in the chip. I have a Nexys4 with a 100T version of the Artix-7, and it goes from not being able to fit 8 cogs at all to only using a fraction of the available resource. See how your Arty does after making this change. I believe the second version is interpreted as a 2-D matrix of single bit registers, rather a 1-D array of 32-bit registers.
  • jac_goudsmitjac_goudsmit Posts: 348
    edited September 1 Vote Up0Vote Down
    Hey thanks, @andrewsi!

    I'll check this out later today, otherwise this weekend.

    Could you post a pull request for your Nexys4 version? I'd be happy to put that into the repository too.

    === Jac
    Rancho Cucamonga, CA
  • As soon as I've verified it's working fully I'll send a pull request. I haven't looked into how the PLLs are implemented in the current version, though. In my older version I think that I had fully implemented/constrained the per-cog PLLs to be asynchronous from each other and the rest of the design that relies on the 160Mhz clock. Any light you want to shed on that topic?
  • I went ahead and provided a pull request for just the cog ram fix so as not to hold things up.
  • That Arty board looks like a very attractive deal. Can we use that with free tools?

    It's great to see such progress going on with the P1V since I put up my P1V repo. Which sadly I have not had time to manage since. Jac is doing a great job there.
  • andrewsi wrote: »
    I went ahead and provided a pull request for just the cog ram fix so as not to hold things up.

    Thanks! That will help me give you credit for the fix.

    ===Jac
    Rancho Cucamonga, CA
  • jac_goudsmitjac_goudsmit Posts: 348
    edited September 1 Vote Up0Vote Down
    Heater. wrote: »
    That Arty board looks like a very attractive deal. Can we use that with free tools?

    (Edit) Yes, it can be used with Vivado WebPACK which is free, and available for Windows (x64) and Linux.

    Personally, though I have limited experience so far, I think I like Vivado better than Quartus (the tool for Altera/Intel FPGA's, also free by the way). It takes significantly less time in Vivado to generate a P1V image (on my system about 5 times(!) as fast as compiling on Quartus), and Vivado is better at hinting you what the next step is that you might want to take.

    Also, I think I like the Arty better than the Altera boards because it's much more tuned to real-world hardware applications: it has Arduino headers and "PMod" headers for which you can get hardware at Digilent. The Altera boards are more generic, clearly intended to be used by e.g. students who are learning to program an FPGA, not by people who actually intend to use one in a hardware project.
    It's great to see such progress going on with the P1V since I put up my P1V repo. Which sadly I have not had time to manage since. Jac is doing a great job there.

    Even though not much work has happened over the last 2 or 3 years or so, I still support it. If more people (and especially Parallax) would pay attention to the project and contribute code, adding new features would be higher up on my priority list.

    And thanks for your work too, Heater!

    ===Jac
    Rancho Cucamonga, CA
  • jac_goudsmitjac_goudsmit Posts: 348
    edited September 1 Vote Up1Vote Down
    Well, that looks much better: 40% in use. And this picture is after I changed the number of cogs back from 2 to 8.

    I will test this, and merge and upload the code shortly to my Github repo. I'll also spread the word at Digilent.

    Thank you very much, Andy!

    Screenshot%202017-09-01%2014.30.26.png

    ===Jac
    Rancho Cucamonga, CA
  • You're very welcome. A quick test of my Nexys4 version seems to be working fine with the Serial Terminal demo, I'll try some VGA output next. I see that most of the code that I previously had in the top module for emulating the PLL and config has moved down into tim.v, so I was able to remove all of that. My version happens to instantiate the MMCM rather than a PLL for overall clocking (at Xilinx' recommendation), but other than that there's not too much difference from your Arty code.

    The one thing I would love to figure out how to fix is the 40ish DRC warnings that are produced regarding signals with asynchronous resets that are connected to Block RAM. After inspecting the code it seems to be some of the hub control signals, so I'm not sure how easily fixed it'd be, nor am I sure the Prop behavior would be accurately emulated if I mess with it.

    The other thing that I think may not be fixable is the potential for output bus skew given all of the possible output sources for each pin, but at the speeds likely to be coming out a prop 1 it may not matter enough to worry about. Normally you fix skew by forcing the "last" flop before the output to be placed into the IOB with a constraint or source code directive, but with all of the wired OR stuff, there's quite a bit more than one "last" flop for each pin.

    As long as everything keeps working I'll just do some cleanup on what I have and send another pull.
  • Nice work andrew.
    I tried a while ago to get rid of the warnings but I just couldn't work out what to do with some of them.
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • jac_goudsmitjac_goudsmit Posts: 348
    edited September 3 Vote Up0Vote Down
    andrewsi wrote: »
    You're very welcome. A quick test of my Nexys4 version seems to be working fine with the Serial Terminal demo, I'll try some VGA output next.

    I didn't have a chance to test the Arty any further, other than changing the source to implement 8 cogs again, and building and downloading it. I think I had the LEDs on the Arty hooked up so they would show cog activity and there is a small test program to start all cogs with a 1 second interval in the tree, but I haven't even run that.

    Once I'm satisfied that the Arty works (and it looks like it is!), I'll merge the Xilinx branch into the Rel branch, which is where releases live. Then Xilinx will be officially supported in my P1V repo, which I'm sure a lot of people would be very pleased with!
    I see that most of the code that I previously had in the top module for emulating the PLL and config has moved down into tim.v, so I was able to remove all of that. My version happens to instantiate the MMCM rather than a PLL for overall clocking (at Xilinx' recommendation), but other than that there's not too much difference from your Arty code.

    I didn't know about the MMCM vs. PLL thing. If you have any suggestions for improvements, they would be very welcome.

    My architecture is different from the original architecture as Chip made it: Basically I added an extra layer above the top module, to represent the hardware-specific parts. Because of this, anything that's specific to some target hardware had to be moved out of the other modules, for example the way that the main clock was generated with a PLL is different between Altera and Xilinx, so it's in the hardware-specific top level module. The old top.v is deprecated and will eventually be removed.

    The PLLs in each cog aren't implemented with FPGA PLL's (some targets would run out of PLL's if we would do that). I think Chip rewrote the timing module especially for the P1V distribution so that it didn't need actual PLL's (and there was at least one bug in that code that has since been fixed, see http://forums.parallax.com/discussion/157387). Maybe there's something to be said about moving the implementation of the cog PLLs to the hardware-specific module too, but for now, there are other things I'd like to work on, especially the features that were in the old repository (before Parallax started their own Github repo) and haven't been carried over into the P1V repo yet. I would also like to port features such as the "B" port and giant hub memories.
    The one thing I would love to figure out how to fix is the 40ish DRC warnings that are produced regarding signals with asynchronous resets that are connected to Block RAM. After inspecting the code it seems to be some of the hub control signals, so I'm not sure how easily fixed it'd be, nor am I sure the Prop behavior would be accurately emulated if I mess with it.

    I thought those asynchronous resets were coming from the actual reset, and if that's true, it would be easy to synchronize the resets with a clock to eliminate the warning. But I haven't studied the problem closely. I have to say, the Vivado warnings seem to be more clear than the Quartus warnings, and also appear to be easier to fix. Maybe Vivado is smarter at figuring out what we want, or maybe Quartus is better at finding actual subtle problems; who knows.

    Anyway, as a software engineer I prefer to write my code so that it doesn't generate any warnings, but as I understand it, this may not be a realistic goal for an HDL project.
    The other thing that I think may not be fixable is the potential for output bus skew given all of the possible output sources for each pin, but at the speeds likely to be coming out a prop 1 it may not matter enough to worry about. Normally you fix skew by forcing the "last" flop before the output to be placed into the IOB with a constraint or source code directive, but with all of the wired OR stuff, there's quite a bit more than one "last" flop for each pin.

    I understand how this could be a problem, why it would be a thing that gets flagged, and how it would be fixed, and how that would be difficult in our case. However my HDL Fu is not powerful enough to judge whether this is a problem to worry about. I would think that if the skew is less than one Prop clock (12.5ns), it would probably be irrelevant. Maybe eventually I'll try P1V with one of my 6502 projects which are very timing-critical, and see if things look different on my logic analyzer between P1 and P1V.
    As long as everything keeps working I'll just do some cleanup on what I have and send another pull.

    I appreciate it! Thanks!

    ===Jac

    Rancho Cucamonga, CA

  • The PLLs in each cog aren't implemented with FPGA PLL's (some targets would run out of PLL's if we would do that). I think Chip rewrote the timing module especially for the P1V distribution so that it didn't need actual PLL's (and there was at least one bug in that code that has since been fixed, see http://forums.parallax.com/discussion/157387). Maybe there's something to be said about moving the implementation of the cog PLLs to the hardware-specific module too, but for now, there are other things I'd like to work on, especially the features that were in the old repository (before Parallax started their own Github repo) and haven't been carried over into the P1V repo yet.
    A true P1 has an unusually large number of PLLs, but more practical design use could use much less.

    To me, it makes sense to support at least one PLL for Main SysCLK, if one exists in the FPGA.

    Possibly a 2+, could be COG available, if someone needs a Video-type PLL, but I'm less clear on how P1V manages the clock domains here ?

    I think it is also a good idea to have any PLL conditional, so 'no-PLL' builds are possible. That covers more FPGAs and is more vendor portable.

    I notice CMOS MEMS oscillators all the way to 150MHz are available, and not that expensive.
    There is also a programmable Si5351A, which can go to 200MHz, but that does need i2c programming.
  • jmg wrote: »
    A true P1 has an unusually large number of PLLs, but more practical design use could use much less.

    To me, it makes sense to support at least one PLL for Main SysCLK, if one exists in the FPGA.

    Possibly a 2+, could be COG available, if someone needs a Video-type PLL, but I'm less clear on how P1V manages the clock domains here ?

    I think it is also a good idea to have any PLL conditional, so 'no-PLL' builds are possible. That covers more FPGAs and is more vendor portable.

    In most use-cases that I can think of, only a few PLL''s (if any) are ever in use. I think an important improvement to P1V's architecture would be to implement some sort of dynamic allocation algorithm for the PLL's and other features like video, that might be customized for a specific task. For example, on an FPGA with 2 hardware PLL's, one would be used for the main clock and another one would be assigned to implement a cog PLL; the other cog PLL's would be Chip's logic simulation that we have now.

    === Jac

    Rancho Cucamonga, CA
  • andrewsiandrewsi Posts: 59
    edited September 3 Vote Up0Vote Down
    I didn't know about the MMCM vs. PLL thing. If you have any suggestions for improvements, they would be very welcome.
    xilinx.com/support/documentation/user_guides/ug472_7Series_Clocking.pdf is the bible for the 7 series clocking capabilities. Truth be told there's not much difference in the 7's between the PLL and the MMCM other than some additional features that probably don't matter too much for P1V purposes.
    Basically I added an extra layer above the top module, to represent the hardware-specific parts. Because of this, anything that's specific to some target hardware had to be moved out of the other modules, for example the way that the main clock was generated with a PLL is different between Altera and Xilinx, so it's in the hardware-specific top level module. The old top.v is deprecated and will eventually be removed.
    Yeah, I did figure that out and have used it essentially the same way. I have avoided touching anything in lower level modules (aside from the 1 bug fix already submitted.)
    I thought those asynchronous resets were coming from the actual reset, and if that's true, it would be easy to synchronize the resets with a clock to eliminate the warning. But I haven't studied the problem closely. I have to say, the Vivado warnings seem to be more clear than the Quartus warnings, and also appear to be easier to fix. Maybe Vivado is smarter at figuring out what we want, or maybe Quartus is better at finding actual subtle problems; who knows.

    Anyway, as a software engineer I prefer to write my code so that it doesn't generate any warnings, but as I understand it, this may not be a realistic goal for an HDL project.
    The async resets in question appear to be from various bus specific flags - I traced the properties of the flagged registers to see where in the code they originate (the properties include the source file and line, if you look closely enough) and it appears to be the if (<signal>) then <reset assignment> clauses in various places in hub.v. They don't seem to be asynchronous resets coming from the top level reset logic. (Clocking the top-level reset doesn't appear to cure anything either, as confirmation.) However, I haven't yet figured out why it thinks these signals are asynchronous in the first place. More inspection is required. :-)
  • jac_goudsmitjac_goudsmit Posts: 348
    edited September 4 Vote Up0Vote Down
    I'm pleased to report that I got the P1V code to work on the Arty with the on-board FTDI chip acting as Prop Plug. The cogledtest.spin program can be downloaded with the Propeller Tool and works, but I haven't tested anything else.
    andrewsi wrote: »
    xilinx.com/support/documentation/user_guides/ug472_7Series_Clocking.pdf is the bible for the 7 series clocking capabilities. Truth be told there's not much difference in the 7's between the PLL and the MMCM other than some additional features that probably don't matter too much for P1V purposes.

    I will look at that. I copied the PLL code for the master clock from somewhere else and modified it to generate 160MHz and it appears that's working, though it looks messy in the source code. I'll improve that before I integrate the source into the main branch ("rel").

    Also, and perhaps more important, I noticed a lot of synth warnings where Vivado apparently decides that a bunch of parts aren't necessary so it throws them away. I'll look into that.

    ===Jac
    Rancho Cucamonga, CA
  • Also, and perhaps more important, I noticed a lot of synth warnings where Vivado apparently decides that a bunch of parts aren't necessary so it throws them away. I'll look into that.

    ===Jac
    I'm glad to hear you're going to look into the warnings. I wanted to try tweaking the P1v RTL but I hesitated when I saw all of the warnings generated by the unmodified code. I wasn't sure how I would know which warnings were caused by my changes after modifying the code so I abandoned my plans to experiment with the RTL. Having a clean source to start with would make things a lot easier for novices like me.
  • David Betz wrote: »
    I'm glad to hear you're going to look into the warnings. I wanted to try tweaking the P1v RTL but I hesitated when I saw all of the warnings generated by the unmodified code. I wasn't sure how I would know which warnings were caused by my changes after modifying the code so I abandoned my plans to experiment with the RTL. Having a clean source to start with would make things a lot easier for novices like me.

    I have a funny feeling that many (if not all) of the problems were already solved in the Xilinx port in the old P8X32A_Emulation repo...

    ===Jac

    Rancho Cucamonga, CA
  • andrewsiandrewsi Posts: 59
    edited September 5 Vote Up0Vote Down
    I found a thread regarding Vivado 2017 that seems to indicate that it throws a lot of spurious "Unused sequential element x was removed" warnings. (Synth 8-6014).

    See: https://forums.xilinx.com/xlnx/board/crawl_message?board.id=SYNTHBD&message.id=21691

    Until they fix it, probably not worth trying to investigate too deeply, but this sort of warning is fairly common anytime it finds things that can be functionally merged together, or logic that is equivalent to a constant value.
  • andrewsiandrewsi Posts: 59
    edited September 5 Vote Up0Vote Down
    Incidentally, I have done some reading up on Verilog synthesis and the huge wad of DRC errors coming out of Vivado. Long story short, synthesizers assume that if multiple signals are included in the sensitivity list, the first one is the clock, and the next one is an asynchronous reset. If the reset signal is left out of the sensitivity list, then it's synchronous (because only the clock signal matters.) This may have made sense in the actual Prop1 design, perhaps, where various hub bus signals, the external reset, cog enable lines, etc., may actually be truly asynchronous, but I get the sense that in FPGA implementations async resets are generally frowned upon.

    However, the code in many of the core files is rife with "always @(posedge <clk> or negedge <signal>" constructs. The synthesizer, as per the above, will infer that the second signal listed is meant as an asynchronous reset (which it generally is, given the code which follows). I would suggest that the code can probably be changed to remove the sensitivities to the second signals, and thus make the whole design synchronous to the real clocks. All the reset signals, both internal and external, are still acted on, just not until the next clock cycle where the reset is asserted.

    It appears that the intent is clearly to make the reset tree synchronous anyway, given the procedure in p1v.v that takes the external reset signals and brings them into the cog_clk domain anyway (and should probably do so through a double-synchronizer since the external resets truly are asynchronous!) this change is unlikely to have any meaningful effect on the behavior of the simulated Prop.

    As a test, I made all of the necessary changes, rebuilt the bitfile, and:
    a) The Vivado DRC errors related to the async resets vanish completely.
    b) A quick check of the Serial Terminal Demo still functions perfectly normally.
  • andrewsi wrote: »
    I found a thread regarding Vivado 2017 that seems to indicate that it throws a lot of spurious "Unused sequential element x was removed" warnings. (Synth 8-6014).

    I think I've seen that before. I'm already not too worried about it but it keeps nagging me because in software, warnings usually indicate that the compiler thinks you might not have entered what you think you entered, and there's usually a way to make it 100% clear to a compiler that tells it that you know what you're doing.
    andrewsi wrote: »
    If the reset signal is left out of the sensitivity list, then it's synchronous (because only the clock signal matters.)

    Thanks! Good to know!
    However, the code in many of the core files is rife with "always @(posedge <clk> or negedge <signal>" constructs.

    Yes. There's a lot of "set this register to <such and such> when a cog clock pulse happens, but reset it to <something else> when the cog is started with cog_ena". Am I right in concluding that even though cog_ena is ultimately generated synchronously, apparently the synth isn't smart enough to see this?
    As a test, I made all of the necessary changes, rebuilt the bitfile, and:
    a) The Vivado DRC errors related to the async resets vanish completely.
    b) A quick check of the Serial Terminal Demo still functions perfectly normally.

    I would be interested to see that!

    By the way, I did a diff between the old Xilinx P1V code to find out what the differences are. I forgot, Andy, your name was in there too, together with Magnus Karlsson. It looks like the changes are mostly a matter of moving declarations around, but there are several places where I saw what looked like adding extra bits to registers to make them an even 32 or 64 or whatever number of bits wide. Any idea why those changes were made?

    Elsewhere I saw more declarations that changed from "reg [a:b] [c:d] ident" to "reg [c:d] ident [a:b]". Your one-line Arty fix has already proved that those can make a big difference, so I'll see if I can integrate the Xilinx changes from the old P8X32A_Emulation repo into the P1V repo. The board support files from Digilent include the Nexys4 so I should already be able to create a target for that, even though I don't have that hardware. And from what I can see, the Pipistrello is not that different (only the clock speed?) from the Nexys4?

    Anyway, I also have another Altera target underway to my mail box. I think it can be a P1V target too. I may look at that later this week if I have time.

    ===Jac

    Rancho Cucamonga, CA
  • There's a lot of "set this register to <such and such> when a cog clock pulse happens, but reset it to <something else> when the cog is started with cog_ena". Am I right in concluding that even though cog_ena is ultimately generated synchronously, apparently the synth isn't smart enough to see this?
    You've just defined the nature of a flop reset, i.e. "use the clock signal to latch all inputs, and use the special set/reset inputs to force a starting condition when they're asserted." The flops do support asynchronous reset, so if you tell the synth to use them by including it in the sensitivity list, the flop will treat it as async even if upstream it's coming from a synchronous source. However, the reason async resets are "bad" in a design like this is that there are no guarantees that an async reset signal will arrive at all the flops on the same clock depending on the routing skew to each flop, and more importantly, affected flops might come out of reset at different times. The timing tool isn't going to attempt to check the setup/holds on async resets, obviously. So better to just make it all synchronous. Good article at fpga-dev.com/resets-make-them-synchronous-and-local/
    I forgot, Andy, your name was in there too, together with Magnus Karlsson. It looks like the changes are mostly a matter of moving declarations around, but there are several places where I saw what looked like adding extra bits to registers to make them an even 32 or 64 or whatever number of bits wide. Any idea why those changes were made?
    Yeah, I contributed one bugfix early on for incorrect logic on the config register decoding for the PLL. I have no idea why bits were added to registers, other than possibly to make the code cleaner and let the synth do the job of ignoring unused bits.
    Elsewhere I saw more declarations that changed from "reg [a:b] [c:d] ident" to "reg [c:d] ident [a:b]". Your one-line Arty fix has already proved that those can make a big difference, so I'll see if I can integrate the Xilinx changes from the old P8X32A_Emulation repo into the P1V repo. The board support files from Digilent include the Nexys4 so I should already be able to create a target for that, even though I don't have that hardware. And from what I can see, the Pipistrello is not that different (only the clock speed?) from the Nexys4?
    If the RAMs being inferred are small enough it'll still use distributed RAM (FF's) rather than block RAMs and it probably won't make much of a difference, although it may be able to optimize them better if it seems them as n-bit wide rather than 2-D matrices of single bit registers. Couldn't hurt to be syntactically correct.

    Anyway, I will submit my Nexys4 stuff soon, since changing all of the output constraints to fit the board is a pain and I've already done the work. :-) Also hooks up the VGA output to some of the prop pins so video can actually get generated, uses the long array of switch LEDs as the COG LEDs, etc. You can save yourself the trouble.

  • I've submitted a pull request which includes verified constraints, top, and Vivado project for the Nexys4, plus synchronous-ization of core reset signals that were previously synthesizing as asynchronous and causing DRC warnings.
  • For those reading along: @Andrewsi and I have been working together on Github to make some improvements for the Arty, and to add support for the Nexys4 (also by Digilent, but using a larger Artix-7 FPGA) in the past few days.

    Andy, if you feel confident about the current state of the Nexys4 code, let me know. I'll merge the Xilinx branch to Rel in my repository to make support for Arty and Nexys4 "official".

    This weekend, if I have time, I may add another Altera target. Stay tuned!

    ===Jac

    Rancho Cucamonga, CA
  • jac, etc

    For Quartus, adding this line to xxxxxx.qsf prevents a warning in the latest V17.0 Quartus
    My laptop has an Intel CPU with 2 cores
    set_global_assignment -name NUM_PARALLEL_PROCESSORS 2
    

    I have been getting various versions (number of COGs, size of HUB RAM, video in/out, CTRB in/out, etc).
    Here is my DE0-Nano.qsf section
    set_global_assignment -name FMAX_REQUIREMENT "80 MHz" -section_id clock_cog
    set_global_assignment -name FMAX_REQUIREMENT "160 MHz" -section_id clock_pll
    set_global_assignment -name VERILOG_MACRO "NOTSCRAMBLED=1"
    set_global_assignment -name VERILOG_MACRO "DISABLE_FONT_ROM=1"
    set_global_assignment -name VERILOG_MACRO "CLUSO_ROMHI=1"
    set_global_assignment -name VERILOG_MACRO "COGS4=1"
    set_global_assignment -name VERILOG_MACRO "HUB_SINGLE_CLOCK=1"
    set_global_assignment -name VERILOG_MACRO "INVERT_LEDS=0"
    
    set_global_assignment -name VERILOG_MACRO "NO_VIDEO=1"
    set_global_assignment -name VERILOG_MACRO "NO_CTRA=1"
    set_global_assignment -name VERILOG_MACRO "NO_CTRB=1"
    set_global_assignment -name VERILOG_MACRO "NO_ROTATES=1"
    

    In my dig.v I have
    `ifdef COGS4													// 0=8 cogs, 1=4 cogs
    	parameter		NCOGS   		= 4;						// number of cogs
    `else
    	parameter		NCOGS   		= 8;
    `endif
    
    ...followed by this later on
    // bus select
    
    reg [NCOGS-1:0] bus_sel;
    
    always @(posedge clk_cog or negedge nres)
    if (!nres)
        bus_sel <= 1'b0;
    else if (ena_bus)
        bus_sel <= {bus_sel[NCOGS-2:0], ~|bus_sel[NCOGS-2:0]};  // rotates hub slot
    
    
    // cogs
    
    wire [NCOGS-1:0]          bus_r;
    wire [NCOGS-1:0]          bus_e;
    wire [NCOGS-1:0]          bus_w;
    wire [NCOGS-1:0]  [1:0]   bus_s;
    wire [NCOGS-1:0] [15:0]   bus_a;
    wire [NCOGS-1:0] [31:0]   bus_d;
    wire [NCOGS-1:0]          pll;
    wire [NCOGS-1:0] [31:0]   outx;
    wire [NCOGS-1:0] [31:0]   dirx;
    
    genvar cogid;
    generate
        for (cogid=0; cogid<NCOGS; cogid++)
        begin : coggen
            cog #(cogid) cog_(
                 		.nres       (nres),
                        .clk_cog    (clk_cog),
                        .clk_pll    (clk_pll),
                        .ena_bus    (ena_bus),
                        .ptr_w      (ptr_w[cogid]),
                        .ptr_d      (ptr_d),
                        .ena        (cog_ena[cogid]),
                        .bus_sel    (bus_sel[cogid]),
                        .bus_r      (bus_r[cogid]),
                        .bus_e      (bus_e[cogid]),
                        .bus_w      (bus_w[cogid]),
                        .bus_s      (bus_s[cogid]),
                        .bus_a      (bus_a[cogid]),
                        .bus_d      (bus_d[cogid]),
                        .bus_q      (bus_q),
                        .bus_c      (bus_c),
                        .bus_ack    (bus_ack[cogid]),
                        .cnt        (cnt),
                        .pll_in     (pll),
                        .pll_out    (pll[cogid]),
                        .pin_in     (pin_in),
                        .pin_out    (outx[cogid]),
                        .pin_dir    (dirx[cogid])   );
        end
    endgenerate
    
    This works fine except that I have not found a way of globally defining NCOGS such as NCOGS=3.
    Does anyone know how to do this?

    I would ultimately like to do something like this
    //parameter	byte    COGSIZE [7:0]   = '{2,2,2,2,2,2,2,2};		// cog sizes in KB
    //parameter			HUBSIZE  		= 32;						// hub ram size in KB
    //parameter	bit 	VIDEO	[7:0]	= '{0,0,0,0,0,0,0,0}; 		// 1=video circuitry, 0=none
    
    Again, I don't know how to achieve this.

    And...
    `ifdef COGS4
    wire  [1:0] hub_bus_s   = bus_s[3] | bus_s[2] | bus_s[1] | bus_s[0];
    wire [15:0] hub_bus_a   = bus_a[3] | bus_a[2] | bus_a[1] | bus_a[0];
    wire [31:0] hub_bus_d   = bus_d[3] | bus_d[2] | bus_d[1] | bus_d[0];
    `else
    wire  [1:0] hub_bus_s   = bus_s[7] | bus_s[6] | bus_s[5] | bus_s[4] | bus_s[3] | bus_s[2] | bus_s[1] | bus_s[0];
    wire [15:0] hub_bus_a   = bus_a[7] | bus_a[6] | bus_a[5] | bus_a[4] | bus_a[3] | bus_a[2] | bus_a[1] | bus_a[0];
    wire [31:0] hub_bus_d   = bus_d[7] | bus_d[6] | bus_d[5] | bus_d[4] | bus_d[3] | bus_d[2] | bus_d[1] | bus_d[0];
    `endif
    
    How do I simplify this using NCOGS ?

    BTW I have a Lattice iCE40 FPGA that I would like to have a go at getting running. So I will take a look at your latest code to see if I can get it running on the iCE40.

    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • jac_goudsmitjac_goudsmit Posts: 348
    edited September 9 Vote Up0Vote Down
    @Cluso99, please start a new thread for this. Thanks!

    (Short answer: You already found how to set parameters for the top level module; the top level module can propagate the parameters to lower-level modules so that's how you can set the number of cogs)

    ===Jac
    Rancho Cucamonga, CA
  • I'm good with the current state of the Nexys4 code and merging everything up to rel should be fine.

    Once it's been merged I'll probably have another small pull request - it's much smaller and easier to create the I/O assignments in the top module with a generate block than to individually list every single output port. :-)

    There's another thing I've been working on, which is to bring all of the input bus signals into the design through a multi-flop synchronizer, but that leads down a rabbit hole:

    Reading the value of an input port when another cog has it set to output mode is supposed to immediately
    return whatever the current value of the port is at the time. This means creating a shortcut from the output bus back to the input bus that avoids the synchronizer flops when the direction of the port is set to output mode, because otherwise there would be several clocks of latency between the output and its return to the input bus. Writing this part is easy, but it shoots the timing all to heck, because there appear to be fully combinational looping paths within the cog logic, so the timing analysis finds ways that lead out of various cogs onto the output bus and then back into the input bus of different cogs, going to the fake PLLs or around to a different cog... which all takes too much time and lots of paths aren't fast enough. I basically need to find some way around this via constraints, or by design. The router just flails badly when I try to enable any of this. I may never really find a solution. Wouldn't matter, except I don't know if there's any prop software out there relying heavily on this behavior (and its timing).
Sign In or Register to comment.