P1 Verilog implementation: Alpha testers required
Ale
Posts: 2,363
Hei I have a small project in the works, a verilog implementation of the P1 and I'd like some Alpha/Beta testers:
What you need:
just ONE of the following
. Lattice MachXO2 breakout board with MachXO2-7000HE (with extra 40 MHz oscillator, you can solder it to the pads or on the prototyping area)
. Arrow's Bemicro CV (Cyclone V-based)
. Xilinx Spartan3E starter kit (XC3S500E)
. (It may run on a XC3S200 too, I don't know yet, tell me if you have one of these)
and ONE PropPlug
Any takers ?
the code is in github now: https://github.com/raps500/AProp
For the time being you need icarus verilog and GTKWave. Altera/Lattice projects need updating, and a Xilinx project is comming soon too.
Features and Differencies as of 1.04.2014:
All opcodes, except HUB ops are 4 cycles long (MUL for lattice being the exception, still in the works), djnz, tjnz, tjz too.
HUB opcodes are between 7 and 8 for 4 cogs, 1+ cycles for each extra pair of cogs.
HUB memory is implemented as dual-ported memory, faster access. I'll add an external memory interface also coupled to the HUB.
Video circuitry is still missing, you can't reconfigure PLLs on the fly, but some fixed frequencies could be implemented.
Counters are still missing.
PORTA and PORTB are implemented (PORTB can be shut down if needed).
If Cluso agrees, we could use his spin interpreter, I have to test this.
A bootloader is still needed.
There is no hazerd when writting a memory position and fetching from there the next instruction.
Some quick statistics:
The MachXO2-7000 has 26 kbytes block RAM. Only 2 cogs fit. Does not have hw multipliers, but booth's algo could be used.
The XC2S500E has 40 kbytes block ram, 4 cogs should fit comfortably there, has hw multipliers.
The cog has a 4 stage pipeline which can be stalled in stage 3 (read/execute) to wait for the HUB or to compare/match (waitxxx).
Fetch (fetch from cog ram)
Decode (decode)
Read (read S & D and precalculate some values used by the ALU)
Write back (D is written)
The code has minimal explicit resource duplication. Some values are pre-computed between decode and read/execute to make a critical path shorter.
I still have a couple of uncommitted updates for mul and for the projects.
What you need:
just ONE of the following
. Lattice MachXO2 breakout board with MachXO2-7000HE (with extra 40 MHz oscillator, you can solder it to the pads or on the prototyping area)
. Arrow's Bemicro CV (Cyclone V-based)
. Xilinx Spartan3E starter kit (XC3S500E)
. (It may run on a XC3S200 too, I don't know yet, tell me if you have one of these)
and ONE PropPlug
Any takers ?
the code is in github now: https://github.com/raps500/AProp
For the time being you need icarus verilog and GTKWave. Altera/Lattice projects need updating, and a Xilinx project is comming soon too.
Features and Differencies as of 1.04.2014:
All opcodes, except HUB ops are 4 cycles long (MUL for lattice being the exception, still in the works), djnz, tjnz, tjz too.
HUB opcodes are between 7 and 8 for 4 cogs, 1+ cycles for each extra pair of cogs.
HUB memory is implemented as dual-ported memory, faster access. I'll add an external memory interface also coupled to the HUB.
Video circuitry is still missing, you can't reconfigure PLLs on the fly, but some fixed frequencies could be implemented.
Counters are still missing.
PORTA and PORTB are implemented (PORTB can be shut down if needed).
If Cluso agrees, we could use his spin interpreter, I have to test this.
A bootloader is still needed.
There is no hazerd when writting a memory position and fetching from there the next instruction.
Some quick statistics:
| Altera | Lattice -----------+-------------+------------------ Resources | 1200 ALM | 1300 slices for each cog (No counters, video) Frequency | 80 MHz(-7) | 40 MHz (-4) HUB Memory | <=160 kBi | <= 22 kBi MUL | 1 cycle hw | 16 cycles Booth's algoThe CV A2 has 176 kbytes block RAM, you need 16 kbytes for 8 cogs, has hw multipliers.
The MachXO2-7000 has 26 kbytes block RAM. Only 2 cogs fit. Does not have hw multipliers, but booth's algo could be used.
The XC2S500E has 40 kbytes block ram, 4 cogs should fit comfortably there, has hw multipliers.
The cog has a 4 stage pipeline which can be stalled in stage 3 (read/execute) to wait for the HUB or to compare/match (waitxxx).
Fetch (fetch from cog ram)
Decode (decode)
Read (read S & D and precalculate some values used by the ALU)
Write back (D is written)
The code has minimal explicit resource duplication. Some values are pre-computed between decode and read/execute to make a critical path shorter.
I still have a couple of uncommitted updates for mul and for the projects.
Comments
I have a Bemicro CV (and several prop plugs)
I was going to order one, I'll just advance that schedule a bit!
I'll post then some code... I need to do a bit more testing and a bit of pasm coding .
So you've got an FPGA emulation of a P1 variant?? Sounds exciting!
I have an older BASYS board (Xilinx Spartan XC3S100E) - I'm guessing it's not enough. I have, however, been looking for an excuse to get one of the new FPGA boards to get a jump with the P2.
Doc
It doesn't happen to run on the Terasic DE0-Nano too, that would be really cool and pick up a lot of testers for you!!
you could only fit 1 cog in that many LEs.
I don't have a DE0-nano... I'll publish the code at some point... it needs a bit of work and clean up
It is a FPGA implementation of a P1, it has a couple of enhancements (djnz/tjz/tjnz take 4 cycles, always), MUL&MULS are there too... and PORTB is also there . The HUB is not yet implemented but the layout is there, I'm taking advantage of dual ported memory to split HUB accesses. I want to implement external memory, too. The demo is not ready yet... I wanted tachyon as demo.. but it needs a couple extra things , I am working on them.
Sounds interesting! I have A Spartan 3A but no time currently
I've been busy adding the HUB and loading the COG's memory, I ended using a small pasm routine to load from HUB to COG, saves some LUTs but needs a bit more cycles like 160000, a bit less. I'll put the code in github, so everyone can have a look at it... I think the code is quite usable...
I pushed the project to github. you can check it out at https://github.com/raps500/AProp.
For the time being, you can try it with icarus verilog, using the scripts in 05_Testbenches. The Lattice and Altera's projects haven't been updated for the HUB, yet.
If you have questions, comments, please post them here.
Ale
Can you add a quick fit summary (maybe in post #1, updated with releases) of the
LOGIC Resource used for each target and MHz reached/reported on that target. (maybe with/without MUL? )
and how much HUB memory is available on each platform.
How many CLK cycles per opcode ?
I've been playing around with FPGAs recently - getting Z80 emulations running on older cyclone II chips http://zx80.netai.net/grant/Multicomp/index.html Looking at the Z80 code and the internal memory of the cyclone series, there would be lots of possibilities. Even down to something like one cog and external ram for the 32k hub.
That might not seem useful, but consider that many of the things that end up running in cogs (VGA video, serial ports, ethernet, SPI ports, SD cards, keyboard etc) can all be done as blocks of VHDL/Verilog on a FPGA, so you may only need one cog for many applications.
Chip has P1 working on DE0 but not 64KB hub. Plus more!
The P1 would fit nicely on the DE0-nano, and could have also both ports!
I keep working on my P1. HUB ops, I'll push an update today or tomorrow.
It sounds like you are moving along nicely. I don't really know anything about Verilog or really even about building the images.
Is the Altera version that runs on the BeMicro (or anything other than Lattice) build-able and loadable?
Is there anything we can do to help you along? What I don't know, I can probably learn without blowing up too much stuff!
I updated the whole project, thus the programming file is also there, ready to download.
I see Chip's P1 is a 2 clock design, which uses the Dual Port memory well on cog opcodes.
Is it easy to morph from 4 to 2 ?
Kuroneko: I read the manual, I did the test (and checked the code of the test) and they seem to be reversed. DO you have a test I could run ?. In my code they are like in the manual, maybe one of my props is faulty...
I get the following for NEGNZ