P8X32A Emulation on Tang20K FPGA

ti85 · 2024-04-23 14:54

I came across a newer, low-cost FPGA called the Tang20K. I'm wondering if anyone has experience with it and the feasibility of running P8X32A emulation on this platform. Parallax has provided the Verilog files, but I'm unfamiliar with Verilog or hardware coding in general.

My question is: what steps would be necessary to get the Propeller 1 running on the Tang20K (if possible at all)? I'm interested in pursuing this because I'd like to explore options for increasing the Propeller 1's processing power, such as expanding hub RAM or adding more cogs.

jmg · 2024-04-23 21:42

parts look ok, but google finds this

https://www.reddit.com/r/GowinFPGA/comments/zvkswc/tang_nano_20k_announced/

Nice board, with MS5351 clock generator. (unsure how they configure that ?)

and this
https://github.com/juj/gowin_flipflop_drainer
suggests GOWIN FPGAs may not be able to cope with too many clocked registers - node changes ? (too much supply bounce ?)
A MCU design like P1, should have a low change-node count, but it suggests having another platform you know is good would be useful here.

Christof Eb. · 2024-04-30 17:24

@ti85 said:
I came across a newer, low-cost FPGA called the Tang20K. I'm wondering if anyone has experience with it and the feasibility of running P8X32A emulation on this platform. Parallax has provided the Verilog files, but I'm unfamiliar with Verilog or hardware coding in general.

My question is: what steps would be necessary to get the Propeller 1 running on the Tang20K (if possible at all)? I'm interested in pursuing this because I'd like to explore options for increasing the Propeller 1's processing power, such as expanding hub RAM or adding more cogs.

I had been looking for low cost FPGA boards for a while now. During covid they seemed to have vanished. It is great, that there are now these possibilities. So I have bought Tang Nano20k now and am trying to learn....

It would be interesting to know the number of lut4 needed for a full P1? Or an estimation?

Christof

Christof Eb. · 2024-05-16 07:27

Ah, found it: 9500 LUT6 for P8X32A: https://github.com/jimbrake/cpu_soft_cores
So it looks possible with 8 cores and plenty of RAM.

Rayman · 2024-05-16 22:56

Once got P1V working on Xyloni

https://forums.parallax.com/discussion/174191/p1v-on-xyloni-trion-t8-fpga

Main problem seemed to be getting clock to higher freq.

cheezus · 2024-05-25 04:56

I've actually been trying to port to this fpga and currently stuck. I've got it compiling but nothing runs yet. Fmax tops out in the high 70s but there's plenty of room left in the part. Currently compiling the Nth test run as i write this post.

I'm at the stage of wanting to include my own booter for testing but not quite being capable of accomplishing that. The P1 is a beautiful processor but the Verilog is very dense. Still, happy to have what we have. I missed the bus back in the day so trying to catch up with what's already been done (quite a lot !!) I need to get a stable fmax of 64mhz but would love to get that 80 mhz, I just don't see it happening. My last compile utilization

9 gowin_pll/your_instance_name/rpll_inst/CLKOUT.default_gen_clk 40.179(MHz) 291.477(MHz) 6 TOP
10 gowin_pll/your_instance_name/rpll_inst/CLKOUTD.default_gen_clk 20.089(MHz) 77.305(MHz) 16 TOP

Resource Usage Utilizatio
Logic 15469(13687 LUT, 1782 ALU) / 20736 75%
Register 5250 / 15750 34%
--Register as Latch 0 / 15750 0%
--Register as FF 5250 / 15750 34%~~~~
BSRAM 40 / 46 87%

Resource
Resource Usage Summary
Resource Usage
I/O Port 41
I/O Buf 40
OBUF 8
IOBUF 32
Register 5250
DFF 41
DFFE 3877
DFFSE 40
DFFR 73
DFFRE 1192
DFFPE 1
DFFCE 26
LUT 13668
LUT2 640
LUT3 5289
LUT4 7739
ALU 1782
ALU 1782
INV 19
INV 19
BSRAM 40
SP 14
SDPB 26
CLOCK 2
OSC 1
rPLL 1

Tool Version V1.9.9.01 (64-bit)
Part Number GW2AR-LV18QN88C8/I7
Device GW2AR-18

SaucySoliton · 2024-05-27 04:39

@cheezus said:
I've actually been trying to port to this fpga and currently stuck. I've got it compiling but nothing runs yet. Fmax tops out in the high 70s but there's plenty of room left in the part. Currently compiling the Nth test run as i write this post.

I'm at the stage of wanting to include my own booter for testing but not quite being capable of accomplishing that. The P1 is a beautiful processor but the Verilog is very dense. Still, happy to have what we have. I missed the bus back in the day so trying to catch up with what's already been done (quite a lot !!) I need to get a stable fmax of 64mhz but would love to get that 80 mhz, I just don't see it happening. My last compile utilization

Sounds like what happened when I tried to port P1v to IceStorm. I have a hunch that it is a sign extension issue. That was a problem with the P2 rev A silicon. The same Verilog works fine in Verilator. Go figure. The P1v was originally SystemVerilog, but I had to use a version ported to Verilog 2001 for IceStorm. That was in 2017, so it could be fixed now.

I did some work to pre-load the program into the RAM blocks. That could eliminate problems related to the IO pins and serial port.
https://github.com/SaucySoliton/P1V/commit/44e26fb5efbf85c70bb875ccb3eac1e802a68a22 Also it just makes sense to have the program pre-loaded after the FPGA bitstream is loaded.

cheezus · 2024-05-31 03:14

I think i have the verilog working on the tang nano 20k. Need a serious code cleanup and then to backtrack and see if i can get the first attempt working. I was able to identify in propeller tool and load / run the cogtest. I'd need to look at some better tests to verify things are actually working. For the 4 cog i was able to compile @ 78ish, now waiting to see what the 8 cog compiles to. The built in LA is really helpful, the tang nano 20k is a pretty cool board. If you were willing to go to a 4 cog / no font rom you might be able to fit in the tang nano 9k. Pretty cool development package and finally something within my budget. I've been wanting to play with the P1V since it's release but couldn't justify it at the time.

ti85 · 2024-06-04 12:18

@cheezus

That is awesome...are you planning on releasing your finished files on github or on the forums here? I just ordered my tang nano 20K

cheezus · 2024-06-09 04:25

I'm more than willing to post what i've got here! This port is based on Rayman's release, great source as always... It's pretty straightforward, although it's still very much a work in progress. I have the PLL making a 58mhz PLLx16 clock from the onboard OSC. I would love to leave it at 80mhz and try to see what breaks but without all my test equipment it would be difficult. I did comment all the portB stuffs out. The compiler is real dumb and didn't want to compile with it iirc.

Most of my compilations remove the video generator since I won't be using it. I have wondered what it would take to get the HDMI port working using the video generator? I'm sure someone much smarter than I could make short work of that. With the video in, it takes forever to compile and the video paths are the first ones to fail. It also really beats down the fmax, can't seem to break the 60s with videogen and 8 cogs.

What would be really nice is to get the source for the BL616 chip (the usb interface) and use it to program the P1V but without access to dtr/rts I'm kinda at a loss... I have an M0S board that i was playing with but couldn't quite figure it out. I only spent a few days and i may revisit at some point.

I HAVE decided i really do not like the Gowin tools, the complier output is very basic and error messages can sometimes be non-existent. I guess that's why these parts are dirt cheap and others are not. There's probably all kinds of edge cases where this thing breaks but, hey isn't that part of the fun?

Cluso99 · 2025-06-07 02:00

I've just dug out my last P1V code (from 2017/10/19) and taken a look thru it. I had been trying to fix the code to remove all warnings.

Anyway, what/where is the latest reliable P1V code?
BTW I am going to use my BeMicroCV-A9.

Andrey Demenev · 2025-06-12 16:19

I have been playing with nano20k recently, and I had the idea that it would be nice to use Propeller as an embedded processor. So I go to the forums to tell about it, and of corse it had already been implemented

Here are some of my ideas :

instructions can be made to execute in 2 cycles instead of 4. In cycle 0, fetch the instruction and write the result of the previous one. In cycle 1, read S and D. This would require dual port RAM. Did not give this much thought yet
same goes with hub access window, probably the window size can be made to appear more frequent
video generator probably becomes useless, since having 2 PLLs per cog is unaffordable luxury, and most likely the freq is too high anyway
unused opcodes could be used to interact with FPGA (read/write from/to registers)
unused opcodes could be used for SDRAM access. Similar to hub, but with different timing

Christof Eb. · 2025-06-12 17:37

Hi,
In my opinion the first question is, what you want to achieve in the end?
The only sense of emulators seems for me to be able to use existing Software, so you have to stay compatible regarding Software.
If the goal is to have an improved p1, then much bigger Hub Ram would be good. To be good it should run at 80 MHz.
Christof

Andrey Demenev · 2025-06-12 17:48

@"Christof Eb." said:
In my opinion the first question is, what you want to achieve in the end?

The goal is always the same: to have fun.

JonnyMac · 2025-06-12 17:49

The goal is always the same: to have fun.

+1

Cluso99 · 2025-06-12 22:41

The goal is always the same: to have fun.

+1

There have been options to:
Set the number of cogs, which cogs have CtrA, CtrB, Video.
Hub ram 64KB of ram, preloaded with part/all of P1 rom.
And thoughts to:
Dual port cog ram for 2 clock instructions
Reduce hub cycles from 2 clocks per cog (16 clocks per rotation with 8 cogs) tp 1 clock per cog.
Add hubexec mode - possibly requires a new JMP/CALL instruction.
Faster clocking, to see what the BeMicroCV-A9 can do.
And a few others.

I have been playing with the P1V code over the past few days. Nothing running yet, only compiles. Looking at the available IP (free stuff with Quartus Light v24.1.

BTW There was a P2HOT which had the dual port cog ram for a 2.clock cycle (circa 2013?). It would have been a beast at the time!

Christof Eb. · 2025-06-13 05:47

@"Andrey Demenev" said:

@"Christof Eb." said:
In my opinion the first question is, what you want to achieve in the end?

The goal is always the same: to have fun.

Yeah!

Ideas
If you had a very big Hub Ram, then you could perhaps make pages of it private to one or two cogs for faster access.
Also you could add memory mapped peripherals in Hub space.
Some registers could be hidden hardware stacks. If you mov from them, it's a pull. If you mov to them, it's a push.
Multiply with add, saturated!
😀 Christof

Rayman · 2025-06-13 14:43

Just FYI... Think had Efinix FPGA doing P1 recently...

Couple years ago, it didn't work right, was stuck at 20 MHz.
https://forums.parallax.com/discussion/174191/p1v-on-xyloni-trion-t8-fpga

But, a few months ago, tried again and new software made it work.

Andrey Demenev · 2025-06-13 14:48

Turns out that dual port RAM idea is not very practical when applied to the GW2 chip.

Dual port RAM with 32 bits data width will waste half of the address space because block RAM supports up to 16 bits data. That would consume 16 BSRAM blocks for 8 cores, leaving only 30 for the hub RAM. That makes 60 KiB of hub RAM. The remaining 4 KiB can be made using shadow RAM. So this consumes all available block RAM and most of the shadow RAM, plus over 2000 MUX.

With 32 bit data bus, block RAM can be used in semi dual port mode, where one port is used only for writes, and the other only for reads, without wasting address space. That makes it possible to execute instructions in 3 cycles.

Christof Eb. · 2025-06-13 15:51

What about SDRAM for HUB? There are 64Mbit, as far as I understand. I have not yet used them.

Andrey Demenev · 2025-06-13 16:40

@"Christof Eb." said:
What about SDRAM for HUB? There are 64Mbit, as far as I understand. I have not yet used them.

Probably that could work. But SDRAM is only fast when doing large sequential transfers. With random access, it is too slow. With generated SDRAM controller, single read or write would take at least 11 RAM clock cycles. Maximum SDRAM clock is 166.(6) MHz. So a bit over 15 MHz maximum Propeller clock if HUB access window is every cycle, or just a bit over 30 MHz if using "standard" 2 cycles access window. Maybe a custom SDRAM controller could squeeze another SDRAM clock cycle, but that's it. And all these calculations do not account for periodic refresh, so instead of 11 that will be 20 RAM cycles per access. It does not have to be that much slower, but the access time predictability requirement demands that refresh is done strictly periodically. If we ditch 1 cog and use its access window for refresh, it is still just 30 MHz clock frequency. Also, SDRAM will be used exclusively for hub. There is no way to use it for other purposes.

P8X32A Emulation on Tang20K FPGA

Comments