P1V with 2MB of hub visible RAM and now 32MB of SDRAM
rogloh
Posts: 5,790
Hi all,
was wondering if anyone has done anything like this before (perhaps I'm the first to try to go down this path) -
I now have built myself a P1V setup with 2MB of hub SRAM on an MAX10 FPGA. I have made modifications to the P1V verilog and can access this extra RAM directly via the hub from both SPIN and COG PASM at full hub speed from each COG. It appears from $10000-$1FFFFF and works perfectly well with RDLONG/WRLONG when reading/writing in this address range.
I am now trying to get PropGCC code to run on it using the LMM model even though I have more than 32kB of HUB RAM. I don't intend to use any cache driver with the XMM model because of the additional unnecessary penalty involved. I've downloaded/installed PropGCC on Ubuntu and got the basic fibo demo loaded and working on my setup but that only uses the lower 32kB by design. I'm also still trying to learn more about PropGCC.
So I am guessing I have to do something like this...
1) modify a linker script somewhere to make the LMM mode make use of this external hub RAM (but where ? I only see XMM linker scripts)
2) use an SD card to hold my PropGCC programs and somehow get the files onto this SD card, using propeller-load tool perhaps.
3) figure out a way to boot my large C LMM programs and run them - I'm imagining I need to develop some sort of initialization boot code that can be loaded in the low 32kB hub RAM when the Prop boots and then have a way to load more code into higher hub RAM from the SD card like some basic OS would do but there might be alternatives or PropGCC may be able to do some of this work for me... TBD.
Has anyone done anything similar or used PropGCC like this and have ideas of what else I might need to do, or sees other issues here? Are there lots of 32kB dependencies in PropGCC for LMM code generation?
Cheers,
Roger.
was wondering if anyone has done anything like this before (perhaps I'm the first to try to go down this path) -
I now have built myself a P1V setup with 2MB of hub SRAM on an MAX10 FPGA. I have made modifications to the P1V verilog and can access this extra RAM directly via the hub from both SPIN and COG PASM at full hub speed from each COG. It appears from $10000-$1FFFFF and works perfectly well with RDLONG/WRLONG when reading/writing in this address range.
I am now trying to get PropGCC code to run on it using the LMM model even though I have more than 32kB of HUB RAM. I don't intend to use any cache driver with the XMM model because of the additional unnecessary penalty involved. I've downloaded/installed PropGCC on Ubuntu and got the basic fibo demo loaded and working on my setup but that only uses the lower 32kB by design. I'm also still trying to learn more about PropGCC.
So I am guessing I have to do something like this...
1) modify a linker script somewhere to make the LMM mode make use of this external hub RAM (but where ? I only see XMM linker scripts)
2) use an SD card to hold my PropGCC programs and somehow get the files onto this SD card, using propeller-load tool perhaps.
3) figure out a way to boot my large C LMM programs and run them - I'm imagining I need to develop some sort of initialization boot code that can be loaded in the low 32kB hub RAM when the Prop boots and then have a way to load more code into higher hub RAM from the SD card like some basic OS would do but there might be alternatives or PropGCC may be able to do some of this work for me... TBD.
Has anyone done anything similar or used PropGCC like this and have ideas of what else I might need to do, or sees other issues here? Are there lots of 32kB dependencies in PropGCC for LMM code generation?
Cheers,
Roger.
Comments
I cannot help you but I'm sure Eric or David could chime in to give you some input on propGcc modifications.This is a really interesting project you are working on! I hope you can finish it. Please keep us updated because I could use your work, too. :-)
Regards,Christian
After mulling it over for a bit I am thinking there's a few different approaches....
1) compile code with the LMM model but hack the linker script to move the output code/data up to my high hub memory address space above 64kB, then develop some loader that would load the code/data into higher memory somewhat like booting XMM code does. The loader tools need work as they don't expect LMM ELF code to be loaded like this, only XMM kernel variants.
2) Use XMMC mode and hack the linker script to also put data into higher hub memory but then actually run the executable code with the LMM VM kernel, not the XMMC VM kernel (the XMMC code should be compatible from what I can tell so far). Develop a simple cache interface loader that is only used by the loader code to write into high memory at load time and then don't use it when running the actual App because generated code will not need it.
3) A slightly simpler approach to 2) might be to use XMMC mode as is and live with data/stack in the lower 32kB hub RAM. This is a little bit limiting for what I want to do but may get me going faster as I won't need to copy data into high memory (just the code) and it would appear I can leverage more of what is in place so far in the loader. I could probably use this as a stepping stone to get to 2).
No matter what, I'm convinced I will need to modify parts of the loader tools or write my own from scratch but at least the source is available so I think it is probably doable given sufficient time. There is a 2MB SPI flash on my MAX10 board so I might be able to get my large program on/off that (ozpropdev has already got something going there), or if not, at least from an extermal SD which I ultimately want to add.
Roger.
I would expect very few, modern compilers run 32b int for most variables.
Your issue may be able to be simply resolved by using a warning instead of a hard ceiling and error.
A little more work could make the code-ceiling warning threshold user-defined, so that means it gives useful information on larger, but still finite systems too.
1) I created a hub_xmem.spin file using the prescribed interface that simply reads/writes from hub RAM to hub RAM so a custom board config file can nominate this driver as its XMEM driver and the propeller-load tool can then use it in its regular way to copy into RAM.
2) I copied one of the XMM linker scripts to the filename used by LMM compiled programs (i.e created a propeller.x from propeller_xmm_single.x) and modified its RAM segment addresses to start at 0x10000 which is the start of my extended hub RAM. I also changed the segment name to .xmmkernel instead of .lmmkernel so
the propeller loader detects this and does an external image load
instead of internal image load.
3) I modified the crtbegin.S startup code so that LMM programs also perform segment relocation/clearing at init time like the XMM programs already do.
4) I modified the serial_helper2.spin code to begin RAM loads at $10000 (it was hardcoded to $20000000).
5) I rebuilt libgcc and the propeller-load tool with my changes.
6) I compiled the fibo LMM demo, also changing its Makefile to force it to use my special propeller.x in case it didn't want to use that by default.
7) I downloaded the elf file to my MAX10 board using propeller-load -b roger fibo.elf -r -t
After all these steps it seems to download but not output anything... :frown:
At this point I'm not sure what else would have been required or what was wrong with what I did and I am still trying to track it down.
I know there are lots of software steps involved in getting it all to download and run but I feel tantalizingly close to getting it working.
Perhaps I will try to use a debug kernel or another COG to see what exactly is being put into memory and if anything is actually running.
As $10000 is also the default beginning of my video memory framebuffer, I can confirm something being copied to this address on the VGA screen being rendered to by my MAX10 video code.
Yas I am using the LMM kernel, I had just tried out the XMMC stuff to see if it would work. The loader tools only create an external image if it thinks it is running XMM (so I kinda fake it with segment naming). The problem is that top of memory is $8000 with LMM and I have a hole after that until $10000 (above where the ROMs sit). I would like code to go into $10000 and be contiguous. I might be able to play more with the linker scripts to achieve this, but I am still learning how it all goes together (not an expert).
Edit: Never mind. I'm pretty sure that was a change to the CMM kernel not the LMM kernel.
The harder part, as I guess you've surmised, is getting the loader to download larger images. I'm not sure where to look to modify that. Probably the loader built into the P1V ROM image needs to be modified to understand the larger memory. I don't know if propeller-load needs changing or not.
Yeah I know exactly what you mean here and that is pretty much what I've been trying. A major problem is that the propeller-load tool doesn't expect/like loading LMM images this way and I've been trying to figure out the remaining problems. There are still some requirements on being able to load some portions into low memory (eg. when running COGs). So some things still need to go into low hub memory. There is also this .boot section which might be getting in the way and affecting the entry point, LMM generates it but XMM does not. There is some relocation stuff in crtBegin.s that runs for XMM but not LMM that I require. So I am finding/patching these things one by one. Eventually I'm gonna make it work and I know its fundamentally possible. I'm feeling more confident today to really beat this thing into submission. It's not gonna break me down! Dammit.
Few issues found..
1) When running LMM in extended hub memory you need to have USE_START_KEREXT defined in the virtual machine so it won't attempt to load the init kernel extension from as yet unloaded hub memory. This was not being defined so I was loading in garbage at bootup/initialization time. Fixed that.
2) I am seeing something weird in the _memcpy library code where a djnz instruction is followed by an address in non-fcached code. I didn't think that was allowed by design but GCC has generated this particular code. The address is thankfully being treated as a NOP when the DJNZ exits but it purely depends on the address value of where memcpy is located. I think it was a bit of a fluke that this didn't hit me. Perhaps for XMM code at standard linker addresses of $20000000 or $30000000 respectively there is usually a NR result bit cleared and this address data doesn't do anything when interpreted as an opcode. But it is still a risk, especially in my code which starts at $00010000 - that can decode to WRBYTE which will write random data to my system. I wonder is this "feature" was an oversight or by design???
3) I just found something else that is a big problem (hopefully the last one). Seems to be the fcache loader code is using a special diabolical feature where upper address bits are used to indicate when to finish the loop. Its from something Kuroneko did a while back to optimize hub loading and was used by Eric in the GCC code. I think this is causing a problem for me because it is generating addresses outside of my decoded 2MB range from $10000 - $1FFFFF. I don't decode with SRAM address foldover because I am worried spin code that relies on it will not work any more so I have an absolute address window only. I might have to patch the fcache loader to fix this.
It's a slow process as unfortunately I am not able to even use the debugger to track what is going at bootup time in the VM because it doesn't appear the code is getting to a point where it can read in the debugger extension and execute another VM from hub RAM because the hub RAM has not been copied into yet. Also what is really frustrating me is I can't easily find a way to remake things with just the run time code crtbegin.S getting my own debug changes without resorting to remaking the entire thing which takes ages, trust me. I happened to to a "make clean loader" instead of "make clean-loader" by mistake at midnight DOH! and then waited about an hour or two for the damn thing to remake everything - So I know how slow it is and learned the hard way.
I'm very happy now as I just fixed the issue 3) above by adding hub memory address foldover support to my SRAM Verilog code and now get the following output below when running PropGCC's LMM fibo demo in extended hub memory on my MAX10 P1V setup...
Awesome!
hello, world!
fibo(00) = 000000 (00000ms) (1024 ticks)
fibo(01) = 000001 (00000ms) (1024 ticks)
fibo(02) = 000001 (00000ms) (1408 ticks)
fibo(03) = 000002 (00000ms) (1792 ticks)
fibo(04) = 000003 (00000ms) (2544 ticks)
fibo(05) = 000005 (00000ms) (3680 ticks)
fibo(06) = 000008 (00000ms) (5568 ticks)
fibo(07) = 000013 (00000ms) (8592 ticks)
fibo(08) = 000021 (00000ms) (13504 ticks)
fibo(09) = 000034 (00000ms) (21440 ticks)
fibo(10) = 000055 (00000ms) (34288 ticks)
fibo(11) = 000089 (00000ms) (55072 ticks)
fibo(12) = 000144 (00001ms) (88704 ticks)
fibo(13) = 000233 (00001ms) (143120 ticks)
fibo(14) = 000377 (00002ms) (231168 ticks)
fibo(15) = 000610 (00004ms) (373632 ticks)
fibo(16) = 000987 (00007ms) (604144 ticks)
fibo(17) = 001597 (00012ms) (977120 ticks)
fibo(18) = 002584 (00019ms) (1580608 ticks)
fibo(19) = 004181 (00031ms) (2557072 ticks)
fibo(20) = 006765 (00051ms) (4137024 ticks)
fibo(21) = 010946 (00083ms) (6693440 ticks)
fibo(22) = 017711 (00135ms) (10829808 ticks)
fibo(23) = 028657 (00219ms) (17522592 ticks)
fibo(24) = 046368 (00354ms) (28351744 ticks)
fibo(25) = 075025 (00573ms) (45873680 ticks)
fibo(26) = 121393 (00927ms) (74224768 ticks)
That change will boost the tightest LMM performance automatically by 9/8 or 12.5% to hit every single hub slot now (5 MIPs max @80MHz) in the main execution loop instead of 8 in every 9 and this makes the execution deterministic as well. So running the same LMM code will now take the same time to run which will work well for LMM driver code timings.
It should also boost the stack push and pop multiple register sequences because registers then can be pushed and popped in every hub cycle instead of every second hub cycle (100% performance boost). That should help any function calls that have to save/restore lots of registers.
Roger.
This is really pretty big. 2MB, perhaps soon to be 8MB of hub ram that can be accessed easily from any cog.
What clock rate are you running at? The fibo results would indicate 80 MHz but is that accurate?
When I get the SDRAM going the additional 8MB will be likely be limited to one COG only but that is fine by me for doing a big main program. The other COGs can still do plenty of stuff in the lower 2MB. I also want to hide the SDRAM refresh cycle so it will be still hub synchronous and deterministic to the COG that gets access to it.
This is AWESOME!! Propforth and possibly pfth could make use of a 2MB (potentially 8MB) HUBRAM for dictionary space. Not that any one cares, but it makes for an interesting application of this P1V variant!
It certainly removes the memory barriers from good sized LMM applications.
Bravo!!!!
It looks like he is using a BE Micro Max10 but I'm not sure where he is getting the 2MB SRAM from. The Max10 has 8MB of SDRAM which Roger mentions a couple posts back, "When I get the SDRAM going the additional 8MB will be likely be limited to one COG only...."
We need more details so we can play along at home!
It's the ultimate BIG HUB Propeller!!
Would be very interesting when the 10M50 parts come out allowing for 8 cogs.
http://forums.parallax.com/discussion/161655/sram-expansion-board-with-svga-15bpp-color-graphics-and-text-on-p1v#latest
Not merely larger Forth dictionaries, a 2MB hubram to allow Propeller driven 3D printers. They need that extra code space.
And if one doesn't care for 3D printers, there is a whole world of CNN milling and lathe work that the Propeller 1 has only had limited horizons... until now.