P1V with 2MB of hub visible RAM and now 32MB of SDRAM

rogloh · 2015-07-11 07:33

Hi all,
was wondering if anyone has done anything like this before (perhaps I'm the first to try to go down this path) -

I now have built myself a P1V setup with 2MB of hub SRAM on an MAX10 FPGA. I have made modifications to the P1V verilog and can access this extra RAM directly via the hub from both SPIN and COG PASM at full hub speed from each COG. It appears from $10000-$1FFFFF and works perfectly well with RDLONG/WRLONG when reading/writing in this address range.

I am now trying to get PropGCC code to run on it using the LMM model even though I have more than 32kB of HUB RAM. I don't intend to use any cache driver with the XMM model because of the additional unnecessary penalty involved. I've downloaded/installed PropGCC on Ubuntu and got the basic fibo demo loaded and working on my setup but that only uses the lower 32kB by design. I'm also still trying to learn more about PropGCC.

So I am guessing I have to do something like this...

1) modify a linker script somewhere to make the LMM mode make use of this external hub RAM (but where ? I only see XMM linker scripts)
2) use an SD card to hold my PropGCC programs and somehow get the files onto this SD card, using propeller-load tool perhaps.
3) figure out a way to boot my large C LMM programs and run them - I'm imagining I need to develop some sort of initialization boot code that can be loaded in the low 32kB hub RAM when the Prop boots and then have a way to load more code into higher hub RAM from the SD card like some basic OS would do but there might be alternatives or PropGCC may be able to do some of this work for me... TBD.

Has anyone done anything similar or used PropGCC like this and have ideas of what else I might need to do, or sees other issues here? Are there lots of 32kB dependencies in PropGCC for LMM code generation?

Cheers,
Roger.

trancefreak · 2015-07-13 07:11

Hi Roger,
I cannot help you but I'm sure Eric or David could chime in to give you some input on propGcc modifications.This is a really interesting project you are working on! I hope you can finish it. Please keep us updated because I could use your work, too. :-)

Regards,Christian

rogloh · 2015-07-13 08:14

Thanks trancefreak, yeah I'm hoping one of those guys might see this and respond or give some useful tips or guidance. I spent most of yesterday wading through it all and trying to figure out how to achieve this. It will be really good if I can get the outcome I want and use my external memory in this way. I could then extend this from my 2MB SRAM and try to use onboard 8MB SDRAM too (hopefully coming next once I integrate my SDRAM controller to the hub Verilog...TBD).

After mulling it over for a bit I am thinking there's a few different approaches....

1) compile code with the LMM model but hack the linker script to move the output code/data up to my high hub memory address space above 64kB, then develop some loader that would load the code/data into higher memory somewhat like booting XMM code does. The loader tools need work as they don't expect LMM ELF code to be loaded like this, only XMM kernel variants.

2) Use XMMC mode and hack the linker script to also put data into higher hub memory but then actually run the executable code with the LMM VM kernel, not the XMMC VM kernel (the XMMC code should be compatible from what I can tell so far). Develop a simple cache interface loader that is only used by the loader code to write into high memory at load time and then don't use it when running the actual App because generated code will not need it.

3) A slightly simpler approach to 2) might be to use XMMC mode as is and live with data/stack in the lower 32kB hub RAM. This is a little bit limiting for what I want to do but may get me going faster as I won't need to copy data into high memory (just the code) and it would appear I can leverage more of what is in place so far in the loader. I could probably use this as a stepping stone to get to 2).

No matter what, I'm convinced I will need to modify parts of the loader tools or write my own from scratch but at least the source is available so I think it is probably doable given sufficient time. There is a 2MB SPI flash on my MAX10 board so I might be able to get my large program on/off that (ozpropdev has already got something going there), or if not, at least from an extermal SD which I ultimately want to add.

Roger.

jmg · 2015-07-13 10:21

Are there lots of 32kB dependencies in PropGCC for LMM code generation?

I would expect very few, modern compilers run 32b int for most variables.

Your issue may be able to be simply resolved by using a warning instead of a hard ceiling and error.
A little more work could make the code-ceiling warning threshold user-defined, so that means it gives useful information on larger, but still finite systems too.

David Betz · 2015-07-13 11:06

CMM is definitely limited to 16 bits of addressing. That was done to cram as much code as possible into hub memory. I am not aware of anything that limits LMM to 16 bit addressing. You would want to use LMM anyway because it gives significantly better performance.

rogloh · 2015-07-13 13:11

Thanks David. I am hoping to get it to work with LMM in some way or another. Having lots of new extended hub RAM and then using a cache to access (ie. via XMM) would be an unnecessary burden and impact performance.

rogloh · 2015-07-14 03:37

An update to what I tried today...

1) I created a hub_xmem.spin file using the prescribed interface that simply reads/writes from hub RAM to hub RAM so a custom board config file can nominate this driver as its XMEM driver and the propeller-load tool can then use it in its regular way to copy into RAM.

2) I copied one of the XMM linker scripts to the filename used by LMM compiled programs (i.e created a propeller.x from propeller_xmm_single.x) and modified its RAM segment addresses to start at 0x10000 which is the start of my extended hub RAM. I also changed the segment name to .xmmkernel instead of .lmmkernel so
the propeller loader detects this and does an external image load
instead of internal image load.

3) I modified the crtbegin.S startup code so that LMM programs also perform segment relocation/clearing at init time like the XMM programs already do.

4) I modified the serial_helper2.spin code to begin RAM loads at $10000 (it was hardcoded to $20000000).

5) I rebuilt libgcc and the propeller-load tool with my changes.

6) I compiled the fibo LMM demo, also changing its Makefile to force it to use my special propeller.x in case it didn't want to use that by default.

7) I downloaded the elf file to my MAX10 board using propeller-load -b roger fibo.elf -r -t

After all these steps it seems to download but not output anything... :frown:

At this point I'm not sure what else would have been required or what was wrong with what I did and I am still trying to track it down.

I know there are lots of software steps involved in getting it all to download and run but I feel tantalizingly close to getting it working.

Perhaps I will try to use a debug kernel or another COG to see what exactly is being put into memory and if anything is actually running.

David Betz · 2015-07-14 03:49

I don't understand why you're looking at XMM. You should just be able to modify things a bit to get LMM to generate more hub code. Did you try just adjusting the top memory address in the linker script?

rogloh · 2015-07-14 04:14

Some partial success! I recompiled fibo with the XMMC model using my same modified linker script and I now get a "hello world" being output to the terminal before it freezes up and stops displaying the remaining output. So something is possible to run - this is proving my code is being loaded and it can also access data from the hub memory. I think I can work with this.

As $10000 is also the default beginning of my video memory framebuffer, I can confirm something being copied to this address on the VGA screen being rendered to by my MAX10 video code.

rogloh · 2015-07-14 05:25

@David
Yas I am using the LMM kernel, I had just tried out the XMMC stuff to see if it would work. The loader tools only create an external image if it thinks it is running XMM (so I kinda fake it with segment naming). The problem is that top of memory is $8000 with LMM and I have a hole after that until $10000 (above where the ROMs sit). I would like code to go into $10000 and be contiguous. I might be able to play more with the linker scripts to achieve this, but I am still learning how it all goes together (not an expert).

David Betz · 2015-07-14 09:44

The linker scripts are a bit opaque. If you find any changes that might help you with your efforts let me know. We can certainly add stuff that is backward compatible with what we have.

rogloh · 2015-07-14 10:24

Yeah linker scripts are painful. I've been mucking about with them all day with no true success so far. I thought it was goint to be easier to follow. I get the general idea of how they are meant to work but there are specifics buried in there with how the loader tool does things and different virtual/physical address range translation and relocations etc. After several hours digging through the code I'm going a bit crazy trying to figure it all out and will have to give up on it for a little bit to recover I think. Its a fairly complex loading sequence that happens from start to finish for XMM. LMM seems easier but the tools don't expect to write to external memory for these internal images (for achieving that you need to pretend it contains an XMM kernel), and they are intended to be loaded/run directly as sub 32kB binary propeller images.

David Betz · 2015-07-14 10:30

We had LMM working for the old P2 design and that had more than 32k of hub RAM. I don't recall it being that hard to do. I'll have to look back to see how we did it although most of that P2 code has been removed from PropGCC by now.

David Betz · 2015-07-14 10:59

Now that I think about it a little, it seems to me that I had to modify the GCC code generator to use 32 bit branch offsets instead of just 16 like are used for the P1. That change might need to be made to target a P1v with more than 64k of hub RAM.
Edit: Never mind. I'm pretty sure that was a change to the CMM kernel not the LMM kernel.

ersmith · 2015-07-14 17:50

As far as gcc and the linker are concerned, I think the only change you should need is to modify the definition of "hub" in the MEMORY sections of propeller.x (that's the linker script for LMM; that's located in /opt/parallax/propeller-elf/lib/ldscripts/propeller.x for Linux, some similar location for Windows). Right now hub has ORIGIN=0 LENGTH=32K, you'll need to change it to ORIGIIN=0x10000 LENGTH=2048K or something like that. That would have the drawback that the first 32K of RAM wouldn't be used, but if you have 2M that's not too big a loss.

The harder part, as I guess you've surmised, is getting the loader to download larger images. I'm not sure where to look to modify that. Probably the loader built into the P1V ROM image needs to be modified to understand the larger memory. I don't know if propeller-load needs changing or not.

rogloh · 2015-07-15 04:20

@ersmith,
Yeah I know exactly what you mean here and that is pretty much what I've been trying. A major problem is that the propeller-load tool doesn't expect/like loading LMM images this way and I've been trying to figure out the remaining problems. There are still some requirements on being able to load some portions into low memory (eg. when running COGs). So some things still need to go into low hub memory. There is also this .boot section which might be getting in the way and affecting the entry point, LMM generates it but XMM does not. There is some relocation stuff in crtBegin.s that runs for XMM but not LMM that I require. So I am finding/patching these things one by one. Eventually I'm gonna make it work and I know its fundamentally possible. I'm feeling more confident today to really beat this thing into submission. It's not gonna break me down! Dammit.

rogloh · 2015-07-15 05:46

I just found the address order is very important in the loader tool software. You need to have the RAM data segments located below the code segments or it wont work properly when it counts through all its segments before downloading. This fact very well might have been messing me up too as at one point I was putting code first then data starting at $10000.

rogloh · 2015-07-16 08:18

So this thing has been a bit of a nightmare tracking it down but I might have just figured something out...

Few issues found..
1) When running LMM in extended hub memory you need to have USE_START_KEREXT defined in the virtual machine so it won't attempt to load the init kernel extension from as yet unloaded hub memory. This was not being defined so I was loading in garbage at bootup/initialization time. Fixed that.

2) I am seeing something weird in the _memcpy library code where a djnz instruction is followed by an address in non-fcached code. I didn't think that was allowed by design but GCC has generated this particular code. The address is thankfully being treated as a NOP when the DJNZ exits but it purely depends on the address value of where memcpy is located. I think it was a bit of a fluke that this didn't hit me. Perhaps for XMM code at standard linker addresses of $20000000 or $30000000 respectively there is usually a NR result bit cleared and this address data doesn't do anything when interpreted as an opcode. But it is still a risk, especially in my code which starts at $00010000 - that can decode to WRBYTE which will write random data to my system. I wonder is this "feature" was an oversight or by design???

3) I just found something else that is a big problem (hopefully the last one). Seems to be the fcache loader code is using a special diabolical feature where upper address bits are used to indicate when to finish the loop. Its from something Kuroneko did a while back to optimize hub loading and was used by Eric in the GCC code. I think this is causing a problem for me because it is generating addresses outside of my decoded 2MB range from $10000 - $1FFFFF. I don't decode with SRAM address foldover because I am worried spin code that relies on it will not work any more so I have an absolute address window only. I might have to patch the fcache loader to fix this.

It's a slow process as unfortunately I am not able to even use the debugger to track what is going at bootup time in the VM because it doesn't appear the code is getting to a point where it can read in the debugger extension and execute another VM from hub RAM because the hub RAM has not been copied into yet. Also what is really frustrating me is I can't easily find a way to remake things with just the run time code crtbegin.S getting my own debug changes without resorting to remaking the entire thing which takes ages, trust me. I happened to to a "make clean loader" instead of "make clean-loader" by mistake at midnight DOH! and then waited about an hour or two for the damn thing to remake everything - So I know how slow it is and learned the hard way.

rogloh · 2015-07-16 10:00

SUCCESS!!!!

I'm very happy now as I just fixed the issue 3) above by adding hub memory address foldover support to my SRAM Verilog code and now get the following output below when running PropGCC's LMM fibo demo in extended hub memory on my MAX10 P1V setup...

Awesome!

hello, world!

fibo(00) = 000000 (00000ms) (1024 ticks)
fibo(01) = 000001 (00000ms) (1024 ticks)
fibo(02) = 000001 (00000ms) (1408 ticks)
fibo(03) = 000002 (00000ms) (1792 ticks)
fibo(04) = 000003 (00000ms) (2544 ticks)
fibo(05) = 000005 (00000ms) (3680 ticks)
fibo(06) = 000008 (00000ms) (5568 ticks)
fibo(07) = 000013 (00000ms) (8592 ticks)
fibo(08) = 000021 (00000ms) (13504 ticks)
fibo(09) = 000034 (00000ms) (21440 ticks)
fibo(10) = 000055 (00000ms) (34288 ticks)
fibo(11) = 000089 (00000ms) (55072 ticks)
fibo(12) = 000144 (00001ms) (88704 ticks)
fibo(13) = 000233 (00001ms) (143120 ticks)
fibo(14) = 000377 (00002ms) (231168 ticks)
fibo(15) = 000610 (00004ms) (373632 ticks)
fibo(16) = 000987 (00007ms) (604144 ticks)
fibo(17) = 001597 (00012ms) (977120 ticks)
fibo(18) = 002584 (00019ms) (1580608 ticks)
fibo(19) = 004181 (00031ms) (2557072 ticks)
fibo(20) = 006765 (00051ms) (4137024 ticks)
fibo(21) = 010946 (00083ms) (6693440 ticks)
fibo(22) = 017711 (00135ms) (10829808 ticks)
fibo(23) = 028657 (00219ms) (17522592 ticks)
fibo(24) = 046368 (00354ms) (28351744 ticks)
fibo(25) = 075025 (00573ms) (45873680 ticks)
fibo(26) = 121393 (00927ms) (74224768 ticks)

jmg · 2015-07-16 10:09

Well done !

rogloh · 2015-07-16 10:12

Thanks jmg. Next thing is to port my eariler P1V Verilog autoincrement/autodecrement changes where for example you can do a "RDLONG dst, src WC" which will also increment src reg by 4 in the same instruction and can be used for autoincrementing reads and stack pops, similarly "WRLONG dst, src WC" for decrementing stack pushes.

That change will boost the tightest LMM performance automatically by 9/8 or 12.5% to hit every single hub slot now (5 MIPs max @80MHz) in the main execution loop instead of 8 in every 9 and this makes the execution deterministic as well. So running the same LMM code will now take the same time to run which will work well for LMM driver code timings.

It should also boost the stack push and pop multiple register sequences because registers then can be pushed and popped in every hub cycle instead of every second hub cycle (100% performance boost). That should help any function calls that have to save/restore lots of registers.

Roger.

Tubular · 2015-07-16 10:17

Awesome!, congratulations
This is really pretty big. 2MB, perhaps soon to be 8MB of hub ram that can be accessed easily from any cog.
What clock rate are you running at? The fibo results would indicate 80 MHz but is that accurate?

rogloh · 2015-07-16 10:25

Thanks Tubular. Yeah its at 80MHz, thats a sweet spot for my synchronous SVGA design doing its 40Mpixels 800x600x15bpp framebuffer rendering which uses this SRAM as well - the 3 MAX10 COGs can share this SRAM synchronously and use it for video data reads/writes and/or large programs and data.

When I get the SDRAM going the additional 8MB will be likely be limited to one COG only but that is fine by me for doing a big main program. The other COGs can still do plenty of stuff in the lower 2MB. I also want to hide the SDRAM refresh cycle so it will be still hub synchronous and deterministic to the COG that gets access to it.

mindrobots · 2015-07-16 13:41

Roger,

This is AWESOME!! Propforth and possibly pfth could make use of a 2MB (potentially 8MB) HUBRAM for dictionary space. Not that any one cares, but it makes for an interesting application of this P1V variant!

It certainly removes the memory barriers from good sized LMM applications.

Bravo!!!!

David Betz · 2015-07-16 16:58

Wow! That's quite cool. Congratulations? What FPGA board are you using for this work?

mindrobots · 2015-07-16 17:12

David,

It looks like he is using a BE Micro Max10 but I'm not sure where he is getting the 2MB SRAM from. The Max10 has 8MB of SDRAM which Roger mentions a couple posts back, "When I get the SDRAM going the additional 8MB will be likely be limited to one COG only...."

We need more details so we can play along at home!

It's the ultimate BIG HUB Propeller!!

potatohead · 2015-07-16 17:21

Awesome work!

Kerry S · 2015-07-16 21:38

That is an amazing project!

Would be very interesting when the 10M50 parts come out allowing for 8 cogs.

rogloh · 2015-07-17 06:31

@mindrobots. I have started another thread with more of the HW details you are probably interested in. See
http://forums.parallax.com/discussion/161655/sram-expansion-board-with-svga-15bpp-color-graphics-and-text-on-p1v#latest

LoopyByteloose · 2015-08-23 19:31

mindrobots wrote: »

Roger,

This is AWESOME!! Propforth and possibly pfth could make use of a 2MB (potentially 8MB) HUBRAM for dictionary space. Not that any one cares, but it makes for an interesting application of this P1V variant!

It certainly removes the memory barriers from good sized LMM applications.

Bravo!!!!

Not merely larger Forth dictionaries, a 2MB hubram to allow Propeller driven 3D printers. They need that extra code space.

And if one doesn't care for 3D printers, there is a whole world of CNN milling and lathe work that the Propeller 1 has only had limited horizons... until now.

rogloh · 2015-08-26 12:02

The real benefits with expanding the P1V RAM like this is that it's mapped into the hub space so the access software is easy and you don't lose all the I/O pins for doing fast memory expansion like you would if you used a real P1 chip. The 32 pins on P1 was always pretty limiting if you wanted to use higher performance SRAMs or with wider buses. Ok so the SPI RAMs don't need so many pins but they are SLOW and you need to access memory indirectly via specialized drivers etc. Having the fast large memory seen by the hub would be ideal for all sorts of applications using larger data sets even if their executable code remains small. And getting 2MB LMM working with GCC was fun also. I hope to get back to this again soon and dig into the SDRAM to push it even further.

P1V with 2MB of hub visible RAM and now 32MB of SDRAM

Comments