I got rid of the byte swapping of the binary file and I get a bit further. Unfortunately, it still ends up in the weeds eventually when it tries to branch to zero. I'm not sure why that is happening. Here is the single step trace.
I got rid of the byte swapping of the binary file and I get a bit further. Unfortunately, it still ends up in the weeds eventually when it tries to branch to zero. I'm not sure why that is happening. Here is the single step trace.]
I just figured out a bit of additional information. This seems to get lost when executing a 'callpcrel' instruction. Is there any reason why I might have trouble with that instruction?
I just figured out a bit of additional information. This seems to get lost when executing a 'callpcrel' instruction. Is there any reason why I might have trouble with that instruction?
Never mind this question. The problem seems to be that the callpcrel instruction is trying to call _init which the linker has placed at location zero for some reason. I'll have to track down why that is happening. Any idea where the source to _init is?
I can't for the life of me find out were _init comes from.
I was already trying to find it a week ago when I was trying to link with customized newlib.
Anyway looking at my fibo listing I see _init is sitting in a ".init" section all of it's own that is not catered for in your linker script. Perhaps it just needs adding to the ROM area.
Thanks for the suggestion. I tried adding the .init input section to the .text output section and that created some bizzare error messages about .data and .bss not being in ram. Looks like I have a bit more work to do...
I can't for the life of me find out were _init comes from.
I'm not sure if _init comes from the same place but all of the .init and .fini sections come from a file called toolchain/gcc/gcc/crtstuff.c that gets used to build crtbegin.o and crtend.o.
I finally got code running from the SPI flash on the C3! This was after lots of mucking around with the linker script. Here are the fibo results. Speed isn't very impressive but I can now run up to 1MB of code with 64K of data on a C3! Even though the SD card gets mounted at the start of this run, it isn't actually needed. The code can run from previously programmed flash. Another nice thing about this is that I now have enough code space for an interactive basic bytecode compiler and vm and probably also an editor if I can get one written.
I finally got code running from the SPI flash on the C3! This was after lots of mucking around with the linker script. Here are the fibo results. Speed isn't very impressive but I can now run up to 1MB of code with 64K of data on a C3!
Congratulations David!
You just have to tell me one way or another how many seconds it takes to run all of fibo from the text "Opening ZPU" to the end at "BREAKPOINT." Please.
I've been working on a TV demo on and off. Hopefully, we can merge efforts. I really want to have a reasonable GUI running soon. I have a few days between boards to get serious with that again.
As it turns out, I'm still having some trouble with the _init function that is called as part of the runtime library initialization. I commented it out to get the fibo run that I posted earlier. When I put it back into the build, I end up crashing at location zero. It seems that something gets compiled into the _init function that ends up trying to call a function at zero. I'm still trying to figure out how code gets added to _init.
UPDATE: I figured out why I was having a problem with _init. I had forgotten to place the .ctors and .dtors sections in the initialized RAM. All is working well now.
I finally got code running from the SPI flash on the C3! This was after lots of mucking around with the linker script. Here are the fibo results. Speed isn't very impressive but I can now run up to 1MB of code with 64K of data on a C3! Even though the SD card gets mounted at the start of this run, it isn't actually needed. The code can run from previously programmed flash. Another nice thing about this is that I now have enough code space for an interactive basic bytecode compiler and vm and probably also an editor if I can get one written.
I notice your number for fibo(26) looks a little odd. Presumably this has wrapped around?
Ross.
You mean the time for fibo(26)? Yes, this memory is so slow that the time wraps around. I'm using a direct mapped cache and hooking into ZOG using the same protocol as jazzed uses with his SdramCache.spin driver. What is needed to get Catalina C working with another memory driver? I guess all of your memory access code is inline with your LMM runtime code, right? Have you tried interfacing with SdramCache or VMCOG or some other extended memory driver like that?
What is needed to get Catalina C working with another memory driver? I guess all of your memory access code is inline with your LMM runtime code, right? Have you tried interfacing with SdramCache or VMCOG or some other extended memory driver like that?
Right - anything else and you lose too much speed. I started doing some design work on a caching SPI driver optimized for the access pattern of a typical Catalina program which would execute in another cog - I was hoping clever caching would help compensate for the slow SPI access time. However, I didn't think I would be able to finish it in time for the C3 release, so I stopped. As it turns out, I probably would have had time, but c'est la vie - I've spent the time doing other things instead.
Now, it might be simpler just to interface with VMCOG and live with the performance hit. This is also on my "to do" list, but I'd be perfectly happy for someone else to do it.
Much head slapping going on here. It occurred to me to suggest looking for sections missing from the linker script with constructors/destructors in mind. Then, well, it went out of my mind.
Speed isn't very impressive but
Actually I think it is. Just looking around for fibo(20) results:
ROM on C3 - 3050ms
Jazzed's 32MB RAM - 2376ms
Catalina on a RAMBLADE - 921ms
Not so shabby and anyway fibo is a really bad test for this as described many times before.
I can now run up to 1MB of code with 64K of data on a C3!
Now here I'd missed a point about your C3 memory map. Am I correct in understanding that you are running fibo with data and stack in an external RAM? If so the speed is even more impressive. Looks like the C3 is another board I need around here.
In your linkers script you had a memory map like so:
Did you fix up Zog to handle all those different areas?
No, ZOG still thinks that everything starting at zero up until the hub memory at 0x10000000 is RAM managed by the cache driver. I determine internal to the cache driver that addresses starting at 0x00100000 should be mapped to the flash. Actually, this is just an extension of what I already had to do to map different address ranges to the two SRAM chips. The cache code doesn't handle the flash any differently than another big SRAM chip except that it ignores writes to that space.
I'm starting to think we are going to need a family of different Zogs to handle all the different memory schemes.
I setup that memory map because I thought it matched what ZOG was already doing. I'm happy to change my map to follow whatever standard you define.
I'm hoping to tackle modifying the syscall code so I can have the stdio functions open, close, read, write, and lseek call the dosfs.c code directly rather than sending messages back to the SPIN COG. In fact, I'm hoping to start using your run_zog.spin program instead of debug_zog.spin so I can take over all of the COGs. How did you handle the COGINIT instruction to allow ZOG programs to start other COGs without having to go back to SPIN code?
ROM on C3 - 3050ms
Jazzed's 32MB RAM - 2376ms
Catalina on a RAMBLADE - 921ms
My latest SDRAM cache fibo(20) calculation time is around 2060ms.
FIBO is a bad test if calculation time is the only metric considered. Overall test time is just as important, and I suspect it is dismal in all cases because of iprintf and other functions memory distribution. I'll look at the dhrystone timing again later.
My latest SDRAM cache fibo(20) calculation time is around 2060ms.
FIBO is a bad test if calculation time is the only metric considered. Overall test time is just as important, and I suspect it is dismal in all cases because of iprintf and other functions memory distribution. I'll look at the dhrystone timing again later.
It could be that my fibo numbers are wrong anyway because I've never modified the source to match my clock speed. Isn't the default setup for 100mhz operation? I'm running my C3 at the standard 80mhz clock rate.
It could be that my fibo numbers are wrong anyway because I've never modified the source to match my clock speed. Isn't the default setup for 100mhz operation? I'm running my C3 at the standard 80mhz clock rate.
Yes, good observation. I never changed the fibo clock constants either until now.
With the adjustment the sum of the fibo calculation times is very close to the total run time. Calculated is 126943s or about 2m7s -vs- actual 2m13s.
I assume the missing 6 seconds is attributed to startup time, program printing, and cache swap overhead.
Cache swap overhead is a big problem with larger buffers. No doubt it is also an issue with VMCOG.
I guess it makes sense that any division of the memory MAP between different external devices is handled by the VMCOG/CACHE driver COG rather than Zog itself.
I have no plans to change the memory map at this moment.
I'm hoping to tackle modifying the syscall code so I can have the stdio functions open, close, read, write, and lseek call the dosfs.c code directly rather than sending messages back to the SPIN COG
I presume you mean the syscalls.c module in libgloss not the syscall handler in Zog.
So we end up with: Application => libgloss => dosfs => SD block driver.
What to do about that lowest level, the SD block driver?
Ideally that would be a stand alone PASM SD driver that could be used from C, dosfs in this case, through a HUB memory mailbox/buffer interface.
I'm hoping to start using your run_zog.spin program instead of debug_zog.spin so I can take over all of the COGs. How did you handle the COGINIT instruction to allow ZOG programs to start other COGs without having to go back to SPIN code?
Ah. This is a bit complicated...
If you look in the lib directory in zog you will find propeller.c. In there a few spin like functions have been defined. Significant here is cognew() which loads and starts a COG from a binary blob.
You will find FullDuplexSerialPlus.cpp and VMCog.cpp which are C++ implementations of the respective Spin objects. They use cognew() to get their PASM blobs started.
All this is built into a library libzog.a and a test program "test_libzog". It is that test I have had running under run_zog using all of HUB RAM or with the possibility to continue running some Spin code.
Where do the FDX and VMCOG PASM binary blobs come from?
They are compiled and extracted from the respective Spin modules using BSTC.
Then they are corrected for endiannes with objcopy, and then converted to ELF object files again with objcopy. These objects are then linked into the test_libzog program from which they can be loaded via conew. Have a look in the Makefile to see how it all hangs together.
Phew...
This is all by way of an experimental hack but it does show what can be done. This works from HUB RAM and needs a bit of effort to be able to use it from EXT memory.
I would like to think that an SD block driver could be added to this scheme. The float32 cog is working this way already.
Now, what would be better is to not have the PASM blobs taking up space in the ZPU HUB space.
My simple approach would be:
a) run_zog loads and starts cogs for FullDuplexSerial, VMCog/CACHE_RAM, SD block, Float32, video, keyboard. What ever a particular app needs. Theses devices have memory interfaces/buffers defined somewhere in high HUB.
b) run_zog the starts a zog interpreter on the applications C code. Moving it to address zero first.
c) The C code runs and uses the available hardware drivers through the high HUB memory interfaces.
d) The C code would be able to start a Zog running ZPU code in external memory.
I presume you mean the syscalls.c module in libgloss not the syscall handler in Zog.
So we end up with: Application => libgloss => dosfs => SD block driver.
What to do about that lowest level, the SD block driver?
Ideally that would be a stand alone PASM SD driver that could be used from C, dosfs in this case, through a HUB memory mailbox/buffer interface.
Yes, that's what I'd like to get to. At the moment my application calls dosfs directly and then hijacks the read/write syscalls to interface with the fsrw driver through your SPIN syscalls handler to do raw sector I/O. Unfortunately, this means I can't use the normal open/close/read/write/lseek functions in my C code to handle file I/O. I have to call dosfs directly. I will probably attempt to access the mailbox interface to my SPI driver directly from C code. Before I do that I have to add SD support to go along with my SPI flash and SPI SRAM support.
Ah. This is a bit complicated...
Thanks for the overview of how things work today. I may just stick with debug_zog.spin for the moment since my first priority is to get my basic system running on the C3.
Perhaps you have a better plan.
No, I'm just fumbling along at the moment try to achieve a short term goal of getting stuff running on the C3. I'll let you do all of the deep thinking! :-)
Now that I have GCC/ZOG running on the C3 using the SPI flash for code and the SPI SDRAM for data I was finally able to compile my simple basic bytecode compiler and interpreter to run on the C3. The good news is that it works fine. The bad news is that it is unbearably slow. It isn't really usable. I'm not sure if this is because of the bad performance of my cache code or if it is just that running interpreted C code from external RAM doesn't have good enough performance for my application. I'm not sure what I'll try next...
Which part is unusable, is it the compilation to byte code stage or is it the actual running of the BASICs byte code, or both ?
When the actual BASIC bytecode program is running I guess we have:
1) A interpreter in PASM executing ZPU byte codes.
2) These byte code are an interpreter in C executing your BASIC byte codes.
3) These byte code are the actual BASIC program.
We might expect this to be rather sluggish:)
One obvious way out is to have the BASIC compiler generate ZPU byte codes rather than your current BASIC byte codes, as you have mentioned before. This removes a whole layer of interpretation and hence a speed up by a huge factor.
Before going down that road it might be worth guestimating what speed up is achievable. For example one would imagine what ZPU byte code sequence would be generated for typical BASIC constructs like assignments, for/next loops, etc. and compare that with the number of ZPU instructions currently executed for the same constructs. I'm sure you have though about all this anyway.
As for the performance of you CACHE code, I have no idea. Does it have multiple pages in HUB at a time and a replacement policy like VMCOG?
It would be interesting to know how the speed using SPI RAM / ROM compares to Jazzed's 32MB RAM CACHE solution or the VMCOG solution.
Which part is unusable, is it the compilation to byte code stage or is it the actual running of the BASICs byte code, or both ?
When the actual BASIC bytecode program is running I guess we have:
1) A interpreter in PASM executing ZPU byte codes.
2) These byte code are an interpreter in C executing your BASIC byte codes.
3) These byte code are the actual BASIC program.
We might expect this to be rather sluggish:)
Yup. :-(
One obvious way out is to have the BASIC compiler generate ZPU byte codes rather than your current BASIC byte codes, as you have mentioned before. This removes a whole layer of interpretation and hence a speed up by a huge factor.
This had been my plan but even the compiler is far too slow. Speeding up the execution of the compiled basic program won't help that.
Before going down that road it might be worth guestimating what speed up is achievable. For example one would imagine what ZPU byte code sequence would be generated for typical BASIC constructs like assignments, for/next loops, etc. and compare that with the number of ZPU instructions currently executed for the same constructs. I'm sure you have though about all this anyway.
Yes, doing that analysis before jumping into generating ZPU code would make sense. Also, I could target the SPIN VM. Unfortunately, neither of those options address the problem that the bytecode compiler itself is too slow to be usable.
As for the performance of you CACHE code, I have no idea. Does it have multiple pages in HUB at a time and a replacement policy like VMCOG?
It would be interesting to know how the speed using SPI RAM / ROM compares to Jazzed's 32MB RAM CACHE solution or the VMCOG solution.
I can certainly do some more optimization of my cache code. It uses a direct-mapped cache with 32 byte cache lines. Making it multi-way might improve things a lot. I just don't have a lot of confidence that it will be enough to make the compiler useable. I may do it anyway just because it will make GCC/ZOG more usable on the C3 even if my on-board basic development environment isn't. I have been having fun with the Propeller and I don't want to give it up.
This had been my plan but even the compiler is far too slow...
Ah.
I could target the SPIN VM...
That's blasphemy round these parts:)
Of course you could also target an LMM kernel. Design one of your own or borrow one from Catalina or such.
Actually we have never discussed your BASICs byte codes. For example if they were few enough and simple enough could you create an interpreter for them in PASM? Along the lines of Zog but different. Who said the BASIC had to be run by the same interpreter as the compiler? That would remove a layer of run time interpretation at the cost of having to maintain yet another virtual machine.
...address the problem that the bytecode compiler itself is too slow...
Hmmm...where does the time go?
Now I'm not really familiar with the world of caches and virtual memory so you might need to explain more about what goes on there. "direct-mapped", "multi-way" etc.
If we look at ZPU execution we have something like this for every instruction:
1) A byte code fetch.
2) One or two accesses to the stack.
3) Perhaps an access to the data area.
So it would seem that as a minimum it would be beneficial to have some part of each of the code, data and stack areas sitting in on-chip RAM as a "working set". If we have to reload the cache every time we go from code to data to stack that is an awful lot of thrashing.
Of course you could be using Catalina or ICC instead but without getting the memory system up to speed there probably aren't many gains to be had.
I would be interested how the compiler performs under VMCOG or SDRAMCACHE. I guess VMCOG does not support sufficient space yet.
I guess yours looks something like that. 128 slots of 32 bytes, 4096 bytes in on-chip RAM and any moment.
Given that you are decoding RAM/ROM space in your memory driver I was wondering if there is any advantage to be gained from having separate caches for code and data. At least that way there would be no cache collisions between stack/data and opcode fetches.
I could certainly have separate code and data caches but I wonder if that would be better than just having a multi-way unified cache? Also, I guess I could try using more than 4k of hub RAM as a cache. I will probably play with some of these ideas to see if they help.
Just for a point of reference, what do you consider the fastest external RAM version of ZOG that is available at the moment? What platforms can it run on? I'd try my VMCOG driver but it is currently limited to only 64k and my C code compiles to almost that amount of code never mind data. If I can figure out how to get VMCOG to address both the 64k of SRAM and the 1MB of flash I could try that to see if it's any better than my cache driver.
David, I am sorry to hear the compiler runs too slowly.
Unfortunately VMCOG does not support >64K VM's right now, but it will as soon as I have some time to work on it.
If your program would run in 128K, you could run two copies of VMCOG (in a separate cog each), with different mailbox addresses - and use one for code, and one for data. You could add the additional SPI ram's to the C3's SPI expansion ports.
If not, maybe you can profile your compiler, and see if you can speed the compiler up.
Where is it spending most of its time?
Is it a recursive descent compiler?
Regards,
Bill
p.s.
One of my upcoming boards will support 8 SPI ram's, so it needs large VMCOG for proper operation
Quickly runs away to read up on "multi-way unified cache....".
As you see I'm in no position to advise. Where is Bill when you need him?
Increasing the size of the cache must surely help, as a brute force approach.
...what do you consider the fastest external RAM version of ZOG...
I don't know yet. I have here a TriBlade and VMCOG which means nothing in your case due to the 64K limit. I also have a 32MB GadgetGangster card which due to work/flu/more work I have yet to find time to get running. Most of the timing experiments we have done were using fibo which is somewhat useless for this task.
What platforms can it run on?
Anywhere that is supported by VMCOG, the GadgetGangster 32MB setup, and now your C3 effort.
I know Bill has plans to greatly increase the address space handled by VMCOG, no idea how far along that has come.
I'm somewhat tempted to do what I always said I would not do, create a Zog with direct hardware access to the 512KBytes on the TriBlade card. Perhaps another for the DracBlade. This is the way to go for raw speed on those cards.
None of this will happen until I have the GadgetGangster running here, and floating point and...
Comments
It looks like an endian issue. At least the instructions at
100024 to 100027 are in reverse. After that it looks a bit odd.
I just figured out a bit of additional information. This seems to get lost when executing a 'callpcrel' instruction. Is there any reason why I might have trouble with that instruction?
I was already trying to find it a week ago when I was trying to link with customized newlib.
Anyway looking at my fibo listing I see _init is sitting in a ".init" section all of it's own that is not catered for in your linker script. Perhaps it just needs adding to the ROM area.
Like so:
That's assuming _init does not actually do anything useful for us.
You just have to tell me one way or another how many seconds it takes to run all of fibo from the text "Opening ZPU" to the end at "BREAKPOINT." Please.
I've been working on a TV demo on and off. Hopefully, we can merge efforts. I really want to have a reasonable GUI running soon. I have a few days between boards to get serious with that again.
UPDATE: I figured out why I was having a problem with _init. I had forgotten to place the .ctors and .dtors sections in the initialized RAM. All is working well now.
I notice your number for fibo(26) looks a little odd. Presumably this has wrapped around?
Ross.
PropCade can use SPI Flash too
You mean the time for fibo(26)? Yes, this memory is so slow that the time wraps around. I'm using a direct mapped cache and hooking into ZOG using the same protocol as jazzed uses with his SdramCache.spin driver. What is needed to get Catalina C working with another memory driver? I guess all of your memory access code is inline with your LMM runtime code, right? Have you tried interfacing with SdramCache or VMCOG or some other extended memory driver like that?
Right - anything else and you lose too much speed. I started doing some design work on a caching SPI driver optimized for the access pattern of a typical Catalina program which would execute in another cog - I was hoping clever caching would help compensate for the slow SPI access time. However, I didn't think I would be able to finish it in time for the C3 release, so I stopped. As it turns out, I probably would have had time, but c'est la vie - I've spent the time doing other things instead.
Now, it might be simpler just to interface with VMCOG and live with the performance hit. This is also on my "to do" list, but I'd be perfectly happy for someone else to do it.
Ross.
Well done. That's fantastic.
Much head slapping going on here. It occurred to me to suggest looking for sections missing from the linker script with constructors/destructors in mind. Then, well, it went out of my mind.
Actually I think it is. Just looking around for fibo(20) results:
ROM on C3 - 3050ms
Jazzed's 32MB RAM - 2376ms
Catalina on a RAMBLADE - 921ms
Not so shabby and anyway fibo is a really bad test for this as described many times before.
Now here I'd missed a point about your C3 memory map. Am I correct in understanding that you are running fibo with data and stack in an external RAM? If so the speed is even more impressive. Looks like the C3 is another board I need around here.
In your linkers script you had a memory map like so:
Did you fix up Zog to handle all those different areas?
I'm starting to think we are going to need a family of different Zogs to handle all the different memory schemes.
I'm hoping to tackle modifying the syscall code so I can have the stdio functions open, close, read, write, and lseek call the dosfs.c code directly rather than sending messages back to the SPIN COG. In fact, I'm hoping to start using your run_zog.spin program instead of debug_zog.spin so I can take over all of the COGs. How did you handle the COGINIT instruction to allow ZOG programs to start other COGs without having to go back to SPIN code?
FIBO is a bad test if calculation time is the only metric considered. Overall test time is just as important, and I suspect it is dismal in all cases because of iprintf and other functions memory distribution. I'll look at the dhrystone timing again later.
With the adjustment the sum of the fibo calculation times is very close to the total run time. Calculated is 126943s or about 2m7s -vs- actual 2m13s.
I assume the missing 6 seconds is attributed to startup time, program printing, and cache swap overhead.
Cache swap overhead is a big problem with larger buffers. No doubt it is also an issue with VMCOG.
I guess it makes sense that any division of the memory MAP between different external devices is handled by the VMCOG/CACHE driver COG rather than Zog itself.
I have no plans to change the memory map at this moment.
I presume you mean the syscalls.c module in libgloss not the syscall handler in Zog.
So we end up with: Application => libgloss => dosfs => SD block driver.
What to do about that lowest level, the SD block driver?
Ideally that would be a stand alone PASM SD driver that could be used from C, dosfs in this case, through a HUB memory mailbox/buffer interface.
Ah. This is a bit complicated...
If you look in the lib directory in zog you will find propeller.c. In there a few spin like functions have been defined. Significant here is cognew() which loads and starts a COG from a binary blob.
You will find FullDuplexSerialPlus.cpp and VMCog.cpp which are C++ implementations of the respective Spin objects. They use cognew() to get their PASM blobs started.
All this is built into a library libzog.a and a test program "test_libzog". It is that test I have had running under run_zog using all of HUB RAM or with the possibility to continue running some Spin code.
Where do the FDX and VMCOG PASM binary blobs come from?
They are compiled and extracted from the respective Spin modules using BSTC.
Then they are corrected for endiannes with objcopy, and then converted to ELF object files again with objcopy. These objects are then linked into the test_libzog program from which they can be loaded via conew. Have a look in the Makefile to see how it all hangs together.
Phew...
This is all by way of an experimental hack but it does show what can be done. This works from HUB RAM and needs a bit of effort to be able to use it from EXT memory.
I would like to think that an SD block driver could be added to this scheme. The float32 cog is working this way already.
Now, what would be better is to not have the PASM blobs taking up space in the ZPU HUB space.
My simple approach would be:
a) run_zog loads and starts cogs for FullDuplexSerial, VMCog/CACHE_RAM, SD block, Float32, video, keyboard. What ever a particular app needs. Theses devices have memory interfaces/buffers defined somewhere in high HUB.
b) run_zog the starts a zog interpreter on the applications C code. Moving it to address zero first.
c) The C code runs and uses the available hardware drivers through the high HUB memory interfaces.
d) The C code would be able to start a Zog running ZPU code in external memory.
Perhaps you have a better plan.
Now that I have GCC/ZOG running on the C3 using the SPI flash for code and the SPI SDRAM for data I was finally able to compile my simple basic bytecode compiler and interpreter to run on the C3. The good news is that it works fine. The bad news is that it is unbearably slow. It isn't really usable. I'm not sure if this is because of the bad performance of my cache code or if it is just that running interpreted C code from external RAM doesn't have good enough performance for my application. I'm not sure what I'll try next...
Which part is unusable, is it the compilation to byte code stage or is it the actual running of the BASICs byte code, or both ?
When the actual BASIC bytecode program is running I guess we have:
1) A interpreter in PASM executing ZPU byte codes.
2) These byte code are an interpreter in C executing your BASIC byte codes.
3) These byte code are the actual BASIC program.
We might expect this to be rather sluggish:)
One obvious way out is to have the BASIC compiler generate ZPU byte codes rather than your current BASIC byte codes, as you have mentioned before. This removes a whole layer of interpretation and hence a speed up by a huge factor.
Before going down that road it might be worth guestimating what speed up is achievable. For example one would imagine what ZPU byte code sequence would be generated for typical BASIC constructs like assignments, for/next loops, etc. and compare that with the number of ZPU instructions currently executed for the same constructs. I'm sure you have though about all this anyway.
As for the performance of you CACHE code, I have no idea. Does it have multiple pages in HUB at a time and a replacement policy like VMCOG?
It would be interesting to know how the speed using SPI RAM / ROM compares to Jazzed's 32MB RAM CACHE solution or the VMCOG solution.
Ah.
That's blasphemy round these parts:)
Of course you could also target an LMM kernel. Design one of your own or borrow one from Catalina or such.
Actually we have never discussed your BASICs byte codes. For example if they were few enough and simple enough could you create an interpreter for them in PASM? Along the lines of Zog but different. Who said the BASIC had to be run by the same interpreter as the compiler? That would remove a layer of run time interpretation at the cost of having to maintain yet another virtual machine.
Hmmm...where does the time go?
Now I'm not really familiar with the world of caches and virtual memory so you might need to explain more about what goes on there. "direct-mapped", "multi-way" etc.
If we look at ZPU execution we have something like this for every instruction:
1) A byte code fetch.
2) One or two accesses to the stack.
3) Perhaps an access to the data area.
So it would seem that as a minimum it would be beneficial to have some part of each of the code, data and stack areas sitting in on-chip RAM as a "working set". If we have to reload the cache every time we go from code to data to stack that is an awful lot of thrashing.
Of course you could be using Catalina or ICC instead but without getting the memory system up to speed there probably aren't many gains to be had.
I would be interested how the compiler performs under VMCOG or SDRAMCACHE. I guess VMCOG does not support sufficient space yet.
I guess yours looks something like that. 128 slots of 32 bytes, 4096 bytes in on-chip RAM and any moment.
Given that you are decoding RAM/ROM space in your memory driver I was wondering if there is any advantage to be gained from having separate caches for code and data. At least that way there would be no cache collisions between stack/data and opcode fetches.
Just for a point of reference, what do you consider the fastest external RAM version of ZOG that is available at the moment? What platforms can it run on? I'd try my VMCOG driver but it is currently limited to only 64k and my C code compiles to almost that amount of code never mind data. If I can figure out how to get VMCOG to address both the 64k of SRAM and the 1MB of flash I could try that to see if it's any better than my cache driver.
Unfortunately VMCOG does not support >64K VM's right now, but it will as soon as I have some time to work on it.
If your program would run in 128K, you could run two copies of VMCOG (in a separate cog each), with different mailbox addresses - and use one for code, and one for data. You could add the additional SPI ram's to the C3's SPI expansion ports.
If not, maybe you can profile your compiler, and see if you can speed the compiler up.
Where is it spending most of its time?
Is it a recursive descent compiler?
Regards,
Bill
p.s.
One of my upcoming boards will support 8 SPI ram's, so it needs large VMCOG for proper operation
As you see I'm in no position to advise. Where is Bill when you need him?
Increasing the size of the cache must surely help, as a brute force approach.
I don't know yet. I have here a TriBlade and VMCOG which means nothing in your case due to the 64K limit. I also have a 32MB GadgetGangster card which due to work/flu/more work I have yet to find time to get running. Most of the timing experiments we have done were using fibo which is somewhat useless for this task.
Anywhere that is supported by VMCOG, the GadgetGangster 32MB setup, and now your C3 effort.
I know Bill has plans to greatly increase the address space handled by VMCOG, no idea how far along that has come.
I'm somewhat tempted to do what I always said I would not do, create a Zog with direct hardware access to the 512KBytes on the TriBlade card. Perhaps another for the DracBlade. This is the way to go for raw speed on those cards.
None of this will happen until I have the GadgetGangster running here, and floating point and...