For sure splitting all of the code out to FLASH leaving data in RAM would be an easier first step.
Earlier I was confusing myself re: possible memory maps for ZOG. What with having external RAM or not. Running from external RAM or not.
I can see that with all the different memory options, SPI, SDRAM, FLASH this is going to result in the possible combinations multiplying. How to manage all that?
I can see that with all the different memory options, SPI, SDRAM, FLASH this is going to result in the possible combinations multiplying. How to manage all that?
Create a huge memory mapped VMCOG.
VMCOG tags and virtual pages are in HUB memory.
VMCOG routines must be small so that large drivers (SDRAM) can fit in one COG.
Convince Brad that a #include FILE is really important to eliminate #ifdef clutter.
The #ifdefs should be in the #include files and not the VMCOG file.
If Bill is unwilling to do the huge VMCOG, one of us can deliver it.
Are you convinced that the VMCOG approach is better than the scheme used by SdramCache? The SdramCache solution eliminates a memory copy from the memory COG buffer to the VM COG buffer. Doesn't that improve performance? Also, SdramCache uses a direct-mapped cache I think (my clone of it does anyway). Does the scheme used by VMCOG perform better? Its page replacement algorithm still worries me a little.
Are you convinced that the VMCOG approach is better than the scheme used by SdramCache? The SdramCache solution eliminates a memory copy from the memory COG buffer to the VM COG buffer. Doesn't that improve performance? Also, SdramCache uses a direct-mapped cache I think (my clone of it does anyway). Does the scheme used by VMCOG perform better? Its page replacement algorithm still worries me a little.
Let's just say I was throwing out an olive branch Combining the best aspects of all the drivers would be good.
I believe the SdramCache buffer approach is better/faster than the one-at-a-time element approach used in VMCOG. Maybe that's the best interface to use.
Maybe an efficient N way set associative cache could work better - I've tried, but could not get any better performance than direct mapped.
I've had doubts about the replacement algorithm too, but I believe getting that working perfectly would be best for page (or cache line) management. What collision resolution algorithm is used in VMCOG ... leaky-bucket, quadratic probe, other ... ?
I think the VMCOG tag LRU counter bits could be moved to HUB RAM with little difference in performance, but would allow huge virtual space. The tag table is fairly small anyway. The other thing could change is the VMCOG page size - 512 bytes takes a long time to load from any backing store. Pages of 64 or 128 bytes is probably better.
In any case, I think containing the #ifdef mess needs to be addressed. Having one good interface helps, but having #include tools would help more.
I'm still too tied up with other things to get into improving all this right now especially since things are working fairly well. I'll get back to it after a bit though.
One biggie for me at some point is making Catalina work with some kind of cache or VM ... that is one tough project though unless Ross does it.
There is of course still the ultimate LMM thing left to do ...
Bill will be resuming work on huge VMCOG as soon as current crop of boards is tested and sent of for production runs.
I TOTALLY agree about the #ifdef clutter; Brad said he would add #ifdef a month or two ago when I asked him.
Current status:
- PropCade - production boards available for sale
- Morpheus (pcb rev.2) - production boards tested, just have to do new BOM and start selling
- Mem+ (pcb rev.2) - production boards tested, just have to do new BOM and start selling
- 485Plug - production boards tested, just have to do BOM and start selling it
- CPUModule - prototype built and tested
- Mem+ built, driver partially tested, one not solved bug
- pPLC (formerly PLC-G) - prototype >50% tested and verified, should be done tomorrow
- Morpheus+ ... needs to be built and tested
- IO* ... needs to be built and tested
- approx 10 small/easy I/O boards: 6 built, 4 to be built, all to be tested
I suppose I could put off building/testing Morpheus+/IO*/unbuilt I/O boards for a week to get back to VMCOG first
Note: PropCade and pPLC both require large VMCOG due to the number of SPI memory sockets on them so I guarantee I will finish the large VMCOG as some of my products are built around it
The current page replacement policy is basically "Least Recently Used", however that can be replaced. I was more worried about getting a page replacement policy working than squeezing the last ounce of performance out at this stage. The current VMCOG was basically a prototype/testbed for the upcoming large VM version, plus I wanted something capable of running CP/M on PropCade (and now, C3)
I think that VMCOG will be faster with slow memories (think SPI) and that SdramCache *may* be better with parallel SDRAM or SRAM. The proof will be in testing large numbers of different applications, and benchmarking the performance. I'd bet that VMCOG will be faster in some situations, and a cache in others. Best guess: code with greater "locality of reference" will work better with VMCOG, and more scattered/random code will work better with a small cache line cache.
A very large number of papers have been written on page replacement policies... each is better for different application access patterns.
Are you convinced that the VMCOG approach is better than the scheme used by SdramCache? The SdramCache solution eliminates a memory copy from the memory COG buffer to the VM COG buffer. Doesn't that improve performance? Also, SdramCache uses a direct-mapped cache I think (my clone of it does anyway). Does the scheme used by VMCOG perform better? Its page replacement algorithm still worries me a little.
I think that SdramCache should use exactly the same interface as VMCOG, for {RD|WR}{B|W|L}, and ignore the other commands. Then the user can choose SdramCache or VMCOG, depending on which performs better for his application - they would be interchangeable.
The problem with giving the application access to a whole small page is that the application then has to know when it needs to grab a handle to a different page.
With the VMCOG approach, the application just asks for the content of an address, and it gets it (or value gets written) without the application having to know if the memory is really in core or not.
Currently there is no page collision possible, but I am addressing that with a two-way associativity in the big version. My calculations showed a slowdown with any greater than 2 way associativity.
My current page replacement (a version of LRU) works, so I intend to leave it alone until the large VM has been working well. VMCOG is pretty modular, it will be easy to try different policies.
Mini's, mainframes, etc standardized on 512 byte pages a long time ago. For the time being, I am sticking with them, because the idea is that if you get your hit rate up high enough, you have to swap pages in/out <1% of the time. Many small pages will cause many more page swaps, and on a smaller number of bytes, the initial address setup overhead will be more significant.
Moving the hit counters to the hub would cause a HUGE performance hit with the current VMCOG "ask for a memory unit (B/W/L)" as it would add two additional hub accesses to every single access.
The SdramCache way of making a cache line available works around this, but requires significantly more complex memory access code in the application. In VMCOG, you simply ask for the data at the addres (or give data to write to an address). In a chache, you have to check if it is in-core, then modify the hub byte, in every application.
I *TOTALLY* agree about the ifdef mess, it is a huge headache for me too.
One of the major goals for VMCOG was to present a simple, easy to use, unified interface, and keep the client code VERY simple.
Let's just say I was throwing out an olive branch Combining the best aspects of all the drivers would be good.
I believe the SdramCache buffer approach is better/faster than the one-at-a-time element approach used in VMCOG. Maybe that's the best interface to use.
Maybe an efficient N way set associative cache could work better - I've tried, but could not get any better performance than direct mapped.
I've had doubts about the replacement algorithm too, but I believe getting that working perfectly would be best for page (or cache line) management. What collision resolution algorithm is used in VMCOG ... leaky-bucket, quadratic probe, other ... ?
I think the VMCOG tag LRU counter bits could be moved to HUB RAM with little difference in performance, but would allow huge virtual space. The tag table is fairly small anyway. The other thing could change is the VMCOG page size - 512 bytes takes a long time to load from any backing store. Pages of 64 or 128 bytes is probably better.
In any case, I think containing the #ifdef mess needs to be addressed. Having one good interface helps, but having #include tools would help more.
I'm still too tied up with other things to get into improving all this right now especially since things are working fairly well. I'll get back to it after a bit though.
One biggie for me at some point is making Catalina work with some kind of cache or VM ... that is one tough project though unless Ross does it.
There is of course still the ultimate LMM thing left to do ...
I think that SdramCache should use exactly the same interface as VMCOG
I'll try this when I have time - that is after I've answered some big questions and have more hardware in the manufacturing pipe. My initial experiments with this revealed VMCOG incompatible with reasonable SDRAM performance.
Problem with standardizing on the VMCOG interface from all the data I've seen so far, is that SdramCache today outperforms VMCOG. Trouble is I can't tell if thats mainly because of the block based interface or the high data-burst rate using SDRAM (2 instructions per byte peak). The only way to know for sure is to isolate a variable which seems impossible without a merged approach ... VMCOG with tags intra-COG is too big to allow SDRAM long bursts.
In general, if you give a cog access to a block of hub memory that it can access directly, it will be faster than if the cog has to request each byte/word/long individually - BUT - the coding for the cog is far simpler with the VMCOG approach.
The fastest approach will generally be to code directly to the hardware interface - ie maintain a small cache within the application cog, which talks directly to the ram - however that requires significant programming in each application cog.
Your cache approach is somewhat easier for the application cogs, however the application cogs now have to check if the address they are looking for is in one of the resident cache lines - again, more coding for the application writers.
I designed VMCOG so that it is very easy for application cogs to interface to it, and requires little memory in the application cogs to use. The complexity of presence checking, replacement policies etc is application transparent - albeit with a performance hit.
Computer science history shows us that with a good enough hit ratio, approaches like mine will get close to 99% of the performance of not having a larger VM (assuming a big enough working set).
I have actually started adding provisions for "locking" VM pages into memory, which would allow a very similar approach to what you do with cache lines - however I intended to use locked pages for screen buffers
I think your benchmarks showed that using VMCOG was roughly 1/4 the speed of using the hub directly - which is a pretty good result considering the hub handshaking going on, and I suspect it is fairly close speed wise to using the SDRAM one byte (or 2 / 4) at a time, not using long burst - if not a bit faster.
The problem with many small pages is that the TLB gets huge - and that when loading/saving small pages (or cache lines) the setup time is a larger fraction of the total transfer time.
Frankly, for larger projects it could easily be that the application programmers will follow a roadmap like:
- initially implement a VMCOG interface, so they can talk to all memory boards that support it
- later for a few systems add a Cache interface (which involves more code writing for the app programmer)
- for really critical systems, implement an optimized direct interface for each significant memory hardware interface
Mind you, in extreme cases such as single channel SPI ram, I would be very surpised if a caching approach significantly outperformed the easier to implement (for an app writer) VMCOG approach.
I'll try this when I have time - that is after I've answered some big questions and have more hardware in the manufacturing pipe. My initial experiments with this revealed VMCOG incompatible with reasonable SDRAM performance.
Problem with standardizing on the VMCOG interface from all the data I've seen so far, is that SdramCache today outperforms VMCOG. Trouble is I can't tell if thats mainly because of the block based interface or the high data-burst rate using SDRAM (2 instructions per byte peak). The only way to know for sure is to isolate a variable which seems impossible without a merged approach ... VMCOG with tags intra-COG is too big to allow SDRAM long bursts.
Okay, I think I'm getting my head wrapped around this endian issue. Even though the Propeller is a little-endian machine, ZOG has turned it into a big-endian machine. If I write an array of 32 bit values into a file on the PC in little-endian format expecting it to work on the Propeller, it doesn't work if I read it under ZOG. I have to write it in big-endian format for use in ZOG even though the Propeller is little-endian. This is kind of a pain since a Spin program can't directly write a file and have it read by a ZOG program without taking endian issues into account even though both programs are running on the same CPU.
Thing is, one never writes an array of 32 bit values to a file. One writes an array of bytes which may happen to represent 32 bit values somehow. Luckily the bits in the bytes always seem to come out the right way around.
This problem has been around for ever, Intel are little endian, Motorola are big endian etc etc.
The only way around it is to be sure to know the on file data format and take steps to ensure it is read/written the right way around.
By the way, did you know that the Prop and Spin are very confused about their orientation as well?
Consider this Spin code:
CON
c = $DEADBEEF
VAR
long a
long b
PUB start
a := $DEADBEEF
b := c
DAT
d long $DEADBEEF
Now have a look at the compiled listing:
DCURR : 0040
|===========================================================================|
|===========================================================================|
Object Untitled1
Object Base is 0010
|===========================================================================|
Object Constants
|===========================================================================|
Constant c = [B]DEADBEEF[/B] (3735928559)
|===========================================================================|
|===========================================================================|
VBASE Global Variables
|===========================================================================|
VBASE : 0000 LONG Size 0004 Variable a
VBASE : 0004 LONG Size 0004 Variable b
|===========================================================================|
Object DAT Blocks
|===========================================================================|
0018(0000) [B]EF BE AD DE[/B] | d long $[B]DEADBEEF[/B]
|===========================================================================|
|===========================================================================|
Spin Block start with 0 Parameters and 0 Extra Stack Longs. Method 1
PUB start
Local Parameter DBASE:0000 - Result
|===========================================================================|
9 a := $[B]DEADBEEF[/B]
Addr : 001C: 3B [B]DE AD BE EF[/B] : Constant 4 Bytes -[B] DE AD BE EF[/B] - $[B]DEADBEEF [/B]3735928559
Addr : 0021: 41 : Variable Operation Global Offset - 0 Write
10 b := c
Addr : 0022: 3B [B]DE AD BE EF[/B] : Constant 4 Bytes -[B] DE AD BE EF[/B] - $[B]DEADBEEF [/B]3735928559
Addr : 0027: 45 : Variable Operation Global Offset - 1 Write
Addr : 0028: 32 : Return
See what I mean?
Just to be sure here is a hex dump of the binary produced:
$ hexdump -C endian.binary
00000000 00 1b b7 00 00 6e 10 00 2c 00 3c 00 1c 00 40 00 |.....n..,.<...@.|
00000010 1c 00 02 00 0c 00 00 00 [B]ef be ad de[/B] 3b [B]de ad be[/B] |............;...|
00000020 [B]ef[/B] 41 3b [B]de ad be ef [/B]45 32 00 00 00 |.A;....E2...|
0000002c
Do we really gain that much by using this pseudo-big-endian mode in ZOG? I guess I've asked this before but couldn't we just add a little-endian mode to the GCC code generator? Seems like it would play better with native Spin/PASM code.
This problem has been around for ever, Intel are little endian, Motorola are big endian etc etc.
Yes, I'm well aware of these endian-issues when moving from one CPU to another. What makes this case different is that we're now talking about different endian modes on the same CPU. I guess there are CPUs that can be configured for either endian mode but I think that configuration happens statically at initialization not between running one program and another. On the other hand, I guess this is probably similar to running a JVM on a machine that has the opposite endianness as Java.
...couldn't we just add a little-endian mode to the GCC code generator
I'm sure it's possible. I have no idea how, might as well ask me to design a space shuttle:) Not only would the code generator need reversing but probably the assembler and linker as well.
We could take the other route and reverse all longs read/written by ZOG leaving the binary images as they are. But that is sure to be a performance hit as I have mentioned before.
In that respect ZOG plays very well with Spin/PASM at the moment. We don't have to turn all the longs around , like PC, SP etc when going from one to the other. There is no reversing necessary when working with a floating point COG for example.
For sure if one is emulating a Motorola 68000 on a PC one has the same issues.
I guess the bottom line is that I need to treat ZOG as a big-endian platform regardless of the fact that is running on the little-endian Propeller. I guess that's fine. I just hadn't thought of it that way before so I have to do some rearrangement of my VM code. Or, I can just get my bytecode compiler to generate big-endian images.
By the way, I have my bytecode compiler running under ZOG. It is split into three phases: tokenize, parse, and generate. It writes a correct little-endian image but the VM gets confused by it since it's really big-endian. If I just change the code generator to create a big-endian image then I think everything should be fine.
A next step, as I've mentioned before, would be for me to change the bytecode compiler to target the ZOG VM instead of my Basic VM. That will take longer but would probably be more useful in the end.
Right now I'm trying to think about how to write a shell that would allow execution of ZOG programs from a command line or screen menu. It's easy to load and run a ZOG image but I need to pass argc/argv to the new image or have some other way to pass parameters. I guess I could just write the parameters to an SD card file but I'd like to find a memory-based way of handling parameters. Have you thought about this at all?
Heater: I have another question. I've been using the syscalls interface to handle raw sector reads/writes to the SD card. The rest of the FAT filesystem is handled by C code under ZOG. Unfortunately, when I enable syscalls, they get used for every call including terminal I/O. How hard would it be to add a low level sector I/O interface that would allow the remaining syscalls to be handled by C code? I know I'd have to edit the syscall code in the libgloss directory but at least the terminal I/O would still go through the normal terminal interface and I could handle open/create/close/read/write using the C FAT code running under ZOG.
Are you wanting to create your "shell" in spin prior to loading starting Zog on a ZPU image? I did briefly try and figure out where argc/argv were coming from but did not pursue it. There must be a place in the loaded ZPU image where parameters can be "POKED" prior to starting ZOG. Might need to have a look at the libgloss crtxxx routines and even make some small changes to the _premain or whatever.
What we need here is a little OS for ZPU/ZOG. I believe Bill has one up his sleeve somewhere.
Have you had a look in zpugcc/toolchain/gcc/libgloss/zpu/syscalls.c ?
There you will find all the stuff you want to mess with I think, all the functions for file handling: open(), close(), read(), write() etc.
Now the nice thing about this is that all the functions there are defined with the "weak" attribute like so:
int __attribute__ ((weak))
_DEFUN (write, (fd, buf, nbytes),
int fd _AND
char *buf _AND
int nbytes)
{
........
}
This means that you can write new versions of those functions and just include them in your program. When the program is built the linker will take your new versions without any "duplicate" errors and drop the "weak" versions.
So, for example you could make a new version of write() that changed the way _use_syscall is used, or ignore it all together. You would add tests of the file descriptor parameter "fd" to decide whether output goes to the serial port for stdio or whether it goes to the file writing function of your FAT file system code.
Similarly for the other file functions.
By the way syscalls.c also contains _premain(). This is the place to look for figuring out where to get command line args from.
Ultimately I'd like to have a SD block driver COG that is used via a mail box so that Zog C FS code can use it without going through a syscall or any Spin code.
Thanks for all of your advice on syscall and argc/argv. It looks like argc and argv are hard coded in syscalls.c in the premain function. I guess I can modify that code to pickup arguments stuffed in by the shell. Do you have a description of the memory layout of a ZOG binary image?
Do you have a description of the memory layout of a ZOG binary image
No. And that's probably an unanswerable question.
zpu-gcc produces an executable in ELF format which is full of code, data, debug sections. That is probably inspectable somehow but then that ELF is turned into a raw binary file for loading which has no formatting in it.
Thinking about it a bit I would proceed like this:
At location zero in the bin file, which is loaded to location zero in the ZPU memory, there is the _start code that sets up a couple of things and then jumps to _premain().
Following that little _start routine is the ZPU interrupt vector (not used yet) and following that is about 1K of vectors and instruction emulation code that ZOG does not use because it implements the full ZPU instructin set in PASM.
All this is in libgloss crt0.S. It has been my intention for a long while to provide an alternate crt0.S which drops all that emulation code and results in 1K smaller binaries.
So, here is a possibility to throw out all the unused vector code from crt0.S and instead provide a small area for command line string, say 80 bytes or whatever, initialized to zeros.
Then the _premain() function is modified to find the command line string at that fixed location and create the argc/argv parameters for main().
When you want to set a command line just POKE the string into that fixed location in crt0 after loading the ZOG binary.
I'm hoping we don't have to rebuild the libraries. Not at this stage anyway.
It should be possible to create a new crt0.S inour programs and tell zpu-gcc to use that instead of its own. May need a compilers switch like -mno-crt.
Alternatively I'm hoping all those symbols defined in crt0.S have "weak" linkage so we can just write our own and the linker will drop the standard library ones.
I just finished building GCC using the instructions you provided. I haven't tried running the resulting code yet though. I think I also figured out how to build the library by itself as well. Try this:
# modify syscalls.c or some other part of libgloss
cd toolchain/gccbuild
make all-target-libgloss
make install-target-libgloss
You will then find the new libbcc.a in toolchain/install/zpu-elf/lib/libbcc.a.
Started adding some things to my debug_zog.spin for doing a demo.
Decided to use Baggers' TV_Half_Height.spin 40x30 driver. I wish I could adjust the vertical start position in the driver - will look at that tomorrow.
For the time being, just printing whatever goes to the serial port on_output is reasonable. Not sure what to do about keyboard yet on_input though. I'll probably just put in another #ifdef.
I'm adding these fragments:
userdefs.spin:
con '' console tv settings
TvPin = 20 '' the TV DAC start pin to use for TV out
TvNTSCPAL = 0 '' the TV NTSC constant = 0, PAL = 1
TvINTRLACE= 1 '' the TV NTSC constant = 0, PAL = 1
TvHiCOLS = 40 '' the TV HALF HEIGHT max columns
TvHiROWS = 28 '' the TV HALF HEIGHT max rows
debug_zog.spin:
OBJ
zog : "zog"
ser : "FullDuplexSerialPlus"
#ifdef USE_TV_TEXT
tv : "TV_Text_Half_Height"
#endif
...
PUB start : okay | n
ser.start(def#conRxPin, def#conTxPin, def#conMode, def#conBaud) 'Start the debug Terminal
ser.str(string("ZOG v1.6"))
#ifdef USE_TV_TEXT 'Start the TV Terminal
tv.start(def#TvPin,def#TvNTSCPAL,def#TvINTRLACE,def#TvHiCOLS,def#TvHiROWS)
#endif
...
PRI on_output
case zog_mbox_port
UART_TX_PORT:
ser.tx(zog_mbox_data)
#ifdef USE_TV_TEXT
case zog_mbox_data
$a: tv.out($d)
$d: 'tv.out($d)
other:
tv.out(zog_mbox_data)
#endif
other:
ser.str(string("Write to unknown output port "))
ser.hex(zog_mbox_port, 8)
ser.str(string(" : "))
ser.hex(zog_mbox_data, 8)
crlf
zog_mbox_command := 0
I've added the TwiKeyboard.spin driver to my debug_zog under #define USE_TWI_KEYBOARD; it works fine with the hello.c program. Thanks to Baggers I have a better vertical start position for TV_Half_Height.spin now.
I'll start working on a more interesting demo now. Wish I had a mouse cursor in the TV text driver
Have you tried running the ZPU on an FPGA? I've got a Digilent Spartan 3 board with an S400, and tried compiling and synthesising the small ZPU core. It built OK (albeit with lots of warnings) and I might try simulating it later.
Comments
For sure splitting all of the code out to FLASH leaving data in RAM would be an easier first step.
Earlier I was confusing myself re: possible memory maps for ZOG. What with having external RAM or not. Running from external RAM or not.
I can see that with all the different memory options, SPI, SDRAM, FLASH this is going to result in the possible combinations multiplying. How to manage all that?
Create a huge memory mapped VMCOG.
VMCOG tags and virtual pages are in HUB memory.
VMCOG routines must be small so that large drivers (SDRAM) can fit in one COG.
Convince Brad that a #include FILE is really important to eliminate #ifdef clutter.
The #ifdefs should be in the #include files and not the VMCOG file.
If Bill is unwilling to do the huge VMCOG, one of us can deliver it.
Deprecate non-VMCOG drivers.
I believe the SdramCache buffer approach is better/faster than the one-at-a-time element approach used in VMCOG. Maybe that's the best interface to use.
Maybe an efficient N way set associative cache could work better - I've tried, but could not get any better performance than direct mapped.
I've had doubts about the replacement algorithm too, but I believe getting that working perfectly would be best for page (or cache line) management. What collision resolution algorithm is used in VMCOG ... leaky-bucket, quadratic probe, other ... ?
I think the VMCOG tag LRU counter bits could be moved to HUB RAM with little difference in performance, but would allow huge virtual space. The tag table is fairly small anyway. The other thing could change is the VMCOG page size - 512 bytes takes a long time to load from any backing store. Pages of 64 or 128 bytes is probably better.
In any case, I think containing the #ifdef mess needs to be addressed. Having one good interface helps, but having #include tools would help more.
I'm still too tied up with other things to get into improving all this right now especially since things are working fairly well. I'll get back to it after a bit though.
One biggie for me at some point is making Catalina work with some kind of cache or VM ... that is one tough project though unless Ross does it.
There is of course still the ultimate LMM thing left to do ...
I TOTALLY agree about the #ifdef clutter; Brad said he would add #ifdef a month or two ago when I asked him.
Current status:
- PropCade - production boards available for sale
- Morpheus (pcb rev.2) - production boards tested, just have to do new BOM and start selling
- Mem+ (pcb rev.2) - production boards tested, just have to do new BOM and start selling
- 485Plug - production boards tested, just have to do BOM and start selling it
- CPUModule - prototype built and tested
- Mem+ built, driver partially tested, one not solved bug
- pPLC (formerly PLC-G) - prototype >50% tested and verified, should be done tomorrow
- Morpheus+ ... needs to be built and tested
- IO* ... needs to be built and tested
- approx 10 small/easy I/O boards: 6 built, 4 to be built, all to be tested
I suppose I could put off building/testing Morpheus+/IO*/unbuilt I/O boards for a week to get back to VMCOG first
Note: PropCade and pPLC both require large VMCOG due to the number of SPI memory sockets on them so I guarantee I will finish the large VMCOG as some of my products are built around it
The current page replacement policy is basically "Least Recently Used", however that can be replaced. I was more worried about getting a page replacement policy working than squeezing the last ounce of performance out at this stage. The current VMCOG was basically a prototype/testbed for the upcoming large VM version, plus I wanted something capable of running CP/M on PropCade (and now, C3)
I think that VMCOG will be faster with slow memories (think SPI) and that SdramCache *may* be better with parallel SDRAM or SRAM. The proof will be in testing large numbers of different applications, and benchmarking the performance. I'd bet that VMCOG will be faster in some situations, and a cache in others. Best guess: code with greater "locality of reference" will work better with VMCOG, and more scattered/random code will work better with a small cache line cache.
A very large number of papers have been written on page replacement policies... each is better for different application access patterns.
The problem with giving the application access to a whole small page is that the application then has to know when it needs to grab a handle to a different page.
With the VMCOG approach, the application just asks for the content of an address, and it gets it (or value gets written) without the application having to know if the memory is really in core or not.
Currently there is no page collision possible, but I am addressing that with a two-way associativity in the big version. My calculations showed a slowdown with any greater than 2 way associativity.
My current page replacement (a version of LRU) works, so I intend to leave it alone until the large VM has been working well. VMCOG is pretty modular, it will be easy to try different policies.
Mini's, mainframes, etc standardized on 512 byte pages a long time ago. For the time being, I am sticking with them, because the idea is that if you get your hit rate up high enough, you have to swap pages in/out <1% of the time. Many small pages will cause many more page swaps, and on a smaller number of bytes, the initial address setup overhead will be more significant.
Moving the hit counters to the hub would cause a HUGE performance hit with the current VMCOG "ask for a memory unit (B/W/L)" as it would add two additional hub accesses to every single access.
The SdramCache way of making a cache line available works around this, but requires significantly more complex memory access code in the application. In VMCOG, you simply ask for the data at the addres (or give data to write to an address). In a chache, you have to check if it is in-core, then modify the hub byte, in every application.
I *TOTALLY* agree about the ifdef mess, it is a huge headache for me too.
One of the major goals for VMCOG was to present a simple, easy to use, unified interface, and keep the client code VERY simple.
Problem with standardizing on the VMCOG interface from all the data I've seen so far, is that SdramCache today outperforms VMCOG. Trouble is I can't tell if thats mainly because of the block based interface or the high data-burst rate using SDRAM (2 instructions per byte peak). The only way to know for sure is to isolate a variable which seems impossible without a merged approach ... VMCOG with tags intra-COG is too big to allow SDRAM long bursts.
The fastest approach will generally be to code directly to the hardware interface - ie maintain a small cache within the application cog, which talks directly to the ram - however that requires significant programming in each application cog.
Your cache approach is somewhat easier for the application cogs, however the application cogs now have to check if the address they are looking for is in one of the resident cache lines - again, more coding for the application writers.
I designed VMCOG so that it is very easy for application cogs to interface to it, and requires little memory in the application cogs to use. The complexity of presence checking, replacement policies etc is application transparent - albeit with a performance hit.
Computer science history shows us that with a good enough hit ratio, approaches like mine will get close to 99% of the performance of not having a larger VM (assuming a big enough working set).
I have actually started adding provisions for "locking" VM pages into memory, which would allow a very similar approach to what you do with cache lines - however I intended to use locked pages for screen buffers
I think your benchmarks showed that using VMCOG was roughly 1/4 the speed of using the hub directly - which is a pretty good result considering the hub handshaking going on, and I suspect it is fairly close speed wise to using the SDRAM one byte (or 2 / 4) at a time, not using long burst - if not a bit faster.
The problem with many small pages is that the TLB gets huge - and that when loading/saving small pages (or cache lines) the setup time is a larger fraction of the total transfer time.
Frankly, for larger projects it could easily be that the application programmers will follow a roadmap like:
- initially implement a VMCOG interface, so they can talk to all memory boards that support it
- later for a few systems add a Cache interface (which involves more code writing for the app programmer)
- for really critical systems, implement an optimized direct interface for each significant memory hardware interface
Mind you, in extreme cases such as single channel SPI ram, I would be very surpised if a caching approach significantly outperformed the easier to implement (for an app writer) VMCOG approach.
Thing is, one never writes an array of 32 bit values to a file. One writes an array of bytes which may happen to represent 32 bit values somehow. Luckily the bits in the bytes always seem to come out the right way around.
This problem has been around for ever, Intel are little endian, Motorola are big endian etc etc.
The only way around it is to be sure to know the on file data format and take steps to ensure it is read/written the right way around.
By the way, did you know that the Prop and Spin are very confused about their orientation as well?
Consider this Spin code:
Now have a look at the compiled listing:
See what I mean?
Just to be sure here is a hex dump of the binary produced:
I'm sure it's possible. I have no idea how, might as well ask me to design a space shuttle:) Not only would the code generator need reversing but probably the assembler and linker as well.
We could take the other route and reverse all longs read/written by ZOG leaving the binary images as they are. But that is sure to be a performance hit as I have mentioned before.
In that respect ZOG plays very well with Spin/PASM at the moment. We don't have to turn all the longs around , like PC, SP etc when going from one to the other. There is no reversing necessary when working with a floating point COG for example.
For sure if one is emulating a Motorola 68000 on a PC one has the same issues.
By the way, I have my bytecode compiler running under ZOG. It is split into three phases: tokenize, parse, and generate. It writes a correct little-endian image but the VM gets confused by it since it's really big-endian. If I just change the code generator to create a big-endian image then I think everything should be fine.
A next step, as I've mentioned before, would be for me to change the bytecode compiler to target the ZOG VM instead of my Basic VM. That will take longer but would probably be more useful in the end.
Right now I'm trying to think about how to write a shell that would allow execution of ZOG programs from a command line or screen menu. It's easy to load and run a ZOG image but I need to pass argc/argv to the new image or have some other way to pass parameters. I guess I could just write the parameters to an SD card file but I'd like to find a memory-based way of handling parameters. Have you thought about this at all?
Thanks,
David
That looks like great progress on the BASIC compiler. Does it have a name? "Betz BASIC" sounds catchy:)
It would be great if you could target Betz BASIC at ZPU byte codes. That would enable Betz BASIC programs to be run on a PC under the ZPU VM in C or even on real FPGAs like this http://zpuino.blogspot.com/search?updated-max=2010-09-24T10%3A22%3A00-07%3A00&max-results=7
Are you wanting to create your "shell" in spin prior to loading starting Zog on a ZPU image? I did briefly try and figure out where argc/argv were coming from but did not pursue it. There must be a place in the loaded ZPU image where parameters can be "POKED" prior to starting ZOG. Might need to have a look at the libgloss crtxxx routines and even make some small changes to the _premain or whatever.
What we need here is a little OS for ZPU/ZOG. I believe Bill has one up his sleeve somewhere.
Have you had a look in zpugcc/toolchain/gcc/libgloss/zpu/syscalls.c ?
There you will find all the stuff you want to mess with I think, all the functions for file handling: open(), close(), read(), write() etc.
Now the nice thing about this is that all the functions there are defined with the "weak" attribute like so:
This means that you can write new versions of those functions and just include them in your program. When the program is built the linker will take your new versions without any "duplicate" errors and drop the "weak" versions.
So, for example you could make a new version of write() that changed the way _use_syscall is used, or ignore it all together. You would add tests of the file descriptor parameter "fd" to decide whether output goes to the serial port for stdio or whether it goes to the file writing function of your FAT file system code.
Similarly for the other file functions.
By the way syscalls.c also contains _premain(). This is the place to look for figuring out where to get command line args from.
Ultimately I'd like to have a SD block driver COG that is used via a mail box so that Zog C FS code can use it without going through a syscall or any Spin code.
No. And that's probably an unanswerable question.
zpu-gcc produces an executable in ELF format which is full of code, data, debug sections. That is probably inspectable somehow but then that ELF is turned into a raw binary file for loading which has no formatting in it.
Thinking about it a bit I would proceed like this:
At location zero in the bin file, which is loaded to location zero in the ZPU memory, there is the _start code that sets up a couple of things and then jumps to _premain().
Following that little _start routine is the ZPU interrupt vector (not used yet) and following that is about 1K of vectors and instruction emulation code that ZOG does not use because it implements the full ZPU instructin set in PASM.
All this is in libgloss crt0.S. It has been my intention for a long while to provide an alternate crt0.S which drops all that emulation code and results in 1K smaller binaries.
So, here is a possibility to throw out all the unused vector code from crt0.S and instead provide a small area for command line string, say 80 bytes or whatever, initialized to zeros.
Then the _premain() function is modified to find the command line string at that fixed location and create the argc/argv parameters for main().
When you want to set a command line just POKE the string into that fixed location in crt0 after loading the ZOG binary.
I'm hoping we don't have to rebuild the libraries. Not at this stage anyway.
It should be possible to create a new crt0.S inour programs and tell zpu-gcc to use that instead of its own. May need a compilers switch like -mno-crt.
Alternatively I'm hoping all those symbols defined in crt0.S have "weak" linkage so we can just write our own and the linker will drop the standard library ones.
I have only built the compiler on Debian Linux like so:
$ cd wherever/zpugcc/toolchain
$ source env.sh
$ ./fixperm.sh
$ ./build.sh
You may need to make the scripts executable first.
$ chmod +x fixperm.sh
$ chmod +x build.sh
Jazzed and myself have had difficulty building the compiler under Ubuntu. Also there are problems building on Windows with cygwin. There is a discussion about these issues going on now on the ZPU mailing list: http://mail.zylin.com/pipermail/zylin-zpu_zylin.com/2010-November/thread.html
Just now I have failed to build the hello world test with an alternative crt0.S and have asked about it on the mailing list.
Of course one could modify the crt0.S prior to doing the compiler build. But that seems a bit extreme to me.
Decided to use Baggers' TV_Half_Height.spin 40x30 driver. I wish I could adjust the vertical start position in the driver - will look at that tomorrow.
For the time being, just printing whatever goes to the serial port on_output is reasonable. Not sure what to do about keyboard yet on_input though. I'll probably just put in another #ifdef.
I'm adding these fragments:
I'll start working on a more interesting demo now. Wish I had a mouse cursor in the TV text driver
Have you tried running the ZPU on an FPGA? I've got a Digilent Spartan 3 board with an S400, and tried compiling and synthesising the small ZPU core. It built OK (albeit with lots of warnings) and I might try simulating it later.