Maybe I'm wrong with this statement, but I assume all the compilers out there still have bits that are not open source? So if one has to start from scratch, that pascal compiler earlier might be where I might start. A somewhat daunting, but not impossible task.
There's plenty of completely open source compilers. Start with something "Spin-like". Given its expression syntax, an open source C compiler might a good start (happy to be corrected by those that have already done work on parsing Spin).
Your main problem with adapting an existing compiler to compile SPIN is going to be the indentation style. Languages that use indentation for block scope (like Python and Spin) are notoriously hard to parse. Of course, if you're going to start from scratch, you could also fix that particular design defect as well! (ducks for cover )
Of course, if you're going to start from scratch, you could also fix that particular design defect as well! (ducks for cover )
Well when it comes down to actually writing a multipass compiler, call me crazy but one of those passes I was thinking of being the 'indent' parser, and its sole job would be to go through the code and replace indents with { and }. That is an internal step to the compiler so nobody need see this. [also ducks for cover]. But by making it more C like, the next step could be a C compiler. Do you have any links to such open source C compilers? I'm thinking out loud whether it will be easier to build a compiler for spin from scratch vs translate spin to C and compile that?
I saw once an example of 'obfuscated C' which was a flight simulator which when printed out, the program was the shape of a plane. What this highlights is that C does not need to worry about indents. If the braces and the semicolons are there, the compiler knows what to do next. So you could add semicolons as part of the compilation process too. Then the indent problem goes away.
Any links for more open source compilers would be most appreciated!
Any links for more open source compilers would be most appreciated!
Dr_Acula, you really should learn to use Google! Googling "open source" and "C compiler" gives you this as the first result. Any entry that says anything other than "Proprietary" could be considered "Open Source" (it depends on your definition).
I'll leave you to google for "difficulty in parsing indented languages"
Yes I have looked at some of those. The problem with some appears to be that 'freeware' means that the compiled program is free. Not necessarily that the source for the compiler is free, though I am sure that code exists.
Some of spin seems more like pascal, so that pascal compiler will be very handy.
Yes I have looked at some of those. The problem with some appears to be that 'freeware' means that the compiled program is free. Not necessarily that the source for the compiler is free, though I am sure that code exists.
Anything that says GPL or BSD is free source code. Some of the freeware ones (notably LCC, on which Catalina is based) are also free source code.
Some of spin seems more like pascal, so that pascal compiler would be very handy.
You should recognize that each SPIN block (DAT, PUB, PRI etc) is actually a completely different syntax. DAT blocks are very simple to parse - any of the many open source configurable assemblers can compile PASM reasonably easily (I actually started on one before I found Homespun!). PUB and PRI blocks are more complex - that's where you need a real parser (or compiler). I don't see the Pascal bit - to me (and apart from the indenting) Spin looks a lot like a subset of C - certainly the expression syntax and operators are derived from C.
Yes you certainly find the syntax differences when you think about a compiler. eg a=5 in the CON section vs a := 5 in a PUB.
The indenting problem needs a little work in that I think tabs need to be converted to spaces, then spaces to blocks with end markers.
Mathematical expressions are the fun part. BODMAS etc. I started thinking about Reverse Polish Notation, and that led me to this link http://en.wikipedia.org/wiki/Shunting-yard_algorithm by Edsger Dijkstra which I think you can generalise for Spin expressions. That link even comes with some example C code!
Have a read of the Crenshaw paper. A few points about that:
1) He describes a method of parsing and compiling expressions that does not use any complicated expression and operator stacks as in the shunting yard algorithm.
Rather he does it via a technique called "recursive decent". Basically you have a function to handle expressions. Whilst in there as you come to the various operators you call other appropriate functions to handle the sub-expressions. Whilst in those functions you call other functions to handle more sub-expressions. Potentially, for expressions with nested sub-expressions in brackets and such you end up calling the same functions again, recursively.
The idea is to use the stack inherent in the language your compiler is written in to keep track of where you are rather than messing with manually maintained stacks. It's dead easy and I suspect it is the kind of technique used by Chip in the Spin compiler.
2) I suspect that the white space block demarcation of Spin can be handled in a similar way to the way comments are handled. That is when parsing a line you know what indentation level you are on and what language construct you are in, if, repeat etc, and you know you should introduce a new level or jump out of one by the amount of space on the current line. This is helped by the fact that statements cannot cover multiple lines in Spin. You have to maintain a "state" of where you are, much like you do for block comments. I can see that white space indentation might be harder for traditional parsers, lex, bison and co.
3) Crenshaw manages to parse the source and generate code all in one go. Without a lexical analysis and tokenizing pass. I bet Chip, being a minimalist, does this in the Spin compiler.
Any compiler experts out there? Is it common to use recursive decent compilers now a days or is the shunting yard style still in use?
Anyone know if my guesses re: Chips approach are even remotely correct?
Recursive descent is still widely used. There are tools such as yacc and bison that automatically generate the C code for a recursive descent parser from a description of the language. Together with a tool for lexical analysis such as lex or flex, a compiler can be created in a few hours.
The language has to be suitable, of course. There are some for which recursive descent can't be used.
A rudimentary compiler can be written in a page of code using the SNOBOL language.
OK. My question was the other way round really. I could be wrong but I got the impression from various books and papers I've seen that early compilers used auxiliary operator and operand stacks, mostly because they were written in assembler, I guess, and that was easier to manage. Then later efforts took advantage of the high level language they were written in, hence recursive decent.
lex and yacc. yuk. I get the feeling it's as easy to parse Spin with a manually created parser. Especially since it has white space block delimeters and is essentially a line based syntax like good old BASIC.
Some where I have a half baked PASM assembler that parses most of the Spin syntax for expressions, needed because it was intended to assemble PASM from Spin compatible source files including all normal immediate value expressions, DAT definitions, constants etc. That project stalled because I was writing it in object oriented ADA, yes I'm crazy, which ultimately ran away from me. And then came HomeSpun and BST so the motivation evaporated.
I was curious about which Spin bytecodes are used the most, so I added some code to SpinSim to accumulate a count of the opcodes that were executed. I used four different programs to accumulate the statistics. They were the CLIB demo program, a dhrystone benchmark program, a floating point demo from the OBEX and a Trig Pack demo, also from the OBEX.
The results are shown below. I combined the percentages from each of the four programs with equal weights. The ldllc (Load long local compact) instruction was used the most. This is the instruction that pushes the value of one of the first 8 local stack variables onto the stack. jz (Jump if zero) was the next most frequent opcode. These two opcodes account for 25% of the opcodes executed.
50% of the opcodes executed were done on only 8 opcodes. These include 3 that load constants --- ldbi (Load byte immediate), ldli1 (Load long immediate one) and ldlip (Load long immediate packed). Two more instructions address the the first 8 local stack variables -- exllc (Execute long local compact) and stllc (Store long local compact). The remaining instruction was ldba (Load byte absolute), which pops an address off of the stack and loads a byte from memory.
95% of the opcodes executed were either math ops, non-indexed memory ops, or program flow such as jmp, call, ret and ldfrm (Load stack frame). The indexed memory ops and most of the other lower ops were rarely used. Of course, other programs will use a different mix of instructions, which could include more of the indexed memory ops and lower ops.
If a new Spin interpreter is developed, it might be good to optimize the more frequently used ops. opcodes that are rarely used could be overlayed from hub RAM or executed as LMM instructions.
I was curious about which Spin bytecodes are used the most...
50% of the opcodes executed were done on only 8 opcodes ...
Nice work, Dave.
This is pretty typical of all compilers, and was supposed to be one of the original factors driving RISC vs CISC architectures. The conventional wisdom is that CISC architectures were popular when building good software (i.e. compilers) was more difficult than building good hardware, and RISC architectures became popular as software became easier to build than hardware.
Personally, I think there is a case to be made for exactly the opposite - i.e. CISC was popular when software was cheap and cheerful - and was often written by the same engineers who designed the hardware. When software became more complex (and hence more expensive to build and maintain, and also the bottleneck on developing new architectures) then CISC had to be abandoned because compiler construction was simply too cumbersome and difficult.
In any case, even on modern RISC architectures compilers tend to use only a fraction of the available instruction set (but probably more than in the days of CISC achitectures - anybody remember the VAX instruction set? - I think you got a medal if you could even figure out what some of the instructions did!).
Seven of the first eight instructions are very simple, and each of them could be coded in two or three lines of PASM. exllc is more complicated because it performs multiple functions. It fetches a local variable, performs an operation on it based on the next opcode, stores the result back to the local varaible, and then it conditionally pushes the result onto the stack. The next opcode can be one of the 32 math ops, or it can be a special operation, such as auto-increment, auto-decrement, sign extension, and a small number of other operations.
It would be interesting to start with an LMM PASM version of the interpreter, and then optimize each instruction starting from the top of the list. I'm guessing that the execution time would break even after optimizing the first 40 instructions. Everything after that would be a speed improvement over the current interpreter.
I read the whole thread before replying, and I agree with you that writing a Spin byte code interpreter that used straight PASM for the most used spin byte codes and LMM for the rest should be a win overall - especially since it would be possible to add new byte codes for single precision floating point support right into it!
I know, float is not needed for most applications, but it can be nice
Are you guys saying that an LMM Spin interpreter would out perform the current interpreter and possibly provide extra features like floating point and unsigned LONG support?
(I'm talking regular small Spin at the moment)
If so:
1) Something like this should go into the Prop II where the LMM kernel and LMM code of the interpreter could live in the extra PROM space it will have.
2) This has repercussions for Parallax's idea for a C to byte code compiler.
As for 2) The current plan for a Prop C compiler seems a lot less than ideal as it is targeting current Spin byte codes and hence looking to drop C features like unsigned int's. That is to say it is NOT C. Bad idea. But with an LMM interpreter the byte codes could be extended to provide full C support.
Are you guys saying that an LMM Spin interpreter would out perform the current interpreter and possibly provide extra features like floating point and unsigned LONG support?
(I'm talking regular small Spin at the moment)
The LMM Spin interpreter could outperform the current interpreter if frequently used opcodes are implemented in pure PASM within the cog. Additonal opcodes for floating point and unsigned LONG could be added if there is room in the cog memory. There would be a tradeoff between how many new opcodes would be added efficiently versus the speedup obtained with the current opcodes. With fewer new opcodes you could put more old opcodes in the cog RAM.
New opcodes could be added by pairing up the unused $3f code with another byte. So the byte pair $3f $00 could be a floating-point multiply, $3f $01 could be a divide, $3f $02 could be an unsigned long divide, and so on. There should be a byte pair reserved for an LMM instruction so that user LMM code could be executed.
If so:
1) Something like this should go into the Prop II where the LMM kernel and LMM code of the interpreter could live in the extra PROM space it will have.
2) This has repercussions for Parallax's idea for a C to byte code compiler.
As for 2) The current plan for a Prop C compiler seems a lot less than ideal as it is targeting current Spin byte codes and hence looking to drop C features like unsigned int's. That is to say it is NOT C. Bad idea. But with an LMM interpreter the byte codes could be extended to provide full C support.
I think a fully compliant C compiler could be written using the current Spin bytecode interpreter. Signed ints can be use for most operations where sign does matter. The only operations where sign matters are >, <, >>, multiply, divide and mod. Function pointers can be implemented using the technique I suggested for method pointers, where the PBASE, VBASE, PCURR and stack variable size are stored in a struct somewhere, and the compiler just presents a pointer to the struct. The compiler can implement implement C jumps with the jmp opcode. It just has to ensure that the stack is maintained properly. Even a longjump can be implemented by using the abort opcode. It would require that compiler route all the calls through code associated with the setjump routine.
Parallax's approach is to write a pre-processor that converts C source to Spin source. This will be hard to do unless they add some extensions to the Spin language to support the C features I mentioned above. However, there shouldn't be a problem if they compile directly to bytecodes.
I created an LMM version of the Spin interpreter. It's about 7 to 8 times slower than the normal Spin interpreter, which I expected since the LMM loop takes 8 instruction times. I then implemented the more frequently used instructions within the COG to speed up the interpreter.
After I implemented 38 of the 256 instruction inside the cog I got it down to about 1.5 times slower. I quit at this point because I realized that I would have to implement many more instructions inside the cog to break even, and then even more to get a significant gain.
I then tried another approach by using a jump table stored n the hub RAM, which I believe Cluso and jazzed have also done. I also optimized six of the more frequently used instructions. This produced a 25% speed improvement, which Cluso has reported as well. The biggest barrier to getting any further improvement is the lack of cog memory. If the cog memory had an extra 100 or 200 longs we could probably achieve a doubling of the speed.
My next thought was to split the interpreter between two cogs. The motivation isn't to use the extra processing capability of the second cog, but to use it's memory. The multi-cycle instructions could be moved to a second cog to reduce the effect of the inter-cog overhead. These instructions would be sqrt, strsize, strcomp, bytemove, wordmove, longmove, bytefill, wordfill, longfill, encode, randf and randr.
In SpinLMM I've gotten better performance on strsize and strcomp for large string sizes. These use tighter loops that run under the FCACHE instruction. Some of the other multi-cycle instructions can be improved also, and others will be about the same. The main advantage is that this will free up more space in the main cog so that the other instructions can be optimized.
This approach would also allow implementation of language extensions that were discussed earlier, such as floating point operators and an LMM interpreter.
Is it too late to add more cog memory to the Prop 2 design?
This sounds fascinating. 1.5x slower might not even be noticable.
If you do go for the second cog, is there any chance you could keep a version that is a little slower but is a one cog version? Maybe some way of selecting which one you want to use?
What sort of code would "LMM Spin" run? Presumably that lives in a larger memory space, and if so, how does a compiler cope with things like large arrays (>32k) and jumps to distant code? Does this need a different compiler?
If we want to just access external RAM we can probably modify the current Spin interpreter and fit it into one cog. There are several places where the interpreter does a "rdbyte op, pcurr" followed by "add pcurr, #1" and "sub dcurr, #4" followed by "rdlong x, dcurr". This can be replaced with "call #readpcurr" and "call #readdcurr", which would fetch the data from hub RAM or external RAM depending on the address. We can do a few more things to free up more memory for the memory access routines.
If we want to speed up the Spin interpreter I think more cog memory is the best way to go. Using a second cog would achieve this. The LMM approach would work for a single cog, but I don't think the speedup will be as much. LMM Spin could run user LMM code and we could add intrinsic support for floating point. The compiles don't support adding new intrinsic functions, but BST does have a "bytecode" function that would allow inserting new codes.
If we want to just access external RAM we can probably modify the current Spin interpreter and fit it into one cog. There are several places where the interpreter does a "rdbyte op, pcurr" followed by "add pcurr, #1" and "sub dcurr, #4" followed by "rdlong x, dcurr". This can be replaced with "call #readpcurr" and "call #readdcurr", which would fetch the data from hub RAM or external RAM depending on the address. We can do a few more things to free up more memory for the memory access routines.
I did this last year. The problems are: 1) some language functionality must be sacrificed to provide memory access (I removed sqrt, encode/decode, and random), 2) there is only room to modify the fetch; the stack and variable spaces remain in HUB, and as a result, writing programs become very difficult because of losing VAR access and STRING references. Now if you can give up more language features, some of the new problems could be resolved by putting everything into external memory and getting HUB access via a special address, but then SPIN loses more features.
A similar exercise with a Cache COG for external memory access, using overlays, and not giving up any features results in a SPIN that's at least 3 times slower than normal.
Conclusion? Use Catalina or some other language. ZOG works on C3 and the SDRAM solutions. Hopefully I can resume working on Catalina SDRAM solutions very soon. I'm thinking that Bean's LMM BASIC could also get "BIG" with external memory.
I spent a little more time on the dual-cog interpreter and achieve a 33% speed improvement. I moved strsize, strcomp, bytemove, bytefill, wordmove, wordfill, longmove and longfill to the second code. The strxxxx and xxxxfill routines use tighter loops, so these operations should be faster for block sizes above a certain amount.
I also moved ops that are not used as much, such as the lockxxxx and cogxxxx instructions. The second cog uses about 150 longs at this point. The first cog implements most of the instructions and the more frequently used ops are optimized for speed. I used a jump table that is stored in hub RAM.
Up to this point the two cogs run effectively as a single cog, but with more cog memory. The second cog waits for the first cog to post an instruction, and then the first cog waits until the second completes the operation. It may be possible to get some additional speedup by overlapping the processing of the two cogs. The first cog could fetch the next instruction and decode it while the second cog is working on the previous instruction. My goal is to achieve a speedup factor of 2 to 1.
The 2 cog approach could be used to implement Big Spin. There is enough room in the second cog to move more code from the first cog. This should allow external memory routines to be added.
That is great news. Would you need a special compiler? Or use that emulator you have with a flat memory space?
If you want an executable binary file larger than 32K you would need a special compiler. Howerver, It would be possible to use the existing compilers to create objects that fit within 32K, and then link them together to form programs that are much bigger than 32K. This would be done by creating stub routines that would perform a 32-bit call to a method in another object.
As an example, let's say a program consist of a top object and a single child object. This is split into two programs. The first program consists of the top object and a stubbed version of the child object. The stubbed version of the child object contains only the PUB and PRI statements from the child object, plus code that will performs a 32-bit jump into the correct method of the child program. The linker would just copy the parent program and the child program together, and it will add a 32-bit pointer that will be used by the child object stubs to call the methods in the child program. It sounds complicated, but it's actually pretty straight forward to implement.
SpinSim could be modified to support a larger memory to test it out. I'm planning on adding a Prop 2 mode to SpinSim anyhow. I could just add an option to specify the size of hub RAM. A size of 128K would simiulate the Prop 2. I could also simulate external RAM. What type of external RAM is normally used, and how is it interfaced?
Edit: I looked at the driver for the HX512 memory card. It has an 8-bit interface, and it clocks out an 8 or 16-bit address in one or two bytes. It then writes a byte or reads a byte depending on how the control bits are set. It has an auto-increment/decrement feature so that successive bytes can be accessed without writing a new address. Bytes located beyond $ffff are addressed by auto-incrementing with dummy reads (Yuck!!!!). What do people really use for external RAM?
What type of external RAM is normally used, and how is it interfaced?
There are many options
Here is my sales-pitch for the "C3 / SDRAM ZOG Cache" interface:
Advantages:
Cache abstracts the hardware type. Just replace the object.
Fast access to external memory using buffer pointers.
No need to tell the Cache COG the type of data access.
Cache is write-back. Writes only happen when a block is swapped.
Allows practically any size external memory.
Supports refresh required for cheap dynamic RAM.
Cache store size is flexible and can use as little as 2KB of HUB.
Disadvantages:
Cache requires a second COG.
Cache for some designs is less efficient than directly accessing the hardware.
The C3 / SDRAM ZOG Cache interface is based on a cache model which lives in a second COG and supports > 64KB memory. This allows using the same routines for any memory type, thus abstracting away the hardware details for the many options.
The cache model is fairly easy to use. For example, the zog.spin interpreter needs access to byte, word, and long data and there is only one instruction difference between the different accesses. Below I've provided a sample of reading a word from external memory.
The cache model is FUNCTIONAL. There are variations on the lower level implementation. For example SDRAM uses a direct mapped cache where C3 uses a VM cache based on VMCOG routines. SDRAM is so fast compared to the SPI FLASH/RAM that little performance is lost by using direct mapped cache - there are optimizations still to be realized.
One optimization in consideration for the cache model is to use two cache pointers since ZOG commonly accesses different memory which may collide back to back. Using that optimization would make the cache even more efficient by requiring fewer off COG accesses.
Here is a sample code fragment for reading a word from external memory. Reading a byte or a long would mean simply replacing the "rdword" instruction.
'------------------------------------------------------------------------------
' read a word from external memory
'
mov addr, address
xor addr, #%10 'XOR here is an endianess fix - ZOG ONLY
call #zpu_cache ' check if addr is in local cache - sets Z if available
if_ne call #cache_read ' if not in local cache, read from second cache COG
rdword data, memp 'the address in cache to read is returned in memp
The supporting zpu_cache and cache_read routines look like this:
'------------------------------------------------------------------------------
zpu_cache
mov memp, addr 'ptr + mboxdat = hub address of byte to load
and memp, #(cache#LINELEN-1)
mov temp, addr 'ptr + mboxdat = hub address of byte to load
andn temp, #(cache#LINELEN-1)
cmp cacheaddr,temp wz 'if cacheaddr == addr, just pull form cache
add memp, mboxptr 'add ptr to memp to get data address
'memp gets overwriteen on a miss
zpu_cache_ret ret
'------------------------------------------------------------------------------
cache_write mov memp, addr 'save address for index
andn addr, #cache#CMD_MASK 'ensure a write is not a read
or addr, #cache#WRITE_CMD
jmp #cache_access
'------------------------------------------------------------------------------
cache_read mov memp, addr 'save address for index
or addr, #cache#READ_CMD 'read must be 3 to avoid needing andn addr
'------------------------------------------------------------------------------
cache_access ' if cacheaddr <> addr, load new cache line
wrlong addr, mboxcmd
mov cacheaddr,addr 'Save new cache address. it's free time here
andn cacheaddr,#cache#LINELEN-1 'Kill command bits in free time
:waitres rdlong temp, mboxcmd
tjnz temp, #:waitres 'We have room for tjnz 4 or 8 cycles
rdlong mboxptr, mboxdat 'Get new buffer
and memp, #(cache#LINELEN-1) 'memp is index into buffer
add memp, mboxptr 'memp is now HUB buf address of data to read
cache_read_ret
cache_write_ret
cache_access_ret ret
'------------------------------------------------------------------------------
The external SDRAM direct mapped Cache COG code that manages the buffers is very small allowing for long lines of buffer reads and writes which with an 8 bit data interface can swap 64 bytes in 14us with an 80MHz system clock: that is 5MB/s. A 16 bit interface used with MicroPropPC is 20% faster than the 8 bit interface at 80MHz.
The C3 Cache COG uses the VM swap algoritm which has larger management code, but is more efficient in determining swap requirements because of cache line aging.
Just for completness: An alternative is the VMCOG interface which accomplishes basically the same thing but the interface is marginally slower because it always consults the second COG which adds more overhead for every access, is more complicated for each type of data access, and supports only 64KB of external memory at this time.
A just short reply: Congratulations to you all for running this discussions and thanks for spending time in doing all the experiments and developments. I am sorry that I do not have the time to follow, understand fully and make use of this. But it is great to watch you thinking!
ErNa
If you want an executable binary file larger than 32K you would need a special compiler. Howerver, It would be possible to use the existing compilers to create objects that fit within 32K, and then link them together to form programs that are much bigger than 32K. This would be done by creating stub routines that would perform a 32-bit call to a method in another object.
As an example, let's say a program consist of a top object and a single child object. This is split into two programs. The first program consists of the top object and a stubbed version of the child object. The stubbed version of the child object contains only the PUB and PRI statements from the child object, plus code that will performs a 32-bit jump into the correct method of the child program. The linker would just copy the parent program and the child program together, and it will add a 32-bit pointer that will be used by the child object stubs to call the methods in the child program. It sounds complicated, but it's actually pretty straight forward to implement.
SpinSim could be modified to support a larger memory to test it out. I'm planning on adding a Prop 2 mode to SpinSim anyhow. I could just add an option to specify the size of hub RAM. A size of 128K would simiulate the Prop 2. I could also simulate external RAM. What type of external RAM is normally used, and how is it interfaced?
Edit: I looked at the driver for the HX512 memory card. It has an 8-bit interface, and it clocks out an 8 or 16-bit address in one or two bytes. It then writes a byte or reads a byte depending on how the control bits are set. It has an auto-increment/decrement feature so that successive bytes can be accessed without writing a new address. Bytes located beyond $ffff are addressed by auto-incrementing with dummy reads (Yuck!!!!). What do people really use for external RAM?
Hi Dave,
You can update the CPLD firmware on the HX512 to make all 512kb directly accessible. See the following threads:
Ross, thanks for the information. I should have assumed someone would have figured out how to directly address all 512K.
jazzed, the information on the cache software is very interesting. Has anyone tried using a hash code to control the cache? The hash code maps the full address space into the smaller cache space, and it may improve the efficiency of the cache.
This is something I should be able to simulate in SpinSim. This would allow me gather statistics on cache efficiency..
Has anyone tried using a hash code to control the cache? The hash code maps the full address space into the smaller cache space, and it may improve the efficiency of the cache.
This is something I should be able to simulate in SpinSim. This would allow me gather statistics on cache efficiency..
Cache efficiency is a big concern obviously and I appreciate having a way to measure the effectiveness over a range of addresses. Zog tends to access stack and .text code areas often, so that is of particular interest. I would appreciate any efforts to improve performance.
The cache I'm using for the SDRAM COG is modulo N hash or direct mapped without replacement policy which is fine because depending on the model, lines can be swapped in less than 14us (8bit data) or 10us (16bit data). David Betz uses exactly the same interface as given above with Bill's VMCOG replacement policy cache for C3 - which makes sense because SPI access time is glacial by comparison to SDRAM.
A 2 way set associative cache would improve performance over direct mapped for SDRAM.
A general improvement would be to keep 2 line pointers in the zpu_cache routine which runs in the interpreter COG and swap there so that there is less need to interface with the second COG. There is also a small optimization for zpu_cache that i did not include.
I did not present the direct mapped (modulo N hash) cache code before. Here is the main fragment - some routines and constants are omitted for focus, but can be found in the attachment.
'====================================================================
' mailbox command spinner loop
' user will get a block pointer back and access the block with a mod index such as:
'
' mov index,addr ' user must do something like this for returned array index
' and index,#(LINELEN-1) ' index can be only up to LINELEN-1
'
sdramDone
cmdloop djnz refresh,#norefresh ' check refresh every time ... djnz here keeps window
call #sdram_refresh ' if refresh timer expired, reload and refresh
norefresh rdlong addr,cmdptr wz ' get command/address ... command is in bit 0..1
if_z jmp #cmdloop ' if zero, do nothing
mov clineptr,addr ' get the cache line pointer offset from address
shr addr,#LINESHFT wc ' test for read/write into carry
movs readtag,addr ' (addr / LINESIZE) mod TAGCOUNT is tag / cache line offset
andn readtag,#$200-TAGCOUNT ' user sends full address. we only load blocks
' add readtag,#8 ' for debugger only
muxnc dirty,_TAG_DIRTY ' save new tag dirty state - fill pipe slot
readtag mov tag,0-0 ' get tag line contents
cmpsub tag,_TAG_DIRTY wc ' get the old dirty tag
cmp addr,tag wz
muxc tag,_TAG_DIRTY ' restore the old dirty tag
if_e wrlong zero,cmdptr ' if match let user know we're done early ...
' we rely on the fact that the user is looking for 0 cmdptr
' and must use another HUB access to get data.
' prolems may come if the user's cog is lower than this cog number
and clineptr,_CLP_MOD ' user sends full address. we only load blocks
add clineptr,cacheptr ' get cache line
wrlong clineptr,datptr ' send cache buffer to user - data may change before we're done
if_ne call #flush ' if bad tag, flush - cache code never changes Z flag
if_ne wrlong zero,cmdptr ' let user know we're ready after flush ...
writeTag movd tagit,readtag
or tag,dirty ' save new tag dirty state
tagit mov 0-0,tag
jmp #sdramDone
'--------------------------------------------------------------------
' flush method saves dirty page and loads page from backstore
'
flush ' if dirty bit set, write old buffer before read
if_nc jmp #flush_rdonly
flush_write mov saveaddr,addr ' save addr for read
mov addr,tag ' get cacheline tag addr
shl addr,#LINESHFT ' physical address to write cacheline
call #sdramBuffWrite ' save cacheline
mov addr,saveaddr ' restore addr for read
flush_rdonly mov tag,addr ' move address clears dirty/count
shl addr,#LINESHFT ' physical address to write cacheline
call #sdramBuffRead ' if zero, load cache line
flush_ret ret
I've attached a full 8bit SDRAM access implementation for you to consider (a 16bit implementation is used on MicroPropPC that is much smaller and faster). Please pardon some of the mess. One could easily replace all of the SDRAM specific stuff (sdram_refresh, send_initmode, sdcmd_*, send_address, sdramBuffRead, sdramBuffWrite) with another set of code.
If it's of value, I can make this driver more flexible for adding other hardware using Homespun's #include feature or just provide a shell with fill-in-the blank hardware routines.
I could also simulate external RAM. What type of external RAM is normally used, and how is it interfaced?
I use a 512k memory chip so it is a lot simpler than things like caches. Set up a cog that is looping looking for an input. Put the address in a location plus a flag to say 'read' or 'write' and it returns the byte. The actual code in pasm will be different for different hardware. I also have a slower version written in spin.
For LMM models, the code to handle the RAM is part of the LMM code.
The code for the Z80 emulations worked by reading and writing bytes. For Spin, one could make a very valid case for a pasm driver that reads and writes longs rather than bytes. So you could maybe abstract that to two longs, one for the address and a flag bit for read or write, and the other long is the data to write, or the data that has been read.
If you want an executable binary file larger than 32K you would need a special compiler. Howerver, It would be possible to use the existing compilers to create objects that fit within 32K, and then link them together to form programs that are much bigger than 32K. This would be done by creating stub routines that would perform a 32-bit call to a method in another object.
Great to hear that might be possible. I was wondering what sort of code is generated by the compiler when it is asked to compile spin code that is then fed into a cog with a cognew. Is this code more relocatable?
Or could you do this in 32k blocks which have their own var/dat etc and link them? Maybe put them in external memory in 32k groups, so that code ends up in known places with an offset you can calculate knowing which block it is?
I use a 512k memory chip so it is a lot simpler than things like caches.
It is very likely you are missing a point which is (considering how long it takes to set up an access on your board), performance of applications on your board using a cache would most likely increase by more than 50%.
Comments
There's plenty of completely open source compilers. Start with something "Spin-like". Given its expression syntax, an open source C compiler might a good start (happy to be corrected by those that have already done work on parsing Spin).
Your main problem with adapting an existing compiler to compile SPIN is going to be the indentation style. Languages that use indentation for block scope (like Python and Spin) are notoriously hard to parse. Of course, if you're going to start from scratch, you could also fix that particular design defect as well! (ducks for cover )
Ross.
Well when it comes down to actually writing a multipass compiler, call me crazy but one of those passes I was thinking of being the 'indent' parser, and its sole job would be to go through the code and replace indents with { and }. That is an internal step to the compiler so nobody need see this. [also ducks for cover]. But by making it more C like, the next step could be a C compiler. Do you have any links to such open source C compilers? I'm thinking out loud whether it will be easier to build a compiler for spin from scratch vs translate spin to C and compile that?
I saw once an example of 'obfuscated C' which was a flight simulator which when printed out, the program was the shape of a plane. What this highlights is that C does not need to worry about indents. If the braces and the semicolons are there, the compiler knows what to do next. So you could add semicolons as part of the compilation process too. Then the indent problem goes away.
Any links for more open source compilers would be most appreciated!
Dr_Acula, you really should learn to use Google! Googling "open source" and "C compiler" gives you this as the first result. Any entry that says anything other than "Proprietary" could be considered "Open Source" (it depends on your definition).
I'll leave you to google for "difficulty in parsing indented languages"
Ross.
Some of spin seems more like pascal, so that pascal compiler will be very handy.
Ross.
Yes you certainly find the syntax differences when you think about a compiler. eg a=5 in the CON section vs a := 5 in a PUB.
The indenting problem needs a little work in that I think tabs need to be converted to spaces, then spaces to blocks with end markers.
Mathematical expressions are the fun part. BODMAS etc. I started thinking about Reverse Polish Notation, and that led me to this link http://en.wikipedia.org/wiki/Shunting-yard_algorithm by Edsger Dijkstra which I think you can generalise for Spin expressions. That link even comes with some example C code!
Is the speed of the thing ok? A guy at my work got one a month or two ago and whilst I got a chance to touch it it seemed a bit sluggish.
Have a read of the Crenshaw paper. A few points about that:
1) He describes a method of parsing and compiling expressions that does not use any complicated expression and operator stacks as in the shunting yard algorithm.
Rather he does it via a technique called "recursive decent". Basically you have a function to handle expressions. Whilst in there as you come to the various operators you call other appropriate functions to handle the sub-expressions. Whilst in those functions you call other functions to handle more sub-expressions. Potentially, for expressions with nested sub-expressions in brackets and such you end up calling the same functions again, recursively.
The idea is to use the stack inherent in the language your compiler is written in to keep track of where you are rather than messing with manually maintained stacks. It's dead easy and I suspect it is the kind of technique used by Chip in the Spin compiler.
2) I suspect that the white space block demarcation of Spin can be handled in a similar way to the way comments are handled. That is when parsing a line you know what indentation level you are on and what language construct you are in, if, repeat etc, and you know you should introduce a new level or jump out of one by the amount of space on the current line. This is helped by the fact that statements cannot cover multiple lines in Spin. You have to maintain a "state" of where you are, much like you do for block comments. I can see that white space indentation might be harder for traditional parsers, lex, bison and co.
3) Crenshaw manages to parse the source and generate code all in one go. Without a lexical analysis and tokenizing pass. I bet Chip, being a minimalist, does this in the Spin compiler.
Any compiler experts out there? Is it common to use recursive decent compilers now a days or is the shunting yard style still in use?
Anyone know if my guesses re: Chips approach are even remotely correct?
What style does HomeSpun or BST use?
The language has to be suitable, of course. There are some for which recursive descent can't be used.
A rudimentary compiler can be written in a page of code using the SNOBOL language.
OK. My question was the other way round really. I could be wrong but I got the impression from various books and papers I've seen that early compilers used auxiliary operator and operand stacks, mostly because they were written in assembler, I guess, and that was easier to manage. Then later efforts took advantage of the high level language they were written in, hence recursive decent.
lex and yacc. yuk. I get the feeling it's as easy to parse Spin with a manually created parser. Especially since it has white space block delimeters and is essentially a line based syntax like good old BASIC.
Some where I have a half baked PASM assembler that parses most of the Spin syntax for expressions, needed because it was intended to assemble PASM from Spin compatible source files including all normal immediate value expressions, DAT definitions, constants etc. That project stalled because I was writing it in object oriented ADA, yes I'm crazy, which ultimately ran away from me. And then came HomeSpun and BST so the motivation evaporated.
The results are shown below. I combined the percentages from each of the four programs with equal weights. The ldllc (Load long local compact) instruction was used the most. This is the instruction that pushes the value of one of the first 8 local stack variables onto the stack. jz (Jump if zero) was the next most frequent opcode. These two opcodes account for 25% of the opcodes executed.
50% of the opcodes executed were done on only 8 opcodes. These include 3 that load constants --- ldbi (Load byte immediate), ldli1 (Load long immediate one) and ldlip (Load long immediate packed). Two more instructions address the the first 8 local stack variables -- exllc (Execute long local compact) and stllc (Store long local compact). The remaining instruction was ldba (Load byte absolute), which pops an address off of the stack and loads a byte from memory.
95% of the opcodes executed were either math ops, non-indexed memory ops, or program flow such as jmp, call, ret and ldfrm (Load stack frame). The indexed memory ops and most of the other lower ops were rarely used. Of course, other programs will use a different mix of instructions, which could include more of the indexed memory ops and lower ops.
If a new Spin interpreter is developed, it might be good to optimize the more frequently used ops. opcodes that are rarely used could be overlayed from hub RAM or executed as LMM instructions.
Dave
Nice work, Dave.
This is pretty typical of all compilers, and was supposed to be one of the original factors driving RISC vs CISC architectures. The conventional wisdom is that CISC architectures were popular when building good software (i.e. compilers) was more difficult than building good hardware, and RISC architectures became popular as software became easier to build than hardware.
Personally, I think there is a case to be made for exactly the opposite - i.e. CISC was popular when software was cheap and cheerful - and was often written by the same engineers who designed the hardware. When software became more complex (and hence more expensive to build and maintain, and also the bottleneck on developing new architectures) then CISC had to be abandoned because compiler construction was simply too cumbersome and difficult.
In any case, even on modern RISC architectures compilers tend to use only a fraction of the available instruction set (but probably more than in the days of CISC achitectures - anybody remember the VAX instruction set? - I think you got a medal if you could even figure out what some of the instructions did!).
Ross.
It would be interesting to start with an LMM PASM version of the interpreter, and then optimize each instruction starting from the top of the list. I'm guessing that the execution time would break even after optimizing the first 40 instructions. Everything after that would be a speed improvement over the current interpreter.
I read the whole thread before replying, and I agree with you that writing a Spin byte code interpreter that used straight PASM for the most used spin byte codes and LMM for the rest should be a win overall - especially since it would be possible to add new byte codes for single precision floating point support right into it!
I know, float is not needed for most applications, but it can be nice
(I'm talking regular small Spin at the moment)
If so:
1) Something like this should go into the Prop II where the LMM kernel and LMM code of the interpreter could live in the extra PROM space it will have.
2) This has repercussions for Parallax's idea for a C to byte code compiler.
As for 2) The current plan for a Prop C compiler seems a lot less than ideal as it is targeting current Spin byte codes and hence looking to drop C features like unsigned int's. That is to say it is NOT C. Bad idea. But with an LMM interpreter the byte codes could be extended to provide full C support.
New opcodes could be added by pairing up the unused $3f code with another byte. So the byte pair $3f $00 could be a floating-point multiply, $3f $01 could be a divide, $3f $02 could be an unsigned long divide, and so on. There should be a byte pair reserved for an LMM instruction so that user LMM code could be executed.
Sounds like a good idea.
I think a fully compliant C compiler could be written using the current Spin bytecode interpreter. Signed ints can be use for most operations where sign does matter. The only operations where sign matters are >, <, >>, multiply, divide and mod. Function pointers can be implemented using the technique I suggested for method pointers, where the PBASE, VBASE, PCURR and stack variable size are stored in a struct somewhere, and the compiler just presents a pointer to the struct. The compiler can implement implement C jumps with the jmp opcode. It just has to ensure that the stack is maintained properly. Even a longjump can be implemented by using the abort opcode. It would require that compiler route all the calls through code associated with the setjump routine.
Parallax's approach is to write a pre-processor that converts C source to Spin source. This will be hard to do unless they add some extensions to the Spin language to support the C features I mentioned above. However, there shouldn't be a problem if they compile directly to bytecodes.
Dave
After I implemented 38 of the 256 instruction inside the cog I got it down to about 1.5 times slower. I quit at this point because I realized that I would have to implement many more instructions inside the cog to break even, and then even more to get a significant gain.
I then tried another approach by using a jump table stored n the hub RAM, which I believe Cluso and jazzed have also done. I also optimized six of the more frequently used instructions. This produced a 25% speed improvement, which Cluso has reported as well. The biggest barrier to getting any further improvement is the lack of cog memory. If the cog memory had an extra 100 or 200 longs we could probably achieve a doubling of the speed.
My next thought was to split the interpreter between two cogs. The motivation isn't to use the extra processing capability of the second cog, but to use it's memory. The multi-cycle instructions could be moved to a second cog to reduce the effect of the inter-cog overhead. These instructions would be sqrt, strsize, strcomp, bytemove, wordmove, longmove, bytefill, wordfill, longfill, encode, randf and randr.
In SpinLMM I've gotten better performance on strsize and strcomp for large string sizes. These use tighter loops that run under the FCACHE instruction. Some of the other multi-cycle instructions can be improved also, and others will be about the same. The main advantage is that this will free up more space in the main cog so that the other instructions can be optimized.
This approach would also allow implementation of language extensions that were discussed earlier, such as floating point operators and an LMM interpreter.
Is it too late to add more cog memory to the Prop 2 design?
Dave
If you do go for the second cog, is there any chance you could keep a version that is a little slower but is a one cog version? Maybe some way of selecting which one you want to use?
What sort of code would "LMM Spin" run? Presumably that lives in a larger memory space, and if so, how does a compiler cope with things like large arrays (>32k) and jumps to distant code? Does this need a different compiler?
If we want to speed up the Spin interpreter I think more cog memory is the best way to go. Using a second cog would achieve this. The LMM approach would work for a single cog, but I don't think the speedup will be as much. LMM Spin could run user LMM code and we could add intrinsic support for floating point. The compiles don't support adding new intrinsic functions, but BST does have a "bytecode" function that would allow inserting new codes.
I did this last year. The problems are: 1) some language functionality must be sacrificed to provide memory access (I removed sqrt, encode/decode, and random), 2) there is only room to modify the fetch; the stack and variable spaces remain in HUB, and as a result, writing programs become very difficult because of losing VAR access and STRING references. Now if you can give up more language features, some of the new problems could be resolved by putting everything into external memory and getting HUB access via a special address, but then SPIN loses more features.
A similar exercise with a Cache COG for external memory access, using overlays, and not giving up any features results in a SPIN that's at least 3 times slower than normal.
Conclusion? Use Catalina or some other language. ZOG works on C3 and the SDRAM solutions. Hopefully I can resume working on Catalina SDRAM solutions very soon. I'm thinking that Bean's LMM BASIC could also get "BIG" with external memory.
I also moved ops that are not used as much, such as the lockxxxx and cogxxxx instructions. The second cog uses about 150 longs at this point. The first cog implements most of the instructions and the more frequently used ops are optimized for speed. I used a jump table that is stored in hub RAM.
Up to this point the two cogs run effectively as a single cog, but with more cog memory. The second cog waits for the first cog to post an instruction, and then the first cog waits until the second completes the operation. It may be possible to get some additional speedup by overlapping the processing of the two cogs. The first cog could fetch the next instruction and decode it while the second cog is working on the previous instruction. My goal is to achieve a speedup factor of 2 to 1.
The 2 cog approach could be used to implement Big Spin. There is enough room in the second cog to move more code from the first cog. This should allow external memory routines to be added.
That is great news. Would you need a special compiler? Or use that emulator you have with a flat memory space?
As an example, let's say a program consist of a top object and a single child object. This is split into two programs. The first program consists of the top object and a stubbed version of the child object. The stubbed version of the child object contains only the PUB and PRI statements from the child object, plus code that will performs a 32-bit jump into the correct method of the child program. The linker would just copy the parent program and the child program together, and it will add a 32-bit pointer that will be used by the child object stubs to call the methods in the child program. It sounds complicated, but it's actually pretty straight forward to implement.
SpinSim could be modified to support a larger memory to test it out. I'm planning on adding a Prop 2 mode to SpinSim anyhow. I could just add an option to specify the size of hub RAM. A size of 128K would simiulate the Prop 2. I could also simulate external RAM. What type of external RAM is normally used, and how is it interfaced?
Edit: I looked at the driver for the HX512 memory card. It has an 8-bit interface, and it clocks out an 8 or 16-bit address in one or two bytes. It then writes a byte or reads a byte depending on how the control bits are set. It has an auto-increment/decrement feature so that successive bytes can be accessed without writing a new address. Bytes located beyond $ffff are addressed by auto-incrementing with dummy reads (Yuck!!!!). What do people really use for external RAM?
Here is my sales-pitch for the "C3 / SDRAM ZOG Cache" interface:
Advantages:
- Cache abstracts the hardware type. Just replace the object.
- Fast access to external memory using buffer pointers.
- No need to tell the Cache COG the type of data access.
- Cache is write-back. Writes only happen when a block is swapped.
- Allows practically any size external memory.
- Supports refresh required for cheap dynamic RAM.
- Cache store size is flexible and can use as little as 2KB of HUB.
Disadvantages:The C3 / SDRAM ZOG Cache interface is based on a cache model which lives in a second COG and supports > 64KB memory. This allows using the same routines for any memory type, thus abstracting away the hardware details for the many options.
The cache model is fairly easy to use. For example, the zog.spin interpreter needs access to byte, word, and long data and there is only one instruction difference between the different accesses. Below I've provided a sample of reading a word from external memory.
The cache model is FUNCTIONAL. There are variations on the lower level implementation. For example SDRAM uses a direct mapped cache where C3 uses a VM cache based on VMCOG routines. SDRAM is so fast compared to the SPI FLASH/RAM that little performance is lost by using direct mapped cache - there are optimizations still to be realized.
One optimization in consideration for the cache model is to use two cache pointers since ZOG commonly accesses different memory which may collide back to back. Using that optimization would make the cache even more efficient by requiring fewer off COG accesses.
Here is a sample code fragment for reading a word from external memory. Reading a byte or a long would mean simply replacing the "rdword" instruction.
The supporting zpu_cache and cache_read routines look like this:
The external SDRAM direct mapped Cache COG code that manages the buffers is very small allowing for long lines of buffer reads and writes which with an 8 bit data interface can swap 64 bytes in 14us with an 80MHz system clock: that is 5MB/s. A 16 bit interface used with MicroPropPC is 20% faster than the 8 bit interface at 80MHz.
The C3 Cache COG uses the VM swap algoritm which has larger management code, but is more efficient in determining swap requirements because of cache line aging.
Just for completness: An alternative is the VMCOG interface which accomplishes basically the same thing but the interface is marginally slower because it always consults the second COG which adds more overhead for every access, is more complicated for each type of data access, and supports only 64KB of external memory at this time.
ErNa
Hi Dave,
You can update the CPLD firmware on the HX512 to make all 512kb directly accessible. See the following threads:
PROJECT: Extended addressing for Extreme 512K
http://forums.parallax.com/showthread.php?t=95106
Building the Lattice ISP Programmer (aka CPLD prgramming cable) for the Hydra HX512
http://forums.parallax.com/showthread.php?t=113554
jazzed, the information on the cache software is very interesting. Has anyone tried using a hash code to control the cache? The hash code maps the full address space into the smaller cache space, and it may improve the efficiency of the cache.
This is something I should be able to simulate in SpinSim. This would allow me gather statistics on cache efficiency..
Dave
Cache efficiency is a big concern obviously and I appreciate having a way to measure the effectiveness over a range of addresses. Zog tends to access stack and .text code areas often, so that is of particular interest. I would appreciate any efforts to improve performance.
The cache I'm using for the SDRAM COG is modulo N hash or direct mapped without replacement policy which is fine because depending on the model, lines can be swapped in less than 14us (8bit data) or 10us (16bit data). David Betz uses exactly the same interface as given above with Bill's VMCOG replacement policy cache for C3 - which makes sense because SPI access time is glacial by comparison to SDRAM.
A 2 way set associative cache would improve performance over direct mapped for SDRAM.
A general improvement would be to keep 2 line pointers in the zpu_cache routine which runs in the interpreter COG and swap there so that there is less need to interface with the second COG. There is also a small optimization for zpu_cache that i did not include.
I did not present the direct mapped (modulo N hash) cache code before. Here is the main fragment - some routines and constants are omitted for focus, but can be found in the attachment.
I've attached a full 8bit SDRAM access implementation for you to consider (a 16bit implementation is used on MicroPropPC that is much smaller and faster). Please pardon some of the mess. One could easily replace all of the SDRAM specific stuff (sdram_refresh, send_initmode, sdcmd_*, send_address, sdramBuffRead, sdramBuffWrite) with another set of code.
If it's of value, I can make this driver more flexible for adding other hardware using Homespun's #include feature or just provide a shell with fill-in-the blank hardware routines.
I use a 512k memory chip so it is a lot simpler than things like caches. Set up a cog that is looping looking for an input. Put the address in a location plus a flag to say 'read' or 'write' and it returns the byte. The actual code in pasm will be different for different hardware. I also have a slower version written in spin.
For LMM models, the code to handle the RAM is part of the LMM code.
The code for the Z80 emulations worked by reading and writing bytes. For Spin, one could make a very valid case for a pasm driver that reads and writes longs rather than bytes. So you could maybe abstract that to two longs, one for the address and a flag bit for read or write, and the other long is the data to write, or the data that has been read.
Great to hear that might be possible. I was wondering what sort of code is generated by the compiler when it is asked to compile spin code that is then fed into a cog with a cognew. Is this code more relocatable?
Or could you do this in 32k blocks which have their own var/dat etc and link them? Maybe put them in external memory in 32k groups, so that code ends up in known places with an offset you can calculate knowing which block it is?
The 'linker' idea sounds very interesting.