I've been evaluating to write a JIT compiler for 68K for that other processor, you know. But I was thinking that for something simpler as the z80 (easier addressing modes), It could be an interesting exercise to try on the propeller. Using pure LMM code I thought that it may provide a similar instruction throughput as the one you have now, or lower . Did you consider such an approach ?, sorry if someone asked before. I do not know how is the relation between buffer length and increase in speed over a "transform" "execute" "transform" and so on model.
@heater "Just add the Z80 ops as LMM to the old 1 COG emulator......"
Maybe do that. I've just finished taking all the files needed to build a romimage including cp/m and up to 1meg of files and wrap it up into a single binary file ready for programming to an eprom. Takes about 10 seconds and a few keypresses now. And in the process I've looked at heaps of assembly code, and the bottom line is that the whole thing is essentially 8080 except for 1 extra z80 code (LDIR). And even that can be written in 8080. So you can take all the z80 specific opcodes and leave them out if you like. Or emulate them really slowly with some code you copy in for the purpose because they are hardly ever called so it won't matter if those codes are 10x slower.
The thing that seems much more important is emulating the full 64k of ram space. So much of both the 8080 and Z80 assumed this much space. Given vince now has the terminal side in a simple board http://www.brielcomputers.com/wik/index.php?title=Image:Pocketermrev1.jpg (I'm getting 4), that means you could ignore the vga and the keyboard for the moment and just build something that does the emulation and a serial port, which maybe frees up some pins to do the interface to a ram chip?
@Ale: I have no idea really about what is involved in a JIT compiler apart from rumours I've heard about how Java byte codes are compiled to native x86, say, and then executed. You might have to elaborate on what you have in mind.
My simple minded interpretation of JIT is that you take a bunch of byte codes, but in or case 8080 opcodes, compile or translate them into a sequence of, in our case, Propeller PASM or LMM instructions for execution. Then move on to the next lot of byte codes. Now this gains you speed when/if you loop back to the original byte code sequence as you already have a high speed translated version stashed away in memory. Eventually the entire program is "compiled" and runs as native code.
Am I right so far?
Now all this only gains you anything if you have memory to keep the compiled/translated codes. If you have to keep re-compiling sequences because you had "recycled" the memory they were in then there is no point. As the COG space is so small the compiled sequences would have to live in HUB as LMM code which is going to be terribly slow. Besides there still won't be much space.
What happens when we get into self modifying code? For example the boot loader or more extreme the way the BIOS applies patches to the CP/M binary in memory. Sounds horrible complicated.
To work in the available memory it would seem to be a case of "transform" "execute" "transform", as you say. In the extreme on a one opcode at a time basis which would be terrible slow.
I have come across so many "swings and roundabouts" in implementing the 8080 emulator on the Prop I still don't know where the optimum answer is. Seems that every time an idea comes up to improve speed it has a penalty that takes speed away again. For example:
a) Keep 8080 registers in COG for fast access. Fine, but now you need some extra ops to ensure they stay in 8 bit range. For registers in HUB rd/wrbyte does that for you. Worse, you need extra ops to access register pairs as WORDs. For registers in HUB they can be arranged so that wr/rdword will do that for you.
b) Have all or part of the dispatch table in COG for fast access. Fine, now you have used up precious COG space and need to go to multiple COGs or LMM both of which slow you down.
c) Use multiple COGs so that you can have lots of fast straight line code to implement the 8080 ops. Fine, but now you HAVE to keep registers in HUB for sharing. And scheduling from COG to COG is slow.
And so it goes on....
By way of some examples:
The 8080 increment instruction "INR A" is implemented as LMM code in the one COG emulator and as fast PASM code in the 4 COG. A test program of 34 INR's followed by a "JMP start" is run on both emulators. Result:
1 COG LMM - 381 KIPS
4 COG PASM - 414 KIPS
A loop of 8080 "INR A, OUT 0, JMP start", Here OUT and JMP are PASM in the 1 COG emulator. For the 4 COG emulator each op code is handled by a different COG so a lot of "swapping" is going on. Result:
1 COG - 348 KIPS
4 COG - 384 KIPS
The complete T8080 CPU diagnostic program run repeatedly. Result:
1 COG - 392 KIPS
4 COG - 406 KIPS
These results make me think that the small gains in speed of using 4 COGs instead of 1 is just not worth it. Unless, possibly, someone can point out a sneaky way to speed up all the COG swapping.
There is one last avenue I would like to explore: Clusso has a clever technique in his Clusso interpreter of putting the COG addresses (9 bits each) of three routines into each LONG dispatch table entry.
Now given most 8080/z80 ops have a source, an operation, and a destination. One could code up a bunch of "source" functions e.g. get_reg_a, get_reg_b, get_reg_pair_hl, get_immediate_byte, get_immediate_word etc etc. Then one could code up a bunch of "operation" functions, add, add_with_carry, shift, rotate, 16 bit add, push, pop etc. Then code up the "destination" functions similar to the source set.
Having done that one proceeds to create 8080 op code emulations by specifying, for each op code, the source, operation, and destination functions in the dispatch table.
Seems like a way of creating a very small emulator in one COG with out any LMM. At least enough to do 8080 ops.
But of course for every "swing" there is a "roundabout". Perhaps the over head of reading three addresses from each dispatch table entry and calling them takes away any speed gains.
Gosh, that was a long post !
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
@Dr_A: I have been on the verge of hand wiring a Prop + 64K RAM for a long time. Each time I delay it because someone is designing a Prop + RAM PCB. This time it looks like we are on to a winner with Cluso's three propeller board.
The way things are going I will probably go back to single COG emulation as I said in the last post. After all when we get the Prop II, 1 COG using the internal HUB RAM will be just perfect [noparse]:)[/noparse]
I'll see what I can do about adding the minimum required Z80 ops to PropAltair.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I have decided to cancel any further work on the Z80 emulator in 4 COGS. The reasons being:
1) It has not realized any significant speed gains over a 1 COG solution. Ten or twenty percent at best, only about 4% on the T8080 CPU diagnostic.
This is just an awful waste of COGs and RAM for little benefit.
2) It turns out to not be much help in allowing easy expansion from 8080/8085 emulation to full Z80. The required code will not fit in the space remaining in each COG and I refuse to blow more COGs on it.
In hind sight this is inevitable, yes you can write big long in line fast code spread over 4 COGs but when it is required to thrash from COG to COG, as in an emulator, with only a short time in each all the speed gains are lost. I had this gut feeling before I started but just had to prove it to myself I guess.
The GOOD NEWS:
For those interested in a fast, streamlined, elegant, easy to understand and full up Z80 emulation in only ONE COG! See the new thread "Z80 emulator object in 1 COG" coming soon[noparse]:)[/noparse]
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Heater - shame it didn't provide the gains you expected. Nice to see you are continuing the 1 cog version. Using a 6.0MHz xtal will give you an instant 20% (if you are not already doing this). Hopefully my TriBladeProp will be off to manufacture - you will be surprised at the adds I've done (no time to explain till it off). Keep up the excellent work :-)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Links to other interesting threads:
I tried a 6MHz crystal a long time ago, along with a whole bunch of others, some faster some slower. Strange thing is I could only get the 10MHz crystals in my junk box to work. For sure not the 5MHz which was odd.
Probably something to do with my home made, hand wired PropDemo board style board. I should post a picture of it one day. It is "art", my homage to the Flying Spaghetti Monster.
As such I'm really looking forward to the TriBladeProp.
Any chance of getting ImageCraft to fix up their compiler to run LMM from TriBlade's RAM ? Or is hacking their LMM kernel to do that possible?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
You said
"I tried a 6MHz crystal a long time ago, along with a whole bunch of others, some faster some slower. Strange thing is I could only get the 10MHz crystals in my junk box to work. For sure not the 5MHz which was odd."
If You have old VGA from PC it have one crystal!14.318.180 test it in You system else one 12 MHz
On al my Props it behaves better with PLL 8 that single freqency crystal 6MHz else more with PLL 16
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Nothing is impossible, there are only different degrees of difficulty. For every stupid question there is at least one intelligent answer. Don't guess - ask instead. If you don't ask you won't know. If your gonna construct something, make it·as simple as·possible yet as versatile as posible.
Comments
I've been evaluating to write a JIT compiler for 68K for that other processor, you know. But I was thinking that for something simpler as the z80 (easier addressing modes), It could be an interesting exercise to try on the propeller. Using pure LMM code I thought that it may provide a similar instruction throughput as the one you have now, or lower . Did you consider such an approach ?, sorry if someone asked before. I do not know how is the relation between buffer length and increase in speed over a "transform" "execute" "transform" and so on model.
What are your thoughts ?
Maybe do that. I've just finished taking all the files needed to build a romimage including cp/m and up to 1meg of files and wrap it up into a single binary file ready for programming to an eprom. Takes about 10 seconds and a few keypresses now. And in the process I've looked at heaps of assembly code, and the bottom line is that the whole thing is essentially 8080 except for 1 extra z80 code (LDIR). And even that can be written in 8080. So you can take all the z80 specific opcodes and leave them out if you like. Or emulate them really slowly with some code you copy in for the purpose because they are hardly ever called so it won't matter if those codes are 10x slower.
The thing that seems much more important is emulating the full 64k of ram space. So much of both the 8080 and Z80 assumed this much space. Given vince now has the terminal side in a simple board http://www.brielcomputers.com/wik/index.php?title=Image:Pocketermrev1.jpg (I'm getting 4), that means you could ignore the vga and the keyboard for the moment and just build something that does the emulation and a serial port, which maybe frees up some pins to do the interface to a ram chip?
My simple minded interpretation of JIT is that you take a bunch of byte codes, but in or case 8080 opcodes, compile or translate them into a sequence of, in our case, Propeller PASM or LMM instructions for execution. Then move on to the next lot of byte codes. Now this gains you speed when/if you loop back to the original byte code sequence as you already have a high speed translated version stashed away in memory. Eventually the entire program is "compiled" and runs as native code.
Am I right so far?
Now all this only gains you anything if you have memory to keep the compiled/translated codes. If you have to keep re-compiling sequences because you had "recycled" the memory they were in then there is no point. As the COG space is so small the compiled sequences would have to live in HUB as LMM code which is going to be terribly slow. Besides there still won't be much space.
What happens when we get into self modifying code? For example the boot loader or more extreme the way the BIOS applies patches to the CP/M binary in memory. Sounds horrible complicated.
To work in the available memory it would seem to be a case of "transform" "execute" "transform", as you say. In the extreme on a one opcode at a time basis which would be terrible slow.
I have come across so many "swings and roundabouts" in implementing the 8080 emulator on the Prop I still don't know where the optimum answer is. Seems that every time an idea comes up to improve speed it has a penalty that takes speed away again. For example:
a) Keep 8080 registers in COG for fast access. Fine, but now you need some extra ops to ensure they stay in 8 bit range. For registers in HUB rd/wrbyte does that for you. Worse, you need extra ops to access register pairs as WORDs. For registers in HUB they can be arranged so that wr/rdword will do that for you.
b) Have all or part of the dispatch table in COG for fast access. Fine, now you have used up precious COG space and need to go to multiple COGs or LMM both of which slow you down.
c) Use multiple COGs so that you can have lots of fast straight line code to implement the 8080 ops. Fine, but now you HAVE to keep registers in HUB for sharing. And scheduling from COG to COG is slow.
And so it goes on....
By way of some examples:
The 8080 increment instruction "INR A" is implemented as LMM code in the one COG emulator and as fast PASM code in the 4 COG. A test program of 34 INR's followed by a "JMP start" is run on both emulators. Result:
1 COG LMM - 381 KIPS
4 COG PASM - 414 KIPS
A loop of 8080 "INR A, OUT 0, JMP start", Here OUT and JMP are PASM in the 1 COG emulator. For the 4 COG emulator each op code is handled by a different COG so a lot of "swapping" is going on. Result:
1 COG - 348 KIPS
4 COG - 384 KIPS
The complete T8080 CPU diagnostic program run repeatedly. Result:
1 COG - 392 KIPS
4 COG - 406 KIPS
These results make me think that the small gains in speed of using 4 COGs instead of 1 is just not worth it. Unless, possibly, someone can point out a sneaky way to speed up all the COG swapping.
There is one last avenue I would like to explore: Clusso has a clever technique in his Clusso interpreter of putting the COG addresses (9 bits each) of three routines into each LONG dispatch table entry.
Now given most 8080/z80 ops have a source, an operation, and a destination. One could code up a bunch of "source" functions e.g. get_reg_a, get_reg_b, get_reg_pair_hl, get_immediate_byte, get_immediate_word etc etc. Then one could code up a bunch of "operation" functions, add, add_with_carry, shift, rotate, 16 bit add, push, pop etc. Then code up the "destination" functions similar to the source set.
Having done that one proceeds to create 8080 op code emulations by specifying, for each op code, the source, operation, and destination functions in the dispatch table.
Seems like a way of creating a very small emulator in one COG with out any LMM. At least enough to do 8080 ops.
But of course for every "swing" there is a "roundabout". Perhaps the over head of reading three addresses from each dispatch table entry and calling them takes away any speed gains.
Gosh, that was a long post !
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
The way things are going I will probably go back to single COG emulation as I said in the last post. After all when we get the Prop II, 1 COG using the internal HUB RAM will be just perfect [noparse]:)[/noparse]
I'll see what I can do about adding the minimum required Z80 ops to PropAltair.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
The bad news:
I have decided to cancel any further work on the Z80 emulator in 4 COGS. The reasons being:
1) It has not realized any significant speed gains over a 1 COG solution. Ten or twenty percent at best, only about 4% on the T8080 CPU diagnostic.
This is just an awful waste of COGs and RAM for little benefit.
2) It turns out to not be much help in allowing easy expansion from 8080/8085 emulation to full Z80. The required code will not fit in the space remaining in each COG and I refuse to blow more COGs on it.
In hind sight this is inevitable, yes you can write big long in line fast code spread over 4 COGs but when it is required to thrash from COG to COG, as in an emulator, with only a short time in each all the speed gains are lost. I had this gut feeling before I started but just had to prove it to myself I guess.
The GOOD NEWS:
For those interested in a fast, streamlined, elegant, easy to understand and full up Z80 emulation in only ONE COG! See the new thread "Z80 emulator object in 1 COG" coming soon[noparse]:)[/noparse]
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Glad to hear that you are still working on the single COG version.
OBC
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
New to the Propeller?
Check out: Protoboard Introduction , Propeller Cookbook 1.4 & Software Index
Updates to the Cookbook are now posted to: Propeller.warrantyvoid.us
Got an SD card connected? - PropDOS
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
JMH
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps (SixBladeProp)
· Prop Tools under Development or Completed (Index)
· Emulators (Micros eg Altair, and Terminals eg VT100) - index
· Search the Propeller forums (via Google)
My cruising website is: ·www.bluemagic.biz
Probably something to do with my home made, hand wired PropDemo board style board. I should post a picture of it one day. It is "art", my homage to the Flying Spaghetti Monster.
As such I'm really looking forward to the TriBladeProp.
Any chance of getting ImageCraft to fix up their compiler to run LMM from TriBlade's RAM ? Or is hacking their LMM kernel to do that possible?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Are you on msn???
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps (SixBladeProp)
· Prop Tools under Development or Completed (Index)
· Emulators (Micros eg Altair, and Terminals eg VT100) - index
· Search the Propeller forums (via Google)
My cruising website is: ·www.bluemagic.biz
You said
"I tried a 6MHz crystal a long time ago, along with a whole bunch of others, some faster some slower. Strange thing is I could only get the 10MHz crystals in my junk box to work. For sure not the 5MHz which was odd."
If You have old VGA from PC it have one crystal!14.318.180 test it in You system else one 12 MHz
On al my Props it behaves better with PLL 8 that single freqency crystal 6MHz else more with PLL 16
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.
For every stupid question there is at least one intelligent answer.
Don't guess - ask instead.
If you don't ask you won't know.
If your gonna construct something, make it·as simple as·possible yet as versatile as posible.
Sapieha