Cluso: A ZiCog version 1 may be nice. With all the current discussion going on I'm starting to think that the magic version 1 should be reserved for the first version that passes all the EX exerciser tests for documented instructions and flags. It's starting to look like that is in sight.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
>> ".... how is the DAA code in the middle of the cog code? Why doesn't this displace the cog code after
>> this point so it can't then be loaded?"
Thing is the DAA code is NOT in the middle of COG code. It is at the end. There is nothing to displace. All the code you see after the DAA code is overlays that sit in HUB waiting to be loaded into COG at some point. When the are loaded they overwrite the DAA code. But DAA itself is an overlay so it can be reloaded from HUB again when needed.
>> "So as an extreme example, running a Z80 overlay could rewrite almost the entire cog, then do
>> something, then reload the cog.
Yep, you have it exactly. But it's this reloading part that starts me thinking it would slow things down to LMM speeds and that LMM may just be easier/more elegant.
I'd don't think we want to get into loading COG blocks (overlays) from external memory. Too slow and complicated. Although I see that may be a solution to some problems. It starts to require an operating system and a development environment to support it. As you say , how do you make your loaded code aware of variables and functions in the system already. That's what linkers and locaters do.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
LMM is great for "straight line" code. Speed is up there with overlays. It has a problem with jumps. When LMM hits a jump it has to branch out of the load/execute loop to some LMM kernel function that reads a new LMM program counter from somewhere. This puts a hiccup in the load/execute loop. An LMM jump might look like this:
JMP #lmm_jump
LONG some_address_to_jump_to
Where lmm_jump is a routine in the LMM kernel that increments the LMM PC, reads the target address "some_address_to_jump_to", sets the LMM PC to that address then gets back to the load/execute loop.
Now for the idea:
Make all LMM jumps program counter relative with an offset of only 9 bits:
jmpr some_offset, #lmm_jump
Now the jmpr instruction does not exist. Or does it?
jmpr is a normal JMP instruction. Normally the destination field of a JMP is left empty and not used but if we could somehow fill it in with a 9 bit offset then the lmm_jump routine could extract that offset and add it to the LMM program counter. BINGO we have done an LMM jump in only one LMM instruction. Even if we still have a "hiccup" in the execute loop it is less severe.
Problem is: how to manufacture such a jump instruction in the prop Tool or BST with the dest field filled in?
I leave that as an exercise for the reader (That is I don't know yet[noparse]:)[/noparse])
My apologies to Bill Henning and all working with LMM who have probably thought of all this before and come up with better techniques.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Now an example of our problem is that Pullmoll has proposed a nice solution to get DAA working accurately. That solution requires a 33 LONG overlay and does not fit.
I think I can make it shorter with self modifying code. The :das and :daa sections are basically the same, except one uses SUB, the other ADD with the same constants. Modifying these two opcodes based on the N flag ought to be shorter than 6 longs.
ADD is %100000 and SUB is %100001, so the N flag would just have to be copied to bit #26 of the two instructions.
Just to throw in an idea that would use a completely different approach: DRC - dynamically recompiling code.
You know what each Z80 instruction does and so would be able to write down the PASM equivalent for it, and put it in the hub RAM.
Now executing Z80 code would on the first run compile the PASM into the cog RAM right after the compiler code, like the overlay loader does.
Whenever a Z80 instruction jumps, returns, or alters the program flow other than jumping back to already compiled code, you would flush the compiled space and start over compiling the instructions from the new location - of course after first executing what had been compiled. Flushing and recompilation would also be necessary if some straight forward Z80 program flow overruns the cog space.
Let's assume some imaginary Z80 subroutine is called and returns at some point:
push bc c5
push de d5
push hl e5
push af f5
ld de, 2000h 11 00 20
nextblock:
ld hl, 1000h 21 00 10
and 0fh e6 0f
add a, h 84
ld h, a 67
ld bc, 100h 01 00 01
ldir ed b0
inc a 3c
cp 10h fe 10
jr c, nextblock 38 ef
pop af f1
pop hl e1
pop de d1
pop bc c1
ret c9
Now you would in some sense disassemble the Z80 code, that is looking up the opcodes in the hub RAM tables and paste the code it points to sequentially into cog RAM. The JR opcode would have a flag bit set to indicate that you have to verify it jumps inside the compiled code, or otherwise end the execution block. After you compiled the instructions, you would just let the PASM code run and then jump back to the compiler to handle the next block of code.
The compilation process might seem to introduce a lot of overhead, but imagine what speed you get from running all these instructions in native PASM. The conditional jump destination address would have to be translated to the corresponding cog RAM address during compilation, which means you would have to keep track of the locations where the previous Z80 opcodes began. Perhaps a table could be kept at the end of the cog RAM growing down and thus decreasing the number of longs of compiled code that can be pasted in the cog.
I have myself not yet written a DRC core, but I know they can be very efficient. MAME contains DRC CPU cores for e.g. the PowerPC and some MIPS chips and they do very well. Of course you have lot more RAM on a PC and can use clever caching algorithms to avoid recompilation of already compiled blocks. For the Prop I think it doesn't make sense to think about caching, because there's just not enough space to have more than a dozen or so opcodes lingering in the cog RAM.
I might try to do something like this once my Propeller board arrives... *sigh*
Juergen
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
He died at the console of hunger and thirst.
Next day he was buried. Face down, nine edge first.
heater: certainly worth a try with LMM instead of overlays.
Drac: The overlay loader does not care where the block resides in cog, nor its length. It is just a block that is loaded into the correct space. It loads in reverse, so as soon as it is complete the loader loop is overwritten and causes the overlay to execute. The loader catches the hub sweet spot at 16 clocks.
Forward LMM takes 32 but also gets to execute the 1 instruction. Reverse LMM is of course way more complicated. However, once we have the instructions executing in forward LMM, it could be put into reverse LMM. It is not as if we are trying to do flexible compilation.
So, we can either·add the extra instructions in forward LMM and see how it goes. Alternately,·add them in overlays·and see how it goes. Either way once it is all complete, we can have a good look at the result. Then we can·do some timing to see which way may work best, even a mix maybe.
Forget putting·the code in SRAM.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Links to other interesting threads:
Reverse LMM is harder, but hey, I did it already in PropAltair. In that case you right your LMM normally forwards and PropAltair reverses it all before starting.
Admittedly that LMM was straight line code but if we had relative LMM jumps implemented as I described above it would make reverse LMM with jumps easier.
I think a normal unrolled normal LLM loop will do just fine for starters.
Pulloll: Dynamic compilation seems to work fine for Java and C#. I guess it needs the RAM space to keep a cache of compiled stuff. I can't see it working out on the humble Prop.
Also people have suggested something similar before, not dynamic but a pre-translation of Z80 programs to equivalent PASM. I never figured out how such a scheme would deal with self modifying Z80 code.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Juergen - we have got to get you a zicog board! As soon as possible.
@heater - "I never figured out how such a scheme would deal with self modifying Z80 code."
Self modifying code is pure evil. I have never written or used such a thing. D'Oh! I just realised, DDT uses modifying code to emulate Z80 instructions as DDT is an 8080 program. Darn it, DDT is one of the first programs you use when testing out some simple Z80 code...
@Cluso = you are way ahead of me there. Overlays that hit the sweet spot of 16 clocks? Yes, of course it does, but only if you know it does. And that is brilliant.
Re where the overlays sit, I'm looking at the code and maybe it is at the end, maybe it isn't. Logically it has to be. Right after the overlay code is a CON section and that is going to be turned into internal data for the compiler, not actual code. So ok there. But after that is a long list of DAT code that is the list of opcodes for the Z80. Does this not end up in the cog? Is it a list of bytes/longs in hub that the cog can reference?
If so, gee there isn't much code for the cog. How did you fit it all in, heater/cluso?
Is the bit that actually fits in the cog at startup just the code from the beginning up till the end of the DAA $fit ? If so, that is only about a third of the zicog object.
Yikes, ignore the commented out code, the ram routines, the #ifdef conditional code that isn't compiled, and the actual zicog in the cog is not much at all. Which block of code would you read in/out to make room for bigger overlays?
Do you indeed need bigger overlays? DAA is big, but others seem smaller and even big things like my lousy EXX code could be shrunk with Juergen's clever xor trick.
Despite it being messy to bring in overlays from external ram, I may be forced to on the dracblade. It shouldn't be too hard. All code has to fit in the space DAA fits in. Pass just one variable - the location of the first register. Ok, it will be inefficient in that instead of "get value in register H" it will be "add 5 to loation of reg_base and then get value in that location". But these are instructions that are not used very often.
I still don't quite understand LMM. I looked up "LMM Large Memory Model" on Google and the first link tells me that it is "an alternative programming environment suggested by Bill Henning"
I'm still digesting Juergen's DRC. If you trap things like DDT poking values into ram that then become instructions, then maybe it is possible to precompile sections of code. Could definitely be some gains there in reading from external ram as that is more efficient reading in small blocks than individual bytes (much more so with dracblade than cluso's ramblade). I wonder how much you could recompile Z80 code to PASM and still allow for pokes into code and all the traps for jmps and calls and rets. It is intriguing as it potentially gets Z80 up to pasm speeds, but I just can't quite get my head around it.
We have heaps of ram space for precompiled code. 512k. DRC seems to me to be like 'super smart caching' where it is a bit smarter than just getting 128 byte blocks of data. It is an intriguing idea.
Dr_A:
No, self-modifying code is brilliant. You can't do anything complex with PASM with out it[noparse]:)[/noparse]
"....Does this not end up in the cog"
When you start a COG it always gets loaded with 496 instructions from wherever you told it to start from. If your PASM is only, say, 100 LONGs then a load of random junk from whatever comes next will also be loaded.
As you see fitting ZiCog in is tight, it's taken two years to get to this point.
Yep, everything from the beginning to the end of DAA gets loaded at start up. And it is only about a third of ZiCog, the rest is overlays and the dispatch tables that live in HUB. As you see selecting some of that "resident" code to push out to an overlay is tricky.
The EX and EXX code can be shrunk by using my pointer manipulation suggestion[noparse]:)[/noparse]
LMM is conceptually simple. And like all simple things it took a genius, Bill Henning, to think of it. Basically instead of loading a bunch of LONGs and executing them as the overlay system does you have a tight loop which fetches a single instruction from HUB into the COG, executes it and then loops around to fetch the next one.
Putting overlays into external RAM is I guess possible if you are running out of HUB space. It's going to be slow and hard to engineer. If we switch to LMM the same applies.
The whole idea of pre-compiling or dynamically compiling Z80 into PASM is just to big and complicated for me to even start thinking about. I'd rather use the heaps of external RAM space for Zog [noparse]:)[/noparse]
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Drac: I just looked. The overlay area is from overlay_start until the end of cog. This is not a requirement of the overlay loader code, but it does make more sense.
Re self-modifying code. It is brilliant for doing things fast. As you can see, it removes the need for a stack. It does have its problems and you need to be careful. I programmed a mini that had only 16 assembler risc instructions including self-modifying code. We ran online order entry, inventory and accounting on a $100K machine with 10KB common (aka hub code) and 5-10KB in each partition (aka cog - we had up to 20). This was mid 70s. 10KB of core memory was ~$20K. Now this takes PCs with GBs of memory and while the interface looks nice and pretty in colors, it runs slower.
heater: I just noted that there is an area between the ZiCog cog code and the DAA overlay which is only used for the ProtoBoard. Probably this bit shoud be shifted from here.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Links to other interesting threads:
Dr_Acula said...
Juergen - we have got to get you a zicog board! As soon as possible.
I might add something to the board I'll be receiving soon!? I have seen the circuit diagram of the TriBlade. It's just that I am broke right now, or otherwise I would have opted to buy one ;-P
Actually, the reason why I started looking at the Propeller is I expect to get a contract for the design of a data logger with SD cards. Getting this done should hopefully solve my financial issues for some time.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
He died at the console of hunger and thirst.
Next day he was buried. Face down, nine edge first.
I'll leave the jump and call handlers to you, so you can get 4 more longs in total. Your DAA is 23 longs, mine is 27 in the self modifying version:
'
' DAA implementation according to the (well tested) MAME Z80 cpu core.
'
daa
rdbyte data_8, a_reg
mov nibble, data_8
mov data_16, data_8 ' save for auxiliary flag test
and nibble, #15 ' isolate low nibble of accumulator
test flags, #%00000010 WC ' test negative flag
muxc :op1, :add_sub_bit
muxc :op2, :add_sub_bit
test flags, #aux_bit WZ
if_nz cmp nibble, #$09+1 WC ' aux isn't set; is nibble > $9 ?
:op1 if_ae add data_8, #$06 ' aux was set or nibble is > $9 => add $6
test flags, #carry_bit WZ
if_nz cmp data_8, #$99+1 WC ' carry isn't set; is data > $99 ?
:op2 if_ae add data_8, #$60 ' carry was set or data is > $99 => add $60
and flags, #%00000011 ' preserve old CF and NF
cmp data_8, #$99+1 WC ' compare with $99 + 1
muxnc flags, #carry_bit ' set carry if result > 99
xor data_16, data_8 ' original value ^ result
and data_16, #aux_bit ' if bit #4 of result changed
or flags, data_16 ' set the aux flag
and data_8, #$FF WZ, WC
muxz flags, #zero_bit ' set Z80 zero flag from props zero
muxnc flags, #parity_bit
test data_8, #$80 WZ
muxnz flags, #sign_bit
wrbyte data_8, a_reg
jmp #fetch
:add_sub_bit long %000001_0000_0000_000000000_000000000
Just a thought while looking at homespun's output of zicog.spin
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
He died at the console of hunger and thirst.
Next day he was buried. Face down, nine edge first.
with the LMM kernel unrolled four times (like in my original posting from 2006) on average an LMM instruction takes 20 clocks, as three of four instructions will execute in 16 clocks, and one will execute in 32 clocks.
Cluso99 said...
heater: Using LMM for the extra instructions may well be a nice solution. I would like to get a v1.0 & v2.0 out first, but like you, I have been on other things. I am still in a mind that 2 cogs could work nicely, but as you know, depending on the hardware, we may not have enough cogs available.
Am I correct (Bill can answer this) in saying that each LMM instruction takes 32 clocks as it is not possible to catch each hub cycle? If so, then I suspect overlaying is still likely to be quicker. Obviously this depends on the exact code being executed. For example, a loop will be much faster in overlays, but inline would likely be marginal if an overlay had to be reloaded at the end.
No need for apologies [noparse]:)[/noparse] but you did miss some of the central ideas in LMM...
1) Jumps take 36 cycles on average (32 cycles 75% of the time, 48 cycles 25% of the time)
A "FJMP" is nothing more than:
rdlong pc, pc
long dest_addr
With 23 address bits or less, just set the top 9 bits to zero, which turns the address into a NOP - which means that conditional far jumps work by putting the conditions on the rdlong instruction
2) Relative jumps that are within -508 bytes to +508 bytes of the current PC take 16 (75%) ..32 (25%) cycles
Ofcourse branches can be conditional [noparse]:)[/noparse]
Just remember that when the sub/add executes, PC points to the instruction AFTER it
Backwards branch:
sub pc,#num_longs*4
Forward branch:
add pc,#num_longs*4
3) The only really expensive op is FCALL, but that should be rare.
4) Also read up on FCACHE, which is basically inline pasm code (overlay)
5) It is trivial to set up an "on n goto" equivalent primitive, assuming N is a number between 0 and N-1
' lmm code...
mov t,N
JMP #ONGOTO
long 0addr
long 1addr
...
long Nminusone_addr
...
ONGOTO:
shl t,#2
add pc,t
jmp #next
ONGOSUB is left as an exercise for the reader [noparse]:)[/noparse]
heater said...
LMM for ZiCog idea:
LMM is great for "straight line" code. Speed is up there with overlays. It has a problem with jumps. When LMM hits a jump it has to branch out of the load/execute loop to some LMM kernel function that reads a new LMM program counter from somewhere. This puts a hiccup in the load/execute loop. An LMM jump might look like this:
JMP #lmm_jump
LONG some_address_to_jump_to
Where lmm_jump is a routine in the LMM kernel that increments the LMM PC, reads the target address "some_address_to_jump_to", sets the LMM PC to that address then gets back to the load/execute loop.
Now for the idea:
Make all LMM jumps program counter relative with an offset of only 9 bits:
jmpr some_offset, #lmm_jump
Now the jmpr instruction does not exist. Or does it?
jmpr is a normal JMP instruction. Normally the destination field of a JMP is left empty and not used but if we could somehow fill it in with a 9 bit offset then the lmm_jump routine could extract that offset and add it to the LMM program counter. BINGO we have done an LMM jump in only one LMM instruction. Even if we still have a "hiccup" in the execute loop it is less severe.
Problem is: how to manufacture such a jump instruction in the prop Tool or BST with the dest field filled in?
I leave that as an exercise for the reader (That is I don't know yet[noparse]:)[/noparse])
My apologies to Bill Henning and all working with LMM who have probably thought of all this before and come up with better techniques.
You know, some time ago I posted here my feeble attempt at a compiler for Jack Crenshaw's TINY language that generates LMM for the Prop. Feeble but it worked and I'm sure it did LMM jumps just as you describe.
That TINY did not have relative jumps. Relative (and conditional) jumps is probably all we will need in ZiCog.
Seems to me that with the 7 instruction start up overhead of the overlay loader, one of which is a HUB access in ZiCog, then the difference in performance between LMM and overlay are marginal for small routines. If we get into having to reload "resident" code after an overlay is done then the difference gets even smaller perhaps with LMM wining.
To be honest I'm having trouble making the calculations here and weighing up the swings and roundabouts.
If I can find the time I will take up Cluso's idea and just try it. If nothing else we get to finish up the missing Z80 ops in a rather elegant way I think.
@Cluso: "...Now this takes PCs with GBs of memory .... it runs slower."
Don't forget, in the old days bits were much bigger than they are today. You could do a lot more with the bigger bits [noparse]:)[/noparse]
That code between OVERLAY_START and daa_ovl is there for a reason. When running on the demo board the Z80 RAM lives in the place where the ZiCog PASM was before starting.
@pullmoll: Great finds on the LONG hunt. I'll try and get them in and retest everything.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Re Juergen "I might add something to the board I'll be receiving soon!? I have seen the circuit diagram of the TriBlade. It's just that I am broke right now, or otherwise I would have opted to buy one ;-P"
I might rummage around in the shed and see if I can find a spare one.
@heater, I was only joking re self modifying code. It is brilliant but the problem is you have to really understand the assembly codes to write it.
Re overlays, now I'm starting to understand how they work. I think I also understand now how you can keep adding them and not add any more bytes to the cog. Now I have to work out how to add them without adding to the hub either. That is my little problem to solve on the dracblade though. It depends on how many more bytes a full Z80 implementation comes to, and we don't know that yet.
Is not the problem just one of adding overlays one at a time for all the Z80 instructions we can think of?
Drac said...
Is not the problem just one of adding overlays one at a time for all the Z80 instructions we can think of?
Yes. It does take hub space. It also takes 1 extra hub·long for each overlay (the pointer).
I should try·a cutdown Sphinx to load the I/O, memory, and SD drivers before running ZiCog. That would save quite a bit of hub code. I will see what I can do - I will be busy with other things next week.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Links to other interesting threads:
But as we see the Prop cannot live without it and it is essential for all interesting programs.
With these and a host of other commonly accepted TRUTHs (especially when we get to memory management) we see how programmers have had to produce bigger and bigger, slower and slower code until we end up with the bloat we see all around today[noparse]:)[/noparse]
>> "...how to add them without adding to the hub either."
What you want is an overlay loader that loads from disk, like we had in the old PC days. That's how we got 2Mb programs plus a huge pile of data to fit in the humble PC with 640K RAM. It's going to be complicated and awful slow.
>> "...Is not the problem just one of adding overlays one at a time..."
Well that's one approach. It will work provided we can find room for the dreaded DAA overlay. Pullmoll's hopefully correct DAA implementation won't fit unless we find some longs.
I forget who, but someone wanted to get ZiCog up on a Hydra which needs bigger memory access functions. They don't fit. But perhaps that can be handled by Bill's VMCog.
@Bill: Once again I woke up from a fitful sleep with an odd idea. One thing I have not seen mentioned is a JMPRET instruction for LMM. After all many times we don't want recursive code and a stack.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I think I must have missed a post there. Is there a problem with the DAA code we have - ie is it not working properly and needs recoding and the new code now doesn't fit?
The current DAA code works as per an Intel 8080. As far as I can tell. It was written to the description of DAA in the 8080 manuals and I did a lot of testing on it comparing to DAA in the AltairZ80 simulator. It works well enough for the SURVEY program which is the only program I have found that uses it so far.
But it fails the EX test which is testing as per a real Z80.
I seem to remember that the the Z80 DAA is a fixed up version of the original 8080 DAA in that it also works correctly after subtractions. Using it after SUB is something I did not test.
PullMoll has experience getting DAA correct in the MAME emulator and has made a PASM version for us. It's a bit bigger than what we have now. We might be able to squeeze it in as he has also found a few LONGs we can save elsewhere.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
For anyone who needs 8 or 10 longs immediately I just noticed the 8 instructions with a DDCB prefix are not implemented (anything using IX/IY with a displacement). That means the variables "address_stash" and "data_stash" can be removed along with any instructions that deal with them.
The are some shift and rotate instructions in that DDCB group that are a pain because they set the flags differently than the similar ones for 8080 which we already have in place.
Just been looking over the ZiCog code. Seems to me by going over to LMM instead of overlays we can save 22 LONGs.
I'm also pondering the register to register moves MOV B,A etc. From looking at CP/M code I'd guess that those moves form a bulk of the instructions and should be optimized for speed. Currently we do them with micro-ops called from the dispatch tables. A long time ago Hippy showed a routine on the old PropAltair thread that would do them more directly by decoding the instruction bits. It would be a lot quicker if we found the space for it.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I seem to remember that the the Z80 DAA is a fixed up version of the original 8080 DAA in that it also works correctly after subtractions. Using it after SUB is something I did not test.
Yep. It's the sole purpose of the Z80 N flag (bit #1) to keep track of the most recent add/adc/inc or sub/sbc/dec instruction and switch handling of the DAA. The 8086, which in many regards resembles the Z80 design, does not have an N flag and instead has separate DAA and DAS instructions.
heater said...
PullMoll has experience getting DAA correct in the MAME emulator and has made a PASM version for us. It's a bit bigger than what we have now. We might be able to squeeze it in as he has also found a few LONGs we can save elsewhere.
For reference and to verify my PASM version is right, here's the piece of C code that does DAA in MAME:
UINT8 a = (Z)->A;
if ((Z)->F & NF) {
if (((Z)->F&HF) | (((Z)->A&0xf)>9)) a-=6;
if (((Z)->F&CF) | ((Z)->A>0x99)) a-=0x60;
}
else {
if (((Z)->F&HF) | (((Z)->A&0xf)>9)) a+=6;
if (((Z)->F&CF) | ((Z)->A>0x99)) a+=0x60;
}
(Z)->F = ((Z)->F&(CF|NF)) | ((Z)->A>0x99) | (((Z)->A^a)&HF) | SZP[noparse][[/noparse]a];
(Z)->A = a;
Ooops! I just saw I have to add/sub #$60, not #$66, which is what was in my post of yesterday.
(Z) is the current Z80 cpu context (there are games running on 3 Z80s! simultaneously).
NF is the negative flag. HF is the auxiliary (half) carry flag. CF is carry flag. SZP[noparse]/noparse is an array of precalculated sign, zero and parity bits for a byte value.
HTH,
Juergen
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
He died at the console of hunger and thirst.
Next day he was buried. Face down, nine edge first.
I had realized that the Z80 DAA also works following subtract and I have seen other emulator code that maintains an N flag to keep track of whether the last arithmetic op was and addition or subtraction.
ZiCog does not have an N flag because it is undocumented and because it's another instruction to slow us down.
What I had not realized is that it is possible to get the DAA right without an N flag. Is it really so?
What is ->F in your MAME code example ? Is that the secret ADD/SUB flag?
By the way the more I think about changing from overlays to LMM code for the Z80 ops the more I like it. The thing is it is not possible to use self-modifying code in LMM, well not in the normal way one would with normal PASM. So a working version of your first, longer DAA, would be just great.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
heater said...
pullmoll: OK I have to study this some more.
I had realized that the Z80 DAA also works following subtract and I have seen other emulator code that maintains an N flag to keep track of whether the last arithmetic op was and addition or subtraction.
ZiCog does not have an N flag because it is undocumented and because it's another instruction to slow us down.
I thought I had seen that you are maintaining #%00000010 in the add/sub opcodes, no? This is the N flag! You just have no name for it. Here's MAME's flag list:
heater said...
What I had not realized is that it is possible to get the DAA right without an N flag. Is it really so?
Well, you can get the non-N-flag case right and thus most programs, since most of the times it's decimal addition that's used. One example is the Z80 hexadecimal conversion code that uses DAA twice. It's using just additions. The DAS case is rarely used, except of course by BCD arithmetic libraries - and obviously by Z80 proofing programs
heater said...
What is ->F in your MAME code example ? Is that the secret ADD/SUB flag?
That's the flags byte, i.e. the LSB of AF.
heater said...
By the way the more I think about changing from overlays to LMM code for the Z80 ops the more I like it. The thing is it is not possible to use self-modifying code in LMM, well not in the normal way one would with normal PASM. So a working version of your first, longer DAA, would be just great.
Will you try yourself? You'd just have to go one of two paths, one for the true DAA, one for the DAS with the SUBs instead of the ADDs with #$6 and #$60.
BTW: I think there's a flaw in my translation. The final test if the result is > $99 should only _set_ the carry flag, but not clear it. The "muxc" will also clear the carry, so this has to be changed to a conditional "or flags, #carry_flag" instead.
Juergen
P.S: Just for fun - attached the current code of my try to write a recompiling Z80 in PASM. It looks like I will be left with just a small block to compile code into, at least if I do arithmetic and logic flag calculations in the kernel code to save hub RAM space... darn. When is the Prop2 scheduled?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
He died at the console of hunger and thirst.
Next day he was buried. Face down, nine edge first.
Pullmoll: "I thought I had seen that you are maintaining #%00000010 in the add/sub opcodes, no?"
Good grief you are right. I really don't remember doing that. That will teach me not use unnamed magic numbers in the code. Must fix that.
Perhaps I was already heading towards a full up DAA, I do remember studying the AltairZ80 simulators implementation prior to realizing I could not get it to fit.
Or was it because our Z80 exerciser program checks that flag even if we configure it to do documented tests only?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
heater said...
Another "Good grief!". For someone who does not have a Prop to test on that is a lot of PASM code in pm80.spin.
Well, I'm practicing swimming on the dry so to speak. If all goes well, the post man should ring my door bell in the next couple of hours. I have everything set up to start right away. I guess I should double check my AC adapter has the correct polarity in order to not blow my Prop to the happy hunting grounds of silicon
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
He died at the console of hunger and thirst.
Next day he was buried. Face down, nine edge first.
Comments
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
>> ".... how is the DAA code in the middle of the cog code? Why doesn't this displace the cog code after
>> this point so it can't then be loaded?"
Thing is the DAA code is NOT in the middle of COG code. It is at the end. There is nothing to displace. All the code you see after the DAA code is overlays that sit in HUB waiting to be loaded into COG at some point. When the are loaded they overwrite the DAA code. But DAA itself is an overlay so it can be reloaded from HUB again when needed.
>> "So as an extreme example, running a Z80 overlay could rewrite almost the entire cog, then do
>> something, then reload the cog.
Yep, you have it exactly. But it's this reloading part that starts me thinking it would slow things down to LMM speeds and that LMM may just be easier/more elegant.
I'd don't think we want to get into loading COG blocks (overlays) from external memory. Too slow and complicated. Although I see that may be a solution to some problems. It starts to require an operating system and a development environment to support it. As you say , how do you make your loaded code aware of variables and functions in the system already. That's what linkers and locaters do.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
LMM is great for "straight line" code. Speed is up there with overlays. It has a problem with jumps. When LMM hits a jump it has to branch out of the load/execute loop to some LMM kernel function that reads a new LMM program counter from somewhere. This puts a hiccup in the load/execute loop. An LMM jump might look like this:
Where lmm_jump is a routine in the LMM kernel that increments the LMM PC, reads the target address "some_address_to_jump_to", sets the LMM PC to that address then gets back to the load/execute loop.
Now for the idea:
Make all LMM jumps program counter relative with an offset of only 9 bits:
Now the jmpr instruction does not exist. Or does it?
jmpr is a normal JMP instruction. Normally the destination field of a JMP is left empty and not used but if we could somehow fill it in with a 9 bit offset then the lmm_jump routine could extract that offset and add it to the LMM program counter. BINGO we have done an LMM jump in only one LMM instruction. Even if we still have a "hiccup" in the execute loop it is less severe.
Problem is: how to manufacture such a jump instruction in the prop Tool or BST with the dest field filled in?
I leave that as an exercise for the reader (That is I don't know yet[noparse]:)[/noparse])
My apologies to Bill Henning and all working with LMM who have probably thought of all this before and come up with better techniques.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I think I can make it shorter with self modifying code. The :das and :daa sections are basically the same, except one uses SUB, the other ADD with the same constants. Modifying these two opcodes based on the N flag ought to be shorter than 6 longs.
ADD is %100000 and SUB is %100001, so the N flag would just have to be copied to bit #26 of the two instructions.
I'm thinking of something like
Hmm.. it isn't too much shorter, just 3 longs because I need this add_sub_bit somewhere.
Would a 30 long DAA fit?
Juergen
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
He died at the console of hunger and thirst.
Next day he was buried. Face down, nine edge first.
Sadly not.
I seriously think we should give up on all this scrimping and saving for LONGs if we ever want to get to 100% correct documented emulation behavior.
I'm leaning towards going LMM instead of overlays at this moment. No time to look at this until the weekend I'm afraid.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
You know what each Z80 instruction does and so would be able to write down the PASM equivalent for it, and put it in the hub RAM.
Now executing Z80 code would on the first run compile the PASM into the cog RAM right after the compiler code, like the overlay loader does.
Whenever a Z80 instruction jumps, returns, or alters the program flow other than jumping back to already compiled code, you would flush the compiled space and start over compiling the instructions from the new location - of course after first executing what had been compiled. Flushing and recompilation would also be necessary if some straight forward Z80 program flow overruns the cog space.
Let's assume some imaginary Z80 subroutine is called and returns at some point:
Now you would in some sense disassemble the Z80 code, that is looking up the opcodes in the hub RAM tables and paste the code it points to sequentially into cog RAM. The JR opcode would have a flag bit set to indicate that you have to verify it jumps inside the compiled code, or otherwise end the execution block. After you compiled the instructions, you would just let the PASM code run and then jump back to the compiler to handle the next block of code.
The compilation process might seem to introduce a lot of overhead, but imagine what speed you get from running all these instructions in native PASM. The conditional jump destination address would have to be translated to the corresponding cog RAM address during compilation, which means you would have to keep track of the locations where the previous Z80 opcodes began. Perhaps a table could be kept at the end of the cog RAM growing down and thus decreasing the number of longs of compiled code that can be pasted in the cog.
I have myself not yet written a DRC core, but I know they can be very efficient. MAME contains DRC CPU cores for e.g. the PowerPC and some MIPS chips and they do very well. Of course you have lot more RAM on a PC and can use clever caching algorithms to avoid recompilation of already compiled blocks. For the Prop I think it doesn't make sense to think about caching, because there's just not enough space to have more than a dozen or so opcodes lingering in the cog RAM.
I might try to do something like this once my Propeller board arrives... *sigh*
Juergen
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
He died at the console of hunger and thirst.
Next day he was buried. Face down, nine edge first.
Post Edited (pullmoll) : 3/4/2010 10:56:08 AM GMT
Drac: The overlay loader does not care where the block resides in cog, nor its length. It is just a block that is loaded into the correct space. It loads in reverse, so as soon as it is complete the loader loop is overwritten and causes the overlay to execute. The loader catches the hub sweet spot at 16 clocks.
Forward LMM takes 32 but also gets to execute the 1 instruction. Reverse LMM is of course way more complicated. However, once we have the instructions executing in forward LMM, it could be put into reverse LMM. It is not as if we are trying to do flexible compilation.
So, we can either·add the extra instructions in forward LMM and see how it goes. Alternately,·add them in overlays·and see how it goes. Either way once it is all complete, we can have a good look at the result. Then we can·do some timing to see which way may work best, even a mix maybe.
Forget putting·the code in SRAM.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
Post Edited (Cluso99) : 3/4/2010 11:07:16 AM GMT
Admittedly that LMM was straight line code but if we had relative LMM jumps implemented as I described above it would make reverse LMM with jumps easier.
I think a normal unrolled normal LLM loop will do just fine for starters.
Pulloll: Dynamic compilation seems to work fine for Java and C#. I guess it needs the RAM space to keep a cache of compiled stuff. I can't see it working out on the humble Prop.
Also people have suggested something similar before, not dynamic but a pre-translation of Z80 programs to equivalent PASM. I never figured out how such a scheme would deal with self modifying Z80 code.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
@heater - "I never figured out how such a scheme would deal with self modifying Z80 code."
Self modifying code is pure evil. I have never written or used such a thing. D'Oh! I just realised, DDT uses modifying code to emulate Z80 instructions as DDT is an 8080 program. Darn it, DDT is one of the first programs you use when testing out some simple Z80 code...
@Cluso = you are way ahead of me there. Overlays that hit the sweet spot of 16 clocks? Yes, of course it does, but only if you know it does. And that is brilliant.
Re where the overlays sit, I'm looking at the code and maybe it is at the end, maybe it isn't. Logically it has to be. Right after the overlay code is a CON section and that is going to be turned into internal data for the compiler, not actual code. So ok there. But after that is a long list of DAT code that is the list of opcodes for the Z80. Does this not end up in the cog? Is it a list of bytes/longs in hub that the cog can reference?
If so, gee there isn't much code for the cog. How did you fit it all in, heater/cluso?
Is the bit that actually fits in the cog at startup just the code from the beginning up till the end of the DAA $fit ? If so, that is only about a third of the zicog object.
Yikes, ignore the commented out code, the ram routines, the #ifdef conditional code that isn't compiled, and the actual zicog in the cog is not much at all. Which block of code would you read in/out to make room for bigger overlays?
Do you indeed need bigger overlays? DAA is big, but others seem smaller and even big things like my lousy EXX code could be shrunk with Juergen's clever xor trick.
Despite it being messy to bring in overlays from external ram, I may be forced to on the dracblade. It shouldn't be too hard. All code has to fit in the space DAA fits in. Pass just one variable - the location of the first register. Ok, it will be inefficient in that instead of "get value in register H" it will be "add 5 to loation of reg_base and then get value in that location". But these are instructions that are not used very often.
I still don't quite understand LMM. I looked up "LMM Large Memory Model" on Google and the first link tells me that it is "an alternative programming environment suggested by Bill Henning"
I'm still digesting Juergen's DRC. If you trap things like DDT poking values into ram that then become instructions, then maybe it is possible to precompile sections of code. Could definitely be some gains there in reading from external ram as that is more efficient reading in small blocks than individual bytes (much more so with dracblade than cluso's ramblade). I wonder how much you could recompile Z80 code to PASM and still allow for pokes into code and all the traps for jmps and calls and rets. It is intriguing as it potentially gets Z80 up to pasm speeds, but I just can't quite get my head around it.
We have heaps of ram space for precompiled code. 512k. DRC seems to me to be like 'super smart caching' where it is a bit smarter than just getting 128 byte blocks of data. It is an intriguing idea.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.smarthome.viviti.com/propeller
No, self-modifying code is brilliant. You can't do anything complex with PASM with out it[noparse]:)[/noparse]
"....Does this not end up in the cog"
When you start a COG it always gets loaded with 496 instructions from wherever you told it to start from. If your PASM is only, say, 100 LONGs then a load of random junk from whatever comes next will also be loaded.
As you see fitting ZiCog in is tight, it's taken two years to get to this point.
Yep, everything from the beginning to the end of DAA gets loaded at start up. And it is only about a third of ZiCog, the rest is overlays and the dispatch tables that live in HUB. As you see selecting some of that "resident" code to push out to an overlay is tricky.
The EX and EXX code can be shrunk by using my pointer manipulation suggestion[noparse]:)[/noparse]
LMM is conceptually simple. And like all simple things it took a genius, Bill Henning, to think of it. Basically instead of loading a bunch of LONGs and executing them as the overlay system does you have a tight loop which fetches a single instruction from HUB into the COG, executes it and then loops around to fetch the next one.
Putting overlays into external RAM is I guess possible if you are running out of HUB space. It's going to be slow and hard to engineer. If we switch to LMM the same applies.
The whole idea of pre-compiling or dynamically compiling Z80 into PASM is just to big and complicated for me to even start thinking about. I'd rather use the heaps of external RAM space for Zog [noparse]:)[/noparse]
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Re self-modifying code. It is brilliant for doing things fast. As you can see, it removes the need for a stack. It does have its problems and you need to be careful. I programmed a mini that had only 16 assembler risc instructions including self-modifying code. We ran online order entry, inventory and accounting on a $100K machine with 10KB common (aka hub code) and 5-10KB in each partition (aka cog - we had up to 20). This was mid 70s. 10KB of core memory was ~$20K. Now this takes PCs with GBs of memory and while the interface looks nice and pretty in colors, it runs slower.
heater: I just noted that there is an area between the ZiCog cog code and the DAA overlay which is only used for the ProtoBoard. Probably this bit shoud be shifted from here.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
Actually, the reason why I started looking at the Propeller is I expect to get a contract for the design of a data logger with SD cards. Getting this done should hopefully solve my financial issues for some time.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
He died at the console of hunger and thirst.
Next day he was buried. Face down, nine edge first.
Why not reverse the condition?
And then the relative jump:
This should do as well:
I'll leave the jump and call handlers to you, so you can get 4 more longs in total. Your DAA is 23 longs, mine is 27 in the self modifying version:
Just a thought while looking at homespun's output of zicog.spin
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
He died at the console of hunger and thirst.
Next day he was buried. Face down, nine edge first.
Post Edited (pullmoll) : 3/5/2010 9:14:07 AM GMT
with the LMM kernel unrolled four times (like in my original posting from 2006) on average an LMM instruction takes 20 clocks, as three of four instructions will execute in 16 clocks, and one will execute in 32 clocks.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
No need for apologies [noparse]:)[/noparse] but you did miss some of the central ideas in LMM...
1) Jumps take 36 cycles on average (32 cycles 75% of the time, 48 cycles 25% of the time)
A "FJMP" is nothing more than:
rdlong pc, pc
long dest_addr
With 23 address bits or less, just set the top 9 bits to zero, which turns the address into a NOP - which means that conditional far jumps work by putting the conditions on the rdlong instruction
2) Relative jumps that are within -508 bytes to +508 bytes of the current PC take 16 (75%) ..32 (25%) cycles
Ofcourse branches can be conditional [noparse]:)[/noparse]
Just remember that when the sub/add executes, PC points to the instruction AFTER it
Backwards branch:
sub pc,#num_longs*4
Forward branch:
add pc,#num_longs*4
3) The only really expensive op is FCALL, but that should be rare.
4) Also read up on FCACHE, which is basically inline pasm code (overlay)
5) It is trivial to set up an "on n goto" equivalent primitive, assuming N is a number between 0 and N-1
ONGOSUB is left as an exercise for the reader [noparse]:)[/noparse]
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
You know, some time ago I posted here my feeble attempt at a compiler for Jack Crenshaw's TINY language that generates LMM for the Prop. Feeble but it worked and I'm sure it did LMM jumps just as you describe.
That TINY did not have relative jumps. Relative (and conditional) jumps is probably all we will need in ZiCog.
Seems to me that with the 7 instruction start up overhead of the overlay loader, one of which is a HUB access in ZiCog, then the difference in performance between LMM and overlay are marginal for small routines. If we get into having to reload "resident" code after an overlay is done then the difference gets even smaller perhaps with LMM wining.
To be honest I'm having trouble making the calculations here and weighing up the swings and roundabouts.
If I can find the time I will take up Cluso's idea and just try it. If nothing else we get to finish up the missing Z80 ops in a rather elegant way I think.
@Cluso: "...Now this takes PCs with GBs of memory .... it runs slower."
Don't forget, in the old days bits were much bigger than they are today. You could do a lot more with the bigger bits [noparse]:)[/noparse]
That code between OVERLAY_START and daa_ovl is there for a reason. When running on the demo board the Z80 RAM lives in the place where the ZiCog PASM was before starting.
@pullmoll: Great finds on the LONG hunt. I'll try and get them in and retest everything.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I might rummage around in the shed and see if I can find a spare one.
@heater, I was only joking re self modifying code. It is brilliant but the problem is you have to really understand the assembly codes to write it.
Re overlays, now I'm starting to understand how they work. I think I also understand now how you can keep adding them and not add any more bytes to the cog. Now I have to work out how to add them without adding to the hub either. That is my little problem to solve on the dracblade though. It depends on how many more bytes a full Z80 implementation comes to, and we don't know that yet.
Is not the problem just one of adding overlays one at a time for all the Z80 instructions we can think of?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.smarthome.viviti.com/propeller
I should try·a cutdown Sphinx to load the I/O, memory, and SD drivers before running ZiCog. That would save quite a bit of hub code. I will see what I can do - I will be busy with other things next week.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
Of course all us BASIC and assembler programmers know that GOTO is essential.
Others have had a downer on self-modifying code en.wikipedia.org/wiki/Self-modifying_code#Disadvantages
But as we see the Prop cannot live without it and it is essential for all interesting programs.
With these and a host of other commonly accepted TRUTHs (especially when we get to memory management) we see how programmers have had to produce bigger and bigger, slower and slower code until we end up with the bloat we see all around today[noparse]:)[/noparse]
>> "...how to add them without adding to the hub either."
What you want is an overlay loader that loads from disk, like we had in the old PC days. That's how we got 2Mb programs plus a huge pile of data to fit in the humble PC with 640K RAM. It's going to be complicated and awful slow.
>> "...Is not the problem just one of adding overlays one at a time..."
Well that's one approach. It will work provided we can find room for the dreaded DAA overlay. Pullmoll's hopefully correct DAA implementation won't fit unless we find some longs.
I forget who, but someone wanted to get ZiCog up on a Hydra which needs bigger memory access functions. They don't fit. But perhaps that can be handled by Bill's VMCog.
@Bill: Once again I woke up from a fitful sleep with an odd idea. One thing I have not seen mentioned is a JMPRET instruction for LMM. After all many times we don't want recursive code and a stack.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.smarthome.viviti.com/propeller
But it fails the EX test which is testing as per a real Z80.
I seem to remember that the the Z80 DAA is a fixed up version of the original 8080 DAA in that it also works correctly after subtractions. Using it after SUB is something I did not test.
PullMoll has experience getting DAA correct in the MAME emulator and has made a PASM version for us. It's a bit bigger than what we have now. We might be able to squeeze it in as he has also found a few LONGs we can save elsewhere.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
The are some shift and rotate instructions in that DDCB group that are a pain because they set the flags differently than the similar ones for 8080 which we already have in place.
Just been looking over the ZiCog code. Seems to me by going over to LMM instead of overlays we can save 22 LONGs.
I'm also pondering the register to register moves MOV B,A etc. From looking at CP/M code I'd guess that those moves form a bulk of the instructions and should be optimized for speed. Currently we do them with micro-ops called from the dispatch tables. A long time ago Hippy showed a routine on the old PropAltair thread that would do them more directly by decoding the instruction bits. It would be a lot quicker if we found the space for it.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Yep. It's the sole purpose of the Z80 N flag (bit #1) to keep track of the most recent add/adc/inc or sub/sbc/dec instruction and switch handling of the DAA. The 8086, which in many regards resembles the Z80 design, does not have an N flag and instead has separate DAA and DAS instructions.
For reference and to verify my PASM version is right, here's the piece of C code that does DAA in MAME:
Ooops! I just saw I have to add/sub #$60, not #$66, which is what was in my post of yesterday.
(Z) is the current Z80 cpu context (there are games running on 3 Z80s! simultaneously).
NF is the negative flag. HF is the auxiliary (half) carry flag. CF is carry flag. SZP[noparse]/noparse is an array of precalculated sign, zero and parity bits for a byte value.
HTH,
Juergen
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
He died at the console of hunger and thirst.
Next day he was buried. Face down, nine edge first.
I had realized that the Z80 DAA also works following subtract and I have seen other emulator code that maintains an N flag to keep track of whether the last arithmetic op was and addition or subtraction.
ZiCog does not have an N flag because it is undocumented and because it's another instruction to slow us down.
What I had not realized is that it is possible to get the DAA right without an N flag. Is it really so?
What is ->F in your MAME code example ? Is that the secret ADD/SUB flag?
By the way the more I think about changing from overlays to LMM code for the Z80 ops the more I like it. The thing is it is not possible to use self-modifying code in LMM, well not in the normal way one would with normal PASM. So a working version of your first, longer DAA, would be just great.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I thought I had seen that you are maintaining #%00000010 in the add/sub opcodes, no? This is the N flag! You just have no name for it. Here's MAME's flag list:
#define CF 0x01 #define NF 0x02 #define PF 0x04 #define VF PF #define XF 0x08 #define HF 0x10 #define YF 0x20 #define ZF 0x40 #define SF 0x80
Well, you can get the non-N-flag case right and thus most programs, since most of the times it's decimal addition that's used. One example is the Z80 hexadecimal conversion code that uses DAA twice. It's using just additions. The DAS case is rarely used, except of course by BCD arithmetic libraries - and obviously by Z80 proofing programs
That's the flags byte, i.e. the LSB of AF.
Will you try yourself? You'd just have to go one of two paths, one for the true DAA, one for the DAS with the SUBs instead of the ADDs with #$6 and #$60.
BTW: I think there's a flaw in my translation. The final test if the result is > $99 should only _set_ the carry flag, but not clear it. The "muxc" will also clear the carry, so this has to be changed to a conditional "or flags, #carry_flag" instead.
Juergen
P.S: Just for fun - attached the current code of my try to write a recompiling Z80 in PASM. It looks like I will be left with just a small block to compile code into, at least if I do arithmetic and logic flag calculations in the kernel code to save hub RAM space... darn. When is the Prop2 scheduled?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
He died at the console of hunger and thirst.
Next day he was buried. Face down, nine edge first.
Post Edited (pullmoll) : 3/5/2010 10:16:04 AM GMT
Good grief you are right. I really don't remember doing that. That will teach me not use unnamed magic numbers in the code. Must fix that.
Perhaps I was already heading towards a full up DAA, I do remember studying the AltairZ80 simulators implementation prior to realizing I could not get it to fit.
Or was it because our Z80 exerciser program checks that flag even if we configure it to do documented tests only?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
{CB} long get_mem_xy_disp + (z80_ddcb << 9)
and
z80_DDCB mov data_stash, data_8 'Z80 DDCB prefix
mov address_stash, address
mov address, pc
call #read_memory_byte 'Read the instruction byte there
add pc, #1 'Increment the program counter
shr data_8, #3 'DDCB table only has 32 entries so discard 3 low bits
mov dispatch_tab_addr, dispatch_tab_ddcb_addr
jmp #dispatch_2
data_stash long 0
which is the part that could be removed?
@Juergen - I seem to have found a spare Dracblade/zicog board. I might see if I can send it to you.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.smarthome.viviti.com/propeller
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
So I'm pretty sure that the LONGs "address_stash" and "data_stash" can be removed and hence any instructions that read/write them.
Pruning it back further, the entry:
{CB} long get_mem_xy_disp + (z80_ddcb << 9)
might as well be replaced with:
{CB} long not_implemented
then the routines get_mem_xy_disp and z80_ddcb can probably be removed entirely.
Then you have loads of space [noparse]:)[/noparse]
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
He died at the console of hunger and thirst.
Next day he was buried. Face down, nine edge first.