Yep, somewhere in all this the idea of using the Z80 as a counter to step through addresses and letting the Prop "poke" a program into RAM has been discussed along with a link to a guy's project page where he has done exactly that.
Still a couple of address lines should probably go to the Prop so it can respond to I/O access correctly when the Z80 program is running.
I haven't read though all the posts yet as I'm just back from holiday so this might have already been discussed but...
does the Z80 need to connect it's address lines to the Prop at all? Surely, if the Prop is controlling the Z80 clock and reset lines, and can stuff data onto the Z80 data bus, then the address lines are superfluous?
On reset we know exactly what addresses the Z80 thinks it's getting data from. As the Prop is feeding data/opcodes to the Z80 it knows what those opcodes are and can implement a virtual Z80 PC register internally. Opcodes will either increment the PC or reload the PC. The Prop just needs the PC control logic from a Z80 emulator. It can then then fetch opcodes and data from 'somewhere' and stuff them onto the Z80 data bus.
We have two threads running on this topic, so I'm getting a little lost as to what was where...
The boot file for the Z80 would be stored on DS card - or in the upper part of the Prop's eeprom.
(Personally, I'd put a machine monitor in eeprom - or a BASIC interpreter - for when the SD card
isn't inserted)
At reset the Z80 starts reading from address 0000h.
So we will start stuffing instructions at 0000h.
The Prop (without connecting all the address lines to the prop!) can NOT control addresses.
The Z80 is doing that.
So... The Z80 is doing an opcode fetch at 0000h.
We stuff an instruction there. Say a JMP (first byte of it any way).
The Z swallows that (when the clock ticks again)
and advances the address to 0001h for the next byte of the first instruction.
When it tries to fetch the next byte of that instruction, it is again held until the Prop
can stuff another byte on the data bus and tick the clock again.
We are basically just stuffing instructions down the Z's throat.
BUT
Nothing ever get's put into MEMORY that way...
So, to actually write instructions to Z memory - using the Z as the address counter?
I think the idea was something like this:
The Z puts address 0000h on the address bus and the Prop stops time.
While the Z is stopped, the Prop puts an instruction on the data bus (the address lines are stable during all this)
and toggles the /WR signal low/high (to write that byte into memory).
Then the Prop puts a NO-OP on the data bus and allows the Z to execute that (by ticking the clock).
That get's something written into Z-RAM and increments the address bus by 1 (next address).
For CP/M, we want to load the jump table vectors in low memory (first so many addresses) - not instructions.
Then comes some of the BIOS (buffers, etc) and so on up to address 100h.
So we probably want to load the first 256 bytes this way.
THEN - if we want to load anything into higher addresses, we force feed a JMP instruction (3 bytes)
and let the Z execute that to set the address counter to the correct location, and go at it again.
When its all done, issue a (soft ware driven) /RST (reset) to the Z and get outta the way!
Yep, some where is all this the idea of using the Z80 as a counter to step through addresses and letting the Prop "poke" a program into RAM has been discussed along with a link to a guy's project page where he has done exactly that.
Still a couple of address lines should probably go to the Prop so it can respond to I/O access correctly when the Z80 program is running.
Heater, I'm still thinking only one address line is needed at the Prop,
but I'd also want to have a 138 or such decode some of the address bits and send a /CS (/ChipSelect) signal to the Prop.
It wouldn't be absolutely mandatory, but it would allow the Z80 to retain some I/O space for fast user I/O.
I'm still trying to wrap my mind around the signals that the Prop would want to monitor...
/MREQ and /M1 for instruction fetch cycles
/IORQ
A0
/CSprop - if we want to keep any I/O addresses for the Z.
/RD and or /WR ??
/Halt - under some special circumstances?
The Prop would ISSUE:
/MREQ
/WR and or /RD (if it ever wants to read back something from Z memory - might be poor mans DMA like mechanism?)
CLK
/Wait - maybe? - but probably not with the CLK trick.
/Z-RESET
1) Let the Z80 run up through RAM addresses (It thinks it's executing NOPs) and as it goes the Prop "pokes" zeros into RAM until you get to 256 bytes below the top. There the Prop "pokes" a 256 byte boot loader.
2) The the Z80 is reset and let rip.
3) The Z80 executes all those zeros in RAM as NOPs and then runs into the boot loader code at the top of memory.
4) The boot loader does whatever it needs to do to read CP/M from an HD port (maintained by the Prop) and set up whatever vectors and things CP/M needs.
Steps 2), 3) and 4) are basically how the Altair SIMH emulation works and also ZiCog.
I finally managed to breadboard my ratsnest circuit using a minimal setup of (Prop, SRAM, 5V Z80). With this I am now able to get a Z80 boot program loaded into the SRAM using the prop with only 15 pins used, leaving 17 pins for other purposes. I used Z80 pins in this order D0-D7, A0, /RD, /WR, /IORQ, /WAIT, /RESET, CLOCK.
I did have a few problems on the way but it was mainly related to the 5V CPU clocking in the end. I also needed the logic analyzer and scope at times. Found a few bugs in my code this way too. :frown: Debugging PASM can be tough.
As I have an old 5V NMOS Z80, I found it is rather difficult to drive the clock with the prop output. The 3.3V is not sufficient drive voltage but apparently only for the clock pin. The Vih threshold for the clock pin input on the Z8400AB1 device seems to be spec'd to be a minimum of (Vcc-0.6V) which is then 4.4V level. I tried to use a simple transistor inverter circuit hack but had problems on my breadboard doing so with capacitance and my 547 transistors would not switch off cleanly, and gave me a highly asymmetric clock output which upset the Z80. So I temporarily resorted to driving it with an 74HC04 inverter output running at 5V until a cleaner circuit can be realized. I don't want the extra 74HC04. Any 3.3V Z80's (if they exist) shouldn't have this issue.
Attached is the spin progam that ultimately worked for me. It simply loads up a Hello World type of program into the Z80 using my push method I discussed earlier in this thread. Yeah it's still a bit messy and not particularly optimal but seems to work for now. I'm currently running the Z80 at 2.5MHz which I know the NCO counter mode will output cleanly. I may want to experiment more to see if I can get 4MHz working this way; might need to use another crystal on the prop for accurate 4MHz clocking.
All those jumper wires! It reminds me of the DracBlade I did on the "LunchBox". That had 6" of internal fly-wires and the all of the 6" breadboard jumpers. I was gob-smacked that it ran.
I posted some code on the other Z80 thread that seems to work sending a small bootup program to the Z80.
I used Z80 pins in this order D0-D7, A0, /RD, /WR, /IORQ, /WAIT, /RESET, CLOCK.
Very similar to this except I didn't use /WAIT. I have a vague feeling that /WAIT may not be needed if the clock is under propeller control but I need to think about that a bit more.
All those jumper wires! It reminds me of the DracBlade I did on the "LunchBox". That had 6" of internal fly-wires and the all of the 6" breadboard jumpers. I was gob-smacked that it ran.
Yeah the wiring is a bit of a mess. Thankfully I didn't seem to make any errors with the connections and they all worked. Once you get into it it's actually quite fast to connect up like this. The biggest issue I had is that all the wires make accessing signals for probing a real nightmare. I did add some labels on the chips to help but the wires covered most of it. It would be better to have a larger board with more room and use flat wires, but I didn't have that handy.
Very similar to this except I didn't use /WAIT. I have a vague feeling that /WAIT may not be needed if the clock is under propeller control but I need to think about that a bit more.
Yeah the /WAIT would not be needed when doing the clock from the prop, but it comes in rather handy for slow I/O accesses. I like the /WAIT method at 4MHz but for modern faster Z80s the prop probably wouldn't have fast enough response time to control it if coupled with a waitpeq on the /IORQ + /WR or /RD lines as I had in my code.
Just ran a quick test and dropped the voltage to my 5V Z80 and 5V SRAM down to 3.3V. Still testing but my Hello world program actually seems to be running even on my NMOS CPU! That also let me fix the clock drive problem I had before. But this is still at 2.5MHz, 4MHz may not be possible with 3.3V, not sure yet.
Yes I'm not sure about /wait either. I'm writing all the code first in Spin and then later will move over to Pasm. Benchmark test on this site http://bytecruft.blogspot.com.au/2012/08/rudimentary-benchmarks-for-parallax.html suggests that the fastest Spin can clock the Z80 is about 50Khz. That is plenty fast enough for small bootloaders, but initial tests suggest that CP/M would take far too long to load. Pasm is over 100x faster.
I'm not quite at the stage yet of writing a tight pasm clock, but I suspect it is going to have artificial delays rather than a raw clockhigh,clocklow,waitpeq.
If using /WAIT then is the standard solution a latch on /IORQ? Hmm -but then you need an address decoder as well, otherwise any port IN or OUT instruction will trigger a wait. I'll need to think more about this as there is so much flexibility being able to start and stop the clock.
Great news on the 3V3. It is such a radical idea and it simplifies things so much to have everything running at the same voltage.
Ok, I have some more code working. @rogloh, I got some boards made so it is harder for me to change pins. Your pinout is very similar - how hard would it be for you to change your pinout slightly and then we have the same pinout and can share code?
You are using 7 control pins and I am using 6 control pins. However, the next two pins above my 6 are the stereo sound out, so you could use one of those for /WAIT and we can just have mono sound on the other one?
If your pinout is D0-D7, A0, /RD, /WR, /IORQ, /WAIT, /RESET, CLOCK.
and mine is 6 pins but in a different order, would it work if we put /WAIT as the last one and go
D0-D7, /IORQ, /WR, /RD, Clock, A0, /RESET /WAIT and then P15 can be audio out?
This is a bootstrap program I wrote. I wanted to be able to dump 128 byte blocks of data to the Z80 and read them back. I'll probably use this to then load a more sophisticated program into high memory, and then use that program to load CP/M.
Debugging is to the propeller terminal so there is a 1 second delay and you hit F10 to download, then when it is finished, hit F12.
' Minimalist Z80 Demo
CON
_clkmode = xtal1 + pll16x
_xinfreq = 5_000_000
_baudRateSpeed = 115200 ' Teraterm does not work at this speed. Use hyperterminal or proptool terminal
_receiverPin = 31
_transmitterPin = 30
OBJ
delay: "Timing" ' for millisecond delays
sio : "pcFullDuplexSerial2FC" ' serial port
str : "ASCII0_STREngine" ' strings
VAR
PUB Main
sio.AddPort(0, _receiverPin, _transmitterPin, -1, -1, 0, 0, _baudRateSpeed)
sio.start
delay.pause1ms(1000) ' time to press F12 to start the terminal
PrintStringCR(string("start"))
Z80Boot
repeat ' endless loop
PUB Z80Boot | i
' P0-7 is Z80 data bus 0-7
' P8 = Z80 /IORQ
' P9 = Z80 /WR
' P10 = Z80 /RD
' P11 = Z80 Clk
' P12 = Z80 A0
' P13 = Z80 /RESET
Z80ResetLow ' set reset low
Z80Clock(5) ' clock several times to register the reset
Z80ResetHigh ' set reset high
Z80WaitForRdLow ' clock until rd goes low, at this point address should be 00000000_00000000
repeat i from 0 to 63 ' make sure number is bigger than DAT program
Z80WriteByteToRam(byte[@Z80BootProgram][i]) ' move a byte into ram
Z80DataBusNOP ' sets data bus to zero, Z80 reads this as a NOP and proceeds to next instruciton
Z80Clock(2) ' two clocks
Z80WaitForRdLow ' wait until reads next value
Z80ResetLow ' reset the Z80 ready to run the program
Z80Clock(5) ' wait till responds
Z80ResetHigh ' reset high
'Z80Clock(3) 'run the little program, test pin status with different number of clock pulses
ReceiveBlock($0000) ' read back the program
SendBlock($0100,$55) ' send a block of data to the Z80
SendBlock($0200,$AA) ' send a block of data to the Z80
ReceiveBlock($0100)
ReceiveBlock($0200) ' read the block back
' test the speed, send 8 blocks of 128 bytes = 1024 bytes
crlf
PrintStringCR(string("Write 8192 bytes"))
repeat 64
SendBlock($0100,$44)
PrintChar(".")
PrintStringCR(string("Finished"))
repeat ' do nothing
DAT
Z80BootProgram
byte $31,$FF,$FF,$DB,$01,$FE,$80,$CA,$17,$00,$FE,$81,$CA,$25,$00,$FE
byte $82,$CA,$3A,$00,$C3,$03,$00,$CD,$33,$00,$06,$80,$DB,$00,$77,$23
byte $10,$FA,$C3,$03,$00,$CD,$33,$00,$06,$80,$7E,$D3,$00,$23,$10,$FA
byte $C3,$03,$00,$DB,$00,$6F,$DB,$00,$67,$C9,$C3,$F0,$FF
{
; minimalist bootstrap program - routines to transfer data to and from memory
; port 0 is data and port 1 is commands
; for something like CP/M that sits in low memory, need to get a bootstrap
; program running in high memory so can transfer CP/M as one memory image
; scan port 1 for a command as below:
; number command
; 128 block move 128 bytes from propeller to Z80 ram
; 129 block move 128 bytes from Z80 to propeller
; 130 jump to location 0FFF0H
.Z80
org 0H
start: ld sp,0FFFFH ; set stack pointer to top of ram for calls
main: in a,(1) ; read command byte from port 1
cp 128 ; test the command byte
jp z,proptoz80 ; data from prop to z80
cp 129
jp z,z80toprop ; data from Z80 to prop
cp 130
jp z,highmem
jp main ; keep testing
proptoz80: call getaddress ; get two bytes from the propeller and put in HL
ld b,128 ; set up counter for 128 bytes to transfer
loop1: in a,(0) ; get byte from data port
ld (hl),a ; store to memory
inc hl ; increment memory counter
djnz loop1 ; do 128 times
jp main
z80toprop: call getaddress ; get two bytes from the propeller and put in HL
ld b,128 ; set up counter for 128 bytes to transfer
loop2: ld a,(hl) ; get byte from memory
out (0),a ; send to the propeller
inc hl ; increment memory counter
djnz loop2 ; do 128 times
jp main ; return to main loop
getaddress: in a,(0) ; collect lsb
ld l,a ; move to register l
in a,(0) ; collect msb
ld h,a ; move to register h
ret
highmem: jp 0FFF0H ; jump to a high memory location
end
}
PUB Z80Clock(n)
DIRA |= %00000000_00000000_00001000_00000000 ' enable the clock pin as an output\
repeat n
OUTA |= %00000000_00000000_00001000_00000000 ' clock high
'delay.pause1ms(1) ' delay if needed
OUTA &= %11111111_11111111_11110111_11111111 ' clock low
'delay.pause1ms(1) ' delay
PUB Z80ResetLow
DIRA &= %11111111_11111111_11000000_00000000 ' data bus and /iorq /wr /rd clk A0 and /reset all HiZ (pulled high with 10k)
DIRA |= %00000000_00000000_00100000_00000000 ' reset pin = output
OUTA &= %11111111_11111111_11011111_11111111 ' set reset pin low
PUB Z80ResetHigh
DIRA |= %00000000_00000000_00100000_00000000 ' reset pin = output
OUTA |= %00000000_00000000_00100000_00000000 ' set reset pin high
PUB Z80WaitForRdLow ' keep clocking until the /rd pin goes low
repeat while (INA & %00000000_00000000_00000100_00000000)
Z80Clock(1)
PUB Z80WriteByteToRam(n)
n &= %00000000_00000000_00000000_11111111 ' mask off all but lower 8 bits
DIRA |= %00000000_00000000_00000110_11111111 ' set P0-P7 as outputs and also /RD and /WR as outputs
OUTA &= %11111111_11111111_11111111_00000000 ' mask outa lower 8 bits to zero
OUTA |= n ' combine with the data byte
OUTA &= %11111111_11111111_11111101_11111111 ' force /WR low
OUTA |= %00000000_00000000_00000100_00000000 ' force /RD high
'delay.pause1ms(1) ' delay, make as short as possible
OUTA |= %00000000_00000000_00000010_00000000 ' write high
DIRA &= %11111111_11111111_11111001_00000000 'set /rd /wr and the data bits to inputs (highZ)
PUB Z80DataBusNOP ' put zero (NOP) on data bus
DIRA |= %00000000_00000000_00000000_11111111 ' set data bus to outputs
OUTA &= %11111111_11111111_11111111_00000000 ' set data bus to zero
PUB ReceiveBlock(address) ' fetch 128 bytes from the Z80 at address
WaitforIORQlow
' Z80 is now expecting a command. Prop pins 13 up are /iorq, /wr,/rd, clk, A0, reset
' status of prop pins 13 to 18 should be LHLLHH ie a read from port 1
SendByte(129) ' send command 129 - fetch block
SendAddress(address) ' send as a word
repeat 128
WaitforIORQlow ' this should be a Z80 out command - /wr low
PrintHexByte(INA) ' send this byte to the terminal
Z80Clock(2) ' so iorq high again
PUB SendBlock(address,data) ' replace later with a block from, say SD card file
WaitforIORQlow
' Z80 is now expecting a command. Prop pins 13 up are /iorq, /wr,/rd, clk, A0, reset
' status of prop pins 13 to 18 should be LHLLHH ie a read from port 1
SendByte(128) ' send command 128 - send block
SendAddress(address) ' send as a word
repeat 128
WaitforIORQlow
SendByte(data) ' same data for the moment
PUB SendAddress(address)
WaitforIORQlow ' wait until ready for next command
' Z80 is now expecting the first byte of the address. Port is zero so A0 is low
' status of prop pins 13 to 18 should be LHLLLH
SendByte(address & %00000000_00000000_00000000_11111111) ' send low address byte
WaitforIORQlow ' ready for high byte of address
SendByte(address >> 8) ' send high byte
PUB NextInstruction ' run until next instruction
Z80Clock(2) ' clock to clear last instruction
WaitforRDlow
PUB BusHiZ
DIRA &= %11111111_11111111_11111111_00000000 ' set data pins to HiZ so doesn't clash with Z80
PUB SendByte(n) ' put a byte on the Z80 bus, clock x2 then set bus HiZ
n &= %00000000_00000000_00000000_11111111 ' mask off lower byte
DIRA |= %00000000_00000000_00000000_11111111 ' set data pins as outputs
OUTA &= %11111111_11111111_11111111_00000000 ' set data bus to zero
OUTA |= n ' merge with the byte to send
Z80Clock(2) ' two clocks to send
BusHiZ ' hand control of the bus back to the Z80
PUB WaitforIORQlow ' clock until /iorq is low
repeat while (INA & %00000000_00000000_00000001_00000000) ' clock until /iorq goes low
Z80Clock(1)
PUB WaitforRDlow
repeat while (INA & %00000000_00000000_00000100_00000000) ' clock until /rd goes low
Z80Clock(1)
PUB PrintStringCR(pstring)' sends pstring to the vga and serial port
PrintString(pstring) ' print the string
crlf
PUB PrintString(pstring)' sends pstring to the serial port with no CRLF
sio.str(0,pstring) ' send to com port
PUB PrintChar(c) ' sends c to vga and serial port
sio.tx(0,c) ' send to com port
PUB CRLF
PrintChar(13) ' send to vga
PrintChar(10) ' send to vga
PUB PrintDecimal(number) ' print the decimal value, useful for debuggin
PrintString(str.integerToDecimal(number, 10))
PUB PrintHexByte(number) ' print hex byte
PrintString(str.integerToHexadecimal(number,2))
There is a pile of commented out code and old obsolete code in the attached zip file, including a pasm clock driver I'll be using later. But the above is the essential code needed to bootstrap a Z80 and do some data transfers.
Yes I'm not sure about /wait either. I'm writing all the code first in Spin and then later will move over to Pasm. Benchmark test on this site http://bytecruft.blogspot.com.au/2012/08/rudimentary-benchmarks-for-parallax.html suggests that the fastest Spin can clock the Z80 is about 50Khz. That is plenty fast enough for small bootloaders, but initial tests suggest that CP/M would take far too long to load. Pasm is over 100x faster.
My PASM code should be reasonably fast. When clocking at 2.5MHz I think I can roughly load around 8.4us per word or 232kB/s. So any Z80 bootloaders should load in a split second and the dominant factor will be the prop boot time itself.
If using /WAIT then is the standard solution a latch on /IORQ? Hmm -but then you need an address decoder as well, otherwise any port IN or OUT instruction will trigger a wait. I'll need to think more about this as there is so much flexibility being able to start and stop the clock.
In my system I force /WAIT low as soon as I decode the /IORQ, and /WR signals both low using the waitpeq instruction. I expect there is some latency of at least ~5 prop cycles (~62.5ns) before the /WAIT signal is output after the detection. This is fine for 4MHz systems but to speed this up further on say 20MHz Z80 systems you could use a latch but as you say it will need to be fully decoded if you plan to use other I/O devices on the same system. That adds more complexity which isn't nice on a minimal system.
Ok, I have some more code working. @rogloh, I got some boards made so it is harder for me to change pins. Your pinout is very similar - how hard would it be for you to change your pinout slightly and then we have the same pinout and can share code?
You are using 7 control pins and I am using 6 control pins. However, the next two pins above my 6 are the stereo sound out, so you could use one of those for /WAIT and we can just have mono sound on the other one?
If your pinout is D0-D7, A0, /RD, /WR, /IORQ, /WAIT, /RESET, CLOCK.
and mine is 6 pins but in a different order, would it work if we put /WAIT as the last one and go
D0-D7, /IORQ, /WR, /RD, Clock, A0, /RESET /WAIT and then P15 can be audio out?
Yes, my pinout is very flexible - I have not made any boards myself and so can readily rearrange things. I only thought to use A0 as P8 because I could get the data and address pin in a single "movs xx, ina" instruction which could come in handy. But having /IORQ as P8 is good too as it will always be 0 during an I/O operation to the prop and this can be useful as well to get at the byte with the top (9th) bit automatically cleared to 0 with the movs. I want to keep the /WAIT in my setup and am happy with mono audio.
Well done! It's nice to see some 3.3V Z80 stuff working fine!
With your efforts, and Shael's too, we have a lot more chances to learn many new and creative ways to use Propeller's power and flexibility.
In my system I force /WAIT low as soon as I decode the /IORQ, and /WR signals both low using the waitpeq instruction. I expect there is some latency of at least ~5 prop cycles (~62.5ns) before the /WAIT signal is output after the detection.
I'll need to write some pasm to test this idea, but what I am thinking is a tight clock loop, and as soon as /IORQ goes low (no need to test /rd or wr in the tight loop), then jump out of the clock loop. Jump to a slower routine that toggles pins up and down same as the spin code above - ie test /rd or /wr, put data on bus or read from bus, add in two clocks, make the data bus HiZ again, jump back into the clock loop.
That is the general idea. The code is a bit more complex though. For example, the Z80 sends a byte out. The pasm routine captures the byte, puts it in a hub location and sets a flag. What happens to the data after that? Presumably it has to go somewhere, so a master Spin program or something needs to be checking for that flag and then gets the data, maybe sends it out the serial port or whatever. Then resets the flag. Which the pasm routine has been polling for all this time. Only then can the pasm routine restart the clock. (well, it could have gone straight back to clocking, but then bytes might be missed). So ultimately the Z80 is not going to be allowed to continue until any bytes it has output have been sent to the correct place. That might be a master spin routine initially, or for faster transfers, maybe another cog polling the flags.
That should be fine for keyboard and some displays. However, for the SD card, and maybe also for moving a whole sprite somewhere on a graphics display, it makes more sense to have block data transfer. So the flag gets set, n bytes get transferred, and then control is handed back to the Z80.
So the cog that has the clock loop in it also will need some code to do data transfers to hub. I guess one needs to think of a block of hub ram, some bytes set aside for data and some for flags. In a simplistic way, one could set aside 256 bytes for the data for port 0 to 255, and then another 256 bytes for the flags for those ports. So 512 bytes. And then set aside 128 bytes which mirror the FCB in CP/M. And one flag byte for that too. So 256+256+128+1=641 bytes. That simplifies passing data to and from the clock cog. We don't define yet what those ports are going to be. Just set aside a block of memory in hub and pass one pointer to that memory when starting the cog. Now all the other cogs can talk to the Z80 as they need to by interacting with this block of hub ram. Keyboard cog has captured a byte? Ok, Keyboard port is x, so put the byte in location x, and set a flag in location 256 plus x, and the keyboard cog then just carries on.
Maybe it is a bit more complex - what if the Z80 doesn't gobble up the byte by the time the next one arrives? So... each object might need a small circular buffer (most do have this, eg keyboard, serial objects), so the keyboard object can only start emptying its buffer if the port flag is clear.
This starts to define what should be in the clock cog code.
Sorry about going off on a tangent here....
So in psuedo code in pasm
start:
clock high
delay
clock low
delay
is /iorq low
if not, jump to start
' iorq routines
/wr low and A0 high - Z80 is writing the port number, so pasm saves this in variable 'portnumber' and jumps to start
/rd low and A0 low - Z80 wants data from port in 'portnumber', fetch data from hub, put on P0-P7, jump to start
/wr low and A0 low - Z80 is writing to port in 'portnumber' read data from P0-7, put data in hub, jump to start
special case - if portnumber = x, this is a FCB write, so next 128 iorqs transfer data
special case - if portnumber = y, this is a FCB read, so next 128 iorqs transfer data
If half the ports are designated for data and half are for flags say even numbers are data, odd numbers are flags, there are only 128 ports effectively. So port 0 is data, port 1 is the flag for port 0 etc.
OR
each transfer could be 3 bytes. First Z80 puts A1 high and sends the port number. Then it sends the data for that port. And then sends the flag for that port.
Hmm - is it better to have 128 ports and 2 bytes per transfer, or 256 ports and 3 bytes per transfer?
I know what you mean here. Larger block type transfers may work better if done in a complete batch especially if running the Z80 at 20MHz. For the slower 4MHz systems I expect it really won't be much of a penalty to wait for the internal prop hub access on the I/O reads/writes.
Depending on how you wish to arrange things for accessing SD cards it could be possible to do the SPI pin control in the same COG as the main Z80 I/O decoder COG, and set aside a 512 byte (or 128 byte) COG RAM transfer buffer for this purpose. This approach may require more of a burden for the Z80 CPU in low level sector access and initializing the card etc - more CP/M BDOS/BIOS work. If however you want to leverage the existing SD drivers with FAT filesystem support etc you probably need another COG doing the work and will need to come up with inter-COG comms for this purpose.
I expect keyboards, UARTs, audio control and other low bit rate devices can probably just wait without too much concern on performance. Also depending on your model for video memory reads and writes, you could choose either block transfers (bitmaps/sprites etc) or individual I/O byte accesses (color consoles).
Lots of ways to skin the same cat.
One idea I have is to leverage the higher speed OTIR Z80 instruction and indirect register I/O addressing in the prop. What you can have is a special internal register in the Z80 COG (initially indexed by a write to the address register), that autoincrements and fills an internal COG RAM buffer on each IO write to the data register from the Z80. This does not write to hub RAM each time and so fills quickly with minimal delay. Then you can write to a different command register (by changing the address register first) to initiate the transfer to hub RAM and another worker COG. Reads could work the same way. That is the benefit of using indirect address register + data register scheme in the prop using two I/O registers on the Z80 side, as you can do useful things like this. You could have one autoincremening register for video reads, one for video writes, one for SD writes etc etc, and each could maintain their own current address so you can even interleave them if required.
Since your present setup allows for a lot of experimental work and you previously cited your intent of decoding the I/O address space in some different manner than the previously suggested ones, I would like to suggest a perhaps new way to do this.
If your setup has an unused inverter gate and some available area to allow mounting an 'HC573 style transparent latch, I believe I'd found at least one way, to retrieve the full eight lower address lines, and enjoy full 256 port I/O address space discrimination, without having to compromise eight more Propeller pins.
My suggestion is to use an inverted /IORQ signal to drive the latch EN input, then, the actual A1 connection between the Z80 and the Propeller can be dismissed from address decoding tasks.
Then, the spared Propeller pin can be used to drive the latch /OE input.
Latch inputs routed to Z80's A7-A0, outputs to D7-D0.
Sure it looks trivial, maybe the controling signals and operational behavior will differ a bit from the ones we'd seen before.
The intended circuit operation, in my view, will be as follows:
- Before the Prop detects the begining of an I/O cycle, the output pin that controls the latch /OE, will be presenting a HIGH state. So the latch outputs are tri-stated and don't drive D7-D0.
- As long the Z80 validates /IORQ, the latches will store A7-A0, pending /OE to enable their outputs. At the same time, the Propeller will detect the beginning of an I/O cycle and starts reacting to it.
- Since Z80's /WR and /RD will be valid at the same time, cycle type discrimination is a breeze. Their absences, during a valid /IORQ, can perhaps be interpreted as an interrupt acknowledge cycle in progress.
- If /RD is low, then D7-D0 will be at input state, now it's safe to enable the address latches outputs and the Propeller can read the port address. Then the outputs will be disabled, allowing to the Propeller to drive the data lines itself.
I/O space discrimination and data retrieve during read cycles can proceed.
-If /WR is low, then the Z80 will be driving the data bus, then we must first gather its contents by reading Propeller's input pins. After input data (from Propeller's view point) is saved internaly, then Z80's clock is advanced, cycle by cycle, till the negation of its /WR output.
Irrespective to the kind of I/O write instruction Z80's is executing, including the block ones, the next cycle will be one of two possibilities:
1- An instruction fetche cycle, or
2- An interrupt acknowledge cycle, if interrupts are being used and enabled.
Then the I/O write operation can be internaly completed by the Propeller program, during the clock cycle that follows the I/O write one.
Data bus will be again tri-stated, so its time for the Propeller to control the latches /OE, and retrieve the already stored write port address, then the latches /OE will be returned to an inactive state, so does the D7-D0 lines.
From that point, routine procedures can deal with every intended operation.
I/O write it can now be named I/O "deferred write", but it will not harm anything, both at circuit and logic behavior perspectives.
It is a very interesting idea to be able to access more of the Z80 I/O space from the propeller and not adding any extra prop pins to do so. It does however go against my own personal desire for minimalism, as it adds two more devices to the mix. If I had more breadboard jumpers I could just about try your scheme out but I have actually used them all up right now in getting what I wanted tested and working. I need to order some more first.
From what you are saying the delayed (pipelined) writes should work in principle, the only issue I can see might be if you had an I/O read immediately following a write and the write took longer to do its job, it may have to block the next read for a bit. But I'd expect at 4MHz this wouldn't be an issue, maybe it would for 20MHz systems. If you are in fully in control of the clock extra delays can be inserted as required.
Do you know if these newer 20MHz Z80's have the same bus timing as the original 2-4MHz ones (albeit scaled up), or do they do more work per bus clock cycle? In other words, are there the exact same number of T states and machine cycles etc per instruction and it is just clocked faster, or is there some other architectural improvement to achieve faster speeds?
I'm going to add another one to the mix. Idea up to this point was to get a simple bootloader into the start of memory, use that to put another bootloader into high memory, and then that bootloader loads in CP/M.
I think it is possible to do away with the middle bootloader. Use the bootloader we have but replace all the absolute jumps with relative jumps. When the program runs, the first thing it does is copy itself to high memory. The Z80 has some block move instructions - it might be possible to do this with just one instruction.
Re the 20Mhz Z80 chips, I'm not sure. I am using a 4Mhz chip but it is only running at 50khz. Will be fun to ramp up the speed and then drop in 6 and 8Mhz chips and see how they respond.
1) Whilst stuffing code into Z80 RAM space just cycle through all 64K, placing zeros everywhere until the top 256 bytes where the bootloader goes.
2) Reset the Z80 which will cause it to execute NOPs (those zeros) from address zero and eventually run into the bootloader at the top where things start to boot the normal Z80 way.
This basically what ZiCog does when running CP/M. No messing with Z80 reset vectors at address zero etc. It's simple and It does not add any noticeable time to the boot up of the system.
So forget your simple boot loader at the start of memory and just use a simple CP/M bootloader at the top of memory.
Now what is going to be done about getting disk blocks from Prop to Z80 RAM and back again in a timely manner?
Streaming blocks serially through an I/O port seems a bit slow.
If you go with my idea above and do block transfers for disk I/O the Z80 INIR and OTIR instructions run around 21 cycles per byte read/written, and can transter up to 256 bytes at a time. For 20MHz devices this is just under 1 MB/s transfer speed. Yes this doesn't include the extra Hub->Cog RAM transfer time, but it is still not too shabby and not far away from raw SPI SD performance on the prop for example. 4MHz will be proportionally slower but still reasonably fast IMHO.
The block I/O instructions are great at the Z80 end but the transfer speed limit here will surely be the Propeller responding to the I/O read or write request on the bus, putting or fetching the data from wherever the sector buffer is in HUB, reading or presenting data on the data bus, and signalling that the Z80 can continue.
My gut instinct tells me that using Z80 block I/O instructions will not actually be noticeably quicker than a tight Z80 loop doing normal I/O instructions.
Agree in general but if you combine the block I/O opcodes and autoincrementing indirect register in the prop that reads/writes to an internal COG buffer, then trigger the entire block reads/write from/to some external COG for the actual SD transfer, I am thinking you can get reasonable performance. I know it won't be competing with a DMA bus master or anything like that but it will suffice.
Any single byte I/O transfers directly to/from the hub RAM will be quite a bit slower and would certainly not be the best approach for fast Z80s and disk transfers. If CP/M uses 128 byte records then a small sector buffer should fit into the COG RAM. That alone will help greatly. The COG can do raw data transfers to/from the hub approaching up to 20MB/s if using 32 bit longs - so that won't be a huge bottleneck compared to the SPI and Z80 transfers which will be around an order of magnitude slower.
I don't think a buffer in COG is going to help you.
When accessing your buffer in COG you have to work in 32 bit longs, you can't address bytes.
When getting or putting I/O to the Z80 you have to use bytes.
That results in a load of shifting and logic to extract/insert bytes to the longs in your buffer.
I think it has been argued that having such a buffer in HUB can be faster as accessing it with RD/WRBYTE does all that byte mangling stuff for you for free. That was my experience with playing around in my Z80 emulator code on the Prop.
That raw bandwidth to from HUB to COG does not help you as you have to:
a) Hoover data from HUB to COG buffer, or vice versa.
b) Fiddle around inserting/removing the bytes.
That's two passes over the data instead of one and a bunch of logic for each byte you don't need.
@Heater,
Good point, I see what you mean now. We can't achieve the full Hub bandwidth without a 32 bit long buffer in COG RAM, and yet shuffling Z80 byte data in and out of this long buffer will slow us down too..
What if the Z80 I/O COG does the SPI accesses directly??? That was another idea I was thinking about but certainly it will involve more Z80 CPU workload.
Edit:
Some estimated performance numbers for a 4MHz Z80, and a 128 byte temporary buffer in the COG RAM (stored as 1 byte per long) are added below.
A burst of 128 bytes transferred using OTIR @ 21 clocks/byte and say (conservatively for example) two extra wait states during I/O writes to COG RAM takes 128 x 23 / 4 = 736us. That is inherently 174kB/s before we account for COG RAM transfer delays and SD writes.
The transfer from COG RAM to hub RAM using bytes can be done at 5MB/s. This adds 128/5 = 25.6us to the number above. So approximately we are talking about 761us before we begin the SD write process. That will take its own time to transfer via SPI. Let's assume we can get ~10Mbps or around 100us for the 128 byte sector. But we probably have to write 4x this number of bytes if using 512 byte sectors and that makes it 400us. The total is then 736 + 25.6 + 400 = 1161 microseconds to write 128 bytes, or 110kB/s. Its still not a bad number for loading programs when you are talking about only a 64kB address space in the Z80. Large file copies will take a hit however.
If you pipeline it or can use 2 buffers you might be able to have an SD write in progress while the intermediate COG RAM is being filled with the next sector data. This might bring the rate back up again towards 168kB/s in the best case. A 20MHz Z80 will change these numbers, but at least at 4MHz you can see the dominant amount of time is spent by the Z80 writing to the I/O COG registers, and that can't really be avoided. Also if you don't choose to use the OTIR and instead do individual port writes, read data from HL, increment HL, decrement counter, loop etc, it won't be 21 clocks/byte, so this number will be quite a bit worse. We really want to use the OTIR, INIR opcodes for best performance.
I'm going to add another one to the mix. Idea up to this point was to get a simple bootloader into the start of memory, use that to put another bootloader into high memory, and then that bootloader loads in CP/M.
I think it is possible to do away with the middle bootloader. Use the bootloader we have but replace all the absolute jumps with relative jumps. When the program runs, the first thing it does is copy itself to high memory. The Z80 has some block move instructions - it might be possible to do this with just one instruction.
1) Whilst stuffing code into Z80 RAM space just cycle through all 64K, placing zeros everywhere until the top 256 bytes where the bootloader goes.
2) Reset the Z80 which will cause it to execute NOPs (those zeros) from address zero and eventually run into the bootloader at the top where things start to boot the normal Z80 way.
This basically what ZiCog does when running CP/M. No messing with Z80 reset vectors at address zero etc. It's simple and It does not add any noticeable time to the boot up of the system.
So forget your simple boot loader at the start of memory and just use a simple CP/M bootloader at the top of memory.
If reaching some place at a higher memory address is your solely intent, you can safely bring a direct jump, for the Z80 to execute, to the targeted address and get there without even having the need to write to the Ram @00000h, and avoid disturbing any previously loaded content.
So, when jumping to say, for example, 0FF00h one will need to stuff only a three byte sequence, for the Z80 to directly execute it, as the first instruction, just after completing all /RESET connected procedures:
C3h, 00h, FFh
The first byte, valued C3h, will be sent from the Propeller to be executed by the Z80, beginning at the very first fetch cycle.
Then, under Propeller control, using Z80's /RD signal active/inactive states as a handshaking control, the following two address components can also be 'stuffed' at the right time.
Since just after reset, there is no possibility for any spurious interrupt to be accepted nor processed, and in CP/M's context, there is also no easy way to allow for non maskable interrupts too, code's executing sequence will be absolutely deterministic.
In fact, since the whole 'cold boot' will be a completely deterministic task, all you have to do is carefully 'telling' or maybe 'insidiously suggesting', 'whispering' without a single clue or footprint, any sequence of events you imagine to be executed by the Z80, only writing at specific addresses, the real code and data you intend to reside anywhere into the RAM.
Finaly, as deterministic procedures are among the best ones for a Propeller to do and, by the fact that the whole boot sequence, as seen from Z80's perspective, will be almost ever deterministic ones, i believe there is a perfect match between the two.
The problem with putting SPI access code in the Z80 is that that in itself will be a lot slower than doing in PASM from COG as usual.
Also that kind of code will eat Z80 RAM space which is better left free for CP/M apps and such.
Then the I/O write operation can be internaly completed by the Propeller program, during the clock cycle that follows the I/O write one.
Data bus will be again tri-stated, so its time for the Propeller to control the latches /OE, and retrieve the already stored write port address, then the latches /OE will be returned to an inactive state, so does the D7-D0 lines.
From that point, routine procedures can deal with every intended operation.
I/O write it can now be named I/O "deferred write", but it will not harm anything, both at circuit and logic behavior perspectives.
Do you know if these newer 20MHz Z80's have the same bus timing as the original 2-4MHz ones (albeit scaled up), or do they do more work per bus clock cycle? In other words, are there the exact same number of T states and machine cycles etc per instruction and it is just clocked faster, or is there some other architectural improvement to achieve faster speeds?
Roger.
@rogloh
Sorry for the late reply, my fault!
Among many others, for reference purposes, I'm using Zilog's #PS017801-0602 product specification, which deals with all but one of Z80's versions.
Albeit this paper doesn't covers the eldest and long time retired 2.5 MHz NMOS part, all the other ones are cited in its scope, from 4 to 8 MHz NMOS parts and from 4 to 20 MHz CMOS ones.It even displays and points the obsoleted 4MHz CMOS version.
For the sake of sparing you to read a bunch of technology linked details, I will focus at the prominent ones, whose significance can mess or not with all but the stringently specified designs (the ones dealing with wave shapes, clock and bus capacitances, signal ringing, etc...).
The main diference you can note, is the existence of a power-down mode, only available in CMOS versions, and encapsulation availability.
Back on the track to our present focus, by carefully reading that paper, you'll notice all the subtle variations among them all, notably the clear (and mandatory) connection between clock speed limits and signaling setup and hold times.
In every other aspect, notably the ones that deal with instruction cycle timing, held unchanged trough the entire family.
The first byte, valued C3h, will be sent from the Propeller to be executed by the Z80, beginning at the very first fetch cycle.
Yes that should work. I'll try it when I get home from work.
I must say that the Propeller makes a perfect debugging tool for things like this. I don't know exactly how many clock cycles the above is going to take, but I know that it is very easy to write Spin code and download and test things quickly and when the /rd line goes low and all the higher address lines go high, it has made the jump.
Among many others, for reference purposes, I'm using Zilog's #PS017801-0602 product specification, which deals with all but one of Z80's versions.
@Yanomani - I found that data sheet, thanks for the link. It contains useful timing information.
I am looking at response time for the prop to detect the Z80's I/O accesses and respond, either by issuing the /WAIT to buy more time or stopping the Z80 clock. I want to see how this compares with Z80 clock speed. The data sheet only shows the information for 5V +/- 10% voltage and 0-70C tempature range not 3.3V so take them with a grain of salt. But they can be still be used to get a very rough indication of where we stand and what might be possible.
For the wait method: You need to drop /WAIT within a maximum amount of time after /RD (or /WR) and /IORQ both go low.
There is a small delay before /RD (or /WR) goes low after a clock edge. You need to have /WAIT issued before 1.5 x the Z80 clock period minus a setup time following this same clock edge. This table computes the maximum response time you have after detecting the /IORQ and /RD edge in the prop and the number of propeller clocks this corresponds to at 80MHz before the /WAIT must go low. Times are in nanoseconds.
There is no way 20MHz is possible without external hardware doing the /WAIT.
For 10MHz you might be able to generate if the combination of the "waitpeq" and "and outa, mask" operations can complete within 6 clocks after the edge. It might just be possible - it will be very close but worth a go. 6 MHz down should be very doable.
For the clock stopping method: On I/O reads (which is the tighter case compared to I/O writes) you need to stop the clock before the Z80 latches its read data from the data bus.
Without waits, you have to stop the clock within 2.5 x the Z80 clock period minus the /RD low delay. The Data setup time is also shown but it should not factor into the computation if you stop the clock in time, however once the prop presents the I/O read data on the bus you will need to wait at least this amount of time before you can start the clock again. Time are in nanoseconds.
If you are controlling the clock in software as fast as you can, and polling /IORQ to be low with something like this code below you can only clock the Z80 as fast as 20M/6 = 3.3MHz if the prop runs at 80MHz and you want a symmetric clock output.
loop xor outa, clockmask
test iorq_mask, ina wz
if_nz jmp #loop
' .. pause toggling the clock pin and do the IORQ here
If instead you are controlling the clock using a counter you will need to do a WAITPEQ on the /IORQ, then stop the counter clocking as soon as possible. I expect this takes at least 5 propeller clocks, (at best) 1 for the waitpeq detection to compete if it was already running, and 4 for the write to CTRA register to stop it, assuming no other internal delays inside the prop itself (which there might be - I don't know). I'm not sure that 20MHz is achievable - it's really tight. Probably 10-12MHz is more resonable to expect?
Roger.
ps. you can always buy a 20MHz rated Z80 CPU and clock it slower. The numbers above indicate using each particular MHz rated Z80 but there is no reason you can't underclock a faster one, that will help too.
For 10MHz you might be able to generate if the combination of the "waitpeq" and "and outa, mask" operations can complete within 6 clocks after the edge.
FWIW, after a pin change waitpxx has an exit path of 3 cycles. Which suggests that you won't meet your 6 cycle requirement here.
Comments
Yep, somewhere in all this the idea of using the Z80 as a counter to step through addresses and letting the Prop "poke" a program into RAM has been discussed along with a link to a guy's project page where he has done exactly that.
Still a couple of address lines should probably go to the Prop so it can respond to I/O access correctly when the Z80 program is running.
We have two threads running on this topic, so I'm getting a little lost as to what was where...
The boot file for the Z80 would be stored on DS card - or in the upper part of the Prop's eeprom.
(Personally, I'd put a machine monitor in eeprom - or a BASIC interpreter - for when the SD card
isn't inserted)
At reset the Z80 starts reading from address 0000h.
So we will start stuffing instructions at 0000h.
The Prop (without connecting all the address lines to the prop!) can NOT control addresses.
The Z80 is doing that.
So... The Z80 is doing an opcode fetch at 0000h.
We stuff an instruction there. Say a JMP (first byte of it any way).
The Z swallows that (when the clock ticks again)
and advances the address to 0001h for the next byte of the first instruction.
When it tries to fetch the next byte of that instruction, it is again held until the Prop
can stuff another byte on the data bus and tick the clock again.
We are basically just stuffing instructions down the Z's throat.
BUT
Nothing ever get's put into MEMORY that way...
So, to actually write instructions to Z memory - using the Z as the address counter?
I think the idea was something like this:
The Z puts address 0000h on the address bus and the Prop stops time.
While the Z is stopped, the Prop puts an instruction on the data bus (the address lines are stable during all this)
and toggles the /WR signal low/high (to write that byte into memory).
Then the Prop puts a NO-OP on the data bus and allows the Z to execute that (by ticking the clock).
That get's something written into Z-RAM and increments the address bus by 1 (next address).
For CP/M, we want to load the jump table vectors in low memory (first so many addresses) - not instructions.
Then comes some of the BIOS (buffers, etc) and so on up to address 100h.
So we probably want to load the first 256 bytes this way.
THEN - if we want to load anything into higher addresses, we force feed a JMP instruction (3 bytes)
and let the Z execute that to set the address counter to the correct location, and go at it again.
When its all done, issue a (soft ware driven) /RST (reset) to the Z and get outta the way!
Heater, I'm still thinking only one address line is needed at the Prop,
but I'd also want to have a 138 or such decode some of the address bits and send a /CS (/ChipSelect) signal to the Prop.
It wouldn't be absolutely mandatory, but it would allow the Z80 to retain some I/O space for fast user I/O.
I'm still trying to wrap my mind around the signals that the Prop would want to monitor...
/MREQ and /M1 for instruction fetch cycles
/IORQ
A0
/CSprop - if we want to keep any I/O addresses for the Z.
/RD and or /WR ??
/Halt - under some special circumstances?
The Prop would ISSUE:
/MREQ
/WR and or /RD (if it ever wants to read back something from Z memory - might be poor mans DMA like mechanism?)
CLK
/Wait - maybe? - but probably not with the CLK trick.
/Z-RESET
Anybody???
1) Let the Z80 run up through RAM addresses (It thinks it's executing NOPs) and as it goes the Prop "pokes" zeros into RAM until you get to 256 bytes below the top. There the Prop "pokes" a 256 byte boot loader.
2) The the Z80 is reset and let rip.
3) The Z80 executes all those zeros in RAM as NOPs and then runs into the boot loader code at the top of memory.
4) The boot loader does whatever it needs to do to read CP/M from an HD port (maintained by the Prop) and set up whatever vectors and things CP/M needs.
Steps 2), 3) and 4) are basically how the Altair SIMH emulation works and also ZiCog.
I finally managed to breadboard my ratsnest circuit using a minimal setup of (Prop, SRAM, 5V Z80). With this I am now able to get a Z80 boot program loaded into the SRAM using the prop with only 15 pins used, leaving 17 pins for other purposes. I used Z80 pins in this order D0-D7, A0, /RD, /WR, /IORQ, /WAIT, /RESET, CLOCK.
I did have a few problems on the way but it was mainly related to the 5V CPU clocking in the end. I also needed the logic analyzer and scope at times. Found a few bugs in my code this way too. :frown: Debugging PASM can be tough.
As I have an old 5V NMOS Z80, I found it is rather difficult to drive the clock with the prop output. The 3.3V is not sufficient drive voltage but apparently only for the clock pin. The Vih threshold for the clock pin input on the Z8400AB1 device seems to be spec'd to be a minimum of (Vcc-0.6V) which is then 4.4V level. I tried to use a simple transistor inverter circuit hack but had problems on my breadboard doing so with capacitance and my 547 transistors would not switch off cleanly, and gave me a highly asymmetric clock output which upset the Z80. So I temporarily resorted to driving it with an 74HC04 inverter output running at 5V until a cleaner circuit can be realized. I don't want the extra 74HC04. Any 3.3V Z80's (if they exist) shouldn't have this issue.
Attached is the spin progam that ultimately worked for me. It simply loads up a Hello World type of program into the Z80 using my push method I discussed earlier in this thread. Yeah it's still a bit messy and not particularly optimal but seems to work for now. I'm currently running the Z80 at 2.5MHz which I know the NCO counter mode will output cleanly. I may want to experiment more to see if I can get 4MHz working this way; might need to use another crystal on the prop for accurate 4MHz clocking.
Roger.
All those jumper wires! It reminds me of the DracBlade I did on the "LunchBox". That had 6" of internal fly-wires and the all of the 6" breadboard jumpers. I was gob-smacked that it ran.
I posted some code on the other Z80 thread that seems to work sending a small bootup program to the Z80.
Very similar to this except I didn't use /WAIT. I have a vague feeling that /WAIT may not be needed if the clock is under propeller control but I need to think about that a bit more.
Yeah the wiring is a bit of a mess. Thankfully I didn't seem to make any errors with the connections and they all worked. Once you get into it it's actually quite fast to connect up like this. The biggest issue I had is that all the wires make accessing signals for probing a real nightmare. I did add some labels on the chips to help but the wires covered most of it. It would be better to have a larger board with more room and use flat wires, but I didn't have that handy.
Yeah the /WAIT would not be needed when doing the clock from the prop, but it comes in rather handy for slow I/O accesses. I like the /WAIT method at 4MHz but for modern faster Z80s the prop probably wouldn't have fast enough response time to control it if coupled with a waitpeq on the /IORQ + /WR or /RD lines as I had in my code.
I'm not quite at the stage yet of writing a tight pasm clock, but I suspect it is going to have artificial delays rather than a raw clockhigh,clocklow,waitpeq.
If using /WAIT then is the standard solution a latch on /IORQ? Hmm -but then you need an address decoder as well, otherwise any port IN or OUT instruction will trigger a wait. I'll need to think more about this as there is so much flexibility being able to start and stop the clock.
Great news on the 3V3. It is such a radical idea and it simplifies things so much to have everything running at the same voltage.
Ok, I have some more code working. @rogloh, I got some boards made so it is harder for me to change pins. Your pinout is very similar - how hard would it be for you to change your pinout slightly and then we have the same pinout and can share code?
You are using 7 control pins and I am using 6 control pins. However, the next two pins above my 6 are the stereo sound out, so you could use one of those for /WAIT and we can just have mono sound on the other one?
If your pinout is D0-D7, A0, /RD, /WR, /IORQ, /WAIT, /RESET, CLOCK.
and mine is 6 pins but in a different order, would it work if we put /WAIT as the last one and go
D0-D7, /IORQ, /WR, /RD, Clock, A0, /RESET /WAIT and then P15 can be audio out?
This is a bootstrap program I wrote. I wanted to be able to dump 128 byte blocks of data to the Z80 and read them back. I'll probably use this to then load a more sophisticated program into high memory, and then use that program to load CP/M.
Debugging is to the propeller terminal so there is a 1 second delay and you hit F10 to download, then when it is finished, hit F12.
There is a pile of commented out code and old obsolete code in the attached zip file, including a pasm clock driver I'll be using later. But the above is the essential code needed to bootstrap a Z80 and do some data transfers.
My PASM code should be reasonably fast. When clocking at 2.5MHz I think I can roughly load around 8.4us per word or 232kB/s. So any Z80 bootloaders should load in a split second and the dominant factor will be the prop boot time itself.
In my system I force /WAIT low as soon as I decode the /IORQ, and /WR signals both low using the waitpeq instruction. I expect there is some latency of at least ~5 prop cycles (~62.5ns) before the /WAIT signal is output after the detection. This is fine for 4MHz systems but to speed this up further on say 20MHz Z80 systems you could use a latch but as you say it will need to be fully decoded if you plan to use other I/O devices on the same system. That adds more complexity which isn't nice on a minimal system.
Yes, my pinout is very flexible - I have not made any boards myself and so can readily rearrange things. I only thought to use A0 as P8 because I could get the data and address pin in a single "movs xx, ina" instruction which could come in handy. But having /IORQ as P8 is good too as it will always be 0 during an I/O operation to the prop and this can be useful as well to get at the byte with the top (9th) bit automatically cleared to 0 with the movs. I want to keep the /WAIT in my setup and am happy with mono audio.
Well done! It's nice to see some 3.3V Z80 stuff working fine!
With your efforts, and Shael's too, we have a lot more chances to learn many new and creative ways to use Propeller's power and flexibility.
Thank you all for doing this"
Yanomani
@rogloh,
I'll need to write some pasm to test this idea, but what I am thinking is a tight clock loop, and as soon as /IORQ goes low (no need to test /rd or wr in the tight loop), then jump out of the clock loop. Jump to a slower routine that toggles pins up and down same as the spin code above - ie test /rd or /wr, put data on bus or read from bus, add in two clocks, make the data bus HiZ again, jump back into the clock loop.
That is the general idea. The code is a bit more complex though. For example, the Z80 sends a byte out. The pasm routine captures the byte, puts it in a hub location and sets a flag. What happens to the data after that? Presumably it has to go somewhere, so a master Spin program or something needs to be checking for that flag and then gets the data, maybe sends it out the serial port or whatever. Then resets the flag. Which the pasm routine has been polling for all this time. Only then can the pasm routine restart the clock. (well, it could have gone straight back to clocking, but then bytes might be missed). So ultimately the Z80 is not going to be allowed to continue until any bytes it has output have been sent to the correct place. That might be a master spin routine initially, or for faster transfers, maybe another cog polling the flags.
That should be fine for keyboard and some displays. However, for the SD card, and maybe also for moving a whole sprite somewhere on a graphics display, it makes more sense to have block data transfer. So the flag gets set, n bytes get transferred, and then control is handed back to the Z80.
So the cog that has the clock loop in it also will need some code to do data transfers to hub. I guess one needs to think of a block of hub ram, some bytes set aside for data and some for flags. In a simplistic way, one could set aside 256 bytes for the data for port 0 to 255, and then another 256 bytes for the flags for those ports. So 512 bytes. And then set aside 128 bytes which mirror the FCB in CP/M. And one flag byte for that too. So 256+256+128+1=641 bytes. That simplifies passing data to and from the clock cog. We don't define yet what those ports are going to be. Just set aside a block of memory in hub and pass one pointer to that memory when starting the cog. Now all the other cogs can talk to the Z80 as they need to by interacting with this block of hub ram. Keyboard cog has captured a byte? Ok, Keyboard port is x, so put the byte in location x, and set a flag in location 256 plus x, and the keyboard cog then just carries on.
Maybe it is a bit more complex - what if the Z80 doesn't gobble up the byte by the time the next one arrives? So... each object might need a small circular buffer (most do have this, eg keyboard, serial objects), so the keyboard object can only start emptying its buffer if the port flag is clear.
This starts to define what should be in the clock cog code.
Sorry about going off on a tangent here....
So in psuedo code in pasm
start:
clock high
delay
clock low
delay
is /iorq low
if not, jump to start
' iorq routines
/wr low and A0 high - Z80 is writing the port number, so pasm saves this in variable 'portnumber' and jumps to start
/rd low and A0 low - Z80 wants data from port in 'portnumber', fetch data from hub, put on P0-P7, jump to start
/wr low and A0 low - Z80 is writing to port in 'portnumber' read data from P0-7, put data in hub, jump to start
special case - if portnumber = x, this is a FCB write, so next 128 iorqs transfer data
special case - if portnumber = y, this is a FCB read, so next 128 iorqs transfer data
If half the ports are designated for data and half are for flags say even numbers are data, odd numbers are flags, there are only 128 ports effectively. So port 0 is data, port 1 is the flag for port 0 etc.
OR
each transfer could be 3 bytes. First Z80 puts A1 high and sends the port number. Then it sends the data for that port. And then sends the flag for that port.
Hmm - is it better to have 128 ports and 2 bytes per transfer, or 256 ports and 3 bytes per transfer?
I know what you mean here. Larger block type transfers may work better if done in a complete batch especially if running the Z80 at 20MHz. For the slower 4MHz systems I expect it really won't be much of a penalty to wait for the internal prop hub access on the I/O reads/writes.
Depending on how you wish to arrange things for accessing SD cards it could be possible to do the SPI pin control in the same COG as the main Z80 I/O decoder COG, and set aside a 512 byte (or 128 byte) COG RAM transfer buffer for this purpose. This approach may require more of a burden for the Z80 CPU in low level sector access and initializing the card etc - more CP/M BDOS/BIOS work. If however you want to leverage the existing SD drivers with FAT filesystem support etc you probably need another COG doing the work and will need to come up with inter-COG comms for this purpose.
I expect keyboards, UARTs, audio control and other low bit rate devices can probably just wait without too much concern on performance. Also depending on your model for video memory reads and writes, you could choose either block transfers (bitmaps/sprites etc) or individual I/O byte accesses (color consoles).
Lots of ways to skin the same cat.
One idea I have is to leverage the higher speed OTIR Z80 instruction and indirect register I/O addressing in the prop. What you can have is a special internal register in the Z80 COG (initially indexed by a write to the address register), that autoincrements and fills an internal COG RAM buffer on each IO write to the data register from the Z80. This does not write to hub RAM each time and so fills quickly with minimal delay. Then you can write to a different command register (by changing the address register first) to initiate the transfer to hub RAM and another worker COG. Reads could work the same way. That is the benefit of using indirect address register + data register scheme in the prop using two I/O registers on the Z80 side, as you can do useful things like this. You could have one autoincremening register for video reads, one for video writes, one for SD writes etc etc, and each could maintain their own current address so you can even interleave them if required.
Roger.
Since your present setup allows for a lot of experimental work and you previously cited your intent of decoding the I/O address space in some different manner than the previously suggested ones, I would like to suggest a perhaps new way to do this.
If your setup has an unused inverter gate and some available area to allow mounting an 'HC573 style transparent latch, I believe I'd found at least one way, to retrieve the full eight lower address lines, and enjoy full 256 port I/O address space discrimination, without having to compromise eight more Propeller pins.
My suggestion is to use an inverted /IORQ signal to drive the latch EN input, then, the actual A1 connection between the Z80 and the Propeller can be dismissed from address decoding tasks.
Then, the spared Propeller pin can be used to drive the latch /OE input.
Latch inputs routed to Z80's A7-A0, outputs to D7-D0.
Sure it looks trivial, maybe the controling signals and operational behavior will differ a bit from the ones we'd seen before.
The intended circuit operation, in my view, will be as follows:
- Before the Prop detects the begining of an I/O cycle, the output pin that controls the latch /OE, will be presenting a HIGH state. So the latch outputs are tri-stated and don't drive D7-D0.
- As long the Z80 validates /IORQ, the latches will store A7-A0, pending /OE to enable their outputs. At the same time, the Propeller will detect the beginning of an I/O cycle and starts reacting to it.
- Since Z80's /WR and /RD will be valid at the same time, cycle type discrimination is a breeze. Their absences, during a valid /IORQ, can perhaps be interpreted as an interrupt acknowledge cycle in progress.
- If /RD is low, then D7-D0 will be at input state, now it's safe to enable the address latches outputs and the Propeller can read the port address. Then the outputs will be disabled, allowing to the Propeller to drive the data lines itself.
I/O space discrimination and data retrieve during read cycles can proceed.
-If /WR is low, then the Z80 will be driving the data bus, then we must first gather its contents by reading Propeller's input pins. After input data (from Propeller's view point) is saved internaly, then Z80's clock is advanced, cycle by cycle, till the negation of its /WR output.
Irrespective to the kind of I/O write instruction Z80's is executing, including the block ones, the next cycle will be one of two possibilities:
1- An instruction fetche cycle, or
2- An interrupt acknowledge cycle, if interrupts are being used and enabled.
Then the I/O write operation can be internaly completed by the Propeller program, during the clock cycle that follows the I/O write one.
Data bus will be again tri-stated, so its time for the Propeller to control the latches /OE, and retrieve the already stored write port address, then the latches /OE will be returned to an inactive state, so does the D7-D0 lines.
From that point, routine procedures can deal with every intended operation.
I/O write it can now be named I/O "deferred write", but it will not harm anything, both at circuit and logic behavior perspectives.
I hope it will help in some way
Yanomani
It is a very interesting idea to be able to access more of the Z80 I/O space from the propeller and not adding any extra prop pins to do so. It does however go against my own personal desire for minimalism, as it adds two more devices to the mix. If I had more breadboard jumpers I could just about try your scheme out but I have actually used them all up right now in getting what I wanted tested and working. I need to order some more first.
From what you are saying the delayed (pipelined) writes should work in principle, the only issue I can see might be if you had an I/O read immediately following a write and the write took longer to do its job, it may have to block the next read for a bit. But I'd expect at 4MHz this wouldn't be an issue, maybe it would for 20MHz systems. If you are in fully in control of the clock extra delays can be inserted as required.
Do you know if these newer 20MHz Z80's have the same bus timing as the original 2-4MHz ones (albeit scaled up), or do they do more work per bus clock cycle? In other words, are there the exact same number of T states and machine cycles etc per instruction and it is just clocked faster, or is there some other architectural improvement to achieve faster speeds?
Roger.
I'm going to add another one to the mix. Idea up to this point was to get a simple bootloader into the start of memory, use that to put another bootloader into high memory, and then that bootloader loads in CP/M.
I think it is possible to do away with the middle bootloader. Use the bootloader we have but replace all the absolute jumps with relative jumps. When the program runs, the first thing it does is copy itself to high memory. The Z80 has some block move instructions - it might be possible to do this with just one instruction.
Re the 20Mhz Z80 chips, I'm not sure. I am using a 4Mhz chip but it is only running at 50khz. Will be fun to ramp up the speed and then drop in 6 and 8Mhz chips and see how they respond.
1) Whilst stuffing code into Z80 RAM space just cycle through all 64K, placing zeros everywhere until the top 256 bytes where the bootloader goes.
2) Reset the Z80 which will cause it to execute NOPs (those zeros) from address zero and eventually run into the bootloader at the top where things start to boot the normal Z80 way.
This basically what ZiCog does when running CP/M. No messing with Z80 reset vectors at address zero etc. It's simple and It does not add any noticeable time to the boot up of the system.
So forget your simple boot loader at the start of memory and just use a simple CP/M bootloader at the top of memory.
Now what is going to be done about getting disk blocks from Prop to Z80 RAM and back again in a timely manner?
Streaming blocks serially through an I/O port seems a bit slow.
My gut instinct tells me that using Z80 block I/O instructions will not actually be noticeably quicker than a tight Z80 loop doing normal I/O instructions.
Any single byte I/O transfers directly to/from the hub RAM will be quite a bit slower and would certainly not be the best approach for fast Z80s and disk transfers. If CP/M uses 128 byte records then a small sector buffer should fit into the COG RAM. That alone will help greatly. The COG can do raw data transfers to/from the hub approaching up to 20MB/s if using 32 bit longs - so that won't be a huge bottleneck compared to the SPI and Z80 transfers which will be around an order of magnitude slower.
When accessing your buffer in COG you have to work in 32 bit longs, you can't address bytes.
When getting or putting I/O to the Z80 you have to use bytes.
That results in a load of shifting and logic to extract/insert bytes to the longs in your buffer.
I think it has been argued that having such a buffer in HUB can be faster as accessing it with RD/WRBYTE does all that byte mangling stuff for you for free. That was my experience with playing around in my Z80 emulator code on the Prop.
That raw bandwidth to from HUB to COG does not help you as you have to:
a) Hoover data from HUB to COG buffer, or vice versa.
b) Fiddle around inserting/removing the bytes.
That's two passes over the data instead of one and a bunch of logic for each byte you don't need.
Good point, I see what you mean now. We can't achieve the full Hub bandwidth without a 32 bit long buffer in COG RAM, and yet shuffling Z80 byte data in and out of this long buffer will slow us down too..
What if the Z80 I/O COG does the SPI accesses directly??? That was another idea I was thinking about but certainly it will involve more Z80 CPU workload.
Edit:
Some estimated performance numbers for a 4MHz Z80, and a 128 byte temporary buffer in the COG RAM (stored as 1 byte per long) are added below.
A burst of 128 bytes transferred using OTIR @ 21 clocks/byte and say (conservatively for example) two extra wait states during I/O writes to COG RAM takes 128 x 23 / 4 = 736us. That is inherently 174kB/s before we account for COG RAM transfer delays and SD writes.
The transfer from COG RAM to hub RAM using bytes can be done at 5MB/s. This adds 128/5 = 25.6us to the number above. So approximately we are talking about 761us before we begin the SD write process. That will take its own time to transfer via SPI. Let's assume we can get ~10Mbps or around 100us for the 128 byte sector. But we probably have to write 4x this number of bytes if using 512 byte sectors and that makes it 400us. The total is then 736 + 25.6 + 400 = 1161 microseconds to write 128 bytes, or 110kB/s. Its still not a bad number for loading programs when you are talking about only a 64kB address space in the Z80. Large file copies will take a hit however.
If you pipeline it or can use 2 buffers you might be able to have an SD write in progress while the intermediate COG RAM is being filled with the next sector data. This might bring the rate back up again towards 168kB/s in the best case. A 20MHz Z80 will change these numbers, but at least at 4MHz you can see the dominant amount of time is spent by the Z80 writing to the I/O COG registers, and that can't really be avoided. Also if you don't choose to use the OTIR and instead do individual port writes, read data from HL, increment HL, decrement counter, loop etc, it won't be 21 clocks/byte, so this number will be quite a bit worse. We really want to use the OTIR, INIR opcodes for best performance.
If reaching some place at a higher memory address is your solely intent, you can safely bring a direct jump, for the Z80 to execute, to the targeted address and get there without even having the need to write to the Ram @00000h, and avoid disturbing any previously loaded content.
So, when jumping to say, for example, 0FF00h one will need to stuff only a three byte sequence, for the Z80 to directly execute it, as the first instruction, just after completing all /RESET connected procedures:
C3h, 00h, FFh
The first byte, valued C3h, will be sent from the Propeller to be executed by the Z80, beginning at the very first fetch cycle.
Then, under Propeller control, using Z80's /RD signal active/inactive states as a handshaking control, the following two address components can also be 'stuffed' at the right time.
Since just after reset, there is no possibility for any spurious interrupt to be accepted nor processed, and in CP/M's context, there is also no easy way to allow for non maskable interrupts too, code's executing sequence will be absolutely deterministic.
In fact, since the whole 'cold boot' will be a completely deterministic task, all you have to do is carefully 'telling' or maybe 'insidiously suggesting', 'whispering' without a single clue or footprint, any sequence of events you imagine to be executed by the Z80, only writing at specific addresses, the real code and data you intend to reside anywhere into the RAM.
Finaly, as deterministic procedures are among the best ones for a Propeller to do and, by the fact that the whole boot sequence, as seen from Z80's perspective, will be almost ever deterministic ones, i believe there is a perfect match between the two.
I hope to help a bit.
Yanomany
Also that kind of code will eat Z80 RAM space which is better left free for CP/M apps and such.
Something like this?
@rogloh
Sorry for the late reply, my fault!
Among many others, for reference purposes, I'm using Zilog's #PS017801-0602 product specification, which deals with all but one of Z80's versions.
Albeit this paper doesn't covers the eldest and long time retired 2.5 MHz NMOS part, all the other ones are cited in its scope, from 4 to 8 MHz NMOS parts and from 4 to 20 MHz CMOS ones.It even displays and points the obsoleted 4MHz CMOS version.
For the sake of sparing you to read a bunch of technology linked details, I will focus at the prominent ones, whose significance can mess or not with all but the stringently specified designs (the ones dealing with wave shapes, clock and bus capacitances, signal ringing, etc...).
The main diference you can note, is the existence of a power-down mode, only available in CMOS versions, and encapsulation availability.
Back on the track to our present focus, by carefully reading that paper, you'll notice all the subtle variations among them all, notably the clear (and mandatory) connection between clock speed limits and signaling setup and hold times.
In every other aspect, notably the ones that deal with instruction cycle timing, held unchanged trough the entire family.
Hope it can help a bit
Yanomani
Yes that should work. I'll try it when I get home from work.
I must say that the Propeller makes a perfect debugging tool for things like this. I don't know exactly how many clock cycles the above is going to take, but I know that it is very easy to write Spin code and download and test things quickly and when the /rd line goes low and all the higher address lines go high, it has made the jump.
Thanks Yanomani!
I am looking at response time for the prop to detect the Z80's I/O accesses and respond, either by issuing the /WAIT to buy more time or stopping the Z80 clock. I want to see how this compares with Z80 clock speed. The data sheet only shows the information for 5V +/- 10% voltage and 0-70C tempature range not 3.3V so take them with a grain of salt. But they can be still be used to get a very rough indication of where we stand and what might be possible.
For the wait method:
You need to drop /WAIT within a maximum amount of time after /RD (or /WR) and /IORQ both go low.
There is a small delay before /RD (or /WR) goes low after a clock edge. You need to have /WAIT issued before 1.5 x the Z80 clock period minus a setup time following this same clock edge. This table computes the maximum response time you have after detecting the /IORQ and /RD edge in the prop and the number of propeller clocks this corresponds to at 80MHz before the /WAIT must go low. Times are in nanoseconds.
There is no way 20MHz is possible without external hardware doing the /WAIT.
For 10MHz you might be able to generate if the combination of the "waitpeq" and "and outa, mask" operations can complete within 6 clocks after the edge. It might just be possible - it will be very close but worth a go. 6 MHz down should be very doable.
For the clock stopping method:
On I/O reads (which is the tighter case compared to I/O writes) you need to stop the clock before the Z80 latches its read data from the data bus.
Without waits, you have to stop the clock within 2.5 x the Z80 clock period minus the /RD low delay. The Data setup time is also shown but it should not factor into the computation if you stop the clock in time, however once the prop presents the I/O read data on the bus you will need to wait at least this amount of time before you can start the clock again. Time are in nanoseconds.
If you are controlling the clock in software as fast as you can, and polling /IORQ to be low with something like this code below you can only clock the Z80 as fast as 20M/6 = 3.3MHz if the prop runs at 80MHz and you want a symmetric clock output.
If instead you are controlling the clock using a counter you will need to do a WAITPEQ on the /IORQ, then stop the counter clocking as soon as possible. I expect this takes at least 5 propeller clocks, (at best) 1 for the waitpeq detection to compete if it was already running, and 4 for the write to CTRA register to stop it, assuming no other internal delays inside the prop itself (which there might be - I don't know). I'm not sure that 20MHz is achievable - it's really tight. Probably 10-12MHz is more resonable to expect?
Roger.
ps. you can always buy a 20MHz rated Z80 CPU and clock it slower. The numbers above indicate using each particular MHz rated Z80 but there is no reason you can't underclock a faster one, that will help too.