@evanh said:
Thinking about it a little more, that 33 ticks may not be right either. The 5 post-hubRAM stages won't matter since they are just like a pipeline so don't need to be emptied first. Then, 28 ticks after a non-blocking RDFAST should be clear.
In reply to last sentence, yes. Maximum time to fill FIFO from start of RDFAST is 17+18-5 = 30 cycles therefore FIFO is guaranteed to be filled 28 cycles after 2-cycle no-wait RDFAST, assuming no longs consumed by RFxxxx during filling as I mentioned earlier. A random hub read/write needs to be at least 28-3 = 25 cycles after no-wait RDFAST to avoid any delay.
Right, yeah, can take off the RDLONG's 2 for execute and 1 for setup. A WRLONG would need 26 ticks after the RDFAST since it doesn't appear to have the 1 setup.
@evanh said:
Right, yeah, can take off the RDLONG's 2 for execute and 1 for setup. A WRLONG would need 26 ticks after the RDFAST since it doesn't appear to have the 1 setup.
The 3 cycles deducted from 28 to give 25 is the pre-read/pre-write time at the start of random read/write instruction and nothing to do with what happens during the RDFAST instruction.
@evanh said:
Right, yeah, can take off the RDLONG's 2 for execute and 1 for setup. A WRLONG would need 26 ticks after the RDFAST since it doesn't appear to have the 1 setup.
The 3 cycles deducted from 28 to give 25 is the pre-read/pre-write time at the start of random read/write instruction and nothing to do with what happens during the RDFAST instruction.
Yep. But it's only 2 for WRLONG - the execution time. The other 1 is the write cycle of one longword. Or more accurately, 1..8 for hub slot alignment and write. ie: 3..10 ticks total for an aligned WRLONG.
@evanh said:
Right, yeah, can take off the RDLONG's 2 for execute and 1 for setup. A WRLONG would need 26 ticks after the RDFAST since it doesn't appear to have the 1 setup.
The 3 cycles deducted from 28 to give 25 is the pre-read/pre-write time at the start of random read/write instruction and nothing to do with what happens during the RDFAST instruction.
Yep. But it's only 2 for WRLONG - the execution time. The other 1 is the write cycle of one longword. Or more accurately, 1..8 for hub slot alignment and write. ie: 3..10 ticks total for an aligned WRLONG.
The actual hub write occurs after random write instruction has finished. A fixed pre-write time of 3 cycles and variable time of 0-7 agree with my test results, whereas pre-write of 2 and variable of 1-8 do not. The easiest way to check this is to write to slice X immediately after read from slice X.
EDIT:
Also, from testing, must have at least 10 cycles after RDFAST with wait before both random read and write, which shows pre-read and pre-write are effectively the same, namely 3 cycles.
Write to slice X takes 10 cycles after read from slice X and the next slice after the write is X. (I used to think this write takes 3 cycles but that is wrong.)
@TonyB_ said:
Write to slice X takes 10 cycles after read from slice X and the next slice after the write is X. (I used to think this write takes 3 cycles but that is wrong.)
Those long burst testing we did together says otherwise. If there was large turnaround it would've been a huge factor in the stats. What you'll be dealing with here is just the regular slot aligning delays.
Well I do now! Fixed.
Weirdly trying this again without the ASMCLK, but starting the code from a SPIN2 method, doesn't work. I thought the PLL should at least be set in that case with flex. What stupidity of mine is causing this problem? Am using 5.9.12 beta.
CON
_clkfreq = 325_000000
'...etc
PUB startup()
coginit(cogid(), @gigatron, 0)
DAT
orgh
gigatron
org 0
' emulator initialization code
' asmclk 'setup PLL
mov pa, ##@lutram
sub pa, ##@gigatron
@TonyB_ said:
@rogloh said:
Well that's useful information @TonyB_. Here's the timing I think we have between the relevant instructions in my code.
24 clocks in third row should be 25+, to be 100% sure of timing.
Yeah I think I can muck a little with the branch code sequence below. I can probably move some of the accumulator test instructions and part of the branch calculations before the RDBYTE and push the RDFAST down a little more into the code as it has more headroom. This should buy a few extra clocks. If I was to test this for a failure I guess I should do an ALU operation followed by a branch that is using a memory lookup result for the destination. I can probably simulate that as separate test with the same instruction spacing and see if the RDBYTE is ever delayed from that position after a prior RDFAST with no waiting, with all the different hub cycle offsets tried. Then we'll know for sure.
branch_ops '46 clocks max (XBYTE uses this code in MEMORY mode)
' JMP
' BGT
' BLT
' BNE
' BEQ
' BGE
' BLE
' BRA
' 8 bit opcode pattern:
' iiijjjss - iii is 3 bit instruction (111 for JMP/Bxx)
' - jjj is 3 bit jump type
' - ss is 2 bit input source
rfbyte d 'read immediate parameter
' 00 01 10 11 "ss" values
add d, rambase ' | bus=[mem] | |
rdbyte bus, d ' | bus=[mem] | |
mov bus, d ' bus=d | | |
mov bus, ac ' | | bus=ac |
getbyte bus, input, #0 ' | | | bus=input
if_z rdfast nowait, branchaddr 'delayed branch if needed
' 000 001 010 011 100 101 110 111 "jjj" values
' JMP BGT BLT BNE BEQ BGE BLE BRA
' far > < <> = >= <= always
getptr branchaddr ' | b c d e f g h
sets branchaddr, bus ' a b c d e f g h
add branchaddr, bus wz ' a b c d e f g h
setd branchaddr, y ' a | | | | | | |
add branchaddr, rombase wz ' a | | | | | | |
test ac wz ' | b c d e f g |
testb ac, #7 wc ' | b c d e f g |
modz _c_or_z wz ' | b | | | | | |
modz _nc_or_z wz ' | | c | | | | |
modz _z wz ' | | | d | | | |
modz _nz wz ' | | | | e | | |
modz _c wz ' | | | | | f | |
modz _nz_and_nc wz ' | | | | | | g |
modz _nz wz 'lazy hack to flip polarity for now
' call #dump
xcont egaout, output
xor CLOCK_PORT, clk 'toggle clock pin if enabled
wxpin output, #OUT_REPO
wxpin ac, #XOUT_REPO
_ret_ rqpin input, #IN_REPO
Maybe I don't have to mess with the branch timing. I can't seem to delay the RDBYTE using the simulated FIFO access test here. It always just takes from 9-16 cycles in this test program, so hopefully that means we are okay as is, unless this test is somehow not representative enough.
CON
_clkfreq = 100000000
DEBUG_BAUD = 115200
DAT
org
asmclk
mov r1, ##10000 ' test loop length
loop
getrnd r0
and r0, #$1f ' randomize requested read address offset
waitx #$f wc ' randomize timing before changing the fifo a bit
rdfast nowait, ##@fifo ' load up the fifo with some default data and don't wait
long 0[7] ' space by 7 NOP instructions
rfbyte r2 ' simulate the _ret_ consumption of a byte from the fifo with xbyte
nop
nop
nop ' simulate XBYTE timing with 3 NOPs
rfbyte r2 ' simulate the rfbyte consumption of a byte from the fifo
getct t_init
rdbyte r2, r0 ' instruction to measure
getct time
sub time, t_init ' compute time interval
sub time, #2 ' account for getct overhead
cmp time, maxtime wc
if_nc mov maxtime, time ' keep track of slowest case
DEBUG(UDEC_LONG(time))
djnz r1, #loop ' repeat
DEBUG(UDEC_LONG(maxtime))
cogid pa ' shutdown
cogstop pa
maxtime long 0
r0 long 0
r1 long 0
r2 long 0
nowait long $80000000
t_init long 0
time long 0
orgh
fifo long $55555555
long $FEEDFACE
Roger,
Those RFBYTE's won't trigger any extra FIFO fetching from hubRAM. The initial prefetching means you have to consume 6 longwords from the FIFO before it'll go fetch any more from hubRAM.
And with the measured RDBYTE sitting at 28 ticks from the RDFAST, I presume is intentional, it's enough to clear the initial prefetching.
PS: And generally the FIFO will fetch another 6 consecutive longwords at a time from hubRAM. The exception being when consumption is faster than 1-in-6, or thereabouts. Give or take some for the slot delays.
@rogloh said:
Just hooked up a SNES style game controller and got the input working in the game controller mode (not serial port data quite yet). I just ran the controller I had at 3.3V (likely a CMOS 4021 etc) and it worked directly wired to the P2 pins without any resistor needed. I could run the snake and car racing game with it. Works fine. Select seems to cycle the actively rendered scan lines per pixel and holding Start down resets the system as expected. No evidence of sync corruption so far, very solid with the P2 @ 325MHz/6.25MHz Gigatron emulation. That Mandelbrot set is a pretty slow to render at this speed though. I'm watching it crawl along until I increase the CPU time by dropping the actively drawn scan lines.
Yeah, and they even cheat by doing half and mirroring it, I think.
Yeah, Select or the equivalent key on a keyboard should cycle between the four video "modes" in anything you are doing. It can skip lines to render faster.
Good news then @evanh. It seems that we shouldn't have a way to deplete the FIFO this way causing RDBYTE slowness. There are plenty of extra cycle times to keep the FIFO topped up too whenever the branching is not occurring in a given instruction. So I'm not sure there is going to be a way for the FIFO to actually affect the emulator performance, which is great if true. Can you see any other potential issue with the code's current timing?
@PurpleGirl said:
Yeah, Select or the equivalent key on a keyboard should cycle between the four video "modes" in anything you are doing. It can skip lines to render faster.
I'm trying to get the host serial keyboard interface working in my code but am having timing issues making it fit into the single IO COG. I thought a RDPIN x, #PIN wc would set C if the Smartpin has received a character, but it looks like you have to test it first with TESTP #PIN wc, thus needing an additional instruction now. Also I think they translate keyboard arrows into buttons in Babelfish, so I'm not sure it will recognize any regular characters to move the arrows and perform program selection like the gamepad does in Gigatron itself. I know that Gigatron must at least recognize some characters in the BASIC programs, but without button control translation I don't think I can actually launch that first.
What I might do is accept both game controller input and also some extra serial host input from a third COG which itself outputs onto a pin like the game controller does (under the same clock and latch/sync timing). The IO COG can then monitor both the input source data pins and accept either whenever they are active, with one of them taking priority perhaps if both pins are active at the same time - actually that could be tricky, maybe I'll just have to AND them together. This third COG could eventually become a PS/2 or USB keyboard COG, or whatever emulates Babelfish more completely.
The FIFO is built to handle the streamer doing 32-bit words at sysclock/1 without hiccup. There is no chance of ever depleting it. I thought the question was how much it stalls single data read/writes on collisions ... and how close the first RFxxx can be after non-blocking RDFAST.
@evanh said:
The FIFO is built to handle the streamer doing 32-bit words at sysclock/1 without hiccup. There is no chance of ever depleting it. I thought the question was how much it stalls single data read/writes on collisions ... and how close the first RFxxx can be after non-blocking RDFAST.
Yes. I'm not seeing evidence of collisions with the spacing in this code and nor am I seeing problems with the gap I have between RDFAST (nowait) and the RET to XBYTE. So I think we can say we're pretty good now, and this emulator should run 100% fine.
Interestingly, the P2 still needed 52 clocks (~23 native instructions + XBYTE) to emulate a very minimal processor such as this one (no flags, no stack). A 325MHz P2 = 6.25MHz Gigatron. This probably means we're not going to get too much faster at emulating very many processors using XBYTE with 100% cycle accuracy if they have an instruction set that requires HUB RAM accesses and if they also do branching, so this 6.25MHz emulation speed is probably getting close to the limit of the P2 (though you could run the P2 faster at ~350MHz on a good day or higher with more cooling).
EDIT: actually that's probably not quite true if you can keep your RDFAST branching handled in one place where there are no other reads or writes to compete. Then you could increase the instruction rate. We needed to pipeline the branches so that made it tricky because branching could happen in any instruction handler. Also if your branch is allowed to take more than one cycle while simpler ones only take one, you could make the other non-branching instructions run faster as well, assuming they are simple enough to execute.
Oh, by deplete, you were meaning hitting the FiFO's low-water mark maybe? The low-water triggers at 13 longwords remaining. The FIFO doesn't refill until down to this level. And then pulls in 6 longwords in a quick single rotation burst.
So on the rare occasion of collision with single data read/write, that single will be stalled for +8 ticks to its next available slot.
I managed to free a couple of longs and got the serial interface working with the game controller in parallel in the IO COG to send ASCII text I type into Gigatron. It seems to be working. Unfortunately some applications require a LF instead of CR, so for that you need to use a different terminal emulator instead of loadp2's simple console. However I did manage to get the included MSBASIC responding directly from loadp2 with CR only.
This updated code will combine game controller input in parallel with serial input to feed the input buffer (using bitwise AND so don't press both at the same instant in time). Only two COGs are used. I also removed the XOUT repository by merging the OUTPUT and extended OUTPUT ports into the same 32 bit repository, and now have two of the LEDs pins share with these repositories to save more pins.
Right now it doesn't do any keyboard to button translation in the COG (it just sends raw data) so to send a game controller button without having one to press you would need to send characters with bit(s) cleared when you want some controller button to be activated (they are active low). The following character values are treated as single button presses. You might be able to hold down ALT on a windows keyboard and type the decimal character number to effectively press the button. I think PC keyboards used to be able do that, whether or not that still works these days, not sure.
01111111 = 7f = 127 BUTTON A (also delete key)
10111111 = bf = 191 BUTTON B
11011111 = df = 223 SELECT
11101111 = ef = 239 START
11110111 = f7 = 247 UP
11111011 = fb = 251 DOWN
11111101 = fd = 253 LEFT
11111110 = fe = 254 RIGHT
Here's the latest beta (v0.6). Anyone with a P2-EVAL or JonnyMac board and the A/V adapter should now be able to get it to work even without a game controller if they use these character values as button presses (and when using the v5a ROM downloaded from github). https://github.com/kervinck/gigatron-rom
@evanh said:
The FIFO is built to handle the streamer doing 32-bit words at sysclock/1 without hiccup. There is no chance of ever depleting it. I thought the question was how much it stalls single data read/writes on collisions ... and how close the first RFxxx can be after non-blocking RDFAST.
Yes. I'm not seeing evidence of collisions with the spacing in this code and nor am I seeing problems with the gap I have between RDFAST (nowait) and the RET to XBYTE. So I think we can say we're pretty good now, and this emulator should run 100% fine.
Interestingly, the P2 still needed 52 clocks (~23 native instructions + XBYTE) to emulate a very minimal processor such as this one (no flags, no stack). A 325MHz P2 = 6.25MHz Gigatron. This probably means we're not going to get too much faster at emulating very many processors using XBYTE with 100% cycle accuracy if they have an instruction set that requires HUB RAM accesses and if they also do branching, so this 6.25MHz emulation speed is probably getting close to the limit of the P2 (though you could run the P2 faster at ~350MHz on a good day or higher with more cooling).
Most older CPUs probably don't do branching in a single cycle. It might be possible to do a GETCT during horizontal blanking to check total cycles/line.
Worst-case RDFAST of 17 cycles occurs when slice difference between previous read and RDFAST = +1 where slice = hub address[4:2], e.g. RDBYTE from $00000 then RDFAST from $00004. Best-case RDFAST of 10 cycles when RDFAST - RDxxxx slice diff = +2. (Also, best-case / worst-case random RDxxxx of 9 / 16 cycles when RDxxxx - RDxxxx slice diff = +1 / 0.)
I have not seen partial FIFO filling cause a delay to random reads or writes, yet. In my tests, I wait 64 cycles using WAITX after RDFAST to ensure FIFO is full and filling over, then do four or five or six RFLONGs, then one or two RDBYTE or WRBYTE. I choose slices so all hub accesses should be best-case and they are, i.e. no delays.
If you actually reserved this much hub RAM, that could be accurate, but if you only have 64K allocated, then you could be seeing this bug.
The memory test in the ROM truly is dynamic. So if you use the A15 breakout on the board and do a socket mod to use a larger chip, it will report 64K on boot.
So you might want to try one of the DevROM versions. I think those start with X. They might add more vCPU opcodes and syscalls too, but I'm not sure. The syscall thing is interesting. While I think that has more overhead than the opcodes, it provides a way to poke a hole in the vCPU interpreter and call native routines. So that's likely better for longer tasks
@PurpleGirl said:
So you might want to try one of the DevROM versions. I think those start with X. They might add more vCPU opcodes and syscalls too, but I'm not sure.
I've now added support for USB keyboards via the USB breakout board, as well as controlling directly via loadp2 console (or other terminal consoles). For now a real game controller is no longer implemented but one could potentially be added via direct IO or via the USB interface with a USB game pad in a future release with support for that.
The following shows the mapping for the USB keyboard keys to game pad buttons.
usb.setEmuPadMapping( encod RIGHT, usb.KEY_RIGHT) ' Right
usb.setEmuPadMapping( encod LEFT, usb.KEY_LEFT) ' Left
usb.setEmuPadMapping( encod DOWN, usb.KEY_DOWN) ' Down
usb.setEmuPadMapping( encod UP, usb.KEY_UP) ' Up
usb.setEmuPadMapping( encod START, usb.KEY_ESC) ' Start
usb.setEmuPadMapping( encod SELECT, usb.KEY_TAB) ' Select
usb.setEmuPadMapping( encod BTN_B, usb.KEY_INSERT)' Button B
usb.setEmuPadMapping( encod BTN_A, usb.KEY_DELETE)' Button A
If you hold ESC down for a couple of seconds on the keyboard, it will reboot Gigatron.
For the serial terminal interface, the ANSI cursor keys are mapped to Gigatron Up/Down/Left/Right buttons as well and the escape key if pressed 3 times in succession reboots Gigatron. You can't hold it down like on the regular keyboard as it interferes with the ANSI escape sequences. The auto-repeat time is not ideal for the games when using the serial port console but does sort of work (badly). It does work okay for entering text into the BASIC interpreters. The tilde "`" is the button A character in the console. Didn't bother with button B yet, nothing uses it AFAIK. The software expects carriage return but translates it into linefeed for Gigatron. Backspace is converted to delete as well.
Pressing TAB cycles through the displayed scan lines for both the terminal console and the real keyboard interface, and increases/decreases the available processor time per frame.
Enjoy the retro and rather lame minimalism of this tiny TTL processor, if you can.
Okay, that makes the keyboard mappings different from the Gigatron. I thought maybe F10 did the cycling and the keys above the arrow keys do what most of the game buttons other than what the direction buttons do. There is an online Gigatron emulator somewhere, and I think the keys on it are what the real one uses.
@PurpleGirl said:
So you might want to try one of the DevROM versions. I think those start with X. They might add more vCPU opcodes and syscalls too, but I'm not sure.
@PurpleGirl said:
Okay, that makes the keyboard mappings different from the Gigatron. I thought maybe F10 did the cycling and the keys above the arrow keys do what most of the game buttons other than what the direction buttons do. There is an online Gigatron emulator somewhere, and I think the keys on it are what the real one uses.
Yeah, for now I don't care too much about that for initial tests but it can obviously be changed. I know there are still issues with the auto-repeat that could be fixed too. Ideally an ISR could be run in the SPIN2 COG to maintain full serial input rate to avoid needing yet another COG to do serial input, but this is not possible AFAIK, however perhaps some serial polling and local buffering could be done during the busy wait times to avoid that issue.
What makes sense to me would be to extend the SPIN2 code I have started and make it work 100% the same as a Babelfish AVR image so the P2 serial port could be hooked up to a PC and either a USB or direct Famicon style gamepad could be used. If someone writes a PS/2 keyboard COG we could use that too, instead of USB.
@rogloh said:
Yeah, for now I don't care too much about that for initial tests but it can obviously be changed. I know there are still issues with the auto-repeat that could be fixed too. Ideally an ISR could be run in the SPIN2 COG to maintain full serial input rate to avoid needing yet another COG to do serial input, but this is not possible AFAIK, however perhaps some serial polling and local buffering could be done during the busy wait times to avoid that issue.
What makes sense to me would be to extend the SPIN2 code I have started and make it work 100% the same as a Babelfish AVR image so the P2 serial port could be hooked up to a PC and either a USB or direct Famicon style gamepad could be used. If someone writes a PS/2 keyboard COG we could use that too, instead of USB.
Well, if you knew what address the keyboard handler stores to, an interrupt or DMA approach would work. Since you'd stuff it in there and let the handler in ROM handle it from there if it even works that way. No interrupts are used on the Gigatron side since the syncs are what function as interrupts and cause it to poll the In port. And you wouldn't even need to emulate the shift register on the Gigatron side, just a regular register.
Do you have an intention to get a P2 and start messing about with this stuff yourself PurpleGirl? Now I've done the processor emulator part I'm sort of satisfied. That was the interesting aspect, for me anyway. Trying to get it to fit in the clocks available.
I've added USB gamepad support for the Gigatron emulator. This should work with any gamepads already supported by NeoYume (not tested) plus a couple I added myself for these types of devices below which I do have here and did test out:
You can plug a USB keyboard into one USB port and a USB gamepad into the other. It should be hot pluggable now.
The serial interface also still works. Any device will feed input into the Gigatron and you can have any connected. Didn't try it with two USB keyboards at the same time or two USB gamepads though so if you do that who know how it will respond.
Really the only other thing would be to add back the support for a direct game controller input (Famicon 3 pin CLOCK/LATCH/DATA interface), and maybe PS/2 keyboard (although with a USB keyboard now, why bother).
@rogloh said:
Do you have an intention to get a P2 and start messing about with this stuff yourself PurpleGirl? Now I've done the processor emulator part I'm sort of satisfied. That was the interesting aspect, for me anyway. Trying to get it to fit in the clocks available.
I am unsure, TBH, but I am interested. I still would like to see an I/O coprocessor, and your code would be nice to study if I really want to build that. I think most things could be done in RAM. To effectively make it work, one could do a respin of the original machine. I don't know if the P2 I/O could go faster than 6.25 MHZ. If it could and the I/O is split out across multiple cogs, then it would be nice if the base machine could be modded to 12.5 or faster.
I know an ALU mod that could be done to the base system that nobody has tried yet that adds a few more chips. A bottleneck there is that the adders in the ALU are 4-bit "fast adders." They never made an 8-bit version. So while the 4-bit adders are fast adders internally, they use ripple carries. So that serializes the operation. I mean, while both adders work at the same time, the result won't be stable until twice the latency of the first one. But if you use 3 adders, you can calculate the low nibble and both possibilities of the upper nibble all in parallel. One upper adder's carry-in signal is tied to "Gnd" and the other to "Vcc." As for the lower nibble's carry-out line, that can drive a mux and choose which upper nibble is used. Yes, the multiplexer adds latency, but considerably less than an adder. Of course, there is another option if one can find it. That is to use a 16-bit adder and ignore half of the lines. I don't know if the latency would be better. But with that and other board mods, 15 Mhz could be possible.
Part of me would love to learn how to mod the control unit some. A bottleneck in the vCPU interpreter is the absence of a carry flag. That makes condition testing a little more involved. The vCPU is 16-bits, so you really need a carry flag, so without that, some additional work is necessary. That is why the 6502 emulation is so slow. While superficially, things appear to run at the same speed as an Apple, that illusion is because the frame rate is the same. The 6502 emulation is about 1/3 the original speed. Another reason for this discrepancy, in addition to that and the way the video is produced, is the lack of shift instructions. Then again, I think the original 6502 made those into illegal instructions, but the Gigatron adds those to the 6502 emulator. The native code has a single-place left shift, but the right shift is missing altogether and is provided through a trampoline routine in the ROM. So shifts and a carry flag could increase native code efficiency.
The above is why part of me wants to do the 4-stage 75 Mhz similar machine that I came up with. With the LUT ROMs and shadow RAMs, I could add such instructions and more. I'm unsure how to do the video on something that runs that fast. I mean, on one hand, bit-banging would work just fine with far more bandwidth. So if the vCPU interpreter could be rewritten to work in the extra 11 cycles per pixel, you'd still have a 6.25 Mhz I/O clock. That should work since there is a register there. So the OUT register should hold things for 12 cycles. Only the code feeding it would need to be modified to update it less. But really, since I am a sucker for speed, I'd like to see an I/O controller there. Snooping could still be used, but I don't think the P2 could help in that case. That would be more of a job for an FPGA with deep pipelining, multiple snoopers, caches, etc. But I wouldn't know what to do about coordinating a parallel-running coprocessor. I guess if going that far, and a new ROM would be absolutely necessary, the ports could be repurposed to help feed the I/O controller. With another bus feeding the coprocessor, interrupts could be emulated. So if you can pass syncs and keyboard usage back to the native ROM, spinlock code could be added if one must slow processing to match the video. I'd imagine that such an arrangement may be needed to prevent dropped frames. Spinlocks are not really an option now since they are non-predictable even if they are still deterministic. But if you split out the I/O, spinlocks could be used. And that could simulate DMA since it is a Harvard design and you can put the ROM in a loop and do what you want with the RAM. So it is a programmatic halt. Generally, DMA is negotiated on the device side, however, in this case, due to the design, the device could ask for DMA access and the ROM would cooperate, or the ROM could expect it from the device after an operation where it was solicited. In this sense, that would make repurposing the ports more useful. So the status could be polled every so many native instructions and possibly also by any native call or vCPU opcode that takes advantage of this.
But then again, as an intermediate step, I could also get a P2 and try the other idea I had in mind. Then the idea would be to just do vCPU and not emulate twice, so the majority of the ROM would be native code and the rest would be the ROM-included program storage. There would like need to be a supervisory cog so that the job of the native Gigatron code that isn't done could just be native P2 code. Then I could adapt your helper cog. I could if I wanted to have closer to the original behavior, add a halt line (or "register") to the vCPU interpreter. That gives an option to use the H-Sync to gate the halt. I likely wouldn't strictly apply that, meaning that 1 overlapping vCPU instruction with the active sync time would be allowable. That would make vCPU emulation more efficient in that a cycle count mechanism and an instruction retry mechanism would not be needed since that would be a "hardware" thing. Obviously, bit-banging would require that, but if you have a controller, you won't need to keep track of time slices. The Gigatron allows so many cycles per vCPU instruction, and if an instruction starts and is interrupted due to running out of time, it restarts fresh in the next time slice. So it wouldn't matter if the P2 vCPU emulator and the halt mechanism I've proposed weren't fully strict. Perfection isn't what matters here, just having a governor of some sort to make things closer to the original when necessary and for preventing frame buffer races. So I wouldn't care if it ran for 41-45 of the 6.25 Mhz cycles on occasion instead of 40. And of course, I could have a mode with selective vCPU gating to where only I/O region transfers are limited to running during the porches to give some missed-frame protection while letting other code run wide open. I'd have to play around with that to see how necessary any of that is.
In that case, I could dispense with the command protocol if I wanted to and just have private registers for inter-process communication. Plus this might be a friendlier design for SPI RAM since I could design in the ability to have wait states. If one wants to negotiate limited pins with better throughput, one could use QSPI RAM instead if it exists. Thus nibble I/O traffic would be faster than 2 unidirectional lines. That probably would be negligible in difference from the hub in terms of delay, though the delay would be more predictable.
Wow, okay. Lots of ideas for you to try out. I think it's probably best to get yourself a P2 and simply start experimenting.
By the way - I just saved another instruction by moving over to use LUT RAM for my VGA colour map table in the IO helper COG, so I could put back the code to read the game controller input pin and logically AND it with other data inputs to be shifted synchronously in the vertical sync interval.
The IO helper COG essentially does these three things now and is fully utilized and locked to the Gigatron COG's instruction timing, there's no time for anything else to happen:
captures the 4 LED and 4 bit audio extended output and sends to IO pins and a DAC pin, respectively.
drives the analog VGA output as a mirror of the EGA parallel RGB data and generates its sync signals
captures game controller input and mixes with input data from other COG(s) via a repository which gets read at the right time to send during the Vsync interval
So basically right now I have 4 control input sources running in parallel to control that Tetris game and they can all work together, it's pretty freaky:
USB keyboard's arrow keys attached to P2 USB port 0
USB game pad attached to P2 USB port 1
a serial port console, arrow keys sent from my MacBook Pro
Famicon/SNES style game pad directly attached to VSYNC, HSYNC and GAME input pins on the P2 (operating at 3.3V)
The next update I send out will have the game controller feature put back in.
Comments
In reply to last sentence, yes. Maximum time to fill FIFO from start of RDFAST is 17+18-5 = 30 cycles therefore FIFO is guaranteed to be filled 28 cycles after 2-cycle no-wait RDFAST, assuming no longs consumed by RFxxxx during filling as I mentioned earlier. A random hub read/write needs to be at least 28-3 = 25 cycles after no-wait RDFAST to avoid any delay.
24 clocks in third row should be 25+, to be 100% sure of timing.
Right, yeah, can take off the RDLONG's 2 for execute and 1 for setup. A WRLONG would need 26 ticks after the RDFAST since it doesn't appear to have the 1 setup.
The 3 cycles deducted from 28 to give 25 is the pre-read/pre-write time at the start of random read/write instruction and nothing to do with what happens during the RDFAST instruction.
Yep. But it's only 2 for WRLONG - the execution time. The other 1 is the write cycle of one longword. Or more accurately, 1..8 for hub slot alignment and write. ie: 3..10 ticks total for an aligned WRLONG.
The actual hub write occurs after random write instruction has finished. A fixed pre-write time of 3 cycles and variable time of 0-7 agree with my test results, whereas pre-write of 2 and variable of 1-8 do not. The easiest way to check this is to write to slice X immediately after read from slice X.
EDIT:
Also, from testing, must have at least 10 cycles after RDFAST with wait before both random read and write, which shows pre-read and pre-write are effectively the same, namely 3 cycles.
Okay, that just means you are including the read and write cycles in with the pre-read/write. With the dividing line at post-read/write.
This can be demonstrated via a block read/write using SETQ prefix.
Write to slice X takes 10 cycles after read from slice X and the next slice after the write is X. (I used to think this write takes 3 cycles but that is wrong.)
Those long burst testing we did together says otherwise. If there was large turnaround it would've been a huge factor in the stats. What you'll be dealing with here is just the regular slot aligning delays.
Well I do now! Fixed.
Weirdly trying this again without the ASMCLK, but starting the code from a SPIN2 method, doesn't work. I thought the PLL should at least be set in that case with flex. What stupidity of mine is causing this problem? Am using 5.9.12 beta.
Yeah I think I can muck a little with the branch code sequence below. I can probably move some of the accumulator test instructions and part of the branch calculations before the RDBYTE and push the RDFAST down a little more into the code as it has more headroom. This should buy a few extra clocks. If I was to test this for a failure I guess I should do an ALU operation followed by a branch that is using a memory lookup result for the destination. I can probably simulate that as separate test with the same instruction spacing and see if the RDBYTE is ever delayed from that position after a prior RDFAST with no waiting, with all the different hub cycle offsets tried. Then we'll know for sure.
Maybe I don't have to mess with the branch timing. I can't seem to delay the RDBYTE using the simulated FIFO access test here. It always just takes from 9-16 cycles in this test program, so hopefully that means we are okay as is, unless this test is somehow not representative enough.
Roger,
Those RFBYTE's won't trigger any extra FIFO fetching from hubRAM. The initial prefetching means you have to consume 6 longwords from the FIFO before it'll go fetch any more from hubRAM.
And with the measured RDBYTE sitting at 28 ticks from the RDFAST, I presume is intentional, it's enough to clear the initial prefetching.
PS: And generally the FIFO will fetch another 6 consecutive longwords at a time from hubRAM. The exception being when consumption is faster than 1-in-6, or thereabouts. Give or take some for the slot delays.
Yeah, and they even cheat by doing half and mirroring it, I think.
Yeah, Select or the equivalent key on a keyboard should cycle between the four video "modes" in anything you are doing. It can skip lines to render faster.
Good news then @evanh. It seems that we shouldn't have a way to deplete the FIFO this way causing RDBYTE slowness. There are plenty of extra cycle times to keep the FIFO topped up too whenever the branching is not occurring in a given instruction. So I'm not sure there is going to be a way for the FIFO to actually affect the emulator performance, which is great if true. Can you see any other potential issue with the code's current timing?
I'm trying to get the host serial keyboard interface working in my code but am having timing issues making it fit into the single IO COG. I thought a
RDPIN x, #PIN wc
would set C if the Smartpin has received a character, but it looks like you have to test it first withTESTP #PIN wc
, thus needing an additional instruction now. Also I think they translate keyboard arrows into buttons in Babelfish, so I'm not sure it will recognize any regular characters to move the arrows and perform program selection like the gamepad does in Gigatron itself. I know that Gigatron must at least recognize some characters in the BASIC programs, but without button control translation I don't think I can actually launch that first.What I might do is accept both game controller input and also some extra serial host input from a third COG which itself outputs onto a pin like the game controller does (under the same clock and latch/sync timing). The IO COG can then monitor both the input source data pins and accept either whenever they are active, with one of them taking priority perhaps if both pins are active at the same time - actually that could be tricky, maybe I'll just have to AND them together. This third COG could eventually become a PS/2 or USB keyboard COG, or whatever emulates Babelfish more completely.
The FIFO is built to handle the streamer doing 32-bit words at sysclock/1 without hiccup. There is no chance of ever depleting it. I thought the question was how much it stalls single data read/writes on collisions ... and how close the first RFxxx can be after non-blocking RDFAST.
Yes. I'm not seeing evidence of collisions with the spacing in this code and nor am I seeing problems with the gap I have between RDFAST (nowait) and the RET to XBYTE. So I think we can say we're pretty good now, and this emulator should run 100% fine.
Interestingly, the P2 still needed 52 clocks (~23 native instructions + XBYTE) to emulate a very minimal processor such as this one (no flags, no stack). A 325MHz P2 = 6.25MHz Gigatron. This probably means we're not going to get too much faster at emulating very many processors using XBYTE with 100% cycle accuracy if they have an instruction set that requires HUB RAM accesses and if they also do branching, so this 6.25MHz emulation speed is probably getting close to the limit of the P2 (though you could run the P2 faster at ~350MHz on a good day or higher with more cooling).
EDIT: actually that's probably not quite true if you can keep your RDFAST branching handled in one place where there are no other reads or writes to compete. Then you could increase the instruction rate. We needed to pipeline the branches so that made it tricky because branching could happen in any instruction handler. Also if your branch is allowed to take more than one cycle while simpler ones only take one, you could make the other non-branching instructions run faster as well, assuming they are simple enough to execute.
Oh, by deplete, you were meaning hitting the FiFO's low-water mark maybe? The low-water triggers at 13 longwords remaining. The FIFO doesn't refill until down to this level. And then pulls in 6 longwords in a quick single rotation burst.
So on the rare occasion of collision with single data read/write, that single will be stalled for +8 ticks to its next available slot.
I managed to free a couple of longs and got the serial interface working with the game controller in parallel in the IO COG to send ASCII text I type into Gigatron. It seems to be working. Unfortunately some applications require a LF instead of CR, so for that you need to use a different terminal emulator instead of loadp2's simple console. However I did manage to get the included MSBASIC responding directly from loadp2 with CR only.
This updated code will combine game controller input in parallel with serial input to feed the input buffer (using bitwise AND so don't press both at the same instant in time). Only two COGs are used. I also removed the XOUT repository by merging the OUTPUT and extended OUTPUT ports into the same 32 bit repository, and now have two of the LEDs pins share with these repositories to save more pins.
Right now it doesn't do any keyboard to button translation in the COG (it just sends raw data) so to send a game controller button without having one to press you would need to send characters with bit(s) cleared when you want some controller button to be activated (they are active low). The following character values are treated as single button presses. You might be able to hold down ALT on a windows keyboard and type the decimal character number to effectively press the button. I think PC keyboards used to be able do that, whether or not that still works these days, not sure.
01111111 = 7f = 127 BUTTON A (also delete key)
10111111 = bf = 191 BUTTON B
11011111 = df = 223 SELECT
11101111 = ef = 239 START
11110111 = f7 = 247 UP
11111011 = fb = 251 DOWN
11111101 = fd = 253 LEFT
11111110 = fe = 254 RIGHT
Here's the latest beta (v0.6). Anyone with a P2-EVAL or JonnyMac board and the A/V adapter should now be able to get it to work even without a game controller if they use these character values as button presses (and when using the v5a ROM downloaded from github). https://github.com/kervinck/gigatron-rom
Most older CPUs probably don't do branching in a single cycle. It might be possible to do a GETCT during horizontal blanking to check total cycles/line.
Worst-case RDFAST of 17 cycles occurs when slice difference between previous read and RDFAST = +1 where slice = hub address[4:2], e.g. RDBYTE from $00000 then RDFAST from $00004. Best-case RDFAST of 10 cycles when RDFAST - RDxxxx slice diff = +2. (Also, best-case / worst-case random RDxxxx of 9 / 16 cycles when RDxxxx - RDxxxx slice diff = +1 / 0.)
I have not seen partial FIFO filling cause a delay to random reads or writes, yet. In my tests, I wait 64 cycles using WAITX after RDFAST to ensure FIFO is full and filling over, then do four or five or six RFLONGs, then one or two RDBYTE or WRBYTE. I choose slices so all hub accesses should be best-case and they are, i.e. no delays.
According to this thread, that could be a bug in V5A ROM::
https://forum.gigatron.io/viewtopic.php?p=3383#p3383
If you actually reserved this much hub RAM, that could be accurate, but if you only have 64K allocated, then you could be seeing this bug.
The memory test in the ROM truly is dynamic. So if you use the A15 breakout on the board and do a socket mod to use a larger chip, it will report 64K on boot.
So you might want to try one of the DevROM versions. I think those start with X. They might add more vCPU opcodes and syscalls too, but I'm not sure. The syscall thing is interesting. While I think that has more overhead than the opcodes, it provides a way to poke a hole in the vCPU interpreter and call native routines. So that's likely better for longer tasks
You'll need to build the latest from https://github.com/lb3361/gigatron-rom/blob/exp/Core/dev.asm.py if you're trying to support SPI expander SD Browser. DEVROM is probably closest to the goals that Marcel listed originally for ROMv5 before he passed.
I've now added support for USB keyboards via the USB breakout board, as well as controlling directly via loadp2 console (or other terminal consoles). For now a real game controller is no longer implemented but one could potentially be added via direct IO or via the USB interface with a USB game pad in a future release with support for that.
The following shows the mapping for the USB keyboard keys to game pad buttons.
If you hold ESC down for a couple of seconds on the keyboard, it will reboot Gigatron.
For the serial terminal interface, the ANSI cursor keys are mapped to Gigatron Up/Down/Left/Right buttons as well and the escape key if pressed 3 times in succession reboots Gigatron. You can't hold it down like on the regular keyboard as it interferes with the ANSI escape sequences. The auto-repeat time is not ideal for the games when using the serial port console but does sort of work (badly). It does work okay for entering text into the BASIC interpreters. The tilde "`" is the button A character in the console. Didn't bother with button B yet, nothing uses it AFAIK. The software expects carriage return but translates it into linefeed for Gigatron. Backspace is converted to delete as well.
Pressing TAB cycles through the displayed scan lines for both the terminal console and the real keyboard interface, and increases/decreases the available processor time per frame.
Enjoy the retro and rather lame minimalism of this tiny TTL processor, if you can.
As usual the needed ROM v5a is maintained here..
https://github.com/kervinck/gigatron-rom
Okay, that makes the keyboard mappings different from the Gigatron. I thought maybe F10 did the cycling and the keys above the arrow keys do what most of the game buttons other than what the direction buttons do. There is an online Gigatron emulator somewhere, and I think the keys on it are what the real one uses.
Thank you for popping in here! I saw your comments in the thread on the official site.
Yeah, for now I don't care too much about that for initial tests but it can obviously be changed. I know there are still issues with the auto-repeat that could be fixed too. Ideally an ISR could be run in the SPIN2 COG to maintain full serial input rate to avoid needing yet another COG to do serial input, but this is not possible AFAIK, however perhaps some serial polling and local buffering could be done during the busy wait times to avoid that issue.
What makes sense to me would be to extend the SPIN2 code I have started and make it work 100% the same as a Babelfish AVR image so the P2 serial port could be hooked up to a PC and either a USB or direct Famicon style gamepad could be used. If someone writes a PS/2 keyboard COG we could use that too, instead of USB.
Well, if you knew what address the keyboard handler stores to, an interrupt or DMA approach would work. Since you'd stuff it in there and let the handler in ROM handle it from there if it even works that way. No interrupts are used on the Gigatron side since the syncs are what function as interrupts and cause it to poll the In port. And you wouldn't even need to emulate the shift register on the Gigatron side, just a regular register.
Do you have an intention to get a P2 and start messing about with this stuff yourself PurpleGirl? Now I've done the processor emulator part I'm sort of satisfied. That was the interesting aspect, for me anyway. Trying to get it to fit in the clocks available.
I've added USB gamepad support for the Gigatron emulator. This should work with any gamepads already supported by NeoYume (not tested) plus a couple I added myself for these types of devices below which I do have here and did test out:
You can plug a USB keyboard into one USB port and a USB gamepad into the other. It should be hot pluggable now.
The serial interface also still works. Any device will feed input into the Gigatron and you can have any connected. Didn't try it with two USB keyboards at the same time or two USB gamepads though so if you do that who know how it will respond.
Really the only other thing would be to add back the support for a direct game controller input (Famicon 3 pin CLOCK/LATCH/DATA interface), and maybe PS/2 keyboard (although with a USB keyboard now, why bother).
I am unsure, TBH, but I am interested. I still would like to see an I/O coprocessor, and your code would be nice to study if I really want to build that. I think most things could be done in RAM. To effectively make it work, one could do a respin of the original machine. I don't know if the P2 I/O could go faster than 6.25 MHZ. If it could and the I/O is split out across multiple cogs, then it would be nice if the base machine could be modded to 12.5 or faster.
I know an ALU mod that could be done to the base system that nobody has tried yet that adds a few more chips. A bottleneck there is that the adders in the ALU are 4-bit "fast adders." They never made an 8-bit version. So while the 4-bit adders are fast adders internally, they use ripple carries. So that serializes the operation. I mean, while both adders work at the same time, the result won't be stable until twice the latency of the first one. But if you use 3 adders, you can calculate the low nibble and both possibilities of the upper nibble all in parallel. One upper adder's carry-in signal is tied to "Gnd" and the other to "Vcc." As for the lower nibble's carry-out line, that can drive a mux and choose which upper nibble is used. Yes, the multiplexer adds latency, but considerably less than an adder. Of course, there is another option if one can find it. That is to use a 16-bit adder and ignore half of the lines. I don't know if the latency would be better. But with that and other board mods, 15 Mhz could be possible.
Part of me would love to learn how to mod the control unit some. A bottleneck in the vCPU interpreter is the absence of a carry flag. That makes condition testing a little more involved. The vCPU is 16-bits, so you really need a carry flag, so without that, some additional work is necessary. That is why the 6502 emulation is so slow. While superficially, things appear to run at the same speed as an Apple, that illusion is because the frame rate is the same. The 6502 emulation is about 1/3 the original speed. Another reason for this discrepancy, in addition to that and the way the video is produced, is the lack of shift instructions. Then again, I think the original 6502 made those into illegal instructions, but the Gigatron adds those to the 6502 emulator. The native code has a single-place left shift, but the right shift is missing altogether and is provided through a trampoline routine in the ROM. So shifts and a carry flag could increase native code efficiency.
The above is why part of me wants to do the 4-stage 75 Mhz similar machine that I came up with. With the LUT ROMs and shadow RAMs, I could add such instructions and more. I'm unsure how to do the video on something that runs that fast. I mean, on one hand, bit-banging would work just fine with far more bandwidth. So if the vCPU interpreter could be rewritten to work in the extra 11 cycles per pixel, you'd still have a 6.25 Mhz I/O clock. That should work since there is a register there. So the OUT register should hold things for 12 cycles. Only the code feeding it would need to be modified to update it less. But really, since I am a sucker for speed, I'd like to see an I/O controller there. Snooping could still be used, but I don't think the P2 could help in that case. That would be more of a job for an FPGA with deep pipelining, multiple snoopers, caches, etc. But I wouldn't know what to do about coordinating a parallel-running coprocessor. I guess if going that far, and a new ROM would be absolutely necessary, the ports could be repurposed to help feed the I/O controller. With another bus feeding the coprocessor, interrupts could be emulated. So if you can pass syncs and keyboard usage back to the native ROM, spinlock code could be added if one must slow processing to match the video. I'd imagine that such an arrangement may be needed to prevent dropped frames. Spinlocks are not really an option now since they are non-predictable even if they are still deterministic. But if you split out the I/O, spinlocks could be used. And that could simulate DMA since it is a Harvard design and you can put the ROM in a loop and do what you want with the RAM. So it is a programmatic halt. Generally, DMA is negotiated on the device side, however, in this case, due to the design, the device could ask for DMA access and the ROM would cooperate, or the ROM could expect it from the device after an operation where it was solicited. In this sense, that would make repurposing the ports more useful. So the status could be polled every so many native instructions and possibly also by any native call or vCPU opcode that takes advantage of this.
But then again, as an intermediate step, I could also get a P2 and try the other idea I had in mind. Then the idea would be to just do vCPU and not emulate twice, so the majority of the ROM would be native code and the rest would be the ROM-included program storage. There would like need to be a supervisory cog so that the job of the native Gigatron code that isn't done could just be native P2 code. Then I could adapt your helper cog. I could if I wanted to have closer to the original behavior, add a halt line (or "register") to the vCPU interpreter. That gives an option to use the H-Sync to gate the halt. I likely wouldn't strictly apply that, meaning that 1 overlapping vCPU instruction with the active sync time would be allowable. That would make vCPU emulation more efficient in that a cycle count mechanism and an instruction retry mechanism would not be needed since that would be a "hardware" thing. Obviously, bit-banging would require that, but if you have a controller, you won't need to keep track of time slices. The Gigatron allows so many cycles per vCPU instruction, and if an instruction starts and is interrupted due to running out of time, it restarts fresh in the next time slice. So it wouldn't matter if the P2 vCPU emulator and the halt mechanism I've proposed weren't fully strict. Perfection isn't what matters here, just having a governor of some sort to make things closer to the original when necessary and for preventing frame buffer races. So I wouldn't care if it ran for 41-45 of the 6.25 Mhz cycles on occasion instead of 40. And of course, I could have a mode with selective vCPU gating to where only I/O region transfers are limited to running during the porches to give some missed-frame protection while letting other code run wide open. I'd have to play around with that to see how necessary any of that is.
In that case, I could dispense with the command protocol if I wanted to and just have private registers for inter-process communication. Plus this might be a friendlier design for SPI RAM since I could design in the ability to have wait states. If one wants to negotiate limited pins with better throughput, one could use QSPI RAM instead if it exists. Thus nibble I/O traffic would be faster than 2 unidirectional lines. That probably would be negligible in difference from the hub in terms of delay, though the delay would be more predictable.
Wow, okay. Lots of ideas for you to try out. I think it's probably best to get yourself a P2 and simply start experimenting.
By the way - I just saved another instruction by moving over to use LUT RAM for my VGA colour map table in the IO helper COG, so I could put back the code to read the game controller input pin and logically AND it with other data inputs to be shifted synchronously in the vertical sync interval.
The IO helper COG essentially does these three things now and is fully utilized and locked to the Gigatron COG's instruction timing, there's no time for anything else to happen:
So basically right now I have 4 control input sources running in parallel to control that Tetris game and they can all work together, it's pretty freaky:
The next update I send out will have the game controller feature put back in.