Shop OBEX P1 Docs P2 Docs Learn Events
Could the Propeller 2 be used as an I/O controller for a Gigatron TTL computer? — Parallax Forums

Could the Propeller 2 be used as an I/O controller for a Gigatron TTL computer?

Introduction

The Gigatron TTL computer has no ASIC CPU and uses 74xx series ICs to make a Harvard RISC CPU. All native instructions are single-cycle. A stock one operates at 6.25 Mhz. Because it is a Harvard machine and cannot execute out of data RAM, it uses the vCPU interpreter to run user code. It is a minimalist machine in that the CPU handles everything via bit-banging. The Gigatron has no interrupts and no easy, obvious way to do DMA, though I have ideas in mind for that. The native code uses time-slices and everything there is cycle-exact.

For video production, there are specialized instructions. The most used instructions there have the ability to read RAM, perform a logical operation, send the data to the Out port, and increment the memory index, all in a single cycle. Since the VGA pixel clock is around 25 Mhz, the Gigatron uses 6.25 Mhz to produce 1/16 VGA. To keep the aspect ratio, the lines are sent 4 times each. So the result is 160x120 pixels. There are only 64 colors because the upper two bits of the port are the sync bits. So the video output logic ignores the upper 2 bits of memory. So coders can use those 2 bits per frame buffer byte as they see fit. As an optimization, there is an indirection table for the frame buffer. That is just a list of 120 pages and offset addresses. That makes it easy to do scrolling, screen flipping, line substitution, etc. It's easier to virtualize lines than to blit them. So each line gets its own page, and the remaining 96 bytes can be used for either program code or for additional graphics data. This makes it easy to do a Pole Position sort of game, and changing the offset in the indirection table lets you use the other additional data past the lines. Since the video uses the Out++ instructions, and they only increment the X register, the lines can wrap around.

Sound is produced from the horizontal frequency. For each actual line (1/4 of a virtual line), 1 sound channel is handled. The channels are added and shifted to create the final output. Due to this mechanism, sounds above 3900 Hz are not possible. That makes sense if you think about it. The Nyquist frequency is 1/2 the horizontal sync, and since you have 4 channels, the maximum range is no more than 1/4 that. Now, while there are 6-bit sound samples in the sample table, the code reduces this to 4 bits and sends that to half of the X-Out register.

Blinkenlights is done using the other half of the X-Out register and is handled independently of the sound.

The game controller (and keyboard port) goes to a shift register, so the In port is a serial port. To make a keyboard work, an Arduino of the PluggyMcPlugface adapter handles that, and I assume it does ASCII conversion too. Pluggy also adds some limited storage I/O abilities, and at least a version 3 ROM is needed to handle that. This is polled every vertical pulse, and I assume written to a memory location.


What I want the Propeller 2 to do and questions

I need the Propeller to snoop the SRAM bus asynchronously and shadow what is on the Gigatron into the hub RAM, at least at specific locations, though it may be easier to just shadow the first 32K. Then I would need the cogs to read from the hub RAM to support video output, sound, Blinkenlights output, keyboard/game input, and maybe an SD card.

Will I be able to send to and from the hub memory fast enough? I imagine I can. If that is a challenge, I guess I could use 2 cogs for that. Use one cog to copy from the hub to LUT memory and one to copy from the LUT to the hub.

Now, how would I produce VGA on the P2 in assembly? Does the P2 have the same VGA abilities and character table as the P1? Or would I need to manually bit-bang it from a cog?

As for transferring from the Gigatron, that would mostly be via SRAM bus snooping. For the keyboard/game input, I could probably do lazy writes to the Gigatron's RAM. Being a Harvard machine, the RAM won't be in use all the time, so writes should be able to occur during those times.

On the snooping, I am not sure what the best way to cross clock domains would be. There probably needs to be spinlocks or something so the I/O lines would be read long enough. I figure that maybe I'd need the Gigatron's clock (probably the Clock 2 line) as an input. And I'd probably need to read the /OE and /WE lines. The /WE would be the main line to pay attention to, but /OE can't also be low. For other add-on boards, both control lines being low would be used to pass commands directly from the CPU, and maybe separate hub memory could be used in that case, but it shouldn't be allowed to overwrite existing memory.

«134567

Comments

  • Will I be able to send to and from the hub memory fast enough? I imagine I can. If that is a challenge, I guess I could use 2 cogs for that. Use one cog to copy from the hub to LUT memory and one to copy from the LUT to the hub.

    If the memory bus runs at 6.25 MHz you can get away with pretty much anything. That's on the order of 40..50 P2 clock cycles per bus cycle, depending on where you clock at. A single hub write op is 10 cycles worst case.

    Now, how would I produce VGA on the P2 in assembly? Does the P2 have the same VGA abilities and character table as the P1? Or would I need to manually bit-bang it from a cog?

    There's no character ROM on P2, because there's enough RAM. VGA is done using the "streamer" hardware and the associated DAC channels. I think there's some simple example code for VGA, NTSC and HDMI that comes with PNut. (The values in the NTSC example are somewhat dodgy though). It is the same "hardware-assisted bitbanging" type logic as on P1, but better (automatic DMA, palette lookup, color conversion).

    On the snooping, I am not sure what the best way to cross clock domains would be. There probably needs to be spinlocks or something so the I/O lines would be read long enough. I figure that maybe I'd need the Gigatron's clock (probably the Clock 2 line) as an input. And I'd probably need to read the /OE and /WE lines. The /WE would be the main line to pay attention to, but /OE can't also be low. For other add-on boards, both control lines being low would be used to pass commands directly from the CPU, and maybe separate hub memory could be used in that case, but it shouldn't be allowed to overwrite existing memory.

    Yes, at those sorta speeds you can just use the bus clock as an input and wait on it. Alternatively (this is a bit tricky if you aren't doing a custom PCB), you could feed the P2 clock input from the bus clock and configure the PLL such that each bus cycle equals an integer number of P2 cycles.

  • Right now, I'm just asking. I have no experience with a P2 or really even a Gigatron. My plan would likely be to spin a new one. A stock one could do at least 8 Mhz, but the reason it is 6.25 is that the VGA is bit-banged and tightly coupled.

    On speeds, 6.25 Mhz go into 100 exactly 16 times. So at 200 Mhz, it would be 32 times, and at 300, 48 times. But then you'd only have the pulses, so then you'd have time for 16 Propeller clocks @ 200, and 24 @ 300. And of course, most cog-specific operations take 2 cycles. So when you look at that way, you see why I was asking if I'd need to copy to a lut and then commit it to the hub in another cog. See, I wouldn't mind clocking the Gigatron faster. A stock one could do 8 and you could go a little faster if you change all the 74LS chips to 74F chips, change the banks of diodes to faster ones, reduce the size of the pull-up resistors on the diodes, and use faster RAM, etc. Then closer to 12.5 would be an option. On a respin, I might want to add 2 more chips (adder and mux) to the upper nibble of the ALU adder to optimize the ripple carry between the 2. In that case, you'd wire one upper adder's carry-in line to Ground and the other upper adder to Vcc, then use the carry from the low nibble to switch between them. Thus all 3 adders would finish at the same time, and the multiplexer would be faster than waiting for the upper one to propagate again. I don't know how many Mhz that could add. At the least, it would make the ALU stabler at higher speeds.

    (I think I know how to make a redesign that could do 75 Mhz, but that would require working with SMDs and using a 4-stage pipeline, but then, a P2 couldn't keep up with it. Going that fast, keeping the bit-banging and adding more registers to work on the vCPU between pixels might be a better option, and if one still wants to offload I/O, then I'd be looking at an FPGA for that. The strategy there might be to pipeline that and have multiple snoopers to split things into different processes and put things in their own BRAM regions and use external RAM for video RAM.)

    And yes, I'd want to let the Gigatron clock (the #2 clock since that is the delayed one and closest to the memory) and use the P2 clock management features to get the P2 clock.

    I don't know how compatible things would still be with the .GT1 files. I mean, if you have at least 19,200 extra cycles from removing the video bit-banging code from the Gigatron's ROM, then it is possible to have software races. Like if you overwrite the memory too fast, then you'd have missing frames. So one might need to add a compatibility mode, maybe gate the number of interpreted instructions allowed or add in loops and somehow sync them or whatever. Since the P2 portion would have its own ports, the In and Out port could be repurposed and could pass data to/from the P2 or simulate interrupts. The CPU in this case has no interrupts, but if you need to sync with external video, there could be code to read when vSync is active.

    On the video, I might want to do some sort of bit-banging for flexibility reasons. Like it would be nice to be able to do 320x240, add a text mode, etc. And yes, it would probably need to add a segment register. And keeping an eye on the memory control lines would be needed since it only needs to pay attention to writes, and it would need to know when both lines are low (normally an invalid state) for taking the extended commands. Different SRAMs handle both main control lines being low differently. Some would fail to respond, some would comingle the data on the bus with the contents and write that back, and some might just ignore /OE and write. The Gigatron has such invalid instructions, and those who designed add-on cards use that to take the memory off the bus and use the bus data as commands. So even the undefined native instructions are useful.

  • Once you get more familiar with the P2, another alternative method to do this is to have an entire Gigatron running inside the P2 itself with HUB RAM holding the Gigatron ROM and your program RAM and other instruction look up information held in LUT RAM to help decode the instructions. It's a simple enough machine by the looks of the schematic. Most of it should fit inside a single COG I would expect. Maybe even the video part too if you can synchronize it perfectly, or you could do video on a secondary COG. Read an instruction word from HUB fifo, decode control data via LUT lookup, branch to handler, execute, increment PC and return, etc.

    Basically it's a Gigatron-on-a-chip. Audio and video and controller IO would still work via P2 pins directly. You could load your progams into P2 HUB RAM via the serial port or you could type them in with that Pluggy PS/2 board. You would just need to add some series resistors to adapt to 3.3V from 5V on the input pins. Would be a fun project (though less hardware oriented).

  • pik33pik33 Posts: 2,366

    The one and only problem here is a logic level shifter, 5V TTL <-> 3.3V P2. 6.25 MHz is slow enough for a P2 to handle.

    As for VGA, the only restriction is the RAM amount. With external RAM (P2-EC32), VGA is possible up to fullHD, 1920x1080 at 8 bpp. At 640x480 you can even output these pixels with HDMI and VGA at the same time (2 cogs needed of course).

    P2 can also filter and upsample the audio data, so while the Nyquist 3900 Hz will still apply, you can get rid of hearable aliases in the accoustic band.

  • As Marty McFly would ask: "What the hell is a Gigatron?" :smile:

    Craig

  • TonyB_TonyB_ Posts: 2,178
    edited 2022-08-10 10:10

    @rogloh said:
    Once you get more familiar with the P2, another alternative method to do this is to have an entire Gigatron running inside the P2 itself with HUB RAM holding the Gigatron ROM and your program RAM and other instruction look up information held in LUT RAM to help decode the instructions. It's a simple enough machine by the looks of the schematic. Most of it should fit inside a single COG I would expect. Maybe even the video part too if you can synchronize it perfectly, or you could do video on a secondary COG. Read an instruction word from HUB fifo, decode control data via LUT lookup, branch to handler, execute, increment PC and return, etc.

    Basically it's a Gigatron-on-a-chip.

    This is how to do it. The Gigatron instruction set is small and doesn't even include CALL and RET. A single cog should be able to emulate everything quite easily using interrupt-driven video and audio.
    https://gigatron.io/

  • roglohrogloh Posts: 5,786
    edited 2022-08-10 10:27

    LOL, speaking of which, @TonyB_ ...
    Here's an idea for emulating a Gigatron on the P2. There's not much to it really, and the P2 would be capable of fully emulating it. This code is incomplete as it still needs the execf skip masks fully figured out but it does compile in flexspin. Perhaps this could get close to the 6.25MHz with very fast P2's if some optimizations are added. More of the ALU opcode code handling could be expanded directly if its outer code is replicated 6 times (which removes the call/ret overhead).

    Also the input and output ports are updated every cycle, while the input port could be read only when required. Ideally a waitct1 instruction could be put in to lock instruction execution to a fixed rate, then video output could reliably be bit banged on P2 pins. Perhaps others can help optimize this sample emulator's instruction timing down further. The branching case could fall through to main_loop if the code is repositioned, eliminating a jump and saving 4 clocks.

    Maybe something like 5MHz is still achievable, which reduces video refresh rate slightly but could still work if the output is synchronous. At 325MHz you'd only have 52 P2 clocks (~26 instructions) to execute each Gigatron instruction to emulate a 6.25MHz machine, but 13 more clocks if you do it at 5MHz. The outer overhead loop takes 21 clocks, leaving 31 for instructions (@6.25MHz). Non memory ALU instructions takes around 16 clocks if further optimized, stores are 14 clocks plus the write instruction. Branches are 22 clocks if they don't involve memory. I think in general there's just about enough time for everything apart from the HUB memory cases, the problem there is going to be the variable rdbyte and wrbyte instruction timing on the P2 (and fifo reload), the rest is very deterministic.

    ' GIGATRON EMULATOR CONCEPT FOR P2
    ' (C) Roger Loh 2022
    
    CON 
            OUTP = 0        ' which group of 8 pins on the P2 is the gigatron output port
            INP  = 1        ' which group of 8 pins on the P2 is the gigatron input port
    
            BRGT = %0001 'c=0 z=0
            BRLT = %0100 'c=1 z=0
            BRNE = %0101 'c=? z=0
            BREQ = %1010 'c=? z=1
            BRGE = %0011 'c=0 z=?
            BRLE = %1110 'c=1 or z=1
    
    DAT
                                orgh
    
    gigatron
    
                                org     0
    
                                mov     pa, ##@lutram
                                sub     pa, ##@gigatron
                                add     ptrb, pa            'find start address of LUT data
                                setq2   #255                'read 256 longs to LUT
                                rdlong  0, ptrb
                                mov     rombase, ##@romdata 'setup ROM start
                                mov     rambase, ##@ramdata 'setup RAM start
                                altsb   outputport, #dira   'enable output pins
                                setbyte 0-0, #$ff, #0
    
    main_loop
                                push    #main_loop          'where we will return (unless we branch)
    { ' can probably skip checking for wrap around, it's only a problem if there are bugs in the ROM
                                getptr  d                   'get FIFO pointer address
                                cmp     d, rambase wc       'RAM follows ROM
            if_nc               rdfast  #0, rombase         'wrap to start of ROM if needed
    }
    
    next_instr                  'sync the input/output ports here
                                altsb   outputport, #outa   
                                setbyte 0-0, output, #0
                                altgb   inputport, #outa   
                                getbyte input, 0-0, #0
    
                                rfbyte  instr               'read the 8 bit instruction
                                rfbyte  d                   'read the 8 bit operand
                                rdlut   code, instr         'lookup the LUT tables
                                execf   code                'execute the instruction
    
    alu_ops
                                mov     addr, d             'execute if RAM address = 0d or yd
                                mov     addr, x             'execute if RAM address = 0x or yx
                                setbyte addr, y, #1         'execute if RAM adress = yd or yx
    
                                rdbyte  bus, addr           'bus=[mem]   |         |       |
                                mov     bus, d              '   |      bus=d       |       |
                                mov     bus, ac             '   |        |       bus=ac    |
                                mov     bus, inputport      '   |        |         |     bus=input
    
                                call    #ld_op              '   LD   |    |    |    |    |
                                call    #and_op             '   |   AND   |    |    |    |
                                call    #or_op              '   |    |    OR   |    |    |
                                call    #xor_op             '   |    |    |   XOR   |    |
                                call    #add_op             '   |    |    |    |   ADD   |
                                call    #sub_op             '   |    |    |    |    |   SUB
    
                                mov     ac, alu             '  ac=alu    |      |          |
                                mov     x, alu              '    |     x=alu    |          |
                                mov     y, alu              '    |       |    y=alu        |
                                mov     output, alu         '    |       |      |       out=alu
                                incmod  x, #255             ' optionally increment x in zero page
                                ret     
    
    ld_op
            _ret_               mov     alu, bus
    
    and_op
                                mov     alu, ac
                                and     alu, bus
            _ret_               zerox   alu, #7
    
    or_op
                                mov     alu, ac
                                or      alu, bus
            _ret_               zerox   alu, #7
    
    xor_op
                                mov     alu, ac
                                xor     alu, bus
            _ret_               zerox   alu, #7
    
    add_op
                                mov     alu, ac
                                add     alu, bus
            _ret_               zerox   alu, #7
    
    sub_op
                                mov     alu, ac
                                sub     alu, bus
            _ret_               zerox   alu, #7
    
    
    st_op
                                mov     addr, d                 'execute if RAM address = 0d or yd
                                mov     addr, x                 'execute if RAM address = 0x or yx
                                setbyte addr, y, #1             'execute if RAM address = yx or yd
                                zerox   addr, #14               'restrict to 32kB
                                add     addr, rambase           'offset in HUB RAM
    
    
                                wrbyte  d, addr                 'RAM[addr]=d
                                wrbyte  ac, addr                '       RAM[addr]=ac
                                wrbyte  input, addr             '                 RAM[addr]=input
    
                                incmod  x, #255                 'optionally increment x
                                mov     x, ac                   'x=ac     |
                                mov     y, ac                   '  |     y=ac
                                ret
    
    ctrl_op                     ret                             ' reserved
    
                                                                '    >   <   <>  =  >=  <=  always never
    branch_ops                  test    ac  wz                  '    a   b   c   d   e   f
                                testb   ac, #7 wc               '    a   b   c   d   e   f
                                modz    BRGT wz                 '    a   |   |   |   |   | 
                                modz    BRLT wz                 '    |   b   |   |   |   |
                                modz    BRNE wz                 '    |   |   c   |   |   |
                                modz    BREQ wz                 '    |   |   |   d   |   |
                                modz    BRGE wz                 '    |   |   |   |   e   |
                                modz    BRLE wz                 '    |   |   |   |   |   f
            if_nz               ret                             'if z=0 no need to branch so return
                                add     d, rambase              '  bus=RAM[d]   
                                rdbyte  bus, d                  '  bus=RAM[d]
                                mov     bus, d                  '         bus=d
                                mov     bus, ac                 '               bus=ac
                                mov     bus, input              '                        bus=input
    farjmp                      setbyte bus, y, #1    
                                add     bus, bus                'ROM addresses words not bytes
                                add     bus, rombase            'offset with ROM base address
                                rdfast  #0, bus                 'setup new fifo address to read from
                                jmp     #next_instr             'no return here but that's ok
    
    ' registers
    ac                          long    0
    x                           long    0
    y                           long    0
    
    ' emulator state
    
    rombase                     long    0
    rambase                     long    0
    outputport                  long    OUTP
    inputport                   long    INP
    
    ' temporary variables
    d                           res     1
    addr                        res     1
    bus                         res     1
    instr                       res     1
    code                        res     1
    input                       res     1
    output                      res     1
    alu                         res     1
    
                                fit
    
                                orgh
    
    lutram
    
                                org     $200
    
    {
    'execf table goes here, e.g. something like
    
    load_d_ac_d_in              long        (%xxxxxxxxx << 10) + alu_ops
    load_d_ac_ram_in            long        (%xxxxxxxxx << 10) + alu_ops
    load_d_ac_ac_in             long        (%xxxxxxxxx << 10) + alu_ops
    load_d_ac_input_in          long        (%xxxxxxxxx << 10) + alu_ops
    
    load_x_ac_d_in              long        (%xxxxxxxxx << 10) + alu_ops
    load_x_ac_ram_in            long        (%xxxxxxxxx << 10) + alu_ops
    load_x_ac_ac_in             long        (%xxxxxxxxx << 10) + alu_ops
    load_x_ac_input_in          long        (%xxxxxxxxx << 10) + alu_ops
    
    load_y_ac_d_in              long        (%xxxxxxxxx << 10) + alu_ops
    load_y_ac_ram_in            long        (%xxxxxxxxx << 10) + alu_ops
    load_y_ac_ac_in             long        (%xxxxxxxxx << 10) + alu_ops
    load_y_ac_input_in          long        (%xxxxxxxxx << 10) + alu_ops
    
    load_yx_ac_d_in             long        (%xxxxxxxxx << 10) + alu_ops
    load_yx_ac_ram_in           long        (%xxxxxxxxx << 10) + alu_ops
    load_yx_ac_ac_in            long        (%xxxxxxxxx << 10) + alu_ops
    load_yx_ac_input_in         long        (%xxxxxxxxx << 10) + alu_ops
    
    and_d_ac_d_in               long        (%xxxxxxxxx << 10) + alu_ops
    and_d_ac_ram_in             long        (%xxxxxxxxx << 10) + alu_ops
    and_d_ac_ac_in              long        (%xxxxxxxxx << 10) + alu_ops
    and_d_ac_input_in           long        (%xxxxxxxxx << 10) + alu_ops
    
    and_x_ac_d_in               long        (%xxxxxxxxx << 10) + alu_ops
    and_x_ac_ram_in             long        (%xxxxxxxxx << 10) + alu_ops
    and_x_ac_ac_in              long        (%xxxxxxxxx << 10) + alu_ops
    and_x_ac_input_in           long        (%xxxxxxxxx << 10) + alu_ops
    
    and_y_ac_d_in               long        (%xxxxxxxxx << 10) + alu_ops
    and_y_ac_ram_in             long        (%xxxxxxxxx << 10) + alu_ops
    and_y_ac_ac_in              long        (%xxxxxxxxx << 10) + alu_ops
    and_y_ac_input_in           long        (%xxxxxxxxx << 10) + alu_ops
    
    and_yx_ac_d_in              long        (%xxxxxxxxx << 10) + alu_ops
    and_yx_ac_ram_in            long        (%xxxxxxxxx << 10) + alu_ops
    and_yx_ac_ac_in             long        (%xxxxxxxxx << 10) + alu_ops
    and_yx_ac_input_in          long        (%xxxxxxxxx << 10) + alu_ops
    
    
     ' continue with execf tables for all the remaining instructions (table is 256 longs)
    or_d...
    xor_d...
    add_d...
    sub_d...
    
    store's ...                 long        (%xxxxxxxxx << 10) + st_op
                                ....
    
    branches...                 long        (%xxxxxxxxx << 10) + branch_ops
                                ....
    }
    
                                orgh
    
    romdata    '    file    "romfile.bin"  ' PUT THE ROM FILE here
    ramdata         byte    0[32768]
    
  • roglohrogloh Posts: 5,786
    edited 2022-08-10 10:19

    Also rather than this

                                call    #sub_op            
    ...
                                mov     ac, alu          
    ....
    sub_op
                                mov     alu, ac
                                sub     alu, bus
            _ret_               zerox   alu, #7
    

    We could simply use getbyte ac, alu, #0 instead of zerox alu, #7 and mov ac, alusaving an instruction.

    Yes there were some extra zerox's where we don't need them too (cut/paste).

  • How about immediate video output and XBYTE?

  • roglohrogloh Posts: 5,786
    edited 2022-08-10 10:40

    Sometimes XBYTE doesn't help. Also the Gigatron code expects to send fixed 8 bit data to a VGA port (like P1) and to output synchronous audio. Maybe the streamer can be used for video to accept some amount of timing slop for single instructions....not sure. Better if fully synchronous if possible with a waitct1 somewhere.

    Update: Using the streamer (immediate to LUT mapping perhaps so the FIFO is not interrupted, or a table in COGRAM for this colour mapping step) would at least let you use the A/V breakout for video output.

    Update2 : forgot about the extended output port for audio DAC and LEDs. This is latched on an edge of HSYNC. That would likely take a few more instructions to realize as well.

  • My understanding is that if streamer is using an immediate data mode then FIFO is available for XBYTE, which speeds up instruction fetch and decoding considerably and _ RET _ prefix at the end of each instruction adds no cycles.

  • roglohrogloh Posts: 5,786
    edited 2022-08-10 10:42

    I would think that would be the case, but haven't tried it. Am already using the FIFO in this example anyway. Whether xbyte helps or not is TBD. You may wish to code that one up.

  • @rogloh said:
    Once you get more familiar with the P2, another alternative method to do this is to have an entire Gigatron running inside the P2 itself with HUB RAM holding the Gigatron ROM and your program RAM and other instruction look up information held in LUT RAM to help decode the instructions. It's a simple enough machine by the looks of the schematic. Most of it should fit inside a single COG I would expect. Maybe even the video part too if you can synchronize it perfectly, or you could do video on a secondary COG. Read an instruction word from HUB fifo, decode control data via LUT lookup, branch to handler, execute, increment PC and return, etc.

    Basically it's a Gigatron-on-a-chip. Audio and video and controller IO would still work via P2 pins directly. You could load your progams into P2 HUB RAM via the serial port or you could type them in with that Pluggy PS/2 board. You would just need to add some series resistors to adapt to 3.3V from 5V on the input pins. Would be a fun project (though less hardware oriented).

    I had considered that. Just cut out the Gigatron native code and just have the vCPU emulator. Some over at the Gigatron forum said that the memory map would be hard to set up in that case. I don't think it is that hard. I mean, if nothing else, if you don't have the native code and go straight to the 16-bit Von Neumann machine that the vCPU interpreter does, then you'd need to handle much of the memory map during the initialization, such as some data ROM space to copy into the memory.

    As for the LUTs, I'd say that isn't enough RAM for program RAM. And going this far, you wouldn't have Gigatron ROM anymore, just Propeller ROM. I mean, The Gigatron is a Harvard RISC machine, so it requires an emulator or interpreter as part of normal operation. So instead of going from native Gigatron to vCPU, it would be Propeller to vCPU, and the Gigatron's schematics would be irrelevant.

    As for I/O, one could use a cog for SPI support and handle an SP card. That cog could do the other stuff needed like FAT32 support, etc. Just as long as the vCPU interpreter has the proper hooks to this. I wouldn't bother with a Pluggy, just handle the keyboard directly. You only have 2 signals of importance there, and a keyboard should handle 3.3 volts. One could add a game board socket too. The Famicon controller that the Gigatron uses is serial. With those, you'd likely need to use a signal leveler, or look for the 3.3-volt variations.

    And in that case, I'd want all the hardware to have its own cogs. While emulating the vCPU portion could be harder on a P2, you can make up for it by making more of the "hardware" have its own cogs. It would likely take leveraging as many of P2's special features as possible. For instance, I don't quite care for the Gigatron's memory map for a few reasons, including not being alignment-friendly, but with the P2, you can do wider reads from the hub without regard to alignment. So the fact that the most important variables are at odd addresses won't be a problem for the P2.

    Then the question arises, how do you do the vCPU emulation? The Gigatron does that by making use of instruction that do jumps to the ROM based on the contents of RAM. That is why the vCPU instruction set is so haphazard and why so many possible opcodes are skipped. The native code blindly jumps based on the vCPU opcodes and doesn't need polling or jump trees. Would the P2's bytecode executor be able to do this? That pretty much needs a jumplist or some sort of instruction lookup to run the routines for each vCPU opcode. I ran into this problem on the PC years ago, where I tried to do some simple AI, and the larger the vocabulary, the slower it got, since I used lots of polling. You don't want to spend hundreds of cycles just evaluating the opcodes.

    The vCPU registers are just memory locations. For the most part, programs tend to use the correct opcodes to handle them rather than using "Peek" and "Poke" to access them. So, for the most part, even if they are actual hardware, they should work.

    For some aspects, a P2 would be easier to work with. For instance, since the P2 has DACs, you would be freer to play with things like sound and video resolution without worrying about changing resistor ladders. You'd handle that in code.

    Speaking of the P2, how easy is it to read a double word and then deal with the individual bytes? Are there byte and word instructions? Or do you have to do a lot of swapping, shifting/rotating, etc?

  • @pik33 said:
    The one and only problem here is a logic level shifter, 5V TTL <-> 3.3V P2. 6.25 MHz is slow enough for a P2 to handle.

    As for VGA, the only restriction is the RAM amount. With external RAM (P2-EC32), VGA is possible up to fullHD, 1920x1080 at 8 bpp. At 640x480 you can even output these pixels with HDMI and VGA at the same time (2 cogs needed of course).

    P2 can also filter and upsample the audio data, so while the Nyquist 3900 Hz will still apply, you can get rid of hearable aliases in the accoustic band.

    Actually, you can get around level shifters. For the keyboard, just have a cog to replace Pluggy and wire the keyboard directly. For a game port, you might need a shifter. But for sound and video, I guess one can use the internal DACs. Now if one wants to wire in external memory, they'd either need to use shifters or 3.3V memory.

    On video, the Gigatron only does 160x120.

  • @rogloh said:
    LOL, speaking of which, @TonyB_ ...
    Here's an idea for emulating a Gigatron on the P2. There's not much to it really, and the P2 would be capable of fully emulating it. This code is incomplete as it still needs the execf skip masks fully figured out but it does compile in flexspin. Perhaps this could get close to the 6.25MHz with very fast P2's if some optimizations are added. More of the ALU opcode code handling could be expanded directly if its outer code is replicated 6 times (which removes the call/ret overhead).

    Also the input and output ports are updated every cycle, while the input port could be read only when required. Ideally a waitct1 instruction could be put in to lock instruction execution to a fixed rate, then video output could reliably be bit banged on P2 pins. Perhaps others can help optimize this sample emulator's instruction timing down further. The branching case could fall through to main_loop if the code is repositioned, eliminating a jump and saving 4 clocks.

    Maybe something like 5MHz is still achievable, which reduces video refresh rate slightly but could still work if the output is synchronous. At 325MHz you'd only have 52 P2 clocks (~26 instructions) to execute each Gigatron instruction to emulate a 6.25MHz machine, but 13 more clocks if you do it at 5MHz. The outer overhead loop takes 21 clocks, leaving 31 for instructions (@6.25MHz). Non memory ALU instructions takes around 16 clocks if further optimized, stores are 14 clocks plus the write instruction. Branches are 22 clocks if they don't involve memory. I think in general there's just about enough time for everything apart from the HUB memory cases, the problem there is going to be the variable rdbyte and wrbyte instruction timing on the P2 (and fifo reload), the rest is very deterministic.
    ```

    Actually, you would only need to emulate the vCPU instruction set, not the native Gigatron code. And splitting the video out into its own cog. The whole idea of my proposed project of adding an external video controller to the Gigatron was to get more performance. 100% compatibility isn't my concern, just making it faster. See, vCPU instructions write video information to RAM in a simple, raw bitmap format, so no color conversion or anything is needed. On the Gigatron, during the active lines, the native code reads the frame buffer and sends it to the port. Then during VGA porch times, the vCPU code runs. So since vCPU writes the video to memory, external hardware could just read it there (forget port usage and native cycles for bit-banging). So that means you'd have 4-5 times the performance if you offload video. And even emulating the function of the Gigatron (ie., emulating the vCPU instruction set, not the native one), you can do the video in another cog. So even if you are a tad slower, if you split out the video and sound into their own cogs, that is allowing the vCPU code more time to run.

    So, it would be more efficient to emulate vCPU and not the Gigaton specifically as a whole. So you set things up, run vCPU code in a cog, run video in a cog, run other devices in their cogs, etc. So you'd have the vCPU and coprocessors for the soft hardware that is currently run as native threads. So if you forget native Gigatron code altogether and do each vCPU instruction the best you can on the P2, taking advantage of the multiple cogs and being able to split out the hardware the Gigatron emulates, you'd be better off.

    And I'd want the code to be in assembly.

  • roglohrogloh Posts: 5,786
    edited 2022-08-10 15:04

    Here's a somewhat optimized update to my code from above...I think it can (just) keep up at 6.25MHz (@325MHz P2) with worst case HUB access times if I have the branch test stuff computed correctly (I might not have the ALU tests correct). It uses the streamer to pace the video output to an 8 bit digital IO port and locks the instruction rate to the streamer output whenever it's sequence runs faster than it should. It just doesn't have that extended output port yet included, but otherwise it's mostly there. I might try to get something to work with it if I do the execf table.

    To free a few more cycles for that extended port logic maybe the XBYTE may save us...?
    Update: I think I need 56 clocks per instruction total to handle the extended port. Without other optimizations that would require a 350MHz P2. Hopefully XBYTE helps.

    Update2: looking at XBYTE and counting clocks, it seems like probably even 300MHz could be used for 6.25MHz cycle accurate Gigatron emulation with extended port output as well. I think its longest, worst case hub timing path is only 47 clocks, :smile: , or it could be 49 clocks, not 100% sure yet.

    ' GIGATRON EMULATOR CONCEPT FOR P2
    ' (C) Roger Loh 2022
    
    CON 
    
            _clkfreq = 325000000
    
            OUTP = 0        ' which group of 8 pins on the P2 is the gigatron output port
            INP  = 1        ' which group of 8 pins on the P2 is the gigatron input port
    
            BRGT = %0001 'c=0 z=0
            BRLT = %0100 'c=1 z=0
            BRNE = %0101 'c=? z=0
            BREQ = %1010 'c=? z=1
            BRGE = %0011 'c=0 z=?
            BRLE = %1110 'c=1 or z=1
    
    DAT
                                orgh
    
    gigatron
    
                                org     0
    
                                mov     pa, ##@lutram
                                sub     pa, ##@gigatron
                                add     ptrb, pa            'find start address of LUT data
                                setq2   #255                'read 256 longs to LUT
                                rdlong  0, ptrb
                                mov     rombase, ##@romdata 'setup ROM start
                                mov     rambase, ##@ramdata 'setup RAM start
                                altsb   outputport, #dira   'enable output pins
                                setbyte 0-0, #$ff, #0           
                                alts    inputport, #input0  'patch input port
                                mov     doinput, 0-0        
                                setxfrq vgafreq             'setup streamer frequency
                                xinit   vgaout, #0          'start video output
                                jmp     #main_loop          'begin emulator
    
    branch_ops '31 clks max to return from here             '    >   <   <>  =  >=  <=  
                                test    ac  wz              '    a   b   c   d   e   f 
                                testb   ac, #7 wc           '    a   b   c   d   e   f
                                modz    BRGT wz             '    a   |   |   |   |   | 
                                modz    BRLT wz             '    |   b   |   |   |   |
                                modz    BRNE wz             '    |   |   c   |   |   |
                                modz    BREQ wz             '    |   |   |   d   |   |
                                modz    BRGE wz             '    |   |   |   |   e   |
                                modz    BRLE wz             '    |   |   |   |   |   f
    branch_always               modz    $f wz               '    |   |   |   |   |   |   always
    
    branch_far_op               add     d, rambase          '  bus=RAM[d]   |        |         |
                                rdbyte  bus, d              '  bus=RAM[d]   |        |         |
                                mov     bus, d              '     |        bus=d     |         |
                                mov     bus, ac             '     |         |      bus=ac      |
                                mov     bus, input          '     |         |        |     bus=input
                                setbyte bus, y, #1          '     ?         ?        ?         ?      
    
                                add     bus, bus            'The ROM addresses words not bytes
                                add     bus, rombase        'offset from the ROM base address in HUB
            if_z                rdfast  #0, bus             'setup new fifo address to read from
    
    main_loop ' loop overhead is 17 clocks
                                push    #main_loop          'where we will return (unless we branch)
    {
        ' drop this code that handles wrap around to save 6 clocks
                                getptr  d                   'get FIFO pointer address
                                cmp     d, rambase wc       'RAM follows ROM
            if_nc               rdfast  #0, rombase         'wrap to start of ROM if needed
    }
    
                                'sync the input/output ports here
    dooutput                    xcont   vgaout, output
    doinput                     getbyte input, 0-0, #0      'patched instruction to read input
    
                                rfbyte  instr               'read the 8 bit instruction
                                rfbyte  d                   'read the 8 bit operand
                                rdlut   code, instr         'lookup the LUT tables
                                execf   code                'execute the instruction
    
    ld_op ' 27 clks max to return from here
                                mov     addr, d             '   0d   |    yd   |   |
                                mov     addr, x             '   |    0x   |   yx  yx++
                                setbyte addr, y, #1         '   |    |    yd  yx  yx++
    
                                rdbyte  bus, addr           'bus=[mem]   |         |       |
                                mov     bus, d              '   |      bus=d       |       |
                                mov     bus, ac             '   |        |       bus=ac    |
                                mov     bus, input          '   |        |         |     bus=input
    
                                mov     alu, bus
    
            _ret_               getbyte ac, alu, #0         '  ac=alu    |      |          |
            _ret_               getbyte x, alu, #0          '          x=alu    |          |
            _ret_               getbyte y, alu, #0          '                 y=alu        |
                                incmod  x, #255             '                           increment?
            _ret_               getbyte output, alu, #0     '                           out=alu
    
    and_op  ' 29 clks max to return from here
                                mov     addr, d             '   0d   |    yd   |   |
                                mov     addr, x             '   |    0x   |   yx  yx++
                                setbyte addr, y, #1         '   |    |    yd  yx  yx++
    
                                rdbyte  bus, addr           'bus=[mem]   |         |       |
                                mov     bus, d              '   |      bus=d       |       |
                                mov     bus, ac             '   |        |       bus=ac    |
                                mov     bus, input          '   |        |         |     bus=input
    
                                mov     alu, ac
                                and     alu, bus
    
            _ret_               getbyte ac, alu, #0         '  ac=alu    |      |          |
            _ret_               getbyte x, alu, #0          '          x=alu    |          |
            _ret_               getbyte y, alu, #0          '                 y=alu        |
                                incmod  x, #255             '                           increment?
            _ret_               getbyte output, alu, #0     '                           out=alu
    
    or_op
                                mov     addr, d             '   0d   |    yd   |   |
                                mov     addr, x             '   |    0x   |   yx  yx++
                                setbyte addr, y, #1         '   |    |    yd  yx  yx++
    
                                rdbyte  bus, addr           'bus=[mem]   |         |       |
                                mov     bus, d              '   |      bus=d       |       |
                                mov     bus, ac             '   |        |       bus=ac    |
                                mov     bus, input          '   |        |         |     bus=input
    
                                mov     alu, ac 
                                or      alu, bus
    
            _ret_               getbyte ac, alu, #0         '  ac=alu    |      |          |
            _ret_               getbyte x, alu, #0          '          x=alu    |          |
            _ret_               getbyte y, alu, #0          '                 y=alu        |
                                incmod  x, #255             '                           increment?
            _ret_               getbyte output, alu, #0     '                           out=alu
    
    xor_op
                                mov     addr, d             '   0d   |    yd   |   |
                                mov     addr, x             '   |    0x   |   yx  yx++
                                setbyte addr, y, #1         '   |    |    yd  yx  yx++
    
                                rdbyte  bus, addr           'bus=[mem]   |         |       |
                                mov     bus, d              '   |      bus=d       |       |
                                mov     bus, ac             '   |        |       bus=ac    |
                                mov     bus, input          '   |        |         |     bus=input
    
                                mov     alu, ac 
                                xor     alu, bus
    
            _ret_               getbyte ac, alu, #0         '  ac=alu    |      |          |
            _ret_               getbyte x, alu, #0          '          x=alu    |          |
            _ret_               getbyte y, alu, #0          '                 y=alu        |
                                incmod  x, #255             '                           increment?
            _ret_               getbyte output, alu, #0     '                           out=alu
    
    add_op
                                mov     addr, d             '   0d   |    yd   |   |
                                mov     addr, x             '   |    0x   |   yx  yx++
                                setbyte addr, y, #1         '   |    |    yd  yx  yx++
    
                                rdbyte  bus, addr           'bus=[mem]   |         |       |
                                mov     bus, d              '   |      bus=d       |       |
                                mov     bus, ac             '   |        |       bus=ac    |
                                mov     bus, input          '   |        |         |     bus=input
    
                                mov     alu, ac 
                                and     alu, bus
    
            _ret_               getbyte ac, alu, #0         '  ac=alu    |      |          |
            _ret_               getbyte x, alu, #0          '          x=alu    |          |
            _ret_               getbyte y, alu, #0          '                 y=alu        |
                                incmod  x, #255             '                           increment?
            _ret_               getbyte output, alu, #0     '                           out=alu
    
    sub_op
                                mov     addr, d             '   0d   |    yd   |   |
                                mov     addr, x             '   |    0x   |   yx  yx++
                                setbyte addr, y, #1         '   |    |    yd  yx  yx++
    
                                rdbyte  bus, addr           'bus=[mem]   |         |       |
                                mov     bus, d              '   |      bus=d       |       |
                                mov     bus, ac             '   |        |       bus=ac    |
                                mov     bus, input          '   |        |         |     bus=input
    
                                mov     alu, ac 
                                sub     alu, bus
    
            _ret_               getbyte ac, alu, #0         '  ac=alu    |      |          |
            _ret_               getbyte x, alu, #0          '          x=alu    |          |
            _ret_               getbyte y, alu, #0          '                 y=alu        |
                                incmod  x, #255             '                           increment?
            _ret_               getbyte output, alu, #0     '                           out=alu
    
    
    st_op  ' 24 clks max to return from here
                                mov     addr, d             '   0d   |    yd   |   |
                                mov     addr, x             '   |    0x   |   yx  yx++
                                setbyte addr, y, #1         '   |    |    yd  yx  yx++
                                incmod  x, #255             '   |    |    |    |  yx++
    
                                zerox   addr, #14           'restrict to 32kB
                                add     addr, rambase       'offset in HUB RAM
    
                                mov     x, ac               'x=ac     |      |
                                mov     y, ac               '  |     y=ac    |
    
            _ret_               wrbyte  d, addr             'RAM[addr]=d     |              |
            _ret_               wrbyte  ac, addr            '      |    RAM[addr]=ac        |
            _ret_               wrbyte  input, addr         '      |         |        RAM[addr]=input
    
    ctrl_op                     ret                         ' reserved
    
    input0                      getbyte input, ina, #0
    input1                      getbyte input, ina, #1
    input2                      getbyte input, ina, #2
    input3                      getbyte input, ina, #3
    input4                      getbyte input, inb, #0
    input5                      getbyte input, inb, #1
    input6                      getbyte input, inb, #2
    input7                      getbyte input, inb, #3
    
    ' registers
    ac                          long    0
    x                           long    0
    y                           long    0
    
    ' emulator state
    
    rombase                     long    0
    rambase                     long    0
    outputport                  long    OUTP
    inputport                   long    INP
    
    vgafreq                     long    $80000000 / (4*13) + 1          '6.25MHz output rate
    vgaout                      long    $608E0001 + (OUTP & 7)<<20      '4x8 bit output 
    
    ' temporary variables
    d                           res     1
    addr                        res     1
    bus                         res     1
    instr                       res     1
    code                        res     1
    input                       res     1
    output                      res     1
    alu                         res     1
    
                                fit
    
                                orgh
    
    lutram
    
                                org     $200
    
    {
    'execf table goes here
    
    }
    
                                orgh
    
    romdata    '    file    "romfile.bin"  ' PUT THE ROM FILE here
    ramdata         byte    0[32768]
    
  • roglohrogloh Posts: 5,786
    edited 2022-08-10 13:39

    @PurpleGirl said:
    Speaking of the P2, how easy is it to read a double word and then deal with the individual bytes? Are there byte and word instructions? Or do you have to do a lot of swapping, shifting/rotating, etc?

    It's very easy with the P2 instruction set. You can read/write bytes/words/longs and extract bytes/words and set/get bytes from longs and words or swap byte orders with the P2, as well as sign/zero extend sub-long sized elements.

    Actually, you would only need to emulate the vCPU instruction set, not the native Gigatron code. And splitting the video out into its own cog. The whole idea of my proposed project of adding an external video controller to the Gigatron was to get more performance. 100% compatibility isn't my concern, just making it faster.

    Yeah that's a different approach to what I was mentioning but something is likely doable. You can try out whatever you want on the P2. Just get stuck into it and you'll figure it out. Best way to learn. :smile:

  • @PurpleGirl said:
    Speaking of the P2, how easy is it to read a double word and then deal with the individual bytes? Are there byte and word instructions? Or do you have to do a lot of swapping, shifting/rotating, etc?

    Welcome to the forum, PurpleGirl!

    The P2 has instructions to read and write an individual nibble, byte or word in a long/double word in cog/register RAM. There are other useful reg instructions, e.g. for swapping or copying bytes, doubling bits, reversing bit order. LUT RAM data width is long-only. Hub RAM is byte-, word- or long-addressable.

  • TonyB_TonyB_ Posts: 2,178
    edited 2022-08-11 13:11

    The P2 can do fast block moves: reading N longs from hub RAM to cog or LUT RAM takes only 1 cycle per long after the first and max time = 16+N-1 cycles. Similarly for writing max time = 10+N-1 cycles. Assuming streamer not accessing same hub slices at same time.

    I think best option for video would to be read each line of 160 bytes / 40 longs from hub RAM to LUT or cog RAM. Video output interrupt could read one long containing four pixels from LUT RAM in 3 cycles with auto-incrementing pointer @ 6.25/4 MHz. The HS and VS bits would need zeroing, taking 2 cycles. By good design these are bits 6 and 7 of each byte. LUT RAM would contain 256 longs of XBYTE EXECF table (one long for each opcode), plus 64 longs for video LUT, plus 40 longs for video line.

    The latter could be in cog RAM instead and if so reading each four-pixel long would be a bit slower, however zeroing HS and VS bits would not be needed at a cost of increasing video LUT to 256 longs. Overall times for both options would be almost identical.

  • Another way to improve performance is to use cog RAM for page zero.

  • roglohrogloh Posts: 5,786
    edited 2022-08-11 15:00

    I have updated my Gigatron emulator concept to fill in most of the functionality - this code is almost ready now to test/debug. Just need a way to figure out the serial shift register and how it can be mapped to receive ASCII from the P2 serial port. Then there would be a way to interact with this Gigatron.

    The Gigatron guys use an ATTiny85 or other Arduinos to access multiplexed input from a game controller, host PC serial console and a PS/2 keyboard using AVR software called Babelfish that all talks over the Famicon game controller protocol. This is certainly something that the P2 could do in time for a total emulation at a "full" 6.25MHz clock speed....you could plug in a game controller, PS/2 (or USB) keyboard into the P2 as well as the A/V breakout.

    I've optimized the emulator's timing far more and do use XBYTE now (slowest path is a branch at 39 P2 clocks plus the XBYTE overhead which I think adds 6 more), leaving me still under 48 P2 clock cycles per emulated instruction to be able to run a P2 at 300MHz. This is not accounting for FIFO reload wait after branches, although I have 4 further instructions (8 cycles) after changing the FIFO before I return to XBYTE so maybe the RDFAST wait can be skipped by setting bit31 in the RDFAST D argument... or I can run at 325MHz to help a bit there. TBD.

    A second IO helper COG is now used for driving the status LEDs, audio DAC, VGA video (on the A/V breakout) and is meant to handle the input port once I figure out that aspect of this system. The other main Gigatron COG still generates its 6 bit "EGA" format digital video output (like P1) on 8 IO pins and runs it's ROM program code from HUB RAM (Harvard architecture style). The IO COG reads this data using a repo Smartpin and looks up the RGB DAC values which it sends to its own video streamer output. These two COGs hopefully will remain locked once synced up.

    I think this is probably eventually going to work out... :smile:

    Here's a sample of the ALU code...and branch/store ops, etc. These 68 instructions are the core of it all.

    alu_ops  ' 37 clks max to return to XBYTE loop from here
                                rfbyte  d                   'read immediate parameter
                                mov     addr, d             '   0d   |    yd   |   0d   0d   0d   |
                                mov     addr, x             '   |    0x   |   yx   |    |    |   yx++
                                setbyte addr, y, #1         '   |    |    yd  yx   |    |    |   yx++
    
                                add     addr, rambase       '   |    bus=[mem]     |       |
                                rdbyte  alu, addr           '   |    bus=[mem]     |       |
                                mov     alu, d              ' bus=d      |         |       |
                                mov     alu, ac             '   |        |       bus=ac    |
                                mov     alu, input          '   |        |         |     bus=input
    
                                                            '  LD (empty)
                                and     alu, ac             '  |   add   |     |     |     |
                                or      alu, ac             '  |    |    or    |     |     |
                                xor     alu, ac             '  |    |    |    xor    |     |
                                add     alu, ac             '  |    |    |     |    add    |
                                subr    alu, ac             '  |    |    |     |     |    sub
    
                                getbyte ac, alu, #0         '  ac=alu    |      |          |
                                getbyte x, alu, #0          '    |     x=alu    |          |
                                getbyte y, alu, #0          '    |       |    y=alu        |
                                incmod  x, #255             '    |       |      |       increment?
                                getbyte output, alu, #0     '    |       |      |       out=alu
    
                                xcont   vgaout, output
                                wxpin   output, #OUT_REPO
                                wxpin   ac, #XOUT_REPO
                _ret_           rqpin   input, #IN_REPO
    
    branch_ops '39 clks max to return to XBYTE loop from here             
                                rfbyte  d                   'read immediate parameter   
                                add     d, rambase          '   |    bus=[mem]     |       |
                                rdbyte  bus, d              '   |    bus=[mem]     |       |
                                mov     bus, d              ' bus=d      |         |       |
                                mov     bus, ac             '   |        |       bus=ac    |
                                mov     bus, input          '   |        |         |     bus=input
    
                                                            '    >   <   <>  =  >=  <=   always  farjmp
                                getptr  addr                '    a   b   c   d   e   f     g       |
                                sets    addr, bus           '    a   b   c   d   e   f     g       h
                                add     addr, bus           '    a   b   c   d   e   f     g       h
                                setd    addr, y             '    |   |   |   |   |   |     |       h
                                add     addr, rombase       '    |   |   |   |   |   |     |       h
                                test    ac wz               '    a   b   c   d   e   f     |       |
                                testb   ac, #7 wc           '    a   b   c   d   e   f     |       |
                                rdfast  wait, addr          '    |   |   |   |   |   |     g       h  
                if_00           rdfast  wait, addr          '    a   |   |   |   |   |     |       |
                if_10           rdfast  wait, addr          '    |   b   |   |   |   |     |       |
                if_nz           rdfast  wait, addr          '    |   |   c   |   |   |     |       |
                if_z            rdfast  wait, addr          '    |   |   |   d   |   |     |       |
                if_nc           rdfast  wait, addr          '    |   |   |   |   e   |     |       |
                if_c_or_z       rdfast  wait, addr          '    |   |   |   |   |   f     |       |
    
                                xcont   vgaout, output
                                wxpin   output, #OUT_REPO
                                wxpin   ac, #XOUT_REPO
                _ret_           rqpin   input, #IN_REPO
    
    st_ops  ' 30 clks max to return to XBYTE loop from here
                                rfbyte  d                   'read immediate parameter
                                mov     addr, d             '   0d   |    yd   |   0d   0d   0d   |
                                mov     addr, x             '   |    0x   |   yx   |    |    |   yx++
                                setbyte addr, y, #1         '   |    |    yd  yx   |    |    |   yx++
                                incmod  x, #255             '   |    |    |    |   |    |    |   yx++
                                mov     x, ac               '   |    |    |    |  x=ac  |    |    |
                                mov     y, ac               '   |    |    |    |   |   y=ac  |    |
    
                                zerox   addr, #14           'restrict to 32kB
                                add     addr, rambase       'offset in HUB RAM
    
                                wrbyte  d, addr             'RAM[addr]=d       |              |
                                wrbyte  ac, addr            '      |      RAM[addr]=ac        |
                                wrbyte  input, addr         '      |           |        RAM[addr]=input
    
                                xcont   vgaout, output
                                wxpin   output, #OUT_REPO
                                wxpin   ac, #XOUT_REPO
                _ret_           rqpin   input, #IN_REPO
    
    
    ctrl_op ' 10 clocks max to return to XBYTE loop from here
                                rfbyte  d                   ' reserved (nop)
                                xcont   vgaout, output
                                wxpin   output, #OUT_REPO
                                wxpin   ac, #XOUT_REPO
                _ret_           rqpin   input, #IN_REPO
    
    
    
    
  • @TonyB_ said:
    The P2 can do fast block moves: reading N longs from hub RAM to cog or LUT RAM takes only 1 cycle per long after the first and max time = 16+N-1 cycles. Similarly for writing max time = 10+N-1 cycles. Assuming streamer not accessing same hub slices at same time.

    I think best option for video would to be read each line of 160 bytes / 40 longs from hub RAM to LUT or cog RAM. Video output interrupt could read one long containing four pixels from LUT RAM in 3 cycles with auto-incrementing pointer @ 6.25/4 MHz. The HS and VS bits would need zeroing, taking 2 cycles. By good design these are bits 6 and 7 of each byte. LUT RAM would contain 256 longs of XBYTE EXECF table (one long for each opcode), plus 64 longs for video LUT, plus 40 longs for video line.

    The latter could be in cog RAM instead and if so reading each four-pixel long would be a bit slower, however, zeroing HS and VS bits would not be needed at a cost of increasing video LUT to 256 longs. Overall times for both options would be almost identical.

    That was what I was thinking. At 6.25 Mhz on the video portion that would be separate from everything else, you have 160 cycles for the line and 40 for the horizontal porches. The vertical porch would be about 200 cycles.

    The Gigatron CPU has no interrupts, but the P2 has a few. The Gigatron does the video in the firmware with OR immediate [Y:X++] Out instructions. For bit-banging, you can't get more efficient than that.

    But things are slightly more complicated since it also has to look up the indirection table. The Gigatron can put any page of the frame buffer anywhere it can reach. While a stock one comes with 32K, it can have 64K, and with some of the expansion board mods that plug into the RAM socket, you could have much more memory. In that case, the firmware would be modded to be able to recognize such boards and pass commands via the invalid memory control line combination. If both lines are high, it is idle. If /WE is low, that is a write. If /OE is low, that is a read, and if both are low that could be a command (and/or rudimentary DMA).

    I mentioned that because the controller would need to be indirection table aware. It would need to know where to get the page (and if a segment is needed for the modded ones, it would have to get it from the "Z" register that a fancy controller might have), and what the offset is. This system fragments the memory map since you need 160 pixels but address lines as 256 bytes. That leaves 96 bytes at the end of each line. The Racer game takes full advantage of that. So the grass on the sides of the track is in that 96 extra bytes region. And since X it is an 8-bit register and X++ only affects the X register, if you overflow, it wraps around. Thus the grass can be on either side of the track. There is one more peculiarity to the indirection table. I believe that the offset is ignored in some cases. If the software changes it in one place in the table, it is used for the rest of the frame. So you can change the offset in just the first entry, and it should scroll the entire screen.

    I'd love to be able to add other modes, like 320x240 and maybe a text mode (maybe 40x30 and/or 20/15). One way I could do it, for a custom build would be to use the indirection table as parameter space. Obviously, a frame buffer won't be at Page 0, so a page of 0 could put the controller in standby mode (with video and sound or whatever doing its thing), and allow you to put commands there. For instance, that could be a way to add a math coprocessor. The P2 can certainly do that. Just put it in standby, load the parameters, and put the math opcode last. Then retrieve what is needed and then put it back in normal mode.

  • @PurpleGirl said:
    The Gigatron CPU has no interrupts, but the P2 has a few. The Gigatron does the video in the firmware with OR immediate [Y:X++] Out instructions. For bit-banging, you can't get more efficient than that.

    Does these OR immediate [Y:X++] Out instructions modify the video data read from RAM?

  • @rogloh said:
    I have updated my Gigatron emulator concept to fill in most of the functionality - this code is almost ready now to test/debug. Just need a way to figure out the serial shift register and how it can be mapped to receive ASCII from the P2 serial port. Then there would be a way to interact with this Gigatron.

    The Gigatron guys use an ATTiny85 or other Arduinos to access multiplexed input from a game controller, host PC serial console and a PS/2 keyboard using AVR software called Babelfish that all talks over the Famicon game controller protocol. This is certainly something that the P2 could do in time for a total emulation at a "full" 6.25MHz clock speed....you could plug in a game controller, PS/2 (or USB) keyboard into the P2 as well as the A/V breakout.

    I've optimized the emulator's timing far more and do use XBYTE now (slowest path is a branch at 39 P2 clocks plus the XBYTE overhead which I think adds 6 more), leaving me still under 48 P2 clock cycles per emulated instruction to be able to run a P2 at 300MHz. This is not accounting for FIFO reload wait after branches, although I have 4 further instructions (8 cycles) after changing the FIFO before I return to XBYTE so maybe the RDFAST wait can be skipped by setting bit31 in the RDFAST D argument... or I can run at 325MHz to help a bit there. TBD.

    A second IO helper COG is now used for driving the status LEDs, audio DAC, VGA video (on the A/V breakout) and is meant to handle the input port once I figure out that aspect of this system. The other main Gigatron COG still generates its 6 bit "EGA" format digital video output (like P1) on 8 IO pins and runs it's ROM program code from HUB RAM (Harvard architecture style). The IO COG reads this data using a repo Smartpin and looks up the RGB DAC values which it sends to its own video streamer output. These two COGs hopefully will remain locked once synced up.

    I think this is probably eventually going to work out... :smile:

    Here's a sample of the ALU code...and branch/store ops, etc. These 68 instructions are the core of it all.

    Well, with that port, if nothing else, use a Famicon controller. Or make your own Pluggy in another cog or something. At that rate, you could remove the shift register and have a regular "register" there. Then you don't have to decode the keyboard, then assemble and convert, then serialize it again. The shift register is ancillary and not a part of the core CPU anyway.

    As for 68 instructions, you'd need even the crazy instructions for full compatibility since newer ROMs use those. There are maybe 16 undefined instructions, and maybe 2 of those need to be used. They are "undefined" because they drive both the /WE and /OE lines active. SRAM tends to not like that, but in that case, newer add-on cards use that invalid mode to take the memory off the bus and speak directly to the bus. So commands get passed that way. The native instructions are rather orthogonal, but the vCPU ones are haphazard. The Gigatron is 8-bit, 1-cycle, Harvard RISC, but the vCPU interpreter is 16-bit, multi-length, Von Neumann.

    Marcel did modify a Gigatron to work at 12.5 Mhz. However, to simplify things, he wrote a test ROM that only wrote to the left half of the screen. Everything was faster, though sound and video worked the same since he kept the syncs the same. I get why he had to write to half the screen. If you wanted to fill it out, that would take a NOP every other screen instruction. If there were 2 more index registers and maybe another accumulator, you could weirdly tangle the vCPU interpreter with the screen-writing.

    And if you use some sort of DMA or snooping scheme to split the video from the CPU, you'd need to do what the Gigatron does and look up the indirection table to find each line's frame buffer address then read and display the 160 relevant ones. Then one would need to rewrite the ROM. If you don't move all the I/O, then you'd need to keep the syncs so the other I/O can work, and maybe a mechanism to sync the syncs.

  • PurpleGirlPurpleGirl Posts: 152
    edited 2022-08-11 23:50

    @TonyB_ said:

    @PurpleGirl said:
    The Gigatron CPU has no interrupts, but the P2 has a few. The Gigatron does the video in the firmware with OR immediate [Y:X++] Out instructions. For bit-banging, you can't get more efficient than that.

    Does these OR immediate [Y:X++] Out instructions modify the video data read from RAM?

    Only the sync bits. That is a single instruction that reads RAM at X:Y, ORs it with 192 (11000000b) or whatever to toggle the syncs, sends it to the Out register, and increments X.

    User software is free to use those 2 bits of every location in the frame buffer as it sees fit as they are ignored by the video.

    I also imagine that the context switch (changing the index registers between vCPU, video, and other things) would happen during the syncs. Plus reading the relevant entry in the indirection table for the next page of pixels.

  • I just read about pipelining in the Gigatron which I didn't realize before. Instructions after branches (branch delay slots) are allowed and apparently used here. Looks like my emulator concept can't work without alterations.

    To remedy I guess I need to save off the branch address and use it later at the end of the next instruction. This means that the ALU operations need to be extended by more cycles that test for a branching condition and deal with it. Don't think it will allow full 6.25MHz speed operation now. :(

  • PurpleGirlPurpleGirl Posts: 152
    edited 2022-08-12 03:08

    @rogloh said:
    I just read about pipelining in the Gigatron which I didn't realize before. Instructions after branches (branch delay slots) are allowed and apparently used here. Looks like my emulator concept can't work without alterations.

    To remedy I guess I need to save off the branch address and use it later at the end of the next instruction. This means that the ALU operations need to be extended by more cycles that test for a branching condition and deal with it. Don't think it will allow full 6.25MHz speed operation now. :(

    Well, the fetch stage would need to be kept a cycle behind the execution stage. So yes, you have 1 branch delay slot.

    The Gigatron ROM embraces the delay slot, thus the next instruction after a branch will execute before it reaches the new location. Thus that makes trampoline code possible for lookup tables, image/program storage, etc. All user code, including the startup program, is run as vCPU code.

  • roglohrogloh Posts: 5,786
    edited 2022-08-12 03:51

    Yeah I think I might possibly be okay again with any luck (just)... :smile:

    I found I could pipeline the branching quite easily by saving the branch address and using it at the start of the next instruction to prepare the FIFO, it also gives the FIFO longer to refill and if I position it nicely hopefully it won't interfere with the RDBYTE and WRBYTE too much now either.

    It pushes out the branching to 43 clock cycles total and the ALU operations to the same, but this may still allow 6.25MHz at 325MHz P2 even when you add the 6 clocks of XBYTE overhead. There's probably one other optimization left here too that could combine the accumulator and output register in the same long repository write operation and that affects all operations.

    ' XBYTE outer loop overhead gets down to 6 clocks.
    
    ' Slowest code path is 43+6 = 49 clocks. A P2@325MHz in theory could yield a video pixel rate of 6.25MHz, though this is excluding any fifo stealing or reconfig cycles during branching if they occur.
    
    
    alu_ops  ' 43 clocks max to return to XBYTE loop from here
                                rfbyte  d                   'read immediate parameter
    
    
                                mov     addr, d             '   0d   |    yd   |   0d   0d   0d   |
                                mov     addr, x             '   |    0x   |   yx   |    |    |   yx++
                                setbyte addr, y, #1         '   |    |    yd  yx   |    |    |   yx++
    
                                add     addr, rambase       '   |    bus=[mem]     |       |
                                rdbyte  alu, addr           '   |    bus=[mem]     |       |
                                mov     alu, d              ' bus=d      |         |       |
                                mov     alu, ac             '   |        |       bus=ac    |
                                mov     alu, input          '   |        |         |     bus=input
    
                                test    branchaddr wz       'check if we branch after this instruction
                if_nz           rdfast  nowait, branchaddr  'branch
                                mov     branchaddr, #0      'clear for next time
    
                                                            '  LD (empty)
                                and     alu, ac             '  |   add   |     |     |     |
                                or      alu, ac             '  |    |    or    |     |     |
                                xor     alu, ac             '  |    |    |    xor    |     |
                                add     alu, ac             '  |    |    |     |    add    |
                                subr    alu, ac             '  |    |    |     |     |    sub
    
                                getbyte ac, alu, #0         '  ac=alu    |      |          |
                                getbyte x, alu, #0          '    |     x=alu    |          |
                                getbyte y, alu, #0          '    |       |    y=alu        |
                                incmod  x, #255             '    |       |      |       increment?
                                getbyte output, alu, #0     '    |       |      |       out=alu
    
    
                                xcont   vgaout, output
                                wxpin   output, #OUT_REPO
                                wxpin   ac, #XOUT_REPO
                _ret_           rqpin   input, #IN_REPO
    
    
    ctrl_op ' 16+ clocks max to return to XBYTE loop from here (waits for FIFO because it has time)
    st_ops  ' 36 clocks max to return to XBYTE loop from here
                                rfbyte  d                   'read immediate parameter                  
                                test    branchaddr wz       'check if we branch after this instruction
    
                if_nz           rdfast  nowait, branchaddr  'RAM[addr]=d   RAM[addr]=ac  RAM[addr]=input |
                if_nz           rdfast  #0, branchaddr      '      |           |              |        ctrl_op
                                mov     branchaddr, #0      'clear for next time
    
                                mov     addr, d             '   0d   |    yd   |   0d   0d   0d   |      |
                                mov     addr, x             '   |    0x   |   yx   |    |    |   yx++    |
                                setbyte addr, y, #1         '   |    |    yd  yx   |    |    |   yx++    |
                                incmod  x, #255             '   |    |    |    |   |    |    |   yx++    |
                                mov     x, ac               '   |    |    |    |  x=ac  |    |    |      |
                                mov     y, ac               '   |    |    |    |   |   y=ac  |    |      |
    
                                zerox   addr, #14           'restrict to 32kB                            |
                                add     addr, rambase       'offset in HUB RAM                           |
    
                                wrbyte  d, addr             'RAM[addr]=d       |              |          |
                                wrbyte  ac, addr            '      |      RAM[addr]=ac        |          |
                                wrbyte  input, addr         '      |           |        RAM[addr]=input  |
    
                                xcont   vgaout, output
                                wxpin   output, #OUT_REPO
                                wxpin   ac, #XOUT_REPO
                _ret_           rqpin   input, #IN_REPO
    
    
    
    branch_ops '43 clocks max to return to XBYTE loop from here             
                                rfbyte  d                   'read immediate parameter   
                                add     d, rambase          '   |    bus=[mem]     |       |
                                rdbyte  bus, d              '   |    bus=[mem]     |       |
                                mov     bus, d              ' bus=d      |         |       |
                                mov     bus, ac             '   |        |       bus=ac    |
                                mov     bus, input          '   |        |         |     bus=input
    
                                test    branchaddr wz       'check if we branch after this instruction
                if_nz           rdfast  nowait, branchaddr  'branch
    
                                                            '    >   <   <>  =  >=  <=   always  farjmp
                                getptr  branchaddr          '    a   b   c   d   e   f     g       |
                                sets    branchaddr, bus     '    a   b   c   d   e   f     g       h
                                add     branchaddr, bus     '    a   b   c   d   e   f     g       h
                                setd    branchaddr, y       '    |   |   |   |   |   |     |       h
                                add     branchaddr, rombase '    |   |   |   |   |   |     |       h
                                test    ac wz               '    a   b   c   d   e   f     |       |
                                testb   ac, #7 wc           '    a   b   c   d   e   f     |       |
                if_c_or_z       mov     branchaddr, #0      '    a   |   |   |   |   |     |       |
                if_nc_or_z      mov     branchaddr, #0      '    |   b   |   |   |   |     |       |
                if_z            mov     branchaddr, #0      '    |   |   c   |   |   |     |       |
                if_nz           mov     branchaddr, #0      '    |   |   |   d   |   |     |       |
                if_c            mov     branchaddr, #0      '    |   |   |   |   e   |     |       |
                if_00           mov     branchaddr, #0      '    |   |   |   |   |   f     |       |
    
    
                                xcont   vgaout, output
                                wxpin   output, #OUT_REPO
                                wxpin   ac, #XOUT_REPO
                _ret_           rqpin   input, #IN_REPO
    
    
  • roglohrogloh Posts: 5,786
    edited 2022-08-12 07:36

    @PurpleGirl said:
    As for 68 instructions, you'd need even the crazy instructions for full compatibility since newer ROMs use those. There are maybe 16 undefined instructions, and maybe 2 of those need to be used. They are "undefined" because they drive both the /WE and /OE lines active. SRAM tends to not like that, but in that case, newer add-on cards use that invalid mode to take the memory off the bus and speak directly to the bus. So commands get passed that way.

    I was just reading this on HaD, about the expansion technique of Gigatron. It's rather clever.

    https://hackaday.io/project/164176-gigatron-io-and-ram-expander/details
    https://cdn.hackaday.io/files/1641767024105984/Expander-schematic.pdf

    I think we can potentially also achieve this type or expansion in this emulator code, certainly the memory expansion to 128kB is achievable, that's a no brainer doing the RAM base address change on receiving the special CTRL instruction. In theory the P2 would be able to control the SPI devices the same by mapping the control register outputs to real IO pins. The only concern I have is reading in those 4 bits from different SPI/IO devices. If the IO read gets done during the control instruction itself, then perhaps its possible, but if the normal ALU instructions have to select between RAM vs IO based on the state of the SPI clock latch bit then perhaps that's going to be harder to achieve in a timely way...need to mull it over more and see if/how it could fit.

    EDIT: Just thought about it, and ok, it might be possible if we switch over to a different XBYTE instruction set when IO is enabled. This way all the instructions can be remapped to an alternative implementation that reads data from the 4 IO pins not RAM. This can work at the rated speed, assuming it can be done with the pipelining as well (TBD). If different IO pins are used we may need to collate their independent pin states into a nibble, but if all 4 SPI MISO pins are contiguous, starting on a nibble boundary it will be eaiser to achieve with a single access. Eg.

    getnib bus, miso_pin_base, #1 ' all 4 pins are together
    

    vs

    setbyte bus, #0, #0 ' or the default state of unused bus bits (maybe $ff)
    testp pin1 wc
    testp pin2 wz
    rczl bus
    testp pin3 wc
    testp pin4 wz
    rczl bus
    

    14 cycles (still 1 clock less than a HUB memory read though which is good, so it sort of still fits in).

  • TonyB_TonyB_ Posts: 2,178
    edited 2022-08-12 10:38

    @PurpleGirl said:

    @TonyB_ said:

    @PurpleGirl said:
    The Gigatron CPU has no interrupts, but the P2 has a few. The Gigatron does the video in the firmware with OR immediate [Y:X++] Out instructions. For bit-banging, you can't get more efficient than that.

    Does these OR immediate [Y:X++] Out instructions modify the video data read from RAM?

    Only the sync bits. That is a single instruction that reads RAM at X:Y, ORs it with 192 (11000000b) or whatever to toggle the syncs, sends it to the Out register, and increments X.

    User software is free to use those 2 bits of every location in the frame buffer as it sees fit as they are ignored by the video.

    I also imagine that the context switch (changing the index registers between vCPU, video, and other things) would happen during the syncs. Plus reading the relevant entry in the indirection table for the next page of pixels.

    That's good news. The video interrupt could read an indirection table word from hub RAM with page and offset (16 P2 cycles max) then read an entire video line (55 cycles max) during horizontal blanking. This would take less time than two Gigatron blanking pixels of which there are 40. No Gigatron instructions would be needed for video bit-banging. During the active display the video interrupt frequency would be 6.25/4 MHz.

Sign In or Register to comment.