Shop OBEX P1 Docs P2 Docs Learn Events
Could the Propeller 2 be used as an I/O controller for a Gigatron TTL computer? - Page 2 — Parallax Forums

Could the Propeller 2 be used as an I/O controller for a Gigatron TTL computer?

24567

Comments

  • TonyB_TonyB_ Posts: 2,196
    edited 2022-08-12 11:03

    @rogloh said:
    Yeah I think I might possibly be okay again with any luck (just)... :smile:

    I found I could pipeline the branching quite easily by saving the branch address and using it at the start of the next instruction to prepare the FIFO, it also gives the FIFO longer to refill and if I position it nicely hopefully it won't interfere with the RDBYTE and WRBYTE too much now either.

    The delayed branching is actually helpful.

    I think the quickest way to handle branching is to set the C flag if branching is needed. C is preserved between XBYTEs if CZ writing not enabled. (I insisted that the latter be optional.) A MOV instruction could clear C after the no-wait if_c RDFAST .... You might need if_c WAITX ... before the MOV. The branch instruction code should use C only.

  • roglohrogloh Posts: 5,837
    edited 2022-08-12 11:44

    @PurpleGirl, do you happen to know what are the particular CTRL instructions required? In the instruction encoding I seem to have 8 of them (all the ones that STORE by reading from memory and writing at the same time), as it's just the 8 addressing modes that change in that situation. Are all 8 of these mapped to the same control instructions or just a couple of them like you mentioned earlier? If a subset, which ones? I can't find that documented but it's probably important.

    @TonyB_ said:

    @rogloh said:

    The delayed branching is actually helpful.

    It sort of is now for the FIFO reload time, yes. Even though it adds some more clocks. Do you know how long it takes to refill a FIFO before the next RFBYTE is valid with the no wait option to RDFAST? If it's more than some number of instructions worth I might have issues maintaining the full speed. I'm hoping it's only around a hub window cycle delay but it might be more. Also an intervening RDBYTE or WRBYTE might affect this time I suspect. Hopefully the P2 is smart enough to keep refilling in spare time while the data from other slots arrives before the RDBYTE gets its address requested, but it might be fully pre-empted until RDBYTE completes and starved during this time, not sure about that. That's sort of why I put it after the RDBYTE, but it would be nice to move it earlier in the sequence to give it the most time to complete.

    I think the quickest way to handle branching is to set the C flag if branching is needed. C is preserved between XBYTEs if CZ writing not enabled. (I insisted that the latter be optional.) A MOV instruction could clear c after the no-wait if_c RDFAST .... You might need if_c WAITX ... before _ RET _. The branch instruction code should use C only.

    I wasn't sure about flags usage in XBYTE between instructions, but if I can make use of them to save a cycle or more I'll be happy.

    I was able to create a handler for this CTRL instruction. It takes 44 clocks, which means it likely needs 325MHz operation or so unless I can shave off another instruction.

    The 128kB RAM banking feature in the STORE instruction has added 2 clocks but it was only 36 clocks before so not an issue going up to 38 clocks. More problematic is how to deal with it in the ALU operations that read from HUB RAM. I hadn't accounted for any RAM banking there, and the extra two instructions needed for that adds 4 clocks, pushing the count to 47 clocks. With the 6 XBYTE clocks it exceeds my self imposed 325MHz operation limits. :( That was almost the last thing to put in too. Maybe the flag feature will save me an instruction or I can find another optimization? I mean we can always go to 337.5MHz which is almost what NeoYume uses, but it's nicer to run a bit slower if you can. Ideally it would run at 297MHz and you could upscale to HDTV output. Wouldn't that 160x120 resolution look mighty nice on a 1080p panel...perfect integer scaling too! :wink:

    Latest code:

    ctrl_op ' 44 clocks max to return to XBYTE loop from here
    ' EXPANSION IO MAPPING
    '  A15 - SPI MOSI OUTPUT
    '  A7 - RAM BANK
    '  A6 - RAM BANK
    '  A5:A2 - SPI CS OUTPUTS
    '  A1 - UNUSED
    '  A0 - SPI CLK & RAM/IO READ SELECT BIT
                                rfbyte  d                   'read immediate parameter
                                test    branchaddr wz       'check if we branch after this instruction
                if_nz           rdfast  nowait, branchaddr  'delayed branch
                                mov     branchaddr, #0      'clear for next time
    '  I need to check which of these address modes below is valid for CTRL (maybe it's all of them)
                                mov     addr, d             '   0d   |    yd   |   0d   0d   0d   |
                                mov     addr, x             '   |    0x   |   yx   |    |    |   yx++
                                setbyte addr, y, #1         '   |    |    yd  yx   |    |    |   yx++
    
                                testb   addr, #7 wc         'get RAM bank bit 1
                                testb   addr, #6 wz         'get RAM bank bit 0
                                bitc    rambasehi, #14
                                bitz    rambasehi, #15
                                rczr    addr wcz            'get clock pin into Z and shift down two bits
                                testb   addr, #13 wc        'test original A15 bit (after shift)
                                drvc    #MOSI_PIN           'drive MOSI data output
                                setnib  #CS_PORT, addr, #CS_NIB 'drive 4x!CS pin outputs from original A5:A2 bits
                                drvz    #CLKPIN             'output SPI clock to pin
                                bitz    xbytetable, #8      'select XBYTE decode table for RAM or IO reading
                                wxpin   ac, #XOUT_REPO      
                                xcont   vgaout, output  
                                wxpin   output, #OUT_REPO
                                rqpin   input, #IN_REPO
                _ret_           setq    xbytetable          'flip to alternate XBYTE table if needed
    
    st_ops  ' 38 clocks max to return to XBYTE loop from here 
                                rfbyte  d                   'read immediate parameter                  
                                test    branchaddr wz       'check if we branch after this instruction
    
                if_nz           rdfast  nowait, branchaddr  'RAM[addr]=d   RAM[addr]=ac  RAM[addr]=input |
                if_nz           rdfast  #0, branchaddr      '      |           |              |        ctrl_op
                                mov     branchaddr, #0      'clear for next time
    
                                mov     addr, d             '   0d   |    yd   |   0d   0d   0d   |      |
                                mov     addr, x             '   |    0x   |   yx   |    |    |   yx++    |
                                setbyte addr, y, #1         '   |    |    yd  yx   |    |    |   yx++    |
                                incmod  x, #255             '   |    |    |    |   |    |    |   yx++    |
                                mov     x, ac               '   |    |    |    |  x=ac  |    |    |      |
                                mov     y, ac               '   |    |    |    |   |   y=ac  |    |      |
    
                                bitl    addr, #15 wcz        'check/clear A15
                                add     addr, rambase
                if_z            sub     addr, rambasehi
    
                '  OLDER CODE
                '               zerox   addr, #14           'restrict to 32kB                            |
                '               add     addr, rambase       'offset in HUB RAM                           |
    
                                wrbyte  d, addr             'RAM[addr]=d       |              |          |
                                wrbyte  ac, addr            '      |      RAM[addr]=ac        |          |
                                wrbyte  input, addr         '      |           |        RAM[addr]=input  |
    
                                xcont   vgaout, output
                                wxpin   output, #OUT_REPO
                                wxpin   ac, #XOUT_REPO
                _ret_           rqpin   input, #IN_REPO
    
    alu_ops  ' 47 clocks max to return to XBYTE loop from here
                                rfbyte  d                   'read immediate parameter
    
                                mov     addr, d             '   0d   |    yd   |   0d   0d   0d   |
                                mov     addr, x             '   |    0x   |   yx   |    |    |   yx++
                                setbyte addr, y, #1         '   |    |    yd  yx   |    |    |   yx++
    
                                bitl    addr, #15 wcz       '   |    bus=[mem]     |       |
                                add     addr, rambase       '   |    bus=[mem]     |       |
                if_z            sub     addr, rambasehi     '   |    bus=[mem]     |       |
    
                                rdbyte  alu, addr           '   |    bus=[mem]     |       |
                                mov     alu, d              ' bus=d      |         |       |
                                mov     alu, ac             '   |        |       bus=ac    |
                                mov     alu, input          '   |        |         |     bus=input
    
                                test    branchaddr wz       'check if we branch after this instruction
                if_nz           rdfast  nowait, branchaddr  'branch
                                mov     branchaddr, #0      'clear for next time
    
                                                            '  LD (empty)
                                and     alu, ac             '  |   add   |     |     |     |
                                or      alu, ac             '  |    |    or    |     |     |
                                xor     alu, ac             '  |    |    |    xor    |     |
                                add     alu, ac             '  |    |    |     |    add    |
                                subr    alu, ac             '  |    |    |     |     |    sub
    
                                getbyte ac, alu, #0         '  ac=alu    |      |          |
                                getbyte x, alu, #0          '    |     x=alu    |          |
                                getbyte y, alu, #0          '    |       |    y=alu        |
                                incmod  x, #255             '    |       |      |       increment?
                                getbyte output, alu, #0     '    |       |      |       out=alu
    
                                xcont   vgaout, output
                                wxpin   output, #OUT_REPO
                                wxpin   ac, #XOUT_REPO
                _ret_           rqpin   input, #IN_REP
    

    UPDATE: @TonyB_ If I can use the Z flag to signal between the XBYTE instructions, I could set it when clearing branchaddr (moved down towards the end of each instruction handler snippet after we're done using flags) and also when cancelling these jumps in the branch_ops code. This avoids that "test branchaddr wz" each time because Z would be valid already. Z flag would be cleared (only) by the branch taken case in the branch_ops code, and set otherwise. This could save a critical cycle in the alu_ops case. :smile:

  • TonyB_TonyB_ Posts: 2,196
    edited 2022-08-14 11:42

    @rogloh said:
    Do you know how long it takes to refill a FIFO before the next RDBYTE is valid with the no wait option to RDFAST? If it's more than some number of instructions worth I might have issues maintaining the full speed. I'm hoping it's only around a hub window cycle delay but it might be more. Also an intervening RDBYTE or WRBYTE might affect this time I suspect. Hopefully the P2 is smart enough to keep refilling in spare time while the data from other slots arrives before the RDBYTE gets its address requested, but it might be fully pre-empted until RDBYTE completes and starved during this time, not sure about that. That's sort of why I put it after the RDBYTE, but it would be nice to move it earlier in the sequence to give it the most time to complete.

    You need 17 cycles from start of no-wait RDFAST to the zero-cycle _RET_ that starts a new XBYTE, or 13 cycles between them excluding both instructions assuming latter is two cycles. I've never tried a RDBYTE/WRBYTE when I know FIFO is refilling but I don't recommend it. My experience with the streamer using the FIFO is that simultaneous random hub accesses take far longer than expected. This was the case for streamer running at sysclk/20, anyway. I recommend placing RDFAST immediately after RDBYTE/WRBYTE.

    EDIT:
    Changed cycle values.

  • TonyB_TonyB_ Posts: 2,196
    edited 2022-08-12 12:40

    @rogloh said:
    UPDATE: @TonyB_ If I can use the Z flag to signal between the XBYTE instructions, I could set it when clearing branchaddr (moved down towards the end of each instruction handler snippet after we're done using flags) and also when cancelling these jumps in the branch_ops code. This avoids that "test branchaddr wz" each time because Z would be valid already. Z flag would be cleared (only) by the branch taken case in the branch_ops code, and set otherwise. This could save a critical cycle in the alu_ops case. :smile:

    C and Z are both preserved between XBYTEs if CZ writing disabled, so you could use either. I thought C might be better because MOV with byte values could clear C for free cycle-wise.

  • roglohrogloh Posts: 5,837
    edited 2022-08-12 12:40

    @TonyB_ said:

    @rogloh said:
    Do you know how long it takes to refill a FIFO before the next RDBYTE is valid with the no wait option to RDFAST? If it's more than some number of instructions worth I might have issues maintaining the full speed. I'm hoping it's only around a hub window cycle delay but it might be more. Also an intervening RDBYTE or WRBYTE might affect this time I suspect. Hopefully the P2 is smart enough to keep refilling in spare time while the data from other slots arrives before the RDBYTE gets its address requested, but it might be fully pre-empted until RDBYTE completes and starved during this time, not sure about that. That's sort of why I put it after the RDBYTE, but it would be nice to move it earlier in the sequence to give it the most time to complete.

    You need 17 cycles between no-wait RDFAST and the zero-cycle _RET_ that starts a new XBYTE, or 15 cycles between them excluding both instructions assuming latter is two cycles. I've never tried a RDBYTE/WRBYTE when I know FIFO is refilling but I don't recommend it. My experience with the streamer using the FIFO is that simultaneous random hub accesses take far longer than expected. This was the case for streamer running at sysclk/20, anyway. I recommend placing RDFAST immediately after RDBYTE/WRBYTE.

    Ok thanks...I think the only way to get that much time is to put the RDFAST right after the RFBYTE and see whatever happens, although any hold up will mess up the video. The current time is only about 10 clocks according to the number of intervening instructions I have, but the xcont will wait a bit as well and that helps too. This thing is really tight, which makes it so interesting to me...

    I guess another way to think about it is that in the cycle budget, let say it's 52 clocks total, if we don't initiate the RDFAST within 52-17-6 clocks of starting the EXECF handler, there won't be enough time left to complete before the next XBYTE loop starts reading from the FIFO. It's only going to be those intervening RDBYTE/WRBYTEs that might mess up things there though. Doing HUB memory access per instruction is the real MIPS killer here.

    @TonyB_ said:
    C and Z are both preserved between XBYTEs if CZ writing disabled, so you could use either. I thought C might be better because MOV with byte values could clear C for free cycle-wise.

    Yeah I'll take a look.

    By the way @PurpleGirl, if we get this thing to work at 325MHz that is actually a good frequency for 1024x768 video output too. It would be possible to have another COG doing analog video rendered from a text or graphics buffer that is visible to the Gigatron's RAM memory space. Then you can have multiple screens, the normal one at 160x120 over VGA/EGA ports, and also including a high resolution display with minimal overhead needed to control it. This is starting to approach something that is more in line with what you wanted initially I think. If the Gigatron can wind down its own video output to lower the number of active scan lines (or disable it entirely) you could have a higher performing emulator outputting to text/graphics video displays at much higher resolutions, still running your Gigatron application code.

    Also if my memory driver mailboxes are mapped into the Gigatron's RAM you can control external memory and you should be able to get HD graphics framebuffers output by the Gigatron using a P2 Edge with PSRAM. That's probably a first too. Lot's of things will open up. :smile:

  • @rogloh said:
    Doing HUB memory access per instruction is the real MIPS killer here.

    Exactly. Have you considered reading one hub long every 4th OUT instruction instead of one hub byte every instruction? Also using cog RAM for page zero, as I mentioned earlier?

  • roglohrogloh Posts: 5,837
    edited 2022-08-12 13:59

    @TonyB_ said:

    @rogloh said:
    Doing HUB memory access per instruction is the real MIPS killer here.

    Exactly. Have you considered reading one hub long every 4th OUT instruction instead of one hub byte every instruction? Also using cog RAM for page zero, as I mentioned earlier?

    No, because it seems that may only help a subset of memory cases that use page 0 and may add further complexity to manage if the zero page is accessed indirectly with YX or YD, if Y=0 (which needs an additional test to prevent regular HUB RAM access etc). The long every 4 Gigatron cycles (if you mean for video) could give some instruction timing slop and stat muxing gains but am trying to avoid it if we can go 100% synchronous. Plus, the XCONT per Gigatron instruction potentially gives us the delay we need to avoid FIFO address change underflow when instruction sequences get run (too) fast for that.

    By the way, I've now coded up the IO mode sequences as well that allow the SPI input pins to be read when the XBYTE table is flipped to operate in this IO mode when sourcing the bus from "IO memory" instead of from actual memory reads. Only the BRANCH ops and ALU ops need a different sequence, as the STORE ops and CTRL ops don't ever read from memory so the existing sequences for them can be retained in both IO and MEMORY modes. Here's the code. I coded it assuming the Z flag is used for branching and that an aligned nibble of SPI MISO pins is used on the P2. There was enough processing time, but just not enough SKIPF bits to read all 4 from disparate pins, so they are packed into a nibble.

    alu_io_ops  ' 24 clocks max to return to XBYTE loop from here
                                rfbyte  d                   'read immediate parameter
                if_nz           rdfast  nowait, branchaddr  'branch
    
                                getnib  alu, MISOPORT, #MISONIB'|    bus=[io]      |       |
                                or      alu, #$f0           '   |    bus=[io]      |       |
                                mov     alu, d              ' bus=d      |         |       |
                                mov     alu, ac             '   |        |       bus=ac    |
                                mov     alu, input          '   |        |         |     bus=input
                                                            '  LD (empty)
                                and     alu, ac             '  |   add   |     |     |     |
                                or      alu, ac             '  |    |    or    |     |     |
                                xor     alu, ac             '  |    |    |    xor    |     |
                                add     alu, ac             '  |    |    |     |    add    |
                                subr    alu, ac             '  |    |    |     |     |    sub
    
                                getbyte ac, alu, #0         '  ac=alu    |      |          |
                                getbyte x, alu, #0          '    |     x=alu    |          |
                                getbyte y, alu, #0          '    |       |    y=alu        |
                                incmod  x, #255             '    |       |      |       increment?
                                getbyte output, alu, #0     '    |       |      |       out=alu
    
                                mov     branchaddr, #0 wz   'clear for next time
                                xcont   vgaout, output
                                wxpin   output, #OUT_REPO
                                wxpin   ac, #XOUT_REPO
                _ret_           rqpin   input, #IN_REPO
    
    branch_io_ops '28 clocks max to return to XBYTE loop from here             
                                rfbyte  d                   'read immediate parameter   
                if_nz           rdfast  nowait, branchaddr  'branch
    
                                getnib  bus, MISOPORT, #MISONIB'|    bus=[io]      |       |
                                or      bus, #$f0           '   |    bus=[io]      |       |
                                mov     bus, d              ' bus=d      |         |       |
                                mov     bus, ac             '   |        |       bus=ac    |
                                mov     bus, input          '   |        |         |     bus=input
    
                                                            '    >   <   <>  =  >=  <=   always  farjmp
                                getptr  branchaddr          '    a   b   c   d   e   f     g       |
                                sets    branchaddr, bus     '    a   b   c   d   e   f     g       h
                                add     branchaddr, bus     '    a   b   c   d   e   f     g       h
                                setd    branchaddr, y       '    |   |   |   |   |   |     |       h
                                add     branchaddr, rombase '    |   |   |   |   |   |     |       h
                                test    ac wz               '    a   b   c   d   e   f     |       |
                                testb   ac, #7 wc           '    a   b   c   d   e   f     |       |
                if_c_or_z       mov     branchaddr, #0      '    a   |   |   |   |   |     |       |
                if_nc_or_z      mov     branchaddr, #0      '    |   b   |   |   |   |     |       |
                if_z            mov     branchaddr, #0      '    |   |   c   |   |   |     |       |
                if_nz           mov     branchaddr, #0      '    |   |   |   d   |   |     |       |
                if_c            mov     branchaddr, #0      '    |   |   |   |   e   |     |       |
                if_00           mov     branchaddr, #0      '    |   |   |   |   |   f     |       |
    
                                xcont   vgaout, output
                                wxpin   output, #OUT_REPO
                                wxpin   ac, #XOUT_REPO
                _ret_           rqpin   input, #IN_REPO
    
  • @TonyB_ said:
    That's good news. The video interrupt could read an indirection table word from hub RAM with page and offset (16 P2 cycles max) then read an entire video line (55 cycles max) during horizontal blanking. This would take less time than two Gigatron blanking pixels of which there are 40. No Gigatron instructions would be needed for video bit-banging. During the active display the video interrupt frequency would be 6.25/4 MHz.

    If you rely on the ROM, you don't need to do any of this, but if you want to have your own controller, then yeah. And one little arcane bit is that abrupt changes to the offset are carried over somehow. I'd have to look at the ROM to see how. There is a shortcut here to where once user software changes the offset at one point in the table, it carries on for that frame. So if the first offset is 1 and the rest are zero, they are all read at 1. So that makes it easier on the user end since the vCPU code only has to change 1 byte to scroll the entire screen to the side. But I'm not sure how that is evaluated since it supposedly can scroll both ways.

    As an optimization, you could set aside some cog RAM to hold an entire line and read from that for 4 actual lines. (If I were doing this on an FPGA, I'd write to the output and a block of BRAM on the first line in a group, then read from the BRAM the next 3 lines. Or start a line early since BRAM is dual-ported and always read from the BRAM line buffer. In that case, one might need to do the writes starting a cycle later than the reads to prevent collisions or BRAM malfunction -- you can read and write BRAM at the same time, but only at different addresses.)

  • @rogloh said:
    @PurpleGirl, do you happen to know what are the particular CTRL instructions required? In the instruction encoding I seem to have 8 of them (all the ones that STORE by reading from memory and writing at the same time), as it's just the 8 addressing modes that change in that situation. Are all 8 of these mapped to the same control instructions or just a couple of them like you mentioned earlier? If a subset, which ones? I can't find that documented but it's probably important.

    The Control codes I was referring to are I/O control codes and have nothing to do with the "control unit" of the CPU. Those are extra codes that are sent in ROM using the undefined instructions (the ones driving both the /OE and /WE low) for the new expansion boards. They happen when you write to the undefined instructions and pass the right parameters to things. As long as you have the complete instruction set and reach the undefined ones the same way they do, the ROM will be able to do this. This is only needed for expansion (adding over 64K or SPI sockets).

  • roglohrogloh Posts: 5,837
    edited 2022-08-13 13:14

    Here's a sample alpha release of a possible Gigatron emulator (code is still untested but should be complete now I think).

    It includes IO expansion support and 128kB RAM with banking, using the special CTRL instruction, along with quad LED output, the Gigatron digital 6 bit "EGA" output with a duplicated P2 VGA video output, and an audio DAC output, and a game controller port option.

    My only concern is whether it can work 100% synchronously at 6.25 MIPS because the P2 FIFO reconfiguration with the no wait option needs to be used. I don't know if it is going to work with my XBYTE sequences in the presence of other RDBYTE/WRBYTE operations. Also I really want to place my RDFAST right at the start of the EXECF blocks (just after the RFBYTE) as that could me save another cycle in the ALU by clearing flags as a side effect of other executed instructions. That brings the slowest path ALU opcode timing to 47 clocks (which should work with 300MHz). But doing so will incur the collision with a RDBYTE.

    This whole FIFO reconfigure time is the potential dealbreaker for the entire concept and needs to be tested. If the FIFO's partially refilled pipeline gets fully cancelled with a RDBYTE and has to start again from scratch after the RDBYTE completes then this is a problem, but if it is only suspended and still allows reading of at least the first instruction loaded into the FIFO's contents while the rest of the data is still arriving then maybe it might be okay. Time will tell.

  • TonyB_TonyB_ Posts: 2,196
    edited 2022-08-13 13:31

    @rogloh said:
    This whole FIFO reconfigure time is the potential dealbreaker for the entire concept and needs to be tested. If the FIFO's partially refilled pipeline gets fully cancelled with a RDBYTE and has to start again from scratch after the RDBYTE completes then this is a problem, but if it is only suspended and still allows reading of at least the first instruction loaded into the FIFO's contents while the rest of the data is still arriving then maybe it might be okay. Time will tell.

    I think RDBYTE will yield to the FIFO, but testing would be useful.

  • TonyB_TonyB_ Posts: 2,196
    edited 2022-08-14 00:58

    Further thought: RDBYTE will be delayed until FIFO filling has finished, some time after RDFAST with wait.

  • TonyB_TonyB_ Posts: 2,196
    edited 2022-08-14 19:23

    Thoughts confirmed by testing using code below:
    RDBYTE after RDFAST with wait takes 10 cycles more than normal RDBYTE.

        rdbyte  $1ff,addr0
        getct   t0
        rdbyte  $1ff,addr0  '   16 cycles
        rdbyte  $1ff,addr1  ' 9-16 cycles
        getct   t1          '* 
        rdfast  #0,addr2    '10-17 cycles
        getct   t2          '*
        rdbyte  $1ff,addr3  '19-26 cycles   
        getct   t3
    
    '* remove for two fastest slice differences
    '  otherwise adds 8 cycles to next instr
    '
    'quickest execution if
    ' addr1 - addr0 = 8*N + 4*1 (slice diff = +1)
    ' addr2 - addr1 = 8*N + 4*2 (slice diff = +2)
    ' addr3 - addr2 = 8*N + 4*3 (slice diff = +3)
    

    Above timings apply generally.

    EDIT:
    WRBYTE after RDFAST with wait takes 10 cycles more than normal WRBYTE.

  • Just did an experiment with RDFAST not waiting...test results look like my Gigatron emulation code might just make it as coded (for reads anyway), but that still has to be proven definitively. I randomized the egg-beater then did a RDFAST with no wait then immediately read successive RFBYTE values from the FIFO back to back until I saw the first valid result. Results indicate the data becomes valid from the 4'th read to the 8'th read. Also I found that a FIFO read too early just returns zeroes, not what is already in the FIFO.
    In my current Gigatron ALU handling, the RDFAST has up to 7 instructions before the last ret instruction which would need the FIFO data. The XCONT will thankfully lock the code to always take this long when its execution path is shorter.
    For STOREs I have anywhere between 5 and 7 instructions after the RDFAST to the WRBYTE. I could potentially delay this further if it has detrimental effects (I need to test that too).

    Read results and test program are included:

    ( Entering terminal mode.  Press Ctrl-] or Ctrl-Z to exit. )
    Cog0  INIT $0000_0000 $0000_0000 load
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE, $0000_00FA, $0000_00ED
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE, $0000_00FA
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE, $0000_00FA
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE, $0000_00FA
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE, $0000_00FA, $0000_00ED
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE, $0000_00FA
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE, $0000_00FA
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE, $0000_00FA
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE, $0000_00FA
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE, $0000_00FA, $0000_00ED
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE, $0000_00FA
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE, $0000_00FA, $0000_00ED
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE, $0000_00FA, $0000_00ED
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE, $0000_00FA
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE, $0000_00FA, $0000_00ED
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055, $0000_0055, $0000_00CE
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055
    Cog0  #d0 = $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0000, $0000_0055, $0000_0055, $0000_0055
    

    Test program:

    CON
        _clkfreq = 100000000
    
        DEBUG_BAUD = 115200
    
    DAT
        orgh 
    start
        org
    
        mov r1, #50  'loop 50 times
    loop
        rdfast #0, ##@pattern1 ' load up the fifo with some default data
        rfbyte r0
        rfbyte r0
        waitx #15 wc ' randomize egg beater timing
        rdfast nowait, ##@pattern2
        rfbyte d0
        rfbyte d1
        rfbyte d2
        rfbyte d3   ' * earliest valid data
        rfbyte d4   
        rfbyte d5
        rfbyte d6
        rfbyte d7   ' * latest valid data
        rfbyte d8
        rfbyte d9
        rfbyte d10
        rfbyte d11
        rfbyte d12
        rfbyte d13
        rfbyte d14
        rfbyte d15
        ' DEBUG(UHEX_REG_ARRAY(#r0, #2))
        DEBUG(UHEX_REG_ARRAY(#d0, #10))
        djnz   r1, #loop
        cogid pa
        cogstop pa
    
    r0  long 0
    r1  long 0
    nowait long $80000000
    
    d0 long 0
    d1 long 0
    d2 long 0
    d3 long 0
    d4 long 0
    d5 long 0
    d6 long 0
    d7 long 0
    d8 long 0
    d9 long 0
    d10 long 0
    d11 long 0
    d12 long 0
    d13 long 0
    d14 long 0
    d15 long 0
    
        orgh
    
    pattern1    long  $AABBCC00
                long  $111111111[15]
    
    pattern2    long $55555555
                long $FEEDFACE
    
  • TonyB_TonyB_ Posts: 2,196
    edited 2022-08-14 12:05

    @rogloh said:
    In my current Gigatron ALU handling, the RDFAST has up to 7 instructions before the last ret instruction which would need the FIFO data. The XCONT will thankfully lock the code to always take this long when its execution path is shorter.
    For STOREs I have anywhere between 5 and 7 instructions after the RDFAST to the WRBYTE. I could potentially delay this further if it has detrimental effects (I need to test that too).

    Yes, there must be minimum of seven 2-cycle instructions between no-wait RDFAST and _RET_ instruction that starts new XBYTE, or five 2-cycle and one 3-cycle.

    I've re-done last night's testing and RDBYTE + RDBYTE + RDFAST with wait + RDBYTE takes 54 cycles minimum. This confirms that final RDBYTE has a 10 cycle delay due to FIFO filling, which I think is completed during this RDBYTE when the seven other slices are read that RDBYTE does not use. (FIFO depth is 19 longs.)

  • roglohrogloh Posts: 5,837
    edited 2022-08-14 12:16

    I believe in my various tests (randomizing the egg-beater timing and differences between FIFO and random read/writes) I have found a path that should work. It will require 46 clocks + 6 for XBYTE = 52 clocks per Gigatron instruction. This means it will need to run 325MHz. It is yet to be proven in the real emulator but unless there are some more buried quirks in the FIFO I've not found (which there could be), it should have a chance of working.

    Due to the XCONT use, which is syncing the emulator to 6.25 MIPS, the write operation will also have enough space between the RDFAST and the ret instructions. In fact it despite looking like is 7 instructions it is now going to be at least 9 instructions in there because we are locked to a cycle of 52 clocks per emulated instruction.

    st_ops  ' 42 clocks max to return to XBYTE loop from here 
                                rfbyte  d                   'read immediate parameter
    
                                mov     addr, d             '   0d   |    yd   |   0d   0d   0d   |
                                mov     addr, x             '   |    0x   |   yx   |    |    |   yx++
                                setbyte addr, y, #1         '   |    |    yd  yx   |    |    |   yx++
                                incmod  x, #255             '   |    |    |    |   |    |    |   yx++
                                mov     x, ac               '   |    |    |    |  x=ac  |    |    |
                                mov     y, ac               '   |    |    |    |   |   y=ac  |    |
    
                                cmpsub  addr, banksize wc   'check & clear A15
                if_c            add     addr, ramoffset     'add bank offset
                                add     addr, rambase       'add RAM base address
    
                                wrbyte  d, addr             'RAM[addr]=d       |              |
                                wrbyte  ac, addr            '      |      RAM[addr]=ac        |
                                wrbyte  input, addr         '      |           |        RAM[addr]=input
    
                if_nz           rdfast  nowait, branchaddr  'delayed branch if needed
                                mov     branchaddr, #0 wz   'clear for next time
                                nop
                                nop
                                nop
                                xcont   egaout, output
                                wxpin   output, #OUT_REPO
                                wxpin   ac, #XOUT_REPO
                _ret_           rqpin   input, #IN_REPO
    

    Here's the latest code (with a few more fixes and missing things I found).

  • TonyB_TonyB_ Posts: 2,196
    edited 2022-08-14 12:38

    @rogloh said:
    I believe in my various tests (randomizing the egg-beater timing and differences between FIFO and random read/writes) I have found a path that should work. It will require 46 clocks + 6 for XBYTE = 52 clocks per Gigatron instruction. This means it will need to run 325MHz. It is yet to be proven in the real emulator but unless there are some more buried quirks in the FIFO I've not found (which there could be), it should have a chance of working.

    FIFO will need refilling for XBYTE if there are not fairly frequent branches [see edit] and the 16-bit instructions consume FIFO data twice as fast as 8-bit. Doc says:

    The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously whenever less than (cogs+7) stages are filled, after which point, up to 5 more longs may stream in, potentially filling all (cogs+11) stages. These metrics ensure that the FIFO never underflows, under all potential reading scenarios.

    EDIT:
    Video bit-banging requires 160 consecutive OUT instructions during active display.

  • Yeah I was concerned about that refilling aspect although I found that in my experiment no matter how quickly I drained the FIFO with repeated RFBYTEs after I issued the RDFAST with a 7 instruction delay before the ret (plus 6 clocks for the XBYTE overhead) I could not cause a RDBYTE to slow down and it would never exceed 16 clocks to complete the RDBYTE. This is important for the branch case which does a RDBYTE as soon as 2 clocks after it's initial RFBYTE completes.

    I have a feeling Chip may have designed this FIFO to not typically impede regular RDBYTE/WRBYTEs once the FIFO is filled and operating normally (at least in the byte access cases), hence the deep buffer stage that deals with that. That can't be said about the streamer stuff you guys investigated a while back but that's a different issue and thankfully doesn't apply here.

  • Precisely when FIFO starts refilling is a mystery to me, probably immediately after RFxxxx. If FIFO refilling does mess up your very exact timing then do RDFAST every instruction, with either branch or next address as start address.

  • TonyB_TonyB_ Posts: 2,196
    edited 2022-08-14 13:24

    @rogloh said:
    I have a feeling Chip may have designed this FIFO to not typically impede regular RDBYTE/WRBYTEs once the FIFO is filled and operating normally (at least in the byte access cases), hence the deep buffer stage that deals with that. That can't be said about the streamer stuff you guys investigated a while back but that's a different issue and thankfully doesn't apply here.

    I think the FIFO always comes first and anything else comes second. Partial FIFO refilling probably won't delay a random hub access provided the latter does not occur soon after RFxxxx.

  • I don't know whether the following is helpful or not. If you swap instruction and operand even/odd addresses, latter could be read at end of an instruction instead of the start.

  • I don't know if things would be more efficient if you only made things vCPU compatible. I mean, if any vCPU instructions are multiplication or division, for instance, you'd need far less code to make it since the P2 can do stuff like that natively. Plus the hub can be accessed irrespective of alignment, so a number of 16-bit ops would be truly 16-bit. You'd have more overhead from the P2 in the instruction timings, but you'd need fewer instructions. You wouldn't need to increment an index and then do a second move. You'd only need to do that with 8-bit external memory (or even 16-bit memory during unaligned operations).

    As for how to access I/O devices, they need to use "DMA" of some sort, which the P2 is ready for. On sound, there would be no issues. Since it would provide its own bus and not be a part of the video (unless you want it to be), there would be no hardware races. On the Gigatron, it uses the software video syncs to also act as a crude PIA. The Gigatron gets data directly from the accumulator during Out operations that occur during the syncs.

    Now, if you have vCPU without the Gigatron, you'd need to do some things in other ways. You'd likely need to use native P2 code to launch a startup menu and loader as well as initialize RAM variables. So you'd need some sort of "management engine." The ME would start up, maybe test memory, write all the system variables, create LUTs, display a menu, initiate sounds (the startup G chord if you want the same feel), act as the loader, and start vCPU.

    If you do that, then you'd need to change the vCPU a little bit too. For instance, when the Halt opcode is encountered, the supervisor cog could jump in and provide a prompt to continue. And then the supervisor cog resets. As for function calls, since there is no native Gigatron code, you'd need to emulate those too.

  • evanhevanh Posts: 16,039
    edited 2022-08-16 04:49

    Some testing done here - https://forums.parallax.com/discussion/comment/1536211/#Comment_1536211
    Not everything was definitively proven but the effects of FIFO fetches on RDLONG/WRLONG are clear enough. And those effects were what was used to build the stats. Namely that when the FIFO intervenes it forces the RDLONG/WRLONG to skip it's timing slot which delays the instruction completion by 8 clocks each occurance.

    The core of the test code is this:

            setq    pa      ' set NCO divider
            xinit   xmod, #0    ' Streamer startup
    ' go!
            mov pa, ##(blocks << 9) - 1 ' to suit SETQ2
    
            wrlong  pb, #(8 << 2)   ' align first GETCT for zero gap in start time
    '       rdlong  pb, #(3 << 2)   ' align first GETCT for zero gap in start time
            getct   pb
            setq2   pa
            wrlong  0, #0       ' Repeated block copy all of lutRAM (512 longwords) sequencially into hubRAM
    '       rdlong  0, #0       ' Repeated block copy sequencially from hubRAM into all of lutRAM (512 longwords)
            getct   pa
    ' done!
            xstop           ' Streamer shutdown
    

    So I was using a bursting WRLONG over a large range to measure how many times the FIFO delayed the cog's writes to hubRAM.

    Attached is the test code as it was left in February.

  • @PurpleGirl said:
    I don't know if things would be more efficient if you only made things vCPU compatible. I mean, if any vCPU instructions are multiplication or division, for instance, you'd need far less code to make it since the P2 can do stuff like that natively. Plus the hub can be accessed irrespective of alignment, so a number of 16-bit ops would be truly 16-bit. You'd have more overhead from the P2 in the instruction timings, but you'd need fewer instructions. You wouldn't need to increment an index and then do a second move. You'd only need to do that with 8-bit external memory (or even 16-bit memory during unaligned operations).

    I've not looked into the vCPU instructions but I think it's a different beast to build that vs a pure Gigatron capable machine. It may be faster but if you want fast, why not just write in native P2 PASM?

    As for how to access I/O devices, they need to use "DMA" of some sort, which the P2 is ready for. On sound, there would be no issues. Since it would provide its own bus and not be a part of the video (unless you want it to be), there would be no hardware races. On the Gigatron, it uses the software video syncs to also act as a crude PIA. The Gigatron gets data directly from the accumulator during Out operations that occur during the syncs.

    I've tried to do IO the Gigatron way with the expansion port for SPI. I think it would be possible to map these channels to both the flash and SD cards on P2 boards via additional COG(s). Other non-SPI peripherals can be memory mapped into the P2 address space from Gigatron's 128kB RAM (with more bank bits you could possibly double this). It could then output to higher resolution video modes and to external memory with other COGs.

    A sample pin mapping with IO could be a system targeting the P2-EDGE or P2-EVAL boards like this:

    • P0-P5 6 bit EGA colour port
    • P6 HSYNC (Game port CLK)
    • P7 VSYNC (Game port LATCH)
    • P8 Game port data
    • P9 reserved (could be additional audio DAC out)
    • P10 SPI MOSI
    • P11 SPI CLK
    • P12 LED 1 (shared with a smartpin repository)
    • P13 LED 2 (ditto)
    • P14 LED 3 (ditto)
    • P15 LED 4
    • P16-P19 - 4 channels of SPI CS pins
    • P20-P23 - 4 channels of SPI MISO pins
    • P24-P31 - VGA & AUDIO breakout
    • P32-P39 - optional DVI output or USB/PS/2 pins or second VGA breakout
    • P40-P57 - (optional PSRAM) or GPIO
    • P58-P61 FLASH & SD card
    • P62-P63 TX/RX serial

    If you want VGA or DVI (replicating Gigatron) as well as a secondary DVI/VGA port for higher resolution screens you can also drop the EGA colour output data and reuse those 6 pins for USB and/or PS/2 or use the PSRAM pins ports if they are free. As well as pin 9 there's also a free pin on the VGA breakout you could use to gain a total of 2 extra IO pins if you built your own setup. Plenty of possibilities, and up to 4 SPI peripherals as well for RTC's or network connections etc. The P2 is ideal for this sort of system flexibility.

    Now, if you have vCPU without the Gigatron, you'd need to do some things in other ways. You'd likely need to use native P2 code to launch a startup menu and loader as well as initialize RAM variables. So you'd need some sort of "management engine." The ME would start up, maybe test memory, write all the system variables, create LUTs, display a menu, initiate sounds (the startup G chord if you want the same feel), act as the loader, and start vCPU.

    If you do that, then you'd need to change the vCPU a little bit too. For instance, when the Halt opcode is encountered, the supervisor cog could jump in and provide a prompt to continue. And then the supervisor cog resets. As for function calls, since there is no native Gigatron code, you'd need to emulate those too.

    Yep, lots of software could do lots of things, especially if other COGs get spawned for higher performance audio etc. A 6.25MHz machine could have fairly low latency control of these smart P2 COG peripherals that do all the work.

    I've been looking into this Babelfish thing that lets a Gigatron core accept serial data via it's input buffer. I think I've probably done enough now to somewhat emulate the serial byte transfers from P2 host IO to Gigatron via it's shift register. There's quite a lot to this to be able to get it to sync up correctly and have the right bits ready at the right polling time in the video sync intervals. The P2 driver code is getting really tight but I think I have my IO COG fitting in the 52 cycle budget now. I had to resort to a couple of tricks with a REP loop to get various branch paths to work.

    I think a good way to go would be to have a third P2 COG that takes input from either PS/2 or USB keyboard, a real attached game controller, or the host serial port and runs the same way as Babelfish and feeds data into the IO helper COG via the P2 IO pins (just like it was connected to a real PluggyMcPlugFace device on the game controller port). I also expect the code I already have written below should still be able to talk to that Pluggy device over the game controller pins (if you have a 5V to 3.3V dropper resistor etc, or can run the Arduino at 3.3V).

    '-----------------------------------------------------------------------------------------------------
    ' The IO helper COG is used to assist the Gigatron COG and starts here.
    '
    ' 3 P2 Smartpins are used in repository mode for fast transfers between COGs.
    ' 2 repositories send output port and extended output port data from the Gigatron COG into the IO COG
    ' 1 repository sends the input port data back to the Gigatron COG from the IO COG.
    '
    ' This COG reads the 6 bit digital video + 2 sync IO data output by the Gigatron COG and transforms it into
    ' an analog VGA signal using the P2 colour space converter and video DACs and regenerates both sync signals.
    '
    ' It also drives the LED "blinkenlights" data onto pins and audio DAC data from the extended output port 
    ' to P2 pins, making use of the audio DAC capabilities in the P2.
    '
    ' It reads in the data from a game controller pin via the shift register on a S/NES joystick interface.  
    ' This input data can potentially be overridden by the P2's host serial port data (TODO).
    '-----------------------------------------------------------------------------------------------------
                                orgh
    
    iocog
                                org
    
                                ' enable LED outputs
                                test        invertleds wz
                if_nz           wrpin       ledconfig, #LED_PINGROUP   ' invert LEDs
                                drvl        #LED_PINGROUP   ' start cleared
    
                                ' init SPI pins
                                drvh        #SPI_CS_PINGROUP
                                drvh        #SPI_MOSI_PIN
                                drvh        #SPI_CLK_PIN
    
                                'setup DAC pin outputs for 4 bit audio
                                fltl        #AUDIO_DAC_PIN
                                wrpin       dacconfig, #AUDIO_DAC_PIN
                                drvh        #AUDIO_DAC_PIN ' enable audio DACs
    
                                ' setup async serial RX on a Smartpin @115200bps
                                fltl        #SERIAL_RX_PIN
                                wrpin       serialconfig, #SERIAL_RX_PIN
                                wxpin       serialmode, #SERIAL_RX_PIN
                                drvl        #SERIAL_RX_PIN
    
                                ' setup RGB video output pins, streamer frequency and colour space converter
                                wrpin       vgaconfig, #(VGA_BASE_PIN+1) addpins 2
                                wrpin       vgasyncconfig, #VGA_BASE_PIN 
                                setxfrq     vgafreq
                                setcy       ##$5A000000 + 76
                                setcy       ##$005A0000 + 76
                                setcy       ##$00005A00 + 76
                                setcmod     #$20                    ' setup vga RGB mode
                                drvl        #VGA_VSYNC_PIN          ' TODO init with correct sync state?
                                xinit       vgaout, #0
    
                                ' setup special condition to wait on (rising repo pin)
                                setse1      #OUT_REPO addpins 1 ' rising edge
                                pollse1     ' clear any initial state
    
    reploop                     rep         #26, #0                 ' 52 clock cycle loop max
                                waitse1                             ' wait for repository
                                rdpin       outputdata, #OUT_REPO   ' get video data
    
                                testbn      outputdata, #VSYNC wz   ' get vsync pin state
                                alts        outputdata, #colourmap  ' transform colour data to DAC values
                                xcont       vgaout, 0-0             ' send pixel to VGA port
                                drvnz       #VGA_VSYNC_PIN          ' copy vsync state
    
                                testb       outputdata, #HSYNC wc   ' test if current hsync is high
                                testbn      last, #HSYNC andc       ' and the last hysnc was low
                                testb       last, #VYSNC andz       ' z=1 if falling high->low edge 
                                mov         last, outputdata        ' save state for next time
    
                if_nc           jmp         #reploop                 ' we are done if not the rising hsync edge
                                rdpin       extended, #XOUT_REPO    ' latch extended output data
    
                                ' light LEDs
                                setnib      LED_PORT, extended, #LED_NIB ' drive LEDs 
    
                                'output audio DAC data
                                and         extended, #$F0          ' remove LED data
                                or          extended, dacconfig     ' retain pin config
                                wrpin       extended, #AUDIO_DAC_PIN' update DAC value
    
                                'write previous last input data to repo (1 shift clock lag)
                                wxpin       inputdata, #IN_REPO
    
    ' - This code below manages the shift register for both serial ASCII or game controller button transfers.
    ' - In the future we may need another COG to do all the input mapping from game controller, serial and PS/2
    '   or USB keyboads to feed Gigatron.  It could replace the Babelfish AVR application. TBD.
    ' - For now just take serial data from host PC (or a game controller) to help test or debug the emulator.
    ' - As with Babelfish the serial received characters to Gigatron are repeated 4 times.
    
                                'We have up to 9 more instructions and 18 clocks max to prepare input data
                if_z            tjnz        repeats, #repeating     ' continue all repeats
                                tjnz        enablecontroller, #game ' using game controller or serial input
                if_nz           jmp         #shiftdata              ' not yet time to latch, only shift
                                rdpin       rxchar, #SERIAL_RX_PIN wc 'check for new data
                                or          rxchar, ffffff          ' set trailing bits high in new data
    
    repeating   if_c            incmod      repeats, #4             ' repeat 4 times if we have new data or repeats
                if_c            mov         rxbuffer, rxchar        ' copy over data from original char to buffer
    
    shiftdata                   rcl         rxbuffer, #1 wc         ' extract top bit, pad with another 1 bit
                                rcl         inputdata, #1           ' repeat loop ends here if no branching done
                                jmp         #reploop                ' if we do fall through, repeat the loop again
    
    game
                                testp       #CTRL_DATA_PIN wc       ' test raw game input pin
                                rcl         inputdata, #1           ' shift into input register (1 bit shift delay)
                                jmp         #reploop
    
    '.....................................................................................................
    
    ' variables
    rxchar                      long    -1
    rxbuffer                    long    0
    last                        long    0
    outputdata                  long    0
    inputdata                   long    -1
    extended                    long    0
    repeats                     long    0
    
    ' LED, video, audio, serial configuration
    ffffff                      long    $FFFFFF
    invertleds                  long    INVERT_LEDS                             '$F to invert LEDs, 0 to not invert
    ledconfig                   long    %001_000000_00000000                    'invert leds
    
    vgaout                      long    $7F010001 + (VGA_BASE_PIN & $38)<<17    'immediate RGBS pixel output command
    vgafreq                     long    $80000000 +/ NCO_DIVIDER + 1            '6.25MHz pixel output rate
    vgasyncconfig               long    %10100_00000000_01_00000_0              '3.3V DAC pin mode for hsync
    vgaconfig                   long    %10111_00000000_01_00000_0              '2V DAC 75ohms pin mode for RGB
    
    dacconfig                   long    %10111_00000000_00_00000_0              '2V DAC pin mode for audio
    
    serialconfig                long    %11111_0                                'ayschronous serial RX smart pin mode
    serialmode                  long    SERIAL_DIVIDER << 16 + 7
    enablecontroller            long    ENABLE_CTRL
    
    ' look up table to map EGA port output state to RBGS data (VSYNC done separately)
    colourmap
                                '        R  G  B  S
                                long    $00_00_00_01[64]  ' during both syncs, blank video
                                long    $00_00_00_00[64]  ' during vsync, blank video
                                long    $00_00_00_01[64]  ' during hysnc, blank video
    
                                long    $00_00_00_00
                                long    $55_00_00_00
    ...
    
  • roglohrogloh Posts: 5,837
    edited 2022-08-16 10:19

    Latest code is I think complete...so it's probably ready to test now. Need to figure out a good way to test it.
    Maybe write some dummy ROM program sequences to exercise each instruction type and bus source etc, printing results with DEBUG instructions, looking for any gross instruction opcode errors. Then move to examine instruction timing is met in all cases, test the RAM banking, I/O and the video output (probably needs real Gigatron apps then, or some hacked up test code).

    I've also included a clock output pin that can be defined to provide a pin toggling at the instruction rate of 6.25MHz (3.125MHz output). This was also conveniently used to clear the flag for the jump condition using an XOR pattern that keeps the port output non-zero (so I got it for free). I thought it could be useful to look for cases where the instruction timing fails to meet the expected timing, using another monitoring COG perhaps. I need to trigger the failure if the measured interval is not 52 clocks between edges.

  • Well, the reason to attempt a vCPU without Gigatron native (and use P2 native instead) is to run .GT1 files. You wouldn't need to gate the vCPU core to be equivalent to 6.25 Mhz. Just let it run however fast it does. So long as you have an I/O controller complex that writes the pixels at 6.25 Mhz (even if you must use more instructions). So the vCPU would have at least a cog, and video could have a cog. Sound could be done in the video cog, though you could provide better frequency response with its own cog, and you'd might need to recalculate the audio table values or convert on the fly. Thus you could reach higher notes. And since the sound would be produced independently, you wouldn't need to use the X-out since it could have its own accumulator/adder/shifter and do things based on what is in memory. Now, if you need more compatibility with the original programs, such as it being too fast due to working in parallel with the video/sound, etc., then provide a "halt line" on vCPU to disable it during active lines, and maybe a key combination to enable/disable this mode.

    And if you implement your own keyboard controller, you can omit the shift register and just emulate a standard register, or use DMA and just write directly to memory. When file I/O is sent that way, would provide a wider pipe.

    In fact, since everything could be done in separate cogs and speak through the hub and/or LUT memory, you could really do without the In/Out ports or repurpose them (such as another way to send/receive commands from the I/O controllers and/or to facilitate file I/O).

    The virtual vCPU "registers" could be dedicated cog memory. That shouldn't break compatibility for well-behaving programs. By well behaving, I mean using the correct opcodes to access the "registers" and not treating them as memory. You could use Peek/Poke and Deek/Doke directly on the vAc, vSP, etc., but I don't think that is good practice in code. One could even go as far as to add a "private" stack through the stack instructions. (The vCPU has stack support though Gigatron native does not).

    Now, if I wanted to spin my own retro platform using the P2 as a base, I'd probably also limit things to 16 bits, though maybe have some larger instructions, such as providing block moves. So it would have its own ISA, but mainly for interfacing with other things and dealing with bottlenecks. It would be good to make most things translate 1:1 except where combination instructions are needed.

  • @rogloh said:
    A sample pin mapping with IO could be a system targeting the P2-EDGE or P2-EVAL boards like this:

    • P0-P5 6 bit EGA colour port
    • P6 HSYNC (Game port CLK)
    • P7 VSYNC (Game port LATCH)
    • P8 Game port data
    • P9 reserved (could be additional audio DAC out)
    • P10 SPI MOSI
    • P11 SPI CLK
    • P12 LED 1 (shared with a smartpin repository)
    • P13 LED 2 (ditto)
    • P14 LED 3 (ditto)
    • P15 LED 4
    • P16-P19 - 4 channels of SPI CS pins
    • P20-P23 - 4 channels of SPI MISO pins
    • P24-P31 - VGA & AUDIO breakout
    • P32-P39 - optional DVI output or USB/PS/2 pins or second VGA breakout
    • P40-P57 - (optional PSRAM) or GPIO
    • P58-P61 FLASH & SD card
    • P62-P63 TX/RX serial

    I wonder about a number of things here. Why would color map conversion be needed? This is just raw bitmap data that is used verbatim. Or would it be to make the internal DAC work properly with fewer bits feeding it?

    I notice you only have one MOSI, but 4 MISO lines. Multiple CS lines are good, even if one has only 1 MOSI/MISO pair, thus multiple devices can be used on the same lines and multicast would be possible. So that would be host hub splitter mode. SPI standards are pretty relaxed in that you can do whatever you want in regards to interrupts and have multiple /CS lines to split into multiple channels, off of a single host. And there are similar standards that allow both data pins to be unidirectional (DSPI) or does that plus adds 2 additional lines (QSPI).

    And having a keyboard controller in a cog might be nice. PS/2 only takes 2 pins, "clock" and "data." The other 2 (Vcc, Gnd) come from the mainboard. Any "5th pin" is just an alignment key. I believe that is an 11-bit protocol; start, 8 data, stop, and parity. Many controller designs skip the parity bit, but that is nice in case you lose sync. The keyboard makes the clock pulse. If you must write to the keyboard, there is a protocol for tugging at the clock line for so long, then the data for so long, then syncing with its clock when it starts and sending. But the Gigatron has little or no mechanism for sending back up the In port, though one way it does that is manipulating the vertical sync. That is a kludge at best and very slow (a couple of bytes per second). It also causes screen wiggle artifacts.

    To allow for possible bit expansion, one could problem drive audio and video off of the internal DACs. So you could have 4-8 bit audio off of a single wire, and 6+ bits of video off of just 5 wires (R DAC, G DAC, B DAC, H, V). There are multiple options to save GPIO lines.

  • @PurpleGirl said:
    I wonder about a number of things here. Why would color map conversion be needed? This is just raw bitmap data that is used verbatim. Or would it be to make the internal DAC work properly with fewer bits feeding it?

    The color map I put in just converts 8 bit VHBBGGRR format data that Gigatron outputs into 32 bit R:G:B:S data in the P2 format that is used for driving the VGA DAC outputs. The colours are still going to come out the same on the monitor (though they could be changed if you wanted another look or a monochrome palette for example). But the code I wrote will still output both sets of signals (on different pins), and they should be in lock step. The parallel "EGA" is output on 8 IO pins, and VGA on 5 IO pins (as RGBHV) via analog DACs.

    I notice you only have one MOSI, but 4 MISO lines. Multiple CS lines are good, even if one has only 1 MOSI/MISO pair, thus multiple devices can be used on the same lines and multicast would be possible. So that would be host hub splitter mode. SPI standards are pretty relaxed in that you can do whatever you want in regards to interrupts and have multiple /CS lines to split into multiple channels, off of a single host. And there are similar standards that allow both data pins to be unidirectional (DSPI) or does that plus adds 2 additional lines (QSPI).

    Yep. The Gigatron can bit bang these pins as needed via its IO expansion interface. I think some SD card data has already been accessed over SPI in this way, I saw that in the Gigatron forums, but I suspect it would be rather slow.

    And having a keyboard controller in a cog might be nice. PS/2 only takes 2 pins, "clock" and "data." The other 2 (Vcc, Gnd) come from the mainboard. Any "5th pin" is just an alignment key. I believe that is an 11-bit protocol; start, 8 data, stop, and parity. Many controller designs skip the parity bit, but that is nice in case you lose sync. The keyboard makes the clock pulse. If you must write to the keyboard, there is a protocol for tugging at the clock line for so long, then the data for so long, then syncing with its clock when it starts and sending. But the Gigatron has little or no mechanism for sending back up the In port, though one way it does that is manipulating the vertical sync. That is a kludge at best and very slow (a couple of bytes per second). It also causes screen wiggle artifacts.

    Yep PS/2 is not too hard to achieve on a P2 though surprisingly I've not seen a PS/2 COG yet (haven't hunted in detail for one though so it might already be available). We used to have a good PS/2 Keyboard SPIN1 COG on the P1 chip. We do already have a USB COG working for keyboards that has recently been reworked and can now also take USB gamepad information (NeoYume uses it). This would be an easy way to use modern USB keyboards and gamepads with a Gigatron emulation. We just need a COG to adapt and drive these inputs into the INPUT port at the correct time.

    To allow for possible bit expansion, one could problem drive audio and video off of the internal DACs. So you could have 4-8 bit audio off of a single wire, and 6+ bits of video off of just 5 wires (R DAC, G DAC, B DAC, H, V). There are multiple options to save GPIO lines.

    Yeah that is exactly what I'm doing in the code. In time I'll probably add more configuration options to be able to custom disable other unwanted signal outputs (like the EGA data, if you are only wanting to connect a VGA monitor), or if you don't want any SPI ports. There's plenty of pins though on the P2 for this stuff.

  • roglohrogloh Posts: 5,837
    edited 2022-08-16 15:30

    By the way I took a look at those vCPU instructions... here's a quick snippet showing roughly the P2 instructions needed in an XBYTE mode to emulate them. For brevity I've neglected the ret at the ends of each instruction handler as this was just a rough sketch only to gauge how easy/hard it might be, and I've left out some stuff I don't yet understand or didn't get to yet...
    EDIT: updated to show skipf options...

    st
        rfbyte d
        add d, rambase
        wrbyte ac, d
    
    stw
        rfbyte d
        mov ptra, rambase
        setbyte ptra, d, #0
        wrbyte ac, ptra++
        cmp d, #255 wz
    if_z sub ptra, #$100
        getbyte data, ac, #1 
        wrbyte data, ptra
    
    stlw
        rfbyte  d
        add d, ram
        add d, sp
        wrword ac, d
    
    ld
        rfbyte d
        add d, rambase
        rdbyte ac, d
    
    
    ldlw
        rfbyte  d
        add d, ram
        add d, sp
        rdword ac, d
    
    ldi
    addi
    subi
    andi
    ori
    xori
        rfbyte d
        mov ac, d    LD
        add ac, d    |    ADD
        sub ac, d    |     |    SUB
        and ac, d    |     |     |    AND
        or ac, d     |     |     |     |    OR
        xor ac, d    |     |     |     |     |    XOR
    
    lslw
        shl ac, #1
        setword ac, #0, #1
    
    inc
        rfbyte d
        add d, rambase
        rdbyte data, d
        add data, #1
        wrbyte data, d
    
    
    ldw
    andw
    orw
    xorw
    addw
    subw
        rfbyte d
        mov ptra, rambase
        setbyte ptra, d, #0
        rdbyte tmp, ptra++
        cmp d, #255 wz
    if_z sub ptra, #$100
        rdbyte data, ptra
        setbyte tmp, data, #1
        mov ac, tmp       'LD   |     |     |     |     |
        and ac, tmp            AND    |     |     |     |
        or ac, tmp              |    OR     |     |     |
        xor ac, tmp             |     |    XOR    |     |
        add ac, tmp             |     |     |    ADD    |
        sub ac, tmp             |     |     |     |    SUB
        setword ac, #0, #1      |     |     |    ADD   SUB
    
    
    peek
        mov addr, rambase
        add addr, ac
        rdbyte ac, addr
    
    deek
        mov addr, rambase
        add addr, ac
        rdword ac, addr
    
    poke
        rfbyte d
        add d, rambase
        rdword d, d
        add d, rambase
        wrbyte ac, d
    
    doke
        rfbyte d
        add d, rambase
        rdword d, d
        add d, rambase
        wrword ac, d
    
    lup
        rfbyte d
        add d, rombase
        add d, ac
        rdword ac, d
    
    bra
        rfbyte d
        add d, d
        getptr addr
        sets addr, d
        rdfast #0, addr
    
    bcc
        rfbyte cond
        alts cond, #table
        getbyte op, 0-0, #0
        setnib testop, op, #7
        rfbyte data
        testb ac, #15 wc
        test ac wz
    testop if_00 ret ' patched
        add d, d
        getptr addr
        sets addr, d
        rdfast #0, addr
    
    
    call        
        rfbyte d
        add d, rambase
        rdword d, d
        getptr lr
        sub d, #2
        add d, rambase
        rdfast #0, d
    
    ret
        mov tmp, #2
        subr tmp, lr
        rdfast #0, tmp
    
    push
       ...
    
    pop
        ...
    
    alloc
        rfbyte d
        add sp, d
    
    sys
        ?
    
    halt
        ?
    
    def
        getptr tmp
        rfbyte d
        sub tmp, rambase
        mov addr, tmp
        add tmp, #1
        mov ac, tmp
        setbyte addr, d
        rdfast #0, addr
    
  • @rogloh said:
    We do already have a USB COG working for keyboards that has recently been reworked and can now also take USB gamepad information (NeoYume uses it). This would be an easy way to use modern USB keyboards and gamepads with a Gigatron emulation.

    FTFY: "This would be an easy way to use one modern USB keyboard or gamepad with a Gigatron emulation." ;3

Sign In or Register to comment.