Shop OBEX P1 Docs P2 Docs Learn Events
Fastest one-way inter-Propeller transfer rate with 8 data lines — Parallax Forums

Fastest one-way inter-Propeller transfer rate with 8 data lines

escherescher Posts: 143
edited 2025-02-18 15:09 in Propeller 1

Hello all, here are the facts:

I have a 0.5965 ms window in which to transfer 10,368 bytes via 8 data lines directly from one Propeller to another, both running on 104 MHz system clocks (9.6153 ms/cycle).

I wrote some code which does a basic bit banging routine (rdlong, mov outa, ror, mov outa,..., rdlong, etc.) that took 76 cycles per long, but that's almost 4x too slow, taking ~1.9 ms to transfer the entire payload.

Push comes to shove I can make the transfer window as wide as 0.9143 ms, but no longer. I'd much rather find a novel way to achieve my goals in software, potentially by using a phsa/phsb clock trick.

Any help would be greatly appreciated!

Comments

  • evanhevanh Posts: 16,234
    edited 2025-02-18 08:15

    A common master clock fed to both Propellers? Or do they have a crystal each?

  • They have a crystal each

  • 10368/.0005965 = 17,381,391 bytes/sec * 8 = 139,051,132 bits/sec.

    Counter tricks would have a peak transfer rate of 26,000,000 bits/sec.

    104 MHz / 4 = 26 MIPS / cog. So 26 MB/sec peak transfer rate.

    The video generator can send 8 bits at a time.
    Sender

    4 mov addr2,addr
    4 add addr2,#4
    8 rdlong L1,addr
    7 waitvid L1,#%%0123 
    8 rdlong L2,addr2
    7 waitvid L2,#%%0123 
    4 add addr,#8
    4 djnz LC,#
    

    48 clocks to send 8 bytes = 17.3MB/sec. There is a trick to improve WAITVID by using CMP instead at the exact right time. Reduces 6-7 cycles to 4 and allows another instruction to fit between hub operations. May want to add another WAITVID to send a known idle pattern between runs. That will resync in case the clocks are slightly different between chips.

    Receiver

    waitpne 
    mov B1,ina
    mov B2,ina
    mov B3,ina
    mov B4,ina
    mov B5,ina
    mov B6,ina
    mov B7,ina
    mov B8,ina
    
    shl B1,#24
    and B2,#$FF
    shl B2,#16
    and B3,#$FF
    shl B3,#8
    and B4,#$FF
    or B1,B2
    or B1,B3
    or B1,B4
    
    shl B4,#24
    and B5,#$FF
    shl B5,#16
    and B6,#$FF
    shl B6,#8
    and B7,#$FF
    or B4,B5
    or B4,B6
    or B4,B7
    wrlong B1,addr
    add addr,#4
    wrlong B4,addr
    
    

    128+ clocks to receive 8 bytes. 6.5 MB/sec.

    Alternate method without all the byte packing:

    4 waitpne ' synchronize
    4 mov T,ina 
    8 wrbyte T,addr
    4 add addr,#1
    
    4 mov T,ina 
    8 wrbyte T,addr
    4 add addr,#1
    

    16 clocks per 1 byte, still 6.5MB/sec. So receiving the data at the desired rate will take 3 cogs. The P2 has ROLBYTE (and the streamer and FIFO) and would easily handle this with one cog.

  • evanhevanh Posts: 16,234
    edited 2025-02-18 19:28

    Never mind, I'm coming to the realisation the Prop1 can't go anywhere near fast enough for that to be an issue. A frame start handshake keeps them plenty close enough in sync.

    As for throughput solutions, the video hardware could output fast but there isn't an equivalent way to input at the other end.
    Bursting into cogRAM might be faster than using sequenced WRLONGs but I'm not too convinced. It may require more than one cog interleaved to keep up with the input needs.

    EDIT: There we go, James has done the analysis.

  • escherescher Posts: 143
    edited 2025-02-19 05:31

    Thanks for the outstanding analysis James. Unfortunately, it looks like I'm in something of a pickle here then...

    I do not have a single cog to spare on the receiver PROP (GPU), as I'm already dedicating 6 to rendering the raw video data from graphics memory and one to displaying the raw video data in a VGA format. This leaves a single cog available to field data transfers from the transmitter Prop (CPU).

    The onus for the timing limitations and data transfer size is the nature of my system and the VGA signal timing itself: the CPU needs to make changes to a large set of graphics data, and the GPU needs to be made aware of those changes before it starts rendering the next frame of video therefore it needs to be transferred quickly during the blanking period. This graphics data set is called a name table in NES terms. Basically, it's the tile map canvas that the virtual camera is rendering data from.

    The effective name table size on my GPU would be 40x30 1-WORD tiles visible on screen, times 4 in a worst-case scenario where a game requires both horizontal AND vertical scrolling, for a total of 9600 BYTEs/2400 LONGs.

    So, worst-case, my CPU would need a way to transfer all 2400 LONGs in less then about 1 ms. I have 8 data lines for sure, but I could increase it to 16 total if I absolutely had to.

    If this is not possible, then I need to start considering compromises, like reducing the extra scrolling preload regions to a minimum (this however would start impacting features I've already developed such as being able to parallax individual lines, reducing the max parallax amount by however much I reduce the preload area).

    There is the possibility of allowing the render cogs to branch to data transfer routines during their otherwise idle veritcal blanking periods, however I believe I'm already pushing the 2K cog RAM limit.

  • evanhevanh Posts: 16,234
    edited 2025-02-19 00:06

    @escher said:
    The effective name table size on my GPU would be 40x30 1-WORD tiles visible on screen, times 4 in a worst-case scenario where a game requires both horizontal AND vertical scrolling, for a total of 9600 BYTEs/2400 LONGs.

    So, worst-case, my CPU would need a way to transfer all 2400 LONGs in less then about 1 ms. I have 8 data lines for sure, but I could increase it to 16 total if I absolutely had to.

    That would just fit. 16 bit is peak 13 MB/s.

    frame_start
        mov addr, bufferstart
        mov count, bufferlength
    
        waitpeq ' synchronize
        waitpne ' synchronize
    frame_loop
        wrword ina, addr    '8
        add addr, #2    '4
        djnz count, #frame_loop    '4
    
        jmp #frame_start
    
    
    bufferstart
        long 0    ' buffer start address
    bufferlength
        long 4800    ' shortword count
    
  • evanhevanh Posts: 16,234

    The initial hubRAM access jitter will probably ruin it though. :(

  • You can't use INA (or CNT or PAR or PHSA/PHSB) like that in the D slot. That just reads the shadow register.

  • escherescher Posts: 143
    edited 2025-02-19 05:40

    Bedtime is a bad time to chime back in as I refuse to open the calculator app, but let's increase the transfer window to the full 0.9143 ms by not waiting for VSYNC to go low explicitly and using the front porch of the vertical blanking period as well (why don't you do that anyway you ask; using the already tasked VSYNC pin as a transfer signal means I don't need either another pin dedicated or slew compensation).

    If we go that route, with almost 2X the transfer time, does a solution become viable? A question that tomorrow must answer.

    Parting thoughts: the NES and other systems also had multiple different video modes depending on the developers use case. If a game only required horizontal scrolling, or needed scrolling in both directions, or needed some special high color mode, etc etc etc, you would be able to use a specific rendering mode that would support your desired features while inevitably forcing tradeoffs.

    If I am considering a dual-direction scrolling game as the worst case scenario, then maybe I just have to force those games to have less extra nametable space. A game scrolling in a single direction would only need half the transfer size for example.

    Ultimately, I'd like to try as hard as possible to avoid those tradeoffs, but it's starting to look like I'll need to really weigh requirements against technical feasibility here.

  • evanhevanh Posts: 16,234

    The Prop1 ain't too good at this. My advise is fit it all in one Propeller.

  • That is unfortunately even less of an option as the feature set I'm supporting - 328x240 60 Hz, 16+ sprites per line, sprite mirroring, double high/sprite rendering, full 8-bit RRRGGGBB color, omni directional fine pixel scrolling, multi-line parallax scrolling, etc - requires 2 props worth of processing power. Ironically, I have been able to successfully implement all of these features and more relatively easily, but the bottleneck is turning into the data transfer rate. Need to send over that whole name table to support the full omni directional scrolling. Could do half for unidirectional scrolling, or reduce the preload area, etc to try and solve this, but I'd really really like to explore every possible option before hobbling my feature set. Even started considering a shared SRAM module between the props.

  • @escher said:
    That is unfortunately even less of an option as the feature set I'm supporting - 328x240 60 Hz, 16+ sprites per line, sprite mirroring, double high/sprite rendering, full 8-bit RRRGGGBB color, omni directional fine pixel scrolling, multi-line parallax scrolling, etc - requires 2 props worth of processing power.

    No, can be made to work (Though 328 wide and 16 line sprites may be pushing it). But, yeah, certainly easier/better if you have a dedicated graphics-only prop.

    @escher said:
    Bedtime is a bad time to chime back in as I refuse to open the calculator app, but let's increase the transfer window to the full 0.9143 ms by not waiting for VSYNC to go low explicitly and using the front porch of the vertical blanking period as well (why don't you do that anyway you ask; using the already tasked VSYNC pin as a transfer signal means I don't need either another pin dedicated or slew compensation).

    IDK what math gets you 0.9143 ms, Vertical blanking is at least 1.2ms on a normal video signal, even if you are rendering the full 240 active lines (actual consoles/arcade boards tend not to, 224 active is more common). 262 total lines - 240 active lines is 22 blanking lines. Times ~0.063 -> 1.386ms.

    For updating a scrolling tilemap, transferring the entire thing every frame is really inefficient. Regardless of what you do (arbitrary logical position each frame), you never need more than one screen's worth of updated tiles per frame (+ one extra row/column for non-aligned positions). In any reasonably programmed game, you likely will not jump randomly around the logical tile map space and not scroll more than one tile out per frame, so you really only need one row + one column to update (and by that logic your tilemap doesn't need to be much larger than a single screen, regardless of scroll direction, unless you want to preload an entire area, in which case the tilemap never/rarely needs updating.)

Sign In or Register to comment.