Fastest one-way inter-Propeller transfer rate with 8 data lines

escher · 2025-02-18 07:44

Hello all, here are the facts:

I have a 0.5965 ms window in which to transfer 10,368 bytes via 8 data lines directly from one Propeller to another, both running on 104 MHz system clocks (9.6153 ms/cycle).

I wrote some code which does a basic bit banging routine (rdlong, mov outa, ror, mov outa,..., rdlong, etc.) that took 76 cycles per long, but that's almost 4x too slow, taking ~1.9 ms to transfer the entire payload.

Push comes to shove I can make the transfer window as wide as 0.9143 ms, but no longer. I'd much rather find a novel way to achieve my goals in software, potentially by using a phsa/phsb clock trick.

Any help would be greatly appreciated!

evanh · 2025-02-18 08:14

A common master clock fed to both Propellers? Or do they have a crystal each?

escher · 2025-02-18 15:07

They have a crystal each

SaucySoliton · 2025-02-18 18:58

10368/.0005965 = 17,381,391 bytes/sec * 8 = 139,051,132 bits/sec.

Counter tricks would have a peak transfer rate of 26,000,000 bits/sec.

104 MHz / 4 = 26 MIPS / cog. So 26 MB/sec peak transfer rate.

The video generator can send 8 bits at a time.
Sender

4 mov addr2,addr
4 add addr2,#4
8 rdlong L1,addr
7 waitvid L1,#%%0123 
8 rdlong L2,addr2
7 waitvid L2,#%%0123 
4 add addr,#8
4 djnz LC,#

48 clocks to send 8 bytes = 17.3MB/sec. There is a trick to improve WAITVID by using CMP instead at the exact right time. Reduces 6-7 cycles to 4 and allows another instruction to fit between hub operations. May want to add another WAITVID to send a known idle pattern between runs. That will resync in case the clocks are slightly different between chips.

Receiver

waitpne 
mov B1,ina
mov B2,ina
mov B3,ina
mov B4,ina
mov B5,ina
mov B6,ina
mov B7,ina
mov B8,ina

shl B1,#24
and B2,#$FF
shl B2,#16
and B3,#$FF
shl B3,#8
and B4,#$FF
or B1,B2
or B1,B3
or B1,B4

shl B4,#24
and B5,#$FF
shl B5,#16
and B6,#$FF
shl B6,#8
and B7,#$FF
or B4,B5
or B4,B6
or B4,B7
wrlong B1,addr
add addr,#4
wrlong B4,addr

128+ clocks to receive 8 bytes. 6.5 MB/sec.

Alternate method without all the byte packing:

4 waitpne ' synchronize
4 mov T,ina 
8 wrbyte T,addr
4 add addr,#1

4 mov T,ina 
8 wrbyte T,addr
4 add addr,#1

16 clocks per 1 byte, still 6.5MB/sec. So receiving the data at the desired rate will take 3 cogs. The P2 has ROLBYTE (and the streamer and FIFO) and would easily handle this with one cog.

evanh · 2025-02-18 18:58

Never mind, I'm coming to the realisation the Prop1 can't go anywhere near fast enough for that to be an issue. A frame start handshake keeps them plenty close enough in sync.

As for throughput solutions, the video hardware could output fast but there isn't an equivalent way to input at the other end.
Bursting into cogRAM might be faster than using sequenced WRLONGs but I'm not too convinced. It may require more than one cog interleaved to keep up with the input needs.

EDIT: There we go, James has done the analysis.

escher · 2025-02-18 23:33

Thanks for the outstanding analysis James. Unfortunately, it looks like I'm in something of a pickle here then...

I do not have a single cog to spare on the receiver PROP (GPU), as I'm already dedicating 6 to rendering the raw video data from graphics memory and one to displaying the raw video data in a VGA format. This leaves a single cog available to field data transfers from the transmitter Prop (CPU).

The onus for the timing limitations and data transfer size is the nature of my system and the VGA signal timing itself: the CPU needs to make changes to a large set of graphics data, and the GPU needs to be made aware of those changes before it starts rendering the next frame of video therefore it needs to be transferred quickly during the blanking period. This graphics data set is called a name table in NES terms. Basically, it's the tile map canvas that the virtual camera is rendering data from.

The effective name table size on my GPU would be 40x30 1-WORD tiles visible on screen, times 4 in a worst-case scenario where a game requires both horizontal AND vertical scrolling, for a total of 9600 BYTEs/2400 LONGs.

So, worst-case, my CPU would need a way to transfer all 2400 LONGs in less then about 1 ms. I have 8 data lines for sure, but I could increase it to 16 total if I absolutely had to.

If this is not possible, then I need to start considering compromises, like reducing the extra scrolling preload regions to a minimum (this however would start impacting features I've already developed such as being able to parallax individual lines, reducing the max parallax amount by however much I reduce the preload area).

There is the possibility of allowing the render cogs to branch to data transfer routines during their otherwise idle veritcal blanking periods, however I believe I'm already pushing the 2K cog RAM limit.

evanh · 2025-02-19 00:05

@escher said:
The effective name table size on my GPU would be 40x30 1-WORD tiles visible on screen, times 4 in a worst-case scenario where a game requires both horizontal AND vertical scrolling, for a total of 9600 BYTEs/2400 LONGs.

So, worst-case, my CPU would need a way to transfer all 2400 LONGs in less then about 1 ms. I have 8 data lines for sure, but I could increase it to 16 total if I absolutely had to.

That would just fit. 16 bit is peak 13 MB/s.

frame_start
    mov addr, bufferstart
    mov count, bufferlength

    waitpeq ' synchronize
    waitpne ' synchronize
frame_loop
    wrword ina, addr    '8
    add addr, #2    '4
    djnz count, #frame_loop    '4

    jmp #frame_start


bufferstart
    long 0    ' buffer start address
bufferlength
    long 4800    ' shortword count

evanh · 2025-02-19 00:11

The initial hubRAM access jitter will probably ruin it though.

Wuerfel_21 · 2025-02-19 02:27

You can't use INA (or CNT or PAR or PHSA/PHSB) like that in the D slot. That just reads the shadow register.

escher · 2025-02-19 05:30

Bedtime is a bad time to chime back in as I refuse to open the calculator app, but let's increase the transfer window to the full 0.9143 ms by not waiting for VSYNC to go low explicitly and using the front porch of the vertical blanking period as well (why don't you do that anyway you ask; using the already tasked VSYNC pin as a transfer signal means I don't need either another pin dedicated or slew compensation).

If we go that route, with almost 2X the transfer time, does a solution become viable? A question that tomorrow must answer.

Parting thoughts: the NES and other systems also had multiple different video modes depending on the developers use case. If a game only required horizontal scrolling, or needed scrolling in both directions, or needed some special high color mode, etc etc etc, you would be able to use a specific rendering mode that would support your desired features while inevitably forcing tradeoffs.

If I am considering a dual-direction scrolling game as the worst case scenario, then maybe I just have to force those games to have less extra nametable space. A game scrolling in a single direction would only need half the transfer size for example.

Ultimately, I'd like to try as hard as possible to avoid those tradeoffs, but it's starting to look like I'll need to really weigh requirements against technical feasibility here.

evanh · 2025-02-19 06:08

The Prop1 ain't too good at this. My advise is fit it all in one Propeller.

escher · 2025-02-19 13:05

That is unfortunately even less of an option as the feature set I'm supporting - 328x240 60 Hz, 16+ sprites per line, sprite mirroring, double high/sprite rendering, full 8-bit RRRGGGBB color, omni directional fine pixel scrolling, multi-line parallax scrolling, etc - requires 2 props worth of processing power. Ironically, I have been able to successfully implement all of these features and more relatively easily, but the bottleneck is turning into the data transfer rate. Need to send over that whole name table to support the full omni directional scrolling. Could do half for unidirectional scrolling, or reduce the preload area, etc to try and solve this, but I'd really really like to explore every possible option before hobbling my feature set. Even started considering a shared SRAM module between the props.

Wuerfel_21 · 2025-02-19 14:49

@escher said:
That is unfortunately even less of an option as the feature set I'm supporting - 328x240 60 Hz, 16+ sprites per line, sprite mirroring, double high/sprite rendering, full 8-bit RRRGGGBB color, omni directional fine pixel scrolling, multi-line parallax scrolling, etc - requires 2 props worth of processing power.

No, can be made to work (Though 328 wide and 16 line sprites may be pushing it). But, yeah, certainly easier/better if you have a dedicated graphics-only prop.

@escher said:
Bedtime is a bad time to chime back in as I refuse to open the calculator app, but let's increase the transfer window to the full 0.9143 ms by not waiting for VSYNC to go low explicitly and using the front porch of the vertical blanking period as well (why don't you do that anyway you ask; using the already tasked VSYNC pin as a transfer signal means I don't need either another pin dedicated or slew compensation).

IDK what math gets you 0.9143 ms, Vertical blanking is at least 1.2ms on a normal video signal, even if you are rendering the full 240 active lines (actual consoles/arcade boards tend not to, 224 active is more common). 262 total lines - 240 active lines is 22 blanking lines. Times ~0.063 -> 1.386ms.

For updating a scrolling tilemap, transferring the entire thing every frame is really inefficient. Regardless of what you do (arbitrary logical position each frame), you never need more than one screen's worth of updated tiles per frame (+ one extra row/column for non-aligned positions). In any reasonably programmed game, you likely will not jump randomly around the logical tile map space and not scroll more than one tile out per frame, so you really only need one row + one column to update (and by that logic your tilemap doesn't need to be much larger than a single screen, regardless of scroll direction, unless you want to preload an entire area, in which case the tilemap never/rarely needs updating.)

escher · 2025-02-28 20:12

@Wuerfel_21 said:

@escher said:
That is unfortunately even less of an option as the feature set I'm supporting - 328x240 60 Hz, 16+ sprites per line, sprite mirroring, double high/sprite rendering, full 8-bit RRRGGGBB color, omni directional fine pixel scrolling, multi-line parallax scrolling, etc - requires 2 props worth of processing power.

No, can be made to work (Though 328 wide and 16 line sprites may be pushing it). But, yeah, certainly easier/better if you have a dedicated graphics-only prop.

A single Prop could be made to support a very large graphical feature set on paper, while sacrificing significantly in every other area that matters for my project. By offloading the rendering to a dedicated GPU, I am able to parallel process six scanlines of video data concurrently and separately from the CPU. This allows me to support very complex game environments with advanced effects like per-scanline parallaxing, while also leaving dedicated CPU cogs for e.g. sound chip emulation, I/O, external memory access, fast (PASM) implementations of frequently use game logic functions, etc.

@escher said:
Bedtime is a bad time to chime back in as I refuse to open the calculator app, but let's increase the transfer window to the full 0.9143 ms by not waiting for VSYNC to go low explicitly and using the front porch of the vertical blanking period as well (why don't you do that anyway you ask; using the already tasked VSYNC pin as a transfer signal means I don't need either another pin dedicated or slew compensation).

IDK what math gets you 0.9143 ms, Vertical blanking is at least 1.2ms on a normal video signal, even if you are rendering the full 240 active lines (actual consoles/arcade boards tend not to, 224 active is more common). 262 total lines - 240 active lines is 22 blanking lines. Times ~0.063 -> 1.386ms.

Because I am rendering six scanlines at a time, I need to have a six-scanline buffer of time before I start displaying a video frame to actually render it. I am also listening to the VSYNC pin on the CPU to trigger the start of data transfer representing the next video frame, in order to reduce the number of pins or other redundant signaling. The drawback is that VSYNC doesn't go (active-)low until after the vertical sync front porch is complete (10 lines, ~.318 ms). The entire vertical sync period, minus the front porch, minus the six scanlines of render-ahead time, equals 0.5965 ms. If I am able to recover that lost front porch time, then I get the "full" 0.9143 ms of usable transfer time. There are all sorts of tricks I could do to account for that delay, but I'm still in the fact-finding stage on core feasibility for design so I don't want to prematurely Barnum & Bailey my codebase.

That all being said you're definitely correct that the current paradigm is inefficient. However if doable, would provide me with the maximum control possible over manipulating the name table by both the CPU and the GPU, enabling some really cool visual effects that you wouldn't even see in the late 80s/early 90s arcade systems. If I had silicon like what the NES uses - where you can have the CPU and GPU effectively treat the same region of memory as being natively addressable as if it existed on the processors themselves - then we wouldn't even be having this discussion.

Ultimately, I'm going to take a step back and identify which features are the most important to my project goals, weighed against how much a PITA they are to actually support. I think I can just manage the name table on the CPU, and only transmit the visible window of the game scene to the GPU, but I will need to find an alternative solution to enable the kind of parallaxing I've already been able to achieve via per-scanline control.

See the individual lines of the ocean in the background at the end of this video:

Or a close-up here:

evanh · 2025-02-28 21:15

Front porch doesn't have to be so big. Monitors are okay with Vsync being early in the blanking. Vertical shift is always much more adjustable than horizontal shift. So vertical timings could be set to 240:1:1:20 for example.

Fastest one-way inter-Propeller transfer rate with 8 data lines

Comments