You know, I wonder if the P1 could be coaxed into generating a DVI signal with 4 COGs doing WAITVID simultaneously? It'll shift 32 bits at a whack, and since some monitors don't care about TMDS, it could be possible to do it.
You might even get away without a differential signal.
You know, I wonder if the P1 could be coaxed into generating a DVI signal with 4 COGs doing WAITVID simultaneously? It'll shift 32 bits at a whack, and since some monitors don't care about TMDS, it could be possible to do it.
You might even get away without a differential signal.
P1 max PLL frequency is something about 230 MHz (experimental) - this is somewhat too low to generate digital signal which needs 250 MHz for 640x480 and they said it should be more than 225 MHz
Thanks for the interest. It would be good if someone with a P2 could test this out. I've done all the really hard work - the thinking. One cog would stream out bytes from a line buffer at 250 MHz, which Chip has proven already. The other two would read from the frame buffer and write to the line buffer. There are 8000 clock cycles per line at 250 MHz and during this time the two cogs must do the following:
1. Read 80 longs from hub RAM to cog RAM.
2. For each of these 80 longs, read each even or odd nibble in turn and use it as an index to a table of 16 longs containing TMDS data.
3. Write the TMDS long to hub RAM and add 5 to the hub RAM pointer.
The hub RAM write will always cross a long boundary for one cog but never for the other. If the above is feasible, then I'll post further details tomorrow. If not, I won't!
Summary:
Read 80 longs from hub to cog using SETQ+RDLONG, then write 320 longs from cog to hub as described using WRLONG (plus one extra cycle for one cog), all within 8000 cycles. Could someone please confirm this is doable?
I don't want to spend a lot of time describing the method if it isn't.
I am not following properly.
I get the first point.
Second There are 8 nibbles in each of the 80 longs. So that gives 640 longs.
Third writes back 320 longs while incrementing the hub by 5? Is this by 5 bytes or longs???
On the P1 I would try 5 data bits at 1/2 the clock rate, which would just be sampled as each bit repeated. The clock line was already 10x slower at 25 MHz. Are there special codes that need to be sent for blanking or H- or V-Sync, though, that would require the full 10-bits of data?
SETQ+RDLONG of 80 longwords takes 89 2+80+8=90 clocks minimum.
SETQ+WRLONG of 320 longwords takes 323 2+320+2=324 clocks minimum.
Could possibly use fifo+WFLONG instead of SETQ+WRLONG and use it inline with the linebuffer encoding.
EDIT: Corrected for SETQ's preceding two clocks. And also realised the documents are 1+8 clocks and 1+2 clocks for RDLONG/WRLONG, respectively.
The WRLONG addresses are separated by 5 longs and therefore cannot use SETQ & FIFO. With some other code in between, I think the 320 x WRLONG instructions could be:
SETQ+RDLONG of 80 longwords takes 89 2+80+8=90 clocks minimum.
SETQ+WRLONG of 320 longwords takes 323 2+320+2=324 clocks minimum.
Could possibly use fifo+WFLONG instead of SETQ+WRLONG and use it inline with the linebuffer encoding.
EDIT: Corrected for SETQ's preceding two clocks. And also realised the documents are 1+8 clocks and 1+2 clocks for RDLONG/WRLONG, respectively.
The WRLONG addresses are separated by 5 longs and therefore cannot use SETQ & FIFO. With some other code in between, I think the 320 x WRLONG instructions could be:
WRLONG D,PTRA++[5]
Based on the table you published before for the 3bpp output, it looks like you need to write longs 3, 5, 8, 10, etc of the buffer.
So in total you are reading 160 longs (640 pixels) and you need to output 640 longs, not contiguously, but also not strictly every fifth long.
If the time was there to do it all in one core you'd process each input nibble, perform the lookup, then write the third or fifth long depending on even or odd nibble.
with 2 cores, you read 80 longs in, process every even or odd nibble, perform the lookup, then write one long and advance by 5.
with 4 cores, you could read 80 longs in, process only one nibble, perform the lookup, then write one long and advance 10.
However, in reading 80 longs into each core and then writing every fifth output long you are introducing an interleaved data format with even pixels in one block and odd pixels in the other.
To avoid this, you'd need to read 80 longs, then for each pixel pair loop through:
Lookup even pixel word
WRLONG D,PTRA++[2]
Lookup odd pixel word
WRLONG D,PTRA++[3]
Maybe even unrolling further so that each long gets a set of four writes.
It seems to me that each core can also have more time to process the data as they can add around half the display field time to their processing allocation.
There might also be an opportunity to streamline further using LUT sharing so that one core simply reads in 160 longs to LUT, and both cores process from there.
Instead of reading 160 longs into each cog with a conventionally linear screen buffer, why not arrange the screen buffer in a format suitable for reading less data. Then use a mapping function in the screen update cog to convert from linear text and attributes buffers to the specific cog buffers.
Instead of reading 160 longs into each cog with a conventionally linear screen buffer, why not arrange the screen buffer in a format suitable for reading less data. Then use a mapping function in the screen update cog to convert from linear text and attributes buffers to the specific cog buffers.
So far the suggestions have been 80 into each, or 160 into one with LUT sharing. This is also a Rev A silicon issue only, as the Rev B silicon will have the hardware assistance to do this work (and more) without the need for such processing.
As the Rev B silicon will benefit from a linear screen buffer it's probably best to retain that structure at this point.
Instead of reading 160 longs into each cog with a conventionally linear screen buffer, why not arrange the screen buffer in a format suitable for reading less data. Then use a mapping function in the screen update cog to convert from linear text and attributes buffers to the specific cog buffers.
So far the suggestions have been 80 into each, or 160 into one with LUT sharing. This is also a Rev A silicon issue only, as the Rev B silicon will have the hardware assistance to do this work (and more) without the need for such processing.
As the Rev B silicon will benefit from a linear screen buffer it's probably best to retain that structure at this point.
Thanks for all the replies.
One pixel is one nibble, so 640 pixels are 80 longs, not 160. If two cogs are needed to compose the line buffer, they would both read all 80 longs from hub RAM using SETQ+RDLONG, which is very fast, but only look at either the even or odd nibbles.
I'm certain that two line buffer cogs will do the job with time to spare for audio perhaps or some sprites. However, single, discontinuous writes to hub RAM using WRLONG waste a lot of clock cycles and I'm now looking at only one line buffer cog to do the following:
1. Read 80 longs from hub RAM to cog RAM using SETQ+RDLONG.
2. For each group of 20 longs (160 pixels), read each nibble in turn and use it as an index to a table of 16 longs containing TMDS data.
3. Write the 160 TMDS longs to particular addresses in LUT RAM in range 0-399.
4. Write 400 longs of TMDS data (1600 bytes=160 pixels) from LUT RAM to hub RAM using SETQ2+WRLONG.
5. Iterate 2-4 four times in total.
The LUT RAM would be pre-configured with static TMDS data as 60% of it would never change.
Instead of reading 160 longs into each cog with a conventionally linear screen buffer, why not arrange the screen buffer in a format suitable for reading less data. Then use a mapping function in the screen update cog to convert from linear text and attributes buffers to the specific cog buffers.
So far the suggestions have been 80 into each, or 160 into one with LUT sharing. This is also a Rev A silicon issue only, as the Rev B silicon will have the hardware assistance to do this work (and more) without the need for such processing.
As the Rev B silicon will benefit from a linear screen buffer it's probably best to retain that structure at this point.
Thanks for all the replies.
One pixel is one nibble, so 640 pixels are 80 longs, not 160. If two cogs are needed to compose the line buffer, they would both read all 80 longs from hub RAM using SETQ+RDLONG, which is very fast, but only look at either the even or odd nibbles.
I'm certain that two line buffer cogs will do the job with time to spare for audio perhaps or some sprites. However, single, discontinuous writes to hub RAM using WRLONG waste a lot of clock cycles and I'm now looking at only one line buffer cog to do the following:
1. Read 80 longs from hub RAM to cog RAM using SETQ+RDLONG.
2. For each group of 20 longs (160 pixels), read each nibble in turn and use it as an index to a table of 16 longs containing TMDS data.
3. Write the 160 TMDS longs to particular addresses in LUT RAM in range 0-399.
4. Write 400 longs of TMDS data (1600 bytes=160 pixels) from LUT RAM to hub RAM using SETQ+WRLONG.
5. Iterate 2-4 four times in total.
The LUT RAM would be pre-configured with static TMDS data as 60% of it would never change.
Great,
My maths was off, but the concept was there. Given the size of LUTRAM both the partial line buffer and the lookup table will fit with space to spare.
So this gives a two core HDMI solution on Rev A silicon? One to form the line buffer and one to control the streamer to the pins?
Then for Rev B you can free up the line buffer core, and get extra colours into the bargain.
Instead of reading 160 longs into each cog with a conventionally linear screen buffer, why not arrange the screen buffer in a format suitable for reading less data. Then use a mapping function in the screen update cog to convert from linear text and attributes buffers to the specific cog buffers.
So far the suggestions have been 80 into each, or 160 into one with LUT sharing. This is also a Rev A silicon issue only, as the Rev B silicon will have the hardware assistance to do this work (and more) without the need for such processing.
As the Rev B silicon will benefit from a linear screen buffer it's probably best to retain that structure at this point.
Thanks for all the replies.
One pixel is one nibble, so 640 pixels are 80 longs, not 160. If two cogs are needed to compose the line buffer, they would both read all 80 longs from hub RAM using SETQ+RDLONG, which is very fast, but only look at either the even or odd nibbles.
I'm certain that two line buffer cogs will do the job with time to spare for audio perhaps or some sprites. However, single, discontinuous writes to hub RAM using WRLONG waste a lot of clock cycles and I'm now looking at only one line buffer cog to do the following:
1. Read 80 longs from hub RAM to cog RAM using SETQ+RDLONG.
2. For each group of 20 longs (160 pixels), read each nibble in turn and use it as an index to a table of 16 longs containing TMDS data.
3. Write the 160 TMDS longs to particular addresses in LUT RAM in range 0-399.
4. Write 400 longs of TMDS data (1600 bytes=160 pixels) from LUT RAM to hub RAM using SETQ+WRLONG.
5. Iterate 2-4 four times in total.
The LUT RAM would be pre-configured with static TMDS data as 60% of it would never change.
Great,
My maths was off, but the concept was there. Given the size of LUTRAM both the partial line buffer and the lookup table will fit with space to spare.
So this gives a two core HDMI solution on Rev A silicon? One to form the line buffer and one to control the streamer to the pins?
Then for Rev B you can free up the line buffer core, and get extra colours into the bargain.
I need to check how long writing the 640 longs to LUT RAM takes, but provisionally the answers are yes and yes. It would be quite an achievement to use only more cog in the P2 rev A than the rev B with its dedicated HDMI logic.
Each row above represents four bytes that are shifted right by one byte at a frequency of 10x the pixel clock, i.e. 250MHz for 640x480. The low six bits for every pixel are the same and only the top four bits can vary, as indicated by asterisks. The table above can be written in byte format as shown below. The 20 bytes/five longs are replicated 320 times in the 6400 byte/1600 long active display line buffer.
RGBI video thus requires byte values for which the TMDS codes are identical in the low six bits. The recommended values are as follows:
#8d, 8h, 10h, 10b , single output?
16, 10, 1F0, 0111110000, yes
47, 2F, 2B0, 1010110000, no
80, 50, 130, 0100110000, no
111, 6F, 270, 1001110000, no
144, 90, 170, 0101110000, no
175, A5, 230, 1000110000, no
208, D0, 1B0, 0110110000, no
239, EF, 2F0, 1011110000, yes
A yes to single output means the 8-bit input is always encoded, regardless of the disparity, to one 10-bit output with five '1's and five '0's. As can be seen, 10h and EFh produce balanced outputs. The other byte values encode differently depending on the disparity, but only the output that matches bits[5:0] of 10h and EFh is used. Strictly speaking, this is against the DVI/HDMI spec but TMDS decoders know nothing about disparity and will decode correctly.
Here is the CGA-style palette:
#N, Rh, Gh, Bh, Colour
0, 10, 10, 10, Black
1, 10, 10, AF, Dark Blue
2, 10, AF, 10, Dark Green
3, 10, AF, AF, Dark Cyan
4, AF, 10, 10, Dark Red
5, AF, 10, AF, Dark Magenta
6, AF, 6F, 10, Brown
7, AF, AF, AF, Light Grey
8, 50, 50, 50, Dark Grey
9, 10, 10, EF, Blue
10, 10, EF, 10, Green
11, 10, EF, EF, Cyan
12, EF, 10, 10, Red
13, EF, 10, EF, Magenta
14, EF, EF, 10, Yellow
15, EF, EF, EF, White
Note that nibble values N = 0,9-15 use only balanced outputs that meet the spec. For true CGA compatibility, bytes 10/50/6F/AF/EF would be 00/55/55/AA/FF. 6F was chosen at that makes a brown that is more distinct from dark red than using 50.
How the TMDS data would be created has been mentioned already.
I like this concept TonyB_ and it is looking very worthy of trying out. A 2 COG HDMI line buffer driver on Rev A silicon sounds pretty good, though perhaps you might need an additional COG if you wanted to generate colour text from a screen buffer.
First glance seems there should be enough memory bandwidth for the transfers and it may work if steps 2 and 3 are tightly coded. Each line has got to be done in ~ 31.75us for VGA. If the 1680 hub transfers you require (80 + 4*400) are clocked on every cycle at 250MHz that takes 6.72us, leaving about 25us to do the other steps plus loop overhead etc. So you get roughly 25*250 / 640 ~= 9 clocks per pixel for the inner loop. Hopefully that is enough to do the work, its only 4.5 instructions per pixel, or 9 instructions per pixel pair if this critical inner loop can be expanded to work on pairs (it should be because the pixel data in the buffer is nibble oriented anyway).
Any P2 PASM experts care to show if steps 2,3 can be coded in 9 instructions or less per data byte containing 2 pixels?
2. For each group of 20 longs (160 pixels), read each nibble in turn and use it as an index to a table of 16 longs containing TMDS data.
3. Write the 160 TMDS longs to particular addresses in LUT RAM in range 0-399.
My first thought is consider using bytes rather than nibbles (assuming there is enough space in cog+lut. Then you'd go straight from byte (=nibble-pair) to looking up a 16x16 matrix (256 longs). So there would be a lot of redundancy waste in the 256 element LUT but does that really matter?
Only problem is it seems you really want to read 32 bits per nibble (not nibble pair) and then write these into LUT with the even nibble having two 16 writes split over two LUT longs, the odd nibble just a single 32 bit write to a LUT long.
So I think the basic algorithm is that the 9 instruction budget needs to do this, using 3x16 entry look up tables in this case (two for the even pixels which split the 16 bit variable data into two separate halves, and one for the odd pixel):
-extract next pixel nibble from a long data value
-read LUT/COGRAM from table 1 base with even nibble index offset, get 32 bit data result
-write 32 bit result into LUT line buffer
-advance LUT line buffer write pointer by 1
-read LUT/COGRAM from table 2 base with even nibble index offset, get 32 bit data result
-write 32 bit result into LUT line buffer
-advance LUT line buffer write pointer by 2
-get next nibble from long data
-read LUT/COGRAM from table 3 base with odd nibble index offset, get 32 bit data result
-write 32 bit result to LUT line buffer
-advance LUT line buffer write pointer by 2
-advance to next nibble
Looks like it might be a bit too hard to fit this into 9 instructions using this straightforward approach at least. Maybe other table data format optimizations are possible by merging tables 1&2 and rotating/masking, or with other self modifying code etc to tighten it up.
If the LUT addresses are hardcoded and the inner loop unrolled to work on at least 32 bits (8 pixels) at a time, it might be possible to skip the LUT buffer pointer stuff and save on some pointer advancing, that could save 3 instructions. Depending on the length of this inner loop, you might have to then do more frequent LUT writes as well, ie. smaller LUT transfers to HUB which would need closer synchronization with the HDMI output COG.
Browsing the P2 document I find there are some nice nibble and COG RAM table oriented instructions which could be put to good use. If this per byte code is repeated four times per source long and the LUT can be directly addressed it might be possible to achieve a less than 9 instruction cycle budget for two pixels. I don't know if my #xx, #yy, #zz WRLUT forms are allowed/correct, but if not, then hard coded registers containing constant addresses might suffice. RDLUT/WRLUT don't seem very well documented unfortunately so I can't tell how to use them.
(repeat 3 more times for all other nibbles of source long)
UPDATE: A sequence like this but writing to COGRAM instead, combined with some FIFO based writes from COGRAM nicely timed hitting hub windows might even suffice so perhaps LUT use can be avoided.
Don't discount a WRLONG as too slow. Instruction counting the spacing of WRLONG's can be reliably coded right down to it's minimum of 3 clocks.
For 5 longword spaced hubram writes: It's a mere 5 clocks from one write to the next write window. This is because advancing 5 longs (20 addresses) shifts the timing window by the same amount. The apparently shortened 5 clocks can work because the instruction itself only needs 3 clocks to execute. Any more is stalling time.
Second window for the second write then is 13 clocks, third is 21 clocks, and so on. The window intervals of 8 clocks is specific to an 8-cog Prop2.
Upon making the second write: The new window for the third write is again only 5 clocks away because the new address is again another 5 longwords further on. This is general to all Prop2's.
Yeah it would be nice if normal WRLONGs are found to be fast enough, however in this case 3 long writes are needed for each pair of pixels. If each write takes 3 clock cycles at best, then that is already half our budget of 9 instruction (18 clocks) per two pixels, before we do any work. Perhaps the fifo method that bursts multiple longs would help though. In any case the LUT method seems to be a possibility now.
Missed your recent post ozpropdev, seems we cross posted, but that looks interesting too. I need to see what this ALTI does, seems like it may be handy.
Comments
You might even get away without a differential signal.
P1 max PLL frequency is something about 230 MHz (experimental) - this is somewhat too low to generate digital signal which needs 250 MHz for 640x480 and they said it should be more than 225 MHz
Summary:
Read 80 longs from hub to cog using SETQ+RDLONG, then write 320 longs from cog to hub as described using WRLONG (plus one extra cycle for one cog), all within 8000 cycles. Could someone please confirm this is doable?
I don't want to spend a lot of time describing the method if it isn't.
I get the first point.
Second There are 8 nibbles in each of the 80 longs. So that gives 640 longs.
Third writes back 320 longs while incrementing the hub by 5? Is this by 5 bytes or longs???
I cannot do any tests until the w/e tho.
SETQ+WRLONG of 320 longwords takes 323 2+320+2=324 clocks minimum.
Could possibly use fifo+WFLONG instead of SETQ+WRLONG and use it inline with the linebuffer encoding.
EDIT: Corrected for SETQ's preceding two clocks. And also realised the documents are 1+8 clocks and 1+2 clocks for RDLONG/WRLONG, respectively.
Jonathan
The WRLONG addresses are separated by 5 longs and therefore cannot use SETQ & FIFO. With some other code in between, I think the 320 x WRLONG instructions could be:
Based on the table you published before for the 3bpp output, it looks like you need to write longs 3, 5, 8, 10, etc of the buffer.
So in total you are reading 160 longs (640 pixels) and you need to output 640 longs, not contiguously, but also not strictly every fifth long.
If the time was there to do it all in one core you'd process each input nibble, perform the lookup, then write the third or fifth long depending on even or odd nibble.
with 2 cores, you read 80 longs in, process every even or odd nibble, perform the lookup, then write one long and advance by 5.
with 4 cores, you could read 80 longs in, process only one nibble, perform the lookup, then write one long and advance 10.
However, in reading 80 longs into each core and then writing every fifth output long you are introducing an interleaved data format with even pixels in one block and odd pixels in the other.
To avoid this, you'd need to read 80 longs, then for each pixel pair loop through:
Maybe even unrolling further so that each long gets a set of four writes.
It seems to me that each core can also have more time to process the data as they can add around half the display field time to their processing allocation.
There might also be an opportunity to streamline further using LUT sharing so that one core simply reads in 160 longs to LUT, and both cores process from there.
So far the suggestions have been 80 into each, or 160 into one with LUT sharing. This is also a Rev A silicon issue only, as the Rev B silicon will have the hardware assistance to do this work (and more) without the need for such processing.
As the Rev B silicon will benefit from a linear screen buffer it's probably best to retain that structure at this point.
Thanks for all the replies.
One pixel is one nibble, so 640 pixels are 80 longs, not 160. If two cogs are needed to compose the line buffer, they would both read all 80 longs from hub RAM using SETQ+RDLONG, which is very fast, but only look at either the even or odd nibbles.
I'm certain that two line buffer cogs will do the job with time to spare for audio perhaps or some sprites. However, single, discontinuous writes to hub RAM using WRLONG waste a lot of clock cycles and I'm now looking at only one line buffer cog to do the following:
1. Read 80 longs from hub RAM to cog RAM using SETQ+RDLONG.
2. For each group of 20 longs (160 pixels), read each nibble in turn and use it as an index to a table of 16 longs containing TMDS data.
3. Write the 160 TMDS longs to particular addresses in LUT RAM in range 0-399.
4. Write 400 longs of TMDS data (1600 bytes=160 pixels) from LUT RAM to hub RAM using SETQ2+WRLONG.
5. Iterate 2-4 four times in total.
The LUT RAM would be pre-configured with static TMDS data as 60% of it would never change.
Great,
My maths was off, but the concept was there. Given the size of LUTRAM both the partial line buffer and the lookup table will fit with space to spare.
So this gives a two core HDMI solution on Rev A silicon? One to form the line buffer and one to control the streamer to the pins?
Then for Rev B you can free up the line buffer core, and get extra colours into the bargain.
I need to check how long writing the 640 longs to LUT RAM takes, but provisionally the answers are yes and yes. It would be quite an achievement to use only more cog in the P2 rev A than the rev B with its dedicated HDMI logic.
The TMDS data for successive pairs of even and odd pixels look like this:
Each row above represents four bytes that are shifted right by one byte at a frequency of 10x the pixel clock, i.e. 250MHz for 640x480. The low six bits for every pixel are the same and only the top four bits can vary, as indicated by asterisks. The table above can be written in byte format as shown below. The 20 bytes/five longs are replicated 320 times in the 6400 byte/1600 long active display line buffer.
RGBI video thus requires byte values for which the TMDS codes are identical in the low six bits. The recommended values are as follows:
A yes to single output means the 8-bit input is always encoded, regardless of the disparity, to one 10-bit output with five '1's and five '0's. As can be seen, 10h and EFh produce balanced outputs. The other byte values encode differently depending on the disparity, but only the output that matches bits[5:0] of 10h and EFh is used. Strictly speaking, this is against the DVI/HDMI spec but TMDS decoders know nothing about disparity and will decode correctly.
Here is the CGA-style palette:
Note that nibble values N = 0,9-15 use only balanced outputs that meet the spec. For true CGA compatibility, bytes 10/50/6F/AF/EF would be 00/55/55/AA/FF. 6F was chosen at that makes a brown that is more distinct from dark red than using 50.
How the TMDS data would be created has been mentioned already.
EDIT:
Palette image added
First glance seems there should be enough memory bandwidth for the transfers and it may work if steps 2 and 3 are tightly coded. Each line has got to be done in ~ 31.75us for VGA. If the 1680 hub transfers you require (80 + 4*400) are clocked on every cycle at 250MHz that takes 6.72us, leaving about 25us to do the other steps plus loop overhead etc. So you get roughly 25*250 / 640 ~= 9 clocks per pixel for the inner loop. Hopefully that is enough to do the work, its only 4.5 instructions per pixel, or 9 instructions per pixel pair if this critical inner loop can be expanded to work on pairs (it should be because the pixel data in the buffer is nibble oriented anyway).
Any P2 PASM experts care to show if steps 2,3 can be coded in 9 instructions or less per data byte containing 2 pixels?
Assuming TDMS table @ cog address 0 to F
Edit: Oops! Forgot the shift
-extract next pixel nibble from a long data value
-read LUT/COGRAM from table 1 base with even nibble index offset, get 32 bit data result
-write 32 bit result into LUT line buffer
-advance LUT line buffer write pointer by 1
-read LUT/COGRAM from table 2 base with even nibble index offset, get 32 bit data result
-write 32 bit result into LUT line buffer
-advance LUT line buffer write pointer by 2
-get next nibble from long data
-read LUT/COGRAM from table 3 base with odd nibble index offset, get 32 bit data result
-write 32 bit result to LUT line buffer
-advance LUT line buffer write pointer by 2
-advance to next nibble
Looks like it might be a bit too hard to fit this into 9 instructions using this straightforward approach at least. Maybe other table data format optimizations are possible by merging tables 1&2 and rotating/masking, or with other self modifying code etc to tighten it up.
Here's a rolled out variant that still needs 10-11 instructions per RGB group
(repeat 3 more times for all other nibbles of source long)
UPDATE: A sequence like this but writing to COGRAM instead, combined with some FIFO based writes from COGRAM nicely timed hitting hub windows might even suffice so perhaps LUT use can be avoided.
For 5 longword spaced hubram writes: It's a mere 5 clocks from one write to the next write window. This is because advancing 5 longs (20 addresses) shifts the timing window by the same amount. The apparently shortened 5 clocks can work because the instruction itself only needs 3 clocks to execute. Any more is stalling time.
Second window for the second write then is 13 clocks, third is 21 clocks, and so on. The window intervals of 8 clocks is specific to an 8-cog Prop2.
Upon making the second write: The new window for the third write is again only 5 clocks away because the new address is again another 5 longwords further on. This is general to all Prop2's.
I'm using ALTI to substitute D with contents of R1 and S with the contents of INDEX.
ALTI also increments INDEX in this configuration too.