HDMI added to Prop2

pedward · 2018-11-01 05:46

You know, I wonder if the P1 could be coaxed into generating a DVI signal with 4 COGs doing WAITVID simultaneously? It'll shift 32 bits at a whack, and since some monitors don't care about TMDS, it could be possible to do it.

You might even get away without a differential signal.

pik33 · 2018-11-01 05:54

pedward wrote: »

You know, I wonder if the P1 could be coaxed into generating a DVI signal with 4 COGs doing WAITVID simultaneously? It'll shift 32 bits at a whack, and since some monitors don't care about TMDS, it could be possible to do it.

You might even get away without a differential signal.

P1 max PLL frequency is something about 230 MHz (experimental) - this is somewhat too low to generate digital signal which needs 250 MHz for 640x480 and they said it should be more than 225 MHz

potatohead · 2018-11-01 06:37

Maybe a very cool, low voltage P1?

TonyB_ · 2018-11-01 10:57

TonyB_ wrote: »

Thanks for the interest. It would be good if someone with a P2 could test this out. I've done all the really hard work - the thinking. One cog would stream out bytes from a line buffer at 250 MHz, which Chip has proven already. The other two would read from the frame buffer and write to the line buffer. There are 8000 clock cycles per line at 250 MHz and during this time the two cogs must do the following:

1. Read 80 longs from hub RAM to cog RAM.
2. For each of these 80 longs, read each even or odd nibble in turn and use it as an index to a table of 16 longs containing TMDS data.
3. Write the TMDS long to hub RAM and add 5 to the hub RAM pointer.

The hub RAM write will always cross a long boundary for one cog but never for the other. If the above is feasible, then I'll post further details tomorrow. If not, I won't!

Summary:
Read 80 longs from hub to cog using SETQ+RDLONG, then write 320 longs from cog to hub as described using WRLONG (plus one extra cycle for one cog), all within 8000 cycles. Could someone please confirm this is doable?

I don't want to spend a lot of time describing the method if it isn't.

Cluso99 · 2018-11-01 11:28

I am not following properly.
I get the first point.
Second There are 8 nibbles in each of the 80 longs. So that gives 640 longs.
Third writes back 320 longs while incrementing the hub by 5? Is this by 5 bytes or longs???

I cannot do any tests until the w/e tho.

evanh · 2018-11-01 13:06

SETQ+RDLONG of 80 longwords takes 89 2+80+8=90 clocks minimum.
SETQ+WRLONG of 320 longwords takes 323 2+320+2=324 clocks minimum.

Could possibly use fifo+WFLONG instead of SETQ+WRLONG and use it inline with the linebuffer encoding.

EDIT: Corrected for SETQ's preceding two clocks. And also realised the documents are 1+8 clocks and 1+2 clocks for RDLONG/WRLONG, respectively.

Rayman · 2018-11-01 14:29

Definitely have to use fifo...

lonesock · 2018-11-01 15:29

On the P1 I would try 5 data bits at 1/2 the clock rate, which would just be sampled as each bit repeated. The clock line was already 10x slower at 25 MHz. Are there special codes that need to be sent for blanking or H- or V-Sync, though, that would require the full 10-bits of data?

Jonathan

TonyB_ · 2018-11-01 17:53

evanh wrote: »

SETQ+RDLONG of 80 longwords takes 89 2+80+8=90 clocks minimum.
SETQ+WRLONG of 320 longwords takes 323 2+320+2=324 clocks minimum.

Could possibly use fifo+WFLONG instead of SETQ+WRLONG and use it inline with the linebuffer encoding.

EDIT: Corrected for SETQ's preceding two clocks. And also realised the documents are 1+8 clocks and 1+2 clocks for RDLONG/WRLONG, respectively.

The WRLONG addresses are separated by 5 longs and therefore cannot use SETQ & FIFO. With some other code in between, I think the 320 x WRLONG instructions could be:

	WRLONG  D,PTRA++[5]

AJL · 2018-11-01 22:23

TonyB_ wrote: »
evanh wrote: »

SETQ+RDLONG of 80 longwords takes 89 2+80+8=90 clocks minimum.
SETQ+WRLONG of 320 longwords takes 323 2+320+2=324 clocks minimum.

Could possibly use fifo+WFLONG instead of SETQ+WRLONG and use it inline with the linebuffer encoding.

EDIT: Corrected for SETQ's preceding two clocks. And also realised the documents are 1+8 clocks and 1+2 clocks for RDLONG/WRLONG, respectively.

The WRLONG addresses are separated by 5 longs and therefore cannot use SETQ & FIFO. With some other code in between, I think the 320 x WRLONG instructions could be:
	WRLONG  D,PTRA++[5] 

Based on the table you published before for the 3bpp output, it looks like you need to write longs 3, 5, 8, 10, etc of the buffer.

So in total you are reading 160 longs (640 pixels) and you need to output 640 longs, not contiguously, but also not strictly every fifth long.

If the time was there to do it all in one core you'd process each input nibble, perform the lookup, then write the third or fifth long depending on even or odd nibble.

with 2 cores, you read 80 longs in, process every even or odd nibble, perform the lookup, then write one long and advance by 5.
with 4 cores, you could read 80 longs in, process only one nibble, perform the lookup, then write one long and advance 10.

However, in reading 80 longs into each core and then writing every fifth output long you are introducing an interleaved data format with even pixels in one block and odd pixels in the other.

To avoid this, you'd need to read 80 longs, then for each pixel pair loop through:

Lookup even pixel word
        WRLONG D,PTRA++[2]
Lookup odd pixel word
        WRLONG D,PTRA++[3]

Maybe even unrolling further so that each long gets a set of four writes.

It seems to me that each core can also have more time to process the data as they can add around half the display field time to their processing allocation.

There might also be an opportunity to streamline further using LUT sharing so that one core simply reads in 160 longs to LUT, and both cores process from there.

78rpm · 2018-11-01 23:09

Instead of reading 160 longs into each cog with a conventionally linear screen buffer, why not arrange the screen buffer in a format suitable for reading less data. Then use a mapping function in the screen update cog to convert from linear text and attributes buffers to the specific cog buffers.

AJL · 2018-11-01 23:44

78rpm wrote: »

Instead of reading 160 longs into each cog with a conventionally linear screen buffer, why not arrange the screen buffer in a format suitable for reading less data. Then use a mapping function in the screen update cog to convert from linear text and attributes buffers to the specific cog buffers.

So far the suggestions have been 80 into each, or 160 into one with LUT sharing. This is also a Rev A silicon issue only, as the Rev B silicon will have the hardware assistance to do this work (and more) without the need for such processing.

As the Rev B silicon will benefit from a linear screen buffer it's probably best to retain that structure at this point.

TonyB_ · 2018-11-02 00:15

AJL wrote: »

78rpm wrote: »

Instead of reading 160 longs into each cog with a conventionally linear screen buffer, why not arrange the screen buffer in a format suitable for reading less data. Then use a mapping function in the screen update cog to convert from linear text and attributes buffers to the specific cog buffers.

So far the suggestions have been 80 into each, or 160 into one with LUT sharing. This is also a Rev A silicon issue only, as the Rev B silicon will have the hardware assistance to do this work (and more) without the need for such processing.

As the Rev B silicon will benefit from a linear screen buffer it's probably best to retain that structure at this point.

Thanks for all the replies.

One pixel is one nibble, so 640 pixels are 80 longs, not 160. If two cogs are needed to compose the line buffer, they would both read all 80 longs from hub RAM using SETQ+RDLONG, which is very fast, but only look at either the even or odd nibbles.

I'm certain that two line buffer cogs will do the job with time to spare for audio perhaps or some sprites. However, single, discontinuous writes to hub RAM using WRLONG waste a lot of clock cycles and I'm now looking at only one line buffer cog to do the following:

1. Read 80 longs from hub RAM to cog RAM using SETQ+RDLONG.
2. For each group of 20 longs (160 pixels), read each nibble in turn and use it as an index to a table of 16 longs containing TMDS data.
3. Write the 160 TMDS longs to particular addresses in LUT RAM in range 0-399.
4. Write 400 longs of TMDS data (1600 bytes=160 pixels) from LUT RAM to hub RAM using SETQ2+WRLONG.
5. Iterate 2-4 four times in total.

The LUT RAM would be pre-configured with static TMDS data as 60% of it would never change.

AJL · 2018-11-02 01:26

TonyB_ wrote: »

AJL wrote: »

78rpm wrote: »

Instead of reading 160 longs into each cog with a conventionally linear screen buffer, why not arrange the screen buffer in a format suitable for reading less data. Then use a mapping function in the screen update cog to convert from linear text and attributes buffers to the specific cog buffers.

So far the suggestions have been 80 into each, or 160 into one with LUT sharing. This is also a Rev A silicon issue only, as the Rev B silicon will have the hardware assistance to do this work (and more) without the need for such processing.

As the Rev B silicon will benefit from a linear screen buffer it's probably best to retain that structure at this point.

Thanks for all the replies.

One pixel is one nibble, so 640 pixels are 80 longs, not 160. If two cogs are needed to compose the line buffer, they would both read all 80 longs from hub RAM using SETQ+RDLONG, which is very fast, but only look at either the even or odd nibbles.

I'm certain that two line buffer cogs will do the job with time to spare for audio perhaps or some sprites. However, single, discontinuous writes to hub RAM using WRLONG waste a lot of clock cycles and I'm now looking at only one line buffer cog to do the following:

1. Read 80 longs from hub RAM to cog RAM using SETQ+RDLONG.
2. For each group of 20 longs (160 pixels), read each nibble in turn and use it as an index to a table of 16 longs containing TMDS data.
3. Write the 160 TMDS longs to particular addresses in LUT RAM in range 0-399.
4. Write 400 longs of TMDS data (1600 bytes=160 pixels) from LUT RAM to hub RAM using SETQ+WRLONG.
5. Iterate 2-4 four times in total.

The LUT RAM would be pre-configured with static TMDS data as 60% of it would never change.

Great,

My maths was off, but the concept was there. Given the size of LUTRAM both the partial line buffer and the lookup table will fit with space to spare.

So this gives a two core HDMI solution on Rev A silicon? One to form the line buffer and one to control the streamer to the pins?
Then for Rev B you can free up the line buffer core, and get extra colours into the bargain.

TonyB_ · 2018-11-02 01:36

AJL wrote: »

TonyB_ wrote: »

AJL wrote: »

78rpm wrote: »

Instead of reading 160 longs into each cog with a conventionally linear screen buffer, why not arrange the screen buffer in a format suitable for reading less data. Then use a mapping function in the screen update cog to convert from linear text and attributes buffers to the specific cog buffers.

So far the suggestions have been 80 into each, or 160 into one with LUT sharing. This is also a Rev A silicon issue only, as the Rev B silicon will have the hardware assistance to do this work (and more) without the need for such processing.

As the Rev B silicon will benefit from a linear screen buffer it's probably best to retain that structure at this point.

Thanks for all the replies.

One pixel is one nibble, so 640 pixels are 80 longs, not 160. If two cogs are needed to compose the line buffer, they would both read all 80 longs from hub RAM using SETQ+RDLONG, which is very fast, but only look at either the even or odd nibbles.

I'm certain that two line buffer cogs will do the job with time to spare for audio perhaps or some sprites. However, single, discontinuous writes to hub RAM using WRLONG waste a lot of clock cycles and I'm now looking at only one line buffer cog to do the following:

1. Read 80 longs from hub RAM to cog RAM using SETQ+RDLONG.
2. For each group of 20 longs (160 pixels), read each nibble in turn and use it as an index to a table of 16 longs containing TMDS data.
3. Write the 160 TMDS longs to particular addresses in LUT RAM in range 0-399.
4. Write 400 longs of TMDS data (1600 bytes=160 pixels) from LUT RAM to hub RAM using SETQ+WRLONG.
5. Iterate 2-4 four times in total.

The LUT RAM would be pre-configured with static TMDS data as 60% of it would never change.

Great,

My maths was off, but the concept was there. Given the size of LUTRAM both the partial line buffer and the lookup table will fit with space to spare.

So this gives a two core HDMI solution on Rev A silicon? One to form the line buffer and one to control the streamer to the pins?
Then for Rev B you can free up the line buffer core, and get extra colours into the bargain.

I need to check how long writing the 640 longs to LUT RAM takes, but provisionally the answers are yes and yes. It would be quite an achievement to use only more cog in the P2 rev A than the rev B with its dedicated HDMI logic.

TonyB_ · 2018-11-02 01:39

Here is a description of my method for 640x480x4bpp RGBI video with HDMI output on the P2 without any HDMI hardware.

The TMDS data for successive pairs of even and odd pixels look like this:

| R | G | B |CLK| R | G | B |CLK| R | G | B |CLK| R | G | B |CLK|
|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|+ -|
|               |               |               |               | pixel[bits]
 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0   even[3:0]
 * * * * * * 0 1 * * * * * * 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0   even[7:4]
 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0/* * * * * * 0 1 * * * * * * 0 1    odd[1:0]/even[9:8]
 1 0 1 0 1 0 0 1 1 0 1 0 1 0 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0    odd[5:2]
 * * * * * * 0 1 * * * * * * 0 1 * * * * * * 0 1 * * * * * * 0 1    odd[9:6]

Each row above represents four bytes that are shifted right by one byte at a frequency of 10x the pixel clock, i.e. 250MHz for 640x480. The low six bits for every pixel are the same and only the top four bits can vary, as indicated by asterisks. The table above can be written in byte format as shown below. The 20 bytes/five longs are replicated 320 times in the 6400 byte/1600 long active display line buffer.

' long0 must be long-aligned, * indicates variable byte
' RGB bits set to black, reverse all bits if CLK- is bit 7
'
'		       | R | G | B |CLK|
'		       |+ -|+ -|+ -|+ -|	 pixel bit
long0		byte	0_1_0_1_0_1_1_0		'even  0
 		byte	0_1_0_1_0_1_1_0		'even  1
		byte	0_1_0_1_0_1_1_0		'even  2
		byte	0_1_0_1_0_1_1_0		'even  3
long1		byte	1_0_1_0_1_0_1_0		'even  4
		byte	1_0_1_0_1_0_0_1		'even  5
		byte	1_0_1_0_1_0_0_1		'even  6 *
		byte	1_0_1_0_1_0_0_1		'even  7 *
long2		byte	1_0_1_0_1_0_0_1		'even  8 *
		byte	0_1_0_1_0_1_0_1		'even  9 *
		byte	0_1_0_1_0_1_1_0		'odd   0
 		byte	0_1_0_1_0_1_1_0		'odd   1
long3		byte	0_1_0_1_0_1_1_0		'odd   2
		byte	0_1_0_1_0_1_1_0		'odd   3
		byte	1_0_1_0_1_0_1_0		'odd   4
		byte	1_0_1_0_1_0_0_1		'odd   5
long4		byte	1_0_1_0_1_0_0_1		'odd   6 *
		byte	1_0_1_0_1_0_0_1		'odd   7 *
		byte	1_0_1_0_1_0_0_1		'odd   8 *
		byte	0_1_0_1_0_1_0_1		'odd   9 *

RGBI video thus requires byte values for which the TMDS codes are identical in the low six bits. The recommended values are as follows:

#8d, 8h, 10h, 10b       , single output?
 16, 10, 1F0, 0111110000, yes
 47, 2F, 2B0, 1010110000, no
 80, 50, 130, 0100110000, no
111, 6F, 270, 1001110000, no
144, 90, 170, 0101110000, no
175, A5, 230, 1000110000, no
208, D0, 1B0, 0110110000, no
239, EF, 2F0, 1011110000, yes

A yes to single output means the 8-bit input is always encoded, regardless of the disparity, to one 10-bit output with five '1's and five '0's. As can be seen, 10h and EFh produce balanced outputs. The other byte values encode differently depending on the disparity, but only the output that matches bits[5:0] of 10h and EFh is used. Strictly speaking, this is against the DVI/HDMI spec but TMDS decoders know nothing about disparity and will decode correctly.

Here is the CGA-style palette:

#N, Rh, Gh, Bh, Colour
 0, 10, 10, 10, Black
 1, 10, 10, AF, Dark Blue
 2, 10, AF, 10, Dark Green
 3, 10, AF, AF, Dark Cyan
 4, AF, 10, 10, Dark Red
 5, AF, 10, AF, Dark Magenta
 6, AF, 6F, 10, Brown
 7, AF, AF, AF, Light Grey
 8, 50, 50, 50, Dark Grey
 9, 10, 10, EF, Blue
10, 10, EF, 10, Green
11, 10, EF, EF, Cyan
12, EF, 10, 10, Red
13, EF, 10, EF, Magenta
14, EF, EF, 10, Yellow
15, EF, EF, EF, White

Note that nibble values N = 0,9-15 use only balanced outputs that meet the spec. For true CGA compatibility, bytes 10/50/6F/AF/EF would be 00/55/55/AA/FF. 6F was chosen at that makes a brown that is more distinct from dark red than using 50.

How the TMDS data would be created has been mentioned already.

EDIT:
Palette image added

rogloh · 2018-11-02 02:44

I like this concept TonyB_ and it is looking very worthy of trying out. A 2 COG HDMI line buffer driver on Rev A silicon sounds pretty good, though perhaps you might need an additional COG if you wanted to generate colour text from a screen buffer.

First glance seems there should be enough memory bandwidth for the transfers and it may work if steps 2 and 3 are tightly coded. Each line has got to be done in ~ 31.75us for VGA. If the 1680 hub transfers you require (80 + 4*400) are clocked on every cycle at 250MHz that takes 6.72us, leaving about 25us to do the other steps plus loop overhead etc. So you get roughly 25*250 / 640 ~= 9 clocks per pixel for the inner loop. Hopefully that is enough to do the work, its only 4.5 instructions per pixel, or 9 instructions per pixel pair if this critical inner loop can be expanded to work on pairs (it should be because the pixel data in the buffer is nibble oriented anyway).

Any P2 PASM experts care to show if steps 2,3 can be coded in 9 instructions or less per data byte containing 2 pixels?

TonyB_ wrote: »

2. For each group of 20 longs (160 pixels), read each nibble in turn and use it as an index to a table of 16 longs containing TMDS data.
3. Write the 160 TMDS longs to particular addresses in LUT RAM in range 0-399.

Tubular · 2018-11-02 03:23

My first thought is consider using bytes rather than nibbles (assuming there is enough space in cog+lut. Then you'd go straight from byte (=nibble-pair) to looking up a 16x16 matrix (256 longs). So there would be a lot of redundancy waste in the 256 element LUT but does that really matter?

rogloh · 2018-11-02 03:33

Only problem is it seems you really want to read 32 bits per nibble (not nibble pair) and then write these into LUT with the even nibble having two 16 writes split over two LUT longs, the odd nibble just a single 32 bit write to a LUT long.

Tubular · 2018-11-02 03:36

ok, yes I thought it sounded too simple

ozpropdev · 2018-11-02 03:45

Maybe something like this
Assuming TDMS table @ cog address 0 to F

build		rdfast	#0,##@buffer
		mov	index,#0
		mov	r2,#20
next_long	rflong	r0
		rep	@.loop,#8
		getnib	r1,r0,#7
		altd	r1
		wrlut	0-0,index
               shl   r0,#4
		add	index,#1
.loop
		djnz	r2,#next_long

Edit: Oops! Forgot the shift

ozpropdev · 2018-11-02 03:50

Actually you could use a ALTI instead of ALTD and lose the add index,#1

rogloh · 2018-11-02 04:01

So I think the basic algorithm is that the 9 instruction budget needs to do this, using 3x16 entry look up tables in this case (two for the even pixels which split the 16 bit variable data into two separate halves, and one for the odd pixel):

-extract next pixel nibble from a long data value
-read LUT/COGRAM from table 1 base with even nibble index offset, get 32 bit data result
-write 32 bit result into LUT line buffer
-advance LUT line buffer write pointer by 1
-read LUT/COGRAM from table 2 base with even nibble index offset, get 32 bit data result
-write 32 bit result into LUT line buffer
-advance LUT line buffer write pointer by 2
-get next nibble from long data
-read LUT/COGRAM from table 3 base with odd nibble index offset, get 32 bit data result
-write 32 bit result to LUT line buffer
-advance LUT line buffer write pointer by 2
-advance to next nibble

Looks like it might be a bit too hard to fit this into 9 instructions using this straightforward approach at least. Maybe other table data format optimizations are possible by merging tables 1&2 and rotating/masking, or with other self modifying code etc to tighten it up.

rogloh · 2018-11-02 04:36

If the LUT addresses are hardcoded and the inner loop unrolled to work on at least 32 bits (8 pixels) at a time, it might be possible to skip the LUT buffer pointer stuff and save on some pointer advancing, that could save 3 instructions. Depending on the length of this inner loop, you might have to then do more frequent LUT writes as well, ie. smaller LUT transfers to HUB which would need closer synchronization with the HDMI output COG.

ozpropdev · 2018-11-02 05:47

It looks like the 9 instruction budget blows out.
Here's a rolled out variant that still needs 10-11 instructions per RGB group

	rflong	r0			'1
	getnib	r1,r0,#7		'2
	alti	table1,#%100_111	'3
	wrlut	0-0,0-0			'4

	getnib	r1,r0,#6		'5
	alti	table2,#%100_111	'6
	wrlut	0-0,0-0			'7

	getnib	r1,r0,#5		'8
	alti	table3,#%100_111	'9
	wrlut	0-0,0-0			'10
	add	index,#1		'11
'*
	getnib	r1,r0,#4		'12
	alti	table1,#%100_111	'13
	wrlut	0-0,0-0			'14

	getnib	r1,r0,#3		'15
	alti	table2,#%100_111	'16
	wrlut	0-0,0-0			'17

	getnib	r1,r0,#2		'18
	alti	table3,#%100_111	'19
	wrlut	0-0,0-0			'20
	add	index,#1		'21

'*
	getnib	r1,r0,#1		'22
	alti	table1,#%100_111	'23
	wrlut	0-0,0-0			'24

	getnib	r1,r0,#0		'25
	alti	table2,#%100_111	'26
	wrlut	0-0,0-0			'27

	rflong	r0			'28
	getnib	r1,r0,#7		'29
	alti	table3,#%100_111	'30
	wrlut	0-0,0-0			'31
	add	index,#1		'32
'*

	getnib	r1,r0,#6		'33
	alti	table1,#%100_111	'34
	wrlut	0-0,0-0			'35

	getnib	r1,r0,#5		'36
	alti	table2,#%100_111	'37
	wrlut	0-0,0-0			'38

	getnib	r1,r0,#4		'39
	alti	table3,#%100_111	'40
	wrlut	0-0,0-0			'41
	add	index,#1		'42

'*
	getnib	r1,r0,#3		'43
	alti	table1,#%100_111	'44
	wrlut	0-0,0-0			'45

	getnib	r1,r0,#2		'46
	alti	table2,#%100_111	'47
	wrlut	0-0,0-0			'48

	getnib	r1,r0,#1		'49
	alti	table3,#%100_111	'50
	wrlut	0-0,0-0			'51
	add	index,#1		'52

'*
	getnib	r1,r0,#0		'53
	alti	table1,#%100_111	'54
	wrlut	0-0,0-0			'55
	
	rflong	r0			'56
	getnib	r1,r0,#7		'57
	alti	table2,#%100_111	'58
	wrlut	0-0,0-0			'59

	getnib	r1,r0,#6		'60
	alti	table3,#%100_111	'61
	wrlut	0-0,0-0			'62
	add	index,#1		'63

'*
	getnib	r1,r0,#5		'64
	alti	table1,#%100_111	'65
	wrlut	0-0,0-0			'66

	getnib	r1,r0,#4		'67
	alti	table2,#%100_111	'68
	wrlut	0-0,0-0			'69

	getnib	r1,r0,#3		'70
	alti	table3,#%100_111	'71
	wrlut	0-0,0-0			'72
	add	index,#1		'73

'*
	getnib	r1,r0,#2		'74
	alti	table1,#%100_111	'75
	wrlut	0-0,0-0			'76

	getnib	r1,r0,#1		'77
	alti	table2,#%100_111	'78
	wrlut	0-0,0-0			'79

	getnib	r1,r0,#0		'80
	alti	table3,#%100_111	'81
	wrlut	0-0,0-0			'82
	add	index,#1		'83

...

table1	long	r1 << 9 | 0
table2	long	r1 << 9 | $10
table3	long	r1 << 9 | $20

rogloh · 2018-11-02 05:51

Browsing the P2 document I find there are some nice nibble and COG RAM table oriented instructions which could be put to good use. If this per byte code is repeated four times per source long and the LUT can be directly addressed it might be possible to achieve a less than 9 instruction cycle budget for two pixels. I don't know if my #xx, #yy, #zz WRLUT forms are allowed/correct, but if not, then hard coded registers containing constant addresses might suffice. RDLUT/WRLUT don't seem very well documented unfortunately so I can't tell how to use them.

getnib evennibble,data,#0
alts evennibble, #table1
wrlut #xx, 0-0
alts evennibble, #table2
wrlut #yy, 0-0
getnib oddnibble,data,#1
alts oddnible, #table3
wrlut #zz, 0-0

(repeat 3 more times for all other nibbles of source long)

UPDATE: A sequence like this but writing to COGRAM instead, combined with some FIFO based writes from COGRAM nicely timed hitting hub windows might even suffice so perhaps LUT use can be avoided.

evanh · 2018-11-02 06:13

Don't discount a WRLONG as too slow. Instruction counting the spacing of WRLONG's can be reliably coded right down to it's minimum of 3 clocks.

For 5 longword spaced hubram writes: It's a mere 5 clocks from one write to the next write window. This is because advancing 5 longs (20 addresses) shifts the timing window by the same amount. The apparently shortened 5 clocks can work because the instruction itself only needs 3 clocks to execute. Any more is stalling time.

Second window for the second write then is 13 clocks, third is 21 clocks, and so on. The window intervals of 8 clocks is specific to an 8-cog Prop2.

Upon making the second write: The new window for the third write is again only 5 clocks away because the new address is again another 5 longwords further on. This is general to all Prop2's.

rogloh · 2018-11-02 06:27

Yeah it would be nice if normal WRLONGs are found to be fast enough, however in this case 3 long writes are needed for each pair of pixels. If each write takes 3 clock cycles at best, then that is already half our budget of 9 instruction (18 clocks) per two pixels, before we do any work. Perhaps the fifo method that bursts multiple longs would help though. In any case the LUT method seems to be a possibility now.

rogloh · 2018-11-02 06:37

Missed your recent post ozpropdev, seems we cross posted, but that looks interesting too. I need to see what this ALTI does, seems like it may be handy.

ozpropdev · 2018-11-02 06:59

Roger
I'm using ALTI to substitute D with contents of R1 and S with the contents of INDEX.
ALTI also increments INDEX in this configuration too.

HDMI added to Prop2

Comments