RDLUT/WRLUT with auto-incrementing address

cgracey · 2018-11-15 22:46

PTRx expressions are added to RDLUT/WRLUT now.

This was a very simple change. Almost nothing. And the LE's dropped, somehow.

It seems to be working just fine.

dat		org

		mov	ptra,#6		'write pattern
		wrlut	#7,ptra--
		wrlut	#6,ptra--
		wrlut	#5,ptra--
		wrlut	#4,ptra--
		wrlut	#3,ptra--
		wrlut	#2,ptra--
		wrlut	#1,ptra--
		wrlut	#8,ptra--

		rdlut	x,#6
		cmp	x,#7	wz
		rdlut	x,#5
	if_z	cmp	x,#6	wz
		rdlut	x,#4
	if_z	cmp	x,#5	wz
		rdlut	x,#3
	if_z	cmp	x,#4	wz
		rdlut	x,#2
	if_z	cmp	x,#3	wz
		rdlut	x,#1
	if_z	cmp	x,#2	wz
		rdlut	x,#0
	if_z	cmp	x,#1	wz
		rdlut	x,last
	if_z	cmp	x,#8	wz

		drvz	#32		'p32 = good
		drvh	#33		'p33 = done

		jmp	#$

last		long	$1FF

x		res	1

cgracey · 2018-11-15 23:01

I don't think I'm going to do anything special for 'SETQ(2)+RD/WR/WMLONG' regarding PTRx behavior.

To do it all the way correctly would require a 6*18-bit multiplier, where the unit size is not 1/2/4 for byte/word/long, but (Q[17:0]+1)*4. That's a little much when you consider it must be multiplied.

I could constrain 'SETQ(2)+RD/WR/WMLONG' PTRx behavior to {++/--}PTRx{++/--}, where the index gets interpreted only as 0/1/-1. That would work, and allow for proper sequencing of sequential block transfers. I am going to look into that now.

cgracey · 2018-11-15 23:15

TonyB_ wrote: »

The PTRx encoding seems to have a spare bit that could be used to increase the index when PTRx does not change. A larger index means there is less need to modify the base pointer. I could explain it but it's easier to see it:

Existing:

1SUPNNNNN   Expression      Address used         PTRx after instr       INDEX

1000NNNNN   PTRA[INDEX]     PTRA + INDEX*SCALE   PTRA                   -16..+15
1100NNNNN   PTRB[INDEX]     PTRB + INDEX*SCALE   PTRB                   -16..+15
1011NNNNN   PTRA++[INDEX]   PTRA                 PTRA ++= INDEX*SCALE   -16..+15
1111NNNNN   PTRB++[INDEX]   PTRB                 PTRB ++= INDEX*SCALE   -16..+15
1010NNNNN   ++PTRA[INDEX]   PTRA + INDEX*SCALE   PTRA ++= INDEX*SCALE   -16..+15
1110NNNNN   ++PTRB[INDEX]   PTRB + INDEX*SCALE   PTRB ++= INDEX*SCALE   -16..+15

S = 0 for PTRA, 1 for PTRB 
U = 0 to keep PTRx same, 1 to update PTRx (PTRx += INDEX*SCALE) 
P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify) 
NNNNN = INDEX (signed)

--------------------------------------------------------------------------------

Proposed:

1SUXNNNNN   Expression      Address used         PTRx after instr       INDEX

100NNNNNN   PTRA[INDEX]     PTRA + INDEX*SCALE   PTRA                   -32..+31
110NNNNNN   PTRB[INDEX]     PTRB + INDEX*SCALE   PTRB                   -32..+31
1011NNNNN   PTRA++[INDEX]   PTRA                 PTRA ++= INDEX*SCALE   -16..+15
1111NNNNN   PTRB++[INDEX]   PTRB                 PTRB ++= INDEX*SCALE   -16..+15
1010NNNNN   ++PTRA[INDEX]   PTRA + INDEX*SCALE   PTRA ++= INDEX*SCALE   -16..+15
1110NNNNN   ++PTRB[INDEX]   PTRB + INDEX*SCALE   PTRB ++= INDEX*SCALE   -16..+15

If U = 0, then P = 0 and INDEX[5] = X
If U = 1, then P = X and INDEX[5] = INDEX[4]

Good find, TonyB_!

If we are not updating PTRx, we can force pre-type addressing (as if P=0) and use the P bit as an additional index bit. I'm looking into how this would affect the Verilog.

cgracey · 2018-11-15 23:39

TonyB_, I'm wondering if there's something better we could do with that bit, rather than asymmetrically extend the index range for non-PTRx-updating operations.

TonyB_ · 2018-11-16 00:08

cgracey wrote: »

PTRx expressions are added to RDLUT/WRLUT now.

This was a very simple change. Almost nothing. And the LE's dropped, somehow.

It seems to be working just fine.

Thanks, Chip. LUT addressing is much better now.

cgracey wrote: »

I don't think I'm going to do anything special for 'SETQ(2)+RD/WR/WMLONG' regarding PTRx behavior.

To do it all the way correctly would require a 6*18-bit multiplier, where the unit size is not 1/2/4 for byte/word/long, but (Q[17:0]+1)*4. That's a little much when you consider it must be multiplied.

I could constrain 'SETQ(2)+RD/WR/WMLONG' PTRx behavior to {++/--}PTRx{++/--}, where the index gets interpreted only as 0/1/-1. That would work, and allow for proper sequencing of sequential block transfers. I am going to look into that now.

The objective is PTRx with the correct address at the end of the fast block move, to avoid a separate add instruction. If PTRx is correct with an index of 0/1/-1, but incorrect/unchanged with a larger index that would be OK with me.

TonyB_ · 2018-11-16 00:14

cgracey wrote: »

TonyB_, I'm wondering if there's something better we could do with that bit, rather than asymmetrically extend the index range for non-PTRx-updating operations.

I'll give it some thought. We've got to use this spare bit now that we've found it. If nothing better turns up, then we have a pretty good use for it.

cgracey · 2018-11-16 00:30

TonyB_ wrote: »

cgracey wrote: »

TonyB_, I'm wondering if there's something better we could do with that bit, rather than asymmetrically extend the index range for non-PTRx-updating operations.

I'll give it some thought. We've got to use this spare bit now that we've found it. If nothing better turns up, then we have a pretty good use for it.

I agree.

There's also the case of updating with index=0. That makes no sense.

TonyB_ · 2018-11-16 00:33

cgracey wrote: »

TonyB_ wrote: »

cgracey wrote: »

TonyB_, I'm wondering if there's something better we could do with that bit, rather than asymmetrically extend the index range for non-PTRx-updating operations.

I'll give it some thought. We've got to use this spare bit now that we've found it. If nothing better turns up, then we have a pretty good use for it.

I agree.

There's also the case of updating with index=0. That makes no sense.

Could we have +16 instead of +0?

cgracey · 2018-11-16 00:51

TonyB_ wrote: »

cgracey wrote: »

TonyB_ wrote: »

cgracey wrote: »

TonyB_, I'm wondering if there's something better we could do with that bit, rather than asymmetrically extend the index range for non-PTRx-updating operations.

I'll give it some thought. We've got to use this spare bit now that we've found it. If nothing better turns up, then we have a pretty good use for it.

I agree.

There's also the case of updating with index=0. That makes no sense.

Could we have +16 instead of +0?

Good idea. The two ideas together mean that you could expand common range to +16.

rogloh · 2018-11-16 00:55

One crazy-assed idea might be to use those unused cases to also update the D field in the next opcode like ALTI might normally do. Then you could do block transfers faster perhaps. It would break all the nice symmetry however and you would need a final dummy instruction. Kinda weird..

eg.

MOVD $+2,#0 
REP #1, #100
RDLUT 0-0, ptr++[0] ' this also increments next D field.
NOP ' this is also needed I guess, (pity).

Now it takes only 300 clocks to transfer 100 longs from LUT to COG instead of 500. Big savings.

cgracey · 2018-11-16 01:03

rogloh wrote: »
One crazy-assed idea might be to use those unused cases to also update the D field in the next opcode like ALTI might normally do. Then you could do block transfers faster perhaps. It would break all the nice symmetry however and need a final instruction. Kinda weird..

eg.
MOVD $+2,#0 
REP #1, #100
RDLUT 0-0, ptr++[0] ' this also increments next D field.
NOP ' this is also needed I guess, (pity).
Now it takes only 300 clocks to transfer 100 longs from LUT to COG instead of 500. Big savings.

That would mean modifying the instruction, while our write opportunity is being taken by the RDLUT.

Maybe we could do it by modifying the result destination on the fly, but at what offset? We'd have to borrow the LSBs of PTRx, or get them from somewhere. I see the goal, though.

TonyB_ · 2018-11-16 01:03

Here's how both changes would look:

Proposed:

1SUXNNNNN   Expression      Address used         PTRx after instr       INDEX

100NNNNNN   PTRA[INDEX]     PTRA + INDEX*SCALE   PTRA                   -32..+31
110NNNNNN   PTRB[INDEX]     PTRB + INDEX*SCALE   PTRB                   -32..+31
1011NNNNN   PTRA++[INDEX]   PTRA                 PTRA ++= INDEX*SCALE   -16..-1,+1..+16
1111NNNNN   PTRB++[INDEX]   PTRB                 PTRB ++= INDEX*SCALE   -16..-1,+1..+16
1010NNNNN   ++PTRA[INDEX]   PTRA + INDEX*SCALE   PTRA ++= INDEX*SCALE   -16..-1,+1..+16
1110NNNNN   ++PTRB[INDEX]   PTRB + INDEX*SCALE   PTRB ++= INDEX*SCALE   -16..-1,+1..+16

S = 0 for PTRA, 1 for PTRB 
U = 0 to keep PTRx same, 1 to update PTRx (PTRx += INDEX*SCALE) 
P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify) 
NNNNN = INDEX (signed)

If U = 0, then P = 0 and INDEX[5] = X
If U = 1, then P = X and INDEX[5] = INDEX[4]

TonyB_ · 2018-11-16 01:06

cgracey wrote: »
rogloh wrote: »
One crazy-assed idea might be to use those unused cases to also update the D field in the next opcode like ALTI might normally do. Then you could do block transfers faster perhaps. It would break all the nice symmetry however and need a final instruction. Kinda weird..

eg.
MOVD $+2,#0 
REP #1, #100
RDLUT 0-0, ptr++[0] ' this also increments next D field.
NOP ' this is also needed I guess, (pity).
Now it takes only 300 clocks to transfer 100 longs from LUT to COG instead of 500. Big savings.
That would mean modifying the instruction, while our write opportunity is being taken by the RDLUT.

Maybe we could do it by modifying the result destination on the fly, but at what offset? We'd have to borrow the LSBs of PTRx, or get them from somewhere. I see the goal, though.

Does RDLUT taking three cycles give us a cycle for something extra?

rogloh · 2018-11-16 01:06

cgracey wrote: »

That would mean modifying the instruction, while our write opportunity is being taken by the RDLUT.

Maybe we could do it by modifying the result destination on the fly, but at what offset? We'd have to borrow the LSBs of PTRx, or get them from somewhere. I see the goal, though.

Yes, my plan was not very well thought through...very messy.

cgracey · 2018-11-16 01:21

That is excellent, TonyB_!

What's holding me back is that this kind of a change is going to bifurcate the tool effort. This will create two ways of assembling code. I'll need a switch in my assembler. I'm not sure that I want to go there. Wait, I could make renamed instructions for the new RDBYTE, RDLONG, etc. where their names end with an "n", or something, to indicate "new". That's how to do it. Then, I could support the current silicon while having special names for the new-silicon instructions that run only on the FPGA, for now. That will keep things simple.

TonyB_ · 2018-11-16 01:30

cgracey wrote: »
rogloh wrote: »
One crazy-assed idea might be to use those unused cases to also update the D field in the next opcode like ALTI might normally do. Then you could do block transfers faster perhaps. It would break all the nice symmetry however and need a final instruction. Kinda weird..

eg.
MOVD $+2,#0 
REP #1, #100
RDLUT 0-0, ptr++[0] ' this also increments next D field.
NOP ' this is also needed I guess, (pity).
Now it takes only 300 clocks to transfer 100 longs from LUT to COG instead of 500. Big savings.
That would mean modifying the instruction, while our write opportunity is being taken by the RDLUT.

Maybe we could do it by modifying the result destination on the fly, but at what offset? We'd have to borrow the LSBs of PTRx, or get them from somewhere. I see the goal, though.

In the soft HDMI, cog A will load a 256-long palette for cog B into LUT RAM, then cog B will copy it to cog RAM with the low eight read and write address bits the same. Although this particular LUT-cog transfer is not time-critical, it might be in other cases. Cog A can load its own different palette directly into cog RAM, but cog B cannot access hub RAM once the streamer is running.

EDIT:
Taking this a stage further, presumably cog B could start copying as soon as the first long is the LUT RAM and cog A could finish loading its palette and start on other code before cog B has finished copying.

TonyB_ · 2018-11-16 01:35

deleted

TonyB_ · 2018-11-16 01:46

deleted

rogloh · 2018-11-16 03:18

Would it make sense to combine a preceding SETQ with RDLUT or WRLUT (with or without pointers?) to allow faster block transfers to/from LUT and COG RAMs, does that buy us anything or give any simplification to any of this?

cgracey · 2018-11-16 03:21

rogloh wrote: »

Would it make sense to combine a preceding SETQ with RDLUT or WRLUT (with or without pointers?) to allow faster block transfers to/from LUT and COG RAMs, does that buy us anything or give any simplification to any of this?

I don't want to do that because I could go insane. I haven't found a need to move between cog and LUT. You can move cog/LUT between hub really easily, though, at 1-clock-per-long.

rogloh · 2018-11-16 03:23

Fair enough. Too many possibilities, and this is yet another distraction we don't want.

In general though it does seem relatively slow to block copy from LUT to COG memory at 4-5 clocks per long. Hopefully it won't need to be done very often.

Update: We can also achieve a 2 clock transfer between the two different RAMs via hub, assuming we don't run the streamer at the same time.

cgracey · 2018-11-16 04:13

TonyB_ wrote: »

Here's how a simple, non-crazy change would look:

Proposed:

1SUXNNNNN   Expression      Address used         PTRx after instr       INDEX

100NNNNNN   PTRA[INDEX]     PTRA + INDEX*SCALE   PTRA                   -32..+31
110NNNNNN   PTRB[INDEX]     PTRB + INDEX*SCALE   PTRB                   -32..+31
1011NNNNN   PTRA++[INDEX]   PTRA                 PTRA ++= INDEX*SCALE   -16..-1,+1..+16
1111NNNNN   PTRB++[INDEX]   PTRB                 PTRB ++= INDEX*SCALE   -16..-1,+1..+16
1010NNNNN   ++PTRA[INDEX]   PTRA + INDEX*SCALE   PTRA ++= INDEX*SCALE   -16..-1,+1..+16
1110NNNNN   ++PTRB[INDEX]   PTRB + INDEX*SCALE   PTRB ++= INDEX*SCALE   -16..-1,+1..+16

S = 0 for PTRA, 1 for PTRB 
U = 0 to keep PTRx same, 1 to update PTRx (PTRx += INDEX*SCALE) 
P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify) 
NNNNN = INDEX (signed)

If U = 0, then P = 0 and INDEX[5] = X
If U = 1, then P = X and INDEX[5] = INDEX[4]

I just implemented this. Here is how it turned out, exactly:

// ptra/ptrb writes

// #s		00000000000000000000000_0aaaaaaaa	= address $00000..$000FF
// #s		00000000000000000000000_1s0iiiiii	= address ptra/ptrb-static with -32..+31 scaled index (iiiiii)
// #s		00000000000000000000000_1s1p00000	= address ptra/ptrb-update with      +16 scaled index (010000)
// #s		00000000000000000000000_1s1p0iiii	= address ptra/ptrb-update with  +1..+15 scaled index (00xxxx)
// #s		00000000000000000000000_1s1p10000	= address ptra/ptrb-update with      -16 scaled index (110000)
// #s		00000000000000000000000_1s1p1iiii	= address ptra/ptrb-update with  -15..-1 scaled index (11xxxx)
// #s+augs	xxxxxxxxxxxxaaaaaaaaaaa_aaaaaaaaa	= address $00000..$FFFFF (x != 000000001???)
// #s+augs	000000001s0xiiiiiiiiiii_iiiiiiiii	= address ptra/ptrb-static with 20-bit unscaled index
// #s+augs	000000001s1piiiiiiiiiii_iiiiiiiii	= address ptra/ptrb-update with 20-bit unscaled index

wire [5:0] ptr_nx	= ix[s6] ? {ix[s4], ix[s4] || ~|ix[s3:s0], ix[s3:s0]}	// ptra/ptrb-update with -16..+16 scaled index
				 : ix[s5:s0];					// ptra/ptrb-static with -32..+31 scaled index

Now I need to modify the assembler to handle this. Any differently-assembling instructions have "_" after their name now.

jmg · 2018-11-16 04:29

cgracey wrote: »

Now I need to modify the assembler to handle this. Any differently-assembling instructions have "_" after their name now.

Do you have conditional defines, or include file ability in your assembler ?
It would be nice to have source code unchanged where opcodes have the same names, but need to produce slightly differing binaries.

A target name is then defined as a constant / pragma somewhere, and users can quickly use Rev A test files on Rev B, with just a rebuild.

cgracey · 2018-11-16 04:35

jmg wrote: »

cgracey wrote: »

Now I need to modify the assembler to handle this. Any differently-assembling instructions have "_" after their name now.

Do you have conditional defines, or include file ability in your assembler ?
It would be nice to have source code unchanged where opcodes have the same names, but need to produce slightly differing binaries.

A target name is then defined as a constant / pragma somewhere, and users can quickly use Rev A test files on Rev B, with just a rebuild.

Right. I don't have that. Will someday.

jmg · 2018-11-16 04:45

cgracey wrote: »

Right. I don't have that. Will someday.

You can do a quick kludge by pre-defining PrefixStr_P2_Rev_A or similar and then checking if it ever appears in user source. (even as a label)
That is then used as a switch, this is less flexible than full conditional/pragma, but faster to get going...

TonyB_ · 2018-11-16 11:35

cgracey wrote: »

TonyB_ wrote: »

Here's how a simple, non-crazy change would look:

Proposed:

1SUXNNNNN   Expression      Address used         PTRx after instr       INDEX

100NNNNNN   PTRA[INDEX]     PTRA + INDEX*SCALE   PTRA                   -32..+31
110NNNNNN   PTRB[INDEX]     PTRB + INDEX*SCALE   PTRB                   -32..+31
1011NNNNN   PTRA++[INDEX]   PTRA                 PTRA ++= INDEX*SCALE   -16..-1,+1..+16
1111NNNNN   PTRB++[INDEX]   PTRB                 PTRB ++= INDEX*SCALE   -16..-1,+1..+16
1010NNNNN   ++PTRA[INDEX]   PTRA + INDEX*SCALE   PTRA ++= INDEX*SCALE   -16..-1,+1..+16
1110NNNNN   ++PTRB[INDEX]   PTRB + INDEX*SCALE   PTRB ++= INDEX*SCALE   -16..-1,+1..+16

S = 0 for PTRA, 1 for PTRB 
U = 0 to keep PTRx same, 1 to update PTRx (PTRx += INDEX*SCALE) 
P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify) 
NNNNN = INDEX (signed)

If U = 0, then P = 0 and INDEX[5] = X
If U = 1, then P = X and INDEX[5] = INDEX[4]

I just implemented this. Here is how it turned out, exactly:

// ptra/ptrb writes

// #s		00000000000000000000000_0aaaaaaaa	= address $00000..$000FF
// #s		00000000000000000000000_1s0iiiiii	= address ptra/ptrb-static with -32..+31 scaled index (iiiiii)
// #s		00000000000000000000000_1s1p00000	= address ptra/ptrb-update with      +16 scaled index (010000)
// #s		00000000000000000000000_1s1p0iiii	= address ptra/ptrb-update with  +1..+15 scaled index (00xxxx)
// #s		00000000000000000000000_1s1p10000	= address ptra/ptrb-update with      -16 scaled index (110000)
// #s		00000000000000000000000_1s1p1iiii	= address ptra/ptrb-update with  -15..-1 scaled index (11xxxx)
// #s+augs	xxxxxxxxxxxxaaaaaaaaaaa_aaaaaaaaa	= address $00000..$FFFFF (x != 000000001???)
// #s+augs	000000001s0xiiiiiiiiiii_iiiiiiiii	= address ptra/ptrb-static with 20-bit unscaled index
// #s+augs	000000001s1piiiiiiiiiii_iiiiiiiii	= address ptra/ptrb-update with 20-bit unscaled index

wire [5:0] ptr_nx	= ix[s6] ? {ix[s4], ix[s4] || ~|ix[s3:s0], ix[s3:s0]}	// ptra/ptrb-update with -16..+16 scaled index
				 : ix[s5:s0];					// ptra/ptrb-static with -32..+31 scaled index

Now I need to modify the assembler to handle this. Any differently-assembling instructions have "_" after their name now.

Thanks, Chip!

I think there won't be differently-assembling instructions if

(a) +index of +0 is always assembled to the static version
(b) The new index[5] bit is used as follows

index = index[4:0]+16*index[5] if index[4]=0 or
index = index[4:0]-16*index[5] if index[4]=1

index[5] is now not the msb of a 6-bit index

TonyB_ · 2018-11-16 12:07

Above post edited for clarity regarding behaviour of new index bit.

cgracey · 2018-11-16 12:46

TonyB_ wrote: »

Thanks, Chip!

I think there won't be differently-assembling instructions if:

(a) +index of +0 is always assembled to the static version
(b) The new index[5] bit is used as follows:

index = index[4:0]+16*index[5] if index[4]=0 or
index = index[4:0]-16*index[5] if index[4]=1

index[5] is not the msb of a 6-bit index

I've got it all implemented in the Verilog and assembler and I just finished testing it. It all checks out okay.

I am intrigued by what you are saying here, but I can't figure it out.

Here is what I'm doing now and it's working fine. If the 'u'pdate is 1 and if index = %0000, then the final index = %010000, or +16. So, for PTRx updates, 0 is never allowed. Index values 0..31 go: +16, 1..15, -16..-1. There wasn't much to do in the assembler to make it happen, but there are some differences between the old and new. The differences are utilizing index %00000 on update as +16 and repurposing bit 5 from post/pre to an extra index bit when there's no update. All this took only several gates to implement.

TonyB_ · 2018-11-16 13:00

Chip, I'll create some #S examples to show you what I mean.

cgracey · 2018-11-16 13:17

TonyB_ wrote: »

Chip, I'll create some #S examples to show you what I mean.

Okay. Thanks.

RDLUT/WRLUT with auto-incrementing address

Comments