RDLUT/WRLUT with auto-incrementing address

124

Comments

  • PTRx expressions are added to RDLUT/WRLUT now.

    This was a very simple change. Almost nothing. And the LE's dropped, somehow.

    It seems to be working just fine.
    dat		org
    
    		mov	ptra,#6		'write pattern
    		wrlut	#7,ptra--
    		wrlut	#6,ptra--
    		wrlut	#5,ptra--
    		wrlut	#4,ptra--
    		wrlut	#3,ptra--
    		wrlut	#2,ptra--
    		wrlut	#1,ptra--
    		wrlut	#8,ptra--
    
    		rdlut	x,#6
    		cmp	x,#7	wz
    		rdlut	x,#5
    	if_z	cmp	x,#6	wz
    		rdlut	x,#4
    	if_z	cmp	x,#5	wz
    		rdlut	x,#3
    	if_z	cmp	x,#4	wz
    		rdlut	x,#2
    	if_z	cmp	x,#3	wz
    		rdlut	x,#1
    	if_z	cmp	x,#2	wz
    		rdlut	x,#0
    	if_z	cmp	x,#1	wz
    		rdlut	x,last
    	if_z	cmp	x,#8	wz
    
    		drvz	#32		'p32 = good
    		drvh	#33		'p33 = done
    
    		jmp	#$
    
    last		long	$1FF
    
    x		res	1
    
  • I don't think I'm going to do anything special for 'SETQ(2)+RD/WR/WMLONG' regarding PTRx behavior.

    To do it all the way correctly would require a 6*18-bit multiplier, where the unit size is not 1/2/4 for byte/word/long, but (Q[17:0]+1)*4. That's a little much when you consider it must be multiplied.

    I could constrain 'SETQ(2)+RD/WR/WMLONG' PTRx behavior to {++/--}PTRx{++/--}, where the index gets interpreted only as 0/1/-1. That would work, and allow for proper sequencing of sequential block transfers. I am going to look into that now.
  • cgraceycgracey Posts: 11,646
    edited 2018-11-15 - 23:16:36
    TonyB_ wrote: »
    The PTRx encoding seems to have a spare bit that could be used to increase the index when PTRx does not change. A larger index means there is less need to modify the base pointer. I could explain it but it's easier to see it:
    Existing:
    
    1SUPNNNNN   Expression      Address used         PTRx after instr       INDEX
    
    1000NNNNN   PTRA[INDEX]     PTRA + INDEX*SCALE   PTRA                   -16..+15
    1100NNNNN   PTRB[INDEX]     PTRB + INDEX*SCALE   PTRB                   -16..+15
    1011NNNNN   PTRA++[INDEX]   PTRA                 PTRA ++= INDEX*SCALE   -16..+15
    1111NNNNN   PTRB++[INDEX]   PTRB                 PTRB ++= INDEX*SCALE   -16..+15
    1010NNNNN   ++PTRA[INDEX]   PTRA + INDEX*SCALE   PTRA ++= INDEX*SCALE   -16..+15
    1110NNNNN   ++PTRB[INDEX]   PTRB + INDEX*SCALE   PTRB ++= INDEX*SCALE   -16..+15
    
    S = 0 for PTRA, 1 for PTRB 
    U = 0 to keep PTRx same, 1 to update PTRx (PTRx += INDEX*SCALE) 
    P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify) 
    NNNNN = INDEX (signed)
    
    --------------------------------------------------------------------------------
    
    Proposed:
    
    1SUXNNNNN   Expression      Address used         PTRx after instr       INDEX
    
    100NNNNNN   PTRA[INDEX]     PTRA + INDEX*SCALE   PTRA                   -32..+31
    110NNNNNN   PTRB[INDEX]     PTRB + INDEX*SCALE   PTRB                   -32..+31
    1011NNNNN   PTRA++[INDEX]   PTRA                 PTRA ++= INDEX*SCALE   -16..+15
    1111NNNNN   PTRB++[INDEX]   PTRB                 PTRB ++= INDEX*SCALE   -16..+15
    1010NNNNN   ++PTRA[INDEX]   PTRA + INDEX*SCALE   PTRA ++= INDEX*SCALE   -16..+15
    1110NNNNN   ++PTRB[INDEX]   PTRB + INDEX*SCALE   PTRB ++= INDEX*SCALE   -16..+15
    
    If U = 0, then P = 0 and INDEX[5] = X
    If U = 1, then P = X and INDEX[5] = INDEX[4]
    

    Good find, TonyB_!

    If we are not updating PTRx, we can force pre-type addressing (as if P=0) and use the P bit as an additional index bit. I'm looking into how this would affect the Verilog.
  • cgraceycgracey Posts: 11,646
    edited 2018-11-15 - 23:39:41
    TonyB_, I'm wondering if there's something better we could do with that bit, rather than asymmetrically extend the index range for non-PTRx-updating operations.
  • TonyB_TonyB_ Posts: 1,223
    edited 2018-11-16 - 00:18:03
    cgracey wrote: »
    PTRx expressions are added to RDLUT/WRLUT now.

    This was a very simple change. Almost nothing. And the LE's dropped, somehow.

    It seems to be working just fine.

    Thanks, Chip. LUT addressing is much better now.
    cgracey wrote: »
    I don't think I'm going to do anything special for 'SETQ(2)+RD/WR/WMLONG' regarding PTRx behavior.

    To do it all the way correctly would require a 6*18-bit multiplier, where the unit size is not 1/2/4 for byte/word/long, but (Q[17:0]+1)*4. That's a little much when you consider it must be multiplied.

    I could constrain 'SETQ(2)+RD/WR/WMLONG' PTRx behavior to {++/--}PTRx{++/--}, where the index gets interpreted only as 0/1/-1. That would work, and allow for proper sequencing of sequential block transfers. I am going to look into that now.

    The objective is PTRx with the correct address at the end of the fast block move, to avoid a separate add instruction. If PTRx is correct with an index of 0/1/-1, but incorrect/unchanged with a larger index that would be OK with me.
    Formerly known as TonyB
  • cgracey wrote: »
    TonyB_, I'm wondering if there's something better we could do with that bit, rather than asymmetrically extend the index range for non-PTRx-updating operations.

    I'll give it some thought. We've got to use this spare bit now that we've found it. If nothing better turns up, then we have a pretty good use for it.
    Formerly known as TonyB
  • TonyB_ wrote: »
    cgracey wrote: »
    TonyB_, I'm wondering if there's something better we could do with that bit, rather than asymmetrically extend the index range for non-PTRx-updating operations.

    I'll give it some thought. We've got to use this spare bit now that we've found it. If nothing better turns up, then we have a pretty good use for it.

    I agree.

    There's also the case of updating with index=0. That makes no sense.
  • cgracey wrote: »
    TonyB_ wrote: »
    cgracey wrote: »
    TonyB_, I'm wondering if there's something better we could do with that bit, rather than asymmetrically extend the index range for non-PTRx-updating operations.

    I'll give it some thought. We've got to use this spare bit now that we've found it. If nothing better turns up, then we have a pretty good use for it.

    I agree.

    There's also the case of updating with index=0. That makes no sense.

    Could we have +16 instead of +0?
    Formerly known as TonyB
  • TonyB_ wrote: »
    cgracey wrote: »
    TonyB_ wrote: »
    cgracey wrote: »
    TonyB_, I'm wondering if there's something better we could do with that bit, rather than asymmetrically extend the index range for non-PTRx-updating operations.

    I'll give it some thought. We've got to use this spare bit now that we've found it. If nothing better turns up, then we have a pretty good use for it.

    I agree.

    There's also the case of updating with index=0. That makes no sense.

    Could we have +16 instead of +0?

    Good idea. The two ideas together mean that you could expand common range to +16.
  • roglohrogloh Posts: 1,167
    edited 2018-11-16 - 01:00:10
    One crazy-assed idea might be to use those unused cases to also update the D field in the next opcode like ALTI might normally do. Then you could do block transfers faster perhaps. It would break all the nice symmetry however and you would need a final dummy instruction. Kinda weird..

    eg.
    MOVD $+2,#0 
    REP #1, #100
    RDLUT 0-0, ptr++[0] ' this also increments next D field.
    NOP ' this is also needed I guess, (pity).
    
    Now it takes only 300 clocks to transfer 100 longs from LUT to COG instead of 500. Big savings.
  • cgraceycgracey Posts: 11,646
    edited 2018-11-16 - 01:03:35
    rogloh wrote: »
    One crazy-assed idea might be to use those unused cases to also update the D field in the next opcode like ALTI might normally do. Then you could do block transfers faster perhaps. It would break all the nice symmetry however and need a final instruction. Kinda weird..

    eg.
    MOVD $+2,#0 
    REP #1, #100
    RDLUT 0-0, ptr++[0] ' this also increments next D field.
    NOP ' this is also needed I guess, (pity).
    
    Now it takes only 300 clocks to transfer 100 longs from LUT to COG instead of 500. Big savings.

    That would mean modifying the instruction, while our write opportunity is being taken by the RDLUT.

    Maybe we could do it by modifying the result destination on the fly, but at what offset? We'd have to borrow the LSBs of PTRx, or get them from somewhere. I see the goal, though.
  • TonyB_TonyB_ Posts: 1,223
    edited 2018-11-19 - 19:14:15
    Here's how both changes would look:
    Proposed:
    
    1SUXNNNNN   Expression      Address used         PTRx after instr       INDEX
    
    100NNNNNN   PTRA[INDEX]     PTRA + INDEX*SCALE   PTRA                   -32..+31
    110NNNNNN   PTRB[INDEX]     PTRB + INDEX*SCALE   PTRB                   -32..+31
    1011NNNNN   PTRA++[INDEX]   PTRA                 PTRA ++= INDEX*SCALE   -16..-1,+1..+16
    1111NNNNN   PTRB++[INDEX]   PTRB                 PTRB ++= INDEX*SCALE   -16..-1,+1..+16
    1010NNNNN   ++PTRA[INDEX]   PTRA + INDEX*SCALE   PTRA ++= INDEX*SCALE   -16..-1,+1..+16
    1110NNNNN   ++PTRB[INDEX]   PTRB + INDEX*SCALE   PTRB ++= INDEX*SCALE   -16..-1,+1..+16
    
    S = 0 for PTRA, 1 for PTRB 
    U = 0 to keep PTRx same, 1 to update PTRx (PTRx += INDEX*SCALE) 
    P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify) 
    NNNNN = INDEX (signed)
    
    If U = 0, then P = 0 and INDEX[5] = X
    If U = 1, then P = X and INDEX[5] = INDEX[4]
    
    Formerly known as TonyB
  • cgracey wrote: »
    rogloh wrote: »
    One crazy-assed idea might be to use those unused cases to also update the D field in the next opcode like ALTI might normally do. Then you could do block transfers faster perhaps. It would break all the nice symmetry however and need a final instruction. Kinda weird..

    eg.
    MOVD $+2,#0 
    REP #1, #100
    RDLUT 0-0, ptr++[0] ' this also increments next D field.
    NOP ' this is also needed I guess, (pity).
    
    Now it takes only 300 clocks to transfer 100 longs from LUT to COG instead of 500. Big savings.

    That would mean modifying the instruction, while our write opportunity is being taken by the RDLUT.

    Maybe we could do it by modifying the result destination on the fly, but at what offset? We'd have to borrow the LSBs of PTRx, or get them from somewhere. I see the goal, though.

    Does RDLUT taking three cycles give us a cycle for something extra?
    Formerly known as TonyB
  • cgracey wrote: »
    That would mean modifying the instruction, while our write opportunity is being taken by the RDLUT.

    Maybe we could do it by modifying the result destination on the fly, but at what offset? We'd have to borrow the LSBs of PTRx, or get them from somewhere. I see the goal, though.

    Yes, my plan was not very well thought through...very messy. :(
  • cgraceycgracey Posts: 11,646
    edited 2018-11-16 - 01:23:45
    That is excellent, TonyB_!

    What's holding me back is that this kind of a change is going to bifurcate the tool effort. This will create two ways of assembling code. I'll need a switch in my assembler. I'm not sure that I want to go there. Wait, I could make renamed instructions for the new RDBYTE, RDLONG, etc. where their names end with an "n", or something, to indicate "new". That's how to do it. Then, I could support the current silicon while having special names for the new-silicon instructions that run only on the FPGA, for now. That will keep things simple.
  • TonyB_TonyB_ Posts: 1,223
    edited 2018-11-16 - 13:57:53
    cgracey wrote: »
    rogloh wrote: »
    One crazy-assed idea might be to use those unused cases to also update the D field in the next opcode like ALTI might normally do. Then you could do block transfers faster perhaps. It would break all the nice symmetry however and need a final instruction. Kinda weird..

    eg.
    MOVD $+2,#0 
    REP #1, #100
    RDLUT 0-0, ptr++[0] ' this also increments next D field.
    NOP ' this is also needed I guess, (pity).
    
    Now it takes only 300 clocks to transfer 100 longs from LUT to COG instead of 500. Big savings.

    That would mean modifying the instruction, while our write opportunity is being taken by the RDLUT.

    Maybe we could do it by modifying the result destination on the fly, but at what offset? We'd have to borrow the LSBs of PTRx, or get them from somewhere. I see the goal, though.

    In the soft HDMI, cog A will load a 256-long palette for cog B into LUT RAM, then cog B will copy it to cog RAM with the low eight read and write address bits the same. Although this particular LUT-cog transfer is not time-critical, it might be in other cases. Cog A can load its own different palette directly into cog RAM, but cog B cannot access hub RAM once the streamer is running.

    EDIT:
    Taking this a stage further, presumably cog B could start copying as soon as the first long is the LUT RAM and cog A could finish loading its palette and start on other code before cog B has finished copying.
    Formerly known as TonyB
  • TonyB_TonyB_ Posts: 1,223
    edited 2018-11-16 - 13:59:51
    deleted
    Formerly known as TonyB
  • TonyB_TonyB_ Posts: 1,223
    edited 2018-11-16 - 13:59:20
    deleted
    Formerly known as TonyB
  • Would it make sense to combine a preceding SETQ with RDLUT or WRLUT (with or without pointers?) to allow faster block transfers to/from LUT and COG RAMs, does that buy us anything or give any simplification to any of this?
  • rogloh wrote: »
    Would it make sense to combine a preceding SETQ with RDLUT or WRLUT (with or without pointers?) to allow faster block transfers to/from LUT and COG RAMs, does that buy us anything or give any simplification to any of this?

    I don't want to do that because I could go insane. I haven't found a need to move between cog and LUT. You can move cog/LUT between hub really easily, though, at 1-clock-per-long.
  • roglohrogloh Posts: 1,167
    edited 2018-11-16 - 03:25:20
    Fair enough. Too many possibilities, and this is yet another distraction we don't want.

    In general though it does seem relatively slow to block copy from LUT to COG memory at 4-5 clocks per long. Hopefully it won't need to be done very often.

    Update: We can also achieve a 2 clock transfer between the two different RAMs via hub, assuming we don't run the streamer at the same time.
  • cgraceycgracey Posts: 11,646
    edited 2018-11-16 - 04:14:28
    TonyB_ wrote: »
    Here's how a simple, non-crazy change would look:
    Proposed:
    
    1SUXNNNNN   Expression      Address used         PTRx after instr       INDEX
    
    100NNNNNN   PTRA[INDEX]     PTRA + INDEX*SCALE   PTRA                   -32..+31
    110NNNNNN   PTRB[INDEX]     PTRB + INDEX*SCALE   PTRB                   -32..+31
    1011NNNNN   PTRA++[INDEX]   PTRA                 PTRA ++= INDEX*SCALE   -16..-1,+1..+16
    1111NNNNN   PTRB++[INDEX]   PTRB                 PTRB ++= INDEX*SCALE   -16..-1,+1..+16
    1010NNNNN   ++PTRA[INDEX]   PTRA + INDEX*SCALE   PTRA ++= INDEX*SCALE   -16..-1,+1..+16
    1110NNNNN   ++PTRB[INDEX]   PTRB + INDEX*SCALE   PTRB ++= INDEX*SCALE   -16..-1,+1..+16
    
    S = 0 for PTRA, 1 for PTRB 
    U = 0 to keep PTRx same, 1 to update PTRx (PTRx += INDEX*SCALE) 
    P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify) 
    NNNNN = INDEX (signed)
    
    If U = 0, then P = 0 and INDEX[5] = X
    If U = 1, then P = X and INDEX[5] = INDEX[4]
    

    I just implemented this. Here is how it turned out, exactly:
    // ptra/ptrb writes
    
    // #s		00000000000000000000000_0aaaaaaaa	= address $00000..$000FF
    // #s		00000000000000000000000_1s0iiiiii	= address ptra/ptrb-static with -32..+31 scaled index (iiiiii)
    // #s		00000000000000000000000_1s1p00000	= address ptra/ptrb-update with      +16 scaled index (010000)
    // #s		00000000000000000000000_1s1p0iiii	= address ptra/ptrb-update with  +1..+15 scaled index (00xxxx)
    // #s		00000000000000000000000_1s1p10000	= address ptra/ptrb-update with      -16 scaled index (110000)
    // #s		00000000000000000000000_1s1p1iiii	= address ptra/ptrb-update with  -15..-1 scaled index (11xxxx)
    // #s+augs	xxxxxxxxxxxxaaaaaaaaaaa_aaaaaaaaa	= address $00000..$FFFFF (x != 000000001???)
    // #s+augs	000000001s0xiiiiiiiiiii_iiiiiiiii	= address ptra/ptrb-static with 20-bit unscaled index
    // #s+augs	000000001s1piiiiiiiiiii_iiiiiiiii	= address ptra/ptrb-update with 20-bit unscaled index
    
    wire [5:0] ptr_nx	= ix[s6] ? {ix[s4], ix[s4] || ~|ix[s3:s0], ix[s3:s0]}	// ptra/ptrb-update with -16..+16 scaled index
    				 : ix[s5:s0];					// ptra/ptrb-static with -32..+31 scaled index
    

    Now I need to modify the assembler to handle this. Any differently-assembling instructions have "_" after their name now.
  • jmgjmg Posts: 13,840
    cgracey wrote: »
    Now I need to modify the assembler to handle this. Any differently-assembling instructions have "_" after their name now.

    Do you have conditional defines, or include file ability in your assembler ?
    It would be nice to have source code unchanged where opcodes have the same names, but need to produce slightly differing binaries.

    A target name is then defined as a constant / pragma somewhere, and users can quickly use Rev A test files on Rev B, with just a rebuild.

  • jmg wrote: »
    cgracey wrote: »
    Now I need to modify the assembler to handle this. Any differently-assembling instructions have "_" after their name now.

    Do you have conditional defines, or include file ability in your assembler ?
    It would be nice to have source code unchanged where opcodes have the same names, but need to produce slightly differing binaries.

    A target name is then defined as a constant / pragma somewhere, and users can quickly use Rev A test files on Rev B, with just a rebuild.

    Right. I don't have that. Will someday.
  • jmgjmg Posts: 13,840
    cgracey wrote: »
    Right. I don't have that. Will someday.

    You can do a quick kludge by pre-defining PrefixStr_P2_Rev_A or similar and then checking if it ever appears in user source. (even as a label)
    That is then used as a switch, this is less flexible than full conditional/pragma, but faster to get going...


  • TonyB_TonyB_ Posts: 1,223
    edited 2018-11-16 - 12:41:38
    cgracey wrote: »
    TonyB_ wrote: »
    Here's how a simple, non-crazy change would look:
    Proposed:
    
    1SUXNNNNN   Expression      Address used         PTRx after instr       INDEX
    
    100NNNNNN   PTRA[INDEX]     PTRA + INDEX*SCALE   PTRA                   -32..+31
    110NNNNNN   PTRB[INDEX]     PTRB + INDEX*SCALE   PTRB                   -32..+31
    1011NNNNN   PTRA++[INDEX]   PTRA                 PTRA ++= INDEX*SCALE   -16..-1,+1..+16
    1111NNNNN   PTRB++[INDEX]   PTRB                 PTRB ++= INDEX*SCALE   -16..-1,+1..+16
    1010NNNNN   ++PTRA[INDEX]   PTRA + INDEX*SCALE   PTRA ++= INDEX*SCALE   -16..-1,+1..+16
    1110NNNNN   ++PTRB[INDEX]   PTRB + INDEX*SCALE   PTRB ++= INDEX*SCALE   -16..-1,+1..+16
    
    S = 0 for PTRA, 1 for PTRB 
    U = 0 to keep PTRx same, 1 to update PTRx (PTRx += INDEX*SCALE) 
    P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify) 
    NNNNN = INDEX (signed)
    
    If U = 0, then P = 0 and INDEX[5] = X
    If U = 1, then P = X and INDEX[5] = INDEX[4]
    

    I just implemented this. Here is how it turned out, exactly:
    // ptra/ptrb writes
    
    // #s		00000000000000000000000_0aaaaaaaa	= address $00000..$000FF
    // #s		00000000000000000000000_1s0iiiiii	= address ptra/ptrb-static with -32..+31 scaled index (iiiiii)
    // #s		00000000000000000000000_1s1p00000	= address ptra/ptrb-update with      +16 scaled index (010000)
    // #s		00000000000000000000000_1s1p0iiii	= address ptra/ptrb-update with  +1..+15 scaled index (00xxxx)
    // #s		00000000000000000000000_1s1p10000	= address ptra/ptrb-update with      -16 scaled index (110000)
    // #s		00000000000000000000000_1s1p1iiii	= address ptra/ptrb-update with  -15..-1 scaled index (11xxxx)
    // #s+augs	xxxxxxxxxxxxaaaaaaaaaaa_aaaaaaaaa	= address $00000..$FFFFF (x != 000000001???)
    // #s+augs	000000001s0xiiiiiiiiiii_iiiiiiiii	= address ptra/ptrb-static with 20-bit unscaled index
    // #s+augs	000000001s1piiiiiiiiiii_iiiiiiiii	= address ptra/ptrb-update with 20-bit unscaled index
    
    wire [5:0] ptr_nx	= ix[s6] ? {ix[s4], ix[s4] || ~|ix[s3:s0], ix[s3:s0]}	// ptra/ptrb-update with -16..+16 scaled index
    				 : ix[s5:s0];					// ptra/ptrb-static with -32..+31 scaled index
    

    Now I need to modify the assembler to handle this. Any differently-assembling instructions have "_" after their name now.

    Thanks, Chip!

    I think there won't be differently-assembling instructions if

    (a) +index of +0 is always assembled to the static version
    (b) The new index[5] bit is used as follows

    index = index[4:0]+16*index[5] if index[4]=0 or
    index = index[4:0]-16*index[5] if index[4]=1

    index[5] is now not the msb of a 6-bit index
    Formerly known as TonyB
  • Above post edited for clarity regarding behaviour of new index bit.
    Formerly known as TonyB
  • cgraceycgracey Posts: 11,646
    edited 2018-11-16 - 12:48:07
    TonyB_ wrote: »
    Thanks, Chip!

    I think there won't be differently-assembling instructions if:

    (a) +index of +0 is always assembled to the static version
    (b) The new index[5] bit is used as follows:

    index = index[4:0]+16*index[5] if index[4]=0 or
    index = index[4:0]-16*index[5] if index[4]=1

    index[5] is not the msb of a 6-bit index

    I've got it all implemented in the Verilog and assembler and I just finished testing it. It all checks out okay.

    I am intrigued by what you are saying here, but I can't figure it out.

    Here is what I'm doing now and it's working fine. If the 'u'pdate is 1 and if index = %0000, then the final index = %010000, or +16. So, for PTRx updates, 0 is never allowed. Index values 0..31 go: +16, 1..15, -16..-1. There wasn't much to do in the assembler to make it happen, but there are some differences between the old and new. The differences are utilizing index %00000 on update as +16 and repurposing bit 5 from post/pre to an extra index bit when there's no update. All this took only several gates to implement.
  • Chip, I'll create some #S examples to show you what I mean.
    Formerly known as TonyB
  • TonyB_ wrote: »
    Chip, I'll create some #S examples to show you what I mean.

    Okay. Thanks.
Sign In or Register to comment.