I don't think I'm going to do anything special for 'SETQ(2)+RD/WR/WMLONG' regarding PTRx behavior.
To do it all the way correctly would require a 6*18-bit multiplier, where the unit size is not 1/2/4 for byte/word/long, but (Q[17:0]+1)*4. That's a little much when you consider it must be multiplied.
I could constrain 'SETQ(2)+RD/WR/WMLONG' PTRx behavior to {++/--}PTRx{++/--}, where the index gets interpreted only as 0/1/-1. That would work, and allow for proper sequencing of sequential block transfers. I am going to look into that now.
The PTRx encoding seems to have a spare bit that could be used to increase the index when PTRx does not change. A larger index means there is less need to modify the base pointer. I could explain it but it's easier to see it:
Existing:
1SUPNNNNN Expression Address used PTRx after instr INDEX
1000NNNNN PTRA[INDEX] PTRA + INDEX*SCALE PTRA -16..+15
1100NNNNN PTRB[INDEX] PTRB + INDEX*SCALE PTRB -16..+15
1011NNNNN PTRA++[INDEX] PTRA PTRA ++= INDEX*SCALE -16..+15
1111NNNNN PTRB++[INDEX] PTRB PTRB ++= INDEX*SCALE -16..+15
1010NNNNN ++PTRA[INDEX] PTRA + INDEX*SCALE PTRA ++= INDEX*SCALE -16..+15
1110NNNNN ++PTRB[INDEX] PTRB + INDEX*SCALE PTRB ++= INDEX*SCALE -16..+15
S = 0 for PTRA, 1 for PTRB
U = 0 to keep PTRx same, 1 to update PTRx (PTRx += INDEX*SCALE)
P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify)
NNNNN = INDEX (signed)
--------------------------------------------------------------------------------
Proposed:
1SUXNNNNN Expression Address used PTRx after instr INDEX
100NNNNNN PTRA[INDEX] PTRA + INDEX*SCALE PTRA -32..+31
110NNNNNN PTRB[INDEX] PTRB + INDEX*SCALE PTRB -32..+31
1011NNNNN PTRA++[INDEX] PTRA PTRA ++= INDEX*SCALE -16..+15
1111NNNNN PTRB++[INDEX] PTRB PTRB ++= INDEX*SCALE -16..+15
1010NNNNN ++PTRA[INDEX] PTRA + INDEX*SCALE PTRA ++= INDEX*SCALE -16..+15
1110NNNNN ++PTRB[INDEX] PTRB + INDEX*SCALE PTRB ++= INDEX*SCALE -16..+15
If U = 0, then P = 0 and INDEX[5] = X
If U = 1, then P = X and INDEX[5] = INDEX[4]
Good find, TonyB_!
If we are not updating PTRx, we can force pre-type addressing (as if P=0) and use the P bit as an additional index bit. I'm looking into how this would affect the Verilog.
TonyB_, I'm wondering if there's something better we could do with that bit, rather than asymmetrically extend the index range for non-PTRx-updating operations.
I don't think I'm going to do anything special for 'SETQ(2)+RD/WR/WMLONG' regarding PTRx behavior.
To do it all the way correctly would require a 6*18-bit multiplier, where the unit size is not 1/2/4 for byte/word/long, but (Q[17:0]+1)*4. That's a little much when you consider it must be multiplied.
I could constrain 'SETQ(2)+RD/WR/WMLONG' PTRx behavior to {++/--}PTRx{++/--}, where the index gets interpreted only as 0/1/-1. That would work, and allow for proper sequencing of sequential block transfers. I am going to look into that now.
The objective is PTRx with the correct address at the end of the fast block move, to avoid a separate add instruction. If PTRx is correct with an index of 0/1/-1, but incorrect/unchanged with a larger index that would be OK with me.
TonyB_, I'm wondering if there's something better we could do with that bit, rather than asymmetrically extend the index range for non-PTRx-updating operations.
I'll give it some thought. We've got to use this spare bit now that we've found it. If nothing better turns up, then we have a pretty good use for it.
TonyB_, I'm wondering if there's something better we could do with that bit, rather than asymmetrically extend the index range for non-PTRx-updating operations.
I'll give it some thought. We've got to use this spare bit now that we've found it. If nothing better turns up, then we have a pretty good use for it.
I agree.
There's also the case of updating with index=0. That makes no sense.
TonyB_, I'm wondering if there's something better we could do with that bit, rather than asymmetrically extend the index range for non-PTRx-updating operations.
I'll give it some thought. We've got to use this spare bit now that we've found it. If nothing better turns up, then we have a pretty good use for it.
I agree.
There's also the case of updating with index=0. That makes no sense.
TonyB_, I'm wondering if there's something better we could do with that bit, rather than asymmetrically extend the index range for non-PTRx-updating operations.
I'll give it some thought. We've got to use this spare bit now that we've found it. If nothing better turns up, then we have a pretty good use for it.
I agree.
There's also the case of updating with index=0. That makes no sense.
Could we have +16 instead of +0?
Good idea. The two ideas together mean that you could expand common range to +16.
One crazy-assed idea might be to use those unused cases to also update the D field in the next opcode like ALTI might normally do. Then you could do block transfers faster perhaps. It would break all the nice symmetry however and you would need a final dummy instruction. Kinda weird..
eg.
MOVD $+2,#0
REP #1, #100
RDLUT 0-0, ptr++[0] ' this also increments next D field.
NOP ' this is also needed I guess, (pity).
Now it takes only 300 clocks to transfer 100 longs from LUT to COG instead of 500. Big savings.
One crazy-assed idea might be to use those unused cases to also update the D field in the next opcode like ALTI might normally do. Then you could do block transfers faster perhaps. It would break all the nice symmetry however and need a final instruction. Kinda weird..
eg.
MOVD $+2,#0
REP #1, #100
RDLUT 0-0, ptr++[0] ' this also increments next D field.
NOP ' this is also needed I guess, (pity).
Now it takes only 300 clocks to transfer 100 longs from LUT to COG instead of 500. Big savings.
That would mean modifying the instruction, while our write opportunity is being taken by the RDLUT.
Maybe we could do it by modifying the result destination on the fly, but at what offset? We'd have to borrow the LSBs of PTRx, or get them from somewhere. I see the goal, though.
Proposed:
1SUXNNNNN Expression Address used PTRx after instr INDEX
100NNNNNN PTRA[INDEX] PTRA + INDEX*SCALE PTRA -32..+31
110NNNNNN PTRB[INDEX] PTRB + INDEX*SCALE PTRB -32..+31
1011NNNNN PTRA++[INDEX] PTRA PTRA ++= INDEX*SCALE -16..-1,+1..+16
1111NNNNN PTRB++[INDEX] PTRB PTRB ++= INDEX*SCALE -16..-1,+1..+16
1010NNNNN ++PTRA[INDEX] PTRA + INDEX*SCALE PTRA ++= INDEX*SCALE -16..-1,+1..+16
1110NNNNN ++PTRB[INDEX] PTRB + INDEX*SCALE PTRB ++= INDEX*SCALE -16..-1,+1..+16
S = 0 for PTRA, 1 for PTRB
U = 0 to keep PTRx same, 1 to update PTRx (PTRx += INDEX*SCALE)
P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify)
NNNNN = INDEX (signed)
If U = 0, then P = 0 and INDEX[5] = X
If U = 1, then P = X and INDEX[5] = INDEX[4]
One crazy-assed idea might be to use those unused cases to also update the D field in the next opcode like ALTI might normally do. Then you could do block transfers faster perhaps. It would break all the nice symmetry however and need a final instruction. Kinda weird..
eg.
MOVD $+2,#0
REP #1, #100
RDLUT 0-0, ptr++[0] ' this also increments next D field.
NOP ' this is also needed I guess, (pity).
Now it takes only 300 clocks to transfer 100 longs from LUT to COG instead of 500. Big savings.
That would mean modifying the instruction, while our write opportunity is being taken by the RDLUT.
Maybe we could do it by modifying the result destination on the fly, but at what offset? We'd have to borrow the LSBs of PTRx, or get them from somewhere. I see the goal, though.
Does RDLUT taking three cycles give us a cycle for something extra?
That would mean modifying the instruction, while our write opportunity is being taken by the RDLUT.
Maybe we could do it by modifying the result destination on the fly, but at what offset? We'd have to borrow the LSBs of PTRx, or get them from somewhere. I see the goal, though.
Yes, my plan was not very well thought through...very messy.
What's holding me back is that this kind of a change is going to bifurcate the tool effort. This will create two ways of assembling code. I'll need a switch in my assembler. I'm not sure that I want to go there. Wait, I could make renamed instructions for the new RDBYTE, RDLONG, etc. where their names end with an "n", or something, to indicate "new". That's how to do it. Then, I could support the current silicon while having special names for the new-silicon instructions that run only on the FPGA, for now. That will keep things simple.
One crazy-assed idea might be to use those unused cases to also update the D field in the next opcode like ALTI might normally do. Then you could do block transfers faster perhaps. It would break all the nice symmetry however and need a final instruction. Kinda weird..
eg.
MOVD $+2,#0
REP #1, #100
RDLUT 0-0, ptr++[0] ' this also increments next D field.
NOP ' this is also needed I guess, (pity).
Now it takes only 300 clocks to transfer 100 longs from LUT to COG instead of 500. Big savings.
That would mean modifying the instruction, while our write opportunity is being taken by the RDLUT.
Maybe we could do it by modifying the result destination on the fly, but at what offset? We'd have to borrow the LSBs of PTRx, or get them from somewhere. I see the goal, though.
In the soft HDMI, cog A will load a 256-long palette for cog B into LUT RAM, then cog B will copy it to cog RAM with the low eight read and write address bits the same. Although this particular LUT-cog transfer is not time-critical, it might be in other cases. Cog A can load its own different palette directly into cog RAM, but cog B cannot access hub RAM once the streamer is running.
EDIT:
Taking this a stage further, presumably cog B could start copying as soon as the first long is the LUT RAM and cog A could finish loading its palette and start on other code before cog B has finished copying.
Would it make sense to combine a preceding SETQ with RDLUT or WRLUT (with or without pointers?) to allow faster block transfers to/from LUT and COG RAMs, does that buy us anything or give any simplification to any of this?
Would it make sense to combine a preceding SETQ with RDLUT or WRLUT (with or without pointers?) to allow faster block transfers to/from LUT and COG RAMs, does that buy us anything or give any simplification to any of this?
I don't want to do that because I could go insane. I haven't found a need to move between cog and LUT. You can move cog/LUT between hub really easily, though, at 1-clock-per-long.
Fair enough. Too many possibilities, and this is yet another distraction we don't want.
In general though it does seem relatively slow to block copy from LUT to COG memory at 4-5 clocks per long. Hopefully it won't need to be done very often.
Update: We can also achieve a 2 clock transfer between the two different RAMs via hub, assuming we don't run the streamer at the same time.
Proposed:
1SUXNNNNN Expression Address used PTRx after instr INDEX
100NNNNNN PTRA[INDEX] PTRA + INDEX*SCALE PTRA -32..+31
110NNNNNN PTRB[INDEX] PTRB + INDEX*SCALE PTRB -32..+31
1011NNNNN PTRA++[INDEX] PTRA PTRA ++= INDEX*SCALE -16..-1,+1..+16
1111NNNNN PTRB++[INDEX] PTRB PTRB ++= INDEX*SCALE -16..-1,+1..+16
1010NNNNN ++PTRA[INDEX] PTRA + INDEX*SCALE PTRA ++= INDEX*SCALE -16..-1,+1..+16
1110NNNNN ++PTRB[INDEX] PTRB + INDEX*SCALE PTRB ++= INDEX*SCALE -16..-1,+1..+16
S = 0 for PTRA, 1 for PTRB
U = 0 to keep PTRx same, 1 to update PTRx (PTRx += INDEX*SCALE)
P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify)
NNNNN = INDEX (signed)
If U = 0, then P = 0 and INDEX[5] = X
If U = 1, then P = X and INDEX[5] = INDEX[4]
I just implemented this. Here is how it turned out, exactly:
// ptra/ptrb writes
// #s 00000000000000000000000_0aaaaaaaa = address $00000..$000FF
// #s 00000000000000000000000_1s0iiiiii = address ptra/ptrb-static with -32..+31 scaled index (iiiiii)
// #s 00000000000000000000000_1s1p00000 = address ptra/ptrb-update with +16 scaled index (010000)
// #s 00000000000000000000000_1s1p0iiii = address ptra/ptrb-update with +1..+15 scaled index (00xxxx)
// #s 00000000000000000000000_1s1p10000 = address ptra/ptrb-update with -16 scaled index (110000)
// #s 00000000000000000000000_1s1p1iiii = address ptra/ptrb-update with -15..-1 scaled index (11xxxx)
// #s+augs xxxxxxxxxxxxaaaaaaaaaaa_aaaaaaaaa = address $00000..$FFFFF (x != 000000001???)
// #s+augs 000000001s0xiiiiiiiiiii_iiiiiiiii = address ptra/ptrb-static with 20-bit unscaled index
// #s+augs 000000001s1piiiiiiiiiii_iiiiiiiii = address ptra/ptrb-update with 20-bit unscaled index
wire [5:0] ptr_nx = ix[s6] ? {ix[s4], ix[s4] || ~|ix[s3:s0], ix[s3:s0]} // ptra/ptrb-update with -16..+16 scaled index
: ix[s5:s0]; // ptra/ptrb-static with -32..+31 scaled index
Now I need to modify the assembler to handle this. Any differently-assembling instructions have "_" after their name now.
Now I need to modify the assembler to handle this. Any differently-assembling instructions have "_" after their name now.
Do you have conditional defines, or include file ability in your assembler ?
It would be nice to have source code unchanged where opcodes have the same names, but need to produce slightly differing binaries.
A target name is then defined as a constant / pragma somewhere, and users can quickly use Rev A test files on Rev B, with just a rebuild.
Now I need to modify the assembler to handle this. Any differently-assembling instructions have "_" after their name now.
Do you have conditional defines, or include file ability in your assembler ?
It would be nice to have source code unchanged where opcodes have the same names, but need to produce slightly differing binaries.
A target name is then defined as a constant / pragma somewhere, and users can quickly use Rev A test files on Rev B, with just a rebuild.
You can do a quick kludge by pre-defining PrefixStr_P2_Rev_A or similar and then checking if it ever appears in user source. (even as a label)
That is then used as a switch, this is less flexible than full conditional/pragma, but faster to get going...
Proposed:
1SUXNNNNN Expression Address used PTRx after instr INDEX
100NNNNNN PTRA[INDEX] PTRA + INDEX*SCALE PTRA -32..+31
110NNNNNN PTRB[INDEX] PTRB + INDEX*SCALE PTRB -32..+31
1011NNNNN PTRA++[INDEX] PTRA PTRA ++= INDEX*SCALE -16..-1,+1..+16
1111NNNNN PTRB++[INDEX] PTRB PTRB ++= INDEX*SCALE -16..-1,+1..+16
1010NNNNN ++PTRA[INDEX] PTRA + INDEX*SCALE PTRA ++= INDEX*SCALE -16..-1,+1..+16
1110NNNNN ++PTRB[INDEX] PTRB + INDEX*SCALE PTRB ++= INDEX*SCALE -16..-1,+1..+16
S = 0 for PTRA, 1 for PTRB
U = 0 to keep PTRx same, 1 to update PTRx (PTRx += INDEX*SCALE)
P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify)
NNNNN = INDEX (signed)
If U = 0, then P = 0 and INDEX[5] = X
If U = 1, then P = X and INDEX[5] = INDEX[4]
I just implemented this. Here is how it turned out, exactly:
// ptra/ptrb writes
// #s 00000000000000000000000_0aaaaaaaa = address $00000..$000FF
// #s 00000000000000000000000_1s0iiiiii = address ptra/ptrb-static with -32..+31 scaled index (iiiiii)
// #s 00000000000000000000000_1s1p00000 = address ptra/ptrb-update with +16 scaled index (010000)
// #s 00000000000000000000000_1s1p0iiii = address ptra/ptrb-update with +1..+15 scaled index (00xxxx)
// #s 00000000000000000000000_1s1p10000 = address ptra/ptrb-update with -16 scaled index (110000)
// #s 00000000000000000000000_1s1p1iiii = address ptra/ptrb-update with -15..-1 scaled index (11xxxx)
// #s+augs xxxxxxxxxxxxaaaaaaaaaaa_aaaaaaaaa = address $00000..$FFFFF (x != 000000001???)
// #s+augs 000000001s0xiiiiiiiiiii_iiiiiiiii = address ptra/ptrb-static with 20-bit unscaled index
// #s+augs 000000001s1piiiiiiiiiii_iiiiiiiii = address ptra/ptrb-update with 20-bit unscaled index
wire [5:0] ptr_nx = ix[s6] ? {ix[s4], ix[s4] || ~|ix[s3:s0], ix[s3:s0]} // ptra/ptrb-update with -16..+16 scaled index
: ix[s5:s0]; // ptra/ptrb-static with -32..+31 scaled index
Now I need to modify the assembler to handle this. Any differently-assembling instructions have "_" after their name now.
Thanks, Chip!
I think there won't be differently-assembling instructions if
(a) +index of +0 is always assembled to the static version
(b) The new index[5] bit is used as follows
index = index[4:0]+16*index[5] if index[4]=0 or
index = index[4:0]-16*index[5] if index[4]=1
I think there won't be differently-assembling instructions if:
(a) +index of +0 is always assembled to the static version
(b) The new index[5] bit is used as follows:
index = index[4:0]+16*index[5] if index[4]=0 or
index = index[4:0]-16*index[5] if index[4]=1
index[5] is not the msb of a 6-bit index
I've got it all implemented in the Verilog and assembler and I just finished testing it. It all checks out okay.
I am intrigued by what you are saying here, but I can't figure it out.
Here is what I'm doing now and it's working fine. If the 'u'pdate is 1 and if index = %0000, then the final index = %010000, or +16. So, for PTRx updates, 0 is never allowed. Index values 0..31 go: +16, 1..15, -16..-1. There wasn't much to do in the assembler to make it happen, but there are some differences between the old and new. The differences are utilizing index %00000 on update as +16 and repurposing bit 5 from post/pre to an extra index bit when there's no update. All this took only several gates to implement.
Comments
This was a very simple change. Almost nothing. And the LE's dropped, somehow.
It seems to be working just fine.
To do it all the way correctly would require a 6*18-bit multiplier, where the unit size is not 1/2/4 for byte/word/long, but (Q[17:0]+1)*4. That's a little much when you consider it must be multiplied.
I could constrain 'SETQ(2)+RD/WR/WMLONG' PTRx behavior to {++/--}PTRx{++/--}, where the index gets interpreted only as 0/1/-1. That would work, and allow for proper sequencing of sequential block transfers. I am going to look into that now.
Good find, TonyB_!
If we are not updating PTRx, we can force pre-type addressing (as if P=0) and use the P bit as an additional index bit. I'm looking into how this would affect the Verilog.
Thanks, Chip. LUT addressing is much better now.
The objective is PTRx with the correct address at the end of the fast block move, to avoid a separate add instruction. If PTRx is correct with an index of 0/1/-1, but incorrect/unchanged with a larger index that would be OK with me.
I'll give it some thought. We've got to use this spare bit now that we've found it. If nothing better turns up, then we have a pretty good use for it.
I agree.
There's also the case of updating with index=0. That makes no sense.
Could we have +16 instead of +0?
Good idea. The two ideas together mean that you could expand common range to +16.
eg.
Now it takes only 300 clocks to transfer 100 longs from LUT to COG instead of 500. Big savings.
That would mean modifying the instruction, while our write opportunity is being taken by the RDLUT.
Maybe we could do it by modifying the result destination on the fly, but at what offset? We'd have to borrow the LSBs of PTRx, or get them from somewhere. I see the goal, though.
Does RDLUT taking three cycles give us a cycle for something extra?
Yes, my plan was not very well thought through...very messy.
What's holding me back is that this kind of a change is going to bifurcate the tool effort. This will create two ways of assembling code. I'll need a switch in my assembler. I'm not sure that I want to go there. Wait, I could make renamed instructions for the new RDBYTE, RDLONG, etc. where their names end with an "n", or something, to indicate "new". That's how to do it. Then, I could support the current silicon while having special names for the new-silicon instructions that run only on the FPGA, for now. That will keep things simple.
In the soft HDMI, cog A will load a 256-long palette for cog B into LUT RAM, then cog B will copy it to cog RAM with the low eight read and write address bits the same. Although this particular LUT-cog transfer is not time-critical, it might be in other cases. Cog A can load its own different palette directly into cog RAM, but cog B cannot access hub RAM once the streamer is running.
EDIT:
Taking this a stage further, presumably cog B could start copying as soon as the first long is the LUT RAM and cog A could finish loading its palette and start on other code before cog B has finished copying.
I don't want to do that because I could go insane. I haven't found a need to move between cog and LUT. You can move cog/LUT between hub really easily, though, at 1-clock-per-long.
In general though it does seem relatively slow to block copy from LUT to COG memory at 4-5 clocks per long. Hopefully it won't need to be done very often.
Update: We can also achieve a 2 clock transfer between the two different RAMs via hub, assuming we don't run the streamer at the same time.
I just implemented this. Here is how it turned out, exactly:
Now I need to modify the assembler to handle this. Any differently-assembling instructions have "_" after their name now.
Do you have conditional defines, or include file ability in your assembler ?
It would be nice to have source code unchanged where opcodes have the same names, but need to produce slightly differing binaries.
A target name is then defined as a constant / pragma somewhere, and users can quickly use Rev A test files on Rev B, with just a rebuild.
Right. I don't have that. Will someday.
You can do a quick kludge by pre-defining PrefixStr_P2_Rev_A or similar and then checking if it ever appears in user source. (even as a label)
That is then used as a switch, this is less flexible than full conditional/pragma, but faster to get going...
Thanks, Chip!
I think there won't be differently-assembling instructions if
(a) +index of +0 is always assembled to the static version
(b) The new index[5] bit is used as follows
index = index[4:0]+16*index[5] if index[4]=0 or
index = index[4:0]-16*index[5] if index[4]=1
index[5] is now not the msb of a 6-bit index
I've got it all implemented in the Verilog and assembler and I just finished testing it. It all checks out okay.
I am intrigued by what you are saying here, but I can't figure it out.
Here is what I'm doing now and it's working fine. If the 'u'pdate is 1 and if index = %0000, then the final index = %010000, or +16. So, for PTRx updates, 0 is never allowed. Index values 0..31 go: +16, 1..15, -16..-1. There wasn't much to do in the assembler to make it happen, but there are some differences between the old and new. The differences are utilizing index %00000 on update as +16 and repurposing bit 5 from post/pre to an extra index bit when there's no update. All this took only several gates to implement.
Okay. Thanks.