Yeah, I think so now too. Originally, Chip had just said don't use a branch as the last instruction of the block. But, since this issue has cropped up a few times already, it's clearly going to keep reoccurring over the life of the prop2.
Another alternative to using an absolute branch is to compensate the REP offset by adding the block length to the relative branch distance.
Using GETPTR instead of GETCT should shave off two instructions, and it doesn't require STALLI and ALLOWI:
rdfast #0, a
rep #2, #0 ' loop 2 next instructions forever until we branch out
rfbyte x wcz
if_z_or_c jmp #exit ' leave loop
exit getptr b
sub b,a
sub b, #2 ' adjust lenght for extra rfbyte, done just before loop exits
P.S. my fault... xor a,b xor b,a xor a,b
getptr a
ret ' now, on ret, a contains next string start address
' b contains last string lenght, excluding end==0 or >127
' c and z would reflect the value of next string first character,
' useful to get each string lenght, from a list terminated by a null
' string, which uses the other possible terminator char.
In the hope I've got it right, after adapting it, due to evanh's warning about the extra pass, caused by the use of a relative branch to exit the rep block, and extracting another usefull result, or two, indeed...
"If you only have some saussage and bread, then be ensured you'd make the better saussage sandwich you can."
Using GETPTR instead of GETCT should shave off two instructions, and it doesn't require STALLI and ALLOWI:
rdfast #0,a
rep #2,#0 ' loop 2 next instructions forever until we branch out
rfbyte x wcz
if_z_or_c jmp #exit ' leave loop
exit getptr b
sub b,a
sub b,#1 ' don't count trailing null byte
ret
Is there any reason not to use _ret_ ?
It's possible to replace jmp with execf to jump with skipping, if latter ever needed.
After adapting the getptr method I found that there is some peculiar inconsistency.
This is my code:
code LEN ' ( str -- len )
rdfast #0, a
rep #2, #0 ' loop 2 next instructions forever until we branch out
rfbyte x wcz
{c|z} ijnz a,#l0 ' leave loop but compensate length calc
.l0 getptr x
_ret_ subr a, x ' a = x-a'
end
I then load this into cog memory and create an alias LEN that points to this code. Now I try it out on a string:
TAQOZ# " ABCDEFGHIJKLMNOPQRSTUVWXYZ" LEN . --- 26 ok
So I tried it on a 64kB block by first filling the block with a valid character and then terminating and checking it.
After adapting the getptr method I found that there is some peculiar inconsistency.
This is my code:
code LEN ' ( str -- len )
rdfast #0, a
rep #2, #0 ' loop 2 next instructions forever until we branch out
rfbyte x wcz
{c|z} ijnz a,#l0 ' leave loop but compensate length calc
.l0 getptr x
_ret_ subr a, x ' a = x-a'
end
IJNZ is clever way to save an instruction, which I tried to do this morning but failed. How about this:
code LEN ' ( str -- len )
mov b, #.l0
rdfast #0, a
rep #2, #0 ' loop 2 next instructions forever until we branch out
rfbyte x wcz
{c|z} ijnz a,b ' leave loop but compensate length calc
.l0 getptr x
_ret_ subr a, x ' a = x-a'
end
code LEN ' ( str -- len )
rdfast #0, a
add a, #1 ' compensate length calc
rep #2, #0 ' loop 2 next instructions forever until we branch out
rfbyte x wcz
{c|z} jmp #\.l0 ' leave loop
.l0 getptr x
_ret_ subr a, x ' a = x-a'
end
code LEN ' ( str -- len )
rdfast #0, a
rep #2, #0 ' loop 2 next instructions forever until we branch out
rfbyte x wcz
{c|z} ijnz a,#l0+2 ' leave loop but compensate length calc
.l0 getptr x
_ret_ subr a, x ' a = x-a'
end
Could a new REP loop break out of an existing REP loop, like this?
code LEN ' ( str -- len )
rdfast #0, a
rep #2, #0 ' loop 2 next instructions forever until we branch out
rfbyte x wcz
{c|z} rep #1, #1
getptr x
_ret_ subr a, x ' a = x-a'
end
That code is a bit different to what I have as it is missing the conditional that I had there to stop the second REP until the last case character matching 0 or >127 case that breaks out.
Just a quick check again this morning before I rush out, but using the same routine I get varying results. With a 32 character string it will report a length of 33 at times depending upon the memory area.
See if you can follow what I'm doing (haven't enough time to comment this, but $! will copy a string at an address to an address):
That code is a bit different to what I have as it is missing the conditional that I had there to stop the second REP until the last case character matching 0 or >127 case that breaks out.
That code is a bit different to what I have as it is missing the conditional that I had there to stop the second REP until the last case character matching 0 or >127 case that breaks out.
It's certainly not going to do what you want.
Ok it sounds like the whole end of REP loop thing is going to be an issue to deal with on exiting loops.
I wonder if using
Three options:
- Add a compensating block length offset to the relative immediate.
- or, use register direct, which is always an absolute address.
- or, use an instruction that can encode absolute immediate, like JMP.
Nobody has commented on the weird discrepancies I'm finding. Testing with this routine:
TAQOZ# code LEN ' ( str -- len )
071E0 FC78_0022 rdfast #0, a
071E4 FCDC_0400 rep #2, #0 ' loop 2 next instructions forever until we branch out
071E8 FD78_1610 rfbyte x wcz
071EC EB8C_45FF {c|z} ijnz a,#l0 ' leave loop but compensate length calc
071F0 FD60_1634 .l0 getptr x
071F4 02C0_440B _ret_ subr a, x ' a = x-a'
end --- ok
(The ijnz forward reference lists before it is resolved but it's fine: TAQOZ# $71EC @ .L --- $EB8C_4400 ok)
Now I store a 32 character string and check it out:
Try terminating the terminator and see what happens:
TAQOZ# $A4 $1.0021 C! --- ok
TAQOZ# $1.0000 LEN . --- 32 ok
TAQOZ# 0 $1.0021 C! --- ok
TAQOZ# $1.0000 LEN . --- 32 ok
TAQOZ# $41 $1.0021 C! --- ok
TAQOZ# $1.0000 LEN . --- 33 ok
When that works, you can try the unrolled speedups if you care to spare the extra long(s). In theory it could boost performance by up to 1.6x for really long strings with 4 rfbyte unrolls vs single rfbyte in the loop. Only tiny strings are faster without unrolling.
I charted the following cases for the speedup gain with 2,3,4 rfbyte unrolls including the execution time of this subroutine itself (including ret and call overhead etc, plus an assumption of the average rdfast execution time of 15.5 clocks) and got this result.
The reality is that strings are normally very short but the other place I would use this is to scan a buffer 512 byte sector for a terminator. The easy peasy 6 cycle method is fine for this, but still I want to see what's going on.
@evanh - I can't see how your +2 on the ijnz helps and it definitely locks up when I try it.
code LEN ' ( str -- len )
rdfast #0, a
rep #2, #0 ' loop 2 next instructions forever until we branch out
rfbyte x wcz
{c|z} ijnz a,#l0+2 ' leave loop but compensate length calc
.l0 getptr x
_ret_ subr a, x ' a = x-a'
end
You are essentially saying that it should jump to just after the last instruction, whatever that is.
Oh, is that code hubexec? Maybe it needs to be +8 instead of +2. Hmm, that'll suck having to customise ...
It's compiled in the hub memory but I copy it to cog for this test and run it as there is no way you can use rdfast otherwise, even if rep limps along.
everybody has these variations. I haven't had time to check them all.
If you don't like having to adjust the IJNZ jump, here's one I posted earlier that adds two cycles overall:
code LEN ' ( str -- len )
rdfast #0, a
add a, #1 ' compensate length calc
rep #2, #0 ' loop 2 next instructions forever until we branch out
rfbyte x wcz
{c|z} jmp #\.l0 ' leave loop
.l0 getptr x
_ret_ subr a, x ' a = x-a'
end
I noticed that some of this pasted code above is using number "1" followed by zero instead of lower case letter "L" zero for the label. Be careful what you type into the code. It could be crashing if you are doing a jump to #10 (ten), instead of "L" zero..
Comments
Another alternative to using an absolute branch is to compensate the REP offset by adding the block length to the relative branch distance.
In the hope I've got it right, after adapting it, due to evanh's warning about the extra pass, caused by the use of a relative branch to exit the rep block, and extracting another usefull result, or two, indeed...
"If you only have some saussage and bread, then be ensured you'd make the better saussage sandwich you can."
Is there any reason not to use _ret_ ?
It's possible to replace jmp with execf to jump with skipping, if latter ever needed.
This is my code:
I then load this into cog memory and create an alias LEN that points to this code. Now I try it out on a string:
So I tried it on a 64kB block by first filling the block with a valid character and then terminating and checking it. Then I run it expecting 65536
Timing wise it works out to 4 cycles per character of course with some overhead:
edit: It seems to be to do with the memory area. I've dropped right down to 32 bytes terminated in $10000 and compared that against a 32 byte string.
http://forums.parallax.com/discussion/comment/1491515/#Comment_1491515
IJNZ is clever way to save an instruction, which I tried to do this morning but failed. How about this:
MOV adds a long that IJNZ removes, though.
EDIT: Ah, eek, it does something extra with the second REP. It seems to concatenate the two REPs together. You get a mix of both.
Wha?
misc2 increments 4 times
misc3 increments 3 times
See if you can follow what I'm doing (haven't enough time to comment this, but $! will copy a string at an address to an address):
I wonder if using
would behave any differently?
The above examples that Tony and myself posted should all work correctly - https://forums.parallax.com/discussion/comment/1491558/#Comment_1491558
Three options:
- Add a compensating block length offset to the relative immediate.
- or, use register direct, which is always an absolute address.
- or, use an instruction that can encode absolute immediate, like JMP.
Now I store a 32 character string and check it out: How long is the 32 character string? Wrong!
Try terminating the terminator and see what happens: Weird? RIght?
Fix your code as per the discussion, then try again.
I charted the following cases for the speedup gain with 2,3,4 rfbyte unrolls including the execution time of this subroutine itself (including ret and call overhead etc, plus an assumption of the average rdfast execution time of 15.5 clocks) and got this result.
@evanh - I can't see how your +2 on the ijnz helps and it definitely locks up when I try it.
You are essentially saying that it should jump to just after the last instruction, whatever that is.
The +2 should be working. Does the missing dot from the label name matter? Fastspin certainly complains about that.
If you don't like having to adjust the IJNZ jump, here's one I posted earlier that adds two cycles overall: