Yeah, I think so now too. Originally, Chip had just said don't use a branch as the last instruction of the block. But, since this issue has cropped up a few times already, it's clearly going to keep reoccurring over the life of the prop2.
Another alternative to using an absolute branch is to compensate the REP offset by adding the block length to the relative branch distance.
Using GETPTR instead of GETCT should shave off two instructions, and it doesn't require STALLI and ALLOWI:
rdfast #0, a
rep #2, #0' loop 2 next instructions forever until we branch outrfbyte x wczif_z_or_cjmp #exit ' leave loop
exit getptr b
sub b,a
sub b, #2' adjust lenght for extra rfbyte, done just before loop exits
P.S. my fault... xor a,b xor b,a xor a,b
getptr a
ret ' now, on ret, a contains next string start address' b contains last string lenght, excluding end==0 or >127' c and z would reflect the value of next string first character,' useful to get each string lenght, from a list terminated by a null' string, which uses the other possible terminator char.
In the hope I've got it right, after adapting it, due to evanh's warning about the extra pass, caused by the use of a relative branch to exit the rep block, and extracting another usefull result, or two, indeed...
"If you only have some saussage and bread, then be ensured you'd make the better saussage sandwich you can."
Using GETPTR instead of GETCT should shave off two instructions, and it doesn't require STALLI and ALLOWI:
rdfast #0,a
rep #2,#0' loop 2 next instructions forever until we branch outrfbyte x wczif_z_or_cjmp #exit ' leave loop
exit getptr b
sub b,a
sub b,#1' don't count trailing null byteret
Is there any reason not to use _ret_ ?
It's possible to replace jmp with execf to jump with skipping, if latter ever needed.
After adapting the getptr method I found that there is some peculiar inconsistency.
This is my code:
code LEN ' ( str -- len )rdfast #0, a
rep #2, #0' loop 2 next instructions forever until we branch outrfbyte x wcz{c|z}ijnz a,#l0 ' leave loop but compensate length calc
.l0 getptr x
_ret_subr a, x ' a = x-a'end
I then load this into cog memory and create an alias LEN that points to this code. Now I try it out on a string:
TAQOZ# " ABCDEFGHIJKLMNOPQRSTUVWXYZ" LEN . --- 26 ok
So I tried it on a 64kB block by first filling the block with a valid character and then terminating and checking it.
TAQOZ# $1.0000 $1.0000 'A' FILL --- ok
TAQOZ# 0 $2.0000 C! --- ok
TAQOZ# $1.FFF0 $20 DUMP ---
1FFF0: 41414141414141414141414141414141 'AAAAAAAAAAAAAAAA'
20000: 005A 8253 C3 A7 C7 A2868928 D8 E51E E1 E1 '.Z.S......(.....' ok
Then I run it expecting 65536
TAQOZ# $1.0000 LEN . --- 65537 ok
Timing wise it works out to 4 cycles per character of course with some overhead:
TAQOZ# $1.0000 LAP LEN LAP .LAP --- 262,208 cycles= 1,311,040ns @200MHz ok
edit: It seems to be to do with the memory area. I've dropped right down to 32 bytes terminated in $10000 and compared that against a 32 byte string.
TAQOZ# $1.0000$40 DUMP ---
10000: 41414141414141414141414141414141'AAAAAAAAAAAAAAAA'10010: 41414141414141414141414141414141'AAAAAAAAAAAAAAAA'10020: 00414141414141414141414141414141'.AAAAAAAAAAAAAAA'10030: 41414141414141414141414141414141'AAAAAAAAAAAAAAAA' ok
TAQOZ# --- ok
TAQOZ# " ABCDEFGHIJKLMNOPQRSTUVWXYZ123456" LEN . --- 32 ok
TAQOZ# $1.0000 LEN . --- 33 ok
After adapting the getptr method I found that there is some peculiar inconsistency.
This is my code:
code LEN ' ( str -- len )rdfast #0, a
rep #2, #0' loop 2 next instructions forever until we branch outrfbyte x wcz{c|z}ijnz a,#l0 ' leave loop but compensate length calc
.l0 getptr x
_ret_subr a, x ' a = x-a'end
IJNZ is clever way to save an instruction, which I tried to do this morning but failed. How about this:
code LEN ' ( str -- len )mov b, #.l0
rdfast #0, a
rep #2, #0' loop 2 next instructions forever until we branch outrfbyte x wcz{c|z}ijnz a,b ' leave loop but compensate length calc
.l0 getptr x
_ret_subr a, x ' a = x-a'end
code LEN ' ( str -- len )rdfast #0, a
add a, #1' compensate length calcrep #2, #0' loop 2 next instructions forever until we branch outrfbyte x wcz{c|z}jmp #\.l0 ' leave loop
.l0 getptr x
_ret_subr a, x ' a = x-a'end
code LEN ' ( str -- len )rdfast #0, a
rep #2, #0' loop 2 next instructions forever until we branch outrfbyte x wcz{c|z}ijnz a,#l0+2' leave loop but compensate length calc
.l0 getptr x
_ret_subr a, x ' a = x-a'end
Could a new REP loop break out of an existing REP loop, like this?
code LEN ' ( str -- len )rdfast #0, a
rep #2, #0' loop 2 next instructions forever until we branch outrfbyte x wcz{c|z}rep #1, #1getptr x
_ret_subr a, x ' a = x-a'end
That code is a bit different to what I have as it is missing the conditional that I had there to stop the second REP until the last case character matching 0 or >127 case that breaks out.
Just a quick check again this morning before I rush out, but using the same routine I get varying results. With a 32 character string it will report a length of 33 at times depending upon the memory area.
See if you can follow what I'm doing (haven't enough time to comment this, but $! will copy a string at an address to an address):
TAQOZ# $1.0000 PRINT$ --- ABCDEFGHIJKLMNOPQRSTUVWXYZ123456 ok
TAQOZ# $1.0000 LEN . --- 32 ok
TAQOZ# $1.0000 $30 DUMP ---
10000: 4142434445464748494A 4B 4C 4D 4E 4F 50 'ABCDEFGHIJKLMNOP'
10010: 5152535455565758595A 313233343536 'QRSTUVWXYZ123456'
10020: 00 D8 CA 0B B034 F1 C302 B6409C 4A 455063 '.....4....@.JEPc' ok
TAQOZ# $1.0000 $7.8000 $! --- ok
TAQOZ# $7.8000 PRINT$ --- ABCDEFGHIJKLMNOPQRSTUVWXYZ123456 ok
TAQOZ# $7.8000 LEN . --- 32 ok
TAQOZ# $7.0000 $1000 'A' FILL --- ok
TAQOZ# 0 $7.0020 C! --- ok
TAQOZ# $7.0000 LEN . --- 33 ok
TAQOZ# $7.0000 $30 DUMP ---
70000: 41414141414141414141414141414141 'AAAAAAAAAAAAAAAA'
70010: 41414141414141414141414141414141 'AAAAAAAAAAAAAAAA'
70020: 00414141414141414141414141414141 '.AAAAAAAAAAAAAAA' ok
TAQOZ# $7.8000 $7.0000 $! --- ok
TAQOZ# $7.0000 $30 DUMP ---
70000: 4142434445464748494A 4B 4C 4D 4E 4F 50 'ABCDEFGHIJKLMNOP'
70010: 5152535455565758595A 313233343536 'QRSTUVWXYZ123456'
70020: 00414141414141414141414141414141 '.AAAAAAAAAAAAAAA' ok
TAQOZ# $7.0000 LEN . --- 33 ok
TAQOZ# $7.0000 $7.1080 $! --- ok
TAQOZ# $7.1080 LEN . --- 33 ok
TAQOZ# $7.8000 LEN . --- 32 ok
TAQOZ# $7.8000 $30 DUMP ---
78000: 4142434445464748494A 4B 4C 4D 4E 4F 50 'ABCDEFGHIJKLMNOP'
78010: 5152535455565758595A 313233343536 'QRSTUVWXYZ123456'
78020: 00 C86A 0C 58 C641 C700 B231021E DD 96 C6 '..j.X.A...1.....' ok
That code is a bit different to what I have as it is missing the conditional that I had there to stop the second REP until the last case character matching 0 or >127 case that breaks out.
That code is a bit different to what I have as it is missing the conditional that I had there to stop the second REP until the last case character matching 0 or >127 case that breaks out.
It's certainly not going to do what you want.
Ok it sounds like the whole end of REP loop thing is going to be an issue to deal with on exiting loops.
I wonder if using
Three options:
- Add a compensating block length offset to the relative immediate.
- or, use register direct, which is always an absolute address.
- or, use an instruction that can encode absolute immediate, like JMP.
Nobody has commented on the weird discrepancies I'm finding. Testing with this routine:
TAQOZ# code LEN ' ( str -- len )
071E0 FC78_0022 rdfast #0, a
071E4 FCDC_0400 rep #2, #0 ' loop 2 next instructions forever until we branch out
071E8 FD78_1610 rfbyte x wcz
071EC EB8C_45FF {c|z} ijnz a,#l0 ' leave loop but compensate length calc
071F0 FD60_1634 .l0 getptr x
071F402C0_440B _ret_ subr a, x ' a = x-a'
end --- ok
(The ijnz forward reference lists before it is resolved but it's fine: TAQOZ# $71EC @ .L --- $EB8C_4400 ok)
Now I store a 32 character string and check it out:
TAQOZ# " AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"$1.0000 $! --- ok
TAQOZ# $1.0000$30 DUMP ---
10000: 41414141414141414141414141414141'AAAAAAAAAAAAAAAA'10010: 41414141414141414141414141414141'AAAAAAAAAAAAAAAA'10020: 00414141414141414141414141414141'.AAAAAAAAAAAAAAA' ok
How long is the 32 character string?
TAQOZ# $1.0000 LEN . --- 33 ok
Wrong!
Try terminating the terminator and see what happens:
TAQOZ# $A4 $1.0021 C! --- ok
TAQOZ# $1.0000 LEN . --- 32 ok
TAQOZ# 0 $1.0021 C! --- ok
TAQOZ# $1.0000 LEN . --- 32 ok
TAQOZ# $41 $1.0021 C! --- ok
TAQOZ# $1.0000 LEN . --- 33 ok
When that works, you can try the unrolled speedups if you care to spare the extra long(s). In theory it could boost performance by up to 1.6x for really long strings with 4 rfbyte unrolls vs single rfbyte in the loop. Only tiny strings are faster without unrolling.
I charted the following cases for the speedup gain with 2,3,4 rfbyte unrolls including the execution time of this subroutine itself (including ret and call overhead etc, plus an assumption of the average rdfast execution time of 15.5 clocks) and got this result.
The reality is that strings are normally very short but the other place I would use this is to scan a buffer 512 byte sector for a terminator. The easy peasy 6 cycle method is fine for this, but still I want to see what's going on.
@evanh - I can't see how your +2 on the ijnz helps and it definitely locks up when I try it.
code LEN ' ( str -- len )rdfast #0, a
rep #2, #0' loop 2 next instructions forever until we branch outrfbyte x wcz{c|z}ijnz a,#l0+2' leave loop but compensate length calc
.l0 getptr x
_ret_subr a, x ' a = x-a'end
You are essentially saying that it should jump to just after the last instruction, whatever that is.
Oh, is that code hubexec? Maybe it needs to be +8 instead of +2. Hmm, that'll suck having to customise ...
It's compiled in the hub memory but I copy it to cog for this test and run it as there is no way you can use rdfast otherwise, even if rep limps along.
everybody has these variations. I haven't had time to check them all.
If you don't like having to adjust the IJNZ jump, here's one I posted earlier that adds two cycles overall:
code LEN ' ( str -- len )rdfast #0, a
add a, #1' compensate length calcrep #2, #0' loop 2 next instructions forever until we branch outrfbyte x wcz{c|z}jmp #\.l0 ' leave loop
.l0 getptr x
_ret_subr a, x ' a = x-a'end
I noticed that some of this pasted code above is using number "1" followed by zero instead of lower case letter "L" zero for the label. Be careful what you type into the code. It could be crashing if you are doing a jump to #10 (ten), instead of "L" zero..
Comments
Another alternative to using an absolute branch is to compensate the REP offset by adding the block length to the relative branch distance.
In the hope I've got it right, after adapting it, due to evanh's warning about the extra pass, caused by the use of a relative branch to exit the rep block, and extracting another usefull result, or two, indeed...
"If you only have some saussage and bread, then be ensured you'd make the better saussage sandwich you can."
Is there any reason not to use _ret_ ?
It's possible to replace jmp with execf to jump with skipping, if latter ever needed.
This is my code:
code LEN ' ( str -- len ) rdfast #0, a rep #2, #0 ' loop 2 next instructions forever until we branch out rfbyte x wcz {c|z} ijnz a,#l0 ' leave loop but compensate length calc .l0 getptr x _ret_ subr a, x ' a = x-a' end
I then load this into cog memory and create an alias LEN that points to this code. Now I try it out on a string:
TAQOZ# " ABCDEFGHIJKLMNOPQRSTUVWXYZ" LEN . --- 26 ok
So I tried it on a 64kB block by first filling the block with a valid character and then terminating and checking it.
TAQOZ# $1.0000 $1.0000 'A' FILL --- ok TAQOZ# 0 $2.0000 C! --- ok TAQOZ# $1.FFF0 $20 DUMP --- 1FFF0: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 'AAAAAAAAAAAAAAAA' 20000: 00 5A 82 53 C3 A7 C7 A2 86 89 28 D8 E5 1E E1 E1 '.Z.S......(.....' ok
Then I run it expecting 65536TAQOZ# $1.0000 LEN . --- 65537 ok
Timing wise it works out to 4 cycles per character of course with some overhead:
TAQOZ# $1.0000 LAP LEN LAP .LAP --- 262,208 cycles= 1,311,040ns @200MHz ok
edit: It seems to be to do with the memory area. I've dropped right down to 32 bytes terminated in $10000 and compared that against a 32 byte string.
TAQOZ# $1.0000 $40 DUMP --- 10000: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 'AAAAAAAAAAAAAAAA' 10010: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 'AAAAAAAAAAAAAAAA' 10020: 00 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 '.AAAAAAAAAAAAAAA' 10030: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 'AAAAAAAAAAAAAAAA' ok TAQOZ# --- ok TAQOZ# " ABCDEFGHIJKLMNOPQRSTUVWXYZ123456" LEN . --- 32 ok TAQOZ# $1.0000 LEN . --- 33 ok
http://forums.parallax.com/discussion/comment/1491515/#Comment_1491515
IJNZ is clever way to save an instruction, which I tried to do this morning but failed. How about this:
code LEN ' ( str -- len ) mov b, #.l0 rdfast #0, a rep #2, #0 ' loop 2 next instructions forever until we branch out rfbyte x wcz {c|z} ijnz a,b ' leave loop but compensate length calc .l0 getptr x _ret_ subr a, x ' a = x-a' end
MOV adds a long that IJNZ removes, though.code LEN ' ( str -- len ) rdfast #0, a add a, #1 ' compensate length calc rep #2, #0 ' loop 2 next instructions forever until we branch out rfbyte x wcz {c|z} jmp #\.l0 ' leave loop .l0 getptr x _ret_ subr a, x ' a = x-a' end
code LEN ' ( str -- len ) rdfast #0, a rep #2, #0 ' loop 2 next instructions forever until we branch out rfbyte x wcz {c|z} ijnz a,#l0+2 ' leave loop but compensate length calc .l0 getptr x _ret_ subr a, x ' a = x-a' end
code LEN ' ( str -- len ) rdfast #0, a rep #2, #0 ' loop 2 next instructions forever until we branch out rfbyte x wcz {c|z} rep #1, #1 getptr x _ret_ subr a, x ' a = x-a' end
EDIT: Ah, eek, it does something extra with the second REP. It seems to concatenate the two REPs together. You get a mix of both.
Wha?
misc2 increments 4 times
misc3 increments 3 times
... rep #3, #0 add misc2, #1 nop rep #1, #3 add misc3, #1 nop nop nop ...
See if you can follow what I'm doing (haven't enough time to comment this, but $! will copy a string at an address to an address):
TAQOZ# $1.0000 PRINT$ --- ABCDEFGHIJKLMNOPQRSTUVWXYZ123456 ok TAQOZ# $1.0000 LEN . --- 32 ok TAQOZ# $1.0000 $30 DUMP --- 10000: 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 50 'ABCDEFGHIJKLMNOP' 10010: 51 52 53 54 55 56 57 58 59 5A 31 32 33 34 35 36 'QRSTUVWXYZ123456' 10020: 00 D8 CA 0B B0 34 F1 C3 02 B6 40 9C 4A 45 50 63 '.....4....@.JEPc' ok TAQOZ# $1.0000 $7.8000 $! --- ok TAQOZ# $7.8000 PRINT$ --- ABCDEFGHIJKLMNOPQRSTUVWXYZ123456 ok TAQOZ# $7.8000 LEN . --- 32 ok TAQOZ# $7.0000 $1000 'A' FILL --- ok TAQOZ# 0 $7.0020 C! --- ok TAQOZ# $7.0000 LEN . --- 33 ok TAQOZ# $7.0000 $30 DUMP --- 70000: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 'AAAAAAAAAAAAAAAA' 70010: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 'AAAAAAAAAAAAAAAA' 70020: 00 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 '.AAAAAAAAAAAAAAA' ok TAQOZ# $7.8000 $7.0000 $! --- ok TAQOZ# $7.0000 $30 DUMP --- 70000: 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 50 'ABCDEFGHIJKLMNOP' 70010: 51 52 53 54 55 56 57 58 59 5A 31 32 33 34 35 36 'QRSTUVWXYZ123456' 70020: 00 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 '.AAAAAAAAAAAAAAA' ok TAQOZ# $7.0000 LEN . --- 33 ok TAQOZ# $7.0000 $7.1080 $! --- ok TAQOZ# $7.1080 LEN . --- 33 ok TAQOZ# $7.8000 LEN . --- 32 ok TAQOZ# $7.8000 $30 DUMP --- 78000: 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 50 'ABCDEFGHIJKLMNOP' 78010: 51 52 53 54 55 56 57 58 59 5A 31 32 33 34 35 36 'QRSTUVWXYZ123456' 78020: 00 C8 6A 0C 58 C6 41 C7 00 B2 31 02 1E DD 96 C6 '..j.X.A...1.....' ok
I wonder if using
if_c_or_z REP #0, #1
would behave any differently?
The above examples that Tony and myself posted should all work correctly - https://forums.parallax.com/discussion/comment/1491558/#Comment_1491558
Three options:
- Add a compensating block length offset to the relative immediate.
- or, use register direct, which is always an absolute address.
- or, use an instruction that can encode absolute immediate, like JMP.
TAQOZ# code LEN ' ( str -- len ) 071E0 FC78_0022 rdfast #0, a 071E4 FCDC_0400 rep #2, #0 ' loop 2 next instructions forever until we branch out 071E8 FD78_1610 rfbyte x wcz 071EC EB8C_45FF {c|z} ijnz a,#l0 ' leave loop but compensate length calc 071F0 FD60_1634 .l0 getptr x 071F4 02C0_440B _ret_ subr a, x ' a = x-a' end --- ok
(The ijnz forward reference lists before it is resolved but it's fine: TAQOZ# $71EC @ .L --- $EB8C_4400 ok)Now I store a 32 character string and check it out:
TAQOZ# " AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" $1.0000 $! --- ok TAQOZ# $1.0000 $30 DUMP --- 10000: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 'AAAAAAAAAAAAAAAA' 10010: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 'AAAAAAAAAAAAAAAA' 10020: 00 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 '.AAAAAAAAAAAAAAA' ok
How long is the 32 character string?TAQOZ# $1.0000 LEN . --- 33 ok
Wrong!Try terminating the terminator and see what happens:
TAQOZ# $A4 $1.0021 C! --- ok TAQOZ# $1.0000 LEN . --- 32 ok TAQOZ# 0 $1.0021 C! --- ok TAQOZ# $1.0000 LEN . --- 32 ok TAQOZ# $41 $1.0021 C! --- ok TAQOZ# $1.0000 LEN . --- 33 ok
Weird? RIght?Fix your code as per the discussion, then try again.
I charted the following cases for the speedup gain with 2,3,4 rfbyte unrolls including the execution time of this subroutine itself (including ret and call overhead etc, plus an assumption of the average rdfast execution time of 15.5 clocks) and got this result.
@evanh - I can't see how your +2 on the ijnz helps and it definitely locks up when I try it.
code LEN ' ( str -- len ) rdfast #0, a rep #2, #0 ' loop 2 next instructions forever until we branch out rfbyte x wcz {c|z} ijnz a,#l0+2 ' leave loop but compensate length calc .l0 getptr x _ret_ subr a, x ' a = x-a' end
You are essentially saying that it should jump to just after the last instruction, whatever that is.
The +2 should be working. Does the missing dot from the label name matter? Fastspin certainly complains about that.
If you don't like having to adjust the IJNZ jump, here's one I posted earlier that adds two cycles overall:
code LEN ' ( str -- len ) rdfast #0, a add a, #1 ' compensate length calc rep #2, #0 ' loop 2 next instructions forever until we branch out rfbyte x wcz {c|z} jmp #\.l0 ' leave loop .l0 getptr x _ret_ subr a, x ' a = x-a' end