CALL problem with PNut binaries

bradzone · 2020-01-20 16:45

I’ve been transitioning from PNut (33p) to FlexGui/fastspin (4.0.6) and I came upon a problem with a CALLD instruction not working as I expected under PNut created binaries, while the FlexGui binary worked ok. When I compared the two binaries I noticed that PNut and FlexGui were producing different machine codes for the same CALLD instruction. While both PNut and FlexGui produced an instruction using the #rel20 coding, the PC offset generated by PNut was in longs instead of bytes. As a result, when the CALLD instruction is executed, the PC is not set to the target label’s address (see code below).

PNut and FlexGui also differ in their handling of the wc/wz/wcz effects on CALLD’s #rel9 encoding format. PNut generates an “effect is not allowed” error when a wc, wz, or wcz effect is requested, while FlexGui does not. From the documentation, the new c and z values are taken from S[31] and S[30] where the S value would be PC + rel9. When I ran the FlexGui binary the c and z flags were set to zero when the wcz effect was requested (see code below). It doesn’t seem that the S[31:30] bits can actually be set using a 9-bit relative offset.

As a relative newcomer, I would appreciate if those with more experience could confirm this CALLD behavior. Also, does anyone know the reasoning for using the #rel9 encoding for short branches? For relative CALLD instructions, it seems that using the #rel20 encoding for all cases when the D value is PA, PB, PTRA, or PTRB avoids the confusion of what to do with the wc, wz, and wcz effects.

' Demo for CALLD differences under fastspin and PNut
'  fastspin: all tests pass
'  PNut: change commented instructions to assemble/pass tests
'   removes wcz effect and changes expected values
' LEDs 56-58: test results (on == pass)
'  56: rel9 encoding with negative value
'  57: rel9 encoding with positive value
'  58: rel20 encoding with positive value
DAT
	org
	hubset 	#0			' use RCFAST (20Mhz)
	setq	#2			' move target code for rel20 branches
	rdlong	cog_r20,##@cog_r20
main
	call	#test_cog
.L1	waitx	##2_500_000
	jmp	#.L1
	cogstop	#0

cog_r9m	mov	arg00,#1		' CALLD target: negative rel9 value
	rczl	arg00 wcz
	jmp	PA
test_cog
	drvl	#56			' LEDs 56-58 for test - assume success
	drvl	#57			'  for PNut change commented instructions			' 
	drvl	#58
	modcz	_SET,_SET wcz
	calld	PA,#cog_r9m wcz		' fastspin: use wcz
	cmp	arg00,#4 wz		' fastspin: success
'	calld	PA,#cog_r9m		' PNut: avoid effect is not allowed error
'	cmp	arg00,#7 wz		' PNut: w/o wcz, arg00 is #7
 if_nz	drvh	#56	
	modcz	_SET,_SET wcz
	calld	PA,#cog_r9 wcz		' fastspin: use wcz
	cmp	arg00,#8 wz		' fastspin: success
'	calld	PA,#cog_r9		' PNut: avoid effect is not allowed error
'	cmp	arg00,#11 wz		' PNut: w/o wcz, arg00 is #11
 if_nz	drvh	#57	
	modcz	_CLR,_CLR wcz
	calld	PA,#cog_r20		' fastspin: FE100400; PNut: FE100100
'	long	$FE100100		'   fastspin: emulate PNut behavior
	cmp	arg00,#$10 wz		' fastspin arg00 == $10
'	cmp	arg00,#$20 wz		' PNut: arg00 == $20
 if_nz	drvh	#58	
	ret	

cog_r9	mov	arg00,#2		' CALLD target: positive rel9 target
	rczl	arg00
	jmp	PA
	long	0[58]			' array of NOPs
cog_r20A				' CALLD target for PNut FE100100 encoding
	mov	arg00,#8
	rczl	arg00
	jmp	PA
result1
arg00	long	0
arg01	long	0
_pa	long	0
	org	cog_r9+$100-3
cog_r20	mov	arg00,#4		' CALLD target: rel20 value
	rczl	arg00
	jmp	PA
cog_code_end

AJL · 2020-01-20 21:39

It looks like CALLD PA could be encoded two different ways depending on the decisions made by the assembler when parsing the assembly code:

1. If the parser looks for PA/PB/PTRA/PTRB first and if found always encodes for rel20 then the wc, wz, and wcz effects would never be allowed.
2. If the parser sees these as symbols for registers $1F6 to $1F9, and then chooses the encoding based on the reach of the target, you'll get rel9 with effects allowed.

This is because the syntax doesn't specifically disambiguate these cases.

The effects can only clear the flags with rel9, as bits 31 and 30 of the source are clear.

I'd suggest that PNut is calculating the rel20 offset in longs because this is cog code (long addressed).
The routine you use to relocate the rel20 code is reading from hub rather than cog, so is that having an effect?

I can't understand why rczl should shift arg00 by a different amount in cog_r20 depending on the assembler; surely the result for both should be $20?

I agree that for PA/PB/PTRA/PTRB defaulting to rel20 (as PNut does) makes sense.

bradzone · 2020-01-21 15:14

AJL wrote: »

It looks like CALLD PA could be encoded two different ways depending on the decisions made by the assembler when parsing the assembly code:

1. If the parser looks for PA/PB/PTRA/PTRB first and if found always encodes for rel20 then the wc, wz, and wcz effects would never be allowed.
2. If the parser sees these as symbols for registers $1F6 to $1F9, and then chooses the encoding based on the reach of the target, you'll get rel9 with effects allowed.

This is because the syntax doesn't specifically disambiguate these cases.

The effects can only clear the flags with rel9, as bits 31 and 30 of the source are clear.

Agreed. That's what I think.

I'd suggest that PNut is calculating the rel20 offset in longs because this is cog code (long addressed).
The routine you use to relocate the rel20 code is reading from hub rather than cog, so is that having an effect?

I'm confident that relocating the rel20 code by reading a copy from the hub is not having an effect on the test. I only posted a small portion of the code from my testing. I also tested and got the same behavior when the code was executing from the lut or hub. Also, the same machine instructions are generated by both assemblers for the CALLD PA,#rel20 instruction (with spinsim offset in bytes and PNut offset in longs). The code I posted for fastspin is consistent with Chip’s revB documentation and shows that the revB hardware expects the #rel20 offset to be in bytes. One oddity is that the revB document highlights the CALLD instruction as changed for revB but I couldn't find any differences.

I can't understand why rczl should shift arg00 by a different amount in cog_r20 depending on the assembler; surely the result for both should be $20?

The rczl instruction shifts arg00 by 2 and inserts the current CZ values in bits [1:0]. I used the rczl instruction to capture the state of the CZ flags after a CALLD instruction using the wcz effect. The cog_r20 subroutine is not shifting the arg00 by a different amount. For PNut created binaries, the $20 arg00 value results because the cog_r20 subroutine (mov arg00,#4) is not being invoked. Instead, based on the relative offset in longs, CALLD sets the PC to cog_r20A subroutine (mov arg00,#8).

I agree that for PA/PB/PTRA/PTRB defaulting to rel20 (as PNut does) makes sense.

PNut does not default to rel20 and also uses the rel9 encoding for short branches. I think using longs instead of bytes for the rel20 offset is a serious bug with PNut that needs to be fixed. As a developer, this bug is particularly pernicious because your code works for long time until you add some code that moves the target subroutine beyond 256 longs (the limit for the rel9 format) and all hell breaks loose.

To demonstrate the PNut behavior, I’ve attached a file with my code modified to compile under PNut. This version compiles under both fastspin and PNut. LED 58 is off under fastspin (cog_20 invoked), but on for PNut (cog_20A invoked).

AJL · 2020-01-21 21:49

bradzone wrote: »

AJL wrote: »

It looks like CALLD PA could be encoded two different ways depending on the decisions made by the assembler when parsing the assembly code:

1. If the parser looks for PA/PB/PTRA/PTRB first and if found always encodes for rel20 then the wc, wz, and wcz effects would never be allowed.
2. If the parser sees these as symbols for registers $1F6 to $1F9, and then chooses the encoding based on the reach of the target, you'll get rel9 with effects allowed.

This is because the syntax doesn't specifically disambiguate these cases.

The effects can only clear the flags with rel9, as bits 31 and 30 of the source are clear.

Agreed. That's what I think.

It might not be what was originally intended, as the docs say that the 9-bit immediate is sign-extended, which could set those bits if the extension was to bit 31 on a backward branch. Your results suggest that the sign-extension is only to bit 19 though.

bradzone wrote: »

I'd suggest that PNut is calculating the rel20 offset in longs because this is cog code (long addressed).
The routine you use to relocate the rel20 code is reading from hub rather than cog, so is that having an effect?

I'm confident that relocating the rel20 code by reading a copy from the hub is not having an effect on the test. I only posted a small portion of the code from my testing. I also tested and got the same behavior when the code was executing from the lut or hub. Also, the same machine instructions are generated by both assemblers for the CALLD PA,#rel20 instruction (with spinsim offset in bytes and PNut offset in longs). The code I posted for fastspin is consistent with Chip’s revB documentation and shows that the revB hardware expects the #rel20 offset to be in bytes. One oddity is that the revB document highlights the CALLD instruction as changed for revB but I couldn't find any differences.

I'm not seeing the indication that CALLD changed for revB. I might have missed it in my searching, so could you point me to it?

bradzone wrote: »

I can't understand why rczl should shift arg00 by a different amount in cog_r20 depending on the assembler; surely the result for both should be $20?

The rczl instruction shifts arg00 by 2 and inserts the current CZ values in bits [1:0]. I used the rczl instruction to capture the state of the CZ flags after a CALLD instruction using the wcz effect. The cog_r20 subroutine is not shifting the arg00 by a different amount. For PNut created binaries, the $20 arg00 value results because the cog_r20 subroutine (mov arg00,#4) is not being invoked. Instead, based on the relative offset in longs, CALLD sets the PC to cog_r20A subroutine (mov arg00,#8).

So, PEBCAK: I didn't follow your code properly.

bradzone wrote: »

I agree that for PA/PB/PTRA/PTRB defaulting to rel20 (as PNut does) makes sense.

PNut does not default to rel20 and also uses the rel9 encoding for short branches. I think using longs instead of bytes for the rel20 offset is a serious bug with PNut that needs to be fixed. As a developer, this bug is particularly pernicious because your code works for long time until you add some code that moves the target subroutine beyond 256 longs (the limit for the rel9 format) and all hell breaks loose.

To demonstrate the PNut behavior, I’ve attached a file with my code modified to compile under PNut. This version compiles under both fastspin and PNut. LED 58 is off under fastspin (cog_20 invoked), but on for PNut (cog_20A invoked).

Ok, I now understand better what you were saying.
So PNut's error message should really be a warning 'effect will clear relevant flag' or similar.

Rereading the Prop2 docs more carefully, the wording suggests that the relative offset is always meant to be in instructions for rel9, while from the examples given rel20 offsets should always in bytes:

The cases below illustrate use of the 20-bit immediate-address instructions and "\" and "@":

        ORGH    $01000
        ORG     0       'cog code

cog     JMP     #cog    '$FD9FFFFC      cog to cog, relative

I'd suggest this is to account for the fact that hubexec code needn't be long aligned.
It also suggests that with rel9, if you have short data segments inline with your hubexec code you might need to ensure instructions before and after the data blocks fall on the same byte offset from long-alignment if you use a rel9 branch to skip over the block, while it seems rel20 could handle any difference.

ersmith · 2020-01-21 22:00

@cgracey , I think @bradzone has found a real PNut bug, at least in PNut v33m. Here's another program to illustrate it:

DAT
	org 0
	jmp	#go_to_hub
	orgh $800
go_to_hub
	calld	ptra, #blinky
	waitx	##10_000_000
	jmp	#go_to_hub
	orgh $1800
blinky
	drvnot	#56
	jmp	ptra

This should blink pin 56, and indeed it does if I compile it with fastspin, or if I remove the second "orgh". With the second orgh in, and compiled with PNut v33m it does nothing. The difference is in the encoding of the "calld" instruction. fastspin outputs:

00800     FC 0F 50 FE | 	calld	ptra, #blinky

but it looks like PNut 33m outputs

00800- FF 03 50 FE 4B 4C 80 FF 1F 00 65 FD F0 FF 9F FD   '..P.KL....e.....'

Dave Hein · 2020-01-21 22:12

I checked p2asm to see how it handles CALLD, and it seems to use the same method for selecting rel9 versus rel20. Basically, rel9 is used if the range is small enough, otherwise rel20 is used. PNut should allow WCZ since it is using rel9, so that is a bug in PNut. I found that p2asm allows WCZ for both rel9 and rel20, which is a bug for rel20 that I will fix.

When using WCZ with CALLD the C and Z flags will always be cleared if an immediate source is used since all the bits above bit 8 are zero. The source would have to specify a cog memory location to be able to set C and Z to nonzero values.

As you discovered, PNut is generating a long address instead of a byte address when using rel20. This is clearly a bug. This bug may have been introduced at the time when there was discussion about how the LOC instruction should work within cog memory. LOC also uses rel20 encoding.

BTW, your cog_r20 routine will be located immediately after _pa in cog memory. So when you call cog_r20, your not actually jumping to the correct location in memory. You can use ORGF to pad the code with zeros so that cog_r20 is in the correct place. Using multiple ORG instructions is useful if you want to create multiple cog images in the same file, or you want to create an image for LUT memory, or if you want to create labels for cog locations.

evanh · 2020-01-21 22:47

Chip,
While you're at this, please check if you've fixed the old LOC bug as well - https://forums.parallax.com/discussion/comment/1457051/#Comment_1457051

bradzone · 2020-01-22 17:40

Thanks for the responses confirming a PNut bug. I really didn’t want others to have to go through the same time-consuming, trouble-shooting effort I experienced trying to figure out why the code stopped working.

AJL wrote wrote: »

I'm not seeing the indication that CALLD changed for revB. I might have missed it in my searching, so could you point me to it?

Well, this is embarrassing. I agree that CALLD was not changed in revB. I mistook the green highlights from my search for CALLD occurrences with Chip’s red highlights for revB changes. In my defense, at the beginning of my long trouble-shooting effort, I had assumed that PNut’s use of long offsets for rel20 was likely correct and that something must have changed in revB. Also, in searching for revB changes, I noticed some CALLD related comments in forum discussions of the revB changes. Clearly, I was predispose towards seeing CALLD changes in the documentation that weren’t really there.

I'd suggest this is to account for the fact that hubexec code needn't be long aligned.
It also suggests that with rel9, if you have short data segments inline with your hubexec code you might need to ensure instructions before and after the data blocks fall on the same byte offset from long-alignment if you use a rel9 branch to skip over the block, while it seems rel20 could handle any difference.

Your alignment observation got me thinking and I wrote some test code with calls to subroutines that are not long aligned.

DAT
        ORGH    $400
'        ORG     0       'cog code
entry	CALLD     PA,#test_rel9
	CALLD     PA,#test_rel20
	cogstop	#0
	byte 1
test_rel9 jmp	PA
	long	$05040302[$100]
	byte	6
test_rel20 jmp	PA

Under fastspin, the listing shows that the rel20 encoding is used for short and long branches. This binary should work correctly when executed from the hub, but should not be copied to a cog, because the instructions are not long-aligned.

003fc     00 00 00 00 |         ORGH    $400
00400                 | '        ORG     0       'cog code
00400                 | entry	CALLD     PA,#test_rel9
00400     09 00 10 FE 
00404     0A 04 10 FE | 	CALLD     PA,#test_rel20
00408     03 00 64 FD | 	cogstop	#0
0040c     01          | 	byte 1
0040d                 | test_rel9 jmp	PA
0040d     2C EC 63 FD

$ od -A x -t x4 calld_syntax.binary
000000 00000000 00000000 00000000 00000000
*
000400 fe100009 fe10040a fd640003 63ec2c01
000410 040302fd 04030205 04030205 04030205
000420 04030205 04030205 04030205 04030205
*
000810 ec2c0605 0000fd63 00000000 00000000

When I enabled the ORG line for cog mode, the instructions were forced to be long aligned and the rel9 encoding was used for the short branch. This code could executed from the hub and work correctly.

003fc     00 00 00 00 |         ORGH    $400
00400 000             |         ORG     0       'cog code
00400 000             | entry	CALLD     PA,#test_rel9
00400 000 03 EC 27 FB 
00404 001 10 04 10 FE | 	CALLD     PA,#test_rel20
00408 002 03 00 64 FD | 	cogstop	#0
0040c 003 01          | 	byte 1
0040d 003             | test_rel9 jmp	PA
0040d 004 00 00 00 2C 
00411 005 EC 63 FD

$ od -A x -t x4 calld_syntax.binary
000000 00000000 00000000 00000000 00000000
*
000400 fb27ec03 fe100410 fd640003 00000001
000410 fd63ec2c 05040302 05040302 05040302
000420 05040302 05040302 05040302 05040302
*
000810 05040302 00000006 fd63ec2c 00000000

I tried this under PNut and got similar behavior, but was surprised to see that in hub mode (without the ORG line) that the rel20 offsets were in bytes. In cog mode the rel20 offset was in longs.

I find all this complexity in the handling of the CALLD instruction somewhat confusing and still think it would make more sense to always use the #rel20 encoding for all cases when the D value is PA, PB, PTRA, or PTRB.

cgracey · 2020-02-25 09:14

Guys, sorry it's taken me so long to get to this. It just wasn't on my radar for some reason. I'll get to this tomorrow as soon as I can. Thanks for logging all this information about the problem.

CALL problem with PNut binaries

Comments