CALLD D,{#}S {WC/WZ/WCZ}

Cluso99 · 2018-12-11 21:02

Question: Is there a case where WC or WZ only would be used ???

I can see the case for setting WCZ and the case for not setting either.

If not, could that be re-purposed?
I have been looking for a way to implement the P1 JMPRET instruction.

If there were an option to just overwrite the return address into D[19:0] without saving CZ to D[31:30] and without clearing D[29:20] then we could use the
JMP #A
instruction as the D register, which would work as the RET instruction variant of JMPRET.
The compiler would need to set the C & Z bits on the CALLD instruction to either 10 or 01 (ie WC or WZ).

Might this be possible?

Here are the two instructions involved. Only CALLD needs a silicon tweek.

EEEE 1011001 CZI DDDDDDDDD SSSSSSSSS  CALLD D,{#}S {WC/WZ/WCZ}  Call to S** by writing {C,Z,10'b0,PC[19:0]} to D. C=S[31],Z=S[30].

EEEE 1101100 RAA AAAAAAAAA AAAAAAAAA  JMP   #A                  Jump to A.  If R=1, PC+=A, else PC=A.

BTW - I think #S is #rel9. I am unsure when using S if it is relative (20bits) or not???

cgracey · 2018-12-11 21:28

Let's think about this a little later. Head full right now.

Cluso99 · 2018-12-11 21:43

Sure

Plenty of P2 to keep me occupied

evanh · 2018-12-11 23:49

** If #S and cogex, PC += signed(S). If #S and hubex, PC += signed(S*4). If S, PC = register S.

Ah, just done some testing and I've discovered that any immediate number entered in source is treated as an absolute address. The assembler then converts it to relative. This answers a puzzle I'd had for some time. I'd previously tried to hand code immediate offsets but it had always been rejected with error of must be within #0-511.

evanh · 2018-12-11 23:59

It was probably a REP instruction where I'd struck that.

ersmith · 2018-12-12 00:05

Couldn't you implement jmpret as:

   calld sub_ret+1, #sub
   ...
sub
   ' do some stuff
   ...
sub_ret
    jmp sub_ret+1
    long 0

It is an extra long (bad) but doesn't require any hardware or instruction set changes (good).

evanh · 2018-12-12 01:02

Cluso99 wrote: »

BTW - I think #S is #rel9. I am unsure when using S if it is relative (20bits) or not???

Any PC-relative encoding is resolved to absolute address upon execution. So, in your hypothetical change, when CALLD is executed it will write an absolute address to the JMP instruction. This would be okay as long as the JMP itself was specified to be an absolute in the first place.

EDIT: Third edit!

evanh · 2018-12-12 01:05

Lol, that's slightly amusing. I edited my previous post twice. The initial posting was about 20 minutes back.

Cluso99 · 2018-12-12 06:00

ersmith wrote: »
Couldn't you implement jmpret as:
   calld sub_ret+1, #sub
   ...
sub
   ' do some stuff
   ...
sub_ret
    jmp sub_ret+1
    long 0
It is an extra long (bad) but doesn't require any hardware or instruction set changes (good).

That works, but every subroutine requires the extra long. Some PASM programs don't have that many spare longs eg spin Interpreter.

It does cover the sets (movs) tho'.

ersmith · 2018-12-12 09:31

Cluso99 wrote: »
ersmith wrote: »
Couldn't you implement jmpret as:
   calld sub_ret+1, #sub
   ...
sub
   ' do some stuff
   ...
sub_ret
    jmp sub_ret+1
    long 0
It is an extra long (bad) but doesn't require any hardware or instruction set changes (good).
That works, but every subroutine requires the extra long. Some PASM programs don't have that many spare longs eg spin Interpreter.

It does cover the sets (movs) tho'.

The subroutine only requires the additional long if you really need the full power of jmpret. Most uses of jmpret can be changed into call/ret. (Distinguishing between the two automatically is probably non-trivial, but for a human programmer it shouldn't be too hard to manually convert code to use the appropriate form.)

For the spin1 interpreter, can you move some of the code into HUB or LUT to free up space?

Cluso99 · 2018-12-12 11:00

Yes, there are some places where call/ret can be used. They are often easily recognised.

But others are much more difficult and sometimes involve inline instruction modification by conditionally modifying the return address.

For specifics, I am using the example of my spin Interpreter. It will use half of LUT for jump tables (3 cog 9-bit addresses and a 5 single flags/bits) for each bytecode. Speed is attained by unrollinggand tweeking the PASM. Only some of this can work in lut.

The thing is, this is the first real conversion of P1 PASM I have tried. Only trying real code do you find the problems in converting P1 code. Most has converted ok, but there are a few stumbling blocks, and there are some much more difficult requiring substantial understanding of the code. I'd like to be able to automate as much as possible. JMPRET stood out to be the real bug-bear, and of course every program has a substantial number of them.

There are already a few P1 instructions that require more than one P2 instructions to simulate. But these usually only occur a few times in the code, so an auto converter should work here.

I know some P2 instructions can replace multiple P1 instructions. This again requires an understanding of the original code.

My hope is to be able to convert a reasonable number of P1 programs (objects) quickly with minimal resources (ie time). The quicker we get these done the better the P2 will be received. IMHO JMPRET is the biggest stumbling block to achieving this by far. I am looking at each and every possible solution to aid the conversion.

ersmith · 2018-12-12 12:20

I guess we all have our own bugbears

. For me when converting the ZPU and Risc-V interpreters from P1 to P2, it was the "movs" and "movd" instructions that caused me the biggest headaches. I don't think I used jmpret at all, except in the restricted forms like jmp, call, and ret that translated trivially.

I just grepped through the "spin-standard-library" project from github.com/parallaxinc, and there were 77 instances of "jmpret", 1340 of "call", and 3155 of "jmp". Some of those may be in comments, but I think it's clear that uses of the full power of jmpret are actually pretty rare. The Spin interpreter is kind of a special case there. But even so, looking at Chip's original interpreter it looks like most of the time the jmpret dest register is the same (there are just "getret" and "pushret"). So I think some variant of my proposal should work, since you'll only need the extra long after those two labels (the actual "jmpret" gets translated into just "calld").

For your new Spin interpreter, have you looked at P2 XBYTE mode? That'll definitely give a big speed improvement.

Cluso99 · 2018-12-12 13:23

I classify the JMPRET instruction as all the variants. They are all used. I can get around the Z & C flags being set as they are much rarer, but of course require another long instruction.
In reality, it's the CALL form of JMPRET that causes the most problem. This writes a 9-bit cog return address into the 9-bits (the S bits) of the D register, leaving all other bits intact. But it's also the RET (which is in fact a JMP) since this is the recipient of those 9-bit cog return address writes. Obviously the other bits must remain intact.

The problem is self-modifying code doesn't work with the new call and ret instructions. That's where the MOVS/D fail. So you really do understand the problem, but are just seeing it from a different perspective.

As for XBYTE, that's too big a change to implement in my interpreters case. I am not up for a rewrite that this would entail. You see, I broke down each bytecode into one to three subroutines. This is what my table houses, up to 3 subroutine calls in the form of 3 x 9-bit cog addresses. And there is one of these for each bytecode.

evanh · 2018-12-13 05:12

Cluso99 wrote: »

In reality, it's the CALL form of JMPRET that causes the most problem.

Something doesn't quite fit in that statement. If only wanting fast and compact CALL/RET function then the Prop2's hardware stack based CALL/RET should be easy drop-in replacement.

Cluso99 · 2018-12-13 05:45

An example...

        call    xyz
        ....
xyz     ....
        ....
xyz_ret ret

However, within the xyz routine, it may jump out of this routine on certain conditions, or it may replace the return address within the xyz_ret ret instruction. All these things need to be examined thoroughly. You just cannot globally replace with the P2 call/ret using the internal stack. You need to understand what the code is doing because we all took advantages of pasm code to make things faster. Remember, there were only something less than 496 lines of code.

Phil Pilgrim (PhiPi) · 2018-12-13 06:05

I frequently use

jmp xyz_ret (without the #)

to return, skipping the actual jump to the return instruction.

-Phil

evanh · 2018-12-13 06:39

Heh, Phil, unlikely to work for Prop2 unless there was a special cogexec only JMPRET. A RET does the job now though. Or _RET_ even.

Cluso99 · 2018-12-13 06:48

With only 8 deep internal stack, you need to be really careful you don't overflow

And recursive subroutine calls are a no-go using the internal stack. I do it in the spin interpreter.

Phil,
Yes, I have used it sometimes, and also seen it used by others too.

evanh · 2018-12-13 07:25

Let me know how many times you strike an overflow.

CALLD D,{#}S {WC/WZ/WCZ}

Comments