Question about register indirection - ALTI

Christof Eb. · 2022-12-12 08:35

Hi,
looking for a fast way to implement stacks in cog memory.
In Forth the command "+" adds the value in the top of stack T to the value of the second S and then shifts (decrements) the stack pointer (sp), that this now points to the previous second.

So I would like to have indirection and an offset and a post-decrement together with ADD. :-)

Something like:
add [sp-1],[sp--]

If TOP is a register:
add T,[sp--]

If I hold TOP and Second as registers
add T,S
mov S,[sp--]

There might be ways to achieve this efficiently either by using ptra/ptrb or ALTI or ????
Thank you for some hints!
Christof

evanh · 2022-12-12 11:27

PTRA/PTRB ops are for hubRAM. ALTx instructions are for cogRAM.

Not sure what you're up to with TOP and Second but the add [sp-1],[sp--] is roughly doable I think ...

        alti    sp_1, #%000_110_110
        add     0-0, 0-0
        ret

sp_1        long     (stackbase - 1)<<9 | stackbase

PS: There is also the option to have a separate result register, eg:

        alti    sp_1, #%100_110_110
        add     0-0, 0-0
        ret

sp_1        long     result<<18 | (stackbase - 1)<<9 | stackbase

ManAtWork · 2022-12-12 11:34

Unfortunatelly, the hardware and assembler manuals are still not finished. So I cannot tell if RDLUT/WRLUT also support the indexed and pre/post increment/decrement addressing modes like RD/WRBYTE/WORD/LONG does. If it does (which might be assumed from the shortcuts "{#}S/PTRx" in the instructionn tables) then I'd recommend using the LUT RAM for the stack. It would have the advantage of some protection against code overwriting in the case of a stack overflow. And you wouldn't need ALTx instructions. But as I said, I can't tell if this actually works...

evanh · 2022-12-12 11:38

The silicon manual has the details for PTRA/PTRB, as per forever.

EDIT: Oh, I see, you're right, even the silicon manual is not particularly clear on this. It just says RDLUT/WRLUT can now handle PTRx expressions. - As of 2019-04-11. One of the RevB late changes.

TonyB_ · 2022-12-12 12:41

@ManAtWork said:
Unfortunatelly, the hardware and assembler manuals are still not finished. So I cannot tell if RDLUT/WRLUT also support the indexed and pre/post increment/decrement addressing modes like RD/WRBYTE/WORD/LONG does. If it does (which might be assumed from the shortcuts "{#}S/PTRx" in the instruction tables) then I'd recommend using the LUT RAM for the stack.

Yes, PTRx does work with RDLUT/WRLUT and LUT RAM is a much better place for stack than cog RAM. I suggested option of auto-incrementing address for LUT RAM and Chip made PTRx work in same way as hub RAM. The cost is an immediate address can only access bottom half of LUT RAM but that's a small price to pay. I helped with the indexing bits and here is the post with final scheme:
https://forums.parallax.com/discussion/comment/1453712/#Comment_1453712
SCALE is 1 for LUT RAM. Index is -32 to +31 if PTRx does not change, or -16 to -1 and +1 to +16 if PTRx changes before or after RDLUT/WRLUT.

Wuerfel_21 · 2022-12-12 15:02

@ManAtWork said:
Unfortunatelly, the hardware and assembler manuals are still not finished. So I cannot tell if RDLUT/WRLUT also support the indexed and pre/post increment/decrement addressing modes like RD/WRBYTE/WORD/LONG does.

Not to advertise, but....

https://p2docs.github.io/lutmem.html#wrlut

Christof Eb. · 2022-12-12 16:52

Thank you very much for the inputs!
(I had been aware, that the pointers can be used with LUT, but I am "astonished" that you cannot use them with Cog Ram. And it would be good to have more than 2 pointers....)
I was wondering, if I could make a Forth engine, that compiles to inlined "native assembler" code, so that this has the speed benefit from the microcache for hubexec. The downside is, that Tachyon only needs a codesize 2 bytes of wordcode, while this now needs at least 8 bytes. With the stack in LUT (as is in Tachyon) more.
So ALTI seems to be able to do the post decrement, but for the pre increment for push operations there are new questions....

Ada, your place was the first, where I looked, but for guys like me you (or Chip !!!) would have to add some more explanations for ALTI. "MEGA ULTRA MECHA TODO" sounds right, whatever it means. I had no idea, that you have to code the S and D of the next instruction into a combi-pointer for D of ALTI. Thanks @evanh !
Examples like in MOVBYTS are very helpful, great!
Sometimes I do not understand the concept. For example PA and PB with CALLD. (What is a coroutine?)

Wuerfel_21 · 2022-12-12 17:33

@"Christof Eb." said:
Ada, your place was the first, where I looked, but for guys like me you (or Chip !!!) would have to add some more explanations for ALTI. "MEGA ULTRA MECHA TODO" sounds right, whatever it means.

"TODO" means that the lack of sufficient explanation has been noted...

Simonius · 2022-12-13 22:03

I did some experiments already.
I will just throw this file in here, maybe it is of some help
Please note that it is just a mere sketch, to play around with.

Christof Eb. · 2022-12-14 10:44

@Simonius said:
I did some experiments already.
I will just throw this file in here, maybe it is of some help
Please note that it is just a mere sketch, to play around with.

Ui, you just did it, while I am still wondering and keep being sidetracked....
As you seem interested in FORTH- The idea to directly inline code originated from some study of bytecode engines and from Forth09: https://colorcomputerarchive.com/repo/Documents/Manuals/Programming/Forth09 (D.P. Johnson).pdf
Page 1-9.

So with the main stack in cog memory it is possible to code "+" with 2 LONGS. Let's assume, that we want an additional instruction to prevent stack underflow. (FGT ?) So this would be 3 instructions. If this is not inlined with other code, the first would need 4 cycles average, so 6 cycles average.

TAQOZ needs 7 instructions for "+" and for doNEXT 3 instructions, including one rdword, which will need 4 cycles average. So 13 cycles average.
If we hope to inline 2 consecutive instructions like "+" they would need 9 cycles, compared to 26 cycles.
So it looks as if it might be possible to write a FORTH, which is 2.5(???) times faster than TAQOZ. Code size would be much larger, similar to compiled code.

Question about register indirection - ALTI

Comments