Discussion about using LUT as a STACK

Cluso99 · 2015-10-07 20:34

The discussion about the possibility of using LUT as a stack originated in Peter's Tachyon thread.
Here is the basis of that discussion...

evanh wrote: »

Peter,
I'm not able to decipher what stacking levels are in use there. Was it intended to help answer Chip's question of whether there is a strong case of having the LUT for stacking or not?

Chip,
Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.

cgracey wrote: »

evanh wrote: »

Chip,
Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.

Wow! Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub. I really like that, because it means we don't need a whole extra set of instructions and pointers to use lut for stack. Excellent!

Are there any other sleeper purposes for address sensitivity?

P.S. This scheme would work only for CALLA/CALLB/RETA/RETB, since they are dedicated instructions that could easily be redirected. PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.

cgracey wrote: »

David Betz wrote: »

cgracey wrote: »

evanh wrote: »

Chip,
Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.

Wow! Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub. I really like that, because it means we don't need a whole extra set of instructions and pointers to use lut for stack. Excellent!

Are there any other sleeper purposes for address sensitivity?

P.S. This scheme would work only for CALLA/CALLB/RETA/RETB, since they are dedicated instructions that could easily be redirected. PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.

Umm... Doesn't that mean that if I have PTRA pointing into LUT and do a CALLA to some function and that function does a POPA to get its return address, it will fetch the wrong value?

Yes, because PUSHA/PUSHB/POPA/POPB are actually WRLONG/RDLONG, only CALLA/CALLB/RETA/RETB could have this lut functionality. Maybe it's too complicated, in that sense, as it would make people suppose that PUSHx and POPx would work, too. To do a pop without caring about the data, you'd just 'SUB PTRx,#1', while a push would be 'WRLUT data,PTRx' plus 'ADD PTRx,#1'.

It would be easy to support PTRx expressions for RDLUT/WRLUT, where only the lower 9 bits are used for lut address. So, a pop from lut would be 'RDLUT data,--PTRx' and a push would be 'WRLUT data,PTRX++'. If we made discrete PUSHx/POPx instructions which weren't just aliases for WRLONG/RDLONG, we could handle this better by invoking either WRLONG/RDLONG or WRLUT/RDLUT, depending on the PTRx range being within lut, or not. That would be pretty easy.

We would have these discrete instructions which would use hub or lut, based on the PTRx address:

PUSHA D/#
PUSHB D/#
POPA D
POPB D
CALLA D/#/@
CALLB D/#/@
RETA
RETB

Meanwhile RDBYTE/RDWORD/RDLONG and WRBYTE/WRWORD/WRLONG would always access hub, only.

Cluso99 · 2015-10-07 20:41

There is another advantage to using LUT for the stack...
It is a full 32-bits, so pushing 32-bit values could/would work.

Chip,
Since you are using PTRA/PTRB to point to HUB or LUT, could it also be used to point to COG (stack in cog)?

Seems to me that the internal 8 level 22-bit stack would now be redundant. Can we get rid of it and its supporting instructions? IMHO it is much better to be able to put the stack in COG/LUT/HUB just by setting the PTRA/PTRB. And it saves the argument for either wider 32-bit or deeper than 8.

jmg · 2015-10-07 20:46

chip wrote:

We would have these discrete instructions which would use hub or lut, based on the PTRx address:

PUSHA D/#
PUSHB D/#
POPA D
POPB D
CALLA D/#/@
CALLB D/#/@
RETA
RETB

Switching by PTRx alone could work, but using the PC address (as I think proposed) sounds dangerous.

This actually makes a further good case for LUT Max_Adr justified (which allows the plus of 1K unbroken code) - as that keeps clear of any accidental PTR roll-into-LUT space.
LUT may be used for other tasks, so this needs to be an explicit user decision.

cgracey · 2015-10-07 21:21

jmg wrote: »

...Switching by PTRx alone could work, but using the PC address (as I think proposed) sounds dangerous...

No, the PC address wouldn't have anything to do with this.

jmg · 2015-10-07 21:37

cgracey wrote: »

jmg wrote: »

...Switching by PTRx alone could work, but using the PC address (as I think proposed) sounds dangerous...

No, the PC address wouldn't have anything to do with this.

Phew, when you said earlier
[".... Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub. "]
I thought the mention of 'just like cog/hub instruction fetching' meant the PC got into the mix here, like it does to change cog/hub operation.

If the code is PC agnostic, and only PTRx determines the action, then that is good.

mindrobots · 2015-10-07 22:16

The built in hardware stack (8 deep) is nice because you can use it anywhere and it is the least destructive for debugging and forensic code. You don't need to touch COGRAM, LUT or any pointers. It's just there ready to be used to capture as much COG state as you can.

cgracey · 2015-10-07 22:32

mindrobots wrote: »

The built in hardware stack (8 deep) is nice because you can use it anywhere and it is the least destructive for debugging and forensic code. You don't need to touch COGRAM, LUT or any pointers. It's just there ready to be used to capture as much COG state as you can.

Yes, I think that needs to stay, no matter what, because it's bullet-proof and fast. Eight levels is enough for any cog/lut program. Hub programs might as well use hub for their stacks.

We'll see about some intermediate solution.

Cluso99 · 2015-10-07 22:35

mindrobots wrote: »

The built in hardware stack (8 deep) is nice because you can use it anywhere and it is the least destructive for debugging and forensic code. You don't need to touch COGRAM, LUT or any pointers. It's just there ready to be used to capture as much COG state as you can.

You cannot guarantee any space is free to use.
The safest way for debugging is to use hubexec and any registers required need to be backed up in hub first.

FYI I built a P1 single stepper that worked on both PASM and SPIN. It used zero footprint in the cog because it resided totally in the shadow ram register space (used for an LMM engine). Shadow ram is being used in the P2 but we now have hubexec.

Seairth · 2015-10-07 22:46

Instead of pushing the address on the stack, could you instead store it to the "return" register used by the breakpoint interrupt?

cgracey · 2015-10-07 22:49

Cluso99 wrote: »

...FYI I built a P1 single stepper that worked on both PASM and SPIN. It used zero footprint in the cog because it resided totally in the shadow ram register space (used for an LMM engine). Shadow ram is being used in the P2 but we now have hubexec.

INA and INB shadow registers serve as the debug interrupt jump and return vectors. They can only be read and written via the SETBRK command or within a debug ISR.

Cluso99 · 2015-10-07 23:23

cgracey wrote: »

Cluso99 wrote: »

...FYI I built a P1 single stepper that worked on both PASM and SPIN. It used zero footprint in the cog because it resided totally in the shadow ram register space (used for an LMM engine). Shadow ram is being used in the P2 but we now have hubexec.

INA and INB shadow registers serve as the debug interrupt jump and return vectors. They can only be read and written via the SETBRK command or within a debug ISR.

That is what I meant by the shadow registers are now used in the P2.

The use of INA & INB shadow registers as the debug interrupt jump and return vectors is nice.
Since they can only be read and written using SETBRK, then perhaps when a COGINIT is performed, if the Cog was already running, could it save the PC plus C&Z flags in the INB shadow register? If it was not running, could it clear the INB shadow register?

This would permit an errant cog to be force-interrupted, and the cog to be interrogated. The INA would not be used because the COGINIT would start in hubexec debug code. By examining the contents of INB, the debugger could determine if the cog was previously running or stopped. It could be set to continue by returning via INB (RET0 ???). Of course there are caveats like if the cog was in a wait or rep loop, etc. However, debuggers always have caveats

cgracey · 2015-10-08 00:44

Cluso99 wrote: »

cgracey wrote: »

Cluso99 wrote: »

...FYI I built a P1 single stepper that worked on both PASM and SPIN. It used zero footprint in the cog because it resided totally in the shadow ram register space (used for an LMM engine). Shadow ram is being used in the P2 but we now have hubexec.

INA and INB shadow registers serve as the debug interrupt jump and return vectors. They can only be read and written via the SETBRK command or within a debug ISR.

That is what I meant by the shadow registers are now used in the P2.

The use of INA & INB shadow registers as the debug interrupt jump and return vectors is nice.
Since they can only be read and written using SETBRK, then perhaps when a COGINIT is performed, if the Cog was already running, could it save the PC plus C&Z flags in the INB shadow register? If it was not running, could it clear the INB shadow register?

This would permit an errant cog to be force-interrupted, and the cog to be interrogated. The INA would not be used because the COGINIT would start in hubexec debug code. By examining the contents of INB, the debugger could determine if the cog was previously running or stopped. It could be set to continue by returning via INB (RET0 ???). Of course there are caveats like if the cog was in a wait or rep loop, etc. However, debuggers always have caveats

It's going to work a lot better to just push C/Z/PC onto the hardware stack when it get's COGINIT'd. No special mux is needed, just the signal to do it.

jmg · 2015-10-08 00:51

cgracey wrote: »

It's going to work a lot better to just push C/Z/PC onto the hardware stack when it get's COGINIT'd. No special mux is needed, just the signal to do it.

an obvious question...

If the stack is going to expand to allow store of C/Z, can users pass 2 booleans back from functions, using the stack ?

Cluso99 · 2015-10-08 01:12

jmg,
The stack has always been wide enough to store the C&Z flags plus the address, hence 22-bits while hub addresses are in bytes.
So yes, you could put 2 Booleans instead provided you did the appropriate pop/push.

potatohead · 2015-10-08 01:13

Wouldn't the flags just be bits on the long pushed to capture the PC?

cgracey · 2015-10-08 01:48

potatohead wrote: »

Wouldn't the flags just be bits on the long pushed to capture the PC?

A PUSH pushes D[22:0], while a RET pops bit 21 into C if WC, bit 20 into Z if WZ, and bits 19:0 into the PC.

Seairth · 2015-10-08 02:18

cgracey wrote: »

It's going to work a lot better to just push C/Z/PC onto the hardware stack when it get's COGINIT'd. No special mux is needed, just the signal to do it.

So, what's the idea? Would the COGINIT just translate into executing a CALL (instead of a JMP) on the target COG?

cgracey · 2015-10-08 02:49

Seairth wrote: »

cgracey wrote: »

It's going to work a lot better to just push C/Z/PC onto the hardware stack when it get's COGINIT'd. No special mux is needed, just the signal to do it.

So, what's the idea? Would the COGINIT just translate into executing a CALL (instead of a JMP) on the target COG?

This feature only exists so that you can find out where the cog was executing before the COOGINIT. The new program would have to do a POP to find out where the last program left off.

Yanomani · 2015-10-08 03:44

Hi Chip

Have you ever considered doing the LUT space as a 256 x 64 bit true dual ported ram, with independent LONG select RD and WR controls?

Could an approach like this one, enable the simultaneous and uncommitted use of the streamer and code+stack operations, provided that each class runs segregated to its own longs?

Sure, if only code+stack usage is intended to satisfy some application, A0 could be used to select the appropriate long, and 512 contiguous longs are accessible.

But, when simultaneous use of the streamer is intended, and since it could be writing into LUT space at the same time, then segregating their access by dividing the ram into upper/lower long halves is needed, to avoid any interferences.

I'm not sure if it's feasible or really worth the effort, but if it is and solves the problem, why don't?

Henrique

ozpropdev · 2015-10-08 04:39

Thinking about using the LUT as a stack I came up with this.
We already have the ALTDS instruction which can modify the following instruction in many ways.
From the ALTDS notes for the SSS field we have

 110 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, decrement D's SSSSSSSSS field
 111 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, increment D's SSSSSSSSS field

So if we had a couple of new assembler aliases like

	'pushl	myreg 
	'becomes
	altds	adra,#%000_000_111
	wrlut	myreg,#0-0

	'popl	myreg				
	'becomes
	altds	adra,#%000_000_110
	rdlut	myreg,#0-0

Simply initialize the ADRA reg with the lut stack base address.
Before "popping" from the LUT a simple sub adra,#1 adjusts for no pre decrement.

We almost have all the pieces we need to implement a stack except for pre-decrement operation. Chip?

cgracey · 2015-10-08 05:06

ozpropdev wrote: »
Thinking about using the LUT as a stack I came up with this.
We already have the ALTDS instruction which can modify the following instruction in many ways.
From the ALTDS notes for the SSS field we have
 110 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, decrement D's SSSSSSSSS field
 111 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, increment D's SSSSSSSSS field
So if we had a couple of new assembler aliases like
	'pushl	myreg 
	'becomes
	altds	adra,#%000_000_111
	wrlut	myreg,#0-0

	'popl	myreg				
	'becomes
	altds	adra,#%000_000_110
	rdlut	myreg,#0-0
Simply initialize the ADRA reg with the lut stack base address.
Before "popping" from the LUT a simple sub adra,#1 adjusts for no pre decrement.

We almost have all the pieces we need to implement a stack except for pre-decrement operation. Chip?

Wouldn't these be even simpler, made into macros?

sub     lutptr,#1
rdlut   data,lutptr

-and-

wrlut   data,lutptr
add     lutptr,#1

ozpropdev · 2015-10-08 05:26

Yikes! I was way to focused on the ALTDS thing rather than back to basics.
I feel better now.

Discussion about using LUT as a STACK

Comments