Question on hardware stack "PUSH/POP" data width

ozpropdev · 2015-10-02 05:56

Hi All
I've noticed that if I PUSH a value onto the hardware stack only 22 bits are stored.
Is this the "hard wired" data width or some other issue?
If it's hard wired that way then that needs to be documented in "the rules of play"

87654321  'pushed on stack
00254321  'pops this result

Test code

con
	sys_clk = 50_000_000
	baud_rate = 115_200
	rx_pin = 63
	tx_pin = 62

dat		orgh	1

launch		setq	#$1F7
		rdlong	cogregs,ptrb[(code-launch)>>2]
		jmp	#mycode

'============================================================

code
		org

cogregs		long	0[16]


mycode		setb	outb,#tx_pin	'set tx output high
		setb	dirb,#tx_pin

wait4start	testb	inb,#rx_pin wz	'press a key to start in PST
	if_nz	jmp	#wait4start


		mov	val,test_pattern	'show pattern
		call	#show_hex
		call	#newline

		push	test_pattern		'push/pop val, show result
		pop	val
		call	#show_hex
		call	#newline
	
here		jmp	@here

test_pattern	long	$87654321
bit_time	long	sys_clk / baud_rate
val		long	0
timer		long	0
dx		long	0
nibc		long	0

'============================================================		
newline		mov	dx,#13

send_byte	setb	dx,#8
		shl	dx,#1
		getcnt	timer
		rep	@endrep,#10
		testb 	dx,#0 wz
		setbnz	outb,#tx_pin
		addcnt	timer,bit_time
		waitcnt
		shr	dx,#1
endrep
		ret
'============================================================

show_hex	mov	nibc,#8		'8 x nibbles

show_hex2	mov	dx,val
		shr	dx,#28
		cmp	dx,#9 wz,wc
	if_a	add	dx,#"A"-10
	if_be	add	dx,#"0"
		call	@send_byte
		shl	val,#4
		djnz	nibc,@show_hex2
		ret

cgracey · 2015-10-02 08:40

It's 22 bits: {C, Z, PC[19:0]}

It's useful for CALL/RET. The PUSH/POP are there so that you can inspect and modify return addresses.

Rayman · 2015-10-02 12:53

So Call and RET use a hardware stack now instead of modifying the destinations RET code, right?

I guess this means that Z and C flags don't come back to the caller.
But, sounds like subroutine could pop, change flag setting, then push before returning to have a flag set upon return to caller. Is this the idea (along with changing where it returns to)?

Seairth · 2015-10-02 13:08

You don't even have to go through that much effort:

func    pop adra
        ... ' any code that sets C/Z
        jmp adra

Bean · 2015-10-02 13:19

Any chance that the stack could be made 32 bits ?
It might be useful to pass parameters to subroutines.

Bean

ozpropdev · 2015-10-02 13:34

The PUSHA/PUSHB/POPA and POPB are 32 bit stack operations in HUB.

Roy Eltham · 2015-10-02 16:40

Bean,
The hardware stack is only 8 or 16 deep, it's just meant for return addresses/flags. It's implemented internal to the cog as hidden registers, not taking up any actual memory.

jmg · 2015-10-02 20:00

ozpropdev wrote: »

The PUSHA/PUSHB/POPA and POPB are 32 bit stack operations in HUB.

So that means the stack is physically 32b wide ?

Seems quirky/confusing to have a variance in POP/PUSH by context ?
(even on a small stack)

Is fixing that a zero cost fix ? (if Stack is already 32b seems so ?)

Seairth · 2015-10-02 20:14

jmg wrote: »

ozpropdev wrote: »

The PUSHA/PUSHB/POPA and POPB are 32 bit stack operations in HUB.

So that means the stack is physically 32b wide ?

Seems quirky/confusing to have a variance in POP/PUSH by context ?
(even on a small stack)

Is fixing that a zero cost fix ? (if Stack is already 32b seems so ?)

If I recall correctly, those instructions are using PTRA/B, hence the A/B suffix.

Maybe PUSH/POP could be renamed to PUSHR/POPR to indicate the internal return stack.

jmg · 2015-10-02 20:22

Seairth wrote: »

Maybe PUSH/POP could be renamed to PUSHR/POPR to indicate the internal return stack.

or even PUSH22/POP22 ?

cgracey · 2015-10-03 10:38

Those hardware stacks are 8 levels deep. We could expand them from 22 bits wide to 32 at a cost of 80 flops per cog. Then, they could be used for data, as well. Should we do it?

evanh · 2015-10-03 10:45

LUT RAM maybe?

evanh · 2015-10-03 10:46

LUTExec only uses every second access.

Seairth · 2015-10-03 10:53

cgracey wrote: »

Those hardware stacks are 8 levels deep. We could expand them from 22 bits wide to 32 at a cost of 80 flops per cog. Then, they could be used for data, as well. Should we do it?

I wouldn't, at least for now. At only 8 deep, you'll get into trouble real quick if you're also pushing/popping data.

Maybe rename them, though. I agree that PUSH/POP may be a bit too generic (implying generic 32-bit width).

Seairth · 2015-10-03 10:54

evanh wrote: »

LUT RAM maybe?

Don't you mean "aux" ram?

evanh · 2015-10-03 10:58

Yeah, it's already looking to be used more for code than any sorts of data simply because the code fetches are free.

Seairth · 2015-10-04 19:33

I've been pondering the whole push/pop data thing while waiting for the next FPGA drop. The first thing that came to mind was:

Add PTRA/PTRB behavior to RDLUT/WRLUT. However, it might make more sense to use ADRA/ADRB, but the functionality would be basically the same. This would provide an easy means to maintain stacks in the LUT.

(I suspect you Forth programmers would really like this change.)

From there, you could also add a LUT version of CALLA/B and RETA/B.

note: with those two changes, it may also diminish the need for the hidden 8-deep stack. Just a thought...

jmg · 2015-10-04 19:59

cgracey wrote: »

Those hardware stacks are 8 levels deep. We could expand them from 22 bits wide to 32 at a cost of 80 flops per cog. Then, they could be used for data, as well. Should we do it?

hmm. 32 bits is more consistent and expected and then PUSH/POP work like any other MCU, and do not need to be renamed.

Can users read the stack pointer, and what happens on stack overflow/underflow ?

Can debug read the stack contents - eg by POP 8x should leave SP back where it was, and extract the whole stack ?

evanh · 2015-10-04 20:31

Seairth wrote: »

note: with those two changes, it may also diminish the need for the hidden 8-deep stack. Just a thought...

That's certainly what was on my mind. That and I feel even more address registers would be good. And since the memory addressing has been flattened somewhat there's a natural tendency to want to use the same indirection capabilities everywhere.

Seairth · 2015-10-04 20:52

One down side to adding PTRx-like capability is that you limit immediate references to only half of the LUT registers ($00-$FF). This could probably be overcome by using one of the extra opcodes, leaving the original RDLUT/WRLUT alone:

cccc 1011011 CZ0 DDDDDDDDD SSSSSSSSS : RDLUT D, ADRx
cccc 1011011 CZ1 DDDDDDDDD SSSSSSSSS : WRLUT D, ADRx

Incidentally, this would return one bit to the index field, allowing -32 to +31. Unfortunately, that wouldn't quite be consistent with the PTRx formatting. If there were one more spare opcode, the memory hubops could have been re-arranged in the same way, though.

cgracey · 2015-10-04 21:16

It would be great to get rid of the hardware stack, but if we used the lut for that, we would cause glitches in the streamer, assuming we gave priority to CALL/RET.

jmg · 2015-10-04 21:32

cgracey wrote: »

It would be great to get rid of the hardware stack, but if we used the lut for that, we would cause glitches in the streamer, assuming we gave priority to CALL/RET.

Would a more conventional stack be able to map into either COG or LUT ?

Rayman · 2015-10-04 21:41

cgracey wrote: »

Those hardware stacks are 8 levels deep. We could expand them from 22 bits wide to 32 at a cost of 80 flops per cog. Then, they could be used for data, as well. Should we do it?

This sounds like something that could be changed later, once the more fundamental things are worked out...

Seairth · 2015-10-04 21:44

jmg wrote: »

cgracey wrote: »

It would be great to get rid of the hardware stack, but if we used the lut for that, we would cause glitches in the streamer, assuming we gave priority to CALL/RET.

Would a more conventional stack be able to map into either COG or LUT ?

I agree, but figured there wasn't an available slot in the pipeline for that. But! If there *were*, then I'd suggest adding the two instructions (maybe call them something else... RDADR/WRADR maybe) and use the MSB of the S field to select cog or LUT. Then the rest of the field would still be identical to PTRx usage.

Seairth · 2015-10-04 21:44

cgracey wrote: »

It would be great to get rid of the hardware stack, but if we used the lut for that, we would cause glitches in the streamer, assuming we gave priority to CALL/RET.

Glitch how? Just curious...

Baggers · 2015-10-04 21:58

Seairth, like the stack overwriting the streamer's lut buffer.

cgracey · 2015-10-04 22:09

Baggers wrote: »

Seairth, like the stack overwriting the streamer's lut buffer.

...or a streamer read being usurped by a CALL/RET, causing a bad pixel to be output.

jmg · 2015-10-04 22:22

cgracey wrote: »

Baggers wrote: »

Seairth, like the stack overwriting the streamer's lut buffer.

...or a streamer read being usurped by a CALL/RET, causing a bad pixel to be output.

That sounds like a bad outcome, so such a stack would need to be able to be placed elsewhere too....(in COG)

Unless that is easy to do, sounds best to keep the present design ?

My question above was can a debug routine POP the HW stack 8 times, and 'end up back where it was' and so make the Stack visible during debug ?

cgracey · 2015-10-04 22:25

Jmg, the hardware stack can be fully popped and pushed back, so that you can inspect and restore it.

I think, too, that the current stack design is adequate.

Yanomani · 2015-10-04 22:30

A possible solution to the problem of streamer interference could be doubling the last 64 longs (or even 128) of LUT to enable that space to be dual ported.
Then you can keep streamer activity at the low half (0 thru FF) and simultaneously having stack (or plain RDLUT/WRLUT) activiy at the top half, in parallel.

My 0.02

Henrique

jmg · 2015-10-04 22:32

cgracey wrote: »

Jmg, the hardware stack can be fully popped and pushed back, so that you can inspect and restore it.

Cool, I'm sure novice users will get lost with the stack...as they do on all MCUs as they JMP out of a routine when they should RET...
Stack inspect will be important for Data use too...
(and I can see COP watchdog uses as well.. )

cgracey wrote: »

I think, too, that the current stack design is adequate.

Do you mean with the nudge to 32b wide ?

Question on hardware stack "PUSH/POP" data width

Comments