Question on hardware stack "PUSH/POP" data width

evanh · 2015-10-04 23:28

cgracey wrote: »

It would be great to get rid of the hardware stack, but if we used the lut for that, we would cause glitches in the streamer, assuming we gave priority to CALL/RET.

Streamer definitely has priority. 99% of streaming activities will not be using more than 50% LUT bandwidth so PUSH/CALL can happily co-exist with the occasional extra clock inserted on some stack accesses. Jitter will still be small and any software that is that timing sensitive with the streamer active will have to be careful is all. Streaming isn't going to be that common an activity anyway.

The bonus of having all of LUTRAM available for super fast stacking will make a lot of people happy.

jmg · 2015-10-04 23:48

A proper stack would be nice, but streaming is also important, and I'm not sure a COG that is streaming also needs deep stacking.

Of more concern is the cases where full streaming speed is needed (and 100% will be used to reduce total system power ) not working at all with the Stack.
That means the stack needs an option....

LUT used to be 256, and was bumped to 512 to match LUT-EXEC
- can LUT-EXEC co-operate with streaming ? or does the same problem apply ?
- or can LUT have (optional) two planes, so 256 can stream and 256 is LUT Code and/or Stack ?

evanh · 2015-10-05 00:09

jmg wrote: »

Of more concern is the cases where full streaming speed is needed (and 100% will be used to reduce total system power ) not working at all with the Stack.

If a Cog is being pushed that hard already there isn't much need for any stack at that stage.

Seairth · 2015-10-05 00:09

cgracey wrote: »

Baggers wrote: »

Seairth, like the stack overwriting the streamer's lut buffer.

...or a streamer read being usurped by a CALL/RET, causing a bad pixel to be output.

Oh! It didn't occur to me that people would try to use the LUT as general purpose memory at the same time they were using it as a LUT or stream buffer. That seems like an all around bad idea.

Anyhow, you'd still have hub stacks with PTRx when LUT was otherwise engaged. From my perspective, when you aren't using the LUT as a LUT or stream buffer, this gives the LUT (AUX???) more use cases.

jmg · 2015-10-05 00:16

evanh wrote: »

jmg wrote: »

Of more concern is the cases where full streaming speed is needed (and 100% will be used to reduce total system power ) not working at all with the Stack.

If a Cog is being pushed that hard already there isn't much need for any stack at that stage.

That was my point, but there could still be some CALL/RET stack need, and the present HW stack covers that.
Telling users they have Zero CALL/RET stack in some use cases, is very constraining.
Best to keep the HW stack.

Electrodude · 2015-10-05 00:30

Whatever happened to JMPRET? Sure, it might not work with hubexec, but there's no reason it isn't useful anymore in cogexec. It would be nice to have, especially for streamer code, which is already running in cog mode anyway.

evanh · 2015-10-05 00:46

jmg wrote: »

That was my point, but there could still be some CALL/RET stack need, and the present HW stack covers that.

My point is there is no reason to expect that much and still expect normal behaviour. You are talking about an extreme situation that requires careful crafting to achieve anything more. It won't just be stack limitations you are bumping into, the Hub will be inaccessible as well.

potatohead · 2015-10-05 00:49

I think we should keep the current stack, and allow for the LUT storage and stream case. People may attempt it in extreme scenarios and we saw similar extremes put to good use on P1.

The stack exists for call return and it's not a blocking path. This seems well worth it.

evanh · 2015-10-05 01:58

Either way, with or without a hardware stack, the Cog in question can only cycle within itself when you're down to 100% bandwidth utilisation. Only a crafted bit of code will do anyway.

evanh · 2015-10-05 02:29

I just don't see such a small stack really being used either. Why was it added in the first place?

Roy Eltham · 2015-10-05 02:36

evanh, 8 deep is pretty far for a function chain excluding recursion. The hardware stack is just for return address and flags, not any data.

jmg · 2015-10-05 02:36

evanh wrote: »

I just don't see such a small stack really being used either. Why was it added in the first place?

I think the interrupts pushed this. There are other MCUs with small stacks, and P1 has no HW stack at all, so relative to a P1 this is huge

evanh · 2015-10-05 02:45

The Prop2 has a proper main memory based stack (and LINK instruction) for HLL use. The oddball hardware stack will never be used for that.

Interrupts are interrupts, they can be fit to whatever.

potatohead · 2015-10-05 02:50

The stack will see PASM use. And having 8 levels is great! Given the size of COG code, it's a good amount of stack, and it will allow for denser programs too.

I'm a fan, and definitely think it's worth it.

evanh · 2015-10-05 03:00

Roy Eltham wrote: »

evanh, 8 deep is pretty far for a function chain excluding recursion. The hardware stack is just for return address and flags, not any data.

Hmm ... except, of course, there is now calls to allow it to handle generic data too. Before long it'll be deeper ...

evanh · 2015-10-05 03:02

I'll keep quiet if it's left as is.

Cluso99 · 2015-10-05 03:16

IMHO there is quite a bit of simplification available here.

But for now, how about we just live with this part that Chip has given us, and try it out.

If the LUT usage is defined as STREAMER or LUT-EXEC then everything should be fine.
Then, the combined usage can be discussed in a separate article on how to do it, with caveats and all.

jmg · 2015-10-05 04:20

Cluso99 wrote: »

But for now, how about we just live with this part that Chip has given us, and try it out.

Yes, I think the HW Stack is fine, with 32b wide modifier to allow limited COG-local ASM Data use.
I agree it is less likely to be coded into any C Compiler back end gode generator, but there will be a lot of PASM code crafted for COGS

evanh · 2015-10-05 06:02

jmg wrote: »

... with 32b wide modifier to allow limited COG-local ASM Data use.

Bad idea, and will always be a bad idea.

jmg · 2015-10-05 07:06

evanh wrote: »

jmg wrote: »

... with 32b wide modifier to allow limited COG-local ASM Data use.

Bad idea, and will always be a bad idea.

hmm .. just a claim, no facts ?

I can equally well state that 22b wide is "Bad idea, and will always be a bad idea.", only I can add facts too :
A Push width of just 68.75% of data will be considered 'quite strange' by many users.

Can you point to any other PUSH and POP opcodes that cannot push DATA -
Fact: There are a LOT of examples out there that CAN push/pop Data, as well as Address.
Fact: 22b contradicts the principle of least surprises.

evanh · 2015-10-05 07:27

jmg wrote: »

hmm .. just a claim, no facts ?

Everything been said already.

potatohead · 2015-10-05 14:09

The P2 includes an 8 level CALL and RETURN stack, not intended for general data use.

Not strange at all.

Bean · 2015-10-05 14:37

This is just my opinion, but many subroutines need only a single parameter. It would be nice to be able to put it on the stack.

I would think for the 80-flops why not make them 32-bits wide so they are "standard" or what would be expected. I think the P2 is going to have some quirks as it is, why add another one ?

Bean

Seairth · 2015-10-05 14:53

Bean wrote: »

This is just my opinion, but many subroutines need only a single parameter. It would be nice to be able to put it on the stack.

I would think for the 80-flops why not make them 32-bits wide so they are "standard" or what would be expected. I think the P2 is going to have some quirks as it is, why add another one ?

Bean

Here's the problem, though. To use this, you will want to push the data *after* the return address. But that makes CALL/RET impossible to use. On the other hand, you could push the data before the return address, but this would then require every subroutine to do some POP/PUSH juggling.

Note also that there is little value in pushing data onto the stack without also having instructions that can access positions in the stack. Otherwise, every subroutine must pop the data to access it. At which point, you haven't gained anything over just leaving the data in cog registers. In other words, you're not really using the stack as a stack (at least for the data).

Electrodude · 2015-10-05 15:05

If you add a "locpush @label" instruction that pushes the return address, then you'll be able to use the stack for subroutine parameters.

                        pushloc @:ret                  ' push return address (problem: annoying label)
                        push    param                  ' push parameter
                        jmp     @subroutine            ' jump to subroutine
:ret                    ...                            ' continue

subroutine
                        pop     x                      ' pop parameter
                        ...
                        ret                            ' return to parent routine

Rayman · 2015-10-05 16:08

No pressure, but I hope this all gets finalized and we get some new images soon...

cgracey · 2015-10-05 17:39

Rayman wrote: »

No pressure, but I hope this all gets finalized and we get some new images soon...

I'm on it. I'm moving the special registers back to the top of cog memory and adding a cog-load option to COGINIT.

jmg · 2015-10-05 18:46

cgracey wrote: »

I'm on it. I'm moving the special registers back to the top of cog memory and adding a cog-load option to COGINIT.

Can you flip LUT:COG in PC _CODE space ? (Special registers & interrupts still move to top of cog memory)
See my post in other thread

http://forums.parallax.com/discussion/comment/1347591/#Comment_1347591

potatohead · 2015-10-05 21:36

Please don't do this. Cog at 0 makes a lot of sense.

jmg · 2015-10-05 21:50

potatohead wrote: »

Please don't do this. Cog at 0 makes a lot of sense.

In what way does it make a lot of sense ?
PC is arbitrary and LUT : COG order is also arbitrary. Just the 10th bit changes.
Data wise, each is its own 512 space, so this has no effect on data .

I think the new JMPREL opcode can index across LUT_COG, and to me imposing an un-natural break in the middle of that jump range, is what make no sense to me.

Question on hardware stack "PUSH/POP" data width

Comments