It would be great to get rid of the hardware stack, but if we used the lut for that, we would cause glitches in the streamer, assuming we gave priority to CALL/RET.
Streamer definitely has priority. 99% of streaming activities will not be using more than 50% LUT bandwidth so PUSH/CALL can happily co-exist with the occasional extra clock inserted on some stack accesses. Jitter will still be small and any software that is that timing sensitive with the streamer active will have to be careful is all. Streaming isn't going to be that common an activity anyway.
The bonus of having all of LUTRAM available for super fast stacking will make a lot of people happy.
A proper stack would be nice, but streaming is also important, and I'm not sure a COG that is streaming also needs deep stacking.
Of more concern is the cases where full streaming speed is needed (and 100% will be used to reduce total system power ) not working at all with the Stack.
That means the stack needs an option....
LUT used to be 256, and was bumped to 512 to match LUT-EXEC
- can LUT-EXEC co-operate with streaming ? or does the same problem apply ?
- or can LUT have (optional) two planes, so 256 can stream and 256 is LUT Code and/or Stack ?
Of more concern is the cases where full streaming speed is needed (and 100% will be used to reduce total system power ) not working at all with the Stack.
If a Cog is being pushed that hard already there isn't much need for any stack at that stage.
Seairth, like the stack overwriting the streamer's lut buffer.
...or a streamer read being usurped by a CALL/RET, causing a bad pixel to be output.
Oh! It didn't occur to me that people would try to use the LUT as general purpose memory at the same time they were using it as a LUT or stream buffer. That seems like an all around bad idea.
Anyhow, you'd still have hub stacks with PTRx when LUT was otherwise engaged. From my perspective, when you aren't using the LUT as a LUT or stream buffer, this gives the LUT (AUX???) more use cases.
Of more concern is the cases where full streaming speed is needed (and 100% will be used to reduce total system power ) not working at all with the Stack.
If a Cog is being pushed that hard already there isn't much need for any stack at that stage.
That was my point, but there could still be some CALL/RET stack need, and the present HW stack covers that.
Telling users they have Zero CALL/RET stack in some use cases, is very constraining.
Best to keep the HW stack.
Whatever happened to JMPRET? Sure, it might not work with hubexec, but there's no reason it isn't useful anymore in cogexec. It would be nice to have, especially for streamer code, which is already running in cog mode anyway.
That was my point, but there could still be some CALL/RET stack need, and the present HW stack covers that.
My point is there is no reason to expect that much and still expect normal behaviour. You are talking about an extreme situation that requires careful crafting to achieve anything more. It won't just be stack limitations you are bumping into, the Hub will be inaccessible as well.
I think we should keep the current stack, and allow for the LUT storage and stream case. People may attempt it in extreme scenarios and we saw similar extremes put to good use on P1.
The stack exists for call return and it's not a blocking path. This seems well worth it.
Either way, with or without a hardware stack, the Cog in question can only cycle within itself when you're down to 100% bandwidth utilisation. Only a crafted bit of code will do anyway.
The stack will see PASM use. And having 8 levels is great! Given the size of COG code, it's a good amount of stack, and it will allow for denser programs too.
IMHO there is quite a bit of simplification available here.
But for now, how about we just live with this part that Chip has given us, and try it out.
If the LUT usage is defined as STREAMER or LUT-EXEC then everything should be fine.
Then, the combined usage can be discussed in a separate article on how to do it, with caveats and all.
But for now, how about we just live with this part that Chip has given us, and try it out.
Yes, I think the HW Stack is fine, with 32b wide modifier to allow limited COG-local ASM Data use.
I agree it is less likely to be coded into any C Compiler back end gode generator, but there will be a lot of PASM code crafted for COGS
... with 32b wide modifier to allow limited COG-local ASM Data use.
Bad idea, and will always be a bad idea.
hmm .. just a claim, no facts ?
I can equally well state that 22b wide is "Bad idea, and will always be a bad idea.", only I can add facts too :
A Push width of just 68.75% of data will be considered 'quite strange' by many users.
Can you point to any other PUSH and POP opcodes that cannot push DATA -
Fact: There are a LOT of examples out there that CAN push/pop Data, as well as Address.
Fact: 22b contradicts the principle of least surprises.
This is just my opinion, but many subroutines need only a single parameter. It would be nice to be able to put it on the stack.
I would think for the 80-flops why not make them 32-bits wide so they are "standard" or what would be expected. I think the P2 is going to have some quirks as it is, why add another one ?
This is just my opinion, but many subroutines need only a single parameter. It would be nice to be able to put it on the stack.
I would think for the 80-flops why not make them 32-bits wide so they are "standard" or what would be expected. I think the P2 is going to have some quirks as it is, why add another one ?
Bean
Here's the problem, though. To use this, you will want to push the data *after* the return address. But that makes CALL/RET impossible to use. On the other hand, you could push the data before the return address, but this would then require every subroutine to do some POP/PUSH juggling.
Note also that there is little value in pushing data onto the stack without also having instructions that can access positions in the stack. Otherwise, every subroutine must pop the data to access it. At which point, you haven't gained anything over just leaving the data in cog registers. In other words, you're not really using the stack as a stack (at least for the data).
Please don't do this. Cog at 0 makes a lot of sense.
In what way does it make a lot of sense ?
PC is arbitrary and LUT : COG order is also arbitrary. Just the 10th bit changes.
Data wise, each is its own 512 space, so this has no effect on data .
I think the new JMPREL opcode can index across LUT_COG, and to me imposing an un-natural break in the middle of that jump range, is what make no sense to me.
Comments
Streamer definitely has priority. 99% of streaming activities will not be using more than 50% LUT bandwidth so PUSH/CALL can happily co-exist with the occasional extra clock inserted on some stack accesses. Jitter will still be small and any software that is that timing sensitive with the streamer active will have to be careful is all. Streaming isn't going to be that common an activity anyway.
The bonus of having all of LUTRAM available for super fast stacking will make a lot of people happy.
Of more concern is the cases where full streaming speed is needed (and 100% will be used to reduce total system power ) not working at all with the Stack.
That means the stack needs an option....
LUT used to be 256, and was bumped to 512 to match LUT-EXEC
- can LUT-EXEC co-operate with streaming ? or does the same problem apply ?
- or can LUT have (optional) two planes, so 256 can stream and 256 is LUT Code and/or Stack ?
If a Cog is being pushed that hard already there isn't much need for any stack at that stage.
Oh! It didn't occur to me that people would try to use the LUT as general purpose memory at the same time they were using it as a LUT or stream buffer. That seems like an all around bad idea.
Anyhow, you'd still have hub stacks with PTRx when LUT was otherwise engaged. From my perspective, when you aren't using the LUT as a LUT or stream buffer, this gives the LUT (AUX???) more use cases.
Telling users they have Zero CALL/RET stack in some use cases, is very constraining.
Best to keep the HW stack.
The stack exists for call return and it's not a blocking path. This seems well worth it.
Interrupts are interrupts, they can be fit to whatever.
I'm a fan, and definitely think it's worth it.
Hmm ... except, of course, there is now calls to allow it to handle generic data too. Before long it'll be deeper ...
But for now, how about we just live with this part that Chip has given us, and try it out.
If the LUT usage is defined as STREAMER or LUT-EXEC then everything should be fine.
Then, the combined usage can be discussed in a separate article on how to do it, with caveats and all.
Yes, I think the HW Stack is fine, with 32b wide modifier to allow limited COG-local ASM Data use.
I agree it is less likely to be coded into any C Compiler back end gode generator, but there will be a lot of PASM code crafted for COGS
Bad idea, and will always be a bad idea.
I can equally well state that 22b wide is "Bad idea, and will always be a bad idea.", only I can add facts too :
A Push width of just 68.75% of data will be considered 'quite strange' by many users.
Can you point to any other PUSH and POP opcodes that cannot push DATA -
Fact: There are a LOT of examples out there that CAN push/pop Data, as well as Address.
Fact: 22b contradicts the principle of least surprises.
Everything been said already.
Not strange at all.
I would think for the 80-flops why not make them 32-bits wide so they are "standard" or what would be expected. I think the P2 is going to have some quirks as it is, why add another one ?
Bean
Here's the problem, though. To use this, you will want to push the data *after* the return address. But that makes CALL/RET impossible to use. On the other hand, you could push the data before the return address, but this would then require every subroutine to do some POP/PUSH juggling.
Note also that there is little value in pushing data onto the stack without also having instructions that can access positions in the stack. Otherwise, every subroutine must pop the data to access it. At which point, you haven't gained anything over just leaving the data in cog registers. In other words, you're not really using the stack as a stack (at least for the data).
I'm on it. I'm moving the special registers back to the top of cog memory and adding a cog-load option to COGINIT.
Can you flip LUT:COG in PC _CODE space ? (Special registers & interrupts still move to top of cog memory)
See my post in other thread
http://forums.parallax.com/discussion/comment/1347591/#Comment_1347591
PC is arbitrary and LUT : COG order is also arbitrary. Just the 10th bit changes.
Data wise, each is its own 512 space, so this has no effect on data .
I think the new JMPREL opcode can index across LUT_COG, and to me imposing an un-natural break in the middle of that jump range, is what make no sense to me.