Chip,
Awesome! That is an excellent way to handle it! Glad I said that even though it was worded odd and wasn't the actual way. It made you think of one!
It's nice to have at least some form of indirection available!
Chip,
I thought you were implementing hubexec as an extremely simple model. This is what I thought was happening...
The JMP/CALL/RET (4 level lifo, PTRA, PTRB) instructions can mix cog and hub execution. PUSH & POP versions supported.
The instruction cache is 4 longs (hence 128 bit wide).
There would be an address tag holding what quad hub address is stored in the Icache.
When a new instruction fetch was required, we'd wait for the hub and read the quad into the Icache.
Then up to 4 instructions could execute from this Icache. Unless there was a loop within this quad, a new hub fetch would be required.
Self-modifying hubexec instructions would not be possible. We have the same problem with LMM anyway.
The LOCxxxx instructions permit us loading into registers embedded constants (17 bits). BTW I have suggested you make it 18 bits for future 1MB expansion.
So, presuming sequential hubexec instructions, the speed would be 1/2 that of cog mode (4 instructions per hub access). Non-sequential or hub data access will slow this down.
If you can do this then this would be fantastic.
We can get up to other tricks to improve speed, but this would permit the user to run from hub simply, and so the benefit is we no longer are tied to 2KB cog.
Chip,
Awesome! That is an excellent way to handle it! Glad I said that even though it was worded odd and wasn't the actual way. It made you think of one!
I'm really glad we've got a solution now. What a relief!. In some ways, this is better than INDA/INDB.
Chip,
I thought you were implementing hubexec as an extremely simple model. This is what I thought was happening...
The JMP/CALL/RET (4 level lifo, PTRA, PTRB) instructions can mix cog and hub execution. PUSH & POP versions supported.
The instruction cache is 4 longs (hence 128 bit wide).
There would be an address tag holding what quad hub address is stored in the Icache.
When a new instruction fetch was required, we'd wait for the hub and read the quad into the Icache.
Then up to 4 instructions could execute from this Icache. Unless there was a loop within this quad, a new hub fetch would be required.
Self-modifying hubexec instructions would not be possible. We have the same problem with LMM anyway.
The LOCxxxx instructions permit us loading into registers embedded constants (17 bits). BTW I have suggested you make it 18 bits for future 1MB expansion.
So, presuming sequential hubexec instructions, the speed would be 1/2 that of cog mode (4 instructions per hub access). Non-sequential or hub data access will slow this down.
If you can do this then this would be fantastic.
We can get up to other tricks to improve speed, but this would permit the user to run from hub simply, and so the benefit is we no longer are tied to 2KB cog.
That all still stands. It just seemed to me that without INDA/INDB, it wouldn't be that practical. Kind of myopic on my part.
Anyway, now we've got a solution to indirection: ALTD/ALTS/ALTDS
We already essentially have indirection for reading and writing hub memory with rdxxxx/wrxxxx since they use a register as the address to read/write.
Now we have a solution for indirection in cog memory accesses.
The problems arise from hub exec needing INDA/INDB-type functionality to overcome the impracticality of self-modifying code in hub memory.
Wait up. The Prop I does LMM. It has no such demands. The LMM is just a loop to fetch and execute instructions, with some other kernel support functionality.
What we need is for the PII to support doing what that 4 instruction LMM does, as fast as possible.
A hardware LMM loop with software support would do the trick.
Is hub exec worth slowing the cog down for?
My gut tells me "no way". When you need the speed you need the speed.
For example, the Fast Fourier Transorm fits in and runs from a single COG. Such things as FFT are never fast enough.
Or think about the fact that even when executing from HUB propgcc can cache tight code loops into COG for maximum speed. In fact it does this so well that the C version of the FFT runs nearly as fast as the hand made PASM version.
Taking 30% off of all of this sounds really bad.
But do they both have to be in hub? That is my question - could one be in cog?
Remember, every CALLA/CALLB/RETA/RETB will perform a wr/rd hub long, which causes a hub access (in addition to the instruction fetching), and so will slow down a lot.
What you said made me realize that we could do something like AUGS/AUGD, but instead of augmenting the next S/D constant, we could alter the S/D field in the next instruction.
So this is 'in-line self modification', that can also be hidden from the programmer ?
(self-modifying-code is not easy to read months later, or by someone else )
jmg,
My understanding is that it doesn't actually modify the memory of the mov instruction. It just changes its behavior. The same way the AUGS instruction does.
It accomplishes the same end result as self modifying code, without actually doing the modify of the code in memory, and it's actually easier to use since it doesn't require the extra instruction(s) between the modify and the execution of the modified instruction.
jmg,
My understanding is that it doesn't actually modify the memory of the mov instruction. It just changes its behavior. The same way the AUGS instruction does.
It accomplishes the same end result as self modifying code, without actually doing the modify of the code in memory,
Yes that is what I meant by 'in-line self modification' - it occurs ln-line, in the execution unit, not in the memory.
The P1+ really does need an 'indirect opcode' group, and this seems a good way to achieve that. I also prefer the 2nd handling form Chip gives, that 'hides' the 'in-line self modification', so it looks more normal to anyone reading code.
A number of micros are slower/larger for indirect access opcodes, so that is ok.
I realized tonight that it is hard to support hub exec and not get mired back in the complexities of the Prop2.
The problems arise from hub exec needing INDA/INDB-type functionality to overcome the impracticality of self-modifying code in hub memory. INDA/INDB require an effective pipeline stage, unto themselves, in order to take the instruction, recognize INDA/INDB usage, and then substitute the INDA/INDB pointers into the S and D fields of the instruction before reading S and D. The Prop1's simpler architecture just feeds the S and D fields of the instruction straight into the address inputs of the cog RAM to read S and D on the next clock, requiring no extra stage. This pipeline stage needed to support INDA/INDB increases the number of instructions trailing a branch (which will need cancellation) from one to two. That, in turn, has the effect of requiring Prop2-type INDA/INDB backtracking circuits to accommodate various cancellation scenarios. This pipeline stage is also required to make the RDxxxxC cached reads work. To implement hub exec, we're going to be complicating the cog quite a bit. Delayed branches will become more necessary to regain looping performance. I think this is the wrong road, in light of what became of the P2 at 180nm.
There is a way around this, and that is to resolve all the INDA/INDB stuff on the same cycle as the instruction read, tacking the INDA/INDB logic time onto the cog RAM access time. This would have the effect of slowing the clock down by ~30%, I estimate. It would require less flops and not introduce a new pipeline stage, but would instantly create what will remain the critical path in the cog.
In summary, if we don't pursue hub exec and INDA/INDB, we're much simpler, which means smaller and faster. There is little cost and no extra pipelining required to implement PTRA/PTRB, though, so that is viable. Same with a hardware LIFO stack for CALL/RET. And 128-bit hub transfers are no problem, either. Same with keeping the four tasks - those make a lot possible and are almost free.
Is hub exec worth slowing the cog down for?
I guess I haven't been following this closely enough. What do INDA and INDB have to do with executing code from hub memory. I thought all you were doing was extending the PC to 17 bits and fetching from the i-cache if the high 8 bits were non-zero. That and adding the JMP and CALL instructions that take 17 bit addresses is all there is to hub exeuction isn't it?
What you said made me realize that we could do something like AUGS/AUGD, but instead of augmenting the next S/D constant, we could alter the S/D field in the next instruction. This is the way to achieve indirection for S and D! This is REALLY simple.
Along with augmenting D and S constants, we could alter D and S registers:
ALTD D/#
ALTS S/#
ALTDS D/#,S/#
This:
ALTS ptr
MOV OUTA,0
Could also be coded as:
MOV OUTA,[ptr]
By the way, if you get rid of the ability to execute code from the hub you probably don't need AUGS and AUGD anymore either since the original "BIG" instruction was mostly to replace the LMM macro to fetch 32 bit constants when running from the hub directly.
I don't understand this proposed hubexec model. Can someone please explain it in a single message rather than scattered across the entire thread? I don't believe we need anything other than the following:
1) PC extended to 17 bits to address 512K of hub memory
2) When the high 8 bits of the PC are zero, it points to COG memory
3) When the high 8 bits of the PC are non-zero, it points to hub memory
4) CALL/JMP instructions with 17 bit address fields
5) A CALL instruction that stores its full 17 bit return address in a register
6) A way to load a 32 bit constant (AUGS)
7) A RDLONGC-like facility that allows 128 bits to be fetched at once and used as a one-line i-cache
What else is needed? In fact, number 7 could be left out in a really simple implementation but I think Bill determined that would only be 25% faster than LMM.
What is all of this about INDA, INDB, ALTD, ALTS, etc? What am I missing here?
Edit: Okay, I think I'm beginning to understand. Chip said INDA-like functionality. I guess the problem is with my points 2 and 3 where the value in PC is sometimes treated as a hub address and sometimes as a COG address. Is that the issue?
That all still stands. It just seemed to me that without INDA/INDB, it wouldn't be that practical. Kind of myopic on my part.
Anyway, now we've got a solution to indirection: ALTD/ALTS/ALTDS
I don't understand the need for indirect COG address. Can someone post an example of how this would be needed in conjunction with hubexec? It seems like an orthogonal feature to me.
Chip,
What we need is for the PII to support doing what that 4 instruction LMM does, as fast as possible.
A hardware LMM loop with software support would do the trick.
Do you mean like the REPS instruction we had in P2?
Do you mean like the REPS instruction we had in P2?
REPS is too hard to work with in compiler code generators. It works really well for straight line code but very badly with branching code. Unfortunately, I think most code tends to branch a lot.
I don't understand the need for indirect COG address. Can someone post an example of how this would be needed in conjunction with hubexec? It seems like an orthogonal feature to me.
David
If INDA/B are removed from P2 and you can't self modify code in hub-exec, it makes it difficult to index through COG ram.
The only way would be to call a COG mode program that self modifies and return back to hub-exec mode.
Here's a P3 example.
David
If INDA/B are removed from P2 and you can't self modify code in hub-exec, it makes it difficult to index through COG ram.
The only way would be to call a COG mode program that self modifies and return back to hub-exec mode.
Here's a P3 example.
By adding ALTD/ALTS/ALTDS allows indexing of COG ram from within Hub-exec.
Brian
Okay, I understand. That sounds like a cool facility but I don't think it will be needed by PropGCC. As I said, it seems like it's orthogonal to hub execution. Without INDA and INDB we are in no worse shape than we were in with P1 which also had no indirect COG memory instructions.
Comments
Awesome! That is an excellent way to handle it! Glad I said that even though it was worded odd and wasn't the actual way. It made you think of one!
It's nice to have at least some form of indirection available!
I'm not sure what (D) and (S) mean.
I thought you were implementing hubexec as an extremely simple model. This is what I thought was happening...
The JMP/CALL/RET (4 level lifo, PTRA, PTRB) instructions can mix cog and hub execution. PUSH & POP versions supported.
The instruction cache is 4 longs (hence 128 bit wide).
There would be an address tag holding what quad hub address is stored in the Icache.
When a new instruction fetch was required, we'd wait for the hub and read the quad into the Icache.
Then up to 4 instructions could execute from this Icache. Unless there was a loop within this quad, a new hub fetch would be required.
Self-modifying hubexec instructions would not be possible. We have the same problem with LMM anyway.
The LOCxxxx instructions permit us loading into registers embedded constants (17 bits). BTW I have suggested you make it 18 bits for future 1MB expansion.
So, presuming sequential hubexec instructions, the speed would be 1/2 that of cog mode (4 instructions per hub access). Non-sequential or hub data access will slow this down.
If you can do this then this would be fantastic.
We can get up to other tricks to improve speed, but this would permit the user to run from hub simply, and so the benefit is we no longer are tied to 2KB cog.
he means like your [ptr] example. using the ALTDS instruction you could have mov [ptr1], [ptr2]
I'm really glad we've got a solution now. What a relief!. In some ways, this is better than INDA/INDB.
Sure!
ALTDS ptr1,ptr2
MOV 0,0
same as:
MOV [ptr1],[ptr2]
That all still stands. It just seemed to me that without INDA/INDB, it wouldn't be that practical. Kind of myopic on my part.
Anyway, now we've got a solution to indirection: ALTD/ALTS/ALTDS
Why would you do
ALTDS ptr1, ptr2
MOV 0,0
It is compiled code, so couldn't you just do
MOV ptr1, ptr2 ???
Or does
ALTDS ptr1, ptr2
read the values of ptr1 and ptr2 (lets say they store $20 and $30)
and then when
MOV 0,0
executes it performs
MOV $20, $30
Just realised this is what it is doing, hence both D & S can be via indirection. Nice!
The latter is true.
Would one cog stack be a nice alternative ?
Or could PTRA & PTB use the address to determine cog or hub mode ?
Nice!
I think two stacks is a very nice to have.
Now we have a solution for indirection in cog memory accesses.
What we need is for the PII to support doing what that 4 instruction LMM does, as fast as possible.
A hardware LMM loop with software support would do the trick. My gut tells me "no way". When you need the speed you need the speed.
For example, the Fast Fourier Transorm fits in and runs from a single COG. Such things as FFT are never fast enough.
Or think about the fact that even when executing from HUB propgcc can cache tight code loops into COG for maximum speed. In fact it does this so well that the C version of the FFT runs nearly as fast as the hand made PASM version.
Taking 30% off of all of this sounds really bad.
You're a bit behind, we have a solution now that doesn't slow the cogs down and keeps hubexec. We won't have inda/indb.
Remember, every CALLA/CALLB/RETA/RETB will perform a wr/rd hub long, which causes a hub access (in addition to the instruction fetching), and so will slow down a lot.
So I see! Great work guys. I'll go back to sleep.
So this is 'in-line self modification', that can also be hidden from the programmer ?
(self-modifying-code is not easy to read months later, or by someone else )
The second form is much more preferable, even if it means that 'opcode' is 4 cycles/2 words in how it is managed.
My understanding is that it doesn't actually modify the memory of the mov instruction. It just changes its behavior. The same way the AUGS instruction does.
It accomplishes the same end result as self modifying code, without actually doing the modify of the code in memory, and it's actually easier to use since it doesn't require the extra instruction(s) between the modify and the execution of the modified instruction.
Yes that is what I meant by 'in-line self modification' - it occurs ln-line, in the execution unit, not in the memory.
The P1+ really does need an 'indirect opcode' group, and this seems a good way to achieve that. I also prefer the 2nd handling form Chip gives, that 'hides' the 'in-line self modification', so it looks more normal to anyone reading code.
A number of micros are slower/larger for indirect access opcodes, so that is ok.
1) PC extended to 17 bits to address 512K of hub memory
2) When the high 8 bits of the PC are zero, it points to COG memory
3) When the high 8 bits of the PC are non-zero, it points to hub memory
4) CALL/JMP instructions with 17 bit address fields
5) A CALL instruction that stores its full 17 bit return address in a register
6) A way to load a 32 bit constant (AUGS)
7) A RDLONGC-like facility that allows 128 bits to be fetched at once and used as a one-line i-cache
What else is needed? In fact, number 7 could be left out in a really simple implementation but I think Bill determined that would only be 25% faster than LMM.
What is all of this about INDA, INDB, ALTD, ALTS, etc? What am I missing here?
Edit: Okay, I think I'm beginning to understand. Chip said INDA-like functionality. I guess the problem is with my points 2 and 3 where the value in PC is sometimes treated as a hub address and sometimes as a COG address. Is that the issue?
Do you mean like the REPS instruction we had in P2?
If INDA/B are removed from P2 and you can't self modify code in hub-exec, it makes it difficult to index through COG ram.
The only way would be to call a COG mode program that self modifies and return back to hub-exec mode.
Here's a P3 example. By adding ALTD/ALTS/ALTDS allows indexing of COG ram from within Hub-exec.
Brian