I have just thought of a potential "gotcha". What if the two instructions (xxx and BIG/AUGI) were over a WIDE break (ie spread over an 8*Long boundary)?
Will that really matter? Won't the pipeline just stall until the next WIDE read has completed and then resume as if nothing happened?
Regarding a visible BIG register... it would only show up in hubexec mode, and newbies should not use it. It is more for assembly gurus and compilers. It could save a fairly significant amount of longs in the hub, and cycles of execution.
Your potential gotcha is a good reason for using BIG as a prefix, in which case it would not matter. I bet Chip could make it work for postfix as well, as the pipeline would stall until the next octal fetch.
I love macro assemblers too
TOTALLY agreed about massive capability increase!!!!!
Bill, while I can see the advantages of automatically placing the 32bit immediate result into a register (be it $1F1 or wherever), I think its more of a kludge that will cause a lot of misunderstanding. Personally, I'd rather not have that feature because of it's obscurity.
I have just thought of a potential "gotcha". What if the two instructions (xxx and BIG/AUGI) were over a WIDE break (ie spread over an 8*Long boundary)?
Oh, and I love the assembler emitting the pair of instructions automatically if the constant is >9 bits.
This last week has certainly seen some massive advances in P2's abilities. Thanks go to "Thanksgiving Holidays" (and we don't have it here in Oz).
Regarding a visible BIG register... it would only show up in hubexec mode, and newbies should not use it. It is more for assembly gurus and compilers. It could save a fairly significant amount of longs in the hub, and cycles of execution.
I think compilers would rather emit something like:
mov TEMP, #imm_lo
BIG #imm_hi
rdlong A, TEMP
... calculate on A
wrlong A, TEMP
This is a common idiom on other processors, and will work correctly even if the "calculate on A" stuff needs a different BIG (it might!) or if the code spans multiple octo blocks.
The visible BIG wold save that move, thus saving hub space. Also, BIG should hold the low bits, see my earlier example for David.
The point is, your example would work as-is, and assembly coders & compilers which can support the visible BIG and embedded addresses in the 8-long window can benefit, even if it is too much work to support it in GCC at this time.
I think compilers would rather emit something like:
mov TEMP, #imm_lo
BIG #imm_hi
rdlong A, TEMP
... calculate on A
wrlong A, TEMP
This is a common idiom on other processors, and will work correctly even if the "calculate on A" stuff needs a different BIG (it might!) or if the code spans multiple octo blocks.
Not unless we have specifically entered a mode that automatically executes this WIDE loop. I am unsure if Chip is doing this???
I guess we need to clarify this if the hub execution mode is actually going to be implemented. If it isn't automatic, we may need to evaluate whether hub execution mode is still worth doing. It may be but it isn't quite as obvious.
I guess we need to clarify this if the hub execution mode is actually going to be implemented. If it isn't automatic, we may need to evaluate whether hub execution mode is still worth doing. It may be but it isn't quite as obvious.
In thinking about this more I'm almost positive that Chip said it would happen automatically. Otherwise, what does hub execution mode do? We can already do a RDLONGC to fill the cache and then jump to the 9-long window. It's the automatic filling when jumping outside of the window that is what defines hub execution mode, no?
BILL,
OK, I took a look at your post about reversing the BIG to use lower 23 bits and the XXX #S using the high 9 bits. I really do not like this idea. It goes against all the #S immediate values. I know you can cheat and use it for other obscure results, but for me IMHO it is a real no-no. This is irrespective of whether you automatically save the resultant bits to $1F1 or not.
Here's a radical idea. Should we consider getting rid of CALL/RET entirely and always call functions using the AUX stack or a register to hold the return address? That would mean that there would be enough space for a full hub address for any type of CALL. The existing CALL instruction could be changed to store its full 32 bit return address in the D register instead of in the S field and then the function could just jmp indirect through that register to return. No more self-modifying code for CALL/RET and CALL would work even in hub execute mode so the PC could be used instead of PTRA.
CALL LR, #my_function
...
my_function
...
JMP LR
LR long 0
I guess it uses one more long than the current approach though.
Actually, if the cog address was pushed on stack A or B, it would not take an extra long, work well, quickly, and remove the need for the self modifying JMPRET.
Don't know if Chip will like it.
Like you, I thought it might be nice to use $00000-$007FF for cog space, and $00800-$3FFFF for hub addresses, but it may have potential issues - heck we could use say anything above 16MB will map to external memory, and let the hardware manage the different memory levels. It would certainly simplify life for compilers - but I feel that is a P2.1+ discussion.
Here's a radical idea. Should we consider getting rid of CALL/RET entirely and always call functions using the AUX stack or a register to hold the return address? That would mean that there would be enough space for a full hub address for any type of CALL. The existing CALL instruction could be changed to store its full 32 bit return address in the D register instead of in the S field and then the function could just jmp indirect through that register to return. No more self-modifying code for CALL/RET and CALL would work even in hub execute mode so the PC could be used instead of PTRA.
CALL LR, #my_function
...
my_function
...
JMP LR
LR long 0
I guess it uses one more long than the current approach though.
Here's a radical idea. Should we consider getting rid of CALL/RET entirely and always call functions using the AUX stack or a register to hold the return address? That would mean that there would be enough space for a full hub address for any type of CALL. The existing CALL instruction could be changed to store its full 32 bit return address in the D register instead of in the S field and then the function could just jmp indirect through that register to return. No more self-modifying code for CALL/RET and CALL would work even in hub execute mode so the PC could be used instead of PTRA.
CALL LR, #my_function
...
my_function
...
JMP LR
LR long 0
I guess it uses one more long than the current approach though.
A problem with this is it enforces the use of the AUX as a stack for all COGs using any call instructions when you might want to use it exclusively for other purposes such as video buffers, or 1kB of other data and can't spare any more of it, even though you want to still make CALLs (via existing JMPRET approach) in COGs. Not sure a single register approach allows nesting either.
Actually, if the cog address was pushed on stack A or B, it would not take an extra long, work well, quickly, and remove the need for the self modifying JMPRET.
Don't know if Chip will like it.
Like you, I thought it might be nice to use $00000-$007FF for cog space, and $00800-$3FFFF for hub addresses, but it may have potential issues - heck we could use say anything above 16MB will map to external memory, and let the hardware manage the different memory levels. It would certainly simplify life for compilers - but I feel that is a P2.1+ discussion.
We don't need to modify CALL to use the AUX stack because we already have CALLA and CALLB. I was just suggesting the modification to CALL for cases where you might not want to use the stack.
Like you, I thought it might be nice to use $00000-$007FF for cog space, and $00800-$3FFFF for hub addresses, but it may have potential issues - heck we could use say anything above 16MB will map to external memory, and let the hardware manage the different memory levels. It would certainly simplify life for compilers - but I feel that is a P2.1+ discussion.
I like this idea too, you can use the ROM hole to map addresses for COGs, and possibly stack RAM. Today we have almost 4kB of space (actually $E00) in the ROM hole, that is enough to hold addresses for the COG and stack RAM combined.
I wonder if in HUBEXEC mode whether any hub addresses read that are less than $E00 could be used to index into stack RAM and/or COG RAM instead of the hub. That could help fix your pointer dereference problem. It probably requires extra address checks which may slow down the RAM access critical path however, but it if was possible it might be a nice way to help...
If that second opcode is context dependent on Any instruction having an immediate S or D, then the Assembler should check that, and give an error. ( another reason for the simpler, clearer one line syntax )
A smart assembler could even support this as well
ADD reg,#AnyConstant
and spawn one of two opcode sets (just like many ASMs now do automatically with JMP/CALL)
The LIST file should make it clear when 32 bit promotion occurred.
I'm thinking that there should be two BIG instructions: AUGD and AUGS. This way, in case both D and S of the initial instruction are immediate, you can control which one gets augmented without any precedence rules. Also, 32-bit constants can be denoted in PASM by ##:
The visible BIG wold save that move, thus saving hub space. Also, BIG should hold the low bits, see my earlier example for David.
If I've understood your example correctly, it relies on using the cached BIG instruction as a hub address. That will only work if the original RDLONG, calculations, and WRLONG all fit within the same 8 long window, which is a fairly specific case. It also relies on the code snippet not being moved around -- adding a single long will shift the relative positions of instructions in the window. That's not a problem for compilers, but could be for programmers writing assembly by hand.
I think the benefits in simplicity of always using the S field as the low 9 bits (whether BIG is present or not) outweigh the tricky optimization available if S is sometimes the upper 9 bits. Consistency in the architecture is important too! Obviously others will have different opinions on the relative values, which is why we're having this discussion :-).
I'm thinking that there should be two BIG instructions: AUGD and AUGS. This way, in case both D and S of the initial instruction are immediate, you can control which one gets augmented without any precedence rules. Also, 32-bit constants can be denoted in PASM by ##:
David and others,
Do you see that the HUBEXEC model would also be used to generate video (as in AUX being used to drive the video DACs)? If so, would it be a problem if this was not permitted??? I have an idea.
I think that in hub-execute mode, PTRA will become the program counter and its three LSBs will dictate which long in WIDEs is executing. There should be no problem with 8-long boundaries, as they should seamlessly connect. There will be pipeline stalls, though, when WIDEs are reloading. Those stalls don't break up the continuity in the pipeline, though.
I don't know if having a visible BIG register is good. It seems kind of dirty. I think it would be better to sublimate the function by the trailing pipeline AUGD/AUGS instructions. Also, I understand Bill's argument about BIG being reusable, but couldn't the same thing be achieved just by moving a 32-bit constant into a cog register and using that? I might be missing the boat on a few things here.
About hub-execute mode: To make this really seamless and not like some half-baked LMM mode from before, there needs to be an overall mode awareness, or awareness by task. To support multitasking, plus hub-execute for a particular task, we'd need the latter, plus a task-aware BIG constant mechanism which would have to be some kind of a prefix.
I'm thinking that there should be two BIG instructions: AUGD and AUGS. This way, in case both D and S of the initial instruction are immediate, you can control which one gets augmented without any precedence rules. Also, 32-bit constants can be denoted in PASM by ##:
Maybe HUBEXEC can happen on part of a scan line. Sometimes graphics engines have exceeded what we can get into a COG. On P1, we simply didn't do them, or we had to pack things into two COGS, etc...
At clock speed, quick dips into HUBEXEC mode while AUX is being drawn to the screen could be VERY useful. Nice to have the execute space frankly. Would be a shame to give it up before we've even written any code!
If AUX were linked to HUBEXEC, it would preclude using it in this way. COG mode used to fill AUX, for example.
Then those pixels take a LONG time to render to the screen at the speed of the P2. Drop into HUBEXEC to get various things done, return to COG mode for more feeding of WAITVIDS, and on it all goes. IMHO, people will do this. I'm thinking of it.
I think the benefits in simplicity of always using the S field as the low 9 bits (whether BIG is present or not) outweigh the tricky optimization available if S is sometimes the upper 9 bits. Consistency in the architecture is important too!
I agree with this overall. So many optimizations done already. Digging too deep may well overcomplicate things and limit some of the potential we've already got.
I think the benefits in simplicity of always using the S field as the low 9 bits (whether BIG is present or not) outweigh the tricky optimization available if S is sometimes the upper 9 bits. Consistency in the architecture is important too!
What I was thinking, is if HUBEXEC mode was mutually exclusive to:
(1) Video Generation (ie video to DACs)
The Video path of AUX Ram could be used to supply the Instruction to the ALU instead of from the hub side of AUX or the Cache.
My thinking here is using the AUX to be the cache too.
This way the RDWIDE could execute concurrently with cog execution. Aux could mean that the cache is now multiples of 8*long clocks, meaning no hub stalls for sequential hub (non-branching) code. A double 8*long aux/cache would result in no cog stalls.
And, AUX can still be used for PUSH/POP/RDAUX/WRAUX without stalling the ALU from fetching instructions. Perhaps this is not a problem with the caching?
Alternately, perhaps the cache could be expanded to two blocks of 32*longs?
In fact, the cache does not even need to be mapped/windowed into the cog at all for HUBEXEC to work?
(2) Single threading (ie don't use the multi-tasking hw options)
Just trying to keep it simple. Wouldn't the AUGD/AUGS have problems in multi-threading - because they would not be sequential in the pipe?
Originally Posted by ersmith
I think the benefits in simplicity of always using the S field as the low 9 bits (whether BIG is present or not) outweigh the tricky optimization available if S is sometimes the upper 9 bits. Consistency in the architecture is important too!
BILL,
OK, I took a look at your post about reversing the BIG to use lower 23 bits and the XXX #S using the high 9 bits. I really do not like this idea. It goes against all the #S immediate values. I know you can cheat and use it for other obscure results, but for me IMHO it is a real no-no. This is irrespective of whether you automatically save the resultant bits to $1F1 or not.
I think that in hub-execute mode, PTRA will become the program counter and its three LSBs will dictate which long in WIDEs is executing. There should be no problem with 8-long boundaries, as they should seamlessly connect. There will be pipeline stalls, though, when WIDEs are reloading. Those stalls don't break up the continuity in the pipeline, though.
.....
About hub-execute mode: To make this really seamless and not like some half-baked LMM mode from before, there needs to be an overall mode awareness, or awareness by task. To support multitasking, plus hub-execute for a particular task, we'd need the latter, plus a task-aware BIG constant mechanism which would have to be some kind of a prefix.
Could the cache be two blocks of 8*longs. Once executing from the first, the second could be loaded without stalling the first, and visa versa. This would prevent stalling due to WIDEs reloading, unless there was a JMP/CALL/RET.
For HUBEXEC mode, Windowing into the cog isn't necessary. Do you agree?
Optionally, if a REP instruction was less than #9 instructions, the reloading of its cache could be prevented (simplest implementation), permitting fast small REP loops from reloading the cache.
For Video Gen or Aux data, can WIDEs load into AUX directly without having to be moved by sw from cache to aux?
David and others,
Do you see that the HUBEXEC model would also be used to generate video (as in AUX being used to drive the video DACs)? If so, would it be a problem if this was not permitted??? I have an idea.
I guess I'll leave this to the "others" to answer. I'm not that familiar with generating video and its requirements.
Comments
Your potential gotcha is a good reason for using BIG as a prefix, in which case it would not matter. I bet Chip could make it work for postfix as well, as the pipeline would stall until the next octal fetch.
I love macro assemblers too
TOTALLY agreed about massive capability increase!!!!!
The point is, your example would work as-is, and assembly coders & compilers which can support the visible BIG and embedded addresses in the 8-long window can benefit, even if it is too much work to support it in GCC at this time.
OK, I took a look at your post about reversing the BIG to use lower 23 bits and the XXX #S using the high 9 bits. I really do not like this idea. It goes against all the #S immediate values. I know you can cheat and use it for other obscure results, but for me IMHO it is a real no-no. This is irrespective of whether you automatically save the resultant bits to $1F1 or not.
I guess it uses one more long than the current approach though.
Don't know if Chip will like it.
Like you, I thought it might be nice to use $00000-$007FF for cog space, and $00800-$3FFFF for hub addresses, but it may have potential issues - heck we could use say anything above 16MB will map to external memory, and let the hardware manage the different memory levels. It would certainly simplify life for compilers - but I feel that is a P2.1+ discussion.
A problem with this is it enforces the use of the AUX as a stack for all COGs using any call instructions when you might want to use it exclusively for other purposes such as video buffers, or 1kB of other data and can't spare any more of it, even though you want to still make CALLs (via existing JMPRET approach) in COGs. Not sure a single register approach allows nesting either.
I like this idea too, you can use the ROM hole to map addresses for COGs, and possibly stack RAM. Today we have almost 4kB of space (actually $E00) in the ROM hole, that is enough to hold addresses for the COG and stack RAM combined.
I wonder if in HUBEXEC mode whether any hub addresses read that are less than $E00 could be used to index into stack RAM and/or COG RAM instead of the hub. That could help fix your pointer dereference problem. It probably requires extra address checks which may slow down the RAM access critical path however, but it if was possible it might be a nice way to help...
I'm thinking that there should be two BIG instructions: AUGD and AUGS. This way, in case both D and S of the initial instruction are immediate, you can control which one gets augmented without any precedence rules. Also, 32-bit constants can be denoted in PASM by ##:
ADD reg,##100_000
...becomes...
ADD reg,#100_000 & $1FF
AUGS #100_000 >> 9
...and this...
SETSERA ##configvalue,#baud
...becomes...
SETSERA #configvalue & $1FF, #baud
AUGD #configvalue >> 9
I think the benefits in simplicity of always using the S field as the low 9 bits (whether BIG is present or not) outweigh the tricky optimization available if S is sometimes the upper 9 bits. Consistency in the architecture is important too! Obviously others will have different opinions on the relative values, which is why we're having this discussion :-).
Eric
BTW Chip, Bill is advocating the following usage and also saving the ultimate 32bit result in a cog register (fixed location such as $1EF or $1F1).
AUGS #(value & $007F_FFFF) 'lower 23 bits
XXXX xxxx,#(value >> 32) 'upper 9 bits
Your thoughts?
Do you see that the HUBEXEC model would also be used to generate video (as in AUX being used to drive the video DACs)? If so, would it be a problem if this was not permitted??? I have an idea.
I don't know if having a visible BIG register is good. It seems kind of dirty. I think it would be better to sublimate the function by the trailing pipeline AUGD/AUGS instructions. Also, I understand Bill's argument about BIG being reusable, but couldn't the same thing be achieved just by moving a 32-bit constant into a cog register and using that? I might be missing the boat on a few things here.
About hub-execute mode: To make this really seamless and not like some half-baked LMM mode from before, there needs to be an overall mode awareness, or awareness by task. To support multitasking, plus hub-execute for a particular task, we'd need the latter, plus a task-aware BIG constant mechanism which would have to be some kind of a prefix.
An option to force a double-opcode could be useful, (for strictest time management) but I'd still expect an assembler to accept this :
ADD reg,#ConstantName
and allow users to change the value of ConstantName elsewhere, and still have it assemble as expected.
The assembler does the housekeeping, so the user does not have to.
Maybe HUBEXEC can happen on part of a scan line. Sometimes graphics engines have exceeded what we can get into a COG. On P1, we simply didn't do them, or we had to pack things into two COGS, etc...
At clock speed, quick dips into HUBEXEC mode while AUX is being drawn to the screen could be VERY useful. Nice to have the execute space frankly. Would be a shame to give it up before we've even written any code!
If AUX were linked to HUBEXEC, it would preclude using it in this way. COG mode used to fill AUX, for example.
Then those pixels take a LONG time to render to the screen at the speed of the P2. Drop into HUBEXEC to get various things done, return to COG mode for more feeding of WAITVIDS, and on it all goes. IMHO, people will do this. I'm thinking of it.
I agree with this overall. So many optimizations done already. Digging too deep may well overcomplicate things and limit some of the potential we've already got.
I agree, too.
(1) Video Generation (ie video to DACs)
The Video path of AUX Ram could be used to supply the Instruction to the ALU instead of from the hub side of AUX or the Cache.
My thinking here is using the AUX to be the cache too.
This way the RDWIDE could execute concurrently with cog execution. Aux could mean that the cache is now multiples of 8*long clocks, meaning no hub stalls for sequential hub (non-branching) code. A double 8*long aux/cache would result in no cog stalls.
And, AUX can still be used for PUSH/POP/RDAUX/WRAUX without stalling the ALU from fetching instructions. Perhaps this is not a problem with the caching?
Alternately, perhaps the cache could be expanded to two blocks of 32*longs?
In fact, the cache does not even need to be mapped/windowed into the cog at all for HUBEXEC to work?
(2) Single threading (ie don't use the multi-tasking hw options)
Just trying to keep it simple. Wouldn't the AUGD/AUGS have problems in multi-threading - because they would not be sequential in the pipe?
I'm thinking of doing a lot of dipping!
Could the cache be two blocks of 8*longs. Once executing from the first, the second could be loaded without stalling the first, and visa versa. This would prevent stalling due to WIDEs reloading, unless there was a JMP/CALL/RET.
For HUBEXEC mode, Windowing into the cog isn't necessary. Do you agree?
Optionally, if a REP instruction was less than #9 instructions, the reloading of its cache could be prevented (simplest implementation), permitting fast small REP loops from reloading the cache.
For Video Gen or Aux data, can WIDEs load into AUX directly without having to be moved by sw from cache to aux?