I don't care about the name BIG. As I mentioned in another post, when we did this at VM Labs we didn't actually require the programmer to code this instruction at all. There may not even have been an mnemonic for it all. The assembler generated it automatically when the programmer used an immediate constant that wouldn't fit in the original instruction. That eliminates the problems with having to shift values right by 9 in order to form the 23 bit constant. The assembler does that for you.
I don't care about the name BIG. As I mentioned in another post, when we did this at VM Labs we didn't actually require the programmer to code this instruction at all. There may not even have been an mnemonic for it all. The assembler generated it automatically when the programmer used an immediate constant that wouldn't fit in the original instruction. That eliminates the problems with having to shift values right by 9 in order to form the 23 bit constant. The assembler does that for you.
In thinking about this more I realize now why I was thinking that BIG would be useful even in LMM mode. It's because I originally proposed two new instructions, LDHI and LDLO. These instructions would take a 16 bit immediate argument and a D register and would load the high and low 16 bits of the register respectively, That would allow a 32 bit constant to be loaded in two instructions and would work for LMM as well as with the "hub execution" feature. The BIG instruction is really only useful in "hub execution" mode.
The other proposal was:
LDHI D, #$nnnn
LDLO D, #$nnnn
I suspect these instructions might be harder to fit into the instruction encoding though and they are less flexible than the BIG instruction which can extend the S field for any instruction. Unfortunately, the BIG instruction won't work in LMM mode because the BIG instruction and the instruction it is intended to affect will be separated by the LMM execution loop.
I'll update post#1 in the other thread after breakfast... Chip also freed three more full D/S opcodes!
Basically, with what Chip planned for RDWIDE, and the H* instructions, there is no longer any need for LMM - the hardware really will execute out of the data cached from the hub, advancing the PC, fetching the next thunk etc. More details in an hour or two in my thread,
Regarding BIG - I now prefer the post-instruction Chip version, as it also should meet my tricky optimization needs.
That is what I was trying to convey, with some optional extensions. Does it need to be for the immediately following instruction? I know that is your current use, but as long as it gets set, and when it gets used, it would be zero'd.
Here are a couple of usage examples...
SETBIG #(hubadr >> 9)
RDWORD cogreg, #(hubadr & $1FF) 'read word into cogreg from hubaddr
...
SETBIG #(longvalue >> 9)
XOR cogreg, #(longvalue & $1FF) 'and cogreg with longvalue
Of course both these could be done differently. David or Bill will need to explain their actual intended use.
I have found at least 2. One will be for the "BIG" instruction - need a better name IMHO.
That's providing Chip hasn't found a use for those instruction slots
You will need to explain to me more about HJMP/HCALL - I will take another look at the Hub Execution Model thread.
In a way, I am a bit nostalgic - I came up with LMM to *somehow* execute code from the hub on a P1... but hubexec is so much more elegant, faster, produce smaller code... and easier to compile to!
Basically, I am thrilled with the possibilities hubexec opens up, and BIG will help it quite nicely
In thinking about this more I realize now why I was thinking that BIG would be useful even in LMM mode. It's because I originally proposed two new instructions, LDHI and LDLO. These instructions would take a 16 bit immediate argument and a D register and would load the high and low 16 bits of the register respectively, That would allow a 32 bit constant to be loaded in two instructions and would work for LMM as well as with the "hub execution" feature. The BIG instruction is really only useful in "hub execution" mode.
The other proposal was:
LDHI D, #$nnnn
LDLO D, #$nnnn
I suspect these instructions might be harder to fit into the instruction encoding though and they are less flexible than the BIG instruction which can extend the S field for any instruction. Unfortunately, the BIG instruction won't work in LMM mode because the BIG instruction and the instruction it is intended to affect will be separated by the LMM execution loop.
Regarding BIG - I now prefer the post-instruction Chip version, as it also should meet my tricky optimization needs.
I really don't care if it is prefix or postfix as long as it works. I think the prefix instruction has the slight advantage that by adding four 23 bit registers you could get it to work in multiple threads at the same time but since we've pretty much decided that BIG isn't useful except in the presense of hub execution and that is limited to a single thread there is no real advantage to the 23 bit register.
However, if for some reason we don't get hub execution in P2, I'd like to request that we get something like LDHI and LDLO instead of BIG because it will be useful in LMM mode where BIG would be useless in LMM code since the BIG prefix/suffix would get separated from the instruction it is supposed to affect because of the way the LMM loop works.
In summary:
If we get hub execution, then I'd also like BIG however Chip thinks it best to implement it.
If we don't get hub execution, I'd like LDHI and LDLO in place of BIG or maybe in addition to it.
In a way, I am a bit nostalgic - I came up with LMM to *somehow* execute code from the hub on a P1... but hubexec is so much more elegant, faster, produce smaller code... and easier to compile to!
LMM was a big help for P1 even if it does become obsolete in P2 or P3.
I really don't care if it is prefix or postfix as long as it works. I think the prefix instruction has the slight advantage that by adding four 23 bit registers you could get it to work in multiple threads at the same time but since we've pretty much decided that BIG isn't useful except in the presense of hub execution and that is limited to a single thread there is no real advantage to the 23 bit register.
However, if for some reason we don't get hub execution in P2, I'd like to request that we get something like LDHI and LDLO instead of BIG because it will be useful in LMM mode where BIG would be useless in LMM code since the BIG prefix/suffix would get separated from the instruction it is supposed to affect because of the way the LMM loop works.
In summary:
If we get hub execution, then I'd also like BIG however Chip thinks it best to implement it.
If we don't get hub execution, I'd like LDHI and LDLO in place of BIG or maybe in addition to it.
One more point about BIG and threads. It seems to me that we won't be able to support any other threads if we're using hub execution mode even if those threads are just executing code in COG memory since the BIG feature will get confused by the other threads sharing the pipeline. So any COG that is executing code from hub memory can only have a single thread running.
Edit: This might be a place where the 23 bit internal register would help. Then other COG threads could run along with the hub thread as long as only one thread uses the BIG instruction. And, having four of those registers would mean that the hub thread could run along with up to three COG threads without any interference even if they all used the BIG instruction.
There are ways to extent hubexec to the four hardware tasks, however it would complicate it needlessly for P2. We can explore that for P3, until then, pthreads will do nicely to provide user threads in hubexec mode.
Regarding cog only mode, multi-tasked or not, I don't think we need BIG. Simply place the full 32 bits in a long somewhere in the cog, and reference it.
If you see another use for BIG in cog-only mode (other than LMM), which would not be handled by using a long in cog space, then I am missing something, and would love to see it.
Having said that, the following may be simple enough, and do what you ask:
- in hubexec mode, BIG behaves as per Chip's post
- in cog only mode, BIG writes the value to a known location, I suggest $1F1 (I'd like to reserve $1F0 for the LR you and Eric need for PropGCC)
p.s.
I said other than LMM as the LMM case can be handled by using PTRA or PTRB as the PC.
' MVI reg,#const32
RDLONGC reg,ptra++
' RDLONG reg,#hubaddr23
RDLONGC reg,ptra++
RDLONG reg,reg
The auto increment on ptra++ gets rid of the problem of executing constants.
One more point about BIG and threads. It seems to me that we won't be able to support any other threads if we're using hub execution mode even if those threads are just executing code in COG memory since the BIG feature will get confused by the other threads sharing the pipeline. So any COG that is executing code from hub memory can only have a single thread running.
Edit: This might be a place where the 23 bit internal register would help. Then other COG threads could run along with the hub thread as long as only one thread uses the BIG instruction. And, having four of those registers would mean that the hub thread could run along with up to three COG threads without any interference even if they all used the BIG instruction.
I can anticipate a trend to an increase in port D usage, when in Hub execution model, by the need to avoid as much Hub operations as possible, other than the WIDE instruction fetch related ones, to take full advantage of the available slots.
Disturbing the straight way going, of gathering Hub resident instructions, only to access some Hub dependant semaphore or data, can pose a lot of stalls in the pipeline, during such access route switching, partly loosing the benefits of the Hub execution model itself.
Now I'm realy feeling the lack, of some multiport-relying message box area, but perhaps it will become a well agreed P3 feature at all.
Then we could have up to three Cog resource dependant tasks, peacefully sharing its execution unit, with a totaly responsive Hub resources dependant one.
There are ways to extent hubexec to the four hardware tasks, however it would complicate it needlessly for P2. We can explore that for P3, until then, pthreads will do nicely to provide user threads in hubexec mode.
Regarding cog only mode, multi-tasked or not, I don't think we need BIG. Simply place the full 32 bits in a long somewhere in the cog, and reference it.
I think you're missing what I was trying to say. If BIG is implemented the way Chip suggests, I don't think it will be possible to have more than one thread running on a COG that is executing code from hub memory. Is that acceptable? I would think it would be nice to be able to run one hub mode thread and one or more COG mode threads at the same time. I don't think that will be possible using Chip's postfix approach since that relies on finding the BIG instruction in the pipeline and with multiple threads the pipeline will be filled with instructions from other threads not the next instruction in the current thread. The 23 bit register idea could get around this and allow COG threads to coexist with a single hub thread.
If you see another use for BIG in cog-only mode (other than LMM), which would not be handled by using a long in cog space, then I am missing something, and would love to see it.
No, I don't see a use for BIG in COG mode. That's why I asked for LDHI and LDLO if we for some reason don't get hub execution in P2.
I think you're missing what I was trying to say. If BIG is implemented the way Chip suggests, I don't think it will be possible to have more than one thread running on a COG that is executing code from hub memory. Is that acceptable? I would think it would be nice to be able to run one hub mode thread and one or more COG mode threads at the same time. I don't think that will be possible using Chip's postfix approach since that relies on finding the BIG instruction in the pipeline and with multiple threads the pipeline will be filled with instructions from other threads not the next instruction in the current thread. The 23 bit register idea could get around this and allow COG threads to coexist with a single hub thread.
Thanks - now I understand. I think what threw me was the reference to threads, instead of hardware tasks. Or lack of breakfast and more coffee
You are absolutely correct that Chip's approach limits a hubexec cog's use of BIG for other hardware tasks, however as other tasks would interfere with the hub slots available to the hubexec task, it would slow it greatly.
Personally, I don't consider it an issue because:
- I'd like hubexec cogs running as fast as possible - which implies not using the hardware tasks in cogs (that are in hubexec mode). This way, many drivers that won't fit in a cog are possible. Think TCP/IP, USB stack etc.
- for user level threads, pthreads work fine (as demonstrated by propgcc for P1) and without more cache lines would be more efficient at the macro level
- I see the up to four hardware tasks as a way of getting more hardware drivers (that don't need every cycle a single non-tasking cog can provide)
- any cog not running in hubexec mode can run up to four hardware tasks
OK, think I can fit the BIG and LDHI & LDLO into the one instruction. BUT
Thinking some more, this is what you want. Maybe BIG has to be done before, not after. We just need to explain to Chip what we want.
0: INSTR D,#(hubaddr & $1FF)
1: thread1 instr
0: BIG #(hubaddr >> 9)
1: thread1 instr
If this works (BIG before or after) will that remove the LDHI/LDLO requirement?
OK, think I can fit the BIG and LDHI & LDLO into the one instruction. BUT
Thinking some more, this is what you want. Maybe BIG has to be done before, not after. We just need to explain to Chip what we want.
0: INSTR D,#(hubaddr & $1FF)
1: thread1 instr
0: BIG #(hubaddr >> 9)
1: thread1 instr
If this works (BIG before or after) will that remove the LDHI/LDLO requirement?
I don't think LDHI and LDLO are needed at all unless Chip decides not to do the hub execute mode for P2. Also, I think you're right that BIG has to be before the instruction it modifies if we're going to allow it to be used with multiple hardware tasks. However, Bill seems to think it would be okay to restrict the use of hub execution mode and the BIG instruction to COGs that only execute a single hardware task. In that case, Chip's solution will work just as well. Also, if you really want to support the use of BIG in multiple tasks, you'd need to also have 4 copies of the hidden register that remembers the 23 bits from the most recent BIG instruction. I think it is probably not very important to support BIG in more than one hardware task but I think it would be good to support it in one task running in hub execute mode and to also allow other COG mode tasks to run at the same time.
The reason I am not interested in running other cog tasks in the same cog that has one task in hubexec mode is very simple.
It will slow hubexec down, and cause very heavy octal-cache trashing if the other tasks use any RDxxxxC or WRxxxxC instructions.
As other cogs can still execute four tasks per cog, I personally find it perfectly reasonable to maximise the performance of the hubexec cog.
This also minimizes the hardware required ... which is why I have not described how to have cogs running each of the four tasks in hubexec mode with pretty good performance... the changes required are non-trivial, and are best postponed for at least a P2.1 if not P3.
Ray,
In other news... I have no preference regarding pre-fix or post-fix, however right justifying the big value (ie lowest 23 bits are in the big, the high nine bits in the related instruction), making the big result visible at $1F1, LR fixed at $1F0, and the cache visible at $1E0-$1E7 are quite important for future optimization, Spin, assembly language etc.
I don't think LDHI and LDLO are needed at all unless Chip decides not to do the hub execute mode for P2. Also, I think you're right that BIG has to be before the instruction it modifies if we're going to allow it to be used with multiple hardware tasks. However, Bill seems to think it would be okay to restrict the use of hub execution mode and the BIG instruction to COGs that only execute a single hardware task. In that case, Chip's solution will work just as well. Also, if you really want to support the use of BIG in multiple tasks, you'd need to also have 4 copies of the hidden register that remembers the 23 bits from the most recent BIG instruction. I think it is probably not very important to support BIG in more than one hardware task but I think it would be good to support it in one task running in hub execute mode and to also allow other COG mode tasks to run at the same time.
The reason I am not interested in running other cog tasks in the same cog that has one task in hubexec mode is very simple.
It will slow hubexec down, and cause very heavy octal-cache trashing if the other tasks use any RDxxxxC or WRxxxxC instructions.
As other cogs can still execute four tasks per cog, I personally find it perfectly reasonable to maximise the performance of the hubexec cog.
This also minimizes the hardware required ... which is why I have not described how to have cogs running each of the four tasks in hubexec mode with pretty good performance... the changes required are non-trivial, and are best postponed for at least a P2.1 if not P3.
I understand the issues with running multiple hub mode tasks at the same time. The only thing I was suggesting is that it might be nice to be able to run one hub mode task and up to three COG mode tasks at the same time. As you say though, the COG mode tasks would have to avoid the RDxxxxC and WRxxxxC instructions.
I agree, that would be useful - it is just that I worry about people not listening to avoiding RDxxxC, WRxxxC, and other hub access, and bringing the hubexec task to its knees.
I understand the issues with running multiple hub mode tasks at the same time. The only thing I was suggesting is that it might be nice to be able to run one hub mode task and up to three COG mode tasks at the same time. As you say though, the COG mode tasks would have to avoid the RDxxxxC and WRxxxxC instructions.
I agree, that would be useful - it is just that I worry about people not listening to avoiding RDxxxC, WRxxxC, and other hub access, and bringing the hubexec task to its knees.
That may happen whether we try to allow multiple hardware tasks while one is doing hub mode anyway. In fact, there isn't likely to be anything in the hardware that would prevent trying to do multiple hub mode tasks. It just won't work very well if at all.
If the other tasks avoided BIG, and RDxxxxC / WRxxxxC, it would simply work, but would slow down the hubexec code - which would probably still be plenty fast enough for a lot of applications.
Hmmm... I wonder what ozpropdev can stuff in besides a hubexec task...
That may happen whether we try to allow multiple hardware tasks while one is doing hub mode anyway. In fact, there isn't likely to be anything in the hardware that would prevent trying to do multiple hub mode tasks. It just won't work very well if at all.
If the other tasks avoided BIG, and RDxxxxC / WRxxxxC, it would simply work, but would slow down the hubexec code - which would probably still be plenty fast enough for a lot of applications.
Hmmm... I wonder what ozpropdev can stuff in besides a hubexec task...
Aside from BIG issues, wouldn't it also work to have multiple hubexec tasks? I know it would be horribly slow but if instruction fetch just does the same thing as RDLONGC, the only problem would be that the cache would be invalidated on every instruction fetch so all of the hubexec tasks would revert to hub speed. The same would be true if one of the COG tasks used one of the RDxxxxC or WRxxxxC instructions. That would just force a new 8-long fetch on the next instruction fetch. Or am I missing something?
The reason I am not interested in running other cog tasks in the same cog that has one task in hubexec mode is very simple.
It will slow hubexec down, and cause very heavy octal-cache trashing if the other tasks use any RDxxxxC or WRxxxxC instructions.
As other cogs can still execute four tasks per cog, I personally find it perfectly reasonable to maximise the performance of the hubexec cog.
This also minimizes the hardware required ... which is why I have not described how to have cogs running each of the four tasks in hubexec mode with pretty good performance... the changes required are non-trivial, and are best postponed for at least a P2.1 if not P3.
My thoughts too.
Ray,
In other news... I have no preference regarding pre-fix or post-fix, however right justifying the big value (ie lowest 23 bits are in the big, the high nine bits in the related instruction), making the big result visible at $1F1, LR fixed at $1F0, and the cache visible at $1E0-$1E7 are quite important for future optimization, Spin, assembly language etc.
Do you really mean lowest 23 bits in BIG and 9 high in #S/#D?
I think the cache is mappable to any 8*long cog window???
BTW I am hoping that $1F0-$1F1 ultimately get used for AUXA/AUXB just like INDA/INDB are used, but to access AUX rather than COG. Either now or later.
I am not sure if Chip is intending to implement a HUBEXEC mode where the instructions will automatically fill the cache, execute the 8 instructions from the cache window, and then automatically loop back to fill the cache with the next line, etc. Currently we can do this with instructions, but a bit slower. Currently, the compiler may have to avoid the use of BIG/AUGI and XXX instruction pairs spread over a WIDE boundary. Again, little price to pay for the astonishing improvement over LMM.
HJMP/HCALL/HRET:
Am I correctly understanding that currently, we would execute these by saving the new hub address into say INDA, and then jumping to the next
RDWIDE cachewindow, INDA++
and continue to execute from the loaded 8*long cache window?
So what we are trying to do, is make this automatic?
The reason I was not asking for each task to be able to hubexec is that it would require 4x(BIG,mapped 8-long cache,LR) - and more logic, multiplexers etc. I have many ideas for P2.1++++ however I am trying to minimize the risk for P2 even. The risk goes way up if we need four of the above.
With one, if it does not work, and is not fixed, P2 will still work with old-style LMM.
Besides, pthreads will run very nicely on hubexec, without the memory cache thrashing. It would actually be faster at the macro level than four hubexec tasks (that can be fixed with a lot more hardware, but not in P2)
Aside from BIG issues, wouldn't it also work to have multiple hubexec tasks? I know it would be horribly slow but if instruction fetch just does the same thing as RDLONGC, the only problem would be that the cache would be invalidated on every instruction fetch so all of the hubexec tasks would revert to hub speed. The same would be true if one of the COG tasks used one of the RDxxxxC or WRxxxxC instructions. That would just force a new 8-long fetch on the next instruction fetch. Or am I missing something?
Do you really mean lowest 23 bits in BIG and 9 high in #S/#D?
I think the cache is mappable to any 8*long cog window???
BTW I am hoping that $1F0-$1F1 ultimately get used for AUXA/AUXB just like INDA/INDB are used, but to access AUX rather than COG. Either now or later.
I am not sure if Chip is intending to implement a HUBEXEC mode where the instructions will automatically fill the cache, execute the 8 instructions from the cache window, and then automatically loop back to fill the cache with the next line, etc. Currently we can do this with instructions, but a bit slower. Currently, the compiler may have to avoid the use of BIG/AUGI and XXX instruction pairs spread over a WIDE boundary. Again, little price to pay for the astonishing improvement over LMM.
HJMP/HCALL/HRET:
Am I correctly understanding that currently, we would execute these by saving the new hub address into say INDA, and then jumping to the next
RDWIDE cachewindow, INDA++
and continue to execute from the loaded 8*long cache window?
So what we are trying to do, is make this automatic?
My original idea was to just use PC and add enough bits to allow it to address every long in hub memory. Any address between 0x0 and 0x7ff would be treated as a COG address and any other address would be treated as a hub address. That means that you can't execute code from the first 2k of hub memory but I don't think that's a big problem. Then, any PC fetch that was >= 0x800 would be handled as if it were a RDLONGC making use of the cache to speed up execution. It would all be automatic. However, Bill pointed out that in that model a CALL instruction would possibly have to store a hub address in the 9 bits of the corresponding RET instruction, obviously impossible. So we end up using PTRA instead of PC when executing from hub. I would still expect the cache filling to happen automatically as if a RDLONGC had been executed for each instruction. Is that difficult?
Do you really mean lowest 23 bits in BIG and 9 high in #S/#D?
Yes. See my extensive discussion with David about this; lowest 23 in BIG can save a surprising amount of memory for assembly programmers and smart optimizing compilers.
I am certain gcc could use it too, but it is not worth the effort to support it in GCC in the first pass.
I think the cache is mappable to any 8*long cog window???
That's what Chip intends, but there are many advantages for hubex mode to keep it at $1E0-$1E7
When used to feed the video engine, or other non-hubex use, any place is valid.
Advantages:
- leaves $000-$1DF free for "FCACHE/FLIB" like use
- totally predictable known address for compiler, see tricks for saving hub code space in my discussion with david
BTW I am hoping that $1F0-$1F1 ultimately get used for AUXA/AUXB just like INDA/INDB are used, but to access AUX rather than COG. Either now or later.
I'd like that too, in which case, $1EF and $1EE could be used for BIG and LR, or chip could add SETBIG regno and SETLR regno. I wonder if it is worth it for P2, as it changes known working (on FPGA) instructions. I leave that up to Chip :-)
The reasons I prefer $1F0/$1F1 for now are:
- known addresses, no linking issues, better for pasm code if they are at a fixed known address
- code in the 8 long window knows where to find the last assembled BIG value
- eliminates need for SETLR / SETBIG
Having them as exposed registers at known locations allows a lot of very nice optimization tricks.
I am not sure if Chip is intending to implement a HUBEXEC mode where the instructions will automatically fill the cache, execute the 8 instructions from the cache window, and then automatically loop back to fill the cache with the next line, etc.
I distinctly remember Chip posting exactly that, auto refilling. Heck he can schedule a refill at the first instuction, and postpone it if the instructions in the 8-long need hub cycles. I leave the implementation to Chip
Currently we can do this with instructions, but a bit slower. Currently, the compiler may have to avoid the use of BIG/AUGI and XXX instruction pairs spread over a WIDE boundary. Again, little price to pay for the astonishing improvement over LMM.
Agreed, but no need to pay the price if BIG is a prefix; then it does not matter if the 8-long cache is reloaded before the instruction that completes the upper bits.
HJMP/HCALL/HRET:
Am I correctly understanding that currently, we would execute these by saving the new hub address into say INDA, and then jumping to the next
RDWIDE cachewindow, INDA++
and continue to execute from the loaded 8*long cache window?
So what we are trying to do, is make this automatic?
Not quite.
the cog's PC is expanded to 16 bits (low two bits are implied and zero for hub access), and yes, this makes it automatic
HJMP
- switch to hubexec mode if in cog mode
- load PC with embedded constant (16 bits, plus two implied zero)
- load cache if necessary, then start executing at the index within the cache indicated by the 16 bit address
- when you run past the 8 long cache, fetch next 8 longs (this may be overlapped with executing non-hub instructions in the cache)
HCALL/HCALLA/HCALLB
- save PC+1 (in long addresses, scaled) to LR / push on A stack / push on B stack (depends on op code)
- rest is the same as HJMP
HRET / HRETA / HRETB
- load PC with return address from one of LR / top of A stack / top of B stack (depends on op code)
- continue executing hub code
This way, there is no need to use PTRA as a PC, and takes less logic. And automagic reloads
My original idea was to just use PC and add enough bits to allow it to address every long in hub memory. Any address between 0x0 and 0x7ff would be treated as a COG address and any other address would be treated as a hub address. That means that you can't execute code from the first 2k of hub memory but I don't think that's a big problem. Then, any PC fetch that was >= 0x800 would be handled as if it were a RDLONGC making use of the cache to speed up execution. It would all be automatic. However, Bill pointed out that in that model a CALL instruction would possibly have to store a hub address in the 9 bits of the corresponding RET instruction, obviously impossible. So we end up using PTRA instead of PC when executing from hub. I would still expect the cache filling to happen automatically as if a RDLONGC had been executed for each instruction. Is that difficult?
This way, there is no need to use PTRA as a PC, and takes less logic. And automagic reloads
If you do this you won't be able to use CALL to call COG resident code since there isn't enough space in the 9 bit S field of the RET instruction for a full hub address. I guess you could store the address of the instruction in the 8-long window instead of what is in the PC but that's kind of a big change. Also, as we discussed before, it won't work if the CALL instruction is in the last long of the 8-long window.
If you do this you won't be able to use CALL to call COG resident code since there isn't enough space in the 9 bit S field of the RET instruction for a full hub address. I guess you could store the address of the instruction in the 8-long window instead of what is in the PC but that's kind of a big change. Also, as we discussed before, it won't work if the CALL instruction is in the last long of the 8-long window.
CALL only calls cog code by definition, and is an alias for JMPRET - so it will work.
Okay, I guess you're saying that once you enter hub mode you can no longer call any functions that are COG-resident. That means there is no way to use helper functions in COG memory like what you typically call FCACHE.
:-)
Yes. See my extensive discussion with David about this; lowest 23 in BIG can save a surprising amount of memory for assembly programmers and smart optimizing compilers.
I am certain gcc could use it too, but it is not worth the effort to support it in GCC in the first pass.
I still don't like it. But Chip is the one to implement it and I am not sure how much extra work that is. The ALU would have to be able to swap where the 9bit immediate bits go (bottom or top).
That's what Chip intends, but there are many advantages for hubex mode to keep it at $1E0-$1E7
When used to feed the video engine, or other non-hubex use, any place is valid.
Advantages:
- leaves $000-$1DF free for "FCACHE/FLIB" like use
- totally predictable known address for compiler, see tricks for saving hub code space in my discussion with david
It makes no difference to Chips implementation where the cache maps to. So GCC would just set $1E0-$1E7.
I'd like that too, in which case, $1EF and $1EE could be used for BIG and LR, or chip could add SETBIG regno and SETLR regno. I wonder if it is worth it for P2, as it changes known working (on FPGA) instructions. I leave that up to Chip :-)
The reasons I prefer $1F0/$1F1 for now are:
- known addresses, no linking issues, better for pasm code if they are at a fixed known address
- code in the 8 long window knows where to find the last assembled BIG value
- eliminates need for SETLR / SETBIG
Having them as exposed registers at known locations allows a lot of very nice optimization tricks.
Yes, true.
I distinctly remember Chip posting exactly that, auto refilling. Heck he can schedule a refill at the first instuction, and postpone it if the instructions in the 8-long need hub cycles. I leave the implementation to Chip
Yes, would be nice. But, if I understand correctly, the data bus to the cache is shared between cog and hub (not dual ported), so the cog would stall while it waits for the next cache line to be filled. It's a slowdown but still way better than LMM.
Agreed, but no need to pay the price if BIG is a prefix; then it does not matter if the 8-long cache is reloaded before the instruction that completes the upper bits.
Not quite.
the cog's PC is expanded to 16 bits (low two bits are implied and zero for hub access), and yes, this makes it automatic
It depends on how Chip implements it. Currently postfix doesn't require any additional registers which was what was so nice.
HJMP
- switch to hubexec mode if in cog mode
- load PC with embedded constant (16 bits, plus two implied zero)
- load cache if necessary, then start executing at the index within the cache indicated by the 16 bit address
- when you run past the 8 long cache, fetch next 8 longs (this may be overlapped with executing non-hub instructions in the cache)
HCALL/HCALLA/HCALLB
- save PC+1 (in long addresses, scaled) to LR / push on A stack / push on B stack (depends on op code)
- rest is the same as HJMP
HRET / HRETA / HRETB
- load PC with return address from one of LR / top of A stack / top of B stack (depends on op code)
- continue executing hub code
This way, there is no need to use PTRA as a PC, and takes less logic. And automagic reloads
OK, I understand this now. It is premised on Chip doing the HUBEXEC mode to cater for this.
Comments
The other proposal was:
LDHI D, #$nnnn
LDLO D, #$nnnn
I suspect these instructions might be harder to fit into the instruction encoding though and they are less flexible than the BIG instruction which can extend the S field for any instruction. Unfortunately, the BIG instruction won't work in LMM mode because the BIG instruction and the instruction it is intended to affect will be separated by the LMM execution loop.
Basically, with what Chip planned for RDWIDE, and the H* instructions, there is no longer any need for LMM - the hardware really will execute out of the data cached from the hub, advancing the PC, fetching the next thunk etc. More details in an hour or two in my thread,
Regarding BIG - I now prefer the post-instruction Chip version, as it also should meet my tricky optimization needs.
BIG not working in LMM is not an issue.
With Chip putting in hubexec, LMM is history.
In a way, I am a bit nostalgic - I came up with LMM to *somehow* execute code from the hub on a P1... but hubexec is so much more elegant, faster, produce smaller code... and easier to compile to!
Basically, I am thrilled with the possibilities hubexec opens up, and BIG will help it quite nicely
However, if for some reason we don't get hub execution in P2, I'd like to request that we get something like LDHI and LDLO instead of BIG because it will be useful in LMM mode where BIG would be useless in LMM code since the BIG prefix/suffix would get separated from the instruction it is supposed to affect because of the way the LMM loop works.
In summary:
If we get hub execution, then I'd also like BIG however Chip thinks it best to implement it.
If we don't get hub execution, I'd like LDHI and LDLO in place of BIG or maybe in addition to it.
Sounds good to me!
Edit: This might be a place where the 23 bit internal register would help. Then other COG threads could run along with the hub thread as long as only one thread uses the BIG instruction. And, having four of those registers would mean that the hub thread could run along with up to three COG threads without any interference even if they all used the BIG instruction.
There are ways to extent hubexec to the four hardware tasks, however it would complicate it needlessly for P2. We can explore that for P3, until then, pthreads will do nicely to provide user threads in hubexec mode.
Regarding cog only mode, multi-tasked or not, I don't think we need BIG. Simply place the full 32 bits in a long somewhere in the cog, and reference it.
If you see another use for BIG in cog-only mode (other than LMM), which would not be handled by using a long in cog space, then I am missing something, and would love to see it.
Having said that, the following may be simple enough, and do what you ask:
- in hubexec mode, BIG behaves as per Chip's post
- in cog only mode, BIG writes the value to a known location, I suggest $1F1 (I'd like to reserve $1F0 for the LR you and Eric need for PropGCC)
p.s.
I said other than LMM as the LMM case can be handled by using PTRA or PTRB as the PC.
' MVI reg,#const32
RDLONGC reg,ptra++
' RDLONG reg,#hubaddr23
RDLONGC reg,ptra++
RDLONG reg,reg
The auto increment on ptra++ gets rid of the problem of executing constants.
Disturbing the straight way going, of gathering Hub resident instructions, only to access some Hub dependant semaphore or data, can pose a lot of stalls in the pipeline, during such access route switching, partly loosing the benefits of the Hub execution model itself.
Now I'm realy feeling the lack, of some multiport-relying message box area, but perhaps it will become a well agreed P3 feature at all.
Then we could have up to three Cog resource dependant tasks, peacefully sharing its execution unit, with a totaly responsive Hub resources dependant one.
Yanomani
Thanks - now I understand. I think what threw me was the reference to threads, instead of hardware tasks. Or lack of breakfast and more coffee
You are absolutely correct that Chip's approach limits a hubexec cog's use of BIG for other hardware tasks, however as other tasks would interfere with the hub slots available to the hubexec task, it would slow it greatly.
Personally, I don't consider it an issue because:
- I'd like hubexec cogs running as fast as possible - which implies not using the hardware tasks in cogs (that are in hubexec mode). This way, many drivers that won't fit in a cog are possible. Think TCP/IP, USB stack etc.
- for user level threads, pthreads work fine (as demonstrated by propgcc for P1) and without more cache lines would be more efficient at the macro level
- I see the up to four hardware tasks as a way of getting more hardware drivers (that don't need every cycle a single non-tasking cog can provide)
- any cog not running in hubexec mode can run up to four hardware tasks
Thanks, I was worried I was missing a usage case!
If they fit, I can see other uses for LDHI and LDLO.
Thinking some more, this is what you want. Maybe BIG has to be done before, not after. We just need to explain to Chip what we want.
0: INSTR D,#(hubaddr & $1FF)
1: thread1 instr
0: BIG #(hubaddr >> 9)
1: thread1 instr
If this works (BIG before or after) will that remove the LDHI/LDLO requirement?
Excellent summary.
The reason I am not interested in running other cog tasks in the same cog that has one task in hubexec mode is very simple.
It will slow hubexec down, and cause very heavy octal-cache trashing if the other tasks use any RDxxxxC or WRxxxxC instructions.
As other cogs can still execute four tasks per cog, I personally find it perfectly reasonable to maximise the performance of the hubexec cog.
This also minimizes the hardware required ... which is why I have not described how to have cogs running each of the four tasks in hubexec mode with pretty good performance... the changes required are non-trivial, and are best postponed for at least a P2.1 if not P3.
Ray,
In other news... I have no preference regarding pre-fix or post-fix, however right justifying the big value (ie lowest 23 bits are in the big, the high nine bits in the related instruction), making the big result visible at $1F1, LR fixed at $1F0, and the cache visible at $1E0-$1E7 are quite important for future optimization, Spin, assembly language etc.
If the other tasks avoided BIG, and RDxxxxC / WRxxxxC, it would simply work, but would slow down the hubexec code - which would probably still be plenty fast enough for a lot of applications.
Hmmm... I wonder what ozpropdev can stuff in besides a hubexec task...
I think the cache is mappable to any 8*long cog window???
BTW I am hoping that $1F0-$1F1 ultimately get used for AUXA/AUXB just like INDA/INDB are used, but to access AUX rather than COG. Either now or later.
I am not sure if Chip is intending to implement a HUBEXEC mode where the instructions will automatically fill the cache, execute the 8 instructions from the cache window, and then automatically loop back to fill the cache with the next line, etc. Currently we can do this with instructions, but a bit slower. Currently, the compiler may have to avoid the use of BIG/AUGI and XXX instruction pairs spread over a WIDE boundary. Again, little price to pay for the astonishing improvement over LMM.
HJMP/HCALL/HRET:
Am I correctly understanding that currently, we would execute these by saving the new hub address into say INDA, and then jumping to the next
RDWIDE cachewindow, INDA++
and continue to execute from the loaded 8*long cache window?
So what we are trying to do, is make this automatic?
With one, if it does not work, and is not fixed, P2 will still work with old-style LMM.
Besides, pthreads will run very nicely on hubexec, without the memory cache thrashing. It would actually be faster at the macro level than four hubexec tasks (that can be fixed with a lot more hardware, but not in P2)
:-)
Yes. See my extensive discussion with David about this; lowest 23 in BIG can save a surprising amount of memory for assembly programmers and smart optimizing compilers.
I am certain gcc could use it too, but it is not worth the effort to support it in GCC in the first pass.
That's what Chip intends, but there are many advantages for hubex mode to keep it at $1E0-$1E7
When used to feed the video engine, or other non-hubex use, any place is valid.
Advantages:
- leaves $000-$1DF free for "FCACHE/FLIB" like use
- totally predictable known address for compiler, see tricks for saving hub code space in my discussion with david
I'd like that too, in which case, $1EF and $1EE could be used for BIG and LR, or chip could add SETBIG regno and SETLR regno. I wonder if it is worth it for P2, as it changes known working (on FPGA) instructions. I leave that up to Chip :-)
The reasons I prefer $1F0/$1F1 for now are:
- known addresses, no linking issues, better for pasm code if they are at a fixed known address
- code in the 8 long window knows where to find the last assembled BIG value
- eliminates need for SETLR / SETBIG
Having them as exposed registers at known locations allows a lot of very nice optimization tricks.
I distinctly remember Chip posting exactly that, auto refilling. Heck he can schedule a refill at the first instuction, and postpone it if the instructions in the 8-long need hub cycles. I leave the implementation to Chip
Agreed, but no need to pay the price if BIG is a prefix; then it does not matter if the 8-long cache is reloaded before the instruction that completes the upper bits.
Not quite.
the cog's PC is expanded to 16 bits (low two bits are implied and zero for hub access), and yes, this makes it automatic
HJMP
- switch to hubexec mode if in cog mode
- load PC with embedded constant (16 bits, plus two implied zero)
- load cache if necessary, then start executing at the index within the cache indicated by the 16 bit address
- when you run past the 8 long cache, fetch next 8 longs (this may be overlapped with executing non-hub instructions in the cache)
HCALL/HCALLA/HCALLB
- save PC+1 (in long addresses, scaled) to LR / push on A stack / push on B stack (depends on op code)
- rest is the same as HJMP
HRET / HRETA / HRETB
- load PC with return address from one of LR / top of A stack / top of B stack (depends on op code)
- continue executing hub code
This way, there is no need to use PTRA as a PC, and takes less logic. And automagic reloads
See my long response to Ray.
Odds are, Chip will improve it more
HCALLx only calls hubexec code, and stores the return address in LR if HCALL, stack A if HCALLA, stack B if HCALLB
HRETx only returns from hubexec code, and returns to the address in LR (HRET), top of stack A (HRETA), top of stack B (HRETB)
This deliberately mirrors the way cog only mode works.
By having LR at $1F0, no need for the linker to be involved, gcc / other compilers can simply emit "HRET" which has an embedded $1F0.
HRET could actually be an alias for HJMP LR, as a simple cog-mode jmp could be used to return to cog mode.
Simple, symmetric, efficient - addresses LR for GCC, and AUX stacking for VM's, assembly language, and other compilers.