Propeller II update - BLOG

David Betz · 2013-12-05 03:45

ozpropdev wrote: »

What about EXTEND or EXPAND?

I don't care about the name BIG. As I mentioned in another post, when we did this at VM Labs we didn't actually require the programmer to code this instruction at all. There may not even have been an mnemonic for it all. The assembler generated it automatically when the programmer used an immediate constant that wouldn't fit in the original instruction. That eliminates the problems with having to shift values right by 9 in order to form the 23 bit constant. The assembler does that for you.

David Betz · 2013-12-05 05:47

David Betz wrote: »

I don't care about the name BIG. As I mentioned in another post, when we did this at VM Labs we didn't actually require the programmer to code this instruction at all. There may not even have been an mnemonic for it all. The assembler generated it automatically when the programmer used an immediate constant that wouldn't fit in the original instruction. That eliminates the problems with having to shift values right by 9 in order to form the 23 bit constant. The assembler does that for you.

In thinking about this more I realize now why I was thinking that BIG would be useful even in LMM mode. It's because I originally proposed two new instructions, LDHI and LDLO. These instructions would take a 16 bit immediate argument and a D register and would load the high and low 16 bits of the register respectively, That would allow a 32 bit constant to be loaded in two instructions and would work for LMM as well as with the "hub execution" feature. The BIG instruction is really only useful in "hub execution" mode.

The other proposal was:

LDHI D, #$nnnn
LDLO D, #$nnnn

I suspect these instructions might be harder to fit into the instruction encoding though and they are less flexible than the BIG instruction which can extend the S field for any instruction. Unfortunately, the BIG instruction won't work in LMM mode because the BIG instruction and the instruction it is intended to affect will be separated by the LMM execution loop.

Bill Henning · 2013-12-05 07:11

I'll update post#1 in the other thread after breakfast... Chip also freed three more full D/S opcodes!

Basically, with what Chip planned for RDWIDE, and the H* instructions, there is no longer any need for LMM - the hardware really will execute out of the data cached from the hub, advancing the PC, fetching the next thunk etc. More details in an hour or two in my thread,

Regarding BIG - I now prefer the post-instruction Chip version, as it also should meet my tricky optimization needs.

Cluso99 wrote: »
That is what I was trying to convey, with some optional extensions. Does it need to be for the immediately following instruction? I know that is your current use, but as long as it gets set, and when it gets used, it would be zero'd.

Here are a couple of usage examples...
  SETBIG #(hubadr >> 9)
  RDWORD cogreg, #(hubadr & $1FF)   'read word into cogreg from hubaddr 
...
  SETBIG #(longvalue >> 9)
  XOR    cogreg, #(longvalue & $1FF)  'and cogreg with longvalue
Of course both these could be done differently. David or Bill will need to explain their actual intended use.

I have found at least 2. One will be for the "BIG" instruction - need a better name IMHO.
That's providing Chip hasn't found a use for those instruction slots

You will need to explain to me more about HJMP/HCALL - I will take another look at the Hub Execution Model thread.

Bill Henning · 2013-12-05 07:15

David,

BIG not working in LMM is not an issue.

With Chip putting in hubexec, LMM is history.

In a way, I am a bit nostalgic - I came up with LMM to *somehow* execute code from the hub on a P1... but hubexec is so much more elegant, faster, produce smaller code... and easier to compile to!

Basically, I am thrilled with the possibilities hubexec opens up, and BIG will help it quite nicely

David Betz wrote: »

In thinking about this more I realize now why I was thinking that BIG would be useful even in LMM mode. It's because I originally proposed two new instructions, LDHI and LDLO. These instructions would take a 16 bit immediate argument and a D register and would load the high and low 16 bits of the register respectively, That would allow a 32 bit constant to be loaded in two instructions and would work for LMM as well as with the "hub execution" feature. The BIG instruction is really only useful in "hub execution" mode.

The other proposal was:

LDHI D, #$nnnn
LDLO D, #$nnnn

I suspect these instructions might be harder to fit into the instruction encoding though and they are less flexible than the BIG instruction which can extend the S field for any instruction. Unfortunately, the BIG instruction won't work in LMM mode because the BIG instruction and the instruction it is intended to affect will be separated by the LMM execution loop.

David Betz · 2013-12-05 07:18

Bill Henning wrote: »

Regarding BIG - I now prefer the post-instruction Chip version, as it also should meet my tricky optimization needs.

I really don't care if it is prefix or postfix as long as it works. I think the prefix instruction has the slight advantage that by adding four 23 bit registers you could get it to work in multiple threads at the same time but since we've pretty much decided that BIG isn't useful except in the presense of hub execution and that is limited to a single thread there is no real advantage to the 23 bit register.

However, if for some reason we don't get hub execution in P2, I'd like to request that we get something like LDHI and LDLO instead of BIG because it will be useful in LMM mode where BIG would be useless in LMM code since the BIG prefix/suffix would get separated from the instruction it is supposed to affect because of the way the LMM loop works.

In summary:

If we get hub execution, then I'd also like BIG however Chip thinks it best to implement it.

If we don't get hub execution, I'd like LDHI and LDLO in place of BIG or maybe in addition to it.

David Betz · 2013-12-05 07:19

Bill Henning wrote: »

In a way, I am a bit nostalgic - I came up with LMM to *somehow* execute code from the hub on a P1... but hubexec is so much more elegant, faster, produce smaller code... and easier to compile to!

LMM was a big help for P1 even if it does become obsolete in P2 or P3.

Bill Henning · 2013-12-05 07:20

David Betz wrote: »

I really don't care if it is prefix or postfix as long as it works. I think the prefix instruction has the slight advantage that by adding four 23 bit registers you could get it to work in multiple threads at the same time but since we've pretty much decided that BIG isn't useful except in the presense of hub execution and that is limited to a single thread there is no real advantage to the 23 bit register.

However, if for some reason we don't get hub execution in P2, I'd like to request that we get something like LDHI and LDLO instead of BIG because it will be useful in LMM mode where BIG would be useless in LMM code since the BIG prefix/suffix would get separated from the instruction it is supposed to affect because of the way the LMM loop works.

In summary:

If we get hub execution, then I'd also like BIG however Chip thinks it best to implement it.

If we don't get hub execution, I'd like LDHI and LDLO in place of BIG or maybe in addition to it.

Sounds good to me!

David Betz · 2013-12-05 07:27

One more point about BIG and threads. It seems to me that we won't be able to support any other threads if we're using hub execution mode even if those threads are just executing code in COG memory since the BIG feature will get confused by the other threads sharing the pipeline. So any COG that is executing code from hub memory can only have a single thread running.

Edit: This might be a place where the 23 bit internal register would help. Then other COG threads could run along with the hub thread as long as only one thread uses the BIG instruction. And, having four of those registers would mean that the hub thread could run along with up to three COG threads without any interference even if they all used the BIG instruction.

Bill Henning · 2013-12-05 07:54

David,

There are ways to extent hubexec to the four hardware tasks, however it would complicate it needlessly for P2. We can explore that for P3, until then, pthreads will do nicely to provide user threads in hubexec mode.

Regarding cog only mode, multi-tasked or not, I don't think we need BIG. Simply place the full 32 bits in a long somewhere in the cog, and reference it.

If you see another use for BIG in cog-only mode (other than LMM), which would not be handled by using a long in cog space, then I am missing something, and would love to see it.

Having said that, the following may be simple enough, and do what you ask:

- in hubexec mode, BIG behaves as per Chip's post
- in cog only mode, BIG writes the value to a known location, I suggest $1F1 (I'd like to reserve $1F0 for the LR you and Eric need for PropGCC)

p.s.

I said other than LMM as the LMM case can be handled by using PTRA or PTRB as the PC.

' MVI reg,#const32

RDLONGC reg,ptra++

' RDLONG reg,#hubaddr23

RDLONGC reg,ptra++
RDLONG reg,reg

The auto increment on ptra++ gets rid of the problem of executing constants.

David Betz wrote: »

One more point about BIG and threads. It seems to me that we won't be able to support any other threads if we're using hub execution mode even if those threads are just executing code in COG memory since the BIG feature will get confused by the other threads sharing the pipeline. So any COG that is executing code from hub memory can only have a single thread running.

Edit: This might be a place where the 23 bit internal register would help. Then other COG threads could run along with the hub thread as long as only one thread uses the BIG instruction. And, having four of those registers would mean that the hub thread could run along with up to three COG threads without any interference even if they all used the BIG instruction.

Yanomani · 2013-12-05 08:09

I can anticipate a trend to an increase in port D usage, when in Hub execution model, by the need to avoid as much Hub operations as possible, other than the WIDE instruction fetch related ones, to take full advantage of the available slots.
Disturbing the straight way going, of gathering Hub resident instructions, only to access some Hub dependant semaphore or data, can pose a lot of stalls in the pipeline, during such access route switching, partly loosing the benefits of the Hub execution model itself.
Now I'm realy feeling the lack, of some multiport-relying message box area, but perhaps it will become a well agreed P3 feature at all.
Then we could have up to three Cog resource dependant tasks, peacefully sharing its execution unit, with a totaly responsive Hub resources dependant one.

Yanomani

David Betz · 2013-12-05 08:22

Bill Henning wrote: »

David,

There are ways to extent hubexec to the four hardware tasks, however it would complicate it needlessly for P2. We can explore that for P3, until then, pthreads will do nicely to provide user threads in hubexec mode.

Regarding cog only mode, multi-tasked or not, I don't think we need BIG. Simply place the full 32 bits in a long somewhere in the cog, and reference it.

I think you're missing what I was trying to say. If BIG is implemented the way Chip suggests, I don't think it will be possible to have more than one thread running on a COG that is executing code from hub memory. Is that acceptable? I would think it would be nice to be able to run one hub mode thread and one or more COG mode threads at the same time. I don't think that will be possible using Chip's postfix approach since that relies on finding the BIG instruction in the pipeline and with multiple threads the pipeline will be filled with instructions from other threads not the next instruction in the current thread. The 23 bit register idea could get around this and allow COG threads to coexist with a single hub thread.

If you see another use for BIG in cog-only mode (other than LMM), which would not be handled by using a long in cog space, then I am missing something, and would love to see it.

No, I don't see a use for BIG in COG mode. That's why I asked for LDHI and LDLO if we for some reason don't get hub execution in P2.

Bill Henning · 2013-12-05 08:36

David Betz wrote: »

I think you're missing what I was trying to say. If BIG is implemented the way Chip suggests, I don't think it will be possible to have more than one thread running on a COG that is executing code from hub memory. Is that acceptable? I would think it would be nice to be able to run one hub mode thread and one or more COG mode threads at the same time. I don't think that will be possible using Chip's postfix approach since that relies on finding the BIG instruction in the pipeline and with multiple threads the pipeline will be filled with instructions from other threads not the next instruction in the current thread. The 23 bit register idea could get around this and allow COG threads to coexist with a single hub thread.

Thanks - now I understand. I think what threw me was the reference to threads, instead of hardware tasks. Or lack of breakfast and more coffee

You are absolutely correct that Chip's approach limits a hubexec cog's use of BIG for other hardware tasks, however as other tasks would interfere with the hub slots available to the hubexec task, it would slow it greatly.

Personally, I don't consider it an issue because:

- I'd like hubexec cogs running as fast as possible - which implies not using the hardware tasks in cogs (that are in hubexec mode). This way, many drivers that won't fit in a cog are possible. Think TCP/IP, USB stack etc.
- for user level threads, pthreads work fine (as demonstrated by propgcc for P1) and without more cache lines would be more efficient at the macro level
- I see the up to four hardware tasks as a way of getting more hardware drivers (that don't need every cycle a single non-tasking cog can provide)
- any cog not running in hubexec mode can run up to four hardware tasks

David Betz wrote: »

No, I don't see a use for BIG in COG mode. That's why I asked for LDHI and LDLO if we for some reason don't get hub execution in P2.

Thanks, I was worried I was missing a usage case!

If they fit, I can see other uses for LDHI and LDLO.

David Betz · 2013-12-05 08:38

Bill Henning wrote: »

Thanks - now I understand. I think what threw me was the reference to threads, instead of hardware tasks. Or lack of breakfast and more coffee

Sorry I got the terminology wrong. I guess I'd better have some more coffee myself!

Cluso99 · 2013-12-05 13:38

OK, think I can fit the BIG and LDHI & LDLO into the one instruction. BUT

Thinking some more, this is what you want. Maybe BIG has to be done before, not after. We just need to explain to Chip what we want.
0: INSTR D,#(hubaddr & $1FF)
1: thread1 instr
0: BIG #(hubaddr >> 9)
1: thread1 instr

If this works (BIG before or after) will that remove the LDHI/LDLO requirement?

David Betz · 2013-12-05 13:54

Cluso99 wrote: »

OK, think I can fit the BIG and LDHI & LDLO into the one instruction. BUT

Thinking some more, this is what you want. Maybe BIG has to be done before, not after. We just need to explain to Chip what we want.
0: INSTR D,#(hubaddr & $1FF)
1: thread1 instr
0: BIG #(hubaddr >> 9)
1: thread1 instr

If this works (BIG before or after) will that remove the LDHI/LDLO requirement?

I don't think LDHI and LDLO are needed at all unless Chip decides not to do the hub execute mode for P2. Also, I think you're right that BIG has to be before the instruction it modifies if we're going to allow it to be used with multiple hardware tasks. However, Bill seems to think it would be okay to restrict the use of hub execution mode and the BIG instruction to COGs that only execute a single hardware task. In that case, Chip's solution will work just as well. Also, if you really want to support the use of BIG in multiple tasks, you'd need to also have 4 copies of the hidden register that remembers the 23 bits from the most recent BIG instruction. I think it is probably not very important to support BIG in more than one hardware task but I think it would be good to support it in one task running in hub execute mode and to also allow other COG mode tasks to run at the same time.

Bill Henning · 2013-12-05 14:02

Ray & David,

Excellent summary.

The reason I am not interested in running other cog tasks in the same cog that has one task in hubexec mode is very simple.

It will slow hubexec down, and cause very heavy octal-cache trashing if the other tasks use any RDxxxxC or WRxxxxC instructions.

As other cogs can still execute four tasks per cog, I personally find it perfectly reasonable to maximise the performance of the hubexec cog.

This also minimizes the hardware required

... which is why I have not described how to have cogs running each of the four tasks in hubexec mode with pretty good performance... the changes required are non-trivial, and are best postponed for at least a P2.1 if not P3.

Ray,

In other news... I have no preference regarding pre-fix or post-fix, however right justifying the big value (ie lowest 23 bits are in the big, the high nine bits in the related instruction), making the big result visible at $1F1, LR fixed at $1F0, and the cache visible at $1E0-$1E7 are quite important for future optimization, Spin, assembly language etc.

David Betz wrote: »

I don't think LDHI and LDLO are needed at all unless Chip decides not to do the hub execute mode for P2. Also, I think you're right that BIG has to be before the instruction it modifies if we're going to allow it to be used with multiple hardware tasks. However, Bill seems to think it would be okay to restrict the use of hub execution mode and the BIG instruction to COGs that only execute a single hardware task. In that case, Chip's solution will work just as well. Also, if you really want to support the use of BIG in multiple tasks, you'd need to also have 4 copies of the hidden register that remembers the 23 bits from the most recent BIG instruction. I think it is probably not very important to support BIG in more than one hardware task but I think it would be good to support it in one task running in hub execute mode and to also allow other COG mode tasks to run at the same time.

David Betz · 2013-12-05 14:05

Bill Henning wrote: »

Excellent summary.

The reason I am not interested in running other cog tasks in the same cog that has one task in hubexec mode is very simple.

It will slow hubexec down, and cause very heavy octal-cache trashing if the other tasks use any RDxxxxC or WRxxxxC instructions.

As other cogs can still execute four tasks per cog, I personally find it perfectly reasonable to maximise the performance of the hubexec cog.

This also minimizes the hardware required ... which is why I have not described how to have cogs running each of the four tasks in hubexec mode with pretty good performance... the changes required are non-trivial, and are best postponed for at least a P2.1 if not P3.

I understand the issues with running multiple hub mode tasks at the same time. The only thing I was suggesting is that it might be nice to be able to run one hub mode task and up to three COG mode tasks at the same time. As you say though, the COG mode tasks would have to avoid the RDxxxxC and WRxxxxC instructions.

Bill Henning · 2013-12-05 14:10

I agree, that would be useful - it is just that I worry about people not listening to avoiding RDxxxC, WRxxxC, and other hub access, and bringing the hubexec task to its knees.

David Betz wrote: »

I understand the issues with running multiple hub mode tasks at the same time. The only thing I was suggesting is that it might be nice to be able to run one hub mode task and up to three COG mode tasks at the same time. As you say though, the COG mode tasks would have to avoid the RDxxxxC and WRxxxxC instructions.

David Betz · 2013-12-05 14:12

Bill Henning wrote: »

I agree, that would be useful - it is just that I worry about people not listening to avoiding RDxxxC, WRxxxC, and other hub access, and bringing the hubexec task to its knees.

That may happen whether we try to allow multiple hardware tasks while one is doing hub mode anyway. In fact, there isn't likely to be anything in the hardware that would prevent trying to do multiple hub mode tasks. It just won't work very well if at all.

Bill Henning · 2013-12-05 14:25

True!

If the other tasks avoided BIG, and RDxxxxC / WRxxxxC, it would simply work, but would slow down the hubexec code - which would probably still be plenty fast enough for a lot of applications.

Hmmm... I wonder what ozpropdev can stuff in besides a hubexec task...

David Betz wrote: »

That may happen whether we try to allow multiple hardware tasks while one is doing hub mode anyway. In fact, there isn't likely to be anything in the hardware that would prevent trying to do multiple hub mode tasks. It just won't work very well if at all.

David Betz · 2013-12-05 14:34

Bill Henning wrote: »

True!

If the other tasks avoided BIG, and RDxxxxC / WRxxxxC, it would simply work, but would slow down the hubexec code - which would probably still be plenty fast enough for a lot of applications.

Hmmm... I wonder what ozpropdev can stuff in besides a hubexec task...

Aside from BIG issues, wouldn't it also work to have multiple hubexec tasks? I know it would be horribly slow but if instruction fetch just does the same thing as RDLONGC, the only problem would be that the cache would be invalidated on every instruction fetch so all of the hubexec tasks would revert to hub speed. The same would be true if one of the COG tasks used one of the RDxxxxC or WRxxxxC instructions. That would just force a new 8-long fetch on the next instruction fetch. Or am I missing something?

Cluso99 · 2013-12-05 14:53

Bill Henning wrote: »

Ray & David,

Excellent summary.

The reason I am not interested in running other cog tasks in the same cog that has one task in hubexec mode is very simple.

It will slow hubexec down, and cause very heavy octal-cache trashing if the other tasks use any RDxxxxC or WRxxxxC instructions.

As other cogs can still execute four tasks per cog, I personally find it perfectly reasonable to maximise the performance of the hubexec cog.

This also minimizes the hardware required ... which is why I have not described how to have cogs running each of the four tasks in hubexec mode with pretty good performance... the changes required are non-trivial, and are best postponed for at least a P2.1 if not P3.

My thoughts too.

Ray,

In other news... I have no preference regarding pre-fix or post-fix, however right justifying the big value (ie lowest 23 bits are in the big, the high nine bits in the related instruction), making the big result visible at $1F1, LR fixed at $1F0, and the cache visible at $1E0-$1E7 are quite important for future optimization, Spin, assembly language etc.

Do you really mean lowest 23 bits in BIG and 9 high in #S/#D?

I think the cache is mappable to any 8*long cog window???

BTW I am hoping that $1F0-$1F1 ultimately get used for AUXA/AUXB just like INDA/INDB are used, but to access AUX rather than COG. Either now or later.

I am not sure if Chip is intending to implement a HUBEXEC mode where the instructions will automatically fill the cache, execute the 8 instructions from the cache window, and then automatically loop back to fill the cache with the next line, etc. Currently we can do this with instructions, but a bit slower. Currently, the compiler may have to avoid the use of BIG/AUGI and XXX instruction pairs spread over a WIDE boundary. Again, little price to pay for the astonishing improvement over LMM.

HJMP/HCALL/HRET:
Am I correctly understanding that currently, we would execute these by saving the new hub address into say INDA, and then jumping to the next
RDWIDE cachewindow, INDA++
and continue to execute from the loaded 8*long cache window?
So what we are trying to do, is make this automatic?

Bill Henning · 2013-12-05 14:57

The reason I was not asking for each task to be able to hubexec is that it would require 4x(BIG,mapped 8-long cache,LR) - and more logic, multiplexers etc. I have many ideas for P2.1++++ however I am trying to minimize the risk for P2 even. The risk goes way up if we need four of the above.

With one, if it does not work, and is not fixed, P2 will still work with old-style LMM.

Besides, pthreads will run very nicely on hubexec, without the memory cache thrashing. It would actually be faster at the macro level than four hubexec tasks (that can be fixed with a lot more hardware, but not in P2)

David Betz wrote: »

Aside from BIG issues, wouldn't it also work to have multiple hubexec tasks? I know it would be horribly slow but if instruction fetch just does the same thing as RDLONGC, the only problem would be that the cache would be invalidated on every instruction fetch so all of the hubexec tasks would revert to hub speed. The same would be true if one of the COG tasks used one of the RDxxxxC or WRxxxxC instructions. That would just force a new 8-long fetch on the next instruction fetch. Or am I missing something?

David Betz · 2013-12-05 15:00

Cluso99 wrote: »

My thoughts too.

Do you really mean lowest 23 bits in BIG and 9 high in #S/#D?

I think the cache is mappable to any 8*long cog window???

BTW I am hoping that $1F0-$1F1 ultimately get used for AUXA/AUXB just like INDA/INDB are used, but to access AUX rather than COG. Either now or later.

I am not sure if Chip is intending to implement a HUBEXEC mode where the instructions will automatically fill the cache, execute the 8 instructions from the cache window, and then automatically loop back to fill the cache with the next line, etc. Currently we can do this with instructions, but a bit slower. Currently, the compiler may have to avoid the use of BIG/AUGI and XXX instruction pairs spread over a WIDE boundary. Again, little price to pay for the astonishing improvement over LMM.

HJMP/HCALL/HRET:
Am I correctly understanding that currently, we would execute these by saving the new hub address into say INDA, and then jumping to the next
RDWIDE cachewindow, INDA++
and continue to execute from the loaded 8*long cache window?
So what we are trying to do, is make this automatic?

My original idea was to just use PC and add enough bits to allow it to address every long in hub memory. Any address between 0x0 and 0x7ff would be treated as a COG address and any other address would be treated as a hub address. That means that you can't execute code from the first 2k of hub memory but I don't think that's a big problem. Then, any PC fetch that was >= 0x800 would be handled as if it were a RDLONGC making use of the cache to speed up execution. It would all be automatic. However, Bill pointed out that in that model a CALL instruction would possibly have to store a hub address in the 9 bits of the corresponding RET instruction, obviously impossible. So we end up using PTRA instead of PC when executing from hub. I would still expect the cache filling to happen automatically as if a RDLONGC had been executed for each instruction. Is that difficult?

Bill Henning · 2013-12-05 15:12

Cluso99 wrote: »

My thoughts too.

:-)

Cluso99 wrote: »

Do you really mean lowest 23 bits in BIG and 9 high in #S/#D?

Yes. See my extensive discussion with David about this; lowest 23 in BIG can save a surprising amount of memory for assembly programmers and smart optimizing compilers.

I am certain gcc could use it too, but it is not worth the effort to support it in GCC in the first pass.

Cluso99 wrote: »

I think the cache is mappable to any 8*long cog window???

That's what Chip intends, but there are many advantages for hubex mode to keep it at $1E0-$1E7

When used to feed the video engine, or other non-hubex use, any place is valid.

Advantages:

- leaves $000-$1DF free for "FCACHE/FLIB" like use
- totally predictable known address for compiler, see tricks for saving hub code space in my discussion with david

Cluso99 wrote: »

BTW I am hoping that $1F0-$1F1 ultimately get used for AUXA/AUXB just like INDA/INDB are used, but to access AUX rather than COG. Either now or later.

I'd like that too, in which case, $1EF and $1EE could be used for BIG and LR, or chip could add SETBIG regno and SETLR regno. I wonder if it is worth it for P2, as it changes known working (on FPGA) instructions. I leave that up to Chip :-)

The reasons I prefer $1F0/$1F1 for now are:

- known addresses, no linking issues, better for pasm code if they are at a fixed known address
- code in the 8 long window knows where to find the last assembled BIG value
- eliminates need for SETLR / SETBIG

Having them as exposed registers at known locations allows a lot of very nice optimization tricks.

Cluso99 wrote: »

I am not sure if Chip is intending to implement a HUBEXEC mode where the instructions will automatically fill the cache, execute the 8 instructions from the cache window, and then automatically loop back to fill the cache with the next line, etc.

I distinctly remember Chip posting exactly that, auto refilling. Heck he can schedule a refill at the first instuction, and postpone it if the instructions in the 8-long need hub cycles. I leave the implementation to Chip

Cluso99 wrote: »

Currently we can do this with instructions, but a bit slower. Currently, the compiler may have to avoid the use of BIG/AUGI and XXX instruction pairs spread over a WIDE boundary. Again, little price to pay for the astonishing improvement over LMM.

Agreed, but no need to pay the price if BIG is a prefix; then it does not matter if the 8-long cache is reloaded before the instruction that completes the upper bits.

Cluso99 wrote: »

HJMP/HCALL/HRET:
Am I correctly understanding that currently, we would execute these by saving the new hub address into say INDA, and then jumping to the next
RDWIDE cachewindow, INDA++
and continue to execute from the loaded 8*long cache window?
So what we are trying to do, is make this automatic?

Not quite.

the cog's PC is expanded to 16 bits (low two bits are implied and zero for hub access), and yes, this makes it automatic

HJMP
- switch to hubexec mode if in cog mode
- load PC with embedded constant (16 bits, plus two implied zero)
- load cache if necessary, then start executing at the index within the cache indicated by the 16 bit address
- when you run past the 8 long cache, fetch next 8 longs (this may be overlapped with executing non-hub instructions in the cache)

HCALL/HCALLA/HCALLB

- save PC+1 (in long addresses, scaled) to LR / push on A stack / push on B stack (depends on op code)
- rest is the same as HJMP

HRET / HRETA / HRETB

- load PC with return address from one of LR / top of A stack / top of B stack (depends on op code)
- continue executing hub code

This way, there is no need to use PTRA as a PC, and takes less logic. And automagic reloads

Bill Henning · 2013-12-05 15:13

That was similar to my original thought, but a bit more thinking made it better

See my long response to Ray.

Odds are, Chip will improve it more

David Betz wrote: »

My original idea was to just use PC and add enough bits to allow it to address every long in hub memory. Any address between 0x0 and 0x7ff would be treated as a COG address and any other address would be treated as a hub address. That means that you can't execute code from the first 2k of hub memory but I don't think that's a big problem. Then, any PC fetch that was >= 0x800 would be handled as if it were a RDLONGC making use of the cache to speed up execution. It would all be automatic. However, Bill pointed out that in that model a CALL instruction would possibly have to store a hub address in the 9 bits of the corresponding RET instruction, obviously impossible. So we end up using PTRA instead of PC when executing from hub. I would still expect the cache filling to happen automatically as if a RDLONGC had been executed for each instruction. Is that difficult?

David Betz · 2013-12-05 15:19

Bill Henning wrote: »

This way, there is no need to use PTRA as a PC, and takes less logic. And automagic reloads

If you do this you won't be able to use CALL to call COG resident code since there isn't enough space in the 9 bit S field of the RET instruction for a full hub address. I guess you could store the address of the instruction in the 8-long window instead of what is in the PC but that's kind of a big change. Also, as we discussed before, it won't work if the CALL instruction is in the last long of the 8-long window.

Bill Henning · 2013-12-05 15:25

CALL only calls cog code by definition, and is an alias for JMPRET - so it will work. This is why I resisted the siren song of making a bigger change

HCALLx only calls hubexec code, and stores the return address in LR if HCALL, stack A if HCALLA, stack B if HCALLB

HRETx only returns from hubexec code, and returns to the address in LR (HRET), top of stack A (HRETA), top of stack B (HRETB)

This deliberately mirrors the way cog only mode works.

By having LR at $1F0, no need for the linker to be involved, gcc / other compilers can simply emit "HRET" which has an embedded $1F0.

HRET could actually be an alias for HJMP LR, as a simple cog-mode jmp could be used to return to cog mode.

Simple, symmetric, efficient - addresses LR for GCC, and AUX stacking for VM's, assembly language, and other compilers.

David Betz wrote: »

If you do this you won't be able to use CALL to call COG resident code since there isn't enough space in the 9 bit S field of the RET instruction for a full hub address. I guess you could store the address of the instruction in the 8-long window instead of what is in the PC but that's kind of a big change. Also, as we discussed before, it won't work if the CALL instruction is in the last long of the 8-long window.

David Betz · 2013-12-05 15:28

Bill Henning wrote: »

CALL only calls cog code by definition, and is an alias for JMPRET - so it will work.

Okay, I guess you're saying that once you enter hub mode you can no longer call any functions that are COG-resident. That means there is no way to use helper functions in COG memory like what you typically call FCACHE.

Cluso99 · 2013-12-05 15:46

Bill Henning wrote: »

:-)
Yes. See my extensive discussion with David about this; lowest 23 in BIG can save a surprising amount of memory for assembly programmers and smart optimizing compilers.

I am certain gcc could use it too, but it is not worth the effort to support it in GCC in the first pass.

I still don't like it. But Chip is the one to implement it and I am not sure how much extra work that is. The ALU would have to be able to swap where the 9bit immediate bits go (bottom or top).

That's what Chip intends, but there are many advantages for hubex mode to keep it at $1E0-$1E7

When used to feed the video engine, or other non-hubex use, any place is valid.

Advantages:

- leaves $000-$1DF free for "FCACHE/FLIB" like use
- totally predictable known address for compiler, see tricks for saving hub code space in my discussion with david

It makes no difference to Chips implementation where the cache maps to. So GCC would just set $1E0-$1E7.

I'd like that too, in which case, $1EF and $1EE could be used for BIG and LR, or chip could add SETBIG regno and SETLR regno. I wonder if it is worth it for P2, as it changes known working (on FPGA) instructions. I leave that up to Chip :-)

The reasons I prefer $1F0/$1F1 for now are:

- known addresses, no linking issues, better for pasm code if they are at a fixed known address
- code in the 8 long window knows where to find the last assembled BIG value
- eliminates need for SETLR / SETBIG

Having them as exposed registers at known locations allows a lot of very nice optimization tricks.

Yes, true.

I distinctly remember Chip posting exactly that, auto refilling. Heck he can schedule a refill at the first instuction, and postpone it if the instructions in the 8-long need hub cycles. I leave the implementation to Chip

Yes, would be nice. But, if I understand correctly, the data bus to the cache is shared between cog and hub (not dual ported), so the cog would stall while it waits for the next cache line to be filled. It's a slowdown but still way better than LMM.

Agreed, but no need to pay the price if BIG is a prefix; then it does not matter if the 8-long cache is reloaded before the instruction that completes the upper bits.

Not quite.

the cog's PC is expanded to 16 bits (low two bits are implied and zero for hub access), and yes, this makes it automatic

It depends on how Chip implements it. Currently postfix doesn't require any additional registers which was what was so nice.

HJMP
- switch to hubexec mode if in cog mode
- load PC with embedded constant (16 bits, plus two implied zero)
- load cache if necessary, then start executing at the index within the cache indicated by the 16 bit address
- when you run past the 8 long cache, fetch next 8 longs (this may be overlapped with executing non-hub instructions in the cache)

HCALL/HCALLA/HCALLB

- save PC+1 (in long addresses, scaled) to LR / push on A stack / push on B stack (depends on op code)
- rest is the same as HJMP

HRET / HRETA / HRETB

- load PC with return address from one of LR / top of A stack / top of B stack (depends on op code)
- continue executing hub code

This way, there is no need to use PTRA as a PC, and takes less logic. And automagic reloads

OK, I understand this now. It is premised on Chip doing the HUBEXEC mode to cater for this.

Propeller II update - BLOG

Comments