Hub Execution Model Thread (split from blog)

jazzed · 2013-12-03 10:56

How is an HUBEXEC FCACHE implemented without a COG interpreter or extra instructions?

We (mouse in one pocket, ring in another) don't want an interpreter, just fetch and execute. An interpreter means wasted code space and slower execution.

Bill Henning · 2013-12-03 11:00

With hub exec, we finally don't need an LMM loop.

What we can do is use a tight reps loop (it will fit in an 8-long cache unit) to load a longer block of code into the cog, and use a regular cog call to execute it, then use a HJMP to resume the hub code.

Same idea as an FCACHE, without the interpreter overhead. I think it should be in-lined, as it would take roughly the same amount of code to call a cog-subroutine to do it.

No need for an interpreter!

jazzed wrote: »

How is an HUBEXEC FCACHE implemented without a COG interpreter or extra instructions?

We (mouse in one pocket, ring in another) don't want an interpreter, just fetch and execute. An interpreter means wasted code space and slower execution.

potatohead · 2013-12-03 11:00

Bill, I'm thinking about P3 being 64 bit... And yes, I saw P3 on that discussion... No worries.

jazzed · 2013-12-03 11:04

Bill Henning wrote: »

With hub exec, we finally don't need an LMM loop.

What we can do is use a tight reps loop (it will fit in an 8-long cache unit) to load a longer block of code into the cog, and use a regular cog call to execute it, then use a HJMP to resume the hub code.

No need for an interpreter!

Ok, but if a fetch gets 8 longs at a time to execute, why do you need FCACHE?

Bill Henning · 2013-12-03 11:04

Totally agreed on 64 bits for P3! I've been assuming that for a while... because it allows for more cog memory

potatohead wrote: »

Bill, I'm thinking about P3 being 64 bit... And yes, I saw P3 on that discussion... No worries.

Bill Henning · 2013-12-03 11:06

To allow larger loops (that don't fit in 8 longs) to run at full speed, and take advantage of the REPxxx instructions, that can freely use RDxxxxxC for better peformance without trashing the octal instruction cache.

jazzed wrote: »

Ok, but if a fetch gets 8 longs at a time to execute, why do you need FCACHE?

Heater. · 2013-12-03 11:06

Could we run any HUB execution model instructions/modes/registers/stack past some complier writers?
I would hate if GCC or one clang could not use what's built effectively.

Segment registers, starts shaking uncontrollably, as far as I know Linux dos not make use of segments, last I heard it set all segments to encompass all of memory in one big linear space. All the virtual memory/protection stuff being handled with pages.

However, it seems Google employs the long forgotten segments to wrap around code, data and stack in it's native execution sand box for in browser apps.

David Betz · 2013-12-03 11:10

Heater. wrote: »

Could we run any HUB execution model instructions/modes/registers/stack past some complier writers?
I would hate if GCC or one clang could not use what's built effectively.

Segment registers, starts shaking uncontrollably, as far as I know Linux dos not make use of segments, last I heard it set all segments to encompass all of memory in one big linear space. All the virtual memory/protection stuff being handled with pages.

However, it seems Google employs the long forgotten segments to wrap around code, data and stack in it's native execution sand box for in browser apps.

I think that what is being discussed would work with PropGCC but Eric would be a better person to comment on this since he wrote the GCC code generator for the Propeller.

Bill Henning · 2013-12-03 11:11

( Putting hand up - I am a compiler writer too! )

The three instructions I proposed is the minimum set that allows hub execution, that Chip started musing on.

I can think of MANY additions to make P3 better for compilers, however we are in a tight time window :-(

The good news is that support for this model should be quite easy to retrofit into PropGCC, and should offer tremendous speed gains.

I did not want to propose a full MMU for P3, opens too many cans of worms, however segment/limit registers (as long as they are not crippled like on x86 and x286) are a viable solution as they give each process linear address spaces for code/data/stack, and provide for relocating it anywhere in the (ddr2/+) memory. KISS principle.

Heater. wrote: »

Could we run any HUB execution model instructions/modes/registers/stack past some complier writers?
I would hate if GCC or one clang could not use what's built effectively.

Segment registers, starts shaking uncontrollably, as far as I know Linux dos not make use of segments, last I heard it set all segments to encompass all of memory in one big linear space. All the virtual memory/protection stuff being handled with pages.

However, it seems Google employs the long forgotten segments to wrap around code, data and stack in it's native execution sand box for in browser apps.

jazzed · 2013-12-03 11:14

Bill Henning wrote: »

( Putting hand up - I am a compiler writer too! )

We also need input from the people who are actually likely to be implementing changes.

Not trying to put you off Bill. Your enthusiasm is always contagious.

Heater. · 2013-12-03 11:15

That is all totally out of my league. I just worry about "pulling and Intel". Intel's hardware gurus have three times come up with super duper architectures and instruction sets that compilers could not deal with. Too complex to generate performant code for. Those were: The 432, the i860 and the Itanium.

David Betz · 2013-12-03 11:16

I'd like to go back to this code sequence:

    rdlong r0, ptra+
    long $12345678

Will this really work? Since the COG is pipelined, what value will ptra have when the rdlong instruction actually executes? Won't several more instructions have been fetched by then? In fact, won't the constant $12345678 end up in the pipeline so that it gets executed?

Dave Hein · 2013-12-03 11:17

OK, maybe it's too late to implement multi-level hardware caching for P2, but P3 should have this. The current proposals for extending P2 are just becoming silly. Hardware caching would simplify the whole thing. In my opinion, if P2 supports the software-caching that has been proposed recently it will start to become very kludgy. Extending this method even further for P3 would be crazy. Hardware caching would just make things easier from the software perspective.

Bill Henning · 2013-12-03 11:18

No worries.

FYI, even without GCC support this mode would be extremely useful.

BTW, this is infinitely easier to implement than a WLIW-style RDQUAD based LMM.

jazzed wrote: »

We also need input from the people who are actually likely to be implementing changes.

Not trying to put you off Bill. Your enthusiasm is always contagious.

Heater. · 2013-12-03 11:18

Bill,

Putting hand up - I am a compiler writer too!

Oops...sorry.

I'll get my coat.

David Betz · 2013-12-03 11:19

I find it very interesting that Chip is even considering this. I think I asked about using the RDLONGC cache to allow hub execution back when he first published the initial P2 instruction set and I don't think he thought it was possible. Maybe he just hadn't had enough time to think about it. :-)

Anyway, I'm glad it is being considered now. Even if it doesn't make it into P2, it will probably be considered again for P3.

Bill Henning · 2013-12-03 11:20

if Chip increments ptra for every cache slot, instead of for the whole cache line (8-longs) it will work.

David Betz wrote: »
I'd like to go back to this code sequence:
    rdlong r0, ptra+
    long $12345678
Will this really work? Since the COG is pipelined, what value will ptra have when the rdlong instruction actually executes? Won't several more instructions have been fetched by then? In fact, won't the constant $12345678 end up in the pipeline so that it gets executed?

potatohead · 2013-12-03 11:22

I'm glad too.

Honestly, I think the HUB RAM expansion makes the WIDE fetch big enough to hold more than a pipe full of instructions. He probably sees that as more viable than the 4 long QUAD was. Just a guess...

Bill Henning · 2013-12-03 11:24

David,

I would be agreeing with you if this was a regular ARM SOC, or a conventional processor.

I am thinking more of the microcontroller applications where this would help a lot, without adding delays or waiting for P3 - essentially to make the P2 even more competitive in the controller market space.

For those applications, this would be incredibly useful.

It is also far less of a kludge than LMM, and far easier to target a compiler to.

Now for the P3, I'd love to see each cog get a "proper" L1 cache, fed by DDR2 for large application code.

I'd still keep this P2 HUBEXEC model, because for programs that fit, it will be faster and more deterministic than XMM/cache.

The propeller processors are not conventional processors, so they need some unconventional "tricks".

Dave Hein wrote: »

OK, maybe it's too late to implement multi-level hardware caching for P2, but P3 should have this. The current proposals for extending P2 are just becoming silly. Hardware caching would simplify the whole thing. In my opinion, if P2 supports the software-caching that has been proposed recently it will start to become very kludgy. Extending this method even further for P3 would be crazy. Hardware caching would just make things easier from the software perspective.

David Betz · 2013-12-03 11:24

Bill Henning wrote: »

if Chip increments ptra for every cache slot, instead of for the whole cache line (8-longs) it will work.

I don't think I stated my question clearly enough. There is a four stage pipeline in P2. I assume PTRA will be incremented on each instruction fetch but that the instruction won't actually be executed until a later pipeline stage. That means that any additional instruction fetches that happen before the RDLONGC gets executed will also increment PTRA. Also, the long after the RDLONGC instruction will have been fetched and will be in the pipeline for execution. This seems like it will cause problems.

Bill Henning · 2013-12-03 11:26

LOL! no worries...

I don't mention that aspect of my skills very often; even though it is what made me come up with LMM in the first place.

My first compiler compiler experience was porting VALGOL to the Amiga; numereous basic variants (not published) followed, along with some crazy language experiments, including Forth like languages.

Heater. wrote: »

Bill,

Oops...sorry.

I'll get my coat.

David Betz · 2013-12-03 11:29

Bill Henning wrote: »

LOL! no worries...

I don't mention that aspect of my skills very often; even though it is what made me come up with LMM in the first place.

My first compiler compiler experience was porting VALGOL to the Amiga; numereous basic variants (not published) followed, along with some crazy language experiments, including Forth like languages.

When can we expect a VALGOL compiler for the Propeller? :-)

Also, how is VALGOL different from ALGOL?

Bill Henning · 2013-12-03 11:31

I was not clear enough either, you have an excellent question.

Basically, I expect Chip to increment once for each slot, on the first pass through the eight longs. The code in the 8-long cache should not modify PTRA, other than using it to skip embedded constants.

Seeing the last instruction update PTRA would let the verilog know not to execute that long, skip it if you will, without having to worry about 32 bit constants executing.

Chip will know how to implement this best (assuming he implements HUBEXEC at all) - but the above is how I see it working.

David Betz wrote: »

I don't think I stated my question clearly enough. There is a four stage pipeline in P2. I assume PTRA will be incremented on each instruction fetch but that the instruction won't actually be executed until a later pipeline stage. That means that any additional instruction fetches that happen before the RDLONGC gets executed will also increment PTRA. Also, the long after the RDLONGC instruction will have been fetched and will be in the pipeline for execution. This seems like it will cause problems.

Bill Henning · 2013-12-03 11:33

Shudder!

NEVER!

It was an easy port though, I loved the orthogonal 68000 instruction set.

As I recall, VALGOL was a teaching subset of the full ALGOL language. The original source I used had a code generator for the PDP11, so I found it an easy port.... this was back in '86 or '87.

David Betz wrote: »

When can we expect a VALGOL compiler for the Propeller? :-)

Also, how is VALGOL different from ALGOL?

David Betz · 2013-12-03 11:35

Bill Henning wrote: »

I was not clear enough either, you have an excellent question.

Basically, I expect Chip to increment once for each slot, on the first pass through the eight longs. The code in the 8-long cache should not modify PTRA, other than using it to skip embedded constants.

Seeing the last instruction update PTRA would let the verilog know not to execute that long, skip it if you will, without having to worry about 32 bit constants executing.

Chip will know how to implement this best (assuming he implements HUBEXEC at all) - but the above is how I see it working.

Just so I understand what you're saying, by "slot" you mean the entire set of 8 longs fetched in one operation? I don't think that's how it would work if he mimics the action of RDLONGC in the instruction fetch logic. That will increment the PTRx on every fetch of a long. If it only gets incremented for each set of 8 longs then HCALL will be difficult since it will have to calculate the return address based on the value of PTRx and the offset of the HCALL instruction in the window.

Bill Henning · 2013-12-03 11:46

Great discussion.

I expect HJMP/HCALL/HRET to load the block-aligned (32 byte aligned 8-long), then skip the instructions before the finer (long grained) address.

Let me try to illustrate, to clear my mind as well

8-long aligned chunk, loaded into the RDOCTL cache

0:
1:
2:
3:
4:
5:
6:
7:

- When it runs off the end, the next eight are fetched

- A HJMP/HCALL/HRET address is encoded as

dddddddddsssssssss[00]

with the 00 being implied as it is a long address; scaled like PTR references

the hub blocks are actually fetched from octal boundaries

dddddddddssssss[000][00]

the lowest three bits of 's' in the hub address correspond to the cache slot, so execution can resume at a long grain.

Regarding how Chip decides to increment PTRA, or has a separate internal PTRA' that works on 32 byte boundaries, is up to him. Heck, he'll probably find a better way!

Ideally, by convention, the 8-long cache should be fixed at $1E0, for two reasons:

- allows expanding to a 16 long cache on the P3
- allows directly addressing cache lines, which leads to much faster hub memory references
- or allows growing I/O registers down from $1F2

This would leave $000-$1DF free for cog subroutines, or a HUGE FCACHE

Later, I think I'll add this to the FAQ post.

Again, great questions David.

David Betz wrote: »

Just so I understand what you're saying, by "slot" you mean the entire set of 8 longs fetched in one operation? I don't think that's how it would work if he mimics the action of RDLONGC in the instruction fetch logic. That will increment the PTRx on every fetch of a long. If it only gets incremented for each set of 8 longs then HCALL will be difficult since it will have to calculate the return address based on the value of PTRx and the offset of the HCALL instruction in the window.

ersmith · 2013-12-03 13:32

I'd be very cautious about making such a big change at this late stage in the design. Faster LMM execution would be a good thing, but putting it in hardware is adding a fairly complicated block which we very well might get wrong. I think aiming for P3 (and having a chance to test it extensively in FPGA) would be wiser.

Also, we already have a code generator for the Propeller (and the old Prop2), so again any changes in hardware will mean that software work is required to modify the compiler. That's not a reason not to do it, of course, but it's a cost/risk that needs to be taken into account.

From the point of view of GCC, here are a few notes:

(1) The closer any hardware implementation can be to the existing LMM, the better. In particular, we need to be able to access any instruction in the cache line with a jump (I think Bill's proposal addresses this), and we need to be able to load 32 bit constants into registers, perhaps with something like "rdlong reg,PTRA++ / long xxxx". Loading 32 bit constants into a register is an extremely common operation in the compiled code, it needs to be simple and easy.

(2) The AUX stack is not really very useful for a C compiler, at least not one that's running in HUB (it'd probably do for smaller programs that fit entirely in a COG). Even a single printf call is likely to use up the whole space. Moreover, if multiple threads are to run on the same COG then they each need their own stack.

(3) AUX is also used as a CLUT for graphics operations, so C programs that do graphics would probably be better not touching it.

(4) With point (2) and (3) in mind, I think a better form for CALL would be "HCALL retreg, destreg"; the return then just becomes "HJMP retreg", so we can save an opcode. OTOH Bill's original suggestion is very nice because the opcode can contain an 18 bit immediate, enough to access all of HUB memory. Perhaps we could have HCALL save the return address in a fixed register instead of pushing it on the AUX stack. Or, I guess we can save it on the AUX stack and then have the subroutine pop it right off on entry, although that seems a bit messy.

(5) We definitely need ways to switch between hub and cog operation (e.g. CALL from the hub code into COG memory). This seems conceptually simple, but the details could get tricky, especially if you look at nested calls.

David Betz · 2013-12-03 13:39

ersmith wrote: »

(4) With point (2) and (3) in mind, I think a better form for CALL would be "HCALL retreg, destreg"; the return then just becomes "HJMP retreg", so we can save an opcode. OTOH Bill's original suggestion is very nice because the opcode can contain an 18 bit immediate, enough to access all of HUB memory. Perhaps we could have HCALL save the return address in a fixed register instead of pushing it on the AUX stack. Or, I guess we can save it on the AUX stack and then have the subroutine pop it right off on entry, although that seems a bit messy.

Popping it off the stack right away is what I figured would be necessary for everything other than leaf functions. This isn't so bad I guess because for non-leaf functions we have to save LR anyway and popping the return address off the stack is a single instruction just like moving LR to another register. I guess it might be a bit slower if the current code generator can do a WRLONG LR, -SP rather than moving it into a register first though.

jmg · 2013-12-03 13:51

ersmith wrote: »

I'd be very cautious about making such a big change at this late stage in the design. Faster LMM execution would be a good thing, but putting it in hardware is adding a fairly complicated block which we very well might get wrong.

Cautious is good, but adding opcodes is relatively safe, older LMM designs still work fine. This does not close any doors.
The P2 code base already in FPGA testing, would quickly catch any unexpected issues, should any occur on existing opcodes.

Bill Henning · 2013-12-03 13:54

ersmith wrote: »

I'd be very cautious about making such a big change at this late stage in the design. Faster LMM execution would be a good thing, but putting it in hardware is adding a fairly complicated block which we very well might get wrong. I think aiming for P3 (and having a chance to test it extensively in FPGA) would be wiser.

This snowballed from the DAC bus's necessary removal; 256KB of hub turned out to be the best/least risk replacement use of the freed up space. Chip then wanted to replace RDQUAD with RDOCTL, for double the potential bandwidth.

ersmith wrote: »

Also, we already have a code generator for the Propeller (and the old Prop2), so again any changes in hardware will mean that software work is required to modify the compiler. That's not a reason not to do it, of course, but it's a cost/risk that needs to be taken into account.

I believe the compiler changes should be fairly minimal, I tried to craft the suggested instructions in such a way as to minimize compiler changes - and risk.

ersmith wrote: »

From the point of view of GCC, here are a few notes:

(1) The closer any hardware implementation can be to the existing LMM, the better. In particular, we need to be able to access any instruction in the cache line with a jump (I think Bill's proposal addresses this), and we need to be able to load 32 bit constants into registers, perhaps with something like "rdlong reg,PTRA++ / long xxxx". Loading 32 bit constants into a register is an extremely common operation in the compiled code, it needs to be simple and easy.

Agreed.

HJMP / HCALL / HRET are very close to the original LMM FJMP / FCALL / FRET

The HUBEXEC implementation precisely used "rdlong reg,PTRA++ / long xxxx" - I really need to copy that over from

http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223934&viewfull=1#post1223934

ersmith wrote: »

(2) The AUX stack is not really very useful for a C compiler, at least not one that's running in HUB (it'd probably do for smaller programs that fit entirely in a COG). Even a single printf call is likely to use up the whole space. Moreover, if multiple threads are to run on the same COG then they each need their own stack.

Here we disagree to some extent. For microcontroller type of programs it would suffice, and be faster.

ersmith wrote: »

(3) AUX is also used as a CLUT for graphics operations, so C programs that do graphics would probably be better not touching it.

High resolution graphics drivers will still needed to be written in cog pasm. Low/medium resolution drivers would be possible in hubex mode.

ersmith wrote: »

(4) With point (2) and (3) in mind, I think a better form for CALL would be "HCALL retreg, destreg"; the return then just becomes "HJMP retreg", so we can save an opcode. OTOH Bill's original suggestion is very nice because the opcode can contain an 18 bit immediate, enough to access all of HUB memory. Perhaps we could have HCALL save the return address in a fixed register instead of pushing it on the AUX stack. Or, I guess we can save it on the AUX stack and then have the subroutine pop it right off on entry, although that seems a bit messy.

I dislike the old IBM / now ARM convention of a Link register. It was originally introduced to avoid a main memory stack access, however the AUX stack would work here admirably.

For programs which don't need a large stack, pushing the address on an AUX stack is much faster - a single cycle - than moving to a register, then writing to the hub. 8x+ faster.

Same for returning. 1 cycle vs. 8+

Saving it on the AUX stack, then moving it off-chip if needed (ie a -hubstack model) is a perfectly legitimate option, which is faster by one cycle than saving it in LR, then pushing it.

ersmith wrote: »

(5) We definitely need ways to switch between hub and cog operation (e.g. CALL from the hub code into COG memory). This seems conceptually simple, but the details could get tricky, especially if you look at nested calls.

We do have to watch out for switching, but I do not see it as a large problem because:

Most common usage cases will be:

- cog-only, native cog code, high performance drivers

- hubexec larger code, using FCACHE like loading of tight loops, where we need not support returning to hubexec until the tight cog code is done

- cog-only code, like the Spin interpreter, which may HCALL small routines, where we do not need to support calling cog routines, or support only a single level.

Just supporting the common usages above will allow a huge win.

Hub Execution Model Thread (split from blog)

Comments