How is an HUBEXEC FCACHE implemented without a COG interpreter or extra instructions?
We (mouse in one pocket, ring in another) don't want an interpreter, just fetch and execute. An interpreter means wasted code space and slower execution.
What we can do is use a tight reps loop (it will fit in an 8-long cache unit) to load a longer block of code into the cog, and use a regular cog call to execute it, then use a HJMP to resume the hub code.
Same idea as an FCACHE, without the interpreter overhead. I think it should be in-lined, as it would take roughly the same amount of code to call a cog-subroutine to do it.
How is an HUBEXEC FCACHE implemented without a COG interpreter or extra instructions?
We (mouse in one pocket, ring in another) don't want an interpreter, just fetch and execute. An interpreter means wasted code space and slower execution.
What we can do is use a tight reps loop (it will fit in an 8-long cache unit) to load a longer block of code into the cog, and use a regular cog call to execute it, then use a HJMP to resume the hub code.
No need for an interpreter!
Ok, but if a fetch gets 8 longs at a time to execute, why do you need FCACHE?
To allow larger loops (that don't fit in 8 longs) to run at full speed, and take advantage of the REPxxx instructions, that can freely use RDxxxxxC for better peformance without trashing the octal instruction cache.
Could we run any HUB execution model instructions/modes/registers/stack past some complier writers?
I would hate if GCC or one clang could not use what's built effectively.
Segment registers, starts shaking uncontrollably, as far as I know Linux dos not make use of segments, last I heard it set all segments to encompass all of memory in one big linear space. All the virtual memory/protection stuff being handled with pages.
However, it seems Google employs the long forgotten segments to wrap around code, data and stack in it's native execution sand box for in browser apps.
Could we run any HUB execution model instructions/modes/registers/stack past some complier writers?
I would hate if GCC or one clang could not use what's built effectively.
Segment registers, starts shaking uncontrollably, as far as I know Linux dos not make use of segments, last I heard it set all segments to encompass all of memory in one big linear space. All the virtual memory/protection stuff being handled with pages.
However, it seems Google employs the long forgotten segments to wrap around code, data and stack in it's native execution sand box for in browser apps.
I think that what is being discussed would work with PropGCC but Eric would be a better person to comment on this since he wrote the GCC code generator for the Propeller.
The three instructions I proposed is the minimum set that allows hub execution, that Chip started musing on.
I can think of MANY additions to make P3 better for compilers, however we are in a tight time window :-(
The good news is that support for this model should be quite easy to retrofit into PropGCC, and should offer tremendous speed gains.
I did not want to propose a full MMU for P3, opens too many cans of worms, however segment/limit registers (as long as they are not crippled like on x86 and x286) are a viable solution as they give each process linear address spaces for code/data/stack, and provide for relocating it anywhere in the (ddr2/+) memory. KISS principle.
Could we run any HUB execution model instructions/modes/registers/stack past some complier writers?
I would hate if GCC or one clang could not use what's built effectively.
Segment registers, starts shaking uncontrollably, as far as I know Linux dos not make use of segments, last I heard it set all segments to encompass all of memory in one big linear space. All the virtual memory/protection stuff being handled with pages.
However, it seems Google employs the long forgotten segments to wrap around code, data and stack in it's native execution sand box for in browser apps.
That is all totally out of my league. I just worry about "pulling and Intel". Intel's hardware gurus have three times come up with super duper architectures and instruction sets that compilers could not deal with. Too complex to generate performant code for. Those were: The 432, the i860 and the Itanium.
Will this really work? Since the COG is pipelined, what value will ptra have when the rdlong instruction actually executes? Won't several more instructions have been fetched by then? In fact, won't the constant $12345678 end up in the pipeline so that it gets executed?
OK, maybe it's too late to implement multi-level hardware caching for P2, but P3 should have this. The current proposals for extending P2 are just becoming silly. Hardware caching would simplify the whole thing. In my opinion, if P2 supports the software-caching that has been proposed recently it will start to become very kludgy. Extending this method even further for P3 would be crazy. Hardware caching would just make things easier from the software perspective.
I find it very interesting that Chip is even considering this. I think I asked about using the RDLONGC cache to allow hub execution back when he first published the initial P2 instruction set and I don't think he thought it was possible. Maybe he just hadn't had enough time to think about it. :-)
Anyway, I'm glad it is being considered now. Even if it doesn't make it into P2, it will probably be considered again for P3.
Will this really work? Since the COG is pipelined, what value will ptra have when the rdlong instruction actually executes? Won't several more instructions have been fetched by then? In fact, won't the constant $12345678 end up in the pipeline so that it gets executed?
Honestly, I think the HUB RAM expansion makes the WIDE fetch big enough to hold more than a pipe full of instructions. He probably sees that as more viable than the 4 long QUAD was. Just a guess...
I would be agreeing with you if this was a regular ARM SOC, or a conventional processor.
I am thinking more of the microcontroller applications where this would help a lot, without adding delays or waiting for P3 - essentially to make the P2 even more competitive in the controller market space.
For those applications, this would be incredibly useful.
It is also far less of a kludge than LMM, and far easier to target a compiler to.
Now for the P3, I'd love to see each cog get a "proper" L1 cache, fed by DDR2 for large application code.
I'd still keep this P2 HUBEXEC model, because for programs that fit, it will be faster and more deterministic than XMM/cache.
The propeller processors are not conventional processors, so they need some unconventional "tricks".
OK, maybe it's too late to implement multi-level hardware caching for P2, but P3 should have this. The current proposals for extending P2 are just becoming silly. Hardware caching would simplify the whole thing. In my opinion, if P2 supports the software-caching that has been proposed recently it will start to become very kludgy. Extending this method even further for P3 would be crazy. Hardware caching would just make things easier from the software perspective.
if Chip increments ptra for every cache slot, instead of for the whole cache line (8-longs) it will work.
I don't think I stated my question clearly enough. There is a four stage pipeline in P2. I assume PTRA will be incremented on each instruction fetch but that the instruction won't actually be executed until a later pipeline stage. That means that any additional instruction fetches that happen before the RDLONGC gets executed will also increment PTRA. Also, the long after the RDLONGC instruction will have been fetched and will be in the pipeline for execution. This seems like it will cause problems.
I don't mention that aspect of my skills very often; even though it is what made me come up with LMM in the first place.
My first compiler compiler experience was porting VALGOL to the Amiga; numereous basic variants (not published) followed, along with some crazy language experiments, including Forth like languages.
I don't mention that aspect of my skills very often; even though it is what made me come up with LMM in the first place.
My first compiler compiler experience was porting VALGOL to the Amiga; numereous basic variants (not published) followed, along with some crazy language experiments, including Forth like languages.
When can we expect a VALGOL compiler for the Propeller? :-)
I was not clear enough either, you have an excellent question.
Basically, I expect Chip to increment once for each slot, on the first pass through the eight longs. The code in the 8-long cache should not modify PTRA, other than using it to skip embedded constants.
Seeing the last instruction update PTRA would let the verilog know not to execute that long, skip it if you will, without having to worry about 32 bit constants executing.
Chip will know how to implement this best (assuming he implements HUBEXEC at all) - but the above is how I see it working.
I don't think I stated my question clearly enough. There is a four stage pipeline in P2. I assume PTRA will be incremented on each instruction fetch but that the instruction won't actually be executed until a later pipeline stage. That means that any additional instruction fetches that happen before the RDLONGC gets executed will also increment PTRA. Also, the long after the RDLONGC instruction will have been fetched and will be in the pipeline for execution. This seems like it will cause problems.
It was an easy port though, I loved the orthogonal 68000 instruction set.
As I recall, VALGOL was a teaching subset of the full ALGOL language. The original source I used had a code generator for the PDP11, so I found it an easy port.... this was back in '86 or '87.
I was not clear enough either, you have an excellent question.
Basically, I expect Chip to increment once for each slot, on the first pass through the eight longs. The code in the 8-long cache should not modify PTRA, other than using it to skip embedded constants.
Seeing the last instruction update PTRA would let the verilog know not to execute that long, skip it if you will, without having to worry about 32 bit constants executing.
Chip will know how to implement this best (assuming he implements HUBEXEC at all) - but the above is how I see it working.
Just so I understand what you're saying, by "slot" you mean the entire set of 8 longs fetched in one operation? I don't think that's how it would work if he mimics the action of RDLONGC in the instruction fetch logic. That will increment the PTRx on every fetch of a long. If it only gets incremented for each set of 8 longs then HCALL will be difficult since it will have to calculate the return address based on the value of PTRx and the offset of the HCALL instruction in the window.
I expect HJMP/HCALL/HRET to load the block-aligned (32 byte aligned 8-long), then skip the instructions before the finer (long grained) address.
Let me try to illustrate, to clear my mind as well
8-long aligned chunk, loaded into the RDOCTL cache
0:
1:
2:
3:
4:
5:
6:
7:
- When it runs off the end, the next eight are fetched
- A HJMP/HCALL/HRET address is encoded as
dddddddddsssssssss[00]
with the 00 being implied as it is a long address; scaled like PTR references
the hub blocks are actually fetched from octal boundaries
dddddddddssssss[000][00]
the lowest three bits of 's' in the hub address correspond to the cache slot, so execution can resume at a long grain.
Regarding how Chip decides to increment PTRA, or has a separate internal PTRA' that works on 32 byte boundaries, is up to him. Heck, he'll probably find a better way!
Ideally, by convention, the 8-long cache should be fixed at $1E0, for two reasons:
- allows expanding to a 16 long cache on the P3
- allows directly addressing cache lines, which leads to much faster hub memory references
- or allows growing I/O registers down from $1F2
This would leave $000-$1DF free for cog subroutines, or a HUGE FCACHE
Just so I understand what you're saying, by "slot" you mean the entire set of 8 longs fetched in one operation? I don't think that's how it would work if he mimics the action of RDLONGC in the instruction fetch logic. That will increment the PTRx on every fetch of a long. If it only gets incremented for each set of 8 longs then HCALL will be difficult since it will have to calculate the return address based on the value of PTRx and the offset of the HCALL instruction in the window.
I'd be very cautious about making such a big change at this late stage in the design. Faster LMM execution would be a good thing, but putting it in hardware is adding a fairly complicated block which we very well might get wrong. I think aiming for P3 (and having a chance to test it extensively in FPGA) would be wiser.
Also, we already have a code generator for the Propeller (and the old Prop2), so again any changes in hardware will mean that software work is required to modify the compiler. That's not a reason not to do it, of course, but it's a cost/risk that needs to be taken into account.
From the point of view of GCC, here are a few notes:
(1) The closer any hardware implementation can be to the existing LMM, the better. In particular, we need to be able to access any instruction in the cache line with a jump (I think Bill's proposal addresses this), and we need to be able to load 32 bit constants into registers, perhaps with something like "rdlong reg,PTRA++ / long xxxx". Loading 32 bit constants into a register is an extremely common operation in the compiled code, it needs to be simple and easy.
(2) The AUX stack is not really very useful for a C compiler, at least not one that's running in HUB (it'd probably do for smaller programs that fit entirely in a COG). Even a single printf call is likely to use up the whole space. Moreover, if multiple threads are to run on the same COG then they each need their own stack.
(3) AUX is also used as a CLUT for graphics operations, so C programs that do graphics would probably be better not touching it.
(4) With point (2) and (3) in mind, I think a better form for CALL would be "HCALL retreg, destreg"; the return then just becomes "HJMP retreg", so we can save an opcode. OTOH Bill's original suggestion is very nice because the opcode can contain an 18 bit immediate, enough to access all of HUB memory. Perhaps we could have HCALL save the return address in a fixed register instead of pushing it on the AUX stack. Or, I guess we can save it on the AUX stack and then have the subroutine pop it right off on entry, although that seems a bit messy.
(5) We definitely need ways to switch between hub and cog operation (e.g. CALL from the hub code into COG memory). This seems conceptually simple, but the details could get tricky, especially if you look at nested calls.
(4) With point (2) and (3) in mind, I think a better form for CALL would be "HCALL retreg, destreg"; the return then just becomes "HJMP retreg", so we can save an opcode. OTOH Bill's original suggestion is very nice because the opcode can contain an 18 bit immediate, enough to access all of HUB memory. Perhaps we could have HCALL save the return address in a fixed register instead of pushing it on the AUX stack. Or, I guess we can save it on the AUX stack and then have the subroutine pop it right off on entry, although that seems a bit messy.
Popping it off the stack right away is what I figured would be necessary for everything other than leaf functions. This isn't so bad I guess because for non-leaf functions we have to save LR anyway and popping the return address off the stack is a single instruction just like moving LR to another register. I guess it might be a bit slower if the current code generator can do a WRLONG LR, -SP rather than moving it into a register first though.
I'd be very cautious about making such a big change at this late stage in the design. Faster LMM execution would be a good thing, but putting it in hardware is adding a fairly complicated block which we very well might get wrong.
Cautious is good, but adding opcodes is relatively safe, older LMM designs still work fine. This does not close any doors.
The P2 code base already in FPGA testing, would quickly catch any unexpected issues, should any occur on existing opcodes.
I'd be very cautious about making such a big change at this late stage in the design. Faster LMM execution would be a good thing, but putting it in hardware is adding a fairly complicated block which we very well might get wrong. I think aiming for P3 (and having a chance to test it extensively in FPGA) would be wiser.
This snowballed from the DAC bus's necessary removal; 256KB of hub turned out to be the best/least risk replacement use of the freed up space. Chip then wanted to replace RDQUAD with RDOCTL, for double the potential bandwidth.
Also, we already have a code generator for the Propeller (and the old Prop2), so again any changes in hardware will mean that software work is required to modify the compiler. That's not a reason not to do it, of course, but it's a cost/risk that needs to be taken into account.
I believe the compiler changes should be fairly minimal, I tried to craft the suggested instructions in such a way as to minimize compiler changes - and risk.
From the point of view of GCC, here are a few notes:
(1) The closer any hardware implementation can be to the existing LMM, the better. In particular, we need to be able to access any instruction in the cache line with a jump (I think Bill's proposal addresses this), and we need to be able to load 32 bit constants into registers, perhaps with something like "rdlong reg,PTRA++ / long xxxx". Loading 32 bit constants into a register is an extremely common operation in the compiled code, it needs to be simple and easy.
Agreed.
HJMP / HCALL / HRET are very close to the original LMM FJMP / FCALL / FRET
The HUBEXEC implementation precisely used "rdlong reg,PTRA++ / long xxxx" - I really need to copy that over from
(2) The AUX stack is not really very useful for a C compiler, at least not one that's running in HUB (it'd probably do for smaller programs that fit entirely in a COG). Even a single printf call is likely to use up the whole space. Moreover, if multiple threads are to run on the same COG then they each need their own stack.
Here we disagree to some extent. For microcontroller type of programs it would suffice, and be faster.
(4) With point (2) and (3) in mind, I think a better form for CALL would be "HCALL retreg, destreg"; the return then just becomes "HJMP retreg", so we can save an opcode. OTOH Bill's original suggestion is very nice because the opcode can contain an 18 bit immediate, enough to access all of HUB memory. Perhaps we could have HCALL save the return address in a fixed register instead of pushing it on the AUX stack. Or, I guess we can save it on the AUX stack and then have the subroutine pop it right off on entry, although that seems a bit messy.
I dislike the old IBM / now ARM convention of a Link register. It was originally introduced to avoid a main memory stack access, however the AUX stack would work here admirably.
For programs which don't need a large stack, pushing the address on an AUX stack is much faster - a single cycle - than moving to a register, then writing to the hub. 8x+ faster.
Same for returning. 1 cycle vs. 8+
Saving it on the AUX stack, then moving it off-chip if needed (ie a -hubstack model) is a perfectly legitimate option, which is faster by one cycle than saving it in LR, then pushing it.
(5) We definitely need ways to switch between hub and cog operation (e.g. CALL from the hub code into COG memory). This seems conceptually simple, but the details could get tricky, especially if you look at nested calls.
We do have to watch out for switching, but I do not see it as a large problem because:
Most common usage cases will be:
- cog-only, native cog code, high performance drivers
- hubexec larger code, using FCACHE like loading of tight loops, where we need not support returning to hubexec until the tight cog code is done
- cog-only code, like the Spin interpreter, which may HCALL small routines, where we do not need to support calling cog routines, or support only a single level.
Just supporting the common usages above will allow a huge win.
Comments
We (mouse in one pocket, ring in another) don't want an interpreter, just fetch and execute. An interpreter means wasted code space and slower execution.
What we can do is use a tight reps loop (it will fit in an 8-long cache unit) to load a longer block of code into the cog, and use a regular cog call to execute it, then use a HJMP to resume the hub code.
Same idea as an FCACHE, without the interpreter overhead. I think it should be in-lined, as it would take roughly the same amount of code to call a cog-subroutine to do it.
No need for an interpreter!
I would hate if GCC or one clang could not use what's built effectively.
Segment registers, starts shaking uncontrollably, as far as I know Linux dos not make use of segments, last I heard it set all segments to encompass all of memory in one big linear space. All the virtual memory/protection stuff being handled with pages.
However, it seems Google employs the long forgotten segments to wrap around code, data and stack in it's native execution sand box for in browser apps.
The three instructions I proposed is the minimum set that allows hub execution, that Chip started musing on.
I can think of MANY additions to make P3 better for compilers, however we are in a tight time window :-(
The good news is that support for this model should be quite easy to retrofit into PropGCC, and should offer tremendous speed gains.
I did not want to propose a full MMU for P3, opens too many cans of worms, however segment/limit registers (as long as they are not crippled like on x86 and x286) are a viable solution as they give each process linear address spaces for code/data/stack, and provide for relocating it anywhere in the (ddr2/+) memory. KISS principle.
Not trying to put you off Bill. Your enthusiasm is always contagious.
Will this really work? Since the COG is pipelined, what value will ptra have when the rdlong instruction actually executes? Won't several more instructions have been fetched by then? In fact, won't the constant $12345678 end up in the pipeline so that it gets executed?
FYI, even without GCC support this mode would be extremely useful.
BTW, this is infinitely easier to implement than a WLIW-style RDQUAD based LMM.
Oops...sorry.
I'll get my coat.
Anyway, I'm glad it is being considered now. Even if it doesn't make it into P2, it will probably be considered again for P3.
Honestly, I think the HUB RAM expansion makes the WIDE fetch big enough to hold more than a pipe full of instructions. He probably sees that as more viable than the 4 long QUAD was. Just a guess...
I would be agreeing with you if this was a regular ARM SOC, or a conventional processor.
I am thinking more of the microcontroller applications where this would help a lot, without adding delays or waiting for P3 - essentially to make the P2 even more competitive in the controller market space.
For those applications, this would be incredibly useful.
It is also far less of a kludge than LMM, and far easier to target a compiler to.
Now for the P3, I'd love to see each cog get a "proper" L1 cache, fed by DDR2 for large application code.
I'd still keep this P2 HUBEXEC model, because for programs that fit, it will be faster and more deterministic than XMM/cache.
The propeller processors are not conventional processors, so they need some unconventional "tricks".
I don't mention that aspect of my skills very often; even though it is what made me come up with LMM in the first place.
My first compiler compiler experience was porting VALGOL to the Amiga; numereous basic variants (not published) followed, along with some crazy language experiments, including Forth like languages.
Also, how is VALGOL different from ALGOL?
Basically, I expect Chip to increment once for each slot, on the first pass through the eight longs. The code in the 8-long cache should not modify PTRA, other than using it to skip embedded constants.
Seeing the last instruction update PTRA would let the verilog know not to execute that long, skip it if you will, without having to worry about 32 bit constants executing.
Chip will know how to implement this best (assuming he implements HUBEXEC at all) - but the above is how I see it working.
NEVER!
It was an easy port though, I loved the orthogonal 68000 instruction set.
As I recall, VALGOL was a teaching subset of the full ALGOL language. The original source I used had a code generator for the PDP11, so I found it an easy port.... this was back in '86 or '87.
I expect HJMP/HCALL/HRET to load the block-aligned (32 byte aligned 8-long), then skip the instructions before the finer (long grained) address.
Let me try to illustrate, to clear my mind as well
8-long aligned chunk, loaded into the RDOCTL cache
0:
1:
2:
3:
4:
5:
6:
7:
- When it runs off the end, the next eight are fetched
- A HJMP/HCALL/HRET address is encoded as
dddddddddsssssssss[00]
with the 00 being implied as it is a long address; scaled like PTR references
the hub blocks are actually fetched from octal boundaries
dddddddddssssss[000][00]
the lowest three bits of 's' in the hub address correspond to the cache slot, so execution can resume at a long grain.
Regarding how Chip decides to increment PTRA, or has a separate internal PTRA' that works on 32 byte boundaries, is up to him. Heck, he'll probably find a better way!
Ideally, by convention, the 8-long cache should be fixed at $1E0, for two reasons:
- allows expanding to a 16 long cache on the P3
- allows directly addressing cache lines, which leads to much faster hub memory references
- or allows growing I/O registers down from $1F2
This would leave $000-$1DF free for cog subroutines, or a HUGE FCACHE
Later, I think I'll add this to the FAQ post.
Again, great questions David.
Also, we already have a code generator for the Propeller (and the old Prop2), so again any changes in hardware will mean that software work is required to modify the compiler. That's not a reason not to do it, of course, but it's a cost/risk that needs to be taken into account.
From the point of view of GCC, here are a few notes:
(1) The closer any hardware implementation can be to the existing LMM, the better. In particular, we need to be able to access any instruction in the cache line with a jump (I think Bill's proposal addresses this), and we need to be able to load 32 bit constants into registers, perhaps with something like "rdlong reg,PTRA++ / long xxxx". Loading 32 bit constants into a register is an extremely common operation in the compiled code, it needs to be simple and easy.
(2) The AUX stack is not really very useful for a C compiler, at least not one that's running in HUB (it'd probably do for smaller programs that fit entirely in a COG). Even a single printf call is likely to use up the whole space. Moreover, if multiple threads are to run on the same COG then they each need their own stack.
(3) AUX is also used as a CLUT for graphics operations, so C programs that do graphics would probably be better not touching it.
(4) With point (2) and (3) in mind, I think a better form for CALL would be "HCALL retreg, destreg"; the return then just becomes "HJMP retreg", so we can save an opcode. OTOH Bill's original suggestion is very nice because the opcode can contain an 18 bit immediate, enough to access all of HUB memory. Perhaps we could have HCALL save the return address in a fixed register instead of pushing it on the AUX stack. Or, I guess we can save it on the AUX stack and then have the subroutine pop it right off on entry, although that seems a bit messy.
(5) We definitely need ways to switch between hub and cog operation (e.g. CALL from the hub code into COG memory). This seems conceptually simple, but the details could get tricky, especially if you look at nested calls.
Cautious is good, but adding opcodes is relatively safe, older LMM designs still work fine. This does not close any doors.
The P2 code base already in FPGA testing, would quickly catch any unexpected issues, should any occur on existing opcodes.
This snowballed from the DAC bus's necessary removal; 256KB of hub turned out to be the best/least risk replacement use of the freed up space. Chip then wanted to replace RDQUAD with RDOCTL, for double the potential bandwidth.
I believe the compiler changes should be fairly minimal, I tried to craft the suggested instructions in such a way as to minimize compiler changes - and risk.
Agreed.
HJMP / HCALL / HRET are very close to the original LMM FJMP / FCALL / FRET
The HUBEXEC implementation precisely used "rdlong reg,PTRA++ / long xxxx" - I really need to copy that over from
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223934&viewfull=1#post1223934
Here we disagree to some extent. For microcontroller type of programs it would suffice, and be faster.
High resolution graphics drivers will still needed to be written in cog pasm. Low/medium resolution drivers would be possible in hubex mode.
I dislike the old IBM / now ARM convention of a Link register. It was originally introduced to avoid a main memory stack access, however the AUX stack would work here admirably.
For programs which don't need a large stack, pushing the address on an AUX stack is much faster - a single cycle - than moving to a register, then writing to the hub. 8x+ faster.
Same for returning. 1 cycle vs. 8+
Saving it on the AUX stack, then moving it off-chip if needed (ie a -hubstack model) is a perfectly legitimate option, which is faster by one cycle than saving it in LR, then pushing it.
We do have to watch out for switching, but I do not see it as a large problem because:
Most common usage cases will be:
- cog-only, native cog code, high performance drivers
- hubexec larger code, using FCACHE like loading of tight loops, where we need not support returning to hubexec until the tight cog code is done
- cog-only code, like the Spin interpreter, which may HCALL small routines, where we do not need to support calling cog routines, or support only a single level.
Just supporting the common usages above will allow a huge win.