Hub exec is going to put me into an early grave at this point. It's overwhelming me. I wish we could figure a way to achieve the same thing without all the current instructions. At least, I need it completely off the plate for now. Can we rationalize that LMM is now quite viable, with streaming cog loads?
Chip,
Forget all the other unnecessary (nice but unnecessary) instructions for hubexec.
Over on the additional cog thread I just posted how to do it, on the basis that the LUT can be used as cog instructions as I suggested.
It only requires the JMP/CALL/RET to support Relative and Absolute 17bit addresses and the PC increased to 17 bits.
Forget caching, at least for now until this other FIFO is resolved because it may be able to be used.
Hub exec is going to put me into an early grave at this point. It's overwhelming me. I wish we could figure a way to achieve the same thing without all the current instructions. At least, I need it completely off the plate for now. Can we rationalize that LMM is now quite viable, with streaming cog loads?
Certainly, you could release a FPGA variant and let the LMM Wizards "have at it", and see what they can do, with Streaming COG loads.
The bandwidth is a lot higher now.
How much stack space would be required to support GCC ?
ie Currently GCC uses hub and I am wondering what size is considered adequate.
Cluso, but this is a bit of a meaningless question. C is a stack-based language. It can and will use as much stack space is as available. If you try and restrict the stack space, you may be able to program in a C-like language, but you will not be able to program in C.
With the way it is now, you will be able to stream data into cog ram at great speed, we already have a lot of LMM-esque codebase on P1, which won't need overly much to tweak to use the even better speed this will offer, and also helps to KISS, and another benefit from this, Chip won't go to an early grave, and we can move on and get the chip out sooner!
I'm not saying we can't should forget HUB-EXEC altogether, just that I think it should wait for another iteration of the chip.
How much stack space would be required to support GCC ?
ie Currently GCC uses hub and I am wondering what size is considered adequate.
I don't have enough time to follow the P2 threads anymore. I don't feel like I even understand the direction P2 is going in at this point. C can use any size stack you give it of course. It just depends on what sorts of applications you want to run. Define the application domain you intend to address and it will be possible to get an idea of how much stack space is necessary for those sorts of applications. If you want to leave the set of applications open ended, you should probably plan on providing support for large hub-based stacks. If you are willing to limit the scope you can probably get away with a COG-RAM-based stack. Anyway, I'm not going to be a good resource for this sort of question because I don't have the time to follow all of the P2 design twists and turns. Just a quick scan of the read/write FIFO discussion makes me wonder if this new P2 design will be good for traditional high-level languages at all. However, I'm sure a version of Spin can be made to perform well on it. It seems clear to me that the primary design goal of the P2 is to produce video efficiently so maybe every other consideration should be put aside and the design should be focussed on that. We can look at whatever comes out of that effort and make a decision if it is a good target for C or not.
Cluso, but this is a bit of a meaningless question. C is a stack-based language. It can and will use as much stack space is as available. If you try and restrict the stack space, you may be able to program in a C-like language, but you will not be able to program in C.
Ross.
I don't agree with this. You can program it in C but the set of applications that will be practical will be limited maybe severely.
I don't agree with this. You can program it in C but the set of applications that will be practical will be limited maybe severely.
That's what I meant. Who would use a C compiler that you can't use to compile a C program - even when there is enough RAM to run it - just because of some arbitrary stack size limitation? It makes no sense.
That's what I meant. Who would use a C compiler that you can't use to compile a C program - even when there is enough RAM to run it - just because of some arbitrary stack size limitation? It makes no sense.
Ross.
You are absolutely right but my sense is that C isn't really a focus for this chip. I'm beginning to think that it should be taken off the table until Chip has put together a design that meets his goals. Then we can look at it and decide if it is worth supporting with a C compiler. It seems to me that P2 is going in the direction of being a fabulous chip for doing high-performance assembly language programming. Nothing wrong with that. And if there is a little high-level glue required to hold together the PASM drivers then Spin can handle that. Since Parallax defines Spin, it can be tailored to work well no matter what the P2 architecture looks like. So I say take C off the table and make the chip that works best for Chip's target applications.
How much stack space would be required to support GCC ?
Up to and including all that you have, 500 odd K in this case.
The C/C++ allows the programmer to declare structures, classes, arrays on the stack. Of arbitary size.
Then of course recursive code will blow up the stack.
@Brian,
IIRC the optimal C processor has something like an accumulator, a secondary register, an index register, and a hardware stack.
Compiler writers love lots of registers in a CPU. The compiler optimizers will try and keep as much working data in registers as possible. What you are describing may be true when all optimizations are switched off. This is not generally the case.
David & Ross,
I asked because if we get 4KB as I suggested, then running a C program would likely yield a 3KB stack = 768 longs.
Would this be enough?
If the routine pushing onto the stack checked the used depth, then the overflow could be placed into hub.
But if the stack requirements >>768 then it might not be beneficial.
You are absolutely right but my sense is that C isn't really a focus for this chip. I'm beginning to think that it should be taken off the table until Chip has put together a design that meets his goals. Then we can look at it and decide if it is worth supporting with a C compiler. .....
I'm not following the thrust here - there is already a C Compiler for the P1, which means C is already 'on the table', and able to run at least as well as C on P1, when run on any P1 superset.
- ie any better device will be worth supporting with a C compiler
You may have been meaning to see how LMM runs on a FPGA image, to see if the gains of HubExec are worth supporting ?
heater,
You can hardly have a 500KB stack with only a total of 512KB altogether. 12KB of program is not going to achieve much!
Unless of course you program has a recursive bug, and then who cares
David,
Taking C off the table would be a major blunder IMHO. And that is from someone who detests C, so I am certainly not being biased here.
Video isn't even on Ken's list.
I'm not following the thrust here - there is already a C Compiler for the P1, which means C is already 'on the table', and able to run at least as well as C on P1, when run on any P1 superset.
- ie any better device will be worth supporting with a C compiler
You may have been meaning to see how LMM runs on a FPGA image, to see if the gains of HubExec are worth supporting ?
I'm not sure if it can run as well as C on the P1 if the only path to hub memory is through read/write FIFOs.
Hub exec is going to put me into an early grave at this point.
Many us out here are going to our graves, early or otherwise, just trying to keep up with the increasing complexity of things.
For myself I would say:
1) Forget hubexec. I don't see it increasing speed over LMM much.
2) Forget FIFO's and HUB streaming. Most code accesses random locations most of the time.
3) Forget messing with increasing COG memory size by whatever tortuous means.
4) Heck most applications won't use video or codic, forget all that.
I'd love an FPGA build of a plain "vanilla" Propeller II, with the new HUB arbiter, and what is done so far.
Nice simple, understandable, easy to program, free of baggage that won't get used most of the time. Perhaps a tad less performant than some theoretical maximum but so what?
FIFO could be very nice sometimes... But, there also needs to be a simple way to do simple things...
Maybe there can be assembly macro commands that can do the regular, stalling reads and writes?
I think Chip has agreed to provide direct Opcodes. (#658), and there is also the 16 sized BLOCK opcodes, plus these new larger-block ones.
HUB2REG D/#,S/# - read S[8:0]+1 longs from read-FIFO starting at reg D[8:0]
HUB2LUT D/#,S/# - read S[7:0]+1 longs from read-FIFO starting at LUT D[7:0]
That covers quite a few options ?
Maybe. At this point I'm going to wait for a coherent description of the instruction set and architecture. It's impossible for me to follow this discussion. I don't know what's in or out at this point so I can't comment on whether it will be a good target for C or not.
You can hardly have a 500KB stack with only a total of 512KB altogether. 12KB of program is not going to achieve much!
Of course that is an extreme case. For illustrative purposes. Point is putting any artificial limit there is not a good idea. Still I can imagine algorithms that will fit in 12K of code and consume 500K of stack.
Taking C off the table would be a major blunder
Yes and yes. This is 2014. Designing a processor that is not C friendly would be nuts.
Historically it has turned out that what is complicated for assembler programmers is also coo much fro compilers. See history of the Intel i860 and Itanium designs.
Anyway, how are we going to get that JavaScript engine running on the P2 without good C support
Many us out here are going to our graves, early or otherwise, just trying to keep up with the increasing complexity of things.
For myself I would say:
1) Forget hubexec. I don't see it increasing speed over LMM much.
2) Forget FIFO's and HUB streaming. Most code accesses random locations most of the time.
3) Forget messing with increasing COG memory size by whatever tortuous means.
4) Heck most applications won't use video or codic, forget all that.
I'd love an FPGA build of a plain "vanilla" Propeller II, with the new HUB arbiter, and what is done so far.
Nice simple, understandable, easy to program, free of baggage that won't get used most of the time. Perhaps a tad less performant than some theoretical maximum but so what?
Here here, I second that too and vanilla is still a flavor, a rather nice one too.
David,
Taking C off the table would be a major blunder IMHO. And that is from someone who detests C, so I am certainly not being biased here.
Video isn't even on Ken's list.
I think you're probably correct that the market for the P2 will be more limited if it doesn't support C or supports it badly. However, if you think C support is important then maybe it would be best to forget video and LUTs and streaming and lay out a good architecture for high level language support. Then the other stuff can be added if there is time and space. The approach here seems to be the reverse of that. So I'm saying that if the focus is really on video then let it be that and don't pretend otherwise. Ending up with a chip that can only do LMM or can only do hubexec badly would be a major blunder too in my opinion.
I think we can get by without many of the extra hubexec instructions that were added in the old P2.
But we will require...
* DJNZ, etc jumps to be Relative +/-127
* JMP/CALL/RET to do Relative and Absolute 17 bits immediate, as was done in P2.
* The return address placed in a fixed location register.(GCC requirement, and PASM could live with it too)
* LOADIMM to load the following long into a register (simpler to do than AUGS, AUGD)
Anything else is a bonus
The Relative jmp/call/ret and djnz,etc should be relative anyway for relocatable code.
Anyway, looks like something has to give unless I missed something about die space.
I think you're probably correct that the market for the P2 will be more limited if it doesn't support C or supports it badly. However, if you think C support is important then maybe it would be best to forget video and LUTs and streaming and lay out a good architecture for high level language support. Then the other stuff can be added if there is time and space. The approach here seems to be the reverse of that. So I'm saying that if the focus is really on video then let it be that and don't pretend otherwise. Ending up with a chip that can only do LMM or can only do hubexec badly would be a major blunder too in my opinion.
What I've seen of this FIFO looks pretty good to me. I don't see how it could adversely affect C support... It makes writing to HUB RAM from a cog faster, even for a single long.
If I'm seeing it correctly, then sure you need two instructions to do a write, but they don't stall execution, so you save many clocks on average.
Read is the same way, but I'm not completely sure how the timing would work....
Presumably, you'd still need 2 instructions to read a long, but maybe you could do other things in between these two instructions, if you didn't want to have a chance of stalling execution. Hopefully, if you read too early execution just stalls instead of giving you bad data...
So it seems that you still win big even if just reading or writing a single long. But, it takes two instructions instead of one...
So with the FIFO approach we'll be able to do random address reads and writes without hub stalls? I don't think so. C programs will encounter lots of hub stalls. This new chip will be for expert assembly programmers only.
Comments
Forget all the other unnecessary (nice but unnecessary) instructions for hubexec.
Over on the additional cog thread I just posted how to do it, on the basis that the LUT can be used as cog instructions as I suggested.
It only requires the JMP/CALL/RET to support Relative and Absolute 17bit addresses and the PC increased to 17 bits.
Forget caching, at least for now until this other FIFO is resolved because it may be able to be used.
Wouldn't they use the same 4 Byte Write enables, and so interface identically as the Streaming FIFO ?
How much stack space would be required to support GCC ?
ie Currently GCC uses hub and I am wondering what size is considered adequate.
Certainly, you could release a FPGA variant and let the LMM Wizards "have at it", and see what they can do, with Streaming COG loads.
The bandwidth is a lot higher now.
Chip, go with your gut feeling on this!
Ross.
Cluso, but this is a bit of a meaningless question. C is a stack-based language. It can and will use as much stack space is as available. If you try and restrict the stack space, you may be able to program in a C-like language, but you will not be able to program in C.
Ross.
I'm not saying we can't should forget HUB-EXEC altogether, just that I think it should wait for another iteration of the chip.
Just my 2c worth.
^^^^^What he said. IIRC the optimal C processor has something like an accumulator, a secondary register, an index register, and a hardware stack.
That's what I meant. Who would use a C compiler that you can't use to compile a C program - even when there is enough RAM to run it - just because of some arbitrary stack size limitation? It makes no sense.
Ross.
The C/C++ allows the programmer to declare structures, classes, arrays on the stack. Of arbitary size.
Then of course recursive code will blow up the stack.
@Brian, Compiler writers love lots of registers in a CPU. The compiler optimizers will try and keep as much working data in registers as possible. What you are describing may be true when all optimizations are switched off. This is not generally the case.
So they'll tell you although once you look at the generated code, it seems they make do quite happily with around 16 of them.
I asked because if we get 4KB as I suggested, then running a C program would likely yield a 3KB stack = 768 longs.
Would this be enough?
If the routine pushing onto the stack checked the used depth, then the overflow could be placed into hub.
But if the stack requirements >>768 then it might not be beneficial.
I'm not following the thrust here - there is already a C Compiler for the P1, which means C is already 'on the table', and able to run at least as well as C on P1, when run on any P1 superset.
- ie any better device will be worth supporting with a C compiler
You may have been meaning to see how LMM runs on a FPGA image, to see if the gains of HubExec are worth supporting ?
You can hardly have a 500KB stack with only a total of 512KB altogether. 12KB of program is not going to achieve much!
Unless of course you program has a recursive bug, and then who cares
David,
Taking C off the table would be a major blunder IMHO. And that is from someone who detests C, so I am certainly not being biased here.
Video isn't even on Ken's list.
For myself I would say:
1) Forget hubexec. I don't see it increasing speed over LMM much.
2) Forget FIFO's and HUB streaming. Most code accesses random locations most of the time.
3) Forget messing with increasing COG memory size by whatever tortuous means.
4) Heck most applications won't use video or codic, forget all that.
I'd love an FPGA build of a plain "vanilla" Propeller II, with the new HUB arbiter, and what is done so far.
Nice simple, understandable, easy to program, free of baggage that won't get used most of the time. Perhaps a tad less performant than some theoretical maximum but so what?
Maybe there can be assembly macro commands that can do the regular, stalling reads and writes?
I think Chip has agreed to provide direct Opcodes. (#658), and there is also the 16 sized BLOCK opcodes, plus these new larger-block ones.
That covers quite a few options ?
Historically it has turned out that what is complicated for assembler programmers is also coo much fro compilers. See history of the Intel i860 and Itanium designs.
Anyway, how are we going to get that JavaScript engine running on the P2 without good C support Yes and yes. Drop it already.
Here here, I second that too and vanilla is still a flavor, a rather nice one too.
But we will require...
* DJNZ, etc jumps to be Relative +/-127
* JMP/CALL/RET to do Relative and Absolute 17 bits immediate, as was done in P2.
* The return address placed in a fixed location register.(GCC requirement, and PASM could live with it too)
* LOADIMM to load the following long into a register (simpler to do than AUGS, AUGD)
Anything else is a bonus
The Relative jmp/call/ret and djnz,etc should be relative anyway for relocatable code.
Anyway, looks like something has to give unless I missed something about die space.
If I'm seeing it correctly, then sure you need two instructions to do a write, but they don't stall execution, so you save many clocks on average.
Read is the same way, but I'm not completely sure how the timing would work....
Presumably, you'd still need 2 instructions to read a long, but maybe you could do other things in between these two instructions, if you didn't want to have a chance of stalling execution. Hopefully, if you read too early execution just stalls instead of giving you bad data...
So it seems that you still win big even if just reading or writing a single long. But, it takes two instructions instead of one...
Doesn't that make writes non-atomic?