I am not opposed to HubExec. I just point out that no-one has made a compelling case for it.
But what I am opposed to is turning the Propeller into a complex monstrosity in the vague and misguided hope of competing with "NXP, TI, FreeScale, Atmel" or (these days) even relatively low-end ARM processors!
I think you are missing my point ... the faster we get the compiled code to run, the bigger the potential market is.
This is why hubexec, slot mapping, mooch all matter ... as does good graphics, fast io etc
to cover a larger application space
Correct.
Any 'argument' that faster does not help, strangely because other devices are faster already, is not an 'argument' at all.
When I want code faster, or more deterministic, it is the problem in front of me I worry about, not what some other processor might clock at.
Analog Devices make 400-600MHz 32 bit DSPs, and I've not seen ARM vendors halting speed improvements, because they think MHz does not matter, citing that ADi will always be faster.
.. even at 50 MIPS, the Propeller is already outperformed by other processors costing less.
Why is it "the end of the world" to narrow the gap to the "others"?
PASM is the "heart and soul" of Propeller and being able to capitalize on the larger HUB space to run large
programs seems the next logical step in the Propeller's evolution.
Why is it "the end of the world" to narrow the gap to the "others"?
PASM is the "heart and soul" of Propeller and being able to capitalize on the larger HUB space to run large
programs seems the next logical step in the Propeller's evolution.
It is not the end of the world - as I said, we all want a faster Propeller... but it also gets you no additional revenue unless you can take market share away from those that already have it.
The Propeller has to win in the markets that it is particularly well suited for. Trying to address other markets is pointless when you cannot even equal the entry level offering of other chip manufacturers.
It is not the end of the world - as I said, we all want a faster Propeller... but it also gets you no additional revenue unless you can take market share away from those that already have it.
The Propeller has to win in the markets that it is particularly well suited for. Trying to address other markets is pointless when you cannot even equal the entry level offering of other chip manufacturers.
Ross.
I agree with you that the Prop is not going to take market share in many of the areas ARM and others play for.
Many of those areas simply require more advanced technology than is in the Prop, and much greater resources for R&D, newer foundry process, etc.
However to have any chance to win, expand, etc, you have to play. And play HARD, bringing the BEST you can muster.
Sure aren't going to win any additional market by choosing to remain slow while even a little LPC810's 8-pin DIP's at $1.35 Retail keep getting faster, cheaper, and more power efficient. Not too mention the BeagleBone Blacks Sitara AM3358 with an A8, dual 32b PRU Cores and 720p video. It too is expensive, however it will be trickling down to the lower families in some fashion or other.
If Parallax can keep the lights on and everything, then thats fine, its their money.
The Prop can stay in its market, for as long as it wants, or until the 'others' start dual/quad coring their M0/M4's, adding video, etc. for $5.
Once that starts to happen, its going to get ugly and quick.
Thats why it would probably make a lot of sense to take a step back, see that the 'others' are already multi-Coreing, and consider if its even 2 years before all of a sudden Parallax is fighting come dual-quad Core M0/M4's on the home turf.
The Uber-P2 didn't pan out. Though a 4 Cog version now doesn't look so bad.
The P1+ seemed to have a lot of potential, however that has also apparently hit some bumps.
The new P16/P2 is somewhere between the P1 and the P1+.
I fail to see what all the yelling that the 'sky is falling', and "We have to have something out stat!" is accomplishing, other than pushing Parallax to rush ahead with the P16 instead of looking more than 1-2 years ahead, and seeing if the bumps can't actually be worked out at the expense of a couple of months....
I also don't see much way that a spare $1M in profit is suddenly going to materialize in that time, with the current P2, for any P3.
Ken or Chip said they welcomed 'brutal', so thats my take.
But what level of simplicity is acceptable? PTRA/PTRB are a bit of a pain. Hub exec is more of a pain. Prop1 was square in the zone of beautiful simplicity. What I have, so far, is like Prop1, but improved.
What is going to disrupt simplicity in this design is wide data paths (4 longs) and hub exec. I think those features are important enough, though, that the disruption they cause must be accommodated.
So while we would like to have fast hubexec, what about a simple version that doesn't use quads?
Sure it would only run at hub speed, but it would still be faster than LMM and code could be smaller than LMM if Chip includes some of the previous hub instructions.
It would also respond well to slot sharing if any of that happens.
I know a lot of folks will dismiss this as being to slow, but if Quads prove to be too much for this version then this is better than just falling back to pure LMM.
I'd like to hear from the GCC guys to see what minimal things they would need for this to be useful for GCC.
The hub slot sharing, 32 slots with 2 level tables (first and second priority cog offerings) is IMHO quite simple to implement. I am sorry jmg, I don't buy your timing analysis - Chip would solve this simply, of that I am certain.
With slot sharing, this could help solve hub bandwidth for the few cogs that may require it. Would this negate the requirement for the more complex QUAD hub-cog interface? Quite likely at least aid it. The quads are more costly (time and complexity) to design - Chip said so!
With slot sharing, this could help solve faster hubexec for the few cogs that may require it. Could Chip implement a simpler hubexec? Quite likely, because it may not require ICache - much simpler without. How could it work? Each time an instruction is required from hub, it is just immediately fetched directly into the ALU and executed in place. Much simpler IMHO. We can still have the support relative & direct jmp/call/ret, the jmpret required by the GCC guys, and maybe??? a few other simple helper instructions (AUGD/S, LOC etc)???
With a simpler implementation, could we have a deeper internal LIFO? Maybe a depth of 16 for call/ret/push/pop. Could the top 16 longs of hub ram be used for this??? (I suggested using the hub ram for ICache and Chip was going to use it that way).
Yes. it can meet speeds, with some assumptions, and the FPGA topologies using Memories are faster, but impose some restrictions on Write. - ie a loop with shift + nibble or byte write would be needed
With slot sharing, this could help solve hub bandwidth for the few cogs that may require it. Would this negate the requirement for the more complex QUAD hub-cog interface? Quite likely at least aid it. The quads are more costly (time and complexity) to design - Chip said so!
Yes, Slot mapping helps allocate BW for all use cases.
My reading is Chip has said Hubexec does make the cut, so until he states explicitly otherwise, assume it is 'in'.
Comments
I want to run large code without the LMM overhead, which has nothing to do with Arm etc (otherthan they don't have that overhead)
You miss my point - just saying you want the chip to run faster is not an argument. We all want that.
What market segment is it that you think Parallax can successfully capture with this feature?
Because I can't see any at all - even at 50 MIPS, the Propeller is already outperformed by other processors costing less.
Ross.
This is why hubexec, slot mapping, mooch all matter ... as does good graphics, fast io etc
to cover a larger application space
Correct.
Any 'argument' that faster does not help, strangely because other devices are faster already, is not an 'argument' at all.
When I want code faster, or more deterministic, it is the problem in front of me I worry about, not what some other processor might clock at.
Analog Devices make 400-600MHz 32 bit DSPs, and I've not seen ARM vendors halting speed improvements, because they think MHz does not matter, citing that ADi will always be faster.
PASM is the "heart and soul" of Propeller and being able to capitalize on the larger HUB space to run large
programs seems the next logical step in the Propeller's evolution.
It is not the end of the world - as I said, we all want a faster Propeller... but it also gets you no additional revenue unless you can take market share away from those that already have it.
The Propeller has to win in the markets that it is particularly well suited for. Trying to address other markets is pointless when you cannot even equal the entry level offering of other chip manufacturers.
Ross.
I agree with you that the Prop is not going to take market share in many of the areas ARM and others play for.
Many of those areas simply require more advanced technology than is in the Prop, and much greater resources for R&D, newer foundry process, etc.
However to have any chance to win, expand, etc, you have to play. And play HARD, bringing the BEST you can muster.
Sure aren't going to win any additional market by choosing to remain slow while even a little LPC810's 8-pin DIP's at $1.35 Retail keep getting faster, cheaper, and more power efficient. Not too mention the BeagleBone Blacks Sitara AM3358 with an A8, dual 32b PRU Cores and 720p video. It too is expensive, however it will be trickling down to the lower families in some fashion or other.
If Parallax can keep the lights on and everything, then thats fine, its their money.
The Prop can stay in its market, for as long as it wants, or until the 'others' start dual/quad coring their M0/M4's, adding video, etc. for $5.
Once that starts to happen, its going to get ugly and quick.
Thats why it would probably make a lot of sense to take a step back, see that the 'others' are already multi-Coreing, and consider if its even 2 years before all of a sudden Parallax is fighting come dual-quad Core M0/M4's on the home turf.
The Uber-P2 didn't pan out. Though a 4 Cog version now doesn't look so bad.
The P1+ seemed to have a lot of potential, however that has also apparently hit some bumps.
The new P16/P2 is somewhere between the P1 and the P1+.
I fail to see what all the yelling that the 'sky is falling', and "We have to have something out stat!" is accomplishing, other than pushing Parallax to rush ahead with the P16 instead of looking more than 1-2 years ahead, and seeing if the bumps can't actually be worked out at the expense of a couple of months....
I also don't see much way that a spare $1M in profit is suddenly going to materialize in that time, with the current P2, for any P3.
Ken or Chip said they welcomed 'brutal', so thats my take.
So while we would like to have fast hubexec, what about a simple version that doesn't use quads?
Sure it would only run at hub speed, but it would still be faster than LMM and code could be smaller than LMM if Chip includes some of the previous hub instructions.
It would also respond well to slot sharing if any of that happens.
I know a lot of folks will dismiss this as being to slow, but if Quads prove to be too much for this version then this is better than just falling back to pure LMM.
I'd like to hear from the GCC guys to see what minimal things they would need for this to be useful for GCC.
Chris Wardell
The hub slot sharing, 32 slots with 2 level tables (first and second priority cog offerings) is IMHO quite simple to implement. I am sorry jmg, I don't buy your timing analysis - Chip would solve this simply, of that I am certain.
With slot sharing, this could help solve hub bandwidth for the few cogs that may require it. Would this negate the requirement for the more complex QUAD hub-cog interface? Quite likely at least aid it. The quads are more costly (time and complexity) to design - Chip said so!
With slot sharing, this could help solve faster hubexec for the few cogs that may require it. Could Chip implement a simpler hubexec? Quite likely, because it may not require ICache - much simpler without. How could it work? Each time an instruction is required from hub, it is just immediately fetched directly into the ALU and executed in place. Much simpler IMHO. We can still have the support relative & direct jmp/call/ret, the jmpret required by the GCC guys, and maybe??? a few other simple helper instructions (AUGD/S, LOC etc)???
With a simpler implementation, could we have a deeper internal LIFO? Maybe a depth of 16 for call/ret/push/pop. Could the top 16 longs of hub ram be used for this??? (I suggested using the hub ram for ICache and Chip was going to use it that way).
I totally agree.
Is it not 'my' timing analysis, I simply report what the FPGA P&R says.
You might want to refresh with the latest numbers, with different FPGA topologies, in the thread
http://forums.parallax.com/showthread.php/155561-A-32-slot-Approach-(was-An-interleaved-hub-approach)?p=1266305&viewfull=1#post1266305
Yes. it can meet speeds, with some assumptions, and the FPGA topologies using Memories are faster, but impose some restrictions on Write. - ie a loop with shift + nibble or byte write would be needed
Yes, Slot mapping helps allocate BW for all use cases.
My reading is Chip has said Hubexec does make the cut, so until he states explicitly otherwise, assume it is 'in'.
Will there be ICache, and if so, how many levels?
Will there be the internal LIFOs, and if so, how many levels?
Will the AUGx, LOCx, etc be included?