In an earlier message, I pointed out that by combining the bits of D:S, we can have 18 bits for a long address, and append two 00's (scaling it for long) - allowing addressing 1MB of hub.
HJMP would jump to a 20 bit address (ie RDOCTL, and resume execution at the proper sub-long)
HCALL would first push the next hub instruction address, then would jump to a 20 bit address (ie RDOCTL, and resume execution at the proper sub-long)
HRET would pop the return address into the pc
It would be a lot faster & less complicated to use AUX as the stack for now, however the future P3 could have a dedicated stack pointer
(I realize for large UI apps a bigger stack would be nicer, but it would be a lot slower - and local variable access would be a lot slower as well)
(P3 should have two stack models - aux and hub - with proper I&D caches - but for now I want a fast P2 :-) )
If PTRA (or PTRB) were used as the hubPC, then loading a constant simply becomes "RDLONG reg,ptra++"
If we know the address of the OCTL area, *AND* it can be guaranteed that opcode %11111100 is a NOP, then
Assume OCTL is mapped to $1E0
any of the eight slots can be used for a hub address constant of up to 23 bits (which could be scaled)
1E0: rdlong r1,$1E1 ' in one instruction we read
1E1: long %FE000000 + address of some hub variable ' will be skipped
1E2: rdlong r2,ptra++
1E3: long $xxxxxxxx ' 32 bit constant
1E4:
1E5:
1E6:
1E7:
All of a sudden the GCC code generator is much simpler, and the code becomes much smaller - and MUCH faster!
I think I missed something. What do these instructions do? I assume that HCALL and HRET don't actually modify instructions like CALL/RET do? Or maybe you're expecting HCALL to use all 18 bits of D+S and push its return address on an AUX memory stack?
So then the product of this is PASM sitting in the HUB, directly executed with HEXEC ptra with ptra the address?
Don't use some instructions, hubops, etc... which would be reserved for LMM and COG PASM.
This gets called HUB PASM, and we now have PASM, LMM, XMM execute models. Wow.
Return to the COG via a standard JMP instruction, canceling the hardware HUB execution mode. And carry on at top speed and full use of HUB operations.
Bill, I think that's about the simplest model there is, given the state of things right now.
Well yes, I have to agree with Kerry.
1MB PASM programs @ ~90 percent of native with few restrictions.
Holy Buckets! All of the sudden, that 256MB HUB makes a big difference. Plenty of room to be paging in fairly large programs from external memory.
Can't wait to play with an FPGA image.
@JMG, well OK. Here we are this morning. I'm going to concede your point. Maximizing it right now makes perfect sense. On the assumption this all makes sense. I'm thinking it will.
Oh man... Yeah, I picked up on no need for HEXEC. That's just going to be a mode name, or something, in comments, etc... at best.
Thanks Bill!
In this way, P2 can bridge the gap to a full on CPU for P3. I like how this is shaping up a lot. I'm stunned at the speed of it to be perfectly honest.
Execute in place at some respectable speed is worth this effort. Seriously worth it.
And to think, several of these can be banging around all at once, or even running the same PASM from the HUB!
As a Propeller lover here I want to wish you guys here doing the hard work all the best.
Its a pleasure to be able to follow the hole process of a chip in creation with help of a community , brings wonderful things to life
like Chip said , we need things we can hack and this chip will more than make all our hacked stuff do what we want it to do like it never did before
the P1 is my one size fits all solution solver , almost anything possible if you can imagine it
for me not to try to run the world on the P2 ( P3 sounds great ) just yet
I stayed quiet and didn't buy any board to emulate but cant wait to get the real thing in my hands once its that time.
Have done some hacking around with the P1, and have always had great help from the community here.
Sounds like a winnner to me.
Wish you guys all the best here in accomplishing this masterpiece
Note - these instructions have normal conditional execution bits :-)
Any jump outside of the OCTL window exits hub execution mode, with PTRA pointing to next hub instruction
This would make P2 very competitive (actually, due to 8 cores, totally outclass) arm chips without hardware floating point that run at up to 160MHz
It would also save Parallax the development cost of a quad-long based VLIW style GCC port (at a guess, about $250K)
HJMP D/#addr
TTTTTTT ZC I CCCC AAAAAAAAA AAAAAAAAA
Enters hub-exec mode if in cog mode
If immediate address, jumps to AAAAAAAAAAAAAAAAAA00 (sets ptra to scaled address, fetches OCTL, jumps to first instruction in octl)
If not immediate address, jumps to address in D (sets ptra to D, fetches OCTL, jumps to first instruction in octl)
Setting C and Z does not make sense for HJMP, so could be used as additional address bits, or relative jumps
C could indicate add AAAAAAAAAAAAAAAAAA00 to PTRA (forward relative jump)
Z could indicate subtract AAAAAAAAAAAAAAAAAA00 from PTRA (backward relative jump)
Relative jumps would be helpful for position independent code.
HCALL D/#addr
TTTTTTT ZC I CCCC AAAAAAAAA AAAAAAAAA
AUX = ++PTRA
Saves next hub instruction address value onto the AUX stack using --SPA, then
It would also be very desirable to be able to enter hub-exec mode with HCALL, as then cog-only mode would be able to execute library code from the hub
This would largely eliminate the cog memory limitation; note all cogs could share hub subroutine libraries.
If immediate address, jumps to AAAAAAAAAAAAAAAAAA00 (sets ptra to scaled address, fetches OCTL, jumps to first instruction in octl)
If not immediate address, jumps to address in D (sets ptra to D, fetches OCTL, jumps to first instruction in octl)
WC could be applied to set a flag in case stack wraps around
WZ could be applied to set a flag if there is a stack collision with SPB
HRET {#offset}
TTTTTTT ZC I CCCC AAAAAAAAA AAAAAAAAA
PTRA = AUX[SPA++] + offset
execute instruction in cog memory right after the HJMP that entered hub-exec mode
It would be highly desirable that if hub code was invoked with HCALL, that the HRET would go back to cog execution mode - see explanation in HCALL
Offset is scaled by 4, normall 0, but could be used to pop up several levels - think exceptions; of course SUBSPA #offset would do the same
WC could be applied to set a flag in case stack wraps around
WZ could be applied to set a flag if there is a stack collision with SPB
Limitations
- REPxxx loops must fit in the 8 long OCTL cache
- DJNZ and friends must fit in the 8 long window
- any type of jump/loop or call that is not HJMP / HCALL / HRET exits hub execution mode
- RDxxxxC and WRxxxxC instructions must not be used in hub execute model
Possible Improvements
- it should be possible to support calling cog subroutines from hub execution mode using JMPRET, as long as they can return to hub execution mode
- adding a CSEG register that is added to all HJMP/HCALL addresses would eliminate the need for relative jumps
- adding a DSEG register for non-HJMP/HCALL/HRET hub references would also allow relocatable data
Folks, with this the P2 is no longer just a microcontroller - it is also a full fledged microprocessor!
Holy Buckets! All of the sudden, that 256MB HUB makes a big difference. Plenty of room to be paging in fairly large programs from external memory.
Oh man, don't tease! If only we were talking MB for memory here... Have to wait for the NEXT Propeller to get there, but I am sure we will.
And at the pace all of you are generating great ideas and solving problems with this group effort I really think the limitation will be funding not time.
I would bet that from the time the P2 goes to fab until the sample chips come back Chip, and the rest of you, will have the first revision P3 already running in FPGA mode.
Note - these instructions have normal conditional execution bits :-)
Any jump outside of the OCTL window exits hub execution mode, with PTRA pointing to next hub instruction
When you say any jump I assume you mean any JMP? An HJMP or HCALL or even HRET when in hub mode would remain in hub mode wouldn't it? Or do I have to carve my hub program into 8-long pieces? Also, is the only difference between COG mode and HUB mode whether there is lookahead done to fill the 8-long icache?
When you say any jump I assume you mean any JMP? An HJMP or HCALL or even HRET when in hub mode would remain in hub mode wouldn't it? Or do I have to carve my hub program into 8-long pieces? Also, is the only difference between COG mode and HUB mode whether there is lookahead done to fill the 8-long icache?
Sounds good. Is there any reason why the PC can't just be extended to 18 bits rather than making use of PTRA? Is it because the PC can't be used as an address in a hub transfer?
Sounds good. Is there any reason why the PC can't just be extended to 18 bits rather than making use of PTRA? Is it because the PC can't be used as an address in a hub transfer?
I seriously doubt we would use AUX as stack with GCC for fetch/exec from HUB or any other model for that matter. We could certainly make use of it for other things in user code via inline ASM though.
Hopefully Eric and David can provide input on implementation things.
I seriously doubt we would use AUX as stack with GCC for fetch/exec from HUB or any other model for that matter. We could certainly make use of it for other things in user code via inline ASM though.
Hopefully Eric and David can provide input on implementation things.
I figure it would be okay for HCALL/HRET to push/pop an AUX memory stack because we could always pop the return address in the prologue to the function being called and then store the return address in a hub stack for non-leaf functions. This isn't really that much different from what we do now since the LMM CALL instruction always puts the return address in the LR register which has to be moved to the stack for non-leaf functions.
I figure it would be okay for HCALL/HRET to push/pop an AUX memory stack because we could always pop the return address in the prologue to the function being called and then store the return address in a hub stack for non-leaf functions. This isn't really that much different from what we do now since the LMM CALL instruction always puts the return address in the LR register which has to be moved to the stack for non-leaf functions.
This is going to be the iPod of micro-controllers. Bill said it first, but this change bridges the gap between Propeller as MCU and Propeller as CPU.
!!!
I'm not sure I agree with this. We're still executing out of on-chip memory even in hub mode. What would take us out of the realm of an MCU would be adding XMM to the mix.
See end of post#1 in the HUBEXEC thread. This model extends fairly easily to executing code out of external DDR2+ memory, and by adding segment/limit registers as I suggest, would even have the functional equivalent of an MMU.
I'm not sure I agree with this. We're still executing out of on-chip memory even in hub mode. What would take us out of the realm of an MCU would be adding XMM to the mix.
This is going to be the iPod of micro-controllers. Bill said it first, but this change bridges the gap between Propeller as MCU and Propeller as CPU.
!!!
It is more than that. When rev B comes out with that large DDR2 ram stacked on top it will be absolutely revolutionary.
It would give any ARM SoC a run for its money as an application processor while kicking ... on Pic micros for RT process control at the same time All Without an Operating System!
I sure hope Parallax has a patent app in for "OS-less independent parallel utility processor with..."
I don't think we have room to even double the AUX memories. It's not a matter area, exactly, but of placement. Also, that is a custom memory that we designed. To modify it is a big task, unlike most of these Verilog changes.
There's been some interesting proposals about increasing AUX's accessibility, but I don't have the room in my head at the moment to think about them clearly. I need to get the USB pin instructions implemented next, and come to some rest point on executing from the hub.
This makes perfect sense. And in the meantime we can discuss it and perhaps come up with something even better. I already have thought of a few ideas overnight, and it also sort of gives LMM hub execution some new possibilities as well.
I noticed Bill has a new thread but haven't read that yet.
Do you have your head around the 2 usb instructions required (nrzi using C flag and pin, and 1 bit crc)?
Also there was pin pairs?
Then simple SERDES first. Can we just get it really simple to start with? Then maybe there will be something to add (bitstuffing/inversion/chaining/buffering/etc) later. Lets get something to test first. That's really when the ideas flow from everyone as they understand what we have and what might be. And I have noted that you have a great mind to dig out the requirements, and simplify it, once we express it.
Now, to catch up on the rest of the overnight hysteria
Looks like we're coming down to the wire (trace?) and running out of die space (although perhaps many small things or changes could still be done). IIRC, recently, Chip mentioned the possibility of baking in some kind of true (or reasonably true) random number generator (not the pseudo kind). That might be the kind of thing that could bring unexpected benefits, or at least be convenient. Does anyone know if that got implemented? Is there some source of natural "jitter" in the chip to derive that from?
Yes, it could be argued that the Prop2 is so late by some metrics that it's not even worth making. But, how many other chips out there have been (or are being) painstakingly designed by people (me, you, others here) who've loved programming for perhaps 30 years, on average, who remember the old feelings of confidence and sheer joy that came from developing on/for systems that were reliable and responded logically to all their efforts, allowing them to build machines that worked perfectly? That kind of experience is long gone, buried by impenetrable layers of mucky-muck that have sapped almost all the fun out of engineering.
Agreed. Let's make sure, though, that we're not deviating from that here and now as well. One of P1's big advantages (without which it probably would never have been marketable) is it's relative simplicity and ease of use. With it arrived a multi-core micro that is suitable for rapid prototyping and development, even for hobbyists. It isn't terribly exotic, certainly not for what you're getting, not in terms of power requirements, unit cost, packaging, or ability to understand and predict how it will work. This is why jmg's observation is correct with regard to the P1 as well - it's not dead, not even close. Engineers and hobbyists will be using it for some time to come - just as many still use 8051s and PICs.
Granted, the P2 cannot remain all those things to the same extent and truly evolve. However, stray too far and you've almost certainly killed any market viability for the part. That is, in an effort to create a one-chip supercomputer that can compete with (or simply outperform) PRU-outfitted ARMs and the many other options developers have at their disposal nowadays - many of which demonstrably are *not* junk - the P2 could easily be priced out of the market, in terms not only of $$'s but other factors as well.
Markets are fragmented, specialized nowadays, more so than in the past it seems. It's not going to do to simply design something that is "cool" or that does many things fairly well (but in reality costs too much or is too exotic for practical use). Therefore, potential markets and applications should be considered early and often in the design phase.
Frankly, I don't even care anymore. I don't care how many hub slots a cog gets, how many cogs there are, or even whether multitasking is in the mix, for that matter. What I do care -- and care deeply about -- is seeing a company and the people who work there who do matter to me -- and that I depend upon to a large extent for my livelihood -- get dragged down by an overly expensive, insanely long development cycle that has no end in sight and that's open to way too much input from people who will have virtually no impact on the chip's ultimate sales. The P2's development has to be more than an expensive, seven-year-and-counting hobby, or before long we won't even have a P1 to talk about.
This is my concern as well. I'm not so much "worried" about the P2 at this moment; honestly, as it stands, I don't know that I'd be able to find use for the thing in a real design. But I do worry that the high-stakes business side (if nothing else) could go sour and take the P1 out with it as well. Only recently I got a PO for a P1-based project, and I have plans for more. P1 is a very handy part to have in my arsenal that makes doing certain things much easier. I'd hate to lose it.
This is my concern as well. I'm not so much "worried" about the P2 at this moment; honestly, as it stands, I don't know that I'd be able to find use for the thing in a real design. But I do worry that the high-stakes business side (if nothing else) could go sour and take the P1 out with it as well. Only recently I got a PO for a P1-based project, and I have plans for more. P1 is a very handy part to have in my arsenal that makes doing certain things much easier. I'd hate to lose it.
I think everyone here agrees that whatever is best for the health of Parallax is what should be done. These changes that are being discussed are supposedly work to fill in a gap before another foundry run can be attempted. When they start delaying that foundry run some serious thought should be given as to their value relative to the delay in production availablity of P2.
I have no enthusiasm for getting to know the ins and outs of ARM...
Wise choice. I've had to get to know the ins and outs of the ARM because it is my living. But I enjoy it about as much as I do the drive to work...on a cold day...when I have to scrape the windows...and the roads are slick...and there are lots of cars jamming the streets.
There is little comparison between the grind-it-out nature of the ARM and the fun and flexibility of the Prop. So when I see what the P2 is becoming, it's really mind-blowing. I predict it will become a legend in its own time.
Comments
So we know what HEXEC does, and we know an ordinary JMP returns to the COG.
In an earlier message, I pointed out that by combining the bits of D:S, we can have 18 bits for a long address, and append two 00's (scaling it for long) - allowing addressing 1MB of hub.
HJMP would jump to a 20 bit address (ie RDOCTL, and resume execution at the proper sub-long)
HCALL would first push the next hub instruction address, then would jump to a 20 bit address (ie RDOCTL, and resume execution at the proper sub-long)
HRET would pop the return address into the pc
It would be a lot faster & less complicated to use AUX as the stack for now, however the future P3 could have a dedicated stack pointer
(I realize for large UI apps a bigger stack would be nicer, but it would be a lot slower - and local variable access would be a lot slower as well)
(P3 should have two stack models - aux and hub - with proper I&D caches - but for now I want a fast P2 :-) )
If PTRA (or PTRB) were used as the hubPC, then loading a constant simply becomes "RDLONG reg,ptra++"
If we know the address of the OCTL area, *AND* it can be guaranteed that opcode %11111100 is a NOP, then
Assume OCTL is mapped to $1E0
any of the eight slots can be used for a hub address constant of up to 23 bits (which could be scaled)
1E0: rdlong r1,$1E1 ' in one instruction we read
1E1: long %FE000000 + address of some hub variable ' will be skipped
1E2: rdlong r2,ptra++
1E3: long $xxxxxxxx ' 32 bit constant
1E4:
1E5:
1E6:
1E7:
All of a sudden the GCC code generator is much simpler, and the code becomes much smaller - and MUCH faster!
You have it dead-on.
Except we don't need HEXEC - a simple HJMP would enter hub execution mode if executing in cog mode.
Read my later response to David, it shows how 32 bit constants, hub variable references, and local variables to functions become super fast too
Thanks Bill!
In this way, P2 can bridge the gap to a full on CPU for P3. I like how this is shaping up a lot. I'm stunned at the speed of it to be perfectly honest.
Execute in place at some respectable speed is worth this effort. Seriously worth it.
And to think, several of these can be banging around all at once, or even running the same PASM from the HUB!
Its a pleasure to be able to follow the hole process of a chip in creation with help of a community , brings wonderful things to life
like Chip said , we need things we can hack and this chip will more than make all our hacked stuff do what we want it to do like it never did before
the P1 is my one size fits all solution solver , almost anything possible if you can imagine it
for me not to try to run the world on the P2 ( P3 sounds great ) just yet
I stayed quiet and didn't buy any board to emulate but cant wait to get the real thing in my hands once its that time.
Have done some hacking around with the P1, and have always had great help from the community here.
Sounds like a winnner to me.
Wish you guys all the best here in accomplishing this masterpiece
Igor
Note - these instructions have normal conditional execution bits :-)
Any jump outside of the OCTL window exits hub execution mode, with PTRA pointing to next hub instruction
This would make P2 very competitive (actually, due to 8 cores, totally outclass) arm chips without hardware floating point that run at up to 160MHz
It would also save Parallax the development cost of a quad-long based VLIW style GCC port (at a guess, about $250K)
HJMP D/#addr
TTTTTTT ZC I CCCC AAAAAAAAA AAAAAAAAA
Enters hub-exec mode if in cog mode
If immediate address, jumps to AAAAAAAAAAAAAAAAAA00 (sets ptra to scaled address, fetches OCTL, jumps to first instruction in octl)
If not immediate address, jumps to address in D (sets ptra to D, fetches OCTL, jumps to first instruction in octl)
Setting C and Z does not make sense for HJMP, so could be used as additional address bits, or relative jumps
C could indicate add AAAAAAAAAAAAAAAAAA00 to PTRA (forward relative jump)
Z could indicate subtract AAAAAAAAAAAAAAAAAA00 from PTRA (backward relative jump)
Relative jumps would be helpful for position independent code.
HCALL D/#addr
TTTTTTT ZC I CCCC AAAAAAAAA AAAAAAAAA
AUX = ++PTRA
Saves next hub instruction address value onto the AUX stack using --SPA, then
It would also be very desirable to be able to enter hub-exec mode with HCALL, as then cog-only mode would be able to execute library code from the hub
This would largely eliminate the cog memory limitation; note all cogs could share hub subroutine libraries.
If immediate address, jumps to AAAAAAAAAAAAAAAAAA00 (sets ptra to scaled address, fetches OCTL, jumps to first instruction in octl)
If not immediate address, jumps to address in D (sets ptra to D, fetches OCTL, jumps to first instruction in octl)
WC could be applied to set a flag in case stack wraps around
WZ could be applied to set a flag if there is a stack collision with SPB
HRET {#offset}
TTTTTTT ZC I CCCC AAAAAAAAA AAAAAAAAA
PTRA = AUX[SPA++] + offset
execute instruction in cog memory right after the HJMP that entered hub-exec mode
It would be highly desirable that if hub code was invoked with HCALL, that the HRET would go back to cog execution mode - see explanation in HCALL
Offset is scaled by 4, normall 0, but could be used to pop up several levels - think exceptions; of course SUBSPA #offset would do the same
WC could be applied to set a flag in case stack wraps around
WZ could be applied to set a flag if there is a stack collision with SPB
Limitations
- REPxxx loops must fit in the 8 long OCTL cache
- DJNZ and friends must fit in the 8 long window
- any type of jump/loop or call that is not HJMP / HCALL / HRET exits hub execution mode
- RDxxxxC and WRxxxxC instructions must not be used in hub execute model
Possible Improvements
- it should be possible to support calling cog subroutines from hub execution mode using JMPRET, as long as they can return to hub execution mode
- adding a CSEG register that is added to all HJMP/HCALL addresses would eliminate the need for relative jumps
- adding a DSEG register for non-HJMP/HCALL/HRET hub references would also allow relocatable data
Folks, with this the P2 is no longer just a microcontroller - it is also a full fledged microprocessor!
Re: AUX as stack. I think that makes a lot of sense. And that's still a nice stack!
While in HUBEXEC mode, would a second HCALl just work, adding to that stack, etc... as expected? Seems that it would.
Oh man, don't tease! If only we were talking MB for memory here... Have to wait for the NEXT Propeller to get there, but I am sure we will.
And at the pace all of you are generating great ideas and solving problems with this group effort I really think the limitation will be funding not time.
I would bet that from the time the P2 goes to fab until the sample chips come back Chip, and the rest of you, will have the first revision P3 already running in FPGA mode.
Any cog-destination jump or call exits.
http://forums.parallax.com/showthread.php/152079-Hub-Execution-Model-Thread-%28split-from-blog%29
It would be better I think to discuss it there, and leave the blog thread for P2 updates.
I'll move the preceeding relevant info and add a few FAQ's
This gets me even more excited about the P2!
It should be fairly easy to modify GCC for this mode (famous last words)
(Let's take this discussion to the HUBEXEC thread)
Hopefully Eric and David can provide input on implementation things.
!!!
And that would preserve the fast stack mode for hard real-time code; placing the return stack in the hub would slow things down greatly.
It is more than that. When rev B comes out with that large DDR2 ram stacked on top it will be absolutely revolutionary.
It would give any ARM SoC a run for its money as an application processor while kicking ... on Pic micros for RT process control at the same time All Without an Operating System!
I sure hope Parallax has a patent app in for "OS-less independent parallel utility processor with..."
When this one is in the can, it's going to be CPU like, but not quite...
Then P3 can be a full on CPU, with micro-controller type features.
P2 is a micro-controller with some CPU like features.
P1 is a micro-controller.
That's how I see it anyway.
@Kerry
All Without an Operating System!
Word. Seriously.
Hopefully P2 will be profitable enough for parallax to make a 90nm P3
~ 4 times the transistor budget (16 cogs, 1MB hub)
~ 4 times the speed
==> 16 cores @ ~800Mhz
==> 12,800MIPS
Does anyone know of an ARM with that many mips?
And it would be a trustable CPU.
I noticed Bill has a new thread but haven't read that yet.
Do you have your head around the 2 usb instructions required (nrzi using C flag and pin, and 1 bit crc)?
Also there was pin pairs?
Then simple SERDES first. Can we just get it really simple to start with? Then maybe there will be something to add (bitstuffing/inversion/chaining/buffering/etc) later. Lets get something to test first. That's really when the ideas flow from everyone as they understand what we have and what might be. And I have noted that you have a great mind to dig out the requirements, and simplify it, once we express it.
Now, to catch up on the rest of the overnight hysteria
Granted, the P2 cannot remain all those things to the same extent and truly evolve. However, stray too far and you've almost certainly killed any market viability for the part. That is, in an effort to create a one-chip supercomputer that can compete with (or simply outperform) PRU-outfitted ARMs and the many other options developers have at their disposal nowadays - many of which demonstrably are *not* junk - the P2 could easily be priced out of the market, in terms not only of $$'s but other factors as well.
Markets are fragmented, specialized nowadays, more so than in the past it seems. It's not going to do to simply design something that is "cool" or that does many things fairly well (but in reality costs too much or is too exotic for practical use). Therefore, potential markets and applications should be considered early and often in the design phase.
Wise choice. I've had to get to know the ins and outs of the ARM because it is my living. But I enjoy it about as much as I do the drive to work...on a cold day...when I have to scrape the windows...and the roads are slick...and there are lots of cars jamming the streets.
There is little comparison between the grind-it-out nature of the ARM and the fun and flexibility of the Prop. So when I see what the P2 is becoming, it's really mind-blowing. I predict it will become a legend in its own time.