It's really a P2, not a P1+ by any stretch of the imagination.
Chip has done some major consolidation and he's keeping hubexec, but simpler. It's not quite as efficient (not as fast) but its a major piece that overcomes the 2K hub limit. 16 Cores gives us the ability to do a lot of intelligent (but simple code) peripherals, as well as some nice main program cores too. Increasing hub to 512KB is a big improvement too. I cannot wait to get running
Me neither. That nice, roomy HUB will really open things up for all of us.
I'm also thinking we made a few choices on the other design we could revisit now that we understand the process dynamics better. I'm thinking along the lines of full on System On A Chip. This device is gonna get close. Compared to P1, it's gonna be awesome!
The whole no OS thing, extended just a bit like the other design was starting to do, really could be something, and the longer timeline is very well aligned with that idea.
Good times!
I also like that Chip is refactoring right now. It being simple is compelling for everybody. And we've a lot of COGS! Can't wait to see how that plays out. We may find some of our thinking changes a little too. For the better.
If we could get serial-in and serial-out, with clock. That would really help. Are we getting threads too ? or just bare cogs ? the bare cogs would be better, I think.
This thread is again too long.:/
With analog pins, hubexec, cordic, multiplier, 16 cogs, 200 MHz and 512kB, I certainly think it's worthy of being called P2.
I just hope these things can stay. Seemed like last time the forum maybe helped to bloat the feature set into something they couldn't produce economically.
For me, if it were just regular P1 cores with the above features added, I'd be very happy.
I realized tonight that it is hard to support hub exec and not get mired back in the complexities of the Prop2.
The problems arise from hub exec needing INDA/INDB-type functionality to overcome the impracticality of self-modifying code in hub memory. INDA/INDB require an effective pipeline stage, unto themselves, in order to take the instruction, recognize INDA/INDB usage, and then substitute the INDA/INDB pointers into the S and D fields of the instruction before reading S and D. The Prop1's simpler architecture just feeds the S and D fields of the instruction straight into the address inputs of the cog RAM to read S and D on the next clock, requiring no extra stage. This pipeline stage needed to support INDA/INDB increases the number of instructions trailing a branch (which will need cancellation) from one to two. That, in turn, has the effect of requiring Prop2-type INDA/INDB backtracking circuits to accommodate various cancellation scenarios. This pipeline stage is also required to make the RDxxxxC cached reads work. To implement hub exec, we're going to be complicating the cog quite a bit. Delayed branches will become more necessary to regain looping performance. I think this is the wrong road, in light of what became of the P2 at 180nm.
There is a way around this, and that is to resolve all the INDA/INDB stuff on the same cycle as the instruction read, tacking the INDA/INDB logic time onto the cog RAM access time. This would have the effect of slowing the clock down by ~30%, I estimate. It would require less flops and not introduce a new pipeline stage, but would instantly create what will remain the critical path in the cog.
In summary, if we don't pursue hub exec and INDA/INDB, we're much simpler, which means smaller and faster. There is little cost and no extra pipelining required to implement PTRA/PTRB, though, so that is viable. Same with a hardware LIFO stack for CALL/RET. And 128-bit hub transfers are no problem, either. Same with keeping the four tasks - those make a lot possible and are almost free.
Ouch, not having hubexec means the speed and selling point of 2-512KB at high speed is gone.
It would break symetry<sp>, but is there a chance of giving one core any additional slot/s on the round robin?
Would be nice to still advertise 2-512KB at X MIPs, even if it was hardware defined, as in Core 1 or 16.
The program would need to use some of the registers in the COG like registers to hold addresses, which means a load / store paradigm for hub exec programs.
If we had a load effective address kind of instruction, we could simply load a cog register with the address we want to work with.
LEA D, S/#S
Where S = contents of hub address into D, #S = absolute address into D.
Now D contains the address desired, and most things work like we think they would, only the instructions come from the hub, not the cog. In effect, it's a CPU with a whole pile of registers! More than anybody would need really.
This slows down some hub exec operations, but we still get big programs really easy, and I think still faster than LMM, and it's not complicated.
If we added access to the hub exec PC, or made that PC some register in the COG, conditional jumps happen easily too.
SETHPC D/#D
With those two, we could do everything from a big program. We just use COG registers as pointers to things.
Why not just have a simpler version of hubexec? It doesn't need to resolve all the problems to make everything work from hub like you were trying to do.
Just make something that reads hub memory for instructions and executes them. It can have a register that is the instruction pointer, and regular instructions can manipulate that register to do jumps/calls/returns for hub exec mode.
I imagine it just being a hardware assisted version of LMM mode, instead of full on handling all the call/jumps whatever between cog/hubexec like you did for P2.
I think you could just fire it off with a single instruction that is "hubjump" or something like that. Perhaps have a cogrun/cognew variant/flag that makes it do hubexec at startup.
There is a way around this, and that is to resolve all the INDA/INDB stuff on the same cycle as the instruction read, tacking the INDA/INDB logic time onto the cog RAM access time. This would have the effect of slowing the clock down by ~30%, I estimate.
In summary, if we don't pursue hub exec and INDA/INDB, we're much simpler, which means smaller and faster. There is little cost and no extra pipelining required to implement PTRA/PTRB, though, so that is viable. Same with a hardware LIFO stack for CALL/RET. And 128-bit hub transfers are no problem, either. Same with keeping the four tasks - those make a lot possible and are almost free.
Options would also be
a) Only some COGs need HUBEXEC - ie it is less likely a system would ever run HUBEXEC in all 16
b) If fetches can be 128b, maybe some form of 'HW-LMM', where code has to copy and runs in the COG, but managed in HW, not the usual SW manager.
Cost of this is likely to be > 30%, but it may be a HW trade off worth making ?
For me, Hubexec was only ever a "nice to have", and is not fundamental to the design or utility of the Propeller.
We can do much of what Hubexec was intended to do via software - all we ever really needed was multi-long loads.
Even without those, it just means is that high-level languages will run a little slower - but still much faster on this new chip than they did on the P1.
Just make something that reads hub memory for instructions and executes them. It can have a register that is the instruction pointer, and regular instructions can manipulate that register to do jumps/calls/returns for hub exec mode.
I imagine it just being a hardware assisted version of LMM mode, instead of full on handling all the call/jumps whatever between cog/hubexec like you did for P2.
I think you could just fire it off with a single instruction that is "hubjump" or something like that. Perhaps have a cogrun/cognew variant/flag that makes it do hubexec at startup.
Seconded.
Doing that will still be fast, and we only need a couple of instructions. Keep the COG fast. Just do the bare minimum to execute instructions from the HUB. We don't need the pointers, etc... COG registers work just fine for those.
Good points. We COULD have hub exec without INDA/INDB - the programmer would just need to have some routine he could call in the register space to do 'indirect' addressing via self-modifying code that lives there. That's not optimal, but it allows everything else to remain fast.
For me, Hubexec was only ever a "nice to have", and is not fundamental to the design or utility of the Propeller.
We can do much of what Hubexec was intended to do via software - all we ever really needed was multi-long loads.
Even without those, it just means is that high-level languages will run a little slower - but still much faster on this new chip than they did on the P1.
Ross.
With PTRA/PTRB, we could probably get an LMM loop running as fast as hub exec. It would just have 4-instruction granularity. It's much easier to write code that's meant to execute directly from the hub, though. LMM is totally free.
Good points. We COULD have hub exec without INDA/INDB - the programmer would just need to have some routine he could call in the register space to do 'indirect' addressing via self-modifying code that lives there. That's not optimal, but it allows everything else to remain fast.
or you could simplify Hubexeec, and provide a better way to do indirect addressing. ?
Is self-modifying code the only way P1+ can do 'indirect' addressing now?
With PTRA/PTRB, we could probably get an LMM loop running as fast as hub exec. It would just have 4-instruction granularity. It's much easier to write code that's meant to execute directly from the hub, though. LMM is totally free.
By all means do what you can - I'm sure it will get used one way or another, most likely in ways we don't currently anticipate.
But as you say, LMM is already here, and is essentially "free" - so don't get yourself bogged down adding stuff that we just don't need.
' copy memory from addr_A to addr_b, count X
LEA addr_a, #$3C000
mov count, #$40
LEA addr_b, #$2000
:loop rdlong temp, addr_a
add addr_a, #4
wrlong temp, addr_b
add addr_b, #4
sub count, #4 wz
if_nz SETHPC #:loop
Could be a hybrid too, where the cog registers are just registers to the hub exec program. And if we have AUGS, then we don't need LEA... or SETHPC type instructions at all.
If you can do all of the hubexec stuff minus the inda/indb support and it'll be fast, then I say do that. We can work around the inda/indb limitations with hubexec mode in software.
Do you mean slower in HUB mode only, or do you mean all opcodes become slower ?
...
So those opcodes are 3 clocks long, or 4 clocks ?
To do INDA/INDB, the cogs, themselves, would either slow down or become very complicated.
Instructions are going to be two clocks, no matter what. INDA/INDB would increase the pipeline depth from two (which is so simple, there's not much to label as 'pipeline') to three (which has all kinds of uglier ramifications).
To do INDA/INDB, the cogs, themselves, would either slow down or become very complicated.
Instructions are going to be two clocks, no matter what. INDA/INDB would increase the pipeline depth from two (which is so simple, there's not much to label as 'pipeline') to three (which has all kinds of uglier ramifications).
Hmmm - "ugly" is probably a good sign you are heading down the wrong path.
But I also agree with koehler - take some time out and come back to the problem afresh sometime later.
When the cog is addressing cog memory, a P1 needs to self-modify. But, when the cog is accessing hub memory, it doesn't.
It can branch and do relative jumps in hub memory, but to access a random cog register, self-modifying code must execute.
I will be going to bed soon. This was just a window of time that I had to push things a little further. The last two days have been taken up by social things, so I'm feeling antsy. I don't work on Sundays, but it's hard not to think about work, as there's a lot of interesting things happening.
When the cog is addressing cog memory, a P1 needs to self-modify. But, when the cog is accessing hub memory, it doesn't.
Correct. I've had this problem in Catalina - when executing LMM code from Hub, you cannot use the usual method of indirect addressing - it just won't work even though the instructions look the same. But you can always do so by other means - it is just a bit slower.
Chip,
What about for cog land having a MOVR instruction that would use the value in the D register as the index of another register to be the actual destination for where the copy of S goes? So it's just a simple indirection. Or is that extra read to be used for final write the problem? There would be no auto incrementing or anything.
It is good You made some free days of work.
In time I worked on some control systems I don't made that days -- It ended with I run around my axis in circles.
Needed then some weeks to reset my mind before I restarted working.
On HubExec even simplest possible give big advantage for NEW IC ---> I can't say how You need made it AS I don't know anything how thing are made in Verilog.
It can branch and do relative jumps in hub memory, but to access a random cog register, self-modifying code must execute.
I will be going to bed soon. This was just a window of time that I had to push things a little further. The last two days have been taken up by social things, so I'm feeling antsy. I don't work on Sundays, but it's hard not to think about work, as there's a lot of interesting things happening.
Chip,
What about for cog land having a MOVR instruction that would use the value in the D register as the index of another register to be the actual destination for where the copy of S goes? So it's just a simple indirection. Or is that extra read to be used for final write the problem? There would be no auto incrementing or anything.
It's easy to redirect writes. It's just those initial reads that are complicated.
Chip,
What about for cog land having a MOVR instruction that would use the value in the D register as the index of another register to be the actual destination for where the copy of S goes? So it's just a simple indirection. Or is that extra read to be used for final write the problem? There would be no auto incrementing or anything.
What you said made me realize that we could do something like AUGS/AUGD, but instead of augmenting the next S/D constant, we could alter the S/D field in the next instruction. This is the way to achieve indirection for S and D! This is REALLY simple.
Along with augmenting D and S constants, we could alter D and S registers:
Comments
+1
It's really a P2, not a P1+ by any stretch of the imagination.
Chip has done some major consolidation and he's keeping hubexec, but simpler. It's not quite as efficient (not as fast) but its a major piece that overcomes the 2K hub limit. 16 Cores gives us the ability to do a lot of intelligent (but simple code) peripherals, as well as some nice main program cores too. Increasing hub to 512KB is a big improvement too. I cannot wait to get running
I'm also thinking we made a few choices on the other design we could revisit now that we understand the process dynamics better. I'm thinking along the lines of full on System On A Chip. This device is gonna get close. Compared to P1, it's gonna be awesome!
The whole no OS thing, extended just a bit like the other design was starting to do, really could be something, and the longer timeline is very well aligned with that idea.
Good times!
I also like that Chip is refactoring right now. It being simple is compelling for everybody. And we've a lot of COGS! Can't wait to see how that plays out. We may find some of our thinking changes a little too. For the better.
This thread is again too long.:/
I realized tonight that it is hard to support hub exec and not get mired back in the complexities of the Prop2.
The problems arise from hub exec needing INDA/INDB-type functionality to overcome the impracticality of self-modifying code in hub memory. INDA/INDB require an effective pipeline stage, unto themselves, in order to take the instruction, recognize INDA/INDB usage, and then substitute the INDA/INDB pointers into the S and D fields of the instruction before reading S and D. The Prop1's simpler architecture just feeds the S and D fields of the instruction straight into the address inputs of the cog RAM to read S and D on the next clock, requiring no extra stage. This pipeline stage needed to support INDA/INDB increases the number of instructions trailing a branch (which will need cancellation) from one to two. That, in turn, has the effect of requiring Prop2-type INDA/INDB backtracking circuits to accommodate various cancellation scenarios. This pipeline stage is also required to make the RDxxxxC cached reads work. To implement hub exec, we're going to be complicating the cog quite a bit. Delayed branches will become more necessary to regain looping performance. I think this is the wrong road, in light of what became of the P2 at 180nm.
There is a way around this, and that is to resolve all the INDA/INDB stuff on the same cycle as the instruction read, tacking the INDA/INDB logic time onto the cog RAM access time. This would have the effect of slowing the clock down by ~30%, I estimate. It would require less flops and not introduce a new pipeline stage, but would instantly create what will remain the critical path in the cog.
In summary, if we don't pursue hub exec and INDA/INDB, we're much simpler, which means smaller and faster. There is little cost and no extra pipelining required to implement PTRA/PTRB, though, so that is viable. Same with a hardware LIFO stack for CALL/RET. And 128-bit hub transfers are no problem, either. Same with keeping the four tasks - those make a lot possible and are almost free.
Is hub exec worth slowing the cog down for?
Doesn't sound good. How about a double speed LMM then? Ie: instruction pairing.
It would break symetry<sp>, but is there a chance of giving one core any additional slot/s on the round robin?
Would be nice to still advertise 2-512KB at X MIPs, even if it was hardware defined, as in Core 1 or 16.
The program would need to use some of the registers in the COG like registers to hold addresses, which means a load / store paradigm for hub exec programs.
If we had a load effective address kind of instruction, we could simply load a cog register with the address we want to work with.
LEA D, S/#S
Where S = contents of hub address into D, #S = absolute address into D.
Now D contains the address desired, and most things work like we think they would, only the instructions come from the hub, not the cog. In effect, it's a CPU with a whole pile of registers! More than anybody would need really.
This slows down some hub exec operations, but we still get big programs really easy, and I think still faster than LMM, and it's not complicated.
If we added access to the hub exec PC, or made that PC some register in the COG, conditional jumps happen easily too.
SETHPC D/#D
With those two, we could do everything from a big program. We just use COG registers as pointers to things.
Just make something that reads hub memory for instructions and executes them. It can have a register that is the instruction pointer, and regular instructions can manipulate that register to do jumps/calls/returns for hub exec mode.
I imagine it just being a hardware assisted version of LMM mode, instead of full on handling all the call/jumps whatever between cog/hubexec like you did for P2.
I think you could just fire it off with a single instruction that is "hubjump" or something like that. Perhaps have a cogrun/cognew variant/flag that makes it do hubexec at startup.
If it's an either/or choice then hub exec has to stay.
Do you mean slower in HUB mode only, or do you mean all opcodes become slower ?
So those opcodes are 3 clocks long, or 4 clocks ?
Options would also be
a) Only some COGs need HUBEXEC - ie it is less likely a system would ever run HUBEXEC in all 16
b) If fetches can be 128b, maybe some form of 'HW-LMM', where code has to copy and runs in the COG, but managed in HW, not the usual SW manager.
Cost of this is likely to be > 30%, but it may be a HW trade off worth making ?
No!
For me, Hubexec was only ever a "nice to have", and is not fundamental to the design or utility of the Propeller.
We can do much of what Hubexec was intended to do via software - all we ever really needed was multi-long loads.
Even without those, it just means is that high-level languages will run a little slower - but still much faster on this new chip than they did on the P1.
Ross.
Seconded.
Doing that will still be fast, and we only need a couple of instructions. Keep the COG fast. Just do the bare minimum to execute instructions from the HUB. We don't need the pointers, etc... COG registers work just fine for those.
With PTRA/PTRB, we could probably get an LMM loop running as fast as hub exec. It would just have 4-instruction granularity. It's much easier to write code that's meant to execute directly from the hub, though. LMM is totally free.
or you could simplify Hubexeec, and provide a better way to do indirect addressing. ?
Is self-modifying code the only way P1+ can do 'indirect' addressing now?
The main goal should be to build reliable, fast and deterministic PASM COGs.
Yes. Add anything to make life easier for Spin2, C and Forth, but do not go too far.
It should be a microcontroller not a microprocessor.
my 2 cents
Mike
By all means do what you can - I'm sure it will get used one way or another, most likely in ways we don't currently anticipate.
But as you say, LMM is already here, and is essentially "free" - so don't get yourself bogged down adding stuff that we just don't need.
Ross.
Could be a hybrid too, where the cog registers are just registers to the hub exec program. And if we have AUGS, then we don't need LEA... or SETHPC type instructions at all.
Having the code run from the hub, using cog registers as more traditional load / store registers would be easy.
Do not sacrifice the simple fast cogs for it.
Yep. To go beyond this we must take a huge bite, either in speed or complexity.
Hubexec is, arguably, of course, of greater import than even 16 cores.
Its 0130 Chip.
This is a setback, of which there are probably a number of stopgaps that can be entertained.
Get some sleep, spend Sunday with the family having fun.
Monday will come soon enough.
To do INDA/INDB, the cogs, themselves, would either slow down or become very complicated.
Instructions are going to be two clocks, no matter what. INDA/INDB would increase the pipeline depth from two (which is so simple, there's not much to label as 'pipeline') to three (which has all kinds of uglier ramifications).
Hmmm - "ugly" is probably a good sign you are heading down the wrong path.
But I also agree with koehler - take some time out and come back to the problem afresh sometime later.
Ross.
When the cog is addressing cog memory, a P1 needs to self-modify. But, when the cog is accessing hub memory, it doesn't.
It can branch and do relative jumps in hub memory, but to access a random cog register, self-modifying code must execute.
I will be going to bed soon. This was just a window of time that I had to push things a little further. The last two days have been taken up by social things, so I'm feeling antsy. I don't work on Sundays, but it's hard not to think about work, as there's a lot of interesting things happening.
Correct. I've had this problem in Catalina - when executing LMM code from Hub, you cannot use the usual method of indirect addressing - it just won't work even though the instructions look the same. But you can always do so by other means - it is just a bit slower.
Ross.
What about for cog land having a MOVR instruction that would use the value in the D register as the index of another register to be the actual destination for where the copy of S goes? So it's just a simple indirection. Or is that extra read to be used for final write the problem? There would be no auto incrementing or anything.
It is good You made some free days of work.
In time I worked on some control systems I don't made that days -- It ended with I run around my axis in circles.
Needed then some weeks to reset my mind before I restarted working.
On HubExec even simplest possible give big advantage for NEW IC ---> I can't say how You need made it AS I don't know anything how thing are made in Verilog.
It's easy to redirect writes. It's just those initial reads that are complicated.
What you said made me realize that we could do something like AUGS/AUGD, but instead of augmenting the next S/D constant, we could alter the S/D field in the next instruction. This is the way to achieve indirection for S and D! This is REALLY simple.
Along with augmenting D and S constants, we could alter D and S registers:
ALTD D/#
ALTS S/#
ALTDS D/#,S/#
This:
ALTS ptr
MOV OUTA,0
Could also be coded as:
MOV OUTA,[ptr]