It would be of interest to know more details about the remaining work to be done and projected time frames, such as in regards to the following:
"Once I'm [able to download from PNut], I'm going to get hubexec working. After that, the only [thing] I'll need to add to the core is the hub-streaming I/O stuff. Then, there should just be some cog refinement work, the hub CORDIC system, and the pin smarts." <--Chip, from link above
A bit more frequent updating of the status of the chip wouldn't be rejected (but maybe that's just me, lol).
While we all appreciate updates, every time Chip does that we all fall into the trap of asking more questions, and that takes Chip away from the main task at hand.
So, lets just be patient and wait for Chip.
I've talked to Chip about three times in the last week and I didn't ask any questions about P2. He volunteered that he's making progress, though. I'm sitting on a number of e-mail inquiries from customers who've got designs in the queue but I'm no longer making any claims to schedule.
Thanks, Ken. I guess that's wise for now until things solidify further. It does seem like things are starting to gel, though. Hence the "Are we there yet?"
Erna,
Yes we are but: A frog crosses a pond by jumping half the remaining distance at each hop. How long does it take the frog to cross the pond?
I suggest the P2 is shipped 99% or so of the way to completion:)
Don't think so. Chip stated that, with the new crosspoint switching multi-hub structure, he has started over completely from fresh with an eye toward power efficiency. It's a big haul to redo everything. He didn't do that for the previous Prop2 variants. So, we ain't getting the big Chipmas this year.
Ken puts a round number of about five months from first fab submittal to full product sales - assuming nothing fails. For a Christmas launch, there's just too much still to finish up and do, including fpga testing and evaluation time, before submitting to the fab.
I posted concerns about the eggbeater architecture when it was first presented, and I still have concerns about it. I suspect that some of the timing complexities that it introduces is part of the reason that the design is taking so long. The eggbeater scheme does provide high speed memory access without using wide busses, so I hope it works out OK. It should make the P2 a "unique" chip in the micro-controller market.
On the one extreme we could conceive that every COG gets access to HUB as and when it likes (Let's ignore contention for a moment)
On the other extreme every COG shares the access time equally among all the others. As in the P1.
Each of these extremes yields a fixed upper HUB bandwidth limit. Or, more interestingly HUB execution rate.
The "egg beater" thing lies somewhere in between those extremes.
Has anyone calculated what the average HUB bandwidth or HUB execution rate might be, assuming access to random locations in HUB for data and code. (Yes I know code tends to be more linear but never mind)?
Thinking about the eggbeater and it's resultant average HUB bandwidth gives me headache. It is totally probabilistic except in certain carefully crafted use cases.
Chip has not posted much about it, however based on his past postings, I expect something like:
1) if hubexec uses the fifo
- jumps/calls/ret on average will take 8 system clocks / 4 instruction cycles (as it needs to re-initialize the FIFO)
- consecutive instructions will execute every instruction cycle
- RDxxx/WRxxx do not affect the fifo, and will on average take 8 clocks / 4 instruction cycles
2) if hubexec does not use the fifo
- jumps/calls/ret on average will take 8 system clocks / 4 instruction cycles
- subsequent instructions will take ? clocks - depends, if a small (4 long) fifo is provided, should execute full rate for non-hub instructions
- RDxxx/WRxxx will on average take 8 clocks / 4 instruction cycles
I think easiest path is (1) and I suspect that is what Chip will end up doing.
On the one extreme we could conceive that every COG gets access to HUB as and when it likes (Let's ignore contention for a moment)
On the other extreme every COG shares the access time equally among all the others. As in the P1.
Each of these extremes yields a fixed upper HUB bandwidth limit. Or, more interestingly HUB execution rate.
The "egg beater" thing lies somewhere in between those extremes.
Has anyone calculated what the average HUB bandwidth or HUB execution rate might be, assuming access to random locations in HUB for data and code. (Yes I know code tends to be more linear but never mind)?
Thinking about the eggbeater and it's resultant average HUB bandwidth gives me headache. It is totally probabilistic except in certain carefully crafted use cases.
Really important to get a current (power) estimate on the "egg beater", or we'll end up at square one. The efficiency of being able to do so much at once could have a nasty flipside.
The main power usage on the failed P2 was in the cogs alu's. Now Chip is very much aware of that, he has divided the alu paths so that not all are active for every instruction. He is truly aware of power and I think this will be a really mean chip. Its not going to be as powerful as the failed P2, but it may make up for some by being able to clock faster (shorter and simpler pathways). It will be simpler to describe and easier to understand, perhaps with the exception of the hub. And yes, the hub on the failed P2 also used power because the whole hub ram was active for every hub access. By dividing this up into smaller blocks and only activating the required block(s) the power will be reduced. Of course, the hub is going to be harder to explain.
I was stuck in the mud, so to speak, for about a week because I had to come up with some new rules for address handling in the assembler. Because the new architecture does not require alignment for hub longs and words, there is now a significant difference between hub and cog memory. After much thinking (in mind-numbing circles, mostly), I came to the final conclusion that ALL memory addresses (both hub and cog) need to be byte-wise to make the various addressing schemes agree. This means that each cog instruction takes 4 'bytes', which must be long-aligned in cog memory. This makes the assembler's guts work harmoniously together, but what was register $1F8 must now be called out as $7E0. This only presents an oddity when referencing a hard address, as symbols are properly stored when encountered during assembly. I finished all these changes just before dinner and I was able to test them and they worked. Whew! I'll be doing some reality checks next to be sure of where I'm at.
The next thing to implement is the hardware stack and then hub exec. I've had PNut downloading for about a week, but with the final addressing scheme in place, I'll be able to write programs with a lot less effort now.
I was stuck in the mud, so to speak, for about a week because I had to come up with some new rules for address handling in the assembler. Because the new architecture does not require alignment for hub longs and words, there is now a significant difference between hub and cog memory. After much thinking (in mind-numbing circles, mostly), I came to the final conclusion that ALL memory addresses (both hub and cog) need to be byte-wise to make the various addressing schemes agree. This means that each cog instruction takes 4 'bytes', which must be long-aligned in cog memory. This makes the assembler's guts work harmoniously together, but what was register $1F8 must now be called out as $7E0. This only presents an oddity when referencing a hard address, as symbols are properly stored when encountered during assembly. I finished all these changes just before dinner and I was able to test them and they worked. Whew! I'll be doing some reality checks next to be sure of where I'm at.
The next thing to implement is the hardware stack and then hub exec. I've had PNut downloading for about a week, but with the final addressing scheme in place, I'll be able to write programs with a lot less effort now.
So 4 bytes just before dinner meant that you weren't that hungry anymore and got straight back into it again?
So does that mean cog memory can now be byte addressed easily as in the case of using it for strings and buffers?
... This makes the assembler's guts work harmoniously together, but what was register $1F8 must now be called out as $7E0.
I guess this will be invisible to the bit-fields of register opcodes, as they will still only need 9 bit fields, as they word-align.
To keep the docs consistent, will that 9 bits now be entered in ASM as $7E0, with an error generated by the ASM, if the user types $7E1,$7E2,$7E3 by mistake ?
So 4 bytes just before dinner meant that you weren't that hungry anymore and got straight back into it again?
So does that mean cog memory can now be byte addressed easily as in the case of using it for strings and buffers?
No. Cog memory is accessible only as longs. For the program counter, which can now execute longs from any offset in hub memory, some standard reckoning was needed that unified cog and hub addressing; hence, all longs span 4 'byte' addresses - even in the cog space. The PC always steps by 4 in the hardware.
I guess this will be invisible to the bit-fields of register opcodes, as they will still only need 9 bit fields, as they word-align.
To keep the docs consistent, will that 9 bits now be entered in ASM as $7E0, with an error generated by the ASM, if the user types $7E1,$7E2,$7E3 by mistake ?
All register addresses are checked to be sure the two LSB's are %00, indicating a valid register offset. Otherwise, an assembly error occurs.
I was stuck in the mud, so to speak, for about a week because I had to come up with some new rules for address handling in the assembler. Because the new architecture does not require alignment for hub longs and words, there is now a significant difference between hub and cog memory. After much thinking (in mind-numbing circles, mostly), I came to the final conclusion that ALL memory addresses (both hub and cog) need to be byte-wise to make the various addressing schemes agree. This means that each cog instruction takes 4 'bytes', which must be long-aligned in cog memory. This makes the assembler's guts work harmoniously together, but what was register $1F8 must now be called out as $7E0. This only presents an oddity when referencing a hard address, as symbols are properly stored when encountered during assembly. I finished all these changes just before dinner and I was able to test them and they worked. Whew! I'll be doing some reality checks next to be sure of where I'm at.
The next thing to implement is the hardware stack and then hub exec. I've had PNut downloading for about a week, but with the final addressing scheme in place, I'll be able to write programs with a lot less effort now.
This is great news for PropGCC. We struggled with byte vs. long addressing when implementing the assembler and linker. This will make the Propeller more like other architectures and will probably allow the GNU assembler, gas, match PASM more closely. Thanks!
No. Cog memory is accessible only as longs. For the program counter, which can now execute longs from any offset in hub memory, some standard reckoning was needed that unified cog and hub addressing; hence, all longs span 4 'byte' addresses - even in the cog space. The PC always steps by 4 in the hardware.
The assembler used with PropGCC uses byte addresses for cog memory. It simplifies things when creating object files and linking, but it does make it confusing when moving cog addresses into registers. There's a special opcode that automatically converts the byte address to a long address. Maybe a special character can be prefixed to a symbol to generate a long address instead of a byte address, such as `label.
The assembler used with PropGCC uses byte addresses for cog memory. It simplifies things when creating object files and linking, but it does make it confusing when moving cog addresses into registers. There's a special opcode that automatically converts the byte address to a long address. Maybe a special character can be prefixed to a symbol to generate a long address instead of a byte address, such as `label.
I'm not sure if that will be necessary. If all addresses are known to be byte addresses, the individual linker fixups that store values in S and D can automatically lop off the two low bits and produce an error if they are not zero.
The complication happens when you do something like mov temp, #100 versus mov temp, #label. If the compiler automatically shifts the value of label right by 2 bits, what does it do with the constant 100. How would the compiler know whether 100 is actually a cog byte address, and I want it shifted down by 2 bits also. The GAS assembler has a special pseudo-op called mova that shifts values down by 2 bits. However, there are no pseudo-ops for add, sub, and, or, etc. These instructions require that the programmer explicitly shift down the address by 2 bits. However, the linker doesn't handle this correctly, so it becomes very messy.
The complication happens when you do something like mov temp, #100 versus mov temp, #label. If the compiler automatically shifts the value of label right by 2 bits, what does it do with the constant 100. How would the compiler know whether 100 is actually a cog byte address, and I want it shifted down by 2 bits also. The GAS assembler has a special pseudo-op called mova that shifts values down by 2 bits. However, there are no pseudo-ops for add, sub, and, or, etc. These instructions require that the programmer explicitly shift down the address by 2 bits. However, the linker doesn't handle this correctly, so it becomes very messy.
I assume that the problem you're trying to solve is later using that address in movs or movd? That is a tricky problem. Maybe P2 should add indirect addressing of COG memory and remove self-modifying code? (ducking and running away....)
movs and movd is another thing that gets complicated with cog byte addresses. Maybe there's really no issue. I'll just wait to see what Chip comes up with rather than speculating about possible problems.
Comments
and... you follow the trail of links down youtube and inevitably reach "that part" of youtube.
Ah, I see what you mean: top-less chips:
https://www.youtube.com/watch?v=5KcNwer2q-g <-- mechanical, brute-force method, short-and-sweet
https://www.youtube.com/watch?v=0Z4aF-qiziM <-- chemical method, 1:41:26, adult-language/references
Seriously, though, the updates on the status of the chip are really appreciated!
It would be of interest to know more details about the remaining work to be done and projected time frames, such as in regards to the following:
"Once I'm [able to download from PNut], I'm going to get hubexec working. After that, the only [thing] I'll need to add to the core is the hub-streaming I/O stuff. Then, there should just be some cog refinement work, the hub CORDIC system, and the pin smarts." <--Chip, from link above
A bit more frequent updating of the status of the chip wouldn't be rejected (but maybe that's just me, lol).
So, lets just be patient and wait for Chip.
I've talked to Chip about three times in the last week and I didn't ask any questions about P2. He volunteered that he's making progress, though. I'm sitting on a number of e-mail inquiries from customers who've got designs in the queue but I'm no longer making any claims to schedule.
Ken Gracey
Yes we are but: A frog crosses a pond by jumping half the remaining distance at each hop. How long does it take the frog to cross the pond?
I suggest the P2 is shipped 99% or so of the way to completion:)
Don't think so. Chip stated that, with the new crosspoint switching multi-hub structure, he has started over completely from fresh with an eye toward power efficiency. It's a big haul to redo everything. He didn't do that for the previous Prop2 variants. So, we ain't getting the big Chipmas this year.
To put that into perpective, I was reading Ken's earlier estimated timeline, which was before the restart, and, well, not surprisingly, the milestones are not happening accordingly - http://www.parallax.com/news/2014-02-24/propeller-2-schedule-update-february-2014-read-schedule-completion-propeller-2
Ken puts a round number of about five months from first fab submittal to full product sales - assuming nothing fails. For a Christmas launch, there's just too much still to finish up and do, including fpga testing and evaluation time, before submitting to the fab.
I hope all this going forwards and restarting and going forwards...is not some kind of chaotic sequence that never terminates!
I do however appreciate the new leaner, meaner, streamlined, efficient approach. It's not easy to find the simple way to do things.
I think that was a brilliant change, and I love the bandwidth it provides to every cog.
On the other extreme every COG shares the access time equally among all the others. As in the P1.
Each of these extremes yields a fixed upper HUB bandwidth limit. Or, more interestingly HUB execution rate.
The "egg beater" thing lies somewhere in between those extremes.
Has anyone calculated what the average HUB bandwidth or HUB execution rate might be, assuming access to random locations in HUB for data and code. (Yes I know code tends to be more linear but never mind)?
Thinking about the eggbeater and it's resultant average HUB bandwidth gives me headache. It is totally probabilistic except in certain carefully crafted use cases.
1) if hubexec uses the fifo
- jumps/calls/ret on average will take 8 system clocks / 4 instruction cycles (as it needs to re-initialize the FIFO)
- consecutive instructions will execute every instruction cycle
- RDxxx/WRxxx do not affect the fifo, and will on average take 8 clocks / 4 instruction cycles
2) if hubexec does not use the fifo
- jumps/calls/ret on average will take 8 system clocks / 4 instruction cycles
- subsequent instructions will take ? clocks - depends, if a small (4 long) fifo is provided, should execute full rate for non-hub instructions
- RDxxx/WRxxx will on average take 8 clocks / 4 instruction cycles
I think easiest path is (1) and I suspect that is what Chip will end up doing.
I was stuck in the mud, so to speak, for about a week because I had to come up with some new rules for address handling in the assembler. Because the new architecture does not require alignment for hub longs and words, there is now a significant difference between hub and cog memory. After much thinking (in mind-numbing circles, mostly), I came to the final conclusion that ALL memory addresses (both hub and cog) need to be byte-wise to make the various addressing schemes agree. This means that each cog instruction takes 4 'bytes', which must be long-aligned in cog memory. This makes the assembler's guts work harmoniously together, but what was register $1F8 must now be called out as $7E0. This only presents an oddity when referencing a hard address, as symbols are properly stored when encountered during assembly. I finished all these changes just before dinner and I was able to test them and they worked. Whew! I'll be doing some reality checks next to be sure of where I'm at.
The next thing to implement is the hardware stack and then hub exec. I've had PNut downloading for about a week, but with the final addressing scheme in place, I'll be able to write programs with a lot less effort now.
So 4 bytes just before dinner meant that you weren't that hungry anymore and got straight back into it again?
So does that mean cog memory can now be byte addressed easily as in the case of using it for strings and buffers?
I guess this will be invisible to the bit-fields of register opcodes, as they will still only need 9 bit fields, as they word-align.
To keep the docs consistent, will that 9 bits now be entered in ASM as $7E0, with an error generated by the ASM, if the user types $7E1,$7E2,$7E3 by mistake ?
No. Cog memory is accessible only as longs. For the program counter, which can now execute longs from any offset in hub memory, some standard reckoning was needed that unified cog and hub addressing; hence, all longs span 4 'byte' addresses - even in the cog space. The PC always steps by 4 in the hardware.
All register addresses are checked to be sure the two LSB's are %00, indicating a valid register offset. Otherwise, an assembly error occurs.
I assume this means that regular D/S instructions (with the 9 bit fields) have implied %00 LSB's?