Something he mentioned, which isn't clear to everyone, is that each counter is dual channel. He said each cog has 4 counter channels, effectively doubling the number. What limitations there are is unknown.
Sounding better all the time !!
My take on Dual Channel / 4 counters, is that is one way to keep the Register mapping, and not break any existing code.
So you have the same CTRA/CTRB as now, but have a parallel set, aka dual channels, which you re-map with a single flag, and possibly a double consecutive read will deliver one of each pair allocated to that register.
Chip said the edge capture was low, high, or full period, and since the result is buffered, you can change from one edge to the other while it's accumulating and get values for both edges.
That allows one counter, but does have an upper frequency limit.
With 4 counters, you may choose to have one capture @rise edge, and the other capture @ fall edge.
Now, the upper limit is purely HW, and with the suggested Atomic Capture control, on both captures, you can reliably read time intervals / phase skews, all the way down to a single clock cycle.
ie you have full hardware dynamic range and your SW is simpler. The best of both worlds.
No, that got broken long before dual-channel counters showed up. Access to SFRs in the Prop II is way different than that of the Prop I.
-Phil
Oh ok, - then I'll modify this slightly
"My take on Dual Channel / 4 counters, is that is one way to keep the present Dual CTRA / CTRB Register mapping, and not add Memory-map, but still allow access to 4 channels, through two names. "
Neither is microcode; hence the aptness of my comparison. The Propeller architecture is not fancy -- by design. It is, however, elegant in its simplicity and flexibility and in its closeness to the hardware metal.
Absolutely. I wasn't shooting you down. Just clarifying your view is all.
If one compares the Cog's instruction set with reloadable microcode then the view of the Cog processor being a virtualiser unit is also apt. At this point what is instructions and what is data is blurry, since, for example, all LMM instructions are data fetches from Hub. Which allows such things as extending the LMM model to have instructions that perform other useful functions, like various LMM addressing modes, beyond native Cog instructions.
Most microcode on the computers I worked with was executed by the hardware ( the code bits controlled gates, registers, and counters to execute the instructions ).
In capture mode the result of the capture is 0 if a value is accumulating, -1 if overflow, and positive if an edge was detected. This according to Chip. Obviously this is all semi-reliable at this point since he can change things at a whim. The unit of measurement is the clock frequency, so 6.25ns at 160Mhz.
I have no clue about the synthesis hurdles of speed at this point, so I'm just relaying what he said.
In capture mode the result of the capture is 0 if a value is accumulating, -1 if overflow, and positive if an edge was detected. This according to Chip. Obviously this is all semi-reliable at this point since he can change things at a whim. The unit of measurement is the clock frequency, so 6.25ns at 160Mhz.
I have no clue about the synthesis hurdles of speed at this point, so I'm just relaying what he said.
Interesting, I hope he is not trying to be 'too clever' in that fancy result handling, or at least makes any cleverness optional.
We just looked at a chip that has Auto-clear on capture (no feature disable), which at first glance sounds ok, but it has a real fish-hook in fast capture situations, and can result in hidden losses.
I do not mind doing the difference maths at all, I'd prefer hardware that I KNOW is simple and solid.
Jmg's right. With an auto-clear the hardware needs additionally buffering on the capture signal to prevent edge case losses. Not worth the extra resources when a non-clearing counter does the job fine.
The capture register read triggers a transfer/clear, so when you read the register is clears to 0 until the next event is captured.
Clear of what ?
If it clears the main counter, then you now have a Software-read-time determined Zero, not a hard-time value. Not ideal.
If it clears the capture register, then I guess that saves a flag, but does expose you to the risk of a valid capture of 0000H, being flagged as 'no capture yet '.
0 is not valid, since it is measured in clock counts, there must be at least one clock to capture. I'm uncertain which register is cleared, however the possible values are 0 = in progress, -1 = timeout, any other = time value. The result register is cleared upon read, so a new result is immediately available next clock. You could not capture high resolution signals without clear on read. Further, you could not get both edges at high speed without auto clear.
Eg:
Set rising edge
Read value
Set falling edge
Read value
Anyone care to take a guess at what the cost of the Prop II might be? I can't imagine something so much more powerful will still be $8. But I haven't been in this part of the world long enough to know how this stuff trends. Do you think we might be looking in the $15 range?
0 is not valid, since it is measured in clock counts, there must be at least one clock to capture. I'm uncertain which register is cleared, however the possible values are 0 = in progress, -1 = timeout, any other = time value. The result register is cleared upon read, so a new result is immediately available next clock.
If zero is not valid, then that suggests some maths on capture, because the main counter certainly can be zero ?
Timeout means what ? that the time-ticks since last capture exceeded 2^32 ? That may not be a drop-dead situation, and a real value could be more use than a -1.
Some designs will want to do relative capture (eg if you want phase between a Rise Capture on one counter, and Fall capture on another, then you need to be reading real-time, not some last-read-zeroed time
You could not capture high resolution signals without clear on read. Further, you could not get both edges at high speed without auto clear.
Eg:
Set rising edge
Read value
Set falling edge
Read value
That is 2 clocks per edge maximum resolution.
That is highly software dependent, and assumes your COG is ready at the right time. It also cannot tell if it overflowed, and so gave a false reading.
Better is to use two counters, one @ Rise, one @Fall, now your capture values are what you hope they are, and are down to ONE clock, and also do not rely on the COG being ready at exactly the right time.
With 4 counters now present, this more robust approach is likely to be more common.
Conclusion: I hope these features you mention, are optional.
Boy howdy, you nickle and dime everything! I specifically said that the exact implementation is unknown at this time. Furthermore, you don't seem to be happy with how Chip has implemented any feature. The last issue you harped on was "gotta have SPI in HW", now you are going on about how things *can't* be done with the counter.
Why don't you wait until it's released, program it, then say how it *can't* be done, then the next day kurenoko will release some code that does it!
Boy howdy, you nickle and dime everything! I specifically said that the exact implementation is unknown at this time. Furthermore, you don't seem to be happy with how Chip has implemented any feature. The last issue you harped on was "gotta have SPI in HW", now you are going on about how things *can't* be done with the counter.
The devil is always in the details, and I do this stuff almost every day in CPLD's / FPGAs, so I am very used to focusing on the details..
It is not complex, but it IS important to get the details right, and I DID say hardware edge capture was very good news, as also is 4 counter channels.
So too is SPI in hardware, given it is not hard to actually implement. The present Prop 2 spec suggests a form of that is there.
Anyone care to take a guess at what the cost of the Prop II might be? I can't imagine something so much more powerful will still be $8. But I haven't been in this part of the world long enough to know how this stuff trends. Do you think we might be looking in the $15 range?
With a lot more pins, and a larger die (even in the shrink process), it will never be close to Prop 1.
I recall predictions between 1.5x and 2x a Prop 1, and the testing time will also be longer.. Yields ??
Atmel's AVR claims this : 131 powerful instructions – most single clock cycle execution
and they do not mention the word pipeline.
...
Of course, they may use both edges of the clock to do this, so it becomes something of a semantics exercise.
The AVR uses a two-stage pipeline with separate fetch and execute cycles. The reason it only requires two cycles is that it uses a Harvard architecture in which the program and data spaces are separate and can be accessed simultaneously -- unlike the Prop I, which has a von Neumann architecture and single-ported RAM. The Prop II will get a 4x speed boost by using multi-ported RAM that supports simultaneous access from several "stations" in its pipeline.
Check out the attached pdf. Instructions have a whole cycle for fetch time. On the following clock cycle: Instruction decode (undocumented) followed by register and/or immediate data fetch are feed through the ALU and apparently the result is asynchronously placed on the general data bus all in one fell swoop. At the third rising clock the result is stored.
It's a cool trick one can do with smaller slower parts. It may be technically a two stage pipe but it's not providing the equivalent two steps that would be expected when just saying "two stage".
Eg: On figure 6.4, the 3rd line should say 1st Instruction Write, 2nd Instruction Execute, 3rd Instruction Fetch.
Conclusion: The AVR doesn't shorten the pipeline by being Harvard. The saving is in the execution stage as per above. The number of internal buses is still four to feed a standard ALU plus instruction decode. Three of those accesses can be to the register set so they have to be triple-ported. And the AVR pays for it's low latency with reduced max clock rate ... which I guess is not a problem when power efficiency is important.
EDIT: Added the architectural block diagram attachment
I know they think they need them to protect their code. This is of course stupid kind of thinking
But if they are implemented, they have to be foolproof, and the best way to do it, is to make a pin, "fuse write enable", which has to be, for example, set to zero. And when it is set, normal operation is disabled. Then, you have to connect this pin to +3.3V to get a propeller working. And even in fuse write enable status, there should be a complex piece of condition to blow them.
Simply: to blow a fuse you have to set your chip pin to low (for example: with a jumper). Then you have to run "fuse burner application" on your PC connected to the Propeller, and this application will send a long sequence of bits to blow a fuse, and if even one of these bit is wrong, nothing can happen.
And there should be a "fuse disable" command which can blow all of these fuses, and then a propeller can work normally with an unprotected code, and only unprotected code in eeprom, so there is no way to brick it with these stupid fuses.
Conclusion: The AVR doesn't shorten the pipeline by being Harvard.
Actually it does. It's because instructions are fetched from a separate memory space that the next instruction fetch can occur simultaneously with register read and write from the current instruction. If instructions and data shared the same memory, this overlap would not be possible without multi-ported memory.
True. I should rephrase my assertion: "The AVR's Harvard architecture is what makes pipelining possible, without resorting to multi-port memory." How's that?
Actually it does. It's because instructions are fetched from a separate memory space that the next instruction fetch can occur simultaneously with register read and write from the current instruction. If instructions and data shared the same memory, this overlap would not be possible without multi-ported memory.
-Phil
Kind of makes me wonder why no CPU's take this to it's limit? I.e. divide the registers into four banks. Instructions come from one bank, the two data sources come from two other banks, and the result is written to the last bank. It wouldn't be too painful to code for if all the banks were mapped into a common address space. In that case it'd be a trade between 4-port registers or a 4x4 crosspoint switch and 4x read/write register banks. The area trade might pay off with a big register space?
True. I should rephrase my assertion: "The AVR's Harvard architecture is what makes pipelining possible, without resorting to multi-port memory." How's that?
An explanation for why pipelining is not mentioned would be the only actual latch (stage) is the instruction register itself. They have managed to achieve pipeline functionality without any dedicated latches.
Kind of makes me wonder why no CPU's take this to it's limit? I.e. divide the registers into four banks. Instructions come from one bank, the two data sources come from two other banks, and the result is written to the last bank.
They do. It is common for any register-register opcode CPU to use multi-port memory for the registers. These can read and write the same location, using different ports, on a single clock edge.
Such multi-port memory speeds operation, but does have a die-size cost.
Most register-register designs have relatively small register areas, but the idea of larger register areas is not new, the Intel MCS96 had 3 operand opcodes, with 256 registers.
Some Infineon parts I believe allow moving the active register area, within a 1-2K byte area, and I recall mention of a Sparc variant that allowed a register offset, so a procedure call could pass params in half the registers, and a new half-set was 'created' for local variables.
In that case it'd be a trade between 4-port registers or a 4x4 crosspoint switch and 4x read/write register banks. The area trade might pay off with a big register space?
Multiporting a small group of registers is no big deal. It will be interesting to know how much of a difference there is between the Prop1 and Prop2 though. A big problem with large register space is the size of the instructions balloon as the number of registers increase.
DSPs have gone down the route of multiple RAM/Flash blocks so as to allow simultaneous table lookup.
Propeller II Status:
The synthesized logic (inner core) continues to go through several iterations to meet timing, as well as to improve functionality. This section is mostly driven by Chip, however where I came in before was to connect all of my existing Power and Ground to the power grid structure of the synthesized logic. Some of the improvements were to address Power/Ground demands due to IR drop (IR drop is basically voltage drop due to current demands over the wire resistance based on the length and thickness of the wire) ...so the placement of Power and Ground strapping was not completely defined. Now that Power and Ground has been finalized I have traversed the perimeter of the synthesized logic making the Power Ground connections. FYI) When the Synthesized logic is changed, it takes about 2 hours to run a script on it to sync all of the stream-out layers. After that it's just a matter of changing the name of the instantiation from the old (previous version) to the newest version. As far as layout I am done with the exception of converting parts of a RAM into a ROM (See the Propeller II thread) .. Right now this is a critical path. Based on the ROM decision, this delays when I start making those physical changes to the RAM. The changes also cascade to the inner core, where where the logic must me re-synthesized to communicate with the ROM structure and sequence.
Mini Contest Results!!:
I had 21 entrants between E-mail and answering in the forum, and two distinct winners.
Congratulations to:
Darren Olafson for having the best guess for the number of transistors inside the core and total number transistors in the design.
Michael Jassowski for having the best guess for the number of transistors outside of the core. ...and a neck and neck guess with Darren for the total number of transistors in the design.
Each will receive $100 towards their next Parallax purchase.
(I will contact both of you by E-mail or PM for further details)
Actual Numbers:
number of transistors inside of the core: 4,892,765
number of transistors outside of the core: 10,771,134
number of total transistors in the design: 15,663,899
Note: The values used represent the totals at the beginning of the contest. Actual numbers to date will be released in the Propeller II datasheet.
Attached is a screenshot of the entries in the order that I received them. A "T" score has been applied to the guess versus the actual number. The closer the T score is to 1, the closer the match ; the closer the T score is to 0, the further the match.
Just for the useless trivia file, do you know the count for the Propeller (Propeller 1, Propeller Classic)? It would be an interesting and rather irrelevant numerical comparison! :0)
I /think/ the total transistor count for the entire Propeller I chip was something like 5,500,000 .... considering the Propeller II occupies roughly the same die space as the Propeller I and the difference in process dimensions are about 3.78 times the density going from 350nm to 180nm .... that figure seems about right to me.
number of transistors inside of the core: 4,892,765
number of transistors outside of the core: 10,771,134
number of total transistors in the design: 15,663,899
Interesting ratios, are those actual counts, or Area-equivalent counts ?
I would expect the outer to have more area, but as the geometries there include R/C/ESD/IO/PSUrIng etc, the real transistor count would be of a lower density ?
Comments
-Phil
Sounding better all the time !!
My take on Dual Channel / 4 counters, is that is one way to keep the Register mapping, and not break any existing code.
So you have the same CTRA/CTRB as now, but have a parallel set, aka dual channels, which you re-map with a single flag, and possibly a double consecutive read will deliver one of each pair allocated to that register.
That allows one counter, but does have an upper frequency limit.
With 4 counters, you may choose to have one capture @rise edge, and the other capture @ fall edge.
Now, the upper limit is purely HW, and with the suggested Atomic Capture control, on both captures, you can reliably read time intervals / phase skews, all the way down to a single clock cycle.
ie you have full hardware dynamic range and your SW is simpler. The best of both worlds.
-Phil
Oh ok, - then I'll modify this slightly
"My take on Dual Channel / 4 counters, is that is one way to keep the present Dual CTRA / CTRB Register mapping, and not add Memory-map, but still allow access to 4 channels, through two names. "
If one compares the Cog's instruction set with reloadable microcode then the view of the Cog processor being a virtualiser unit is also apt. At this point what is instructions and what is data is blurry, since, for example, all LMM instructions are data fetches from Hub. Which allows such things as extending the LMM model to have instructions that perform other useful functions, like various LMM addressing modes, beyond native Cog instructions.
Yep, that's the fully decoded part.
I have no clue about the synthesis hurdles of speed at this point, so I'm just relaying what he said.
Interesting, I hope he is not trying to be 'too clever' in that fancy result handling, or at least makes any cleverness optional.
We just looked at a chip that has Auto-clear on capture (no feature disable), which at first glance sounds ok, but it has a real fish-hook in fast capture situations, and can result in hidden losses.
I do not mind doing the difference maths at all, I'd prefer hardware that I KNOW is simple and solid.
Clear of what ?
If it clears the main counter, then you now have a Software-read-time determined Zero, not a hard-time value. Not ideal.
If it clears the capture register, then I guess that saves a flag, but does expose you to the risk of a valid capture of 0000H, being flagged as 'no capture yet '.
Eg:
Set rising edge
Read value
Set falling edge
Read value
That is 2 clocks per edge maximum resolution.
If zero is not valid, then that suggests some maths on capture, because the main counter certainly can be zero ?
Timeout means what ? that the time-ticks since last capture exceeded 2^32 ? That may not be a drop-dead situation, and a real value could be more use than a -1.
Some designs will want to do relative capture (eg if you want phase between a Rise Capture on one counter, and Fall capture on another, then you need to be reading real-time, not some last-read-zeroed time
That is highly software dependent, and assumes your COG is ready at the right time. It also cannot tell if it overflowed, and so gave a false reading.
Better is to use two counters, one @ Rise, one @Fall, now your capture values are what you hope they are, and are down to ONE clock, and also do not rely on the COG being ready at exactly the right time.
With 4 counters now present, this more robust approach is likely to be more common.
Conclusion: I hope these features you mention, are optional.
Why don't you wait until it's released, program it, then say how it *can't* be done, then the next day kurenoko will release some code that does it!
The devil is always in the details, and I do this stuff almost every day in CPLD's / FPGAs, so I am very used to focusing on the details..
It is not complex, but it IS important to get the details right, and I DID say hardware edge capture was very good news, as also is 4 counter channels.
So too is SPI in hardware, given it is not hard to actually implement. The present Prop 2 spec suggests a form of that is there.
With a lot more pins, and a larger die (even in the shrink process), it will never be close to Prop 1.
I recall predictions between 1.5x and 2x a Prop 1, and the testing time will also be longer.. Yields ??
Check out the attached pdf. Instructions have a whole cycle for fetch time. On the following clock cycle: Instruction decode (undocumented) followed by register and/or immediate data fetch are feed through the ALU and apparently the result is asynchronously placed on the general data bus all in one fell swoop. At the third rising clock the result is stored.
It's a cool trick one can do with smaller slower parts. It may be technically a two stage pipe but it's not providing the equivalent two steps that would be expected when just saying "two stage".
Eg: On figure 6.4, the 3rd line should say 1st Instruction Write, 2nd Instruction Execute, 3rd Instruction Fetch.
Conclusion: The AVR doesn't shorten the pipeline by being Harvard. The saving is in the execution stage as per above. The number of internal buses is still four to feed a standard ALU plus instruction decode. Three of those accesses can be to the register set so they have to be triple-ported. And the AVR pays for it's low latency with reduced max clock rate ... which I guess is not a problem when power efficiency is important.
EDIT: Added the architectural block diagram attachment
EDIT2: Added decode phase to execution step
I know they think they need them to protect their code. This is of course stupid kind of thinking
But if they are implemented, they have to be foolproof, and the best way to do it, is to make a pin, "fuse write enable", which has to be, for example, set to zero. And when it is set, normal operation is disabled. Then, you have to connect this pin to +3.3V to get a propeller working. And even in fuse write enable status, there should be a complex piece of condition to blow them.
Simply: to blow a fuse you have to set your chip pin to low (for example: with a jumper). Then you have to run "fuse burner application" on your PC connected to the Propeller, and this application will send a long sequence of bits to blow a fuse, and if even one of these bit is wrong, nothing can happen.
And there should be a "fuse disable" command which can blow all of these fuses, and then a propeller can work normally with an unprotected code, and only unprotected code in eeprom, so there is no way to brick it with these stupid fuses.
-Phil
-Phil
Kind of makes me wonder why no CPU's take this to it's limit? I.e. divide the registers into four banks. Instructions come from one bank, the two data sources come from two other banks, and the result is written to the last bank. It wouldn't be too painful to code for if all the banks were mapped into a common address space. In that case it'd be a trade between 4-port registers or a 4x4 crosspoint switch and 4x read/write register banks. The area trade might pay off with a big register space?
Lawson
An explanation for why pipelining is not mentioned would be the only actual latch (stage) is the instruction register itself. They have managed to achieve pipeline functionality without any dedicated latches.
They do. It is common for any register-register opcode CPU to use multi-port memory for the registers. These can read and write the same location, using different ports, on a single clock edge.
Such multi-port memory speeds operation, but does have a die-size cost.
Most register-register designs have relatively small register areas, but the idea of larger register areas is not new, the Intel MCS96 had 3 operand opcodes, with 256 registers.
Some Infineon parts I believe allow moving the active register area, within a 1-2K byte area, and I recall mention of a Sparc variant that allowed a register offset, so a procedure call could pass params in half the registers, and a new half-set was 'created' for local variables.
Multiporting a small group of registers is no big deal. It will be interesting to know how much of a difference there is between the Prop1 and Prop2 though. A big problem with large register space is the size of the instructions balloon as the number of registers increase.
DSPs have gone down the route of multiple RAM/Flash blocks so as to allow simultaneous table lookup.
The synthesized logic (inner core) continues to go through several iterations to meet timing, as well as to improve functionality. This section is mostly driven by Chip, however where I came in before was to connect all of my existing Power and Ground to the power grid structure of the synthesized logic. Some of the improvements were to address Power/Ground demands due to IR drop (IR drop is basically voltage drop due to current demands over the wire resistance based on the length and thickness of the wire) ...so the placement of Power and Ground strapping was not completely defined. Now that Power and Ground has been finalized I have traversed the perimeter of the synthesized logic making the Power Ground connections. FYI) When the Synthesized logic is changed, it takes about 2 hours to run a script on it to sync all of the stream-out layers. After that it's just a matter of changing the name of the instantiation from the old (previous version) to the newest version. As far as layout I am done with the exception of converting parts of a RAM into a ROM (See the Propeller II thread) .. Right now this is a critical path. Based on the ROM decision, this delays when I start making those physical changes to the RAM. The changes also cascade to the inner core, where where the logic must me re-synthesized to communicate with the ROM structure and sequence.
Mini Contest Results!!:
I had 21 entrants between E-mail and answering in the forum, and two distinct winners.
Congratulations to:
Darren Olafson for having the best guess for the number of transistors inside the core and total number transistors in the design.
Michael Jassowski for having the best guess for the number of transistors outside of the core. ...and a neck and neck guess with Darren for the total number of transistors in the design.
Each will receive $100 towards their next Parallax purchase.
(I will contact both of you by E-mail or PM for further details)
Actual Numbers:
number of transistors inside of the core: 4,892,765
number of transistors outside of the core: 10,771,134
number of total transistors in the design: 15,663,899
Note: The values used represent the totals at the beginning of the contest. Actual numbers to date will be released in the Propeller II datasheet.
Attached is a screenshot of the entries in the order that I received them. A "T" score has been applied to the guess versus the actual number. The closer the T score is to 1, the closer the match ; the closer the T score is to 0, the further the match.
I /think/ the total transistor count for the entire Propeller I chip was something like 5,500,000 .... considering the Propeller II occupies roughly the same die space as the Propeller I and the difference in process dimensions are about 3.78 times the density going from 350nm to 180nm .... that figure seems about right to me.
Interesting ratios, are those actual counts, or Area-equivalent counts ?
I would expect the outer to have more area, but as the geometries there include R/C/ESD/IO/PSUrIng etc, the real transistor count would be of a lower density ?