Well, it seems that all this wide-data-path stuff only increased the Prop123-A9 16-cog/18-smart-pin logic utilization from 89% to 91%, which is surprisingly low. These changes got rid of 68, somewhat complicated, state machines: a sender and receiver in each smart pin and a sender and receiver in each cog. The hub grew to accommodate the 32-bit mux/demux circuits, while those old cog and smart-pin state machines each became 32-bit flop arrays. This is a huge net improvement.
So, each cog has a 32bit wide data/input/output path that feeds a mux to 64 I/O smart-pins/input/output.
That is one heck of a wide set of busses!
It seems the FPGA (and hopefully ASIC) software can shift the buss section onto a metal layer effectively over the top of the other logic. I am imagining this like a multilayer pcb, where some layers have parts and other layers connect the parts, and other layers are buss layers, and last of all, there are power and ground layers.
While I don't let my pcb software autoroute, they are mostly only small/tiny 2 layer pcbs. The software is not that expensive.
Guess this is another reason the IC Layout software is sooooo expensive - few users, lots of decisions to make.
Same geometry, but near double silicon layers with more double level of transistors. Should keep the software gurus busy for some time, while still meeting the essence of Moore's Law.
They are already doing this with stacked FLASH up to 48 silicon/metal layers.
Guessing here but I suspect the Prop2 is targeting smallest amount of layers to keep such a small run cost effective. This might be the reason why MRAM based HubRAM was ruled out at the start. If I understood Beau at all, he seemed to indicate MRAM required more metal layers (four) than the Prop2 (unspecified) has.
Guessing here but I suspect the Prop2 is targeting smallest amount of layers to keep such a small run cost effective. This might be the reason why MRAM based HubRAM was ruled out at the start. If I understood Beau at all, he seemed to indicate MRAM required more metal layers (four) than the Prop2 (unspecified) has.
It's not just metal layer count, but that 'M" in MRAM, stands for magnetic, so there are highly specialized materials handling needed for the magnetic side of this, certainly a long way from any low cost trailing edge FAB.
Tho their claims here are rather vague and non engineering like... "MagnaChip announced today the availability of a new 0.13 micron Slim Flash process technology, based on 0.13 micron EEPROM. While maintaining the same performance characteristics of the existing EEPROM process, Slim Flash process technology is highly cost competitive because it reduces the number of layers to be embedded by 20 percent and cuts the manufacturing turnaround time by 15 percent."
Even that is quite a jump.
Better suited to the P2 could be something like a Dual-die, which the FPGA vendors already manage.
Allowing for that may not be too incolved
...
Assuming that 3b fraction applies correctly, this gives an AutoBaud quantize error of 0.392% max
The capture resolution on a 10baud time sample
at 1.5MBd, is around 0.8%
at 3MBd ~ 1.59%
at 4Mbd ~2%
Those initial measure/capture numbers for 10 bit base are significant, and now fractional baud is included, this makes
15b base capture more attractive. ( 15b comes from an AutoBaud char of "?" and summing tRR,tFF.)
Ultimate Fractional Baud errors depend on the design details.
With 10 possible sample times needing +0/+1 adjust, a simpler 3 bit fraction can only cover 7 of those up to 9 adders.
4b BCD can cover 0..9, but might not co-operate as well, with user-selected Data length choices up to 32b ?
So, each cog has a 32bit wide data/input/output path that feeds a mux to 64 I/O smart-pins/input/output.
That is one heck of a wide set of busses!
So does mean that there is a total of 32 lines or 512 lines (i.e., 16 cogs x 32 data lines) involved? Guess it has to be 512 total, but I'm guessing that the lines are not in a ring-like bus. I mean, even if there are a total of 512 lines that get mux'ed to the smart pins, perhaps the lines aren't routed around such that the lines for each cog pass reasonably close by all the other cogs to be available for "tapping" (as below).
But if there are 512 lines in a ring-like configuration (which seems like wishful thinking), I'd be curious to know if such a configuration could be tapped (or piggybacked on) to allow having each cog be able to rather directly write any other cog's LUT, not just each cog's adjacent neighbor(s). And in the unlikely event that there is a ring, would the further provision of a "cog selection bus" with an additional16 lines be worth the costs, in that, then, any one cog could write the same data to any desired combination of cog's LUT's. Don't know if it'd be worth it. Sounds super flexible, though, and it would integrate the cogs more than them just being in the same chip and sharing the same pins and hub (which is a heck of a lot already).
Whatever the case, a big CONGRATS!!! to Chip on being courageous enough to consider the merits of ripping up/out all those state machines, as the new way sounds pretty beneficial and also simpler. It's really hard to toss out stuff you've worked hard on and thought out well, but sometimes it's the right way to go (unless the deadline is today). Apparently, he was able to do so reasonably quickly once he made the decision and got into it. And for some reason, the router worked with him and the new design, such that it didn't eat up so much logic/real estate. [Hmm...if that's the case for this and this doesn't involve a ring, would it be the case for a full ring? No, that'd put us well over the top, wouldn't it?]
This is what I currently understand. It might be totally wrong though!
There are 64 smart pins, and each has a 32bit buss. That is a 2048 buss ring from what I understand.
Now each cog (16 of them) has a 32 bit bus out to connect to this 2048 buss ring. Each cog has 6 select lines to select which smart pin, and hence which block of 32 from 2048 to "OR" its' 32bit data with. So if two cogs talk to the same smart pin at once, theirdata will be "OR"ed.
Those initial measure/capture numbers for 10 bit base are significant, and now fractional baud is included, this makes
15b base capture more attractive. ( 15b comes from an AutoBaud char of "?" and summing tRR,tFF.)
I've checked into more details on the sum, and my instinct was this might not quite be a 'free lunch' was right....
Turns out the average improves with more samples, but the worst case rounding effects worsen, and worst case is what worries us more here.
Worst case is a just-too-slow baud that gets a +1 edge into both the 7b and 8b windows, giving a result of 17b, not 15
That's worse than a single capture error.
Conclusion is the optimal capture T-axis is the maximum possible single capture of 9 bits.
ie this has a capture quata of 1.67% at 3MBd
This gives these AutoBAUD roundings, at 3MBd (again, assumes idealized fractional baud, 3 bit)
First AutoBAUD char is not such an issue, as that is all P2 is doing...
INT_RISE: // First char, then disable this code - Absolute time readings
oFall = nFall
nFall = CaptFall() // Absolute time captures, 32b
oRise = nRise
nRise = CaptRise()
IF nRise-nFall < oRise-oFall THEN // needs wrap-safe maths
// Add code here for tighter checks, and can do One-Pin / Duplex decision here.
t9 = nRise-oFall // time for 9 Baud Bits
TxChar(AutoBaudEcho) // ack the AutoBaud char
Disable INT_RISE // Repeats until Done (correct Phase)
END
Then, Active RX could run this pair of interrupts
INT_FALL: // Active during Rx, queues 2 falls, Occurs EVERY =\_ in
oFall = nFall
nFall = CaptFall()
RETI
INT_RX:
nRise = CaptRise() // Stop bit, discards any mid-char-rise. Capt 32b absolute time.
IF RxChar=AutoBaud THEN
t9 = nRise-oFall // time for 9 Baud Bits, needs wrap-safe maths, 32b
TxChar(AutoBaudEcho) // ack the AutoBaud char
END
That's a little less than ideal, as INT_FALL queue steals P2 bandwidth, but it is needed if Active retrim is required.
An alternative, is to use some other Smart Pin mode, than can Time from Next =\_ to second _/=
Maybe that is already possible, as I see cycle-counts as params ? **
This HW assist, slashes the live-retrim to the simpler
INT_RX:
nRise = CaptF_2ndR() // read and [u]re-prime[/u].
IF Char=AutoBaud THEN
t9 = nRise
TxChar(AutoBaudEcho) // ack the AutoBaud char
END
** eg Smart Pin docs say things like : X[31:0] establishes how many A-input rise/edge to B-input rise/edge periods are to be measured.
sounding close, but docs lack explicit detail on what controls what....
Better would be a simple list of
* Start condition: (here, we want A Fall to start ), once started, further falls are ignored.
* Repeat & Count condition: (here, we X-- on B(A) rise)
* Capture condition (here we want rises counts =2 for " " 0x20)
* Arm/reprime condition - on read ? or some Arm command ?
It might be the SmartPin can already do this t9 capture ?
Other AutoBAUD / ReSync chars :
Alternative in this mode would be 0x55, with a rises count of 5, to capture t9.
With this, a simple check for (t9 < 1.1 * Old_t9) would fire only on a new 0x55, and it does not require a valid Rx, so has a wider capture range.
It's trivial to have your download process include a tiny initial download at 115kbps, then jump to something much higher from there to load the full 512kB+.
This has to be the best approach. The first part could be fully automatic by inspecting the code you want to load for the pll settings.
If the pll doesn't work, then you'll be using either internal RC or external oscillator. Those would both be settings in the code you're wanting to launch and could be set automatically.
It's trivial to have your download process include a tiny initial download at 115kbps, then jump to something much higher from there to load the full 512kB+.
This has to be the best approach. The first part could be fully automatic by inspecting the code you want to load for the pll settings.
If the pll doesn't work, then you'll be using either internal RC or external oscillator. Those would both be settings in the code you're wanting to launch and could be set automatically.
Dual down-load is always a user choice, but that 115kbps limit has long since gone away.
See Chip's various posts. There are clear benefits to not being forced to do a dual-download, so faster AutoBAUD is important.
The discussion now, is around what speeds you can practically autobaud and send to.
I'm thinking 3MBaud is practical.
I have always thought that the boot process was going to be a dual boot approach. Once you have xtal freq locked in, the limit becomes extremely high, and a protocol and checksum can also be used.
In fact, the first boot code could then be a standard release.
Comments
So, each cog has a 32bit wide data/input/output path that feeds a mux to 64 I/O smart-pins/input/output.
That is one heck of a wide set of busses!
It seems the FPGA (and hopefully ASIC) software can shift the buss section onto a metal layer effectively over the top of the other logic. I am imagining this like a multilayer pcb, where some layers have parts and other layers connect the parts, and other layers are buss layers, and last of all, there are power and ground layers.
While I don't let my pcb software autoroute, they are mostly only small/tiny 2 layer pcbs. The software is not that expensive.
Guess this is another reason the IC Layout software is sooooo expensive - few users, lots of decisions to make.
Should consider stacking hub ram
I can see Moore's Law changing....
Same geometry, but near double silicon layers with more double level of transistors. Should keep the software gurus busy for some time, while still meeting the essence of Moore's Law.
They are already doing this with stacked FLASH up to 48 silicon/metal layers.
More practical could be something like
http://www.prnewswire.com/news-releases/magnachip-announces-cost-competitive-013-micron-slim-flash-process-technology-300337369.html
Tho their claims here are rather vague and non engineering like...
"MagnaChip announced today the availability of a new 0.13 micron Slim Flash process technology, based on 0.13 micron EEPROM. While maintaining the same performance characteristics of the existing EEPROM process, Slim Flash process technology is highly cost competitive because it reduces the number of layers to be embedded by 20 percent and cuts the manufacturing turnaround time by 15 percent."
Even that is quite a jump.
Better suited to the P2 could be something like a Dual-die, which the FPGA vendors already manage.
Allowing for that may not be too incolved
Beau only mentioned the number of metal layers as an issue.
Also, Flash can't do HubRAM.
Those initial measure/capture numbers for 10 bit base are significant, and now fractional baud is included, this makes
15b base capture more attractive. ( 15b comes from an AutoBaud char of "?" and summing tRR,tFF.)
Reduced Measurement quanta effects:
4MBd = 1.333%
3MBd = 1%
2MBd = 0.667%
1.5MBd = 0.5%
Ultimate Fractional Baud errors depend on the design details.
With 10 possible sample times needing +0/+1 adjust, a simpler 3 bit fraction can only cover 7 of those up to 9 adders.
4b BCD can cover 0..9, but might not co-operate as well, with user-selected Data length choices up to 32b ?
So does mean that there is a total of 32 lines or 512 lines (i.e., 16 cogs x 32 data lines) involved? Guess it has to be 512 total, but I'm guessing that the lines are not in a ring-like bus. I mean, even if there are a total of 512 lines that get mux'ed to the smart pins, perhaps the lines aren't routed around such that the lines for each cog pass reasonably close by all the other cogs to be available for "tapping" (as below).
But if there are 512 lines in a ring-like configuration (which seems like wishful thinking), I'd be curious to know if such a configuration could be tapped (or piggybacked on) to allow having each cog be able to rather directly write any other cog's LUT, not just each cog's adjacent neighbor(s). And in the unlikely event that there is a ring, would the further provision of a "cog selection bus" with an additional16 lines be worth the costs, in that, then, any one cog could write the same data to any desired combination of cog's LUT's. Don't know if it'd be worth it. Sounds super flexible, though, and it would integrate the cogs more than them just being in the same chip and sharing the same pins and hub (which is a heck of a lot already).
Whatever the case, a big CONGRATS!!! to Chip on being courageous enough to consider the merits of ripping up/out all those state machines, as the new way sounds pretty beneficial and also simpler. It's really hard to toss out stuff you've worked hard on and thought out well, but sometimes it's the right way to go (unless the deadline is today). Apparently, he was able to do so reasonably quickly once he made the decision and got into it. And for some reason, the router worked with him and the new design, such that it didn't eat up so much logic/real estate. [Hmm...if that's the case for this and this doesn't involve a ring, would it be the case for a full ring? No, that'd put us well over the top, wouldn't it?]
From Chip's P2 Verilog code The router will be busy.
There are 64 smart pins, and each has a 32bit buss. That is a 2048 buss ring from what I understand.
Now each cog (16 of them) has a 32 bit bus out to connect to this 2048 buss ring. Each cog has 6 select lines to select which smart pin, and hence which block of 32 from 2048 to "OR" its' 32bit data with. So if two cogs talk to the same smart pin at once, theirdata will be "OR"ed.
Turns out the average improves with more samples, but the worst case rounding effects worsen, and worst case is what worries us more here.
Worst case is a just-too-slow baud that gets a +1 edge into both the 7b and 8b windows, giving a result of 17b, not 15
That's worse than a single capture error.
Conclusion is the optimal capture T-axis is the maximum possible single capture of 9 bits.
ie this has a capture quata of 1.67% at 3MBd
This gives these AutoBAUD roundings, at 3MBd (again, assumes idealized fractional baud, 3 bit)
Which brings us to how to capture 9-bits....
First AutoBAUD char is not such an issue, as that is all P2 is doing...
Then, Active RX could run this pair of interrupts
That's a little less than ideal, as INT_FALL queue steals P2 bandwidth, but it is needed if Active retrim is required.
An alternative, is to use some other Smart Pin mode, than can Time from Next =\_ to second _/=
Maybe that is already possible, as I see cycle-counts as params ? **
This HW assist, slashes the live-retrim to the simpler
** eg Smart Pin docs say things like :
X[31:0] establishes how many A-input rise/edge to B-input rise/edge periods are to be measured.
sounding close, but docs lack explicit detail on what controls what....
Better would be a simple list of
* Start condition: (here, we want A Fall to start ), once started, further falls are ignored.
* Repeat & Count condition: (here, we X-- on B(A) rise)
* Capture condition (here we want rises counts =2 for " " 0x20)
* Arm/reprime condition - on read ? or some Arm command ?
It might be the SmartPin can already do this t9 capture ?
Other AutoBAUD / ReSync chars :
Alternative in this mode would be 0x55, with a rises count of 5, to capture t9.
With this, a simple check for (t9 < 1.1 * Old_t9) would fire only on a new 0x55, and it does not require a valid Rx, so has a wider capture range.
This has to be the best approach. The first part could be fully automatic by inspecting the code you want to load for the pll settings.
If the pll doesn't work, then you'll be using either internal RC or external oscillator. Those would both be settings in the code you're wanting to launch and could be set automatically.
Dual down-load is always a user choice, but that 115kbps limit has long since gone away.
See Chip's various posts. There are clear benefits to not being forced to do a dual-download, so faster AutoBAUD is important.
The discussion now, is around what speeds you can practically autobaud and send to.
I'm thinking 3MBaud is practical.
In fact, the first boot code could then be a standard release.