You used pusha and popa for the spin stack. Will there be a way to extend stack space into hub ram if the whole stack doesn't fit in stack ram?
Not in this first version. There are huge speed advantages in using the stack RAM as the Spin stack space. Basically, you get 256 stack locations, which, if your average routine requires 8 levels, gives you a calling depth of 32, which is excessive. Where this could become a practical problem is in recursive calling.
The foundry we are using offers multi-layer reticles, not just single-layer. This means that a mask will have images for more than one process step. It's not ideal for mass production because they expose only half the area (or 1/3 or 1/4, etc.) at a time that a single-layer reticle would.
What's the advantage to this? Does this let you replace parts of the design (while leaving other parts unchanged) if you find a defect?
I assume that with "multi-layer reticles" you mean multiple reticles for the same layer. I worked at ASML for a while and I have some idea of the accuracy at which a wafer scanner works, and it would seem to me that it would be at least impractical, but more likely impossible to expose multiple reticles (on top of each other) at the same time... Am I wrong?
...this could become a practical problem is in recursive calling.
Could have sworn I posted a comment here like "Oh dear there goes our favorite benchmark, the recursive fibo()". Seems my intermittent phone connection bungled it.
But now that I'm hear I reckon that despite the recursive fibo() being very recursive it does not actually use much stack. fibo(10) only need to recurse to a max depth of 10. So it looks like the on COG stack will give us a very quick fibo() result.
All of which makes this post redundant. Just ignore me.
Making ICs is the equivalent of a series of photographic exposures. Each exposure uses a mask or reticule. As line densities go up these masks become exponentially more expensive. For a small run (like the initial test of an IC) the over-riding cost is in these mask charges. To save on that a number of masks are done as shared. So for instance 4 masks can be shared on a single one. Each exposure is offset by 1 die. So that you get 4 exposures or mask steps for the price of one.
Yes that also means that only 1/4 of the die are built correctly (exposures out of order won't work). But it is a lot cheaper than paying for a full mask set.
Other ways to lower the mask costs is shared project, like PCBpool only for ICs, like MOSIS.
Not in this first version. There are huge speed advantages in using the stack RAM as the Spin stack space. Basically, you get 256 stack locations, which, if your average routine requires 8 levels, gives you a calling depth of 32, which is excessive. Where this could become a practical problem is in recursive calling.
Thanks. I was thinking about recursive calling and functions with lots of local variables when asking this.
Will you be able to tell PNUT.exe to use your own spin interpreter so you can have different versions of the interpreter for different applications, for example, one with the stack in stack ram and the other with the stack in hub ram, and custom application-specific ones that make misuses of commands like cognew(-1,10) do something useful? It would be really cool if the user could invent your own operators and tell the compiler their syntax and what bytecode they get translated to. This information would be project-specific or specified in the main spin file or something similar.
What's the advantage to this? Does this let you replace parts of the design (while leaving other parts unchanged) if you find a defect?
I assume that with "multi-layer reticles" you mean multiple reticles for the same layer. I worked at ASML for a while and I have some idea of the accuracy at which a wafer scanner works, and it would seem to me that it would be at least impractical, but more likely impossible to expose multiple reticles (on top of each other) at the same time... Am I wrong?
===Jac
These multi-layer reticles contain more than one image. The single image of interest is selectively exposed during lithography. I assume they either don't shine light through the unwanted areas, or crop the transmission window to just expose the area of interest. At no point does the wafer get exposed to multiple images at once.
Many thanks for keeping us abreast of the latest develops, Parallax! It's getting exciting!
Based on the latest news from Chip (as quoted below), it sounds like there's been some changes to the original plan of using a shared shuttle run:
Post 2005 above: "If this new chip works, we'll be able to make 100k more units from the same mask set, so there won't be much delay for actual chips." Post 2011 above: "The foundry we are using offers multi-layer reticles, not just single-layer. This means that a mask will have images for more than one process step. It's not ideal for mass production because they expose only half the area (or 1/3 or 1/4, etc.) at a time that a single-layer reticle would."
That apparently signals a likely welcome change (from the standpoint of us forum members, anyway) from:
Well, the final time to sample chips in hand for Parallax might be similar, but, if the chip works, it sounds like it could mean less time for getting chips into our grubby hands (though likely at a bit of a premium). I heard from a Sales Manager of a fabless chip company (not a "fabulous" one like Parallax, though!) that TSMC has an at least three-month waiting period for production. But perhaps this calculated risk Parallax is taking in terms of the choice of mask sets could speed that up somewhat (possibly depending on which fab they are using).
This apparent change is not completely unexpected: back in May, Ken Gracey mentioned the possibility of using an alternative of some sort (though whether at TSMC or a different fab was not absolutely specified, at least as I read it):
"This particular foundry [TSMC] has a much higher startup cost for shuttle and mask set, but a lower unit cost. We are presently looking at alternatives which might increase our unit cost per die, but have a lower shuttle/mask cost. Our recent submittal had a poly layer error, so we are evaluating an alternative for the next foundry run. See Post #7 on 05-19-2013 by Ken Gracey http://forums.parallax.com/showthread.php/148098-Propeller-2-Release-Date
There is, of course, a lot of info about mask sets available on the web, but perhaps see the following link for a quick comparison of the Multi-Project Wafer (MPW), also called a "shuttle run" (I think), Multi Layer Mask (MLM), also called Multi Layer Reticle (MLR), which Chip references, and Full Maskset, apparently also called a Single Maskset. There's a comparison diagram at the end. In particular, it states:
"MLM (Multi Layer Mask) or MLR (Multi Layer Reticle) services help reduce the tapeout NRE cost (full maskset cost). This method allows combining up to 4 masks into one, and hence reducing the total number of masks that need to be created. As the number of masks is reduce[d] the NRE [costs are] reduced as well.
What is the catch with MLM? The drawback is the wafer price. While the NRE price decreases the wafer price increases. Therefore MLM is recommended for projects that require low or medium production volume.
You can think of MLM as a financial service. You pay lower NRE (vs. full maskset), but on the other hand, you pay a higher wafer price" http://anysilicon.com/understanding-maskset-type-mpw-mlm-mlr-and-single-maskset/
By the way, in the section about Multi Project Wafers (MPW), I'm wondering if the writer intended the text I added in brackets for the following sentence: "Distributing the maskset price among projects drive the cost down, the typical price of MPW shuttle is 90% [below that?] of [the] full maskset price." If a MPW run "only" a gave 10% savings, the advantages of going it alone (such as maybe more scheduling flexibility and more sample chips) might outweigh that 10% savings.
Anyway, Parallax is going with an MLM/MLR mask process, at least at the present time (for samples and, from what it sounds like, at least some initial production if the samples pan out).
OT - SPIN2:
Because it is soft, we can also tweek it. So if a larger stack is required, we can use a hub stack version of the interpreter. This is going to be fun
The prototype chips are made in USA, if those masks work, the first 100k chips could also be made in USA.
The fab is relatively local, that's why the MLR process, they don't do shuttle runs in the conventional sense, 1 reticle with multiple images instead of multiple reticles with multiple images.
Where this could become a practical problem is in recursive calling.
There are various optimizations that can be had that can solve this problem. If you do tail recursion then your recursive call is the last command on that part of the stack which means the return can be safely optimized away meaning your call depth doesn't increase.
Most if not all recursive algorithms can be reworked to be non-recursive, and often will perform better after reworking.
I generally try to avoid recursion in my code designs.
The one really neat (even indispensable) use of recursion is in compiling equations, written from left to right with operator precedence and parentheses, into "reverse Polish notation" for solving with a simple math stack. The Spin compiler makes extensive use of recursion for compiling Spin source into byte codes. Otherwise, I avoid recursion, as it's a brain bender and feels dangerous.
Yeah, recursion for compilation is often the best route to take, but usually doesn't involve a very deep recursion. Although, you could probably turn equation processing into a multi pass setup instead of recursing and get it working well.
I've seen code that relies on recursion that goes many hundreds of levels deep. It's scary.
For a long time it was thought that recursion was the only way to calculate Ackermann's function, but I seem to remember that someone came up with a non-recursive way to calculate it.
If you want to deal with tree like data structures it's so easy to think about using recursion. Visiting all the nodes of a binary tree for example (warning JavaScript):
function traverse(node) {
print(node.data);
if (node.left) {
traverse(node.left);
}
if (node.right) {
traverse(node.right);
}
}
For these kind of problems you are going to need a stack somewhere anyway.
Parsing programming languages is basically a case of traversing a tree structure so recursive algorithms are very handy.
Using recursion you can program without loops. For example this prints out an array :
function visit (list, n) {
var data = list[n];
if (data) {
print(data);
visit(list, n + 1);
}
}
visit(someList, 0);
Now that of course is very inefficient what with all that calling overhead and needing a stack.
But some compilers can optimize that code into something like this (pseudo code):
n := 0
visit:
data = list[n]; if (data) {
print(data)
n := n + 1
goto visit
}
This is the magical tail recursion that optimizes the calls and returns into a jump. Now you can write your algorithms clearly and concisely in a recursive style without the overheads.
The prototype chips are made in USA, if those masks work, the first 100k chips could also be made in USA.
The fab is relatively local, that's why the MLR process, they don't do shuttle runs in the conventional sense, 1 reticle with multiple images instead of multiple reticles with multiple images.
It sounds like all new masks were made rather than correcting the offending layer(s) of the old mask set. It makes me wonder if any changes at all were made to the P2 verilog code. Also, I'd love to know what sorts of tools or techniques were used to check for unintended shorts between layers.
Recursion can make some tasks simpler. I used recursion in the chess program I wrote recently to evaluate moves and counter-moves. It simplified the bookkeeping quite a bit. The routine iterated over all the possible moves and called itself to generate all of the possible counter moves, and so on until it reached the maximum depth. Of course, it's also possible to do this without recursion.
Now that of course is very inefficient what with all that calling overhead and needing a stack.
!!!
It's a little more than inefficient! Downright wasteful, imho. If you aren't going to be coming back and reusing the stacked data then you shouldn't be using recursion.
Yes downright wasteful. And that's why some languages and their compilers have tail call optimization as shown.
Oddly I have seen a couple of times on this forum somebody new to programming asks why their code does not work and it looks like this:
PUB doSomething
'bla bla
'bla bla
doSomething
Clearly they have not totally grasped the idea of method calls, stacks and local variables etc they just want to repeat their code sequence in the most obvious way. It's not their fault the compiler does that very inefficiently.
There are some algorithms, like tree searches, algebraic expression evaluation, and game move evaluation which are so much easier with recursion you'd be crazy to do them any other way. Yes, you can do them without, but it will take twice the code and be ten times harder to debug.
Then there are others, like Heater's example of "programming without loops" which are simply parlor tricks and are not practical either for implementation or maintenance. Yes it's cool to show that you can use recursion instead of a loop but it's not very useful.
Recursion is a powerful tool. You should use the right tool for the job.
Has the revised Prop2 layout been sent to the fab yet ?
Yes. Wafers should be back from the fab in two weeks. Then, several days will be needed for transit and packaging. We'll be able to try it out once we get back packaged parts.
Comments
Not in this first version. There are huge speed advantages in using the stack RAM as the Spin stack space. Basically, you get 256 stack locations, which, if your average routine requires 8 levels, gives you a calling depth of 32, which is excessive. Where this could become a practical problem is in recursive calling.
What's the advantage to this? Does this let you replace parts of the design (while leaving other parts unchanged) if you find a defect?
I assume that with "multi-layer reticles" you mean multiple reticles for the same layer. I worked at ASML for a while and I have some idea of the accuracy at which a wafer scanner works, and it would seem to me that it would be at least impractical, but more likely impossible to expose multiple reticles (on top of each other) at the same time... Am I wrong?
===Jac
But now that I'm hear I reckon that despite the recursive fibo() being very recursive it does not actually use much stack. fibo(10) only need to recurse to a max depth of 10. So it looks like the on COG stack will give us a very quick fibo() result.
All of which makes this post redundant. Just ignore me.
Yes that also means that only 1/4 of the die are built correctly (exposures out of order won't work). But it is a lot cheaper than paying for a full mask set.
Other ways to lower the mask costs is shared project, like PCBpool only for ICs, like MOSIS.
Thanks. I was thinking about recursive calling and functions with lots of local variables when asking this.
Will you be able to tell PNUT.exe to use your own spin interpreter so you can have different versions of the interpreter for different applications, for example, one with the stack in stack ram and the other with the stack in hub ram, and custom application-specific ones that make misuses of commands like cognew(-1,10) do something useful? It would be really cool if the user could invent your own operators and tell the compiler their syntax and what bytecode they get translated to. This information would be project-specific or specified in the main spin file or something similar.
These multi-layer reticles contain more than one image. The single image of interest is selectively exposed during lithography. I assume they either don't shine light through the unwanted areas, or crop the transmission window to just expose the area of interest. At no point does the wafer get exposed to multiple images at once.
Based on the latest news from Chip (as quoted below), it sounds like there's been some changes to the original plan of using a shared shuttle run:
That apparently signals a likely welcome change (from the standpoint of us forum members, anyway) from: Well, the final time to sample chips in hand for Parallax might be similar, but, if the chip works, it sounds like it could mean less time for getting chips into our grubby hands (though likely at a bit of a premium). I heard from a Sales Manager of a fabless chip company (not a "fabulous" one like Parallax, though!) that TSMC has an at least three-month waiting period for production. But perhaps this calculated risk Parallax is taking in terms of the choice of mask sets could speed that up somewhat (possibly depending on which fab they are using).
This apparent change is not completely unexpected: back in May, Ken Gracey mentioned the possibility of using an alternative of some sort (though whether at TSMC or a different fab was not absolutely specified, at least as I read it):
There is, of course, a lot of info about mask sets available on the web, but perhaps see the following link for a quick comparison of the Multi-Project Wafer (MPW), also called a "shuttle run" (I think), Multi Layer Mask (MLM), also called Multi Layer Reticle (MLR), which Chip references, and Full Maskset, apparently also called a Single Maskset. There's a comparison diagram at the end. In particular, it states:
By the way, in the section about Multi Project Wafers (MPW), I'm wondering if the writer intended the text I added in brackets for the following sentence: "Distributing the maskset price among projects drive the cost down, the typical price of MPW shuttle is 90% [below that?] of [the] full maskset price." If a MPW run "only" a gave 10% savings, the advantages of going it alone (such as maybe more scheduling flexibility and more sample chips) might outweigh that 10% savings.
Anyway, Parallax is going with an MLM/MLR mask process, at least at the present time (for samples and, from what it sounds like, at least some initial production if the samples pan out).
Because it is soft, we can also tweek it. So if a larger stack is required, we can use a hub stack version of the interpreter. This is going to be fun
The prototype chips are made in USA, if those masks work, the first 100k chips could also be made in USA.
The fab is relatively local, that's why the MLR process, they don't do shuttle runs in the conventional sense, 1 reticle with multiple images instead of multiple reticles with multiple images.
There are various optimizations that can be had that can solve this problem. If you do tail recursion then your recursive call is the last command on that part of the stack which means the return can be safely optimized away meaning your call depth doesn't increase.
I generally try to avoid recursion in my code designs.
The one really neat (even indispensable) use of recursion is in compiling equations, written from left to right with operator precedence and parentheses, into "reverse Polish notation" for solving with a simple math stack. The Spin compiler makes extensive use of recursion for compiling Spin source into byte codes. Otherwise, I avoid recursion, as it's a brain bender and feels dangerous.
I've seen code that relies on recursion that goes many hundreds of levels deep. It's scary.
Well we know we have to have lively debate around here, so...
¡Viva la recursividad!
C.W.
If you want to deal with tree like data structures it's so easy to think about using recursion. Visiting all the nodes of a binary tree for example (warning JavaScript): For these kind of problems you are going to need a stack somewhere anyway.
Parsing programming languages is basically a case of traversing a tree structure so recursive algorithms are very handy.
Using recursion you can program without loops. For example this prints out an array :
Now that of course is very inefficient what with all that calling overhead and needing a stack.
But some compilers can optimize that code into something like this (pseudo code):
n := 0
visit:
data = list[n]; if (data) {
print(data)
n := n + 1
goto visit
}
This is the magical tail recursion that optimizes the calls and returns into a jump. Now you can write your algorithms clearly and concisely in a recursive style without the overheads.
It sounds like all new masks were made rather than correcting the offending layer(s) of the old mask set. It makes me wonder if any changes at all were made to the P2 verilog code. Also, I'd love to know what sorts of tools or techniques were used to check for unintended shorts between layers.
It's a little more than inefficient! Downright wasteful, imho. If you aren't going to be coming back and reusing the stacked data then you shouldn't be using recursion.
Yes downright wasteful. And that's why some languages and their compilers have tail call optimization as shown.
Oddly I have seen a couple of times on this forum somebody new to programming asks why their code does not work and it looks like this:
Clearly they have not totally grasped the idea of method calls, stacks and local variables etc they just want to repeat their code sequence in the most obvious way. It's not their fault the compiler does that very inefficiently.
Then there are others, like Heater's example of "programming without loops" which are simply parlor tricks and are not practical either for implementation or maintenance. Yes it's cool to show that you can use recursion instead of a loop but it's not very useful.
Recursion is a powerful tool. You should use the right tool for the job.
Parallax doesn't do preorders very often, and so far haven't offered the option for the Propeller 2.
Yes. Wafers should be back from the fab in two weeks. Then, several days will be needed for transit and packaging. We'll be able to try it out once we get back packaged parts.
I guess I should get my checkbook ready...