Anyway, Chip has the numbers, and as I remember multipliers are pretty big pieces of logic, compared to ALUs, and it would not surprise me if it doubled the size of a COG
I wonder how many times peripheral type code actually needs a multiply, heck lots of us here came from an era where all we had was 8 bit adders.
I think a 16x16 multiplier would be good, per cog. Any thoughts on whether that would be precise enough? 16x16 yields a convenient 32-bit result, at least.
I think a 16x16 multiplier would be good, per cog. Any thoughts on whether that would be precise enough? 16x16 yields a convenient 32-bit result, at least.
Yes 16x16 is good enough for most applications. You always can make 16x32 or 32x32 with the 16x16 and some shifting and adding.
This may actually be good news if the new Spin is source compatible with the old Spin. It may mean that almost all OBEX code will work on the new processor even if the assembly language changes a bit.
I see the most incompatibelity not in the bit encoding or some missing instructions, but in the changed timing with 2 cycle instructions. Quite a few objects rely on the exact timing of the instructions. They for example use a counter to generate the bitclock while the instructions shift out and such things.
All what I will say with that is that you can't expect that all the objcets will work without modification. I want for sure the faster instructions.
I see the most incompatibelity not in the bit encoding or some missing instructions, but in the changed timing with 2 cycle instructions. Quite a few objects rely on the exact timing of the instructions. They for example use a counter to generate the bitclock while the instructions shift out and such things.
All what I will say with that is that you can't expect that all the objcets will work without modification. I want for sure the faster instructions.
Andy
Good point. Unless the code is paced by using waitcnt it will likely require revision for this new chip.
I think a 16x16 multiplier would be good, per cog. Any thoughts on whether that would be precise enough? 16x16 yields a convenient 32-bit result, at least.
What happened to the single, shared maths-block idea ? That seemed to have great merit.
You could check the Power & Die costs with/without that easily enough ?
VGA is going to simply be a use case of a shifter that drives four DAC outputs to a set of fixed pins attached to each cog. The four channels from highest to lowest can be used as: R, G, B, HSYNC. In some modes, the shifter handles data in ways that is obviously for video, but, otherwise, it's a generic circuit that can simultaneously update four DACs with unique 8-bit data.
I presume this will also include a Parallel mode (just skips the DACS, drives up to 32 pins in parallel , with strobe ) ?
Good point. Unless the code is paced by using waitcnt it will likely require revision for this new chip.
Or users could use COG Gearing set to 50%, to emulate a 4 cycle opcode. That could be the default.
This COG gearing also controls the power envelope, which I am sure is still going to be an issue.
As per conversation with Chip, here is a rough I/O floor-plan that follows the image Chip posted at the top of this thread.
Note:
- XI and XO in the current IO library are made into one pad (upper left block)
- Since the GND is connected to the bottom of the package, the GND's still needed to be brought out from the die. So for every VIO and VDD there is a corresponding GIO and GND on the attached image. From the Package perspective this is still a 100 pin package, but from the die itself there are a total of 132 pads. 32 of which are either GND or GIO and are connected directly to the bottom of the package.
I'm going to speculate that every piece of assembler has a MOV in it. But there is only 130 files counted with a MOV.
Is it really so we have 1335 files with no assembler in OBEX?
This is not believable. I think Roy made some mistake there. Ok he stated (emphasis form me)
Here is my grep analysis of the OBEX files I use for compiler testing. This consists of 1465 spin files. My counts below are just for number of spin files that contain the instruction. Also, I can't easily account for SPIN keywords that match PASM ones, but it shouldn't matter for this purpose.
But even then this would come down to just 130 files containing PASM in the OBEX. I can not imagine a PASM file without MOV.
As per conversation with Chip, here is a rough I/O floor-plan that follows the image Chip posted at the top of this thread.
Beau and I discussed that we could stuff the video PLLs into the VDD/GND pin pairs that precede each set of 4 I/O's that an adjacent cog can drive the DACs of. This keeps things tidy.
Beau had the great idea of building 4 of the security fuses into each I/O pad structure, yielding 256, of which 128 are needed for the HMAC/SHA256 which validates the loader (code protect).
The corner pads for XI and XO will implement the chip-wide PLL into themselves, maybe taking the corner.
This all gives a nice, big, open square for all the synthesized logic to drop into, with four sets per side of VDD and GND.
Now that we have the honking ground pad, is it really necessary to have a separate vio and vcore for almost every two I/O pins?
The reason I ask is if would be OK to half that, one Vio and Vcore for every four pins, we would get 4 sides * 6 pins... 24 I/O's back, and be at 88 80 I/O's - MUCH better.
I am NOT trying to be a pain, I just don't understand why a separate Vio and Vcore is needed for every two I/O pins. And I love to learn
Also not wanting to be a pain but having 12 I/O left after external RAM, VGA and a couple of serial ports pretty much means you have to use two chips to make a complete system. An extra 24 I/O would bring us back up to P1 level of available I/O with the extra RAM which I think, in a lot of controller applications, would be a must.
Now that we have the honking ground pad, is it really necessary to have a separate vio and vcore for every two I/O pins?
It is also common these days to have VCCIO spread around the part, and to have groups of those allowed to be different.
Is that the case here ? or are all the VCCIO internally bussed up on the die ?
It is not so common to have a lot of sprinkled VCore pins, usually just enough to handle the expected current.
Sprinkled VCore pins makes PCB routing harder.
Beau and I discussed that we could stuff the video PLLs into the VDD/GND pin pairs that precede each set of 4 I/O's that an adjacent cog can drive the DACs of. This keeps things tidy.
Beau had the great idea of building 4 of the security fuses into each I/O pad structure, yielding 256, of which 128 are needed for the HMAC/SHA256 which validates the loader (code protect).
The corner pads for XI and XO will implement the chip-wide PLL into themselves, maybe taking the corner.
This all gives a nice, big, open square for all the synthesized logic to drop into, with four sets per side of VDD and GND.
Chip,
This was checking 1465 spin files, which was the entire obex as of august 2012. The counts are for files that contained the instruction, not for how many times the instructions occur. There are a lot of spin files that don't contain any of the instructions. I figured that it was more important how many different objects used the instruction at all.
I was surprised to find that TJZ was used a fair bit, even more than TJNZ. Of course, DJNZ was used a lot more than either T variant. I was also happy to see that I was right about my other choices, ABSNEG & SUBABS are never used, and ADDABS was used once. You were right about CMPSX, but also ADDSX, SUBSX, & CMPX are never used.
It is also common these days to have VCCIO spread around the part, and to have groups of those allowed to be different.
Is that the case here ? or are all the VCCIO internally bussed up on the die ?
It is not so common to have a lot of sprinkled VCore pins, usually just enough to handle the expected current.
Sprinkled VCore pins makes PCB routing harder.
That's just 12 pins for RST.X1.X0.VCore and VccIO
One VccIO per 8 io pins needs 11, VCore probably needs at least 4, maybe 8.
Something is simply not adding up ?
Every four I/O pins have their own VIO supply pin (3.3V) that is separate from all other groups. This keeps analog signal groups partitioned so that there's no crosstalk outside of groups of four. This is a little excessive, but will be very safe. We may be able to get by with fewer VDD pins, too, but this arrangement is very conservative and will help minimize switching noise on the internal power grid. If we cut all the VIO and VDD pins in half, we could get 16 more I/O pins. That could mean another 4 cogs, too, for 20, total. I do feel the pull for more I/O to support SDRAM.
Does anyone have any data on how other chips apportion VDD pins based on their core current? I know on very high-current chips, they place power pads right in the middle of the die and attach the package directly to them.
Chip,
This was checking 1465 spin files, which was the entire obex as of august 2012. The counts are for files that contained the instruction, not for how many times the instructions occur. There are a lot of spin files that don't contain any of the instructions. I figured that it was more important how many different objects used the instruction at all.
I was surprised to find that TJZ was used a fair bit, even more than TJNZ. Of course, DJNZ was used a lot more than either T variant. I was also happy to see that I was right about my other choices, ABSNEG & SUBABS are never used, and ADDABS was used once. You were right about CMPSX, but also ADDSX, SUBSX, & CMPX are never used.
I remember putting those in there because they completed the adder functions and were free, so to speak. We could reuse many of those spaces now without much fear. This was a really valuable bit of research you did here. Thank you!
I think a 16x16 multiplier would be good, per cog. Any thoughts on whether that would be precise enough? 16x16 yields a convenient 32-bit result, at least.
This (And hardware CORDIC) are what I was looking for to leverage all of that potential analog I/O!
Every four I/O pins have their own VIO supply pin (3.3V) that is separate from all other groups. This keeps analog signal groups partitioned so that there's no crosstalk outside of groups of four. This is a little excessive, but will be very safe. We may be able to get by with fewer VDD pins, too, but this arrangement is very conservative and will help minimize switching noise on the internal power grid. If we cut all the VIO and VDD pins in half, we could get 16 more I/O pins. That could mean another 4 cogs, too, for 20, total. I do feel the pull for more I/O to support SDRAM.
Does anyone have any data on how other chips apportion VDD pins based on their core current? I know on very high-current chips, they place power pads right in the middle of the die and attach the package directly to them.
Confused again.
Why is the number of Cogs bound to the number of pins?
As Bill I just would be fine to have more pins brought to the outside.
Why is the number of Cogs bound to the number of pins?
As Bill I just would be fine to have more pins brought to the outside.
Enjoy!
Mike
There is no hard rule for cogs-to-pins, but each cog can drive four DAC outputs very quickly, so it would be a shame to have pins that weren't matched to cogs. That's all.
If we had 32 cogs, we could have two cogs for each set of four I/O pins, and their DAC output values could be OR'd together, enabling them to tag team on the DACs. This could be good for video. We'd need a finer process for that, though, because the power could get out of control.
As it is shaping up, 70% of the synthesized block will be RAM, with 30% left for logic. This is why we can't go to 1MB - there's no room.
It is also common these days to have VCCIO spread around the part, and to have groups of those allowed to be different.
Is that the case here ? or are all the VCCIO internally bussed up on the die ?
It is not so common to have a lot of sprinkled VCore pins, usually just enough to handle the expected current.
Sprinkled VCore pins makes PCB routing harder.
That's just 12 pins for RST.X1.X0.VCore and VccIO
One VccIO per 8 io pins needs 11, VCore probably needs at least 4, maybe 8.
This is great news Chip! It's exactly what so many of us had hoped for.
As to code compatibility with the P1 - I don't believe binary level compatibility is necessary, but assembler level compatibility is extremely desirable.
As to what new functionality you choose to add in and leave out, I'm sure you will make the correct final decisions if you keep in mind the principles that the P1 embodied - i.e. simplicity, orthogonality, versatility and fun!
This is great news Chip! It's exactly what so many of us had hoped for.
As to code compatibility with the P1 - I don't believe binary level compatibility is necessary, but assembler level compatibility is extremely desirable.
As to what new functionality you choose to add in and leave out, I'm sure you will make the correct final decisions if you keep in mind the principles that the P1 embodied - i.e. simplicity, orthogonality, versatility and fun!
Also not wanting to be a pain but having 12 I/O left after external RAM, VGA and a couple of serial ports pretty much means you have to use two chips to make a complete system. An extra 24 I/O would bring us back up to P1 level of available I/O with the extra RAM which I think, in a lot of controller applications, would be a must.
So because of several people's reactions to the numbers, I went and looked some more, and it appears that grep is failing me. I used grep -i -w -l instruction *.spin but it's clearly failing to list files that have mov in them.
I suspect that the issue is that most of the files are in unicode form, and grep is not handling that. I need to find a grep that can read unicode.
So, I apologize, but the data is wrong. I'll see if I can get new data asap.
Comments
I wonder how many times peripheral type code actually needs a multiply, heck lots of us here came from an era where all we had was 8 bit adders.
Yes 16x16 is good enough for most applications. You always can make 16x32 or 32x32 with the 16x16 and some shifting and adding.
I see the most incompatibelity not in the bit encoding or some missing instructions, but in the changed timing with 2 cycle instructions. Quite a few objects rely on the exact timing of the instructions. They for example use a counter to generate the bitclock while the instructions shift out and such things.
All what I will say with that is that you can't expect that all the objcets will work without modification. I want for sure the faster instructions.
Andy
What happened to the single, shared maths-block idea ? That seemed to have great merit.
You could check the Power & Die costs with/without that easily enough ?
I presume this will also include a Parallel mode (just skips the DACS, drives up to 32 pins in parallel , with strobe ) ?
Or users could use COG Gearing set to 50%, to emulate a 4 cycle opcode. That could be the default.
This COG gearing also controls the power envelope, which I am sure is still going to be an issue.
Note:
- XI and XO in the current IO library are made into one pad (upper left block)
- Since the GND is connected to the bottom of the package, the GND's still needed to be brought out from the die. So for every VIO and VDD there is a corresponding GIO and GND on the attached image. From the Package perspective this is still a 100 pin package, but from the die itself there are a total of 132 pads. 32 of which are either GND or GIO and are connected directly to the bottom of the package.
Reference Image from Top Post
http://forums.parallax.com/attachment.php?attachmentid=108014&d=1396730333
This is not believable. I think Roy made some mistake there. Ok he stated (emphasis form me)
But even then this would come down to just 130 files containing PASM in the OBEX. I can not imagine a PASM file without MOV.
confused!
Mike
Beau and I discussed that we could stuff the video PLLs into the VDD/GND pin pairs that precede each set of 4 I/O's that an adjacent cog can drive the DACs of. This keeps things tidy.
Beau had the great idea of building 4 of the security fuses into each I/O pad structure, yielding 256, of which 128 are needed for the HMAC/SHA256 which validates the loader (code protect).
The corner pads for XI and XO will implement the chip-wide PLL into themselves, maybe taking the corner.
This all gives a nice, big, open square for all the synthesized logic to drop into, with four sets per side of VDD and GND.
I have a dumb question.
I see each group of 25 pins has:
- 1 misc pin
- 6 Vio
- 6 Vcore
- 12 I/O pins
Correction: (Thanks jmg for catching the error!)
- 1 misc pin
- 4 Vio
- 4 Vcore
- 16 I/O pins
Now that we have the honking ground pad, is it really necessary to have a separate vio and vcore for almost every two I/O pins?
The reason I ask is if would be OK to half that, one Vio and Vcore for every four pins, we would get 4 sides * 6 pins... 24 I/O's back, and be at 88 80 I/O's - MUCH better.
I am NOT trying to be a pain, I just don't understand why a separate Vio and Vcore is needed for every two I/O pins. And I love to learn
TQFP-100 with 88 80 I/O... I would love that.
Also not wanting to be a pain but having 12 I/O left after external RAM, VGA and a couple of serial ports pretty much means you have to use two chips to make a complete system. An extra 24 I/O would bring us back up to P1 level of available I/O with the extra RAM which I think, in a lot of controller applications, would be a must.
I think we should keep you. Pain or NOT. But 24 more pins would be marvelous.
Very good question.
Enjoy!
Mike
It is also common these days to have VCCIO spread around the part, and to have groups of those allowed to be different.
Is that the case here ? or are all the VCCIO internally bussed up on the die ?
It is not so common to have a lot of sprinkled VCore pins, usually just enough to handle the expected current.
Sprinkled VCore pins makes PCB routing harder.
That's just 12 pins for RST.X1.X0.VCore and VccIO
One VccIO per 8 io pins needs 11, VCore probably needs at least 4, maybe 8.
Something is simply not adding up ?
This was checking 1465 spin files, which was the entire obex as of august 2012. The counts are for files that contained the instruction, not for how many times the instructions occur. There are a lot of spin files that don't contain any of the instructions. I figured that it was more important how many different objects used the instruction at all.
I was surprised to find that TJZ was used a fair bit, even more than TJNZ. Of course, DJNZ was used a lot more than either T variant. I was also happy to see that I was right about my other choices, ABSNEG & SUBABS are never used, and ADDABS was used once. You were right about CMPSX, but also ADDSX, SUBSX, & CMPX are never used.
Every four I/O pins have their own VIO supply pin (3.3V) that is separate from all other groups. This keeps analog signal groups partitioned so that there's no crosstalk outside of groups of four. This is a little excessive, but will be very safe. We may be able to get by with fewer VDD pins, too, but this arrangement is very conservative and will help minimize switching noise on the internal power grid. If we cut all the VIO and VDD pins in half, we could get 16 more I/O pins. That could mean another 4 cogs, too, for 20, total. I do feel the pull for more I/O to support SDRAM.
Does anyone have any data on how other chips apportion VDD pins based on their core current? I know on very high-current chips, they place power pads right in the middle of the die and attach the package directly to them.
I remember putting those in there because they completed the adder functions and were free, so to speak. We could reuse many of those spaces now without much fear. This was a really valuable bit of research you did here. Thank you!
so as of august 2012 just130 files out of 1465 contain ANY PASM? (all mov counted. any PASM without mov thinkable?)
Could you please recheck that? I still can not grook it.
Enjoy!
Mike
This (And hardware CORDIC) are what I was looking for to leverage all of that potential analog I/O!
I could sell a couple dozen of these tomorrow....
Confused again.
Why is the number of Cogs bound to the number of pins?
As Bill I just would be fine to have more pins brought to the outside.
Enjoy!
Mike
There is no hard rule for cogs-to-pins, but each cog can drive four DAC outputs very quickly, so it would be a shame to have pins that weren't matched to cogs. That's all.
If we had 32 cogs, we could have two cogs for each set of four I/O pins, and their DAC output values could be OR'd together, enabling them to tag team on the DACs. This could be good for video. We'd need a finer process for that, though, because the power could get out of control.
As it is shaping up, 70% of the synthesized block will be RAM, with 30% left for logic. This is why we can't go to 1MB - there's no room.
I must have messed up the math somewhere! Or mis-read the tiny print on the pic.
Hmm... 64 I/O/4 means I must be wrong, there has to be 16 I/O per side.
25 pins - 16 = 9 pins, 1 misc leaves eight.
Got it! I mis-counted Vio + Vcore, there are 4 + 4, NOT 6 + 6.
Thank you for catching my error.
If we can reduce to 2 + 2, then we get 4 more I/O per side, 16 total.
80 I/O totla.
I'll go fix my silly mistake now.
As to code compatibility with the P1 - I don't believe binary level compatibility is necessary, but assembler level compatibility is extremely desirable.
As to what new functionality you choose to add in and leave out, I'm sure you will make the correct final decisions if you keep in mind the principles that the P1 embodied - i.e. simplicity, orthogonality, versatility and fun!
Ross.
I think this will be a lot of fun!
ONLY 16 extra I/O... which would still hugely help!
Bad news: I goofed reading the pinout pic.
Good news: 80 I/O's may still be possible
I suspect that the issue is that most of the files are in unicode form, and grep is not handling that. I need to find a grep that can read unicode.
So, I apologize, but the data is wrong. I'll see if I can get new data asap.