Yes, but becoming niche...
Last Monitor I bought, I had to pounce on the last run-out deal with VGA connector, newer ones did not have the HD15 anymore.
NTSC/PAL are still big in security and small camera (car backing camera) uses.
Well, I also prefer having 8 cogs and maintain the 512KB RAM. 16 cogs would be too overkill for just 128KB of RAM. More important is to fit lots of code, plus the variables generated during a program execution. The processing power per cog will be a significant improvement per se, and if the hub latency decreases, better yet.
I just sent OnSemi the new Verilog file set with the CORDIC's k-factor adjustment in 16 new stages, instead of being lumped in with the iteration stages, which was slowing things way down.
Yesterday, they sent me a timing report for the synthesis at 166MHz. I attached it so you can see. It shows the slowest paths. They all exactly met timing because the tool optimized them to exactly that point. If you want 180MHz, the tool works harder and takes longer. Kind of like play-dough. If you ask for something unrealistic it won't finish for a long time, before giving up. The slower your timing requirement, the faster it gets the job done. It seemed to handle 166MHz pretty quickly, which is a good sign. By the way, the timing figures in that report are picosecond integers.
I just sent OnSemi the new Verilog file set with the CORDIC's k-factor adjustment in 16 new stages, instead of being lumped in with the iteration stages, which was slowing things way down.
Yesterday, they sent me a timing report for the synthesis at 166MHz. I attached it so you can see. It shows the slowest paths. They all exactly met timing because the tool optimized them to exactly that point. If you want 180MHz, the tool works harder and takes longer. Kind of like play-dough. If you ask for something unrealistic it won't finish for a long time, before giving up. The slower your timing requirement, the faster it gets the job done. It seemed to handle 166MHz pretty quickly, which is a good sign. By the way, the timing figures in that report are picosecond integers.
Hi Chip,
Does this mean that 180MHz operation is not feasible, or just compromised? By the way, is it still possible to do 1 instruction per cycle, instead of the 1/4 of the P1? This post really changed the game for me in two aspects: the reduced number of cogs, and now the clock frequency.
Samuel, before we're done with this, we will push the synthesizer to the limit of what is possible. I am hoping for 200 megahertz, but maybe 180 is more realistic. If we can even be 160, that is great.
In this architecture, instructions take two clocks.
Samuel, before we're done with this, we will push the synthesizer to the limit of what is possible. I am hoping for 200 megahertz, but maybe 180 is more realistic. If we can even be 160, that is great.
In this architecture, instructions take two clocks.
Hum, several sources mention one cycle per instruction. But, assuming 8 cogs at 160MHz, this gives us 80MIPS per cog, which is 4x as fast as the current Propeller. Overall, not as big as an improvement as the expected initially (compared to 16 cogs at 200MHz, and 200MIPS per cog according to some sources).
Oh well, at least we have smart pins. Perhaps Parallax could consider releasing a version without smart pins, but having 16 cogs, because the raw processing power is also interesting.
200 MIPS/Cog was always dreaming, even for the Prop2-Hot ... maybe with nitrogen cooling. EDIT: That said, I was probably guilty of dreaming it way back then.
The main reason to have more Cogs is to farm out the timing sensitive tasks and just make it an easy job to build robots with. A compute platform, the Prop ain't.
Oh well, at least we have smart pins. Perhaps Parallax could consider releasing a version without smart pins, but having 16 cogs, because the raw processing power is also interesting.
Samuel,
Also keep in mind that the instruction set has been substantially (wonderfully!) improved. This can mean that what took many P1 instructions might only take 1 or 2 P2 instructions (your mileage may vary, of course), with a corresponding increase in throughput.
I would suspect that if you need 16 P2 cogs to process your data, then you might run into I/O bottlenecks first (you simply won't be able to get the data in to/out of the Propeller fast enough). This is assuming you're not having to do extensive processing on a relatively small data set.
Personally, I'll take 8 cogs with a side of smart pins.
And one thing I've always liked about the Propeller is that there are no variations. If your program runs on your Propeller, it will run on every other Propeller.
Oh well, at least we have smart pins. Perhaps Parallax could consider releasing a version without smart pins, but having 16 cogs, because the raw processing power is also interesting.
Samuel,
Also keep in mind that the instruction set has been substantially (wonderfully!) improved. This can mean that what took many P1 instructions might only take 1 or 2 P2 instructions (your mileage may vary, of course), with a corresponding increase in throughput.
I would suspect that if you need 16 P2 cogs to process your data, then you might run into I/O bottlenecks first (you simply won't be able to get the data in to/out of the Propeller fast enough). This is assuming you're not having to do extensive processing on a relatively small data set.
Personally, I'll take 8 cogs with a side of smart pins.
And one thing I've always liked about the Propeller is that there are no variations. If your program runs on your Propeller, it will run on every other Propeller.
Walter
Hi Walter,
It always depends on your end application, the type of resources you wish or not. I don't know if there are concrete plans, but I saw 5 variations of the P2, with a varying number of pins (smart or not) and cogs, and having different RAM sizes. I would prefer having fewer cogs if the RAM size is to be reduced from 512KB to just 128KB, but I can live without the smart pins for some of my applications. Nevertheless, having them allows applications like oscilloscopes and the like on a single chip. Lets say it is just another flavor. But if it is to be just one flavor, then it should have smart pins.
I'm not really seeing having only 8 cogs as a significant downside.
Sure, with the P1, I did run into problems of not having enough cogs multiple times. But that was largely because I had several COGs dedicated to managing I2C, UART, SPI, 1W, etc. These were such a problem for me that I recall once stumbling onto a P1 Object that could handle 4 full duplex serial buses on one COG. I nearly cried. I don't know what was involved in making that code but I am forever indebted to the author.
The P2 is a totally different story. Smart pins plus interrupts suddenly means that a single cog can manage multiple buses without having to do some kind of timing gymnastics that I assume a blood pact is involved in figuring out. Plus, integrated ADC probably means one less bus to worry about because an external ADC is not needed.
I have some ambitious plans for this microprocessor but I still don't really know what I'd do with more than maybe 5 cogs now that buses are going to be so much easier to manage.
Samuel, that was Prop2-Hot from a few years ago that had single-cycle execution. The design was very power-hungry and required three dual-port RAMs per cog, instead of one, for cog memory.
I miss that speed, too. What we have now, though, is more balanced.
I compiled our final 8-cog / 512KB-hub image for the Prop123_A9 board and this is the resulting slack histogram:
The paths in red are exceeding the time limit I've set of 10ns (100MHz). The Altera tool optimizes things pretty well, but you can see there are some paths, just a few, that inhibit the 10ns goal from being met. These are due not to logic delays, but interconnect delays. The ASIC tools in use for the actual Prop2 layout have much finer granularity and can smoosh these red paths to the right, getting them into the blue zone, and creating a vertical wall at the goal line. Pretty neat, I think.
These variants all use the 16-slot egg-beater, so there's really no need to constrain cogs to a power of two. We could have three cogs, if we wanted. Keeping the eggbeater as is keeps things simple and objects' timing consistent.
So, if we can't have 16, can we have 10 or 12 or 13?
These variants all use the 16-slot egg-beater, so there's really no need to constrain cogs to a power of two. We could have three cogs, if we wanted. Keeping the eggbeater as is keeps things simple and objects' timing consistent.
So, if we can't have 16, can we have 10 or 12 or 13?
See above where Q: " add a 9th cog if there is spare space. "
has reply
That would cause the framework to expand up to the 16-cog level, so it wouldn't be worth it. We are pretty much stuck with powers of 2.
The memory blocks == Slot count, and I guess you might be able to have 16 Memory cells, Least Nibble MUX'd, and have only (say) 9 COGS allocated for 7 empty/wasted slots, that's bumped the HUB slot time x2 for only one more COG ?
More intriguing might be a egg-beater mapping that was 8 mapped to 8 even slots, (same as 16 COG design) and 9th given every ODD slot, for a large boost in HUB memory bandwidth, on that COG.
The smarter question here is probably how much extra memory, can that 9th/10th COG equate to ?
Comments
Not as significant, but still as useful. Besides, there is gonna be some fun in all of that.
Hey, some one did a VR 3D display with a C64: https://hackaday.com/2017/09/14/hacked-headset-brings-vr-to-the-commodore-64/
Hilarious, and pretty awesome at the same time.
IMHO, those features will find good uses. It's like most things that ended up in there.
I'm super excited about the ADC/DAC capabilities myself.
With what noise floor numbers ?
Yes, but becoming niche...
Last Monitor I bought, I had to pounce on the last run-out deal with VGA connector, newer ones did not have the HD15 anymore.
NTSC/PAL are still big in security and small camera (car backing camera) uses.
I meant aside from that.
Well, I also prefer having 8 cogs and maintain the 512KB RAM. 16 cogs would be too overkill for just 128KB of RAM. More important is to fit lots of code, plus the variables generated during a program execution. The processing power per cog will be a significant improvement per se, and if the hub latency decreases, better yet.
Kind regards, Samuel Lourenço
Chip had previously gone through a number of iterations along those lines I believe. Won't be anything trivial remaining.
Yesterday, they sent me a timing report for the synthesis at 166MHz. I attached it so you can see. It shows the slowest paths. They all exactly met timing because the tool optimized them to exactly that point. If you want 180MHz, the tool works harder and takes longer. Kind of like play-dough. If you ask for something unrealistic it won't finish for a long time, before giving up. The slower your timing requirement, the faster it gets the job done. It seemed to handle 166MHz pretty quickly, which is a good sign. By the way, the timing figures in that report are picosecond integers.
Does this mean that 180MHz operation is not feasible, or just compromised? By the way, is it still possible to do 1 instruction per cycle, instead of the 1/4 of the P1? This post really changed the game for me in two aspects: the reduced number of cogs, and now the clock frequency.
Kind regards, Samuel Lourenço
In this architecture, instructions take two clocks.
Oh well, at least we have smart pins. Perhaps Parallax could consider releasing a version without smart pins, but having 16 cogs, because the raw processing power is also interesting.
Kind regards, Samuel Lourenço
The main reason to have more Cogs is to farm out the timing sensitive tasks and just make it an easy job to build robots with. A compute platform, the Prop ain't.
Samuel,
Also keep in mind that the instruction set has been substantially (wonderfully!) improved. This can mean that what took many P1 instructions might only take 1 or 2 P2 instructions (your mileage may vary, of course), with a corresponding increase in throughput.
I would suspect that if you need 16 P2 cogs to process your data, then you might run into I/O bottlenecks first (you simply won't be able to get the data in to/out of the Propeller fast enough). This is assuming you're not having to do extensive processing on a relatively small data set.
Personally, I'll take 8 cogs with a side of smart pins.
And one thing I've always liked about the Propeller is that there are no variations. If your program runs on your Propeller, it will run on every other Propeller.
Walter
Hi Walter,
It always depends on your end application, the type of resources you wish or not. I don't know if there are concrete plans, but I saw 5 variations of the P2, with a varying number of pins (smart or not) and cogs, and having different RAM sizes. I would prefer having fewer cogs if the RAM size is to be reduced from 512KB to just 128KB, but I can live without the smart pins for some of my applications. Nevertheless, having them allows applications like oscilloscopes and the like on a single chip. Lets say it is just another flavor. But if it is to be just one flavor, then it should have smart pins.
Kind regards, Samuel Lourenço
Sure, with the P1, I did run into problems of not having enough cogs multiple times. But that was largely because I had several COGs dedicated to managing I2C, UART, SPI, 1W, etc. These were such a problem for me that I recall once stumbling onto a P1 Object that could handle 4 full duplex serial buses on one COG. I nearly cried. I don't know what was involved in making that code but I am forever indebted to the author.
The P2 is a totally different story. Smart pins plus interrupts suddenly means that a single cog can manage multiple buses without having to do some kind of timing gymnastics that I assume a blood pact is involved in figuring out. Plus, integrated ADC probably means one less bus to worry about because an external ADC is not needed.
I have some ambitious plans for this microprocessor but I still don't really know what I'd do with more than maybe 5 cogs now that buses are going to be so much easier to manage.
Maybe referring to the streamer?
Regards,
Rich
I miss that speed, too. What we have now, though, is more balanced.
He's going to do some power estimations now.
The paths in red are exceeding the time limit I've set of 10ns (100MHz). The Altera tool optimizes things pretty well, but you can see there are some paths, just a few, that inhibit the 10ns goal from being met. These are due not to logic delays, but interconnect delays. The ASIC tools in use for the actual Prop2 layout have much finer granularity and can smoosh these red paths to the right, getting them into the blue zone, and creating a vertical wall at the goal line. Pretty neat, I think.
"Cogs can start CORDIC operations every 16 8 clocks and get results 39 55 clocks later"
So cog slot comes around twice as often but each operation lasts longer. Overall a bit slower than before?
So, if we can't have 16, can we have 10 or 12 or 13?
See above where Q: " add a 9th cog if there is spare space. "
has reply
The memory blocks == Slot count, and I guess you might be able to have 16 Memory cells, Least Nibble MUX'd, and have only (say) 9 COGS allocated for 7 empty/wasted slots, that's bumped the HUB slot time x2 for only one more COG ?
More intriguing might be a egg-beater mapping that was 8 mapped to 8 even slots, (same as 16 COG design) and 9th given every ODD slot, for a large boost in HUB memory bandwidth, on that COG.
The smarter question here is probably how much extra memory, can that 9th/10th COG equate to ?
That's 833mA at 1.8V.
It's going to be really cool.