Consulting the crystal ball: What comes after the Prop2?
markaeric
Posts: 282
First of all, it's great to be back again. I wish I could say that I travelled around the world, or been abducted by aliens but unfortunately that wasn't the case. Besides that, I've been spending the past several days finishing up a propeller project where I left off about a year ago, which I hope to share with you all soon.
Anyway, I've read the progress made on the P2, and it looks like it's going to be quite a beast. So beastly in fact, that I don't understand half the features! It's intimidating to say the least. Now, I know it's almost blasphemy to talk about a follow up to a product that hasn't even been released yet, but bear in mind I intend that this discussion is only in good fun, so please don't demand my head on a platter!
Of course, the natural assumption is that the next silicon after the P2 would be a bigger and badder P3 (that is, if Chip isn't sick of making processors by then, and instead picks up gardening), but I would actually like to sort of see the opposite. If you didn't already want to see my head roll, I'm sure you do now, but hear me out. I think there is a comfy spot in between the P1 and the P2, which I'll refer to as the P1.5. Nothing too fancy, but a few modest upgrades. In a perfect world where we could all get what we want, here would be the things that I would want:
Essentially a good ol' P1 manufactured on a smaller process. This would allow for a higher clock speed, and a bunch more hub ram.
64 P1-esque GPIO
one or two 32-bit SERDES per cog with clock generation/capture
Upgraded video engines - up to 16-bit color VGA/composite
inter-cog comm port like on the P2, maybe wider
A shared single cycle math engine similar to that on the P2 accessible via the hub, or more preferably it's own hub out of phase with the memory hub (so you don't have to choose between memory or math engine access). Possibly with programmable bandwidth.
And possibly the much hated programmable hub bandwidth.
That's my happy medium. Now if I could only sell a billion propeller-based products, maybe I could convince Parallax to make my custom chip!
Anyway, I've read the progress made on the P2, and it looks like it's going to be quite a beast. So beastly in fact, that I don't understand half the features! It's intimidating to say the least. Now, I know it's almost blasphemy to talk about a follow up to a product that hasn't even been released yet, but bear in mind I intend that this discussion is only in good fun, so please don't demand my head on a platter!
Of course, the natural assumption is that the next silicon after the P2 would be a bigger and badder P3 (that is, if Chip isn't sick of making processors by then, and instead picks up gardening), but I would actually like to sort of see the opposite. If you didn't already want to see my head roll, I'm sure you do now, but hear me out. I think there is a comfy spot in between the P1 and the P2, which I'll refer to as the P1.5. Nothing too fancy, but a few modest upgrades. In a perfect world where we could all get what we want, here would be the things that I would want:
Essentially a good ol' P1 manufactured on a smaller process. This would allow for a higher clock speed, and a bunch more hub ram.
64 P1-esque GPIO
one or two 32-bit SERDES per cog with clock generation/capture
Upgraded video engines - up to 16-bit color VGA/composite
inter-cog comm port like on the P2, maybe wider
A shared single cycle math engine similar to that on the P2 accessible via the hub, or more preferably it's own hub out of phase with the memory hub (so you don't have to choose between memory or math engine access). Possibly with programmable bandwidth.
And possibly the much hated programmable hub bandwidth.
That's my happy medium. Now if I could only sell a billion propeller-based products, maybe I could convince Parallax to make my custom chip!
Comments
I think there is a definite place in the market for P1.5 on the low power 180nm process having port B, MUL, and DIV implemented, but I'm not sure it would be worth Parallax's while to move in that direction post-P2.
And P3? Maybe 64-bit architecture? Then source and destination PASM instruction fields could have 24 bits - up to 16 M entries in cog ram. And then some GB of hub ram... And 1 GHz clock... all of this with 1W power consumption.
In 2020 this may be easy...
P2 needs dual power supplies, 3v3 for the I/O and 1v8 for the core. Parallax has already said there won't be a DIP version.
P2 designs also won't have the battery life of P1, which is one thing that will keep the P1 a viable product for some time.
Why should larger be the 'natural assumption' ?
That is jumping in two directions at once - the present Prop 2 is die-limited in the 128 pin package, so throwing in MORE features on a wish list, is going to make the die even larger.
There is Room for a Prop 2-Tiny, but to get there, you need to decide what can be trimmed, not added.
** 64 P1-esque GPIO
Less I/O is a natural step, for a P2-Tiny.
** one or two 32-bit SERDES per cog with clock generation/capture
We all hope this (in some form) has made it into the p2 silicon. Cost is very low, and it could even share/overlay Timer-SFR's if needed.
** Upgraded video engines - up to 16-bit color VGA/composite
The P2 DACs should be there already ?
I'll add one more :
** Wide supply support, and on Chip core regulator.
The latest Microcontroller example of this I've seen supports 5V Vcc / 5V Io, and a 1.5V FLASH core, with 0.9V RAM-keep option
Now we come to what can be removed, to shrink a die ?
a) High speed DAC on every pin makes cell duplicate easy, and could make for a great Universal Tester (with wider vcc) but realistically, what high volume designs are going to use more than a handful of all those DACs.
I'd explore the die savings of limiting the number of pins having the full-suite of Area costly Analog Features.
b) Pseudo Static (hidden refresh DRAM) is much smaller than SRAM, so is a candidate for Main memory. This is already time-sliced, so even if another slot was added purely for refresh, that is a small overhead to pay for a SHIPLOAD more RAM.
c) It would be nice to make COGs asymmetric, but I'm unsure of the RAM:Opcode resource splits.
A problem here, is the COG memory has many port access, (so is large) but if it was under 50% of the total Cog, then it could be practical to Thread/Time slice one Cog's OpCode engine, to give a number of "Virtual Prop1 Cogs"
Given Prop 2 is x4 x2 the speed of Prop1, you might even manage 1 P2.COG sliced to emulate 8 P1.Cogs. It needs 8x the Memory, so that each virtual prop1 cog, runs in its own memory, just the core-resource is swapped on a deterministic basic.
Legacy code support becomes very easy.
What you gain here, is you need FEWER of these Smarter COGs, so hopefully do make a die-gain
That swap-allocate could even be Soft controlled, allowing a run-time choice of Number, or Speed - using feature d) would allow access to the 'spare' RAM if you chose speed.
d) Prop COGs are powerful, but a little insular, and with a hard ceiling.
I'd look at some means to cross-port a small section of COG memory, with a sibling, to give high bandwidth task splitting.
In the simplest form, the read/write access of a small piece, would simply swap with a sibling.
e) To crack the ceiling issue, some means of hardware fast-reload at a function call level, any/all of COG memory.
Extending on c) those COG planes, could become like a cache, and one could even pre-load to better balance bandwidth.
Good software support is important for this type of approach, and you can divide-and-conquer this, with a slower SW patch solution,
before you lock-in the faster hardware.
There are many wasted (unused) hub cycles, so refreshing using these unused cycles should be easy. I know of no code that can use all cogs simultaneously and require every hub access slot (8 cycles on the P2). Only a test program would be capable of this.
A refresh cycle could be made to occur from 80% onwards, waiting for a free slot.
Another idea would be that the hub be sectioned into 2 halves - a lower X bytes and an upper X bytes (eg 1MB & 1MB). Then the refresh could occur in the half that was not being accessed. Again, not all cogs are going to be accessing the same half.
I am sure there is a solution here just waiting to be found.
1) For some time I have advocated for a "Super Cog". It would have banked cog memory. The upper 1KB could be bank switched. I no longer think it needs more hub cycles with the P2. But perhaps with the P2 architecture, overlaying from hub and LMM would be fine instead of requiring banked cog memory. Maybe this is moot on the P2.
2) Only 4 cogs really need all the extra video logic, though I am hoping that can be used in reverse to clock data in.
Maybe if the last 2 cogs could be replaced with 4 - 8 simpler cogs with just basic counters (similar to the P1 but without the video and tv logic, but with the added bonus of being able to clock in or out 32 bits, similar to the vga style logic). These cogs would have lower hub access cycles - the 2 accesses would be equally shared between these cogs. I think some extra simpler cogs capable of doing some serial type peripherals (serial, I2C, SPI, etc) would be a bonus.
I have mentioned this too. A fifo style of access would be fine.
This can be achieved in the P2 by using one of the extra internal pins. A small piece of code could be loaded into the cog that does a waitpeq on the internal pin (minimise power). By setting this pin from the main cog, the slave cog(s) could determine when to load a new piece of code to execute. Now, by software, the length of code to be loaded can be controlled. In fact, it could just use LMM from hub which would be really fast. Once done, the cog could return to its idle waitpeq loop.
Power: All the newer chips have found a way to minimise power usage. I am not sure how they are achieving this, but it needs to be investigated for the P3 (or P2 variant). Certainly the new vertical (3D) transistors Intel is pioneering is a big breakthrough.
Anyway, lets just get our hands on the P2 first. Once we can see what it can do we will have a better idea of what can be achieved.
Once you add more RAM, I think it is best to do that in COG sized amounts, and then the bank-switch you mention becomes a speed question.
It is a small (?) step to switch at high speed, and so time-slice all the more complex Opcode logic, to create two (or more) sub-speed COGS.
That slice % would be under user control.
Just have the ARM core, no serial, no SPI etc, etc. Just have lots of cogs in one or more Prop cores.
All peripherals would then be soft. That is the Prop philosopy after all.
I'm sort of glad to see the raspi queue getting longer even if it means I have to wait. That means that eventually there will be a huge number of users with same hardware base and the same Linux platform. All sharing their experinces and solving all those niggly problems that will crop up. Not to mention inspiring the growth of hardware gizmos to add to it along with the software to drive them. Bit like a grown up Arduino eco system.
How about increasing cog ram and using a 32 bit register to provide indirect addressing as 16 bit source and destination addresses.
Why? I'd be worried if there weren't long discussions here on the forums. That means nobody cares.
so not an ideal match for DRAM (unlike main memory).
Discussions are indeed healthy, and there is no doubt we all care passionately about our Propellers around here. I'm just registering my observation that all the necessary elements are present for a very long thread (open ended question, Prop2 dreaming, and it's been a while since we forumistas have vented)...
Anyway better join in I suppose. What I like about the approach to Prop 2 design is the ability to throw computing time to make a faster chip. Just give it to the ex-cray guys and gals and keep crunching for faster results. That's really neat. Who's to say they couldn't keep crunching away using idle time and have a chip 30-50% faster a year or two down the track, with minimal additional design effort?
I think Chip mentioned a couple of things that may not be added to Prop2 that we might be expecting. I take that (kind of compromise) as a good sign that we are getting closer to real product. But perhaps Prop 3 can look at adding those features back in again
How long will it take us to find out all the things we could do with a P2??? We have not even found all the neat things we can do with the P1 yet.
- a DOS box emulation with Gravis Ultrasound
- Amiga and Atari ST emulation
- USB and Ethernet ports
- mp3/flac/ape/ogg decompressing
- ATA HD interface
- GEM/KDE/Win like GUI OS
- fpc/Lazarus
- first since Atari ST MIDI system without delays
- wavetable synth
- ...
- ...
Instead let's add as much dynamic RAM to each cog as possible. Dynamic is much smaller and with a dynamic ram area per cog we don't need all that quad porting hardware either.
Ah, you say, cog instructions only have 9 bit address fields and that sets the cog ram size limit.
Well not really. With out the # cog memory access works through another register with a 32 bit range and so 4G longs of data can be accessed easily.
Ah, you say, code execution won't work, jmps and such need those 9 bit addresses unless we use another long for each jump address which is wastefull.
No problem. Perhaps restrict code execution to 512 longs as usual and use an lmm style to fetch code to execute from outside that range. Fetching will be faster than lmm from hub as there is no hub waiting, with an auto incrementing register to use as the program counter things start to fly. Or better yet have that lmm loop done in hardware.
Allow for cogs to access external RAM via a proper hardware bus. Probably only one cog would be granted this power.
Ah, you say, without hub ram cogs cannot communicate.
OK, have a small, 1K, hub for that or put in some cog to cog comms hardware.
- Larger cog memory
- More efficient and flexible hub access mode
- 1080p60 display capability with external DRAM
@localroger
I do vaguely recall Chip or Beau mentioning that the P1 wouldn't scale to a smaller process, though I imagine that his AHDL code (or whatever he used) could be used for the automated layout like they're doing with the P2. I also recall Chip asking the guys that run the layout tools to amuse him and tell him what the P2 on a 45 nm process would run at. Obviously, 1GHz is amazing, but that puts it in a whole other ballgame where it's more along the lines of an application processor than a microcontroller. Besides probably locking out hobbyists and even some developers, I couldn't see how well it's architecture would stack up against it's new competition.
RE: pin reduced P2.. Could definitely be interesting. Let's say a 64 IO device, which is a reduction of 28 pins. I believe Beau said there's about 6000 transistors per pin, which would translate to more than 5.2k of hub memory assuming 4 transistors per bit (I say more because there's also all the die space used for routing). It may not sound like much, but I'm sure many of us would be happy with and additional 5k of hub memory on the P1.
Probably not, but Chip did mention that at least the core logic was more efficient than the P1 because he was utilizing some sort of gating.
Because that's the American way!
I didn't say anything about adding features to the P2, but rather to the P1, which if manufactured on the same process would probably have a MUCH smaller die size. So adding maybe some of the features from the P2 (like one shared math engine, and cog-based SERDES) shouldn't add too much. So if the process and available die size was the same, aside from the little bit of real estate taken up by a few extra features, the rest could go into hub memory, which I would guess would be substantial amount.
I think DRAM is a scary proposition (no pun intended) for a microcontroller that should generally be energy efficient. Then again, the massive gains in RAM would be fantastic. I suppose hub memory could be split into two chunks - one chunk static, and the other dynamic, with the the dynamic portion having the option to be disabled.
Time slicing with shadow cog ram would only work if you didn't need all the counters/video generators/etc. that are shoved in each cog. Having to replicate those as well would unlikely result it any substantial savings.
I had previously been a fan of the super-cog idea, but I just don't see any practical way to accomplish this without there being some significant kludge. Banked memory? Ugh! Different architecture with it's own unique opcodes? Ugh! I don't see any winning.
Question: How much cog ram do would make us happy? I know, 75847584TB. lol.. It's been mentioned before that the easiest way would be to increase the LONGS to a non-standard 33 or 34-bits. The 4-port cog ram would severely cut down on hub space, but I guess it could be partially made up using dram.
I have always advocated the inter-cog comm port. I would prefer it to be more than 32-bits wide though, so can have async communications between cogs with at least one of the channels being 32-bits wide.
I wish someone would make a vanilla ARM chip (Preferably the Cortex A5) with nothing but external memory and DMA interface. I guess there's really no substantial need for that in the market. Some of the ARM9 parts do come close, but I doubt they would perform substantially better than a P2 utilizing LMM.
Hey now! I like hub ram!
If we have too much cog ram and under utilize it, then it's an even bigger "waste" than hub ram. Ideally, we would have memory that's 8x the core clock so all our problems disappear! Of course, that's not going to happen.
How so? No matter how fast the HUB RAM is we have 8 cogs accessing it. They have to wait for each other.
My next choice after that would be a P4x32A, with 32 I/O pins, and another 32 I/O pins to make it look like an SRAM so it could be connected to a larger CPU as a peripheral. I imagine the external CPU taking every other slot on the hub. In a lot of ways, the Propeller makes CPLD-like functions accessible to programmers.
Xilinx has FPGAs with Cortex-A9's in the middle, and even Microchip has some (minimal) programmable logic now (Google PIC CLC).
?? This is pulling in three different direction at once.
If you keep the same large process, and same timing, you have gone nowhere ?
Also, the P2 tools and IP, and months of tuning, is all based on a NEW process and 1 CLK core, so that has to be the 'point of commonality'.
You can easily allow for optional slow-down timing 'compatible' modes - that's very common now in core upgrades.
Something like 'Take a P2 and keep throwing stuff overboard, until it fits in a P1 package, and see what results? path could yield a P1.5 ?
The 1.2V core gives you Energy efficiency, and an 80MHz Prop1 is drawing significant power.
If you meant Static Sleep modes, then yes, DRAM is not as good as SRAM (but could be as good as any WAITxx in Prop1)
With 90+ pins supporting DACs, and a 128 pin part optimized for speed, it is not clear what the P2 static Icc could be anyway, and the general trend these days is to not ask too much of one chip.
If you really need deep sleep, add a tiny micro that does that properly, and re-start the big-iron as needed.
A 32 bit counter register is tiny, and the Adder can be Mux'd - the real cost in video generation, is at the pins.
If the final chip could clock up higher because of this change, that would be nice, but I wouldn't count that as a design goal. Reimplement P1 with P2 design process, work on gating/process to get power draw (static and dynamic) as low as possible, then see how high that'll clock.
There are a few concepts that have been learnt from P2 that could likely be ported back to a P1 derivative.
1) Hub cycle reduction from 1 in 16 to 1 in 8.
2) Perhaps a small speed improvement by fixing/changing the constraints that are limiting the overclocking. We are achieving ~110MHz in reality, but the PLL or internal clocking seems to be one of the limiting issues.
3) Implement the B port in a 84 PLCC for 64 I/Os. This requires consideration of how to access the second 32 I/O as using the C flag may break existing code. Perhaps an instruction to flip between them (for each cog) somewhat similar to P2. Presuming the die fits, make a 40 DIP which could use the B port for internal comms (like the additional internal port on the P2).
4) Add a few gates to give the VGA circuit the additional ability to input the bits from a pin(s) (i.e. deserialiser)
5) Add more hub RAM & delete most ROM - minimum of 62KB+2KB ROM (similar concept to P2) and boot from SPI (Flash or SD card).
6) Add another 32bit CNT register for overflow from the first CNT register, and an instruction to access it.