P3 ideas

jmg · 2019-02-23 19:44

Maybe P3 needs Phase-Change memory ? - looks like MRAM has been skipped over... ?

https://www.st.com/content/st_com/en/about/innovation---technology/PCM.html
https://www10.edacafe.com/nbc/articles/1/1651034/STMicroelectronics-Introduces-Safe-Real-Time-Microcontrollers-Next-Generation-Automotive-Domain-Architectures

Impressive parts, " six Arm Cortex-R52 cores clocked at 400MHz, 16Mbytes of PCM, and 8Mbytes of RAM, all in a BGA516 package." plus "three Arm Cortex-M4 cores with a floating-point unit and DSP extensions to provide application-specific acceleration."

They've used RAM here to run the code, (same as P1/P2/P3?) and use PCM as the boot-storage.

evanh · 2019-02-23 22:10

PCM is an alternative to Flash. Both wear out on writes. Density is the priority. Its advantage over Flash will be speed.
MRAM replaces DRAM - also a speed advantage, and large block SRAM in places like CPU caching or embedded main memory. MRAM can perform all functions only when highest density isn't important.

Cluso99 · 2019-02-23 22:33

New memory type are a matter of what OnSemi can offer, at what feature size, and at what royalty cost.

Currently OnSemi seem to not offer Flash or eeprom for the P2, so what hope is there for the cutting edge technology?

What I see as more interesting is the 1T or 1.5T RAM cells. Thats a huge silicon saving if they can get it mainstream - but again, at what royalty cost?

evanh · 2019-02-23 22:41

I don't think JMG was thinking Prop3 really. I certainly wasn't replying in that fashion.

jmg · 2019-02-23 22:44

Cluso99 wrote: »

Currently OnSemi seem to not offer Flash or eeprom for the P2, so what hope is there for the cutting edge technology?

OnSemi can do Flash and EE, but those need more process steps, so add to the price, and worse, they are slower than SRAM.
To get the better MHz speeds, most vendors are going to loaded-RAM, and the smaller processes mean the cost of that RAM is at least tolerable.
I've seen others offer stacked die, where a common/vanilla/low cost SPI flash part is 'inside the plastic'.

Cluso99 wrote: »

What I see as more interesting is the 1T or 1.5T RAM cells. Thats a huge silicon saving if they can get it mainstream - but again, at what royalty cost?

Are those fast SRAM cells, or DRAM ?

Rayman · 2019-02-23 23:11

It'd be nice to have jpg and mp3 encode/decode ability in real time...

Maybe we do already with P2, not sure yet...

Cluso99 · 2019-02-23 23:41

jmg wrote: »

Cluso99 wrote: »

Currently OnSemi seem to not offer Flash or eeprom for the P2, so what hope is there for the cutting edge technology?

OnSemi can do Flash and EE, but those need more process steps, so add to the price, and worse, they are slower than SRAM.
To get the better MHz speeds, most vendors are going to loaded-RAM, and the smaller processes mean the cost of that RAM is at least tolerable.
I've seen others offer stacked die, where a common/vanilla/low cost SPI flash part is 'inside the plastic'.

Cluso99 wrote: »

What I see as more interesting is the 1T or 1.5T RAM cells. Thats a huge silicon saving if they can get it mainstream - but again, at what royalty cost?

Are those fast SRAM cells, or DRAM ?

Chip implied that Flash/EEPROM were not offered, not that there were additional process steps.
As for speed, the flash/eeprom could be serial loaded as they are now so that's not necessarily an issue.
Personally, I couldn't see why a 24C256 couldn't have been added to the P2 die as it is made by OnSemi in the same Onc18 process. Cost didn't seem to be the issue here.

IIRC they are touting the 1-1.5T RAM cells as DRAM & SRAM replacements - they are static.

What I'd really like to see is the SRAM stacked in layers on top of the cpu layers. More production cost for sure but it's all relative to what's inside

evanh · 2019-02-23 23:48

I get the impression that layer count is a major issue for cost. So when Chip said something wasn't available he probably meant not within the minimum number of layers.

evanh · 2019-02-24 02:24

MRAM could excel as primary capture memory in digital storage scopes. 64 MB is a nice manageable amount that should fit on the same die along with management/display processor, sampling buffers and filtering compute units. No need for giant external SDRAM bus with its need for masses of SRAM buffers.

jmg · 2019-02-24 03:11

Some very recent MRAM news from intel is here
https://www.extremetech.com/computing/286084-intel-confirms-its-22nm-finfet-mram-is-production-ready
https://www.tomshardware.com/news/intel-stt-mram-mass-production,38665.html

Still not shipping in mass-market MCUs, and Toms link has 'weeks' in the retention without power column for MRAM, plus a finite read ceiling ?

Cluso99 · 2019-02-24 03:40

Might just be ready for P4 in about 20 years

evanh · 2019-02-24 04:33

JMG, those numbers on Tom's are stupid. It has DRAM and Flash at the same density!

evanh · 2019-02-24 05:00

In terms of density, they're making ground. Everspin has announced 1 Gb parts at 28 nm process (presumably by Global Foundries). Which compares as not too far off the common 8 Gb DRAM parts in PCs.

evanh · 2019-02-24 06:08

I guess SLC Flash possibly is similar density to DRAM. But Flash moved on from those days maybe 20 years ago. With limited number very slow power hungry writes, density is king.

Rayman · 2019-08-02 20:56

I guess we don't have worry about derailing the P2 progress by suggesting things now...

A post in P1 forum about a bad ENC28J60 driver got me wondering if it should be possible to put limits on what a driver cog can do.
Maybe limiting access to certain pins and parts of HUB RAM would be a good idea...

Wuerfel_21 · 2019-08-02 21:27

Like a mask for DIRx that could be passed into the cognew?
Memory protection would prob. require an MMU of sorts to be flexible enough.

potatohead · 2019-08-02 22:16

evanh wrote: »

Chip! HDMI does't have analogue. You could call it full-HD video I guess.

EDIT: I wonder if component video on TV's might handle that. I note my cheapo TV doesn't have component video inputs.

Component TV officially goes to 1080i. Many take 1080p anyway.

3 pins, optionally just one for grey scale monochrome! This is my personal favorite. Super lean on resources, high performance.

I have yet to do it, but one can clock color at a different rate. Maybe save RAM.

It may make sense to send analog into an HDMI chip for high resolution use cases.

Analog sets allow you to play all kinds of games with the pixel clocks and resolutions. And it's all pixel perfect. Digital sets, given a stable signal, work really well at the various standard resolutions.

Ramon · 2019-08-04 13:13

Phil Pilgrim (PhiPi) wrote: »

For the Prop 3, I just want what could've been offered years ago in Prop 1.5: more counters per cog and more counter modes. Maybe higher speed. Don't even want more pins. Just that sweet, simple elegance of the Prop 1 architecture that makes programming such a pleasure!

-Phil

Yes, that would be the perfect P3 - If ever done - for a 110nm process.

But why on earth are we even thinking about something new?
I can hardly sleep while dreaming about some 4 cogs variants of P2 on 44-PLCC and 48-TQFP/64-TQFP.
... or a 2 cogs version in 32-TSOP/48-TSOP

MJB · 2019-08-04 17:40

Ramon wrote: »

Phil Pilgrim (PhiPi) wrote: »

For the Prop 3, I just want what could've been offered years ago in Prop 1.5: more counters per cog and more counter modes. Maybe higher speed. Don't even want more pins. Just that sweet, simple elegance of the Prop 1 architecture that makes programming such a pleasure!

-Phil

Yes, that would be the perfect P3 - If ever done - for a 110nm process.

But why on earth are we even thinking about something new?
I can hardly sleep while dreaming about some 4 cogs variants of P2 on 44-PLCC and 48-TQFP/64-TQFP.
... or a 2 cogs version in 32-TSOP/48-TSOP

yea - those smaller (and cheaper) ones would make great super powerful real-time IO-Co-Processors with their SmartPins.
Don't know how we would like memory to scale with less COGs. Keep it / reduce it ...
I am sure there are applications for both - so find the right mix.

but now let's play with the BIG one ... :-)
waiting for P2D2 (new) plus dev board from Peter ...

Rayman · 2019-08-06 14:38

Maybe it would be nice to have a switch that would prevent cogs from being restarted.
That way, one cog could function as a supervisor and couldn't be restarted by bad code....

Wuerfel_21 · 2019-08-06 14:47

Maybe a privileged mode vs. user mode thing, seperate for each cog. In user mode, cogs can't stop other cogs, can't start cogs with access to pins that the starting cog doesn't have, can't change clock or hub settings.

Rayman · 2020-02-24 23:26

Maybe it would be nice to have a couple more counters that auto-increment only in certain situations...

One could be for any type of wait situation.
Another could be for an xcont wait.
Another could be for during interrupt handling.

Seems that would be useful info for maximizing the utility of some assembly drivers...

Electrodude · 2020-02-25 01:20

Wuerfel_21 wrote: »

Maybe a privileged mode vs. user mode thing, seperate for each cog. In user mode, cogs can't stop other cogs, can't start cogs with access to pins that the starting cog doesn't have, can't change clock or hub settings.

Debug-interrupt mode already does many things like this, so it could just be an extension of debug mode.

To avoid using up debug mode for a non-debug purpose, one or two more intermediate levels of debug mode (basically protection rings) could be added. If there are four total levels, including normal mode, you could use them e.g. like so:

1. Debug protected mode
2. Normal protected mode
3. Debug normal mode
4. Normal mode

rogloh · 2020-02-25 01:49

What I think could be interesting for some future P3 is something like a process shrunk even faster P2 allowing a lot more internal memory on die etc but using dual port HUB RAMs per COG. One port of each of these RAMs goes to each dedicated COG it serves, the other port is egg beater accessible by all other COGs (including its own COG), to allow for standard hub memory sharing as we have today. So basically each COG gets its own high performance memory it can rapidly/randomly access and which all the other COGs can still see and use when they access it via their hub interface. I think this would allow for some ever larger low latency applications while still retaining a lot of benefits from the shared hub model of the existing P2. Hub-exec mode code could probably benefit when executing directly from it.

You may need slightly different memory access instructions to get to the dedicated RAM directly versus via the egg-beater interface, or maybe this could be mapped by memory address etc. Think of this a bit like the LUT-sharing we have today but with the egg-beater on the other side of the LUTRAM instead of another COG.

Obviously the address size would need to change to support larger memory and that impacts the instruction set, but a P3 would be a significantly different beast anyway with many other changes.

You'd still have to come up with a mechanism to deal with simultaneous writes to this dual port HUB RAM at the same address by the COG and the egg-beater. Perhaps that means the COG side access gets delayed in that case which could start to make things look non-deterministic if it is a free for all, but with appropriate coding where COG writers know to respect other memory areas it could still remain deterministic. Even then I'm sure there would be plenty of issues to resolve to make it even work...

evanh · 2020-02-25 03:08

Tricky, a cross-point switch with every cog. Say, 256 kB per eggbeater. Each would need to reconfigure for the number of cogs slicing at any one time. The maximum number of slices per eggbeater would probably be limited also.

With 16 cogs, say, that's already 4 MB of hubRAM total. What's neat is that each eggbeater can be an addressable bank and with a limit of 4 banks per cog then the address map of each cog can be contained within 1 MB range still.

PS: I'd stick with single porting.

jmg · 2020-02-25 04:17

Rayman wrote: »

Maybe it would be nice to have a couple more counters that auto-increment only in certain situations...

One could be for any type of wait situation.
Another could be for an xcont wait.
Another could be for during interrupt handling.

Seems that would be useful info for maximizing the utility of some assembly drivers...

So that could be via an alternate Clock-source to a smart pin counter ? There are smart pin cells under the boot pins, that are somewhat wasted currently.

Cluso99 · 2020-02-25 07:47

At 90nm there would be 4x chip area.

What would be interesting is
8 P2 cores with 512KB egg-beater hub (ie current P2), plus
2 P3 super cores with 2MB hub (see below), plus
a shared 16KB dual port hub with one port in each hub space located at the top of hub.

P3 super cores and 2MB hub:
no COG/LUT, but running full speed out of the 2MB hub.
The hub port would be 64-bit wide and organised to support 3 read and one write 32-bit on each clock allowing both P3's to run full speed using opposite clocks. Would be an interesting hub design but I think it could be done.

evanh · 2020-04-01 23:12

Another future enhancement would be a RD/WRLUT with prefixed SETQ for block copy between cogRAM and lutRAM.

rogloh · 2020-04-02 01:07

Yeah @evanh that would have been handy and in hindsight we probably should have requested it. I wonder if the existing P2 architecture could have supported this and copy at a rate of say 2 (or perhaps 1) clocks per long or if something internal might have stopped us anyway?

Right now with 3 clocks you can do a REP with RDLUT x, ptra++ but the address of x can't be changed without an ALTD/ALTI which adds another 2 clocks in the loop, making 5 clocks per long, and 4 clocks per long in the reverse direction with WRLUT. You can unroll and get down to 3 and 2 clocks transferred per long but this burns instruction RAM.

evanh · 2020-04-02 01:23

Prop2 architecture would happily handle a longword per clock cycle between lutRAM and cogRAM.

P3 ideas

Comments