P2 - New Instruction Ideas, Discussions and Requests

potatohead · 2014-03-29 22:42

Yep! Interesting isn't it?

Yes, my question too. Honestly, I've been itching to poke at that display some. The analog channels can totally deliver the pixels. It's all about whether or not the display samples at 4K or just scales...

Maybe the home theatre geeks know the answer to this...

Not much out there yet. There is this:

Supports 4K 60p (3840X2160p (50/59.94/60Hz) YCbCr 4:2:0 8bit, 4096X2160p (50/59.94/60Hz) YCbCr 4:2:0 8bit)

Looks like the analog inputs will be there on some devices at least! That was from a new SONY 4K QHDTV spec sheet.

We are gonna have to figure out a way to do this when we get real silicon. It's a nice show 'n tell. We may end up really happy we pestered Chip for component support in the waitvid color engine.

potatohead · 2014-03-29 22:50

Ha! I take that as a challenge. I will henceforth lobby Chip to include the mooching feature so that I can take it up.

Looks like you have won with that

THAT IS ALL IT TOOK?

Sheesh.

Well, I want it for compiled code. Not much else. That remains the case that won me over. And if it helps us with XMM EXEC? Yeah, gotta do it.

One should care very much about those "incompetents". They are the majority. They are potential customers. The chip should help them.

No matter what, this is now a consideration. We've added a lot. Much work to be done packaging and presenting things in ways that can make sense one layer at a time, IMHO.

potatohead · 2014-03-29 23:06

Re: Musing about 4K...

Well, I found this document. http://www.synopsys.com/Company/Publications/SynopsysInsight/Pages/Art3-hdmi-2.0-IssQ3-13.aspx?cmp=Insight-I3-2013-Art3

I was wondering what YCbCr 4:4:0 was. Turns out they are interlacing the color info to reduce overall throughput required for HDMI 2.0, obviously leaving head room for 8K displays this time.

Looks like they are sharing color across a 2x2 pixel grid for a 2048x1080 pixel color display, capable of 4096X2160 intensity.

Now I don't know what the analog encoding would be.

One thing I want to try with component video is to run the color at a reduced resolution from the intensity. It appears viable at higher resolutions, as that is precisely what they are doing. It's a spin on NTSC, which took the same approach, putting the resolution toward intensity, for the perception of higher definition. One could degrade the blue channel even more with very little perceptable loss. Our eyes have a fraction of the blue detector cells as they do both intensity and the other colors. Low resolution blue is cheap compression.

It may be that simple in analog land. Run the Y channel at a 4K dot clock, and the other two at a fraction dot clock...

Heater. · 2014-03-29 23:15

Potaohead,

...we don't depend on the moocher. That's got to be baked in to the dialog from the start.

Oh yes.

Bill,

Are you telling me that you would not test the four link case if you needed four links?

I might test one as a proof of concept. Let's say I'm lazy and I get the code from someone like you so I'm not familiar with what's in it. Then I build my complete system with four links and all my other stuff. Quite rightly expecting that they will work. They are all clones of each other in exactly similar hardware right? After months of work I do my four link integration tests. Oh dear, it fails, my project is doomed. I'm grumpy. I never buy PII again.

I'm OK with this as long as I have been slapped in the face with warnings of potential moocher conflicts from the outset. As Potatohead put so well above.

How do we do that? (RTFM is not the answer)

Re: XMOS:

The restrictions on shared memory in XC are quite fine. If you want to be able to move processes around from core to core they are essential. It at least ensures you have no data race conditions.

The I/O mapping is a pain. The fact that all pins are not equal is a pain. The fact that not all pins are reachable from all cores is a pain. The latter seems to contradict the process migrating idea anyway, sure you can move threads to different cores and they hey can still communicate easily with the original core, but now they can't reach the I/O ports they need.

The clocked I/O is great. Timing should not depend on counting the instructions in your program.

I got ****** when XMOS decided to rename "cores" as "tiles" and hardware scheduled threads as "logical cores". No doubt for markettig reasons. To the casual reader it makes it sound like it has more cores than it actually has. It's dishonest.

Heater. · 2014-03-29 23:19

Potatohead,

THAT IS ALL IT TOOK?...Sheesh.

Bill's kind of sneaky that way isn't he

potatohead · 2014-03-29 23:22

Oh yes he is. Clever as all get out too.

RossH · 2014-03-29 23:49

Mooching is a great feature. As long as it is possible to write cog code that depends on it, but do so in such a way that that code can be put in an OBEX (or whatever the P2 equivalent ends up being) without others having to be aware of it.

And I personally don't see the main benefit of mooching being in the compiled code. It will be in hand-crafted cog code that requires more hub slots than are available to a single cog.

Ross.

Heater. · 2014-03-29 23:56

Presumably mooching will produce better scores for high level benchmarks like Dhrystone. All good marketing stuff I guess.

potatohead · 2014-03-30 00:09

Actually, it is likely to be the reverse. Hand coded PASM will maximize the throughput possible and it's likely to do so on a number of fronts: smart data representations, key loops written for max performance, etc...

Compiled code is going to have lots of operations in it that won't be planned out to the degree hand code would be. Those will benefit from more frequent access to the hub. Think smaller and somewhat more diverse transactions, not near pathalogical cases we will come up with to make it really sing.

And to be honest, this chip is fast enough to not require the ultimate in PASM for drivers to be useful and reusable.

Afer doing some video related wok, I'm kind of stunned at what we get at 80Mhz.

IMHO, the big challenge will be combined drivers that offer significant reuse. That case will likely involve tasking mode, polling, HUBEXE and potentially the task swap capability we've got in there.

Either we make the right, most common packages, or we find ways to make combining things more practical...

That task swap capability is starting to look very interesting to me now. Perhaps we may end up with a pseudo-object of sorts that can be swapped in.

RossH · 2014-03-30 00:30

Heater. wrote: »

Presumably mooching will produce better scores for high level benchmarks like Dhrystone. All good marketing stuff I guess.

Quite likely - but nobody benchmarks on programs small enough to fit in the entire P2 hub memory any more. Haven't done for years. Modern benchmarks are designed to exceed the size of level 2 and even level 3 caches - and even on small micros, these caches are measured in megabytes, not kilobytes.

I know potatohead disagrees, and I respect his opinion - but I still think the most interesting uses of mooching will be in things we haven't even thought of yet. Yes, compiled code is going to benefit - perhaps quite a lot for simple linear programs. But not much in program that are already using multiple cogs, or which depend heavily on functionality provided in cog programs, since there won't be that many spare cycles to mooch.

But that's not where the Propeller has ever attracted any interest anyway. There are cheaper chips that are 10 times faster than the P2 will ever be at executing straight-line compiled code of just about any high-level language. SPIN is a success as a language on the P1 even though it is as slow as a wet weekend in Melbourne (I don't know the US equivalent - a wet weekend in New Jersey maybe?) because it gives you a quick and easy way of exploiting what you can accomplish in hand-crafted PASM in the cogs. And the P2 will be the same - it is what you will be able to accomplish in cog programs that will attract interest, and this (hopefully!) will give the P2 some unique appeal.

Ross.

potatohead · 2014-03-30 00:59

It's not so much disagreement as it is perception of likelihoods at this time Ross. As you say, we've not really gone all that far yet. I'm thinking in terms of what people would want to do. Over time, that envelope expands, but when we look at the "want to haves" on P1, taking a display for an example, it's kind of amazing where we are at! We can get a whole scan line worth of data in a mere fraction of the scanline now at FPGA speeds, and that is 16 bit color! Drop that down to 8 bit, or even 16 color displays at some nice resolutions, and a bitmap COG is going to be idle a lot of the time, where on P1 it was busy enough that we had to fall back on scan-line drivers to do anything other than simple tiles.

We have a whole game in a COG. Granted it's a simple one, but still! When I think of packaging things together, display, keyboard, mouse, other basic human I/O, maybe moderate speed comms, I think it's going to be difficult to saturate a COGs HUB cycles fully. There are wait times, calculation times, etc...

Anyway, that is the basis for my opinion on compiled code, and or big PASM, whatever.

Heater. · 2014-03-30 03:31

RossH,

...but nobody benchmarks on programs small enough to fit in the entire P2 hub memory any more. Haven't done for years. Modern benchmarks are designed to exceed the size of level 2 and even level 3 caches - and even on small micros, these caches are measured in megabytes, not kilobytes.

Sure they do. Do tell what small micro-controllers do you have in mind with megabytes of cache?

I might suggest that something like an STM32 F4 from ST Microelectronics is a chip that is comparable to the PII in price/performance. Here we are looking at a megabyte of FLASH and 256K of RAM. More or less depending on variant. These things have floating point and really fly with quoted benchmarks of 225 DMIPS/608 CoreMark as advertised here: http://www.stmicroelectronics.com.cn/web/en/jp/catalog/mmc/SC1244/SS1583/SC1169/SS1577
They are in the price range, the chips are only about 10 dollars, heck the very nice STM32F4 DISCOVERY development board is only 15 dollars. They run JavaScript a treat

Yes, yes, someone will want to remind me that a PII is not a "run of the mill" single core machine like the STMF4's. Very true and thank God for that. But when scanning around for micro-controllers for their next designs engineers the world over will be making such "check box" comparisons. Speed, RAM, ROM, I/O, peripherals, price. They will be comparing high level language performance not looking at the 500 opcodes of the PII.

Cluso99 · 2014-03-30 03:56

Wow, I missed a lot over the w/e, although it did take a quick look Sat night.

Here are a few of my comments regarding Mooching (the current name for slot sharing)...

1. Mooching cannot increase hub bandwidth - wrong!

Say we have a hubexec program that continually needs to load the wide instruction cache every 8 clocks. But often we have an instruction that also writes (or reads) a hub long (a data access). With Mooching, say the wide loads on slot #0. A couple instructions later (from the cache) a wrlong to hub occurs. With Mooching (presuming an unused cycle is available immediately) the wrlong executes, followed by some more instructions, and a new Wide load. Now I can write a program that would be able to fetch the wide on slot #0, wrlong on say slot #5 and fetch the next wide on my next usual slot #0.
Bingo, I have just increased my cogs hub bandwidth from a wide (8 longs) every 8 slot to 9 longs.
This will not be the norm, but the scenario could well be played out in many loops within C or other big hubexec code.
Of course, none of this can be deterministic, but could be extremely beneficial for one or two main large hubexec programs.
This scenario (of continual wide loading of the instruction cache) would be common for multitask hubexec code.

What Mooching cannot do is increase the overall hub bandwidth. This is limited to 8 longs per slot.

2. Effective hub bandwidth can be increased with Mooching.

Say we have a simple cog program that regularly takes 7 clocks between hub accesses. We might indeed have a few of these programs in various cogs.
With Mooching, the first cog gets its hub access at slot #1 (its cog #1). The next access it ties is for clock 1+7= slot #0. If it is unused, it gets slot #0, and now slot #1 will be unused. Therefore its latency has been reduced. Now lets say we have another cog whose slot is #2 but it is ready for hub access 1 clock early. So it gets slot #1 and then #2 becomes unused.
As you can see, here is an example where sharing unused slots by Mooching reduces the latency, without causing any issues with Non-Mooching cogs.

3. Scenario of donating slots with/without priority.

This is a mode I have asked for. Firstly, it only works when a cog specifically donates its' slot to another cog, with or without priority. Priority means the other cog will get my slot if it wants it, otherwise I can use it. Preferably this mode would give me the option of using the other cogs slot if it did not require it.

I suggest these "paired" cogs be 4 (ie 4 slots) apart. These cogs would run special "paired" code.

(a) This might actually enable me to double my bandwidth at the expense of the paired cog.
(b) It might enable me to Ping-Pong double bandwidth between paired cogs.
(c) This would enable me to ensure that my hub access latency is a maximum of 4 clocks/slots. I may very well have two co-operating cog programs where one of these requires a maximum latency. I had thought this could be a fall-back option for my FS USB code if I cannot do it in 1 cog.
(d) Games and Video modes could benefit from these paired modes. Once AUX is fully loaded ready to display frame(s), this cog does not require its slot, so this could be donated to the code modifying the display. Its a WIN-WIN for both cogs.

Summary

There are many benefits, both to simple Mooching, and the simple "Paired Donor" slot sharing.
Slot sharing give "wasted" resources back to some programs that could make use of it, or require it.

ZiCog would benefit nicely from slot sharing. In this implementation, only the VM cog would use Mooching. No other cogs would be "harmed"

RossH · 2014-03-30 03:59

Heater. wrote: »

Sure they do. Do tell what small micro-controllers do you have in mind with megabytes of cache?

The ARM family of microcontrollers supports cache sizes up to 8Mb.

Ross.

Heater. · 2014-03-30 04:11

Are we confusing micro-controllers. micro-processors and those new fangled SOC devices?
Or is there a micro-controller you can point to with 8Mb of cache?

I'm comparing PII with other micro-controllers which I think is appropriate.

potatohead · 2014-03-30 04:27

I'm not OK with anything that isn't completely passive and simple.

Seems to me the end game on pairing is a 4 COG machine.

As for priorities, etc... that's Chip's call if it even happens at all.

8Mb cache! Seems a fluid definition of "micro" is in play.

evanh · 2014-03-30 04:51

potatohead wrote: »

Seems to me the end game on pairing is a 4 COG machine.

I believe that 4 Cogs was one config that was asked for way back when Chip first asked about Cogs vs RAM.

... which makes the end-game on mooching is a single Cog machine. :P

RossH · 2014-03-30 05:06

Heater. wrote: »

Are we confusing micro-controllers. micro-processors and those new fangled SOC devices?
Or is there a micro-controller you can point to with 8Mb of cache?

I'm comparing PII with other micro-controllers which I think is appropriate.

I agree it is unlikely anyone would ever even put multi-megabytes of RAM (let alone cache) on a chip that is at all comparable to the P2 - but in a quick search found ARMs with 16kb, 64bk and 128kb cache sizes. Definitely microcontrollers (or at least they call themselves that). Larger memory sizes than that and I agree you are talking more of a "System on a Chip". But some of these SoCs would target the same embedded market the P2 would be after, so I'm not sure there is a real difference.

Ross.

RossH · 2014-03-30 05:08

potatohead wrote: »

8Mb cache! Seems a fluid definition of "micro" is in play.

Yes, that just seems to be a limit imposed by the ARM architecture. I doubt anyone would ever package up a microcontroller version of an ARM with that much RAM. But a microprocessor version - definitely!

jmg · 2014-03-30 12:10

RossH wrote: »

I agree it is unlikely anyone would ever even put multi-megabytes of RAM (let alone cache) on a chip that is at all comparable to the P2

I would not be quite so broad. The RaspPi has a stacked die processor, admittedly in BGA, and Nuvoton have TQFP128 ARMs with stacked memory. The N329xx series are 200MHz, and come with up to 32MByte of DDR memory, and they target toys so will likely be cheaper than P2.

Heater. · 2014-03-30 12:43

Exactly, the Raspberry Pi SOC, like similar SOCs in phones and tabs etc, is not generally regarded as a "micro-controller". The Raspberry Pi and fellow SOCs are not comparable to the PII micro-controller. If only because they depend on external RAM and such.

Having said that these definitions are very woolly now a days. As we move from 8051 through, P1 to P2, through STM32, through Raspi and other SOCs when to we actually make the transition away from being a "micro-controller"?

RossH · 2014-03-30 13:03

Heater. wrote: »

Exactly, the Raspberry Pi SOC, like similar SOCs in phones and tabs etc, is not generally regarded as a "micro-controller". The Raspberry Pi and fellow SOCs are not comparable to the PII micro-controller. If only because they depend on external RAM and such.

Having said that these definitions are very woolly now a days. As we move from 8051 through, P1 to P2, through STM32, through Raspi and other SOCs when to we actually make the transition away from being a "micro-controller"?

Yes, it's difficult. I used to think the difference was that a "microcontroller" had to have build-in RAM and EEPROM (or FLASH), so that an entire system could be built without needing any external chips.

But by that definition, even the P1 doesn't qualify!. :frown:

Ross.

potatohead · 2014-03-30 13:07

I think it's somewhat relative.

At any given time we have basic control tasks. Many of those can alway be done with what we all would consider an OLD micro-controller, or CPU.

The difference then was a CPU was designed to be part of a computing system capable of applications, general purpose use, etc... The micro processor.

The micro-controller often was the same CPU, but packaged in ways better suited to control applications. Things like some small on chip RAM, I/O, special signals, etc...

Today, those tasks can be done on modern hardware and the CPUs from the past, essentially turning them into micro-controllers more than micro processors due to size and scale of computing today.

Often modern hardware is overkill for control tasks, but it's cheap and familar, so it gets used.

The scope of what we call a simple control task has expanded, as has what we call a micro controller, as has computing changing what we call a micro processor.

The P2 has on board RAM and a lot of I/O features making it a micro controller. If the P2 were designed to employ external RAM as easily as it does internal, or its internal RAM were large enough it could be called a micro processor. A much larger internal RAM might qualify it for more of a system on chip definition too.

The intriguing thing is it being a multi processor that also has multi tasking. It also has some basic computer type functionality intended for on chip self hosted development.

I have no idea how that will all play out, but we will end up using it as a micro computer despite its micro controller roots.

Fun times!

I often think of the cool embedded guys I knew in the 8/16 bit era. They used embedded, because they all wanted to make little systemd they understood cold, not use micro controllers, and I always found that notable.

Today, P2 will fit that mind set, dated as it may be. Who knows, it could be a killer app in the right hands. The guys I knew running OS/9 or FLEX on 6809 computers, some they made themselves, were capable of amazing things and they required nothing but a display / terminal, etc... to get them done. They would get a home computer, apple, coco, whatever and bootstrap their own system off of that. 6502, z80, 6809, etc... then they rolled their own, choosing to develop on the target system, or their computer, depending. Many wrote their tools and then just used them.

Notably, systems like the PI require a significan OS to develop on chip. Or these larger capacity devices we find more difficult to classify require tool chains and are typically developed for, not ON. Can you all envision Tachyon on a P2? Holy Smile! Peter will be able to do a lot and never leave the chip or his tools.

That too is something a P2 will be different in. I'm very interested in seeing where some of that all goes.

And we will have SPIN, C, etc... too, just a potent. Just different.

The real challeng is to somehow identify the sweet spots and communicate the P2 in ways that resonate enough to see adoption.

potatohead · 2014-03-30 13:09

@Ross, yes. My thought exactly. Internal RAM = micro controller.

I just had another thought. Legacy systems.

They are everywhere. Some documented, many not. Some have tools available, others don't. Some tools can run on stuff we have today, some won't, or require hacks, cracks, etc to make happen.

When somebody builds to last, it often does, but what then? Often we have the dead programmer problem, leaving this thing that must be somehow repaired or enhanced / replaced.

The answer to that is standard tools, but standard tools get old too, leaving us old standard things! That is better than some old non standard thing, but sometimes not by much.

P1 will always work the way it did when it was released. We opened it, and so we can always build for it, even though it may require building the tools to build for it first.

The P2 will always work the way it did when released, and it runs it's own tools. Well, it does if the fuses allow for it. On can always drop in a P2 that does run that way at least, assuming there are some laying about.

Seeing that happen is a lot like the embedded systems those guys liked to build. They were always talking about just needing to understand the problem, and the native dev tools, and that always works, given a terminal or some basic interface.

That may be another niche. Custom, built to last and always accessible.

jmg · 2014-03-30 13:45

RossH wrote: »

Yes, it's difficult. I used to think the difference was that a "microcontroller" had to have build-in RAM and EEPROM (or FLASH), so that an entire system could be built without needing any external chips.

But by that definition, even the P1 doesn't qualify!. :frown:

Some early Microcontrollers needed external CODE memory, for their full Software reach as often the on-chip memory was limited by cost/die area to values like 4k/8k, or sometimes was only ROM.

These days, 'Microcontroller' is mostly a 'largely single package' yardstick - so small 8 pin memory still just sneaks into Microcontroller area, but a P2 with SDRAM that is not stacked die, starts to get debatable.
Probably sneak into Microcontroller most of the time, because of how it is used, but it is now a 3 package solution.

kwinn · 2014-03-30 22:12

Then again, perhaps the difference between microprocessor, microcontroller, and soc is more a matter of viewpoint or application than hardware.

Heater. · 2014-03-31 00:22

Then again perhaps people like Intel developed the first micro-controllers, these were chips that included a CPU, ROM, RAM, timers, UART, clocks, IO pins. Everything required to build a programmable control system integrated into a single chip.

Seems the first ever wast the TI TMS1000 in 1971. Intel was next up with the very popular 8048. There was one in every PC keyboard for a long time.

These guys invented the concept and they got to name it "micro-controller".

Cluso99 · 2014-03-31 01:03

Heater. wrote: »

Then again perhaps people like Intel developed the first micro-controllers, these were chips that included a CPU, ROM, RAM, timers, UART, clocks, IO pins. Everything required to build a programmable control system integrated into a single chip.

Seems the first ever wast the TI TMS1000 in 1971. Intel was next up with the very popular 8048. There was one in every PC keyboard for a long time.

These guys invented the concept and they got to name it "micro-controller".

Think you are about 10 years too early. About 1981 was about right for the single chip micros with internal eeprom such as the 8748 and MC68705

Heater. · 2014-03-31 01:36

Cluso,

The TI TMS1000 was developed in 1971 and went on sale about 1974.

The Intel 8048 was on sale in 1977.

In 1981 the IBM PC was launched featuring the 8048 as the keyboard scanner. Intel made a pile of cash out of that.

Admittedly these first device probably were only one time programmable. I don't remember when the first UV erasable (EPROM) device appeared. Pretty soon though.

EEPROM devices (No UV eraser required) did not come on the scene until the early 1990's

Cluso99 · 2014-03-31 01:55

Heater. wrote: »

Cluso,

The TI TMS1000 was developed in 1971 and went on sale about 1974.

The Intel 8048 was on sale in 1977.

In 1981 the IBM PC was launched featuring the 8048 as the keyboard scanner. Intel made a pile of cash out of that.

Admittedly these first device probably were only one time programmable. I don't remember when the first UV erasable (EPROM) device appeared. Pretty soon though.

EEPROM devices (No UV eraser required) did not come on the scene until the early 1990's

The 8080 and the MC6800 appeared in 1976 or maybe late 1975? The 8008 and 4004 preceded both these. I did not think TI entered this market before Intel and Motorola. I didn't think the 8048 appeared that early. However, eeprom version of the 8048 (8748) and the Motorola MC68705P3S were both available around 1981. I was the largest user of the MC68705P3S in the Southern Hemishphere for some time in the early 80's, having released a printer interface based on a pair of these around 1981/1982. In fact, my product was being sold and used commercially before the production version of the MC68705P3S became available. I had a contract with Motorola to be able to use the final sample chips in my product because the final bug did not affect my product.

P2 - New Instruction Ideas, Discussions and Requests

Comments