Propeller II update - BLOG

Heater. · 2014-01-15 04:01

Oh, yes, "more specific", Seems I can sometimes only read half the words in a post and end up with a totally different story than the one being told:)

David,

Re: SSD and compilation. In my experience compilation speed has been dominated by the CPU time not the disk read/write time. Admittedly I have a bunch of slow old machines here. I imagine that when building VHDL into an FPGA there is a lot of CPU intensive synthesis going on so super fast disk I/O might not help.

P.S. Perhaps you should try to build propgcc with Clang. It is said to be faster than GCC.

Heater. · 2014-01-15 04:05

Chip,

Finns are more like this

Oh yes. They have a lot in common with the Scottish that way.

David Betz · 2014-01-15 04:13

Heater. wrote: »

Oh, yes, "more specific", Seems I can sometimes only read half the words in a post and end up with a totally different story than the one being told:)

David,

Re: SSD and compilation. In my experience compilation speed has been dominated by the CPU time not the disk read/write time. Admittedly I have a bunch of slow old machines here. I imagine that when building VHDL into an FPGA there is a lot of CPU intensive synthesis going on so super fast disk I/O might not help.

P.S. Perhaps you should try to build propgcc with Clang. It is said to be faster than GCC.

I do use Clang since that's the compiler that Xcode uses. My machine isn't terribly new. It's about four years old and has a dual core i5 processor in it.

Cluso99 · 2014-01-15 04:28

Congratulations Chip! There are going tobe a lot of uses for hubexec mode!

Its going to simplify my P2 LMM Debugger in a major way

I can sympathise with your "burn and learn". Reminds me of my UV eraser loaded up with 2708-2764 and MC68705s.

David Betz · 2014-01-15 04:29

cgracey wrote: »

Okay. Hub exec caching is working like it's supposed to now.

There are two cache modes that can be set via special operand-less instructions:

ICACHEP - icache prefetch, default, exploits unused hub cycles to prefetch the next cache line, allows single-task straight-line code to run at full speed (when no hub instructions)
ICACHEN - icache no prefetch, loads a cache line only when a cache miss occurs, slower, but more cache-line efficient, actually faster for 4-way hub multitasking, potentially lower power

I'm tidying up the code now and then I'll do a final mental-synthesis reality check. This was all very tricky to think about because there are some subtleties to the icache logic that cause the Verilog to be smaller than you'd expect, with some operation cases working because of layered contingencies which are not obvious. It's been fun, but quite fatiguing to get this worked out. It's taken three days.

Hub execution really opens up new ways of thinking about cogs and code. It basically buys your application (if you're executing from hub) a whole new RAM resource, being the old cog RAM, now free for data or fast code. This is a mental paradigm shift for me. It's going to take a while to figure out the new balance of things. Like the ROM Monitor program - all that code can be executed from hub ROM, and I'll be able to use the cog RAM for much bigger data-entry buffers. This is kind of a mind-bender, from where I'm coming from.

Wow! That's great news! Can't wait to play with it.

Bill Henning · 2014-01-15 05:09

Excellent news!

I knew that you'd get it running soon. Now go get some well deserved sleep

p.s.

I really like the control over execution that ICACHE{P|N} gives.

cgracey wrote: »

Okay. Hub exec caching is working like it's supposed to now.

There are two cache modes that can be set via special operand-less instructions:

ICACHEP - icache prefetch, default, exploits unused hub cycles to prefetch the next cache line, allows single-task straight-line code to run at full speed (when no hub instructions)
ICACHEN - icache no prefetch, loads a cache line only when a cache miss occurs, slower, but more cache-line efficient, actually faster for 4-way hub multitasking, potentially lower power

I'm tidying up the code now and then I'll do a final mental-synthesis reality check. This was all very tricky to think about because there are some subtleties to the icache logic that cause the Verilog to be smaller than you'd expect, with some operation cases working because of layered contingencies which are not obvious. It's been fun, but quite fatiguing to get this worked out. It's taken three days.

Hub execution really opens up new ways of thinking about cogs and code. It basically buys your application (if you're executing from hub) a whole new RAM resource, being the old cog RAM, now free for data or fast code. This is a mental paradigm shift for me. It's going to take a while to figure out the new balance of things. Like the ROM Monitor program - all that code can be executed from hub ROM, and I'll be able to use the cog RAM for much bigger data-entry buffers. This is kind of a mind-bender, from where I'm coming from.

Bill Henning · 2014-01-15 05:30

Heater. wrote: »

Bill,
A warning is a very good idea. Note however that in general the warning cannot be 100% reliable. Even if a module has hungry mode setting instructions that does not mean they actually get executed. We don't have a way to do the static code analysis required to know if hungry mode will actually be enabled at run time.

Agreed - however such warnings would (ugh) warn a user to watch out for the potential of a starving hungry cog

Heater. wrote: »

[usually only one cog running in HUNGRY mode]
I would not bank on this.

Only time will tell; however looking at my likely usage cases, I suspet *I* will normally use only once hungry cog.

Heater. wrote: »

Perhaps you are right. As long as "hungry code" is that sloppy (timing wise) stuff that want's to run as fast as possible overall but does not care about little hiccups on the way and is ultimately not worried if it ends up running at half speed.

This is rather like processes on Unix. Unix tries manage resources, disk, I/O, memory, etc, so as to get the best aggregate performance for all processes but makes no guarantees about timing. This makes the overall throughput higher.

Of course everyone complains "You can't do real-time work on Linux":)

Exactly! I want the "main application" that is written in C/Spin/JavaScript(I saw your later message)/other VM to run as fast as possible, when possible.

We have all the other cogs for ridiculously hard real time usage (5ns precision for WAIT's and counters) - heck each of the four potential tasks in a cog is roughly 2.5x more capable than a whole P1 cog!

I think there will be far more "unused" hub slots than most people realize; in an earlier post I did calculations for a six cog data decimator that had 5x 50Mbps bit-banged SPI ports reading 1Msps each, with a sixth cog making running averages... the six cogs left 96%+ of their hub cycles unused.

A 1080p60Hz 32 bit dispay cog would leave ~37% of its hub cycles unused.

The P2 will be a BEAST

Heater. wrote: »

Bottom line is that the speed of a hungry COG is now dependent on the activity going on in other COGs. This is a BIG new feature of the Propeller. Your claim is that it does not matter and the speed gains are worth that coupling. We worry that such coupling will break things in unexpected ways sometimes. As I said it's for Chip to decide the pros and cons here.

Yep, the extra speed a hungry cog gets is dependent on how much other cogs saturage their hub slots. I believe there is a ton of unused slots available. Time will tell.

Given that the only coupling is to the extra speed a hungry cog may get, it can only affect the hungry cog. Hungry cogs can be clearly documented as not being guaranteed of a large amount of extra hub cycles, and not being deterministic, so the only potential issues are if people ignore the design guidelines. Ignoring documentation and design rules can lead to plenty of other issues totally unrelated to hungry (waitxxx, pushing too much data on AUX stacks, etc etc etc)

Of course it is up to Chip!

cgracey · 2014-01-15 05:36

I got the cache code all tidied up. Tomorrow I'll run the mental-synthesis check to make sure it looks like it should work the way that it does. Then, I'll add the hub stack calls and returns, which shouldn't be too hard. At that point, hub execution is complete. There will probably need to be some new assembler directives to help with coding for hub exec.

I think we may need a new instruction to get an absolute address from a relative address. There will be no bits for where to store it, so it may have to push into the task's 4-level hardware stack.

jazzed · 2014-01-15 08:31

Hi Chip.

Glad you're past your recent troubles.

Heater. wrote: »

P.S. Perhaps you should try to build propgcc with Clang. It is said to be faster than GCC.

We do on Mac where we have no choice.

Heater. · 2014-01-15 09:49

Sorry, I forgot you are all Mac users. Never used one myself. Perhaps the Propeller C compiler should be llvm based. We could call it "Prang":)

dMajo · 2014-01-15 10:18

Bill Henning wrote: »

Exactly! I want the "main application" that is written in C/Spin/JavaScript(I saw your later message)/other VM to run as fast as possible, when possible.

OK Bill, you got my attention ... PERHAPS you are right ...

For sure you've pretty convinced Heater with the above assertion ...
It seems you well know how to steer the people, were you a politician in your previous life?

@potatohead: It's first time I hear okiedokie ... after a small googling, now I know someting more ... thanks. I only haven't understood if it's a sweet OK or it is teasing

KeithE · 2014-01-15 10:20

cgracey wrote: »

I've never done that. I just burn and learn.

To set up a Verilog test bench at this point seems like a lot of extra work. I just run the FPGA and employ the Prop2 trace output, if necessary (assuming it's all working enough to even do that).

I've tried the simulator in Quartus before, it was so much monkey motion to configure and get running that I just decided I'd stick with watching what the FPGA does. If I had gone to school for this, I'm sure I would have had a more standard approach, but this way works fine, although sometimes I break something that is difficult to analyze. The good thing about running the FPGA is that it's close to full-speed, so you get a feel of the ergonomics of the chip, which a simulator would never give you.

Yep - I think that everyone is doing emulation with FPGAs or other technologies today. If I remember correctly you do all functional testing in production, so figured that you must have some sort of testbench that you can use to generate the vectors and do fault grading. A basic testbench shouldn't be hard to develop for this sort of thing. Without one how do you determine your test coverage and calculate your expected DPM (defects per million)? I would interested to know if you have a really clever method. In my experience large customers will need for these numbers before they will close a deal with a supplier so they are critical. This could come after tapeout though.

jmg · 2014-01-15 11:10

Bill Henning wrote: »

Only time will tell; however looking at my likely usage cases, I suspet *I* will normally use only once hungry cog.

That will be the most common.

Bill Henning wrote: »

Exactly! I want the "main application" that is written in C/Spin/JavaScript(I saw your later message)/other VM to run as fast as possible, when possible. ... The P2 will be a BEAST

Correct.

Bill Henning wrote: »

Yep, the extra speed a hungry cog gets is dependent on how much other cogs saturage their hub slots. I believe there is a ton of unused slots available.

Even here, the designer, who is the one in control, can determine how-much/if the other COGS saturate.
If there is some (rare) needed minimum above the 'own slot' BW base, secondary COGS can be coded to have deliberate gaps in their peak HUB needs. As you say, averages are less likely to be a problem.
The designer chooses bandwidth allocation.

Heater. · 2014-01-15 11:48

jmg,

... secondary COGS can be coded to have deliberate gaps in their peak HUB needs. As you say, averages are less likely to be a problem.
The designer chooses bandwidth allocation.

In the world you describe the "designer" writes all the code in his project. In that case it does not matter what interactions any component has with any other. The "designer" will know all about that and create his design accordingly. As complicated as it may be.

This is not where we should be.

A modern day designer will expect to have services available: UARTs, SPI, I2C, USB etc etc. He does not want to recreate those, it's been done a thousand times already. he just wants to use them.

For the Prop, these services are create in software running on COGS. Hopefully from off the shelf software components or "objects". As such it is very important that the actions of one COG do not affect the timing of others.

jmg · 2014-01-15 12:30

Heater. wrote: »

In the world you describe the "designer" writes all the code in his project. In that case it does not matter what interactions any component has with any other. The "designer" will know all about that and create his design accordingly. As complicated as it may be.

This is not where we should be.

Close, but not quite. My designer has access to all source, He does not need to have written all the code.
He can harvest all the objects he chooses.

Heater. wrote: »

For the Prop, these services are create in software running on COGS. Hopefully from off the shelf software components or "objects". As such it is very important that the actions of one COG do not affect the timing of others.

Sure, and it is only in rare cases, that any of that source code will need to be touched.
These are not mutually exclusive. Only when someone wants to push the envelope is more control needed.

Where some seek to remove control from the designer, I seek to give him choice, and let him push the envelope.

jazzed · 2014-01-15 12:55

Heater. wrote: »

Sorry, I forgot you are all Mac users. Never used one myself. Perhaps the Propeller C compiler should be llvm based. We could call it "Prang":)

I'm a reluctant Mac user. It's just another platform to me like windows, linux, iOS, and android. I don't have a windows phone anymore and I refuse to buy one.

I suggested we consider making Propeller-C LLVM early on. The decision to avoid that path was not mine.

Heater. · 2014-01-15 12:59

jmg,

My designer has access to all source,

In all my discussions I have always assumed open source software. We have no time for anything else.

Sure, and it is only in rare cases, that any of that source code will need to be touched.

That is good. Because my mythical designer will turn to an STM32 ARM chip that has hardware SPI, I2C, PWM when he finds out it's to
hard to do on a Prop due to weird timing interactions between COGs.

I am reminded of the Story of Mel here:

http://www.catb.org/jargon/html/story-of-mel.html

I think Chip will like the Story of Mel.

potatohead · 2014-01-15 13:01

Re: Okie Dokie

In this case, it's a snarky response to attempts at marginalization. Give some, get some kind of thing.

And in that vein, I offer this:

Concerns and considerations based on real experiences are not fear, and underlines really don't change that, nor does emphasis.

To which, I asked a question so far unanswered...

potatohead · 2014-01-15 13:03

Looking forward to the HUB exec. I too have been thinking about the possibilities with cog registers, fast code segments, etc... It is a really intriguing model and I am really happy we did this change.

Thanks a lot Chip. It is hard and you are appreciated.

User Name · 2014-01-15 13:15

Heater. wrote: »

That is good. Because my mythical designer will turn to an STM32 ARM chip that has hardware SPI, I2C, PWM when he finds out it's to
hard to do on a Prop due to weird timing interactions between COGs.

Those who favor slot sharing are always separating OBEX from slot sharing. Those who don't favor slot sharing are always stuffing slot sharing back into the OBEX so they have some sort of grist with which to fabricate dire scenarios.

Bill Henning · 2014-01-15 13:17

Heater,

You keep saying "weird timing interaction between cogs" etc trying to kill the use of ***UNUSED SPARE HUB SLOTS***

Please provide a valid example where this could happen.

Hand waving, saying that people *might* write two hungry apps that *may* have some poor interaction (which could only between the hungry cogs) is not a technical argument.

1) by default all cogs will NOT be hungry
2) in most cases, only the overall "business logic" or "HMI" app would use "hungry" mode
3) anyone who writes a driver in hubexec, depending on extra slots, will realize if something goes wrong due to warnings
4) totally deterministic timing apps/drivers simply should not use "hungry"

Even with a thoughtless developer, hungry cogs CANNOT affect the hub slots and deterministic performance of non-hungry cogs.

You are an excellent programmer and a smart guy... I cannot understand your opposition.

Heater. wrote: »

jmg,

In all my discussions I have always assumed open source software. We have no time for anything else.

That is good. Because my mythical designer will turn to an STM32 ARM chip that has hardware SPI, I2C, PWM when he finds out it's to
hard to do on a Prop due to weird timing interactions between COGs.

I am reminded of the Story of Mel here:

http://www.catb.org/jargon/html/story-of-mel.html

I think Chip will like the Story of Mel.

potatohead · 2014-01-15 13:17

On the contrary, I don't think OBEX is the primary consideration.

This is about expectations. This feature is very easy to over promise and end up under delivering, regardless of the code reuse issues. My primary objection is along these lines, and for an example of another similar feature and concern would be incorporation P1 hardware support for some sort of code compatibility.

ctwardell · 2014-01-15 13:20

Heater. wrote: »

A modern day designer will expect to have services available: UARTs, SPI, I2C, USB etc etc. He does not want to recreate those, it's been done a thousand times already. he just wants to use them.

For the Prop, these services are create in software running on COGS. Hopefully from off the shelf software components or "objects". As such it is very important that the actions of one COG do not affect the timing of others.

This seems like spreading FUD to me.

- It is very clear that nobody is advocating that a cog can steal slots from another cog.
- The proposed default it that cogs do not share any slots, so a developer cannot naively write code that benefits from spare slots, they must deliberately code to make use of the slots AND code at least one other cog to donate slots.
- We could have a global setting that disables all slot sharing, that would make it very easy to see if you application still works with no spare slots, albeit slower without manually changing code in every object.
- UART's, SPI, I2C are actually good examples of cogs that COULD give up unused slots. If an object like a UART polls the hub every other slot it would have very low latency compared to typical data rates and could donate 50% of it slots to another process.

C.W.

Bill Henning · 2014-01-15 13:27

Here is how I would control expectations:

"Enabling HUNGRY mode for your main application may provide a significant boost to the speed of your business logic or HMI, however please note that you cannot rely on any specific performance improvement as any gain in performance would come from cogs not using all their hub access slots. If all non-HUNGRY cogs use all of their hub access slots, you will not get any performance improvement"

"The possible gain may be a factor of 2x to 3x for compiled code and Spin code, however the improvements are only available when other cogs do not use all of their hub access slots. Examples: a 1080p60 32 bit color display cog uses 64% of its hub slots. A cog dedicated to bit-banged 50Mbps SPI reading of up to 40 bits from an SPI slave uses approximately 4% of its hub slots. Therefore you CANNOT count on any specific level of improvement, and should think of HUNGRY mode as a "free" speed boost for your business logic or HMI. YOU MAY NOT RELY ON GETTING ADDITIONAL SLOTS."

I'd put the above in the manual where the instruction to enable HUNGRY was documented, ditto for Spin, C etc manuals.

potatohead wrote: »

On the contrary, I don't think OBEX is the primary consideration.

This is about expectations. This feature is very easy to over promise and end up under delivering, regardless of the code reuse issues. My primary objection is along these lines, and for an example of another similar feature and concern would be incorporation P1 hardware support for some sort of code compatibility.

Bill Henning · 2014-01-15 13:33

ctwardell wrote: »

This seems like spreading FUD to me.

- It is very clear that nobody is advocating that a cog can steal slots from another cog.

Correct.

ctwardell wrote: »

- The proposed default it that cogs do not share any slots, so a developer cannot naively write code that benefits from spare slots, they must deliberately code to make use of the slots AND code at least one other cog to donate slots.

Not quite.

"- the default for all cogs is NOT to use any spare slots. A cog must specifically enable the HUNGRY mode to be able to access spare slots from other cogs."

There is no need to explicitly donate slots, as another cog using an unused, spare slot, cannot affect the donor.

ctwardell wrote: »

- We could have a global setting that disables all slot sharing, that would make it very easy to see if you application still works with no spare slots, albeit slower without manually changing code in every object.

Interesting idea.

Or the compiler could omit the "HUNGRY" ops, or pasm could replace them with NOP in "test" mode.

ctwardell wrote: »

- UART's, SPI, I2C are actually good examples of cogs that COULD give up unused slots. If an object like a UART polls the hub every other slot it would have very low latency compared to typical data rates and could donate 50% of it slots to another process.

C.W.

Absolutely. See my calculation a page or two ago - a 1msps 50Mbps SPI handling cog only needs 4% of its hub slots.

ctwardell · 2014-01-15 13:44

Bill Henning wrote: »

Not quite.

"- the default for all cogs is NOT to use any spare slots. A cog must specifically enable the HUNGRY mode to be able to access spare slots from other cogs."

There is no need to explicitly donate slots, as another cog using an unused, spare slot, cannot affect the donor.

You are correct, I was thinking that the default was to not offer up the unused cycles.

Maybe the default should be that unused slots are NOT made available. By making this explicit the designer will be much more aware of what is happening.

This way you have to purposefully code at least one cog to give up spare slots before seeing a benefit.

C.W.

Bill Henning · 2014-01-15 14:04

ctwardell wrote: »

You are correct, I was thinking that the default was to not offer up the unused cycles.

Maybe the default should be that unused slots are NOT made available. By making this explicit the designer will be much more aware of what is happening.

This way you have to purposefully code at least one cog to give up spare slots before seeing a benefit.

C.W.

Sorry, that does not make any sense to me whatsoever as the non-hungry cogs cannot be affected by another cog using their "table scraps" (unused hub cycles)

If I build a project, I don't want to have to modify every object, and some objects have binary blobs.

I think warnings as heater suggested, combined with all cogs defaulting to normal (non-hungry mode), and requiring an instruction for a cog to enter hungry mode, is more than sufficient.

The speed gains for every vm and compiled code are potentially very large, way too large to ignore the potential performance boost.

Heck, we could even ban "hungry" objects from Obex!

The "Donate" instruction would make some sense if a cog wanted to give up all of its hub slots... but even there it is not necessary, as that cog could simply not use hub instructions.

jazzed · 2014-01-15 14:15

Bill Henning wrote: »

The "Donate" instruction would make some sense if a cog wanted to give up all of its hub slots... but even there it is not necessary, as that cog could simply not use hub instructions.

And it can display a PayPal button

ctwardell · 2014-01-15 14:17

Bill Henning wrote: »

Sorry, that does not make any sense to me whatsoever as the non-hungry cogs cannot be affected by another cog using their "table scraps" (unused hub cycles)

If I build a project, I don't want to have to modify every object, and some objects have binary blobs.

I think warnings as heater suggested, combined with all cogs defaulting to normal (non-hungry mode), and requiring an instruction for a cog to enter hungry mode, is more than sufficient.

The speed gains for every vm and compiled code are potentially very large, way too large to ignore the potential performance boost.

Heck, we could even ban "hungry" objects from Obex!

The "Donate" instruction would make some sense if a cog wanted to give up all of its hub slots... but even there it is not necessary, as that cog could simply not use hub instructions.

The purpose isn't to protect the donors, it is understood that you can't steal, the purpose is to make the sharing explicit.

This helps prevent the scenario of someone naively building the app that runs in the much feared 'super cog' that only worked because you hadn't done anything with some of the other cogs and unknowingly had the benefit of plenty of spare slots.

I get that this na

Kerry S · 2014-01-15 14:34

ctwardell wrote: »

Maybe the default should be that unused slots are NOT made available. By making this explicit the designer will be much more aware of what is happening.

This way you have to purposefully code at least one cog to give up spare slots before seeing a benefit.

C.W.

To me that would just make it seem that the 'donating' cog is impacted by the hungry cog, which it is not.

As long as there is no way for a Hungry Cog to steal another cog's slot it seems like a very basic way to increase data availability for programs that need more than would fit into cog ram.

Looking at how much time cogs will be spending on other things, than using their hub slot, having a Hungry option seems like a simple way to increase the P2's performance significantly for certain types of programs. While it is not guaranteed, in most cases there would be a significant boost and if you need that, and it is not available, what is the alternative? Use another processor that has more memory available without the Hub/Cog limitations?

Propeller II update - BLOG

Comments