Who do you think will buy the Prop2 ?

MikeChristle · 2017-05-28 13:35

Good morning all.

My background is in building automated test equipment at Lockheed Martin. I can envision lots of applications for the P2 in the test and measurement area. Unfortunately, quantities would only be in the hundreds, maybe thousands. You definitely want to produce a module, similar to the P1 Quickstart board.

On the software side I see many people recommending C/C++. Not sure this is a good idea. If you use a general purpose compiler, like gcc, it will not support the P2s unique features. If you add a library of functions to support these features it will never come close to the performance of PASM code. What you need is a language specific to the P2 that supports all of the P2s powerful features. Just make this language look like C so that developers who know C can learn it easily. This will unlock the full potential of the P2.

I am now retired, so I don’t know of any specific sales, but if you make a P2 Quickstart board I promise I will buy one.

Mike

Mike Green · 2017-05-28 20:33

Ditto Heater's suggestion of pairing a P2 with an ESP32

Heater. · 2017-05-28 21:00

@MikeChristle,

Yes, I always thought a P1 or P2 would be great for test fixtures. What with it's tightly coupled I/O, real-time behavior, etc.

I have a downer on the idea of creating a new language just for the Propeller though. Because:

The world has enough languages already.
If it's not C/C++ most of the world won't look at it.
If it looks like C/C++ but isn't it's confusing.
Designing a language, building a compiler and tools and then supporting all that forever is a major task.
I suspect building P2 features into the language will not get you the performance anyway.
P2 features can be built into C++ with macros, inline functions, whatever.

Anecdotally, I wrote a Fast Fourier Transform for the P1 in PASM and also in C. I forget the figures now but the GCC compiled versions speed compared very favorably with the hand crafted PASM version. Even if the PASM version had been subject to a lot of performance tweaks suggested by forum members. It's amazing what compiler optimization does now a days.

@Mike,

Thanks. Can't wait to get one now....

jmg · 2017-05-28 21:10

MikeChristle wrote: »

...
On the software side I see many people recommending C/C++. Not sure this is a good idea. If you use a general purpose compiler, like gcc, it will not support the P2s unique features. If you add a library of functions to support these features it will never come close to the performance of PASM code. What you need is a language specific to the P2 that supports all of the P2s powerful features. Just make this language look like C so that developers who know C can learn it easily. This will unlock the full potential of the P2.

I'm not sure I follow - if you "Just make this language look like C", then it is C, surely ?
"a library of functions" can be written in PASM, so the claim of "it will never come close to the performance of PASM code" seems strange ?

However, I think I understand what you were trying to say... that good in-line PASM support, is vital ?

Where generic C compilers do seem to vary, is in their in-line ASM support, with some much better than others...
(some are simply awful.., and certainly P2 C really needs to be at the better end of good, for In-line ASM support )

google finds some examples
https://msdn.microsoft.com/en-us/library/5f7adz6y.aspx
- but some comment in-line ASM in Microsoft is x86 only, not ARM/x64 ?

http://comments.gmane.org/gmane.comp.compilers.sdcc.user/5407

and an example of less than ideal, with a lot of bonus chaff....

int main()
{
     int temp = 0;
     int usernb = 3;

     __asm__ volatile (
          "pusha \n"
          "mov eax, %0 \n"
          "inc eax \n"
          "mov ecx, %1 \n"
          "xor ecx, %1 \n"
          "mov %1, ecx \n"
          "mov eax, %1 \n"
          "popa \n"
          : // no output
          : "m" (temp), "m" (usernb) ); // input
     exit(0);
}

and, for other languages, this for in-line ASM of x64 for FPC
http://forum.lazarus.freepascal.org/index.php?topic=27725.15
The claim is this works on x64

{$asmmode intel}

procedure Div128(MSDividend : QWord; LSDividend : QWord; Divisor : QWord;
                 var Quotient : QWord; var Remainder : QWord);
{ Find MSDividend:LSDividend div Divisor -> Quotient rem Remainder, }
{ assuming MSDividend < Divisor }
begin
   asm
      mov rdx,MSDividend
      mov rax,LSDividend       { rdx:rax = 128-bit dividend }
      div Divisor              { 128bits div 64bits = 64bits rem 64bits }
      mov rdi,Quotient         { rdi = ptr to Quotient }
      mov qword ptr [rdi],rax  { Quotient = rdx:rax div Divisor }
      mov rdi,Remainder        { rdi = ptr to Remainder }
      mov qword ptr [rdi],rdx  { Remainder = rdx:rax mod Divisor }
   end ['rdx','rax','rdi'];
end;


var
   _MSDividend, _LSDividend, _Divisor, _Quotient, _Remainder: QWord;

begin { main }
   _MSDividend := $0000000000000001;
   _LSDividend := $1100000000000017;
   _Divisor := $10;

   Div128(_MSDividend, _LSDividend, _Divisor, _Quotient, _Remainder);

   WriteLn(HexStr(_Quotient,16));  { 1110000000000001 expected }
   WriteLn(HexStr(_Remainder,16)); { 0000000000000007 expected }
end.

Roy Eltham · 2017-05-28 22:10

jmg,
With the Visual C++ x64 compiler, they don't have inline assembly at all. Instead they have a full set of instrinic functions that expose all of the extended instructions (including SSE/AVX/etc. stuff). Most of the intrinsics map one to one to an instruction, so you can get essentially the same thing as inline asm using them. You can see the extensive list of them here: https://docs.microsoft.com/en-us/cpp/intrinsics/compiler-intrinsics

In game dev we use these all the time for high performance SSE/AVX math stuff and other special case needs.

jmg · 2017-05-28 22:21

Roy Eltham wrote: »

jmg,
With the Visual C++ x64 compiler, they don't have inline assembly at all. Instead they have a full set of instrinic functions that expose all of the extended instructions (including SSE/AVX/etc. stuff). Most of the intrinsics map one to one to an instruction, so you can get essentially the same thing as inline asm using them. You can see the extensive list of them here: https://docs.microsoft.com/en-us/cpp/intrinsics/compiler-intrinsics

In game dev we use these all the time for high performance SSE/AVX math stuff and other special case needs.

Ahh, thanks, that makes good sense.
I was surprised that seemed to be missing...
Do those give 100% opcode coverage ?

The multiple-keyword nature of PASM is less compatible with that approach I think ?

Phil Pilgrim (PhiPi) · 2017-05-28 23:24

Whatever it takes for inline asm. Anything but "...\n", though, which is just ridiculous. I can't imagine having to code PASM that way.

-Phil

Roy Eltham · 2017-05-28 23:35

jmg,
It's not 100% coverage. It's mainly for the extended instructions, but most of the reason for inline asm is to use those. On x86_64 architecture, doing inline asm just for hand coding asm is kind of a waste of time, the optimizer is going to beat you easily in most cases, especially when you consider the wide variant of CPUs (not just AMD vs intel, but all the various families each have). The x86 and x86_64 optimizers are very mature.

Obviously for PASM, this is not the case, but an intrinsics approach could work well for getting usable performance out of high level languages.

John A. Zoidberg · 2017-05-29 01:54

Graphical applications? I wished there's a full interface to an SDRAM with high resolution TFT and a good touch screen to go.

potatohead · 2017-05-29 04:51

Doing that, or doing it with one of the newer fast RAMS mentioned here, should be possible.

evanh · 2017-05-29 05:40

John,
As Spud indicated, there is hardware support for full speed burst mode transfers to/from SDRAM, namely built around a Streamer DMA engine.

I'm not sure if that is the answer you were wanting. Like pretty much everything in the Propeller, it isn't one dedicated controller that only does SDRAM. A Cog will have to manage the control lines and initiate sequencing of the Streamer-SDRAM interactions. There will be examples/objects in the Obex.

EDIT: I think someone did do an example not long ago.

evanh · 2017-05-29 05:50

PS: In the case of a native 44pin TFT LCD interface a second Streamer could be pacing the colour data out from a recently Streamer filled line buffer in HubRAM.

And because the SDRAM can be burst managed this has the nice side effect of freeing up lots of time for a Cog to modify the content of the SDRAM.

potatohead · 2017-05-29 06:11

I thought this too. For most UX / control applications, it will be fast enough.

Some assets, text, buttons, etc. can be buffered in HUB too. I'm not sure any of us has run the pixel mixer yet. It does alpha blending.

That gives us good quality compositing from multiple storage sources.

jmg · 2017-05-29 06:54

John A. Zoidberg wrote: »

Graphical applications? I wished there's a full interface to an SDRAM with high resolution TFT and a good touch screen to go.

There are increasing choices on LCDs, some are here
http://www.newhavendisplay.com/new-products-c-985.html

Those 800x480 models, need a 24b parallel bus, with a CLK.DE qualifier, ~30 <50MHz (no min spec is given ?)
A streamer and LUT could manage 256 colour palette, with one COG in charge of streamer playback & DE timing, and another COG (or COGs) would need to construct line info, as the full pixel count is way over any predicted P2 RAM.

SDRAM currently has no native P2 Hardware Double-Data Rate support, so the jury is still out on what bandwidths can actually deliver there.

If you wanted to try to stream full pixel info from 8b HyperRAM to this display, you need to sustain > 90MHz average byte-reads & I've not seen actual numbers yet for P2 + HyperRAM.
You also need to have spare slots for writing to that memory.
Going by that display spec, it looks to allow up to ~34% inactive DE during frame flyback, so maybe that would be enough write bandwidth ?

sobakava · 2017-05-29 08:22

We have used P1 for a mass production in a couple of years ago. As the design engineer, I had no experience with P1 before that. The only reason we have chosen the P1 was CVBS video out capability with minimal effort. I have years of experience with ARM, MIPS, 8051 cores (asm and C) but for the sake of P1, I have learnt some SPIN and P1 assembly.

After a couple of batch runs, our customer requested to increase resolution and number of colors to come up with a more modern UI and we stuck at there because it wasn't possible to implement such thing without using a dedicated graphics IC (Solomon etc) or high-end processor + video encoder and still keeping it under their price range.

Currently we are manufacturing an advanced version of the original design that uses a SoC, DDR2 memory and TFT LCD. But still peeking P2 on silicon time to time.

MikeChristle · 2017-05-29 12:45

jmg wrote: »

However, I think I understand what you were trying to say... that good in-line PASM support, is vital ?

Exactly. With a generic compiler you need inline assembly to get acceptable performance. Messy and hard to maintain. That's why I developed PropC for the P1. See the attached files. I can work entirely in C, and still get hand coded PASM performance. Also, it supports 100% of PASM features. You will never get this kind of performance with GCC without a lot of time spent tweaking.

Mike

MikeChristle · 2017-05-29 17:44

Heater. wrote: »

The way to add wireless for low cost is to put an ESP32 onto the Propeller 2 breakout board.

That gets you WIFI and Bluetooth.

Network connectivity for less than the price of an Ethernet jack.

Not to mention a device that is increasingly familiar to millions of hackers that they can feel comfortable with.

How about a module, like the P1 Quickstart, with a socket that plugs into a Raspberry PI. For less than $100 you get a complete test instrument with great potential. That would sell.

Mike

potatohead · 2017-05-29 19:21

You also need to have spare slots for writing to that memory.
Going by that display spec, it looks to allow up to ~34% inactive DE during frame flyback, so maybe that would be enough write bandwidth ?

That's a good percentage. For many things, fill rate isn't a limiting factor. Say that rounds down to 30 percent of the frame changing in one frame of time.

One can buffer the changes into a queue and so long as full, frame locked motion is not required, users and developer may not even notice the occasional skipped frame or delayed draw.

Add priority to the queue, and a portion of the display can update frame locked, remainder happens as it can.

End result is something like 15hz full frame redraw. Many use cases won't go there. Careful UX planning will leave the display running nicely

jmg · 2017-05-29 21:17

potatohead wrote: »

You also need to have spare slots for writing to that memory.
Going by that display spec, it looks to allow up to ~34% inactive DE during frame flyback, so maybe that would be enough write bandwidth ?

That's a good percentage. For many things, fill rate isn't a limiting factor. Say that rounds down to 30 percent of the frame changing in one frame of time.

That's not a practical usable number, just what the display specs allow.
In real world use, the Line buffering is going to be more practical, as there is not enough space for frame buffering.

The new 120MHz P2 builds, might just allow testing the 800 x 480 display spec linked above.

P2 -> LCD can be 30MHz which is SysCLK/4, so that's looking comfortable.

However, HyperRAM needs to read at > 3x that, so a 120MHz bus rate (60MHz clk) should read the required 2400 bytes per line ok.
This bit is as yet unproven.

Longest line budget is (512+800)/30M = 43.73us, Shortest is (85+800)/30M = 29.5us
Frame rate limits I get as 32.144ms to 14.278ms, and even the slowest frame is inside the HyperRAM 85'C refresh frame MAX of 64ms
(even slowest 2 frames are very close, so there could be refresh spread over 2 frames, if proven necessary)

If we assume 120M, HyperRAM can burst fetch** 2400 byte in ~21us, (of 29.5~43.73us) & the spare time can be split between WRITES and user-manage the refresh.
Looks like it may just fit a whole scan-line write, along with whole scan-line read in one H time-slot, tho maybe allowing 50% write per line is safer initially.

A first-pass design could skip refresh, and just have single frame store, where the repeated read will auto-refresh just those displayed RAM cells.
This wastes some of the RAM, but they are cheap.
Once the data-flow rates and streamers are P2 proven, then the refresh of other display frames could be added into spare time slots.

** this data spec will be important : - to me, this says no stuttering effects.
"When configured in linear burst mode, the device will automatically fetch the next sequential row from the memory array to support a continuous linear burst. Simultaneously accessing
the next row in the array while the read or write data transfer is in progress, allows for a linear sequential burst operation that can provide a sustained data rate of 333 MB/s (1 byte (8 bit data bus) * 2 (data clock edges) * 166 MHz = 333 MB/s). "

potatohead · 2017-05-30 01:06

For sure. I always tend to take the coarse metric, then work forward.

That last spec does seem to indicate the display can be fed while display is happening. However, prep on the source data must also happen. That's what I was getting at.

Would be super nice to write it all (source) during blank. Then, as you say there, it can be streamed out lines at a time.

If not, some percentage fill rate will be in play. Doesn't have to mean stutter. Movies, games, etc... Will be demanding. Ordinary UX typically won't be.

Will be spiffy to give these external memories a go.

jmg · 2017-05-30 01:55

potatohead wrote: »

For sure. I always tend to take the coarse metric, then work forward.
...
Will be spiffy to give these external memories a go.

Yes, refresh is very poorly spec'd, but the data does say this

"The host system may also effectively increase the tCMS value by explicitly taking responsibility for performing all refresh and doing burst refresh reading of multiple sequential rows in order to catch up on distributed refreshes missed by longer transactions."

which seems to say you can manage refresh yourself, but they fail to give any examples of just how many clocks are needed per ROW change

I was expecting some register-space means to "refresh++" but it seems they have fixed buried timer only, or usual R/W access ??

I guess a test could start with refresh dummy read of one byte per Row, confirm that works, and then try prune of clocks until it breaks.
The good news is that repeated read/replay of a single frame, inside that 64ms, does not need any added refresh work.

potatohead · 2017-05-30 02:52

Seems to me, they are allowing a choice.

If the frame is buffered, writes are throttled by refresh and reads. Only so much may change per frame, but it can be done smoothly. Timed region schemes, blanking, and SO on.

If one wants to stream it and or do more, fine, but then the behavior must be known, and timing met however makes sense.

Edit: I'm confused. Does this display have it's own buffer?

jmg · 2017-05-30 03:17

potatohead wrote: »

Edit: I'm confused. Does this display have it's own buffer?

The display I've been working backwards from, is this one from the page linked above. Quite nice - Sunlight and Cap Sense.
http://www.newhavendisplay.com/specs/NHD-7.0-800480EF-ASXN-CTP.pdf
I think that has no frame-buffer visible to the user, but expects pixels to arrive in a raster fashion.
They do spec limits on DE windows, which infers two limits ( @ 30MHz) of 14.278ms <= Frame <= 32.144ms
My guess is the slowest limit is imposed by some DRAM-like refresh being needed on 1t TFT pixel cells ?
When testing something else, like HyperRAM paths, best to stay inside these specs.

potatohead · 2017-05-30 03:55

That's what I see too.

In that sense, it's not going to differ much from a native, analog display. No color transforms, just raw RGB. Pixel mixer could play a role.

Then, yes. The burden falls on the source data updates.

Either, one has to get assets from that RAM, or dynamically create them, or fetch from HUB for compositing. May not need that beyond a mouse cursor.

Then write them back to prep display data, if streamed from external RAM. Work in chunks ahead of the display, if streamed from a small HUB buffer.

The nice thing about the latter is multiple COGS could build the display data.

If streamed from external RAM, it's all about blanking times, smart queuing.

If streamed from HUB buffer, it's all about sorting draw ops, objects, and nearly the entire frame is available.

The latter is more complex, but can yield a lot more changes per frame. The former is simpler, but will be rate limited.

jmg · 2017-05-30 04:19

potatohead wrote: »

If streamed from external RAM, it's all about blanking times, smart queuing.
If streamed from HUB buffer, it's all about sorting draw ops, objects, and nearly the entire frame is available.

I'm basing my maths on a bit of both - line buffering done in HUB (just 800*3 bytes/line) and Frame buffering done in external RAM), and assuming 120MHz BUS write rates.

Assuming the most bone-headed, inefficient, user-refresh I can imagine Cypress may have designed, that needs ~ 24 SysCLKs per ROW, which maps to ~53 tH times.
In other words, for frame blanking times of > 53 lines, refresh can be done in 1 frame flyback.
Max spec for Frame DE is 255 tH, so seems comfortable. If that 24 SysCLKs can be reduced, this only improves.

potatohead · 2017-05-30 05:24

But the assets have to be either buffered again somewhere else, dynamically generated, or one has to read, modify write.

The HUB lacks room, so it's gonna be external. Read, modify write.

In any case, soon. We can play with this stuff.

John A. Zoidberg · 2017-05-31 09:50

evanh wrote: »

John,
As Spud indicated, there is hardware support for full speed burst mode transfers to/from SDRAM, namely built around a Streamer DMA engine.

I'm not sure if that is the answer you were wanting. Like pretty much everything in the Propeller, it isn't one dedicated controller that only does SDRAM. A Cog will have to manage the control lines and initiate sequencing of the Streamer-SDRAM interactions. There will be examples/objects in the Obex.

EDIT: I think someone did do an example not long ago.

This is great that SDRAM is supported as SDRAM chips are easily sourced in the market. However, the other one being called "HyperRAM" only is available in ball-grid packages, which isn't convenient for many of the prototypers. Are we anticipating a HyperRAM controller inside too?

Once if the P2 is out, I would think about making a mini board with the SDRAM with it. That is why I feel the SDRAM is very, very important. Many other big microcontrollers like STM32 and PIC32 (latest one with a stacked 32MB SDRAM inside the chip) have support for SDRAM and graphics.

I'm also concerned that will the P2 runs its code from SDRAM? It would be good for people who writes retro processors emulators as a huge pool of memory allows the emulated processor to cover all the address bits in the RAM and ROM.

jmg · 2017-05-31 10:07

John A. Zoidberg wrote: »

This is great that SDRAM is supported as SDRAM chips are easily sourced in the market. However, the other one being called "HyperRAM" only is available in ball-grid packages, which isn't convenient for many of the prototypers. Are we anticipating a HyperRAM controller inside too?

Depends what you mean by "HyperRAM controller inside" ?
The streamer can move bytes, with any provided count, and (hopefully) a smart pin can provide a tightly coupled associated clock count.
There will be some software to configure the ChipSelect and prime the bursts, so the 'controller' is a mix of SW & HW.
- but this will need to be proven, before final silicon sign off, to catch any small gotchas lurking.

John A. Zoidberg wrote: »

Once if the P2 is out, I would think about making a mini board with the SDRAM with it. That is why I feel the SDRAM is very, very important. Many other big microcontrollers like STM32 and PIC32 (latest one with a stacked 32MB SDRAM inside the chip) have support for SDRAM and graphics.

I'm also concerned that will the P2 runs its code from SDRAM? It would be good for people who writes retro processors emulators as a huge pool of memory allows the emulated processor to cover all the address bits in the RAM and ROM.

Once HyperRAM is working, you could morph that to other widths. 16b wide could be either 2 HyperRAM, or one SDRAM. SDRAM is not too great a jump from HyperRAM, just less pin efficient.

A key element of any DRAM is going to be refresh handling, where it will be important to push streamer speed to reduce the time that overhead refresh imposes.
That's another detail that will need testing, tuning and improving.

Capt. Quirk · 2017-05-31 22:56

Obviously the people that purchase the P2 will be people that succeeded first with the P1

But has the P1 actually been a success???

Ken Gracey wrote: »

I think P2 will find itself being used in newer inventions, including robotics, machine control, signal processing

as a favorite general-purpose microcontroller by people who value the quick development from flexible I/O

The kinds of "products" it goes into would be between 10 and 1K units in the near-term, likely not high volumes.

I have heard the catch phrase "by people who value the quick development from flexible I/O" That phrase implies
that they must learn the hard way first. Too many people are downloading Atmel Studio for their first big project, when
they should be purchasing their first Propeller!

The P1 has a lot of horsepower (80mhz & 8 cogs), but a simple 8 bit AVR has a lot of torque ( lots of peripheral functions).
Speaking for myself, I haven't heard anyone say how complicated all those registers, interrupts, and the ide's are. Instead
I hear "people who value the quick development from flexible I/O."

Ken Gracey wrote: »

The kinds of "products" it goes into would be between 10 and 1K units in the near-term, likely not high volumes.

If that is true, than the P1 needs to be the way a newbie completes his first big project. I tried to uploaded two old
model airplane magazines. They are full of how to design, build, and fly model airplane projects. They show off models
made by their readers, they talk about design trends, and building your own radio control. They are not (imo) an example
example of how the past Parallax books and articles were written. They have a different different spirit,they advise on
how to do your own thing, while helping those that need to copy someone else thing.

AeroModeller 1950/06 June

Model Aircraft 1953/05 May

The links to download these magazine examples are on the bottom of the page.

These 2 magazines are just on the top of the stack, nothing special.

Bill M.

VBB · 2017-06-01 08:26

>by people who value the quick development from flexible I/O

I think you can break this down even further to:

>flexible I/O

I think the P1 is awesome as a dynamically reconfigurable peripheral-co-processor and this is how I use it.

I am days away from launching a kickstarter for my (many years in the making) P1 based 'killer app' and have hopes for ordering 10K+ P1's this year. I crave for a faster P1 and I think the P2 would do the job but would take a huge effort to refactor all my P1 (all PASM - tight soft peripheral code not suitable for c/spin) code since P2 is not directly compatible.

I like the availability of the Verilog for the P1 so if the P2 bankrupts parallax I have a way forward. Also I just have no idea when the P2 will be available! If I knew for sure it would be available by the end of the year I might even refactor my application to use the P2 and my own application would be so much better off for it. As it stands the P2 is a maybe for next generation version but by then the code-base will have grown and I will be forced to maintain 2 product lines and 2 code-bases ( P1 & P2 ) with 2 documentation and feature lineups etc and so on.. In the end it may prove better to roll my own P1V+ ASIC to get the performance I need while keeping my P1 code-base and single product line. ( I think it's a big mistake to not be backward compatible with the P1 ) . It's so hard to make decisions in this environment of it's ready when it's ready.

For me the winning factor is:
* Soft peripherals that can use any pin that can be dynamically reprogrammed in less than 5 seconds.. that's it..

Who do you think will buy the Prop2 ?

Comments