The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

David Betz · 2014-04-07 09:53

Bill Henning wrote: »

even a simplified hubexec would be much faster then LMM and simpler than QLMM - see my other posts

Didn't you say at one point that hubexec would only provide a 25% improvement over LMM?

Roy Eltham · 2014-04-07 09:53

Rayman wrote: »

Don't know if this makes sense and I don't want to ask for more complexity... But...

Can the DAC shifter possible go straight to 8 consecutive pins as an 8-bit digital shift out?

And/or, can the shifter be bidirectional?

I know what Rayman's after here, and it would be quite useful for connecting to a lot of LCD panels.

JRetSapDoog · 2014-04-07 09:56

cgracey wrote: »

First, hub memory will be comprised of 16 instances of 32768 x 8 RAM for 512KB. This is going to yield a hub data path of 128 bits. Only the RAMs that are needed on a given cycle will be activated, saving power. ... Though the cog memory map is still 512x32, cog RAM will be physically organized as 128 x 128, so we can read or write four contiguous registers with RDQUAD/WRQUAD instructions. ... This also gives cogs running at 200MHz (100 MIPS) a hub memory bandwidth of 200MB/s, which is enough to do any kind of VGA that we have the internal hub memory to support, at any color depth (up to 24bpp)

##### Come let us calculate together: 200MHz / 16 cogs * 16-byte transfers (per hub slot) = 200MB/s. Have I got that right? Absolutely fantastic!!! #####

Chip, it's so cool that you're willing to work hard to keep the video features. I know some folks don't need it. But it brought many of us to the Prop. Thanks!

Guess we'll be losing the P2's color lookup table (CLUT)? Maybe we can alter colors on the fly. Anyway to keep the CLUT or what are people's thoughts?

Bill Henning · 2014-04-07 09:57

That was based on 32 bit (1 long) hub bus, same as P1.

Now Chip expanded that to 128 bits. BIG difference.

In order to avoid arguing with half the forum here, I pm'd him a very simple hubexec that suits this that would be ~50MIPS (simple instructions) for incredibly few gates (relatively speaking). With 32 entry slot mapping, it would get close to 100MIPS. WAY better C performance.

non-FCACHE LMM on this latest design tops out at 12.5MIPS, fcache can help a lot - when the compiler can effectively use it.

non-FCACHE QLMM, if effectively used, would be roughly 25MIPS (lose one slot to add pc,#16)

I am extremely tired of some people criticizing me on a non-technical basis, so I did not post (NOT talking about you)

David Betz wrote: »

Didn't you say at one point that hubexec would only provide a 25% improvement over LMM?

Bill Henning · 2014-04-07 09:57

+ 1,000,000

Rayman wrote: »

Don't know if this makes sense and I don't want to ask for more complexity... But...

Can the DAC shifter possible go straight to 8 consecutive pins as an 8-bit digital shift out?

And/or, can the shifter be bidirectional?

pjv · 2014-04-07 09:59

Hi All;

I too think CMPSUB is important.

Is there an estimation what the power budget might be?.... this chip might just be the one I'm needing.

Cheers,

Peter (pjv)

Roy Eltham · 2014-04-07 10:01

Bill, that's really quite contrary to how we've all been doing things here. Besides if Chip uses it, then we are all going to know it with the first FPGA release, and all you've done is delay the arguments until then.

Bill Henning wrote: »

That was based on 32 bit (1 long) hub bus, same as P1.

Now Chip expanded that to 128 bits. BIG difference.

In order to avoid arguing with half the forum here, I pm'd him a very simple hubexec that suits this that would be ~50MIPS (simple instructions) for incredibly few gates.

potatohead · 2014-04-07 10:02

I like the video changes. We can rather easily do component and or TV in software.

Had no idea that color space was so expensive! Ditch it.

I'm still catching up, but I vote now on code compatability.

The idea that we take the best features from our efforts and respect the power budget is a winner!

I'm stoked about this. Chip, you are awesome. Really.

Heater. · 2014-04-07 10:04

Does anyone have all the sources from OBEX that can be greped for all the different PASM mnemonics so that we can count what ops are used a lot and what are not?

It's not definitive of course as some rarely occurring opcode may actually be executed a lot in reality. But gives an idea at least.

Rayman · 2014-04-07 10:07

Actually, now that I think about... I suppose with 16 cogs, I could just devote 1 cog to this task and not feel too bad about it...

Roy Eltham wrote: »

I know what Rayman's after here, and it would be quite useful for connecting to a lot of LCD panels.

Bill Henning · 2014-04-07 10:08

Roy,

I WOULD FAR PREFER TO DO IT ON THE OPEN FORUM

But I have HAD IT with personal attacks, or criticism without understanding what I propose, and not doing a technical analysis of it.

I LOVE TO GET TECHNICAL, ANALYTIC CRITICISM/ARGUMENTS - I LEARN FROM THAT! SO DOES EVERYONE!

But I have had it with "hand waving", "it is not like the P1", or criticism without a well founded technical argument to support it.

If moderators were doing their jobs well, they should have come down hard on personal attacks, and non-technical criticism. But they are not.

You guys have NO idea how close I came to leaving the P2 forum for ever in the past week. Or at least a year like last time.

I actually wonder if some forum members would prefer if I dissapeared.

Roy Eltham wrote: »

Bill, that's really quite contrary to how we've all been doing things here. Besides if Chip uses it, then we are all going to know it with the first FPGA release, and all you've done is delay the arguments until then.

pedward · 2014-04-07 10:10

Roy Eltham wrote: »

Instructions that I think we could drop from the P1 set to make room for required new ones:

These can be done with the separate instructions:
CMPSUB

Probably don't need all flavors of MUX, just keep MUXC/MUXZ and drop MUXNC/MUXNZ.

I disagree strongly with removing CMPSUB and MUXNC/MUXNZ, these are some seminal instructions that are very useful for atomic operations and dealing with bit tests. I've written some code that could *not* have been written without these instructions, specifically interfacing gray-code encoders requires those instructions.

I agree that some of the arithmetic instructions are lesser used, specifically the absolute values. Perhaps we could remove all ABS instructions in asm, since they can easily be done as 2 instructions:

ABS:

NEG D, S WC WZ NR
IF_C_AND_NZ NEG D, S

ABSNEG:

NEG D, S WC WZ NR
IF_NC_AND_NZ NEG D, S

There are other instructions that can be broken down into "microcode" instead of dedicated instruction slots.

Bean · 2014-04-07 10:14

I would like to vote to keep CMPSUB also.

Bean

potatohead · 2014-04-07 10:14

I vote yes on the mode to make the shifter work bidirectionally across 8 pins. We will use the Smile out of that.

Re: composite loss

Pal might get tough again, but we have a lot of P1 software that drives a TV sans the chroma circuit.

It worked well, but for the coarse signals. With DACS this will be fine. No tear shed here. A 200Mhz or so display COG can nail a TV signal. I'll end up doing one. I like TV. Another COG can do color lookups and integrate data from buffers.

If games are done, we have an awful lot of sprite COGS and the ability to partition the display among them still. Nuts crazy things are possible.

We don't lose much, but the nice color space and fast pixel clocks. We can still do commercial quality video.

Roy Eltham · 2014-04-07 10:17

Bill,
I for one do not want you to leave, and while I may disagree with you at times, I still respect your methods and contributions. However, bucking the system to avoid arguments from a few is unfair to the system.

pedward,
Then keep MUXNC/MUXNZ and drop MUXC/MUXZ ? Do you need both? CMPSUB can be done with a CMP and a conditional SUB, is it really required to be in one instruction? Remember the instructions are inherently twice as fast (2 clock instead of 4).

Bill Henning · 2014-04-07 10:21

Roy,

Roy Eltham wrote: »

I for one do not want you to leave, and while I may disagree with you at times, I still respect your methods and contributions.

Thank you.

Roy Eltham wrote: »

However, bucking the system to avoid arguments from a few is unfair to the system.

Why is it bucking the system?

I know that at the very least you and jazzed make suggestions in person to Chip, being close enough to physically visit. I am sure there are others.

That is exactly the same as a PM.

Your request, that I only post on the forum, is illogical and unfair.

Regarding avoiding arguments:

I LOVE GOOD TECHNICAL ARGUMENTS

Moderators are not really doing their job stamping on personal attacks.

Quite a few people do not bother to read technical proposals, and have illogical, non-supported knee-jerk reactions. Which influence others, and result in many uninformed "Mee too!" postings.

Such non-technical arguments should also be stamped on.

FYI,

Just like I do not have the right to request that those close enough to visit Chip and make suggestions, or email, phone or pm suggestions to him...

rjo__ · 2014-04-07 10:24

Excellent news.

potatohead · 2014-04-07 10:24

I would very much appreciate the fact that Chip can do this regardless of where the suggestions come from.

The power considerations are firm in mind, as is the vision of lots of COGS to do stuff we packed into single ones.

I want fast big programs. Any suggestion toward that end makes sense to me. That is a big execption to P1. Let's be rid of it.

David Betz · 2014-04-07 10:25

Bill Henning wrote: »

That was based on 32 bit (1 long) hub bus, same as P1.

Now Chip expanded that to 128 bits. BIG difference.

In order to avoid arguing with half the forum here, I pm'd him a very simple hubexec that suits this that would be ~50MIPS (simple instructions) for incredibly few gates (relatively speaking). With 32 entry slot mapping, it would get close to 100MIPS. WAY better C performance.

non-FCACHE LMM on this latest design tops out at 12.5MIPS, fcache can help a lot - when the compiler can effectively use it.

non-FCACHE QLMM, if effectively used, would be roughly 25MIPS (lose one slot to add pc,#16)

I am extremely tired of some people criticizing me on a non-technical basis, so I did not post (NOT talking about you)

PropGCC does use fcache even on P1.

Heater. · 2014-04-07 10:29

Bill,

I actually wonder if some forum members would prefer if I dissapeared.

Some forum members, can prefer what they like, I would very much miss your contributions.

David Betz · 2014-04-07 10:29

Bill Henning wrote: »

That was based on 32 bit (1 long) hub bus, same as P1.

Now Chip expanded that to 128 bits. BIG difference.

In order to avoid arguing with half the forum here, I pm'd him a very simple hubexec that suits this that would be ~50MIPS (simple instructions) for incredibly few gates (relatively speaking). With 32 entry slot mapping, it would get close to 100MIPS. WAY better C performance.

non-FCACHE LMM on this latest design tops out at 12.5MIPS, fcache can help a lot - when the compiler can effectively use it.

non-FCACHE QLMM, if effectively used, would be roughly 25MIPS (lose one slot to add pc,#16)

I am extremely tired of some people criticizing me on a non-technical basis, so I did not post (NOT talking about you)

I'm not convinced that QLMM is going to be all that big a help. There are issues like how to handle branches and LMM macros. Do you have a complete QLMM proposal?

jazzed · 2014-04-07 10:31

cgracey wrote: »

This thread is about the new chip we are going to build in the 180nm process.

Thanks Chip,

Your posts sound good.

Backward PASM code timing compatibility is not an issue. If instructions can be executed in one or two clock cycles, that's far more important than PASM timing compatibility.

However, P1 instruction compatibility is pretty important. If all the OBEX code and all the other code written before that people rely on for revenue can be ported, then it's a non-issue. DEC/ENC were never usable anyway, so those can easily be removed. MUL/DIV should be implemented of course.

Thanks for agreeing to not remove RDWORD/WRWORD.

If people want color NTSC or PAL they can use P1.

Any chance that 1MB HUB RAM would fit? Perry's suggestion makes sense looking forward.

Any chance that RD*C instructions can be used? If not that's fine.

--Steve

Roy Eltham · 2014-04-07 10:32

Bill,
I'm actually quite far from Chip, but I do visit from time to time (once or twice a year). There have been times that Chip has contacted me about features, but they got shared here pretty quickly. Also, usually when I visit, we talk about almost everything else besides the chips. Anyway, I mostly just want to hear your idea because I am interested in how it works, since LMM/HUBEXEC type stuff is going to be key in making C/C++ perform well and I think that is very important for Parallax.

Heater. · 2014-04-07 10:33

David Betz,

PropGCC does use fcache even on P1.

Yep, and it works very well, the C version of the heater_FFT runs almost as fast as the hand crafted PASM version !

How that translates to more general purpose usage I have no idea but it's most encouraging.

ctwardell · 2014-04-07 10:34

I like the idea of having some form of HUBEXEC, if it gets designed in public that's fine, if it happens behind closed doors that's fine as well.

C.W.

Bill Henning · 2014-04-07 10:34

Yes it does. My posts above is the minimum expected performance, because until a given bit of code is tried, the % improvement is unknow.

As I came up with LMM, FCACHE, FLIB (renamed 'kernel extensions"), I know that you realize I am well aware of how good FCACHE can be

With something like an FFT, on a small enough array that it fits in the cog, it can closely approach native MIPS for the fcached part.

With stack based recursive code, accessing local variables, it has very little effect.

hubexec is much faster than the base LMM performance, and saves a lot of hub memory.

QLMM has the potential of roughly 2x the performance of hubexec, at the expense of having to write a VLIW style GCC code generator. Money better reserved for P1+ / P2 shuttle runs in my opinion.

At one point, Ken mentioned that improving compiled C performance was high on custumers "want" list.

David Betz wrote: »

PropGCC does use fcache even on P1.

jazzed · 2014-04-07 10:34

I see this thread has degenerated already. Moderators, please remove perceived personal attacks and posts about perceived personal attacks (including this one).

Bill Henning · 2014-04-07 10:36

Thank you my friend.

Heater. wrote: »

Bill,

Some forum members, can prefer what they like, I would very much miss your contributions.

Bob Lawrence (VE1RLL) · 2014-04-07 10:37

About code compatibility with Prop1:

Is this absolutely mandatory?

I say break it if necessary. As much as we would love to have code compatibility sometimes it's just not possible . I have always felt that the biggest mistake that Microsoft made was try to stay compatible with DOS when Windows was developed. They should have made a clean break.

I'll post FPGA images for the DE0-Nano and DE2-115 boards as soon as we have something working.

I'm looking forward to that. :cool:

So, what can I do with 16 cog's? Now I have to think of a project to do. LOL

Bill Henning · 2014-04-07 10:41

Thanks Roy.

Can I count on you and others to stamp on personal attacks on me, and technically unsupported criticism on my posts?

If so, I will be happy to post it.

I love technical criticism, it leads to improvements!

I know, and I would strongly prefer to post all such suggestions, but they get lost in unwarranted non-technical criticisms... which I am tired of responding to, but have to, as otherwise (people not technically as versed) will fall for it, and add "me too!"... and the viscious circle continues.

Perhaps I should have vented earlier.

Roy Eltham wrote: »

Bill,
I'm actually quite far from Chip, but I do visit from time to time (once or twice a year). There have been times that Chip has contacted me about features, but they got shared here pretty quickly. Also, usually when I visit, we talk about almost everything else besides the chips. Anyway, I mostly just want to hear your idea because I am interested in how it works, since LMM/HUBEXEC type stuff is going to be key in making C/C++ perform well and I think that is very important for Parallax.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments