The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Heater. · 2014-08-18 08:44

Those 16 bit TI processors were very popular with assembler programmers back in the day. Nice simple instruction set and they were actually 16 bit in a world that was mostly still 8 bit.

They were not much like the Propeller COGs though. Their registers were, as you say, out in main memory. And that is where code was executed from. Unlike the Prop that executes it's code from it's own internal registers.

kwinn · 2014-08-19 07:55

@Leon

Great idea to use a macro-assembler as a cross-assembler. Wish I had heard or thought of it years ago. Would have saved me a lot of time and money as well as the tedium of hand assembling code.

@Heater

Having registers instead of standard memory and executing code from them is fantastic.

It would also make it very simple to implement a simple single interrupt per cog. All the state information you need to save is the PC and flags, which would fit in a single word or shadow register, and no worries or wasted time saving registers.

Heater. · 2014-08-19 08:21

kwinn,

I agree, the Prop mode of execution is one of the things I love about it most. Fantastic.

That single, very simple interrupt mechanism was my suggestion:)

It's about the only "interrupt" idea I can tolerate.

Thinking about it now it's not really like an interrupt at all but rather two sets of execution state, PC and flags as you say, that can be swapped between as an when required by software control or hardware input. There is no "background" and "interrupt" there are just two of the same going on. Very simple, very elegant. No stack required, no priorities, no fuss.

Perhaps even better than the old P2 hardware thread scheduling idea.

Makes for an event driven system rather than an interrupt mechanism. But we can call it "interrupt" to keep those who think interrupts are important happy:)

User Name · 2014-08-19 10:50

Having registers instead of standard memory and executing code from them is fantastic.
It would also make it very simple to implement a simple single interrupt per cog. All the state information you need to save is the PC and flags, which would fit in a single word or shadow register, and no worries or wasted time saving registers.

Perhaps even better than the old P2 hardware thread scheduling idea.
Makes for an event driven system rather than an interrupt mechanism. But we can call it "interrupt" to keep those who think interrupts are important happy.

This would be a great feature to explore with Open P1. Even though I've never found the Propeller wanting for interrupts, why not expand the Propeller's repertoire?

jmg · 2014-08-19 12:55

Heater. wrote: »

....
Thinking about it now it's not really like an interrupt at all but rather two sets of execution state, PC and flags as you say, that can be swapped between as an when required by software control or hardware input. There is no "background" and "interrupt" there are just two of the same going on. Very simple, very elegant. No stack required, no priorities, no fuss.

Perhaps even better than the old P2 hardware thread scheduling idea.

It sounds identical to the P2 thread idea, with just a single bit index.
( You do need to also duplicate the pipeline, if you want 2+ code streams running)

Given COG RAM is one of the more die-area costly parts of the COG, some form of time-sliced core usage, to allow as much code to run as possible, makes good sense.
The other issue is unused COGS == wasted RAM, but I guess that one is harder to crack.

Bill Henning · 2014-08-19 13:42

I'd love to see hardware multitasking come back to the P2.

I personally feel that for serial, i2c and many other drivers 1/4 of a P2 cog would be more than sufficient, and for those saying we have 16 cogs, so we don't need it, I think that one cog, tasking four ways, will use less power (and thus generate less heat) than using four cogs. As for interrupts, with four tasks, each task could be WAITing on a pin... presto, four "interrupts" per cog, with very fast response, at very low power when waiting.

jmg wrote: »

It sounds identical to the P2 thread idea, with just a single bit index.
( You do need to also duplicate the pipeline, if you want 2+ code streams running)

Given COG RAM is one of the more die-area costly parts of the COG, some form of time-sliced core usage, to allow as much code to run as possible, makes good sense.
The other issue is unused COGS == wasted RAM, but I guess that one is harder to crack.

mark · 2014-08-19 13:52

Bill Henning wrote: »

As for interrupts, with four tasks, each task could be WAITing on a pin...

I thought that wasn't possible, as any WAIT instruction would stall execution of all tasks IIRC.

Heater. · 2014-08-19 14:30

And there you hit the problem. WAITxx only waits on one thing. Meanwhile anything else is blocked.

What we need is to wait on many things. That is why people from interrupt land crave interrupts.

But we have many COGS so waiting on many things is a bit silly. Do we really need 8 levels of interrupt, in the traditional style, on each of 16 COGS?
With all that overhead of a stack and priorities and whatever. I think not.

Hence my single interrupt suggestion.

No stack required.

WAITxx on something, if something else happens, swap PC and flags and go and do that. When it's done swap PC and flags back again.

Except unlike interrupts this is totally symmetrical, there is no "background" and "interrupt" and "interrupt of interrupt"..., there is only two two threads of execution that get swapped between. It's hard to tell which is which.

jmg · 2014-08-19 14:49

Heater. wrote: »

Hence my single interrupt suggestion.

No stack required.

WAITxx on something, if something else happens, swap PC and flags and go and do that. When it's done swap PC and flags back again.

I'm not following the semantics here, what you have described IS a stack, just a 1 level one.
An advantage of time-slice threading, is total independence of the running code.
If you do a response-stack-return, then that may save a little silicon, but it makes execution a lot more granular than time-sliced.

Some processors have an 'event manager', which allows HW events to run a simple state machine.
Perhaps a WAITTABLE type of operation could allow an event-branch at low silicon cost ?
Could be more suited to P1V ?

jmg · 2014-08-19 14:57

[QUOTE=mark

mark · 2014-08-19 15:30

jmg wrote: »

I think that depends on the design.
If tasks have their own pipelines, then there need be no stall.

Right, I just believe that was the case with Chip's proposed implementation of tasks.

kwinn · 2014-08-19 15:57

User Name wrote: »

This would be a great feature to explore with Open P1. Even though I've never found the Propeller wanting for interrupts, why not expand the Propeller's repertoire?

I agree. On a couple of projects I had to input and save data that came a few times per second at random intervals, but was only available for 200nS. Adding a latch would not have helped since the next datum could arrive as little as 200nS later. The code was only a few lines but I had to dedicate an entire cog to it so no data was missed. A simple interrupt would have made adding that task to another cog possible.

kwinn · 2014-08-19 16:10

jmg wrote: »

It sounds identical to the P2 thread idea, with just a single bit index.
( You do need to also duplicate the pipeline, if you want 2+ code streams running)

I was under the impression that the P2 threading involved multiple PC's and executing the instructions pointed to by each PC in sequence. More like a propeller within a propeller than an interrupt.

Given COG RAM is one of the more die-area costly parts of the COG, some form of time-sliced core usage, to allow as much code to run as possible, makes good sense.
The other issue is unused COGS == wasted RAM, but I guess that one is harder to crack.

Agreed.

mklrobo · 2014-08-19 16:19

cgracey wrote: »

This thread is about the new chip we are going to build in the 180nm process.
The big picture:

I've been hammering out a new, minimalist design that should be along the lines of a Prop1 in cog complexity, with a few things taken away and a few things added.

First, hub memory will be comprised of 16 instances of 32768 x 8 RAM for 512KB. This is going to yield a hub data path of 128 bits. Only the RAMs that are needed on a given cycle will be activated, saving power.

Though the cog memory map is still 512x32, cog RAM will be physically organized as 128 x 128, so we can read or write four contiguous registers with RDQUAD/WRQUAD instructions. This is way better than what we had on the Prop2, because, rather than just affecting mappable overlay registers, these transfers are into and out of the actual cog registers, themselves. These 128 bit paths don't take too much mux'ing and they keep the power down to reasonable levels. Interfaces to any peripherals can take advantage of them, too. This also gives cogs running at 200MHz (100 MIPS) a hub memory bandwidth of 200MB/s, which is enough to do any kind of VGA that we have the internal hub memory to support, at any color depth (up to 24bpp) - without any hub slot reallocation scheme needed to favor particular cogs. LMM greatly benefits from this, too.

VGA is going to simply be a use case of a shifter that drives four DAC outputs to a set of fixed pins attached to each cog. The four channels from highest to lowest can be used as: R, G, B, HSYNC. In some modes, the shifter handles data in ways that is obviously for video, but, otherwise, it's a generic circuit that can simultaneously update four DACs with unique 8-bit data. You can also write the DACs directly in software, with 8 bits of dither, to realize something like 16-bit DACs.

What complicated the heck out of the P2 video was all the accommodation to support fancy color-space conversion for TV's. I plan to get rid of all that, as it's very costly, being full of staged multipliers and CORDIC rotators. Every flat screen TV I've seen has a VGA connector, and it is tidier than component connections, anyway. You can still drive a TV using the DAC shifter to make a 1-wire composite signal, but color modulation is no longer part of the package. This is a little sad, in that a one-wire color signal was nice, but you wouldn't want to have to read small text on a TV, anyway, so it was kind of a novelty.

There are some Prop1 instructions that I've never used, like CMPSX. Maybe we could cull a few of those for other things. Any ideas on getting rid of any of those instructions?

Do we really need to support words any more, with RDWORD/WRWORD? Bytes, I think, are always needed, but I don't think I've ever used words for anything before. Convention says we need them, but do we, really?

Here is the pin-out, as posted earlier in another thread:

Attachment not found.

This is going to take several weeks, probably, to develop. I'll post FPGA images for the DE0-Nano and DE2-115 boards as soon as we have something working.

With this kind of power, what expectations do you have with application? With all the things that has been done with the P1, what directions do you have in mind?
A Propeller laptop? Will this allow competetion capability with the beaglebone/raspberry pi , and yet remain stable in the enviroment in which the P1 dominates?
That would be interesting.

jmg · 2014-08-19 18:14

mklrobo wrote: »

...
A Propeller laptop? Will this allow competetion capability with the beaglebone/raspberry pi , and yet remain stable in the enviroment in which the P1 dominates?

A laptop, nope, completely the wrong design for that.
Not even really competition with beaglebone/raspberry pi, as they run Linux, P2 is more complementary with them (& Intel's Galelio), as they do real time IO very poorly - so think of the P2 more as an advanced IO and real time processor, ideal for peripheral work.

kwinn · 2014-08-20 10:01

mklrobo wrote: »

With this kind of power, what expectations do you have with application? With all the things that has been done with the P1, what directions do you have in mind?
A Propeller laptop? Will this allow competetion capability with the beaglebone/raspberry pi , and yet remain stable in the enviroment in which the P1 dominates?
That would be interesting.

The laptop and tablet market is already overcrowded with competitors racing towards low margin commodity pricing. On top of that the P2 architecture does not suit those types of applications. Better to aim for things like HMI, building and industrial automation, real time I/O, data acquisition/logging and instrumentation applications.

Invent-O-Doc · 2014-08-21 04:07

I dont want ANY additional features on P2 apart from the most recent design. Just want something finished I can buy.

__red__ · 2014-08-21 09:51

[QUOTE=mark

kwinn · 2014-08-21 15:27

Invent-O-Doc wrote: »

I dont want ANY additional features on P2 apart from the most recent design. Just want something finished I can buy.

I wouldn't want any additional features either if they would delay the P2, and with double the number of cogs there is less of a need for an interrupt/event switch feature. On the other hand if the P1 were to be redone for any reason adding a simple interrupt/event as Heater suggested would be a very beneficial and simple change.

kwinn · 2014-08-21 15:35

__red__ wrote: »

Can't you use WAITPNE to monitor multiple pins?

Read current state, give that and your mask to WAITPNE.

You now sleep until any of those pins change state?

Yes, you can monitor multiple pins and execute different code depending on which pin changed state, but that is not the same as having hardware multitasking where each task monitors and responds to one or more pins. There is overhead in deciding which one of the pins changed and jumping to that task, and in the meantime the other pins are not being monitored.

Heater. · 2014-08-24 01:19

@jmg,

I'm not following the semantics here, what you have described IS a stack, just a 1 level one.

Call it a stack if you like. I would not.

Two items that you swap between does not look like a stack to me. It does not have the semantics of PUSH and POP.

With stack of size 2 you can have A on the stack [A]. You can PUSH B on the stack [A, B]. When you do a POP you only have A on the stack again [A]. The B has vanished. Those are stack semantics.

What I describe is two items A and B and a SWAP operation that gets you from one to the other. In this case A and B are the process state, program counter and flags, and swapping gets you running a different thread. The swapping being potentially triggered by external events in the manner of interrupts.

I do agree about the time sliced treading. As used in XMOS devices. Such harware scheduled threading was introduced into the Propeller II design. That design was abandoned when it was realized that it was too big, complex and power hungry. I get the feeling we will not see such threading on the table again for the current P2 design effort.

Hence my suggestion of the simple two thread model. Threads being "ping ponged" between in response to external events rather than relying on JMPRET or TSKSWAP type instructions being used in cooperative scheduling. It sounds like a small, simple, elegant way to get event driven code, like FDS for example, working with minimal latency and code size.

I don't want ANY additional features on P2 apart from the most recent design. Just want something finished I can buy.

I agree. Ship it already.

David Betz · 2014-08-27 17:32

Invent-O-Doc wrote: »

I dont want ANY additional features on P2 apart from the most recent design. Just want something finished I can buy.

I wish I could keep track of what is in the "most recent design". Has an FPGA image been released and I missed it?

ozpropdev · 2014-08-27 17:58

David Betz wrote: »

I wish I could keep track of what is in the "most recent design". Has an FPGA image been released and I missed it?

David,
The last P2 FPGA release was 24 March 2014 and was the "BIG" design.
Nothing released yet in the "new" P2 design.
Chip hinted a few weeks ago that he hoped to have a FPFA image soon.

Cheers
Brian

RossH · 2014-08-27 20:06

ozpropdev wrote: »

Chip hinted a few weeks ago that he hoped to have a FPFA image soon.

Just keep in mind that Chip's concept of "soon" may not be the same as ours.

Ross.

Dave Hein · 2014-08-28 05:36

On August 10 Chip posted that the FPGA image would be available in "a week or so". Let's hope it's not too much longer.

mindrobots · 2014-08-28 07:53

Let's hope it is as long as Chip needs to do what he wants/needs to do! :0)

rjo__ · 2014-08-28 10:03

Imagine a really complex design running on a system that has a capacity to corrupt files on a somewhat random basis...leaving the text files appearing to look ok.

David Betz · 2014-08-28 10:07

mindrobots wrote: »

Let's hope it is as long as Chip needs to do what he wants/needs to do! :0)

I believe Ken posted a message a while back that Chip would be available for supporting the P1v effort until the end of August and would then be mostly working on the P2 so I would guess that the "a week or two" now starts in September with maybe a little added on for getting back up to speed. Don't rush the wizard! :-)

Dave Hein · 2014-08-28 11:57

You guys have a lot more patience than I do. When the PropGCC development was started over 3 years ago the target was for the P2. The target changed to P1 when PropGCC started to come up, and there was no P2 to run it on. I understand that there have been some major problems that had to be overcome during the P2 development, and that P2 was completely redesigned earlier this year, but if an FPGA image wasn't going to be available a week or so from the 10th, then don't say that it will be available. Chip, I admire your technical abilities, but 4 or 5 years for P2 development seems a little excessive.

Of course, P2 is more important for Parallax than it is for the customers. We can always use some other chip. There are several alternatives available, and more are coming out as time moves on. I suspect that some people who were active on the forum a few years ago have moved on to something else. To me, it seems like it is important for Parallax to get the P2 out as soon as possible, and then move on to the P3.

Heater. · 2014-08-28 12:45

Dave,

When the PropGCC development was started over 3 years ago the target was for the P2. The target changed to P1 when PropGCC started to come up, and there was no P2 to run it on.

I cannot believe that is true. Is it really possible that people are going to write a compiler for an instruction set that does not exist?

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments