Propeller II update - BLOG

whicker · 2013-12-02 00:08

I finally got through this thread to the end here:

My opinion (and I know this is not a democracy):

Keep package the same
8 Cogs
Cog dependent Analog output
512 long AUX memory
Simple 1-5, 2-6, 3-7, 4-8 dual hub access (optional) to double data rate for the single thread that runs high level code and for video.
(I forsee the drop from 8 to 6 cogs as being acceptable to get main code and video driver the data rate it needs, any more is excessive)
((The cog that wants to forfeit its hub access in favor of its "twin" on the other side says so. if twin doesn't take it then the local hubop will complete.))
The remaining space used for more HUB memory.

A SERDES that does:

1) USB NRZ logic
2) synchronous data in with selectable rising or falling edge of clock.
3) synchronous clock out selectable rising or falling edge 1/2T from when data changes.
4) adjustable start and stop bits for asynchronous data in and data out. oversampling. space for parity bit, but manually calculated because higher level processing should be using block level CRC's anyways.

NO QUAD SPI HARDWARE SUPPORT for what can be done in software. For what I looked at it seems like the quad mode is used in bursts anyways, meaning there is an initial spi command on just data pin 0, then maybe some dummy clocks, then the data starts arriving on all 4 data pins. It's too complicated for hardware because of the switchover from one bit to 4 bits is command dependent and chip dependent.

ozpropdev · 2013-12-02 00:22

The discussions(arguments) about what to do with the free die space have been interesting to say the least.

When first brought up by Chip the idea of more Cog's immediately got my attention.
As the day progressed I started shifting from this to the idea of more HUB ram.

As further time passed I shifted to AUX ram as the prime target.
How I ended up here was for a few reasons.
Multi-tasking is a pretty good replacement for multiple cogs. My early DE0-Nano stuff taught me that.
Video would benefit a lot with more ram (no brainer there).

All the new instruction tweaks have helped reduce code size to assist in better COG ram management.
COG ram is the one area we all want more but we all know this is impossible
without a aechitecture change and that's certainly bot going to happen.
So reduce the amount of use of COG ram as registers by using more AUX ram seems a goof option.
Swapping code snippets from AUX to COG is a nice option for lager cog programs.

The problem with more AUX is the limitation of 8bit immediate addressing in the instruction set.

ZCWS		0000110 ZC I CCCC DDDDDDDDD SSSSSSSSS		RDAUX	D,S/#0..FF/SPx
ZCWS		0000111 ZC I CCCC DDDDDDDDD SSSSSSSSS		RDAUXR	D,S/#0..FF/SPx

So I now at the point that the best,simplest and safest use of free die space is extra HUB ram.

Phew, I feel better now.

Ozpropdev

Roy Eltham · 2013-12-02 00:23

A single P2 COG can write 16bytes (4 longs) to HUB memory every 8 cycles. At 160Mhz that's 320MB/s. That's enough to fill HUB memory (at 126K) a little more than 2480 times. It the chip ends up running at 200Mhz, then it'll be 400MB/s or just over 3100 HUB fills per second. If all 8 COGs are doing this then they can achieve 2,560MB/s at 160Mhz or 3,200MB/s at 200Mhz.

Of course, this assumes the COG is getting that much data fed to it from someplace...

Also, how many instructions can the cog execute between writes to HUB if it has 2 timeslots (let alone having more)? Wouldn't it be 3 instructions (because one of the 4 is the read or write to/from hub)? What can it do with so few instructions that would get it enough data to feed into HUB? Or if it's reading HUB what can it be outputting to with so few instructions? I'm guessing only SDRAM or video output. You can already go from SDRAM directly out to video via the AUX ram streaming mode, so no need for HUB access there. That's the fast path for that usage case.

Am I missing something? What can you do with 3 instructions between writes that would warrant one COG using two timeslots? Am I wrong and it's more instructions?

It just seems silly to me to be talking about ways to speed up access to HUB, when it's only 126K...

Cluso99 · 2013-12-02 00:43

Not every cog is equal in a working design.

The main execution cog(s) and the video cog(s) require extra resources that the other cogs do not.

Using this as a base, what if the extra space was used for 2 blocks of ~48KB (~12K longs) memory - or one of 64KB and one of 32KB if you think that would be better.
Each block has a separate address/data bus that goes to all cogs. These two blocks then become "shared" resources.

Any 2 cogs can claim the first 48KB block.
Any 2 cogs can claim the second block 48KB block.
The same cog could claim both blocks and different cogs claim the other blocks respectively ie CogA has both blocks, one block is shared with CogB and the other with CogC.
The same cog could claim both blocks and another cog could claim both blocks too. ie CogA shares both blocks with CogB.

The first claiming cog sets access to be either 1:2 clocks or it has priority over the paired cog.

It's probably easy to implement and would give another high speed pathway between cog pairs.

Any thoughts??

ozpropdev · 2013-12-02 00:45

@Roy
The area that has the most to gain from faster hub access is multi-tasking.
Remember you potentially have 4 tasks within the COG FIGHTING over the same time slot.
Pipeline stall being the symptom here.

JRetSapDoog · 2013-12-02 00:49

Hmm...I've ran out of cogs, and also conceptually ran out of them knowing that there weren't enough to entertain a particular design approach. I basically don't even consider objects that take multiple cogs. It almost sounds like some would be happy with fewer cogs than the 8 that we have. But if there's anything a Propeller should have, it's cogs!

In my case though, I guess I like cogs because they sort of provide some of the features that an operating system provides with more determinism but less of the complexity. Full disclosure: I've probably got some special requirements that call for more cogs, and based upon what I've read about the P2, the time-slicing isn't going to address those needs. But maybe that's just me, though I wouldn't have thought so.

However, from all of the rather bold statements that more cogs is a waste due to lack of memory per cog and so on, then perhaps some weren't really satisfied with the P2 before the extra silicon was opened up, either. But wouldn't most of us have been pleasantly surprised when the chip was formally announced to have found more cogs? The chip's strengths are increasing in all other areas except cogs. I can see some more cogs being a natural extension. Yes, there's the time-slicing thing with cogs, but that doesn't allow objects from the OBEX to be snapped together so freely, not without major mods (and assuming that things will fit together in a cog).

As for me, I certainly don't think Chip's inquiry was a no-brainer by any means (if it was that clear-cut, he wouldn't have asked). Well, for about 20 minute's of forum-reading time, I was relishing the thought of 12 cogs, until, you know, calmer heads prevailed and informed me that I didn't need them, or, if I did, I was doing it all wrong (which might be the case). It's a pity, though, considering how relatively easy Chip said they would be to implement. Of course, extra memory should be similarly straightforward to add, too, I'll admit.

Hey, how about a compromise: 10 cogs (2 additional) with another 50K or so of HUB RAM tossed in? Still too many cogs?

Anyway, whatever features ultimately make the cut is bound to make for a powerful chip. Finishing it is key!

Roy Eltham · 2013-12-02 01:16

ozpropdev,
With 4 tasks running, and each of them was continuously using HUB, they would each get a timeslot every 32 cycles right (best case in the current scheme). That's still 5,000,000 (5 million) HUB accesses per second per task (over 38 full HUB memory reads or writes per second, using just longs, assuming you can't really use the quad read/write well with 4 tasks since there is only one quad buffer). Also, again, I run into the number of instructions between hub accesses thing. You will have the same problem as I mentioned before. If you get more hub access windows, then you can do less instructions between, So what are you going to do with those extra accesses?

ozpropdev · 2013-12-02 01:53

With 4 tasks running, and each of them was continuously using HUB, they would each get a timeslot every 32 cycles right (best case in the current scheme). That's still 5,000,000 (5 million) HUB accesses per second per task (over 38 full HUB memory reads or writes per second, using just longs, assuming you can't really use the quad read/write well with 4 tasks since there is only one quad buffer). Also, again, I run into the number of instructions between hub accesses thing. You will have the same problem as I mentioned before. If you get more hub access windows, then you can do less instructions between, So what are you going to do with those extra accesses?

Roy,
The problem is that the tasks are not split evenly (25%) cog time each.
Allocation of the cog time seems to be more about compensating for pipeline stall.
It's seems wasteful (there's that word again) to give more cog time to a task just to cover a HUBop.
It's hard to put meaningful numbers to this, it's based on just tuning it up till it works.
Does this make sense?

Ozpropdev

jmg · 2013-12-02 01:58

Roy Eltham wrote: »

If you get more hub access windows, then you can do less instructions between, So what are you going to do with those extra accesses?

I think one area where this can gain, is things like buffering DRAM to video - because of the nature of SDRAM, you want to transfer a block fast enough, to still have slack to display it. That work is very simple, and can be repeat coded, but in this realm 5Ma/s is a low number, not a high one.

Another area would be display-list playbacks - More Speed allows a larger list to complete in the same time, and this is what the FT800 does, to allow it to get by with less RAM.

As you have said, a Prop is RAM STARVED, which means anything that buys (virtual) extra memory, is well worth the trouble.
Cars are shipped with more than one gear, and everyone is used to that.

jmg · 2013-12-02 02:08

whicker wrote: »

NO QUAD SPI HARDWARE SUPPORT for what can be done in software. For what I looked at it seems like the quad mode is used in bursts anyways, meaning there is an initial spi command on just data pin 0, then maybe some dummy clocks, then the data starts arriving on all 4 data pins. It's too complicated for hardware because of the switchover from one bit to 4 bits is command dependent and chip dependent.

I'd call this half right.
Yes, for devices that need a Single SPI preamble, you set in SW to one bit SPI, then use HW for one bit address until the mode flips, (again in SW), but now you flip the HW to nibble wide and the transfer bandwidth jumps.
So you use SW for the Chip dependent stuff, and HW for the fast stuff.

It's not too complicated at all, because you do not need to try to make a state engine do all the decisions.
Hardware does what it is best at, which is moving bits. Framing is managed in SW

jmg · 2013-12-02 02:15

Ariba wrote: »

But if you use many objects with fast DACs (Audio, Functiongenerators and so on) then the cog allocation can only work if you start the drivers in the exact right order.

Lots of SW development involves getting things organized into the right places, so this is a SW management problem.

It just needs a means to define, and check, that you are getting what you expect.
PCs are great at this sort of mundane allocate and cross check stuff, they do it in miliseconds.

Cluso99 · 2013-12-02 02:37

Initially, I thought wow, 12 cogs. But I changed and thought memory (hub, aux or new block buffers) would be best.

Now that things have quietened a bit I have had time to think this through.

Cogs are what I now wish for. Take this scenario:

0: Main program (spin or C)
1: Video generator (output)
2: Video manipulation (game engine, gui, etc)
3: SD Driver (at least 1 cog)
4: SDRAM and Cache Driver
5: USB Low Level Driver
6: USB Command Processor USB FS will require at least 2 cogs)

Now I only have 1 cog left for keyboard/mouse/serial.

Desperately need more cogs!

How to implement slots?
Simple, default 1:12 clocks, with the same mechanism as Chip has detailed.
By using 2 donour cogs, we can get 1:, or with 1 donour, 1:6.
Also unused slots can be used by request.
Both of these are morre important with 12 cogs.

ozpropdev · 2013-12-02 03:08

This is quite a dilemma Ray.

The more I think about cogs ,hub and aux, the more I can justify all of them.
I see pro's and con's for all of them. What to do?

The hub slot issue has most of my attention at the moment.
I worry that this feature may be easy to implement but will be vetoed because of OBEX concerns.
It would be SAD

when we have silicon and we start hearing "I wish we had that time slot thingy".

Ozpropdev

Kerry S · 2013-12-02 03:29

Hard choice.

For me it is more Cogs. The limit I see hitting is Cog Ram. No way to change that. Except if I have more Cogs, then I can split my processes between them (remember we have WAY more pins to control now) and still have full real time control of them. Now I have to limit my Cog pin use to what I can fit into cog ram. Running out of cog ram = not being able to use pins. I realize that I am a special case (industrial process control) where I have a LOT of pins that need sold real time control to be fast and effective.

For those writing soft applications they need HUB ram, for those doing maxed out video they seem to want AUX. Chip and Ken will have to look at who their biggest commercial buyers are and see which of the three areas are paying the bills and do what will keep them happy and the rest of us will still have an amazing one of a kind processor to use.

Regardless of what Chip decides I am going to do my best to design my product using the P2. If it turns out to be too limited to do it then I will just wait for the P3 (hopefully not in 7 more years

).

ozpropdev · 2013-12-02 03:38

Kerry S wrote: »

(remember we have WAY more pins to control now)

A very valid point that hasn't been mentioned in this debate.
Your right that more cogs solves the COG ram shortage!

Erik Friesen · 2013-12-02 04:33

Really, this onboard regulator idea would be nice, but if that can't be, then I suppose hub ram, or both.

Tubular · 2013-12-02 04:57

Cluso99 wrote: »

Initially, I thought wow, 12 cogs. But I changed and thought memory (hub, aux or new block buffers) would be best.

Now that things have quietened a bit I have had time to think this through.

Cogs are what I now wish for. Take this scenario:

0: Main program (spin or C)
1: Video generator (output)
2: Video manipulation (game engine, gui, etc)
3: SD Driver (at least 1 cog)
4: SDRAM and Cache Driver
5: USB Low Level Driver
6: USB Command Processor USB FS will require at least 2 cogs)

Now I only have 1 cog left for keyboard/mouse/serial.

Desperately need more cogs!

The P2 cogs are really potent and useful. For your application above you could easily have a "HMI" cog that takes care of Video, Keyboard, Mouse, serial if you want. You could also probably fit the gui in as well. Then that cog becomes a useful, reusable block. This is pretty much exactly what OzPropDev has done with his invaders demo, although you'd probably put the game engine in with the main program, or hub ram.

Are you sure USB FS will require at least 2 cogs, if we get silicon clocking at 160MHz+? Or will those proposed decoding instructions get it into 1 cog? How demanding is the command processor?

David Betz · 2013-12-02 05:10

cgracey wrote: »

It is untyped, though the compiler knows where variables are. To access the AUX variables by index, you would use 'AUXR[index]' in Spin. Spin would give you the address of a local in AUX by '@localvariable'. You'd just better use that index in an AUXR[index] expression. My nephews saw my mom in the Apple app store (on her iPad) trying to find handbags from Macy's.

While this would obviously work, it seems like it would be awkward for the programmer. If I pass the address of a variable to a function I'd like to be able to use that in a generic way and not need to know if it points to AUX memory or hub memory. Otherwise I need a different variant of the function for AUX and for hub.

SRLM · 2013-12-02 05:39

David Betz wrote: »

While this would obviously work, it seems like it would be awkward for the programmer. If I pass the address of a variable to a function I'd like to be able to use that in a generic way and not need to know if it points to AUX memory or hub memory. Otherwise I need a different variant of the function for AUX and for hub.

And to further this thought: what about passing variables from one cog to another? If it's in AUX then you can't do that.

David Betz · 2013-12-02 06:14

SRLM wrote: »

And to further this thought: what about passing variables from one cog to another? If it's in AUX then you can't do that.

This is why I don't think that the AUX memory will be that useful for PropGCC, too many restrictions as to how it is used.

potatohead · 2013-12-02 06:42

You know, SPIN is going to run about as fast as PASM does on P1. Somewhere close anyway. At that speed, doing the SD card, or moving some pixels around, mouse, etc... can happen in SPIN, making a big library COG that can do lots of things, every so often, or frequently but not real time, pretty easy. Honestly, that plus the snippets mentioned below transforms SPIN into something useful for more than just a larger supervisor type program. Many of the things we use PASM for today can happen in SPIN.

That will be true of C too, and it's faster still! Fewer support COGS needed. But RAM is needed for the magic to happen on a program size that really does stuff.

COG threads can pack video, and input devices together no problem. Ozpropdev has video running in a 1/8 or was it 1/16th task! For basic video needs, it almost doesn't take a COG, due to how waitvid has been seriously improved over the demanding one in P1.

This will all work best with nice and roomy HUB RAM.

Without it, we are going to end up streaming data from all over the place, and that's just a tougher, slower model overall. More HUB RAM means room to page in from SDRAM, or whatever serial thing makes sense too. Bigger pages, or more granular ones that can keep something moving while the paging is being done, etc...

For those wanting a GUI environment, that HUB RAM is the key to having it at some reasonable color resolution, mouse, backing store, etc... We can have nice ones! And they can be mostly SPIN too. And, since SPIN will allow inline PASM snippets, some little bits here and there, or an LMM thing needed for speed without blowing a COG, can very easily happen. So we won't always need a math COG, for example, nor a blitter / graphics COG, unless we want one.

These COGS are considerably more potent than the P1 COGS. HUB RAM.

SRLM has been working on lean and mean C. (nice job man!)

Having a lot of HUB RAM means still doing that, so that C programmers can get the most possible, but it also means being able to do it for a nice, luxurious kernel with libraries that do lots of stuff users can just use with much fewer worries, and still have room for a respectable program. We want this.

In short, P2 will bring us a new trade-off: COGS won't always be a must have, but a want to have based on a speed requirement. Tons of things done by a whole COG on P1 can be done as a task, or snippet, or just plain old SPIN or C program, which frees COGS where those trade-offs make sense.

potatohead · 2013-12-02 06:58

@David

I agree with you. If it's used in SPIN, it will require some thinking. Nice to have for some speedy thing, but not so nice in the context of a larger program, which will use the HUB largely as we do today.

A bigger AUX RAM will complicate larger programs, mainly by pushing them to external RAM much quicker for lack of HUB RAM. Some data streaming, video, sprites, etc... would be improved, but I have to say our video is going to be nuts good anyway. I would feel bad about maximizing it when the return from larger programs, easier larger programs will impact so many more users.

In SPIN, the AUX may well be useful as a quick scratch pad for PASM snippets, maybe holding some data across a few snippets that get loaded.

For inline PASM with C, the same things can be done. So it's there if needed and there is enough of it to pay off nicely, IMHO.

Kerry S · 2013-12-02 07:59

potatohead wrote: »

... Many of the things we use PASM for today can happen in SPIN.

That will be true of C too, and it's faster still! Fewer support COGS needed. But RAM is needed for the magic to happen on a program size that really does stuff.

COG threads can pack video, and input devices together no problem. Ozpropdev has video running in a 1/8 or was it 1/16th task! For basic video needs, it almost doesn't take a COG, due to how waitvid has been seriously improved over the demanding one in P1.

This will all work best with nice and roomy HUB RAM.

That makes things seem a lot better. All I have to go on at this point are experiments with the P1, wish I had one of the FPGA to play with, and my concern was going from I/O limited on the P1 to cog mem limited on the P2. I need 62 I/O not including video, mouse, keyboard, ext mem, etc. 40 of those need to be high speed real time. If I really can do a nice, commercial quality, user interface with 4 cogs or less then I should get the performance I need out of the process control side.

potatohead · 2013-12-02 08:19

Yeah, mostly, it takes an entire P1 to do a nice user interface. And color / resolution / speed limits really didn't allow for as much as people would have liked to do without exotic tricks, or extra hardware. IMHO, the best results came from text oriented interfaces, and some specialized graphical ones people did. Some of those turned out pretty great. Almost enough.

It's early, but I'm feeling like that basic quality GUI is going to be possible to do on P2. Of course, that really depends on what people call "commercial quality" too. The basic elements are there: resolution, color depth, pixel processor, SDRAM. When we get this latest change done and commit to a synthesis run, it will be time to build tools up again and people can try some stuff.

I'm itching to go there myself. We had C working and I started to explore that and PASM. Chip has an early version of SPIN 2 done, but it needs the usual work before it's ready to bang around on. Once this change is sorted, lots of us are looking forward to that and the revision to PropGCC being whipped into good enough shape to start doing stuff.

Seems it all was a nice dry run so far. We learned a lot, P2 got tweaked, and it's looking good now!

The nice thing we will have is the SDRAM operates fast enough for the video to render right out of it. Nice looking things are going to be possible. For those so inclined, very clever things are going to be possible too, and I'm thinking of dynamically drawn, custom displays that are more than just bitmaps.

And we've got the gamut: NTSC / PAL (yes!), Composite, S-video, VGA, Component. Many would love HDMI, but that's really just an interface chip away.

The good component capability interests me because that can drive everything from TV resolutions and sweep rates, right up through a full 1080 HDTV display on just three pins! If you don't need color, one pin does it nicely for a very good 8 bit grey scale type display.

Good times ahead for sure!

User Name · 2013-12-02 08:24

Brian Fairchild wrote: »

The moment you make it difficult for developers, particularly commercial developers, they will look elsewhere. Why would I choose a processor where I can't freely use a 6-channel 32-bit timer/PWM module alongside a USB device module without considering how my processor will allocate these things called slots' and how it will access system RAM?

This strikes me as illogical. Just don't use the feature if you don't like the feature. Use only objects that don't take advantage of the feature. It's simple.

Have you ever developed a product with an ARM? Clearly not. If you had, you'd realize that complexity doesn't drive away potential commercial developers, anyway. The typical ARM IDE alone is so much more complex than this simple slot sharing business is that it's laughable to think it would scare anyone away! Even Phil and heater would soon realize that they wanted it.

Kerry S · 2013-12-02 09:21

potatohead wrote: »

IOf course, that really depends on what people call "commercial quality" too

Not too insane, at least I hope. Something any cheap 'pad' would be able to do today. No retina display, no 3d graphics, no gaming. Just a resolution that people will find acceptable compared to what they are used to today. I.E. minimum 1024x786 but bigger is always better for fitting in information/controls. Something in the 10" class I am thinking. Image wise it would be 75% text based but it needs graphical elements for icons, bar graphs, sliders and the like. Nothing really super fancy, just functional, clean and professional looking. Could get by with 256 colors but with more you can do some nice textures with your icons so they 'look' modern. Other than looking at hardware, this is the first thing they interact with. It needs to be at least nice enough that they don't notice it...

Electrodude · 2013-12-02 09:31

David Betz wrote: »

While this would obviously work, it seems like it would be awkward for the programmer. If I pass the address of a variable to a function I'd like to be able to use that in a generic way and not need to know if it points to AUX memory or hub memory. Otherwise I need a different variant of the function for AUX and for hub.

Maybe you could make the top few bits of the address indicate if it's hub or aux ram or somewhere else?

%00...: hub ram
%01...: aux ram
%1...: user defined (external, cog ram, etc.)

David Betz · 2013-12-02 09:43

Electrodude wrote: »

Maybe you could make the top few bits of the address indicate if it's hub or aux ram or somewhere else?

%00...: hub ram
%01...: aux ram
%1...: user defined (external, cog ram, etc.)

Are you suggesting this for Spin2 or PropGCC? PropGCC already does something like that although it doesn't currently include support for AUX memory but that could be added. It would be nice if there was hardware support for this but I guess that's asking a lot. Also, it still leaves the problem that a pointer to AUX memory no matter how it is encoded will not be useful if passed to another COG.

Cluso99 · 2013-12-02 09:44

It is great that we have multitasking. That will permit standard drivers to be written, combining keyboard/mouse/serial, etc. But many users will not want that level of complexity just to drive their other pins, etc. Hence, my reasoning that it will be the cogs in shortest supply.

I mentioned above, that putting a pair ofaux pointers into cog $1F0-1F1 that work with standard instructions just like INDA/INDB do. Thhis would open up aux ram for better variable usage. Using this would likely take 2 instructions and therefore 2 clocks, but would be quite useful. The work has already been done with inda/indb and is proven.
Example
SETAUX #auxreg
XOR AUXA

Cluso99 · 2013-12-02 09:50

sorry, xoom/android problem

example
SETAUXA #auxreg
XOR AUXA,#$7F

and define an aux block as
DAT
ORG 0 (or offset)
AUXREG res 1
var2 res 1
var3 res 1

This would increase the aux ram potential significantly.

Propeller II update - BLOG

Comments