What would you want more of, cogs or RAM?

Gadgetman · 2006-11-25 23:01

Because now each instruction only takes ONE clock pulse to execute instead of the four on the current Propeller, which quadruples the speed, and the clock-speed is also doubled...

I'm not certain, but my guess is that 5V is avoided because it adds speed constraints.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Don't visit my new website...

parsko · 2006-11-26 00:30

Martin Hebel said...
While I don't think the plan is to bring port B out to the I/O pins with this version, it would be great if it were available internally to allow a 32-bit bus between cogs for inter-cog communications.

Great discussions,
Martin

Martin, you read my mind! I had remembered just that reading through the posts, then got to yours... I remember someone mentioning this a few months back, something of a "ghost" register. I haven't thought of how you would then coordinate the pin outputs if it was external. But internal use would be nice, I'd second that suggestion.

-Parsko

lairdt · 2006-11-26 00:51

8 (faster) cogs and 256k RAM, would prefer 1024 word cog RAM as well.

Gavin · 2006-11-26 01:08

How about moving the current part to faster technologies.
Leave everything the same except make the cogs run at 160mips.
Just thinking it might be faster to get silicon so we can all get more power sooner. No major design change except shrinking the micron size. If the die gets smaller it should be cheaper than the current part.
We can then call it the Turboprop and have a Supercharged Hydra.

Gavin

IanM · 2006-11-26 02:41

Looks like a lot of people are voting more RAM. I would prefer more cogs. It is the cogs that distinguishes the prop from all other uC. 16 cogs would simplify (not complicate) a lot of applications. It doesn't seem to fit the prop philosophy if the next version just adds more memory and speed. If you need that much more memory you can go off chip or perhaps the prop is not the best option.

However, whichever wins, looking forward to it!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Ian Mitchell
www.research.utas.edu.au

lairdt · 2006-11-26 03:12

Coming from a MCU with 16MB RAM/EEPROM addressing, I don't think the 256k increase is too much to ask for in Propeller v2.0. Coupled with the added processing power of 8 cogs however, not having 256k is a real problem. It would only get worse in a system with 8 faster or 16 total cogs.

Matthew Hay · 2006-11-26 03:42

Okay unfortunately I've never used the prop (only used the BS2, 8052, and atmel butterfly) but I'd say go for the 8/256K.·

Also I'd go for adding in the ability to run multiple·chips ie have a master chip and slave chips.· You could have the slave chips act as extensions of the master (ie more cogs / io pins).· Which with a little create programming you could almost do that now (though not having used the chip I can't say for sure).

Anyway that just a crazy idea I had, though I'm not sure how hard it would be to build that into a chip.

-Matt Hay

Tracy Allen · 2006-11-26 03:47

I'd vote for 8 COGs and 256k RAM, with 8 cycles per hub access. Especially when combined with the pipeline for one instruction per cycle.

Do branches take 4 cycles, when the pipeline has to be emptied and refilled?

Would the 256kbytes (load in at startup from a 256k eeprom be available for spin objects, images of Propasm programs, and for data processing? Just checking that the 256k hub ram is not banked or limited in some way.

The bandwidth selection ideas are intriguing, but it sounds like it might be a headache to document and support. There is something very easy to grasp about 1:8 KISS.

I second the idea of implementing the port registers for the 64 pin device, if simply as a side door method of communicating between cogs. Or an port independent register of that sort, that follows the same control rules.

Also I second the idea of extended COG counters, to include the increment on hub read or write. There are other selections I'd like for input and output on the cog counters. For example, the capability to allow a selection of output from PHS or CARRY in the DETector modes. But that is another topic.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Tracy Allen
www.emesystems.com

Mike Green · 2006-11-26 04:29

Sorry I've been away from the discussion ... I vote for 8 cogs and 256K. The faster cogs and hub access will allow more functionality with fewer cogs. I like the idea of adjustable hub access bandwidth with common-case defaults. Most of the time, the adjustable access won't be necessary, but it will save a cog or two occasionally (and the associated complexity) when the issue is indeed hub access bandwidth.

Phil Pilgrim (PhiPi) · 2006-11-26 05:10

Chip Gracey said...
Maybe when a cog is launched, its hub-access requirement could be stated, and then the launch would pass/fail based not just on whether or not a cog was available, but also on whether or not a requested-bandwidth hub slot was available. For example, you could have 1:4 being the highest, then 1:8, 1:16, and finally 1:32. Every program should use the lowest-possible setting. It would take only a bit of logic in the hub to negotiate the setup requests and then serve them deterministically thereafter.

Chip, I really like the idea; but it could get a little tricky, depending on the order in which the requests are made. For optimal "time-packing", you'd want the most demanding cogs assigned first and the least-demanding last. But there's no guarantee that that's the order in which the requests will come. And you can't jiggle things after the fact, since it'll throw the timing off for cogs that've already been assigned.

Did already you have an algorithm in mind for this? If so, I'm very curious what it might be.

Also, would it be possible for a cog to request a different access priority after being launched? This would enable more efficient bandwidth sharing when rapid hub access is needed only in short bursts.

-Phil

power mousey · 2006-11-26 06:04

·hey Chip,

·how about both options. true.

·and also a third option....up to 16 cogs active and with a maximum of 256k of ram. also, maybe expand the rom too.

·also how about this?: have the capability in the hardware and software of the propeller chip for sharing and using some of the cogs(other cogs) general purpose ram for some of the program code,memory for extra and fast memory and some of the code in their registers.

·for example: use a few cogs in an application...lauch a few other cogs and use their general purpose ram for some of the code and data too. yet,·even though these cogs are launched and active...thier registers are used for some of the code and data.

·cheers,

power mousey

cgracey · 2006-11-26 06:14

Phil Pilgrim (PhiPi) said...

Chip, I really like the idea; but it could get a little tricky, depending on the order in which the requests are made. For optimal "time-packing", you'd want the most demanding cogs assigned first and the least-demanding last. But there's no guarantee that that's the order in which the requests will come. And you can't jiggle things after the fact, since it'll throw the timing off for cogs that've already been assigned.

Did already you have an algorithm in mind for this? If so, I'm very curious what it might be.

Also, would it be possible for a cog to request a different access priority after being launched? This would enable more efficient bandwidth sharing when rapid hub access is needed only in short bursts.

-Phil

Yes, in thinking more about it, it seems it would be hard to avoid fragmentation, especially after a few cogs have re-launched. I think it very quickly becomes a "memory management" type of problem, to which there is no (simple?) solution. Like you said, you can't reassign a cog's time-slot on the fly because it could potentially destroy its established function. About setting bandwidth during runtime: Any cog asking for more bandwidth probably needs it,·and what if it can't get it? The more I program the Propeller, the more timing-centric everything is becoming. Timing and function are more often than not inseparable concepts. Potatohead pointed out something similar to this. I'm convinced that anything that introduces indeterminancy into timing is really poisonous. Determinism has that wonderful KISS quality, which is always right.

BTW, here's what Potatohead wrote (red text is critical):

I've dealt with high end applictions for a lot of years. Many of these were running on SGI NUMA machines. Interesting philosophy that turned out to be very true in a lotta cases: Any compute problem, properly coded, becomes an I/O problem.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 11/26/2006 7:12:59 AM GMT

Bill Henning · 2006-11-26 07:14

Chip, I have a potentially interesting idea for allocating bandwidth.

Basic assumptions:

8 cogs competing for memory time slices.

80Mhz HUB ram speed (12.5ns)

Why not have a set of special registers in the hub memory that allocated timing slots?

For yucks, lets use a 80 entry table

by default, the table is filled as follows:

0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
... until the end of the table

HOWEVER

the table can be re-written under cog control!

Mind you, the table would have to be insanely fast - say 2n or less access speed

That way, the memory access would be TOTALLY soft, totally programmable

I would NOT want the entries packed, and it may be best to have the entries be a bit mask by bit position as it would simplify the decode logic (and make it a bit faster) - we can live with the limit of 32 cogs sharing hub memory that it would impose on a 32 bit prop.

What do you think?

A slightly more elaborate version would have every *second* slot or every fourth slot fixed, to guarantee a minimum certain bandwidth per cog.

EDIT:

A potentially easier/better idea:

Allow for 32 max potential timing slots

A new hub instruction, called SCHEDMEM, could be used to request a bitmask of timing slots; giving up slots normally scheduled for a cog that it did not want, and trying to allocate ones it wanted

It could return the allocated slots

so the default configuration with eight cogs would be something like

cog0: 10000000100000001000000010000000
cog1: 01000000010000000100000001000000
cog2: 00100000001000000010000000100000
cog3: 00010000000100000001000000010000
cog4: 00001000000010000000100000001000
cog5: 00000100000001000000010000000100
cog6: 00000010000000100000001000000010
cog7: 00000001000000010000000100000001

the above would be the default access mask for the cogs, for the current behaviour

however say cog4 only needed/wanted one hub access cycle in every 32 possible cycles

it could release three cycles!

there could be·a globally available hub register showing currently allocated memory slots

RAMUSED: 000100010010000111000001000000

a cog could then tell what cycles it can request

every time a cog released its slot, it would become available for another cog

Btw, this is also easier to implement in gates than the time slot registers i suggested above

Post Edited (Bill Henning) : 11/26/2006 7:25:18 AM GMT

cgracey · 2006-11-26 07:36

Bill,

That would certainly be flexible, and if you knew exactly what you wanted for the whole system, it would be ideal. But, if cogs spawning from objects are trying to set up their own requirements, they could be clobbering the schedules of others. That whole thing might have to be locked for inidividual cog access. I could see a lot of cog code getting spent on iffy setup procedures. Do you know what I mean? This would preclude unknown cog schemes·from deterministically starting with their bandwidth requirements under an RTOS' control. The RTOS would have to have some data on what the cog needed so that it, alone, could set it up. Nobody else had better interfere, either.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 11/26/2006 7:40:26 AM GMT

potatohead · 2006-11-26 08:02

Given this, I'm thinking this is not a good path. Enough complexity has been brought up to totally validate Phil's point --and it's not even fully defined yet.

I guess this means the current symmetry in the design is another one of those don't touch items as well..

Bill Henning · 2006-11-26 09:14

I figured spin or a hypothetical rtos would manage bandwidth allocation roughly as follows:

- a cog may give up any of its default allocation

- a cog may request any slice, and for any requested slice, if it is not allocated, the request is granted

- one way to enforce fairness is to hardwire one access in every 32 slots to a cog - if it asks for it or not; ie on startup, each cog gets one access cycle every 32 clocks (or if the default is 16 it matches the current scheme, while leaving 16 slices up for grabs for bw hungry apps!)

- and cogs that did not need access every 1/16 slots would be allowed to give up ONE of their two slots

- this scheme would also work for a 16 cog prop, but by default each cog would only get two time slices at hub access per 32 cycles

ie

*---*---*---*---*---*---*---*---
0---1---2---3---4---5---6---7---

slots with * are hard locked, may not be given up

by default each cog gets one cycle in a 32 stroke "wheel"

24 slots not allocated

only unallocated slots may be claimed by cogs

cogs can only free up claimed slots, not the "hard" slot they are initially allocated

I agree its non-trivial, but say a "hyper" speed app could claim 24 slots out of 32 if it needed it!

Chip Gracey (Parallax) said...

Bill,

That would certainly be flexible, and if you knew exactly what you wanted for the whole system, it would be ideal. But, if cogs spawning from objects are trying to set up their own requirements, they could be clobbering the schedules of others. That whole thing might have to be locked for inidividual cog access. I could see a lot of cog code getting spent on iffy setup procedures. Do you know what I mean? This would preclude unknown cog schemes·from deterministically starting with their bandwidth requirements under an RTOS' control. The RTOS would have to have some data on what the cog needed so that it, alone, could set it up. Nobody else had better interfere, either.

Post Edited (Bill Henning) : 11/26/2006 9:21:59 AM GMT

Peter Jakacki · 2006-11-26 11:50

There is so much information here that I am afraid that the really useful and practical to implement stuff might get buried. From what I understand of silicon, digital design, embedded hardware and software requirements, and perhaps even the market viability part, I am offering my2cents worth. Remember, most of us aren't all-round gurus, we each have our own experience, requirements, and expectations. Together it should make one mean pie.

COG MEMORY
I understand the limitation that you have with 512 longs/cog is directly related to the KISS/FAST instruction decode and cannot be changed without changing the code/cpu itself or extending the instruction width longitudinally perhaps to 40 bits or more (I won't mention banking). Ok, so we are stuck with 512 longs per cog, let's work from there.

MORE COGS? YES! HOW?
Someone mentioned why have multiple video registers when all we really need is one (even more) and that can be accessed centrally as part of the main memory map. That may not have been practical on the original but could indeed be with the proposed new design. If this is the case then I would like to see a simple 8/16/32-bit SPI-like interface on each cog as this would not take up any more silicon than the current video generator would. The use of SPI would permit pchips to communicate with other pchips effectively and efficiently and because we would have at least 8 SPI interfaces per chip that means we could connect them in the most suitable fashion, whether that be a simple chip to chip or a transputer like connection where they are connected in a 2D XY matrix or perhaps even a 3D matrix if we start getting really fancy.

16 cogs would seem an advantage at first but not when an 8-cog chip could access more main memory much faster, remember, we only have 512 longs, we need efficient access to that main memory. Consider this, if 16 cogs would be beneficial then why not 32 or 64? It seems that at some point we run into a barrier with adding more cogs as there is no efficient inter-chip communications method, so I suggest the SPI-like method and simply add more chips when we need more cogs.

DETERMINISTIC
Keep the main memory access deterministic even though my original thoughts when I first played with the pchip was why didn't they have programmabled access? The approach to mux'ing main memory access may seem a little bit plain and simple but it works. The Spin development environment of creating sharable objects is part of the success and ease of use of the Propeller. Imagine if one object required a certain type of access and another object required something different and you tried to use these objects together with one hogging what it needs but not leaving enough for what the other requires, or hand-tweaking the application, no thanks!

KEEP US IN THE LOOP
Make advance information available, even if it is tentative. As you know, there is a long evaluate/prototype/evaluate/development/production/whatever cycle etc in most commercial products. Having advance information plus the experience with working with the pchip now, plus the fact that we won't look elsewhere while we are in expectation and salivating means that Propeller II can expect a much faster end-user utilization, and a shorter development cost amortization then perhaps has been experienced with the original. We want you guys to stay in business.

There are plenty of other good suggestions, some pie-in-the-sky, but there is only so much time and money available and these few things that I have outlined seem in my opinion both desirable and do'able.

*Peter*

parsko · 2006-11-26 11:53

--------------------------------------NOW (8/32)         8/256                  16/128
Total Cog Ram (bytes).................16384               16384                  32768
Global Ram (bytes)....................32k                 256k(8x)              128k(4x)
Hub Access Time.......................1/16=200ns      1/8=100ns             1/16=200ns
PASSY Command Execution Time..........4clk=50ns        1clk=12.5ns            ??????
Clock Speed(Mhz/MIPS).................80/20              80/(160?)            80/(160?)

Did I miss anything important? Did I get something wrong?

After having a night to sleep on it,one important thing, to me (and I think likely Cliff and/or KaosKidd too) is the Total Cog Ram. Don't we gain some hidden benefits with having double the amount of COG ram available? Especially if COG-COG communication is faster...?

-Parsko

Post Edited (parsko) : 11/26/2006 2:27:29 PM GMT

ciw1973 · 2006-11-26 12:51

Having given this some more thought overnight, I'm finding my original preference for 16 cogs and 128K RAM is once again looking more appealing, but only if there were bandwidth allocation features implemented as well.

My main reason for going with 8 faster cogs wasn't the additional memory, but that access to this memory would be slower. Being able to allocate more slots to processes requiring faster access to this memory would largely negate the issue. OK, so it would introduce other issues which would need to be overcome, but I think there is a lot of potential there.

If we consider the allocation of hub slots to be a similar issue to the issue of allocating blocks of memory, where the slots are determined in the spin which prepares the assembly code for loading into a cog, then it becomes fairly simple to manage.

I'm still very keen on the idea of any new Propeller also including some on-chip FLASH though, to keep the component count down for smaller designs. I know physical silicon space is an issue, so how about a version of the 8 cog chip that has 128K SRAM and 128K FLASH?

Post Edited (ciw1973) : 11/26/2006 12:57:44 PM GMT

PVJohn · 2006-11-26 13:06

I prefer more memory.
Can you make it in DIP 40 package and pin compatible with current version, so that we can continue to use HYDRA board?

PVJohn

ciw1973 · 2006-11-26 13:51

Agreed having a 40 pin DIP version makes it much easier for the hobbyist and for prototyping in general. Adding FLASH would make it even more so.

Cobalt · 2006-11-26 15:57

I think I'll change my vote from the 16 cogs to the 8 cogs - I didn't quite realise that it would be faster and that the increased memory would let things run on fewer cogs... although I would want more IO pins [noparse]:D[/noparse]

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
while alive = 1
wakeup
program(propeller)
eat(3)
sleep(7)

potatohead · 2006-11-26 16:57

Bill, I like your idea of a micro OS to manage things. However, would that not fragment the body of prop code?

This is gonna happen anyway as people build stuff and other people build on top of it. For a specific application, no biggie. One gets the bits they need, tweaks them to play together and moves on to the application.

However, that change would more or less mandate Parallax to provide some sort of management scheme. That may or may not make sense to them. Also, I'm not sure for a lot of applications the additional thoughtput possible would be worth the overhead. A similar level of granularity, setting peak performance aside, could be achieved with COG threading as well. From the applications point of view, there would be very little difference.

Thought of something else. A new chip cycle means another shot at what's on the ROM.

Does it all have to be ROM? EEPROM maybe for those who don't care about anything beyond the SPIN interpeter and it's necessary elements?

Would different contents really matter, knowing what we do now?

I personally would like to see a small 8x8 character set in there, among other things. Is this something open to discussion Chip?

nutson · 2006-11-26 17:48

Chip's question suggests 1 Cog + 512x32 bit registers equals 16k8 HUB ram silicon in real estate, so the Cog takes the biggest part of the real estate. Allow me to throw in some new variables into the discussion: backwards compatibility, hypertheading, SPIN interpreter.

Although I back up the 8/256 direction, I still have doubts, assigning a 160 MIPS processor to the task of inputting characters from the keyboard will make me feel guilty I am sure. The 16 Cog advocates have a point that the simplicity of having 16 absolutely equal resources greatly reduces the problems in programming very diverse I/O tasks and eases the reuse of objects etc. which is one of the Props strong points. Bill's method of executing streams from HUB memory is a way out, but has the disadvantage of task and context switching overhead, limited register use, different programming methods for "full" and "shared" Cog use. But, can't we have one 160 MIPS processor "hyperthreading" 8 threads at 20 MIPS, which would give us backward compatibilty with current objects?? Give each thread a separte program counter and C/Z set, and share all other resources.

One Cog without local memory could execute one thread out of HUB memory at 20 MIPS, it could execute 8 threads out of HUB memory at 1.125MIPS. One Cog with 4K32 local memory could hyperthread 8 threads at 20 MIPS.

What is the role of the SPIN interpreter in this. Could the SPIN interpreter be made to run multiple (low speed) threads from a single COG??

Nico Hattink

iam7805 · 2006-11-26 18:11

I'd say go with more RAM. Would be useful for game programming.

Mike Green · 2006-11-26 19:40

Some support for high speed clocked serial communications would be very useful for multiple chip-to-chip communications as well as the Ethernet that's already been mentioned. Using a self-clocking system (like Manchester encoding) would save I/O pins, but, with a larger package needed anyway, that may not be as much of a problem. Most high speed serial chips now use SPI. If Chip decides to add a little FIFO buffering for video, it wouldn't take much logic to use the same buffer for SPI output as well. Input buffering would also be very useful, but would require more supporting logic since the original reason for putting in the FIFO is for video generation.

If SPI support were to be added, it could also be used for cog-to-cog communications

Phillip Y. · 2006-11-26 19:40

MORE Intercog communication would be useful.

port A (same as always)
port B w/wo real I/O pins

OR;

Register for access to the other cogs similar to the port A and B,
One for all cogs (port C) but not for I/O,
Using port C instead of port B would ELIMINATE issues of moving programs that use port B with 32 I/O to 64 I/O versions of the chips.

OR;

Two registers for adjacent cogs , i.e. cog to the right , cog to the left . (port R, port L)
togeather Port R and port L connections would use silicon = to port C alone,
many times 2 or 3 cogs work together closely and don't need special access to other cogs.

Phil Pilgrim (PhiPi) · 2006-11-26 20:00

I agree with the need for better inter-cog comm support. And an unpinned port B may be all it takes. But, at 160MIPS, I'm not sure that any more is really needed in the hardware for fast serial I/O.

What I would like to see, though, are more counters that could be combined in various ways to support hardware PWM, for example. The DUTY mode is just too fast for some D/A apps, especially those requiring MOSFETs to drive an inductive load, say.

-Phil

Phillip Y. · 2006-11-26 20:13

PhilPi ;
I am only talking about 32 bit parallel access with in the Propeller chip, not serial or between chips.

Phil Pilgrim (PhiPi) · 2006-11-26 20:20

Hi Philip,

'Sorry, I should've kept the two topics separate. The portion of my comment regarding serial I/O was in response to Mike's posting just above yours, which also alluded to inter-cog comms.

-Phil

What would you want more of, cogs or RAM?

Comments