Should the next Propeller be code-compatible?

Graham Stabler · 2008-08-28 23:59

clean slate

cgracey · 2008-08-29 00:07

jazzed said...

...A pseudo-DMA function would also be much appreciated since we
can't access pins very fast or much faster even with the Prop-II.
Pseudo-DMA would let us specify a buffer and maximum access
length and provide start, stop, and terminal-count flags in a
control/status register. This could be a mode in the counter registers.

Jazzed, because each cog gets a turn at the bus, in sequence, it is only once every 16 clocks·that you can assert a hub R/W command. Special hardware cannot speed this up. You can do a RDxxxx/WRxxxx instruction, plus 14 normal instructions and still not miss your window. This brings up another question I will ask you all...

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Sapieha · 2008-08-29 00:12

Hi Chip Gracey.
Thanks for reply

You said.
"" Each cog has a 3-read/1-write port ""

It is what I mistake litle of first bild solution in My post?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

Mike Green · 2008-08-29 00:12

Clearly in a variety of ways more is better. 16 cogs is much better than 8 and 256KB is much better than 128MB.
Die size is an absolute cost parameter that doesn't seem to come down much over time. What changes is how much
you can fit in a given size die, how much functionality it can provide, and how fast it can run. There may be clear
financial disadvantages to the larger (8mm x 8mm) die size. There may be significant incremental functional limitations (compared
to the Prop I) of the 6mm x 6mm die size. It strikes me that improving the ability to "cascade" chips would mitigate
many of the limitations of the 6mm x 6mm version. There needs to be some way to synchronize two cogs across two separate
chips, perhaps across more than two. There needs to be some simple way (hardware and software) to transfer data from one
cog to another between chips a short distance apart (nearby on the same board). High speed serial of some sort would be the
simplest if it were handled in hardware. It should be almost as simple as RDxxxx/WRxxxx although not as fast.

cgracey · 2008-08-29 00:33

ANOTHER QUESTION:

If we have 16 cogs, bus latency gets to be an issue, as you have only one opportunity every 16 clocks to do a RDxxxx/WRxxxx. This will hamper things like graphics and LMM code.

We had talked in the past about having special·instructions that could read or write 8 longs·in a single hub access. This is 32 bytes all at once! For writing, there would need to be a whole 32-bit register of write-enable bits to gate what bytes·get written. This is·efficient to do in silicon.

What's a pain about this is ALSO supporting the regular RDxxxx/WRxxxx instructions, as they are going to require a ton of mux's to deal with a 256-bit data bus, where only 8 bits may be of interest.

Can you all imagine living with 'cache-line' style reads and writes, where you must pick through the 8 longs read into a strip of registers (say $1E0..$1E7) for what you need? We could provide some indirection for this data, so that you could index by bytes, words, and longs elegantly. But, gone would be something as simple as RDBYTE.

What do you think?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 8/29/2008 12:38:44 AM GMT

Sapieha · 2008-08-29 00:51

Hi Chip Gracey.

It is was one problem I hawe in mind.
Look on my Picture In my thred (COG´s)
Often it is more that one COG serving one program type spin+ FPoint support, 2 GOG´s VGA driver, etc..
My proposo was to block 2 and more COG´s to one logical block to overcome bus latency.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

Mike Green · 2008-08-29 00:55

For writing it's easy. You have a 32-bit by 32 long ROM with write-enable bits for each byte address in a page of 32 addresses. You'd also have a 32-bit by 16 long ROM with write-enable bits for each word address in a page of 16 addresses and a 32-bit by 8 long ROM with write-enable bits for each long address in a page of 8 long addresses. Masked ROM may be cheaper than mux's.

Another option for compatibility could be to have a masked ROM with routines to execute the compatibility instructions. Since the timing would be different for the Prop II, the main issue is that instructions like RDxxxx/WRxxxx not take any longer in the Prop II than they do in the Prop I and that they be deterministic. If a RDxxxx/WRxxxx takes several Prop II instruction times to execute, who's going to notice as long as the result is correct?

The main problem with the 'cache-line' style reads and writes without the standard RDxxxx/WRxxxx is that you're stuck with a single channel for serialized data. If you need to access a single item randomly somewhere, you've lost the current contents of the 'buffer' and will need to restore it later. That can be dealt with, but it's awkward and may complicate coding unnecessarily.

Rayman · 2008-08-29 00:57

I've seen that some of your code uses the special registers for other things... Maybe one easy thing would be an instruction that copies N longs into the special registers...

ImageCraft · 2008-08-29 01:05

Hmm... speaking for LMM C only, the cache line read would benefit the instruction fetch, but probably not for the data access. In fact, it would add complexity for accessing data as the compiler then will have to know whether something is in the cache or in main memory....

I'd vote for pushing the complexity to the hardware (no surprise) and support both cache line read (for instruction) and random access for data.

cgracey · 2008-08-29 01:08

Mike Green said...
For writing it's easy. You have a 32-bit by 32 long ROM with write-enable bits for each byte address in a page of 32 addresses. You'd also have a 32-bit by 16 long ROM with write-enable bits for each word address in a page of 16 addresses and a 32-bit by 8 long ROM with write-enable bits for each long address in a page of 8 long addresses. Masked ROM may be cheaper than mux's.

This is simple enough, but it's all the data muxing that gets out of control.

Another option for compatibility could be to have a masked ROM with routines to execute the compatibility instructions. Since the timing would be different for the Prop II, the main issue is that instructions like RDxxxx/WRxxxx not take any longer in the Prop II than they do in the Prop I and that they be deterministic. If a RDxxxx/WRxxxx takes several Prop II instruction times to execute, who's going to notice as long as the result is correct?

The main problem with the 'cache-line' style reads and writes without the standard RDxxxx/WRxxxx is that you're stuck with a single channel for serialized data. If you need to access a single item randomly somewhere, you've lost the current contents of the 'buffer' and will need to restore it later. That can be dealt with, but it's awkward and may complicate coding unnecessarily.

I agree. This is a major disruption.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

simonl · 2008-08-29 01:09

@Mike: Yup, I whole-heartedly agree. If we can have an 8 COG / 256K PropII with really simple & fast inter-chip comm's (with COG sync'ing too) I think that'd tick all the boxes - and scale really nicely too

@Chip: Is that doable and likely? I suspect final chip cost could put off prospective customers (given that the Prop is already more expensive than many uCs), and I wouldn't want to push for a 16 COG chip if it meant lost sales for Parallax or product builders.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Cheers,

Simon

www.norfolkhelicopterclub.co.uk
You'll always have as many take-offs as landings, the trick is to be sure you can take-off again ;-)
BTW: I type as I'm thinking, so please don't take any offense at my writing style

Luis Digital · 2008-08-29 01:13

Option B: "made a little different to better accommodate the expanded memory"

8 COGs If the final product is cheaper.

I have never used the 8 COGs and is always the option for reuse or several objects in a COG.

Please, Linux version of IDE or libraries.

cgracey · 2008-08-29 01:14

ImageCraft said...
Hmm... speaking for LMM C only, the cache line read would benefit the instruction fetch, but probably not for the data access. In fact, it would add complexity for accessing data as the compiler then will have to know whether something is in the cache or in main memory....

I'd vote for pushing the complexity to the hardware (no surprise) and support both cache line read (for instruction) and random access for data.

Richard, this is a great idea! We could have separate cache-line·instructions, plus random access via old-style instructions. I've got to see about minimizing the mux's somehow. Using tri-state logic would decimate them in the silicon. I think I'm going to try this now. I'm scared of·how much FPGA it might suck up. I've got 8 cogs and have already used 52% of the logic resources. I think I need a bigger FPGA.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Fred Hawkins · 2008-08-29 01:16

Clean slate, but provide a conversion utility. It could be as simple as: this won't work, please redo (with this and that, see manual pages xx and yy). Utility ought to go in either direction.

cgracey · 2008-08-29 01:20

ANOTHER QUESTION:

If each new cog is more powerful than a whole current Propeller chip, do you really need 16 of them? Would 8 not suffice? Personally, I've never used all 8, except in some demo to show what the chip could do. To me, 8 is quite rich. By the time we get to 16, we are hub-starved and have to resort to cache-line style hub accesses to get the bandwidth back up (well, way up).

Are you guys sure about 16 cogs?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

cgracey · 2008-08-29 01:23

simonl said...
@Mike: Yup, I whole-heartedly agree. If we can have an 8 COG / 256K PropII with really simple & fast inter-chip comm's (with COG sync'ing too) I think that'd tick all the boxes - and scale really nicely too

@Chip: Is that doable and likely? I suspect final chip cost could put off prospective customers (given that the Prop is already more expensive than many uCs), and I wouldn't want to push for a 16 COG chip if it meant lost sales for Parallax or product builders.

You bet it's doable! I feel younger just thinking about it.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

simonl · 2008-08-29 01:29

LOL, you're already too young for your abilities Chip!

I vote for 8 COGs then

I seem to remember a post you made about giving each COG a reverse video circuit (that's not what you called it, but you were talking about de-serialising Ethernet or such I think). I'm guessing you're already WAY ahead of all of us, and had 8 COGs & fast inter-chip comm's in-mind all along!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Cheers,

Simon

www.norfolkhelicopterclub.co.uk
You'll always have as many take-offs as landings, the trick is to be sure you can take-off again ;-)
BTW: I type as I'm thinking, so please don't take any offense at my writing style

Sapieha · 2008-08-29 01:29

Hi Chip Gracey.

For My
It was mentioned already if I can have very fast Propeller to Propeller comunikation it is beter with 8 COG´s.
And more craft/possibility to every COG

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

Oldbitcollector (Jeff) · 2008-08-29 01:30

@Chip,

You know what my use of the Propeller is for. A one-chip personal microcomputer.
It might be one of the applications to actually use many cogs.

I know I risk getting a chair tossed my way for even suggesting this, but if 8 cogs
run 10x faster (or more) than the current ones couldn't we devise a way to
..."interrupt"... <ducks> to allow cogs operating at these higher rates to do
more than one thing at a time?

Better coders than myself could answer this...
(gotta duck outta sight now!)

OBC

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
New to the Propeller?

Getting started with a Propeller Protoboard?
Check out: Introduction to the Proboard & Propeller Cookbook 1.4
Updates to the Cookbook are now posted to: Propeller.warrantyvoid.us
Got an SD card connected? - PropDOS

simonl · 2008-08-29 01:32

@OBC: In-coming! (Only kidding).

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Cheers,

Simon

www.norfolkhelicopterclub.co.uk
You'll always have as many take-offs as landings, the trick is to be sure you can take-off again ;-)
BTW: I type as I'm thinking, so please don't take any offense at my writing style

Mike Green · 2008-08-29 01:34

OBC,
There's already something like that, the JMPRET instruction. This is used in FullDuplexSerial and in the TV driver for just that purpose. For FullDuplexSerial, there's a receive routine and a transmit routine and these are interleaved using JMPRET. For the TV driver, the initialization routine is interleaved that way so it executes during the vertical retrace interval with pieces executing on successive horizontal scan lines.

Ron Sutcliffe · 2008-08-29 01:34

Hippy’s suggestion of two PropII s, presumably using a common clock makes a lot of sense.

·Mounted on a single PCB they might just end up costing less than a single 16-cog version.

It certainly allows more wriggle room and looking to the future it maybe worth considered developing 8-Cog PROPII versions with different specialized features.

Given the PROPII versions maintain a degree of compatibility they simple become bricks in the wall.

Ron

Mike Green · 2008-08-29 01:49

Chip,
The 16 cog configuration was intended to solve two things:

1) As we add functionality to the Prop II programs, particularly I/O functionality, we may have more than 8 functions. It's awkward at best to combine disparate functions in a single cog, particularly if the routines haven't been designed for this (like the 4-port serial driver). True, we can combine some of the multicog video drivers, but it's easy to run through cogs: keyboard/mouse, video, I2C/SPI, floating point 1, floating point 2, 4-port serial driver, Ethernet driver, main program. Some of this can be handled by using an LMM interpreter for slower I/O functions (and this may end up being preferable).

2) Some of the multicog usage has to do with the size of cog memory and the maximum allowable program size for a cog. The floating point package is a prime example. It takes two cogs to properly do this. Floating point has to be very fast, so running an LMM interpreter will give you a big performance hit. Perhaps a mix with the basic FP routines in one cog and a specialized LMM interpreter would help. In the case of video, line buffering may still end up filling the cog's memory. It's hard to tell until the routines actually get written.

waltc · 2008-08-29 01:49

I agree with Chip, if the Prop II Cogs are much faster and feature rich I don't see the need for 16 of them and deal having to deal resource allocation/starvation problems.

So stay with 8 for the time being. Don't need to introduce unneeded complexity.

And if folks really need more Cogs, just buy another Prop! Its not like they are costly and have a giant footprint!

I just think some folks really want a Cell Processor or Xmos beast on the cheap.

Erik Friesen · 2008-08-29 01:49

My vote is for 8 cogs based on a couple of factors.

While 16 cogs is fun to play with, when the chip cost gets up around $20+ in small lots it it doesn't have quite the fun factor anymore. You have to have a pretty good idea and product to justify an expensive chip in a production or semi production device.

Most projects can fit within 8 cogs with some careful coding. Especially if there are more cog options like more counters and misc. When a person changes to the parallel processing concept it is easy to keep dumping stuff off onto a cog when in reality it can be managed another way.

Phil Pilgrim (PhiPi) · 2008-08-29 01:51

I just knew the "I" word would come up in this context! Hopefully, it's not being considered. Although I agree with Mike about the use of JMPRET and its utility for coroutines, that approach does preclude the use of any WAITxxx instructions. I wonder, though, how difficult it would be to implement hyperthreading, i.e. two program counters, two sets of status bits, and interleaved execution? You'd have to be able to turn it on and off, of course. Also, if instructions are more than one clock long (I can't remember), it would affect the time-granularity of the WAITxxx instructions, since they'd have to be synced to the instruction cycle instead of the processor clock.

-Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!

potatohead · 2008-08-29 01:54

What's the clock again? I seem to remember 160Mhz.

(goes off to look)

Yep, it was 160.

Does this decision impact the number of COGs?

Man, this is a tough one!

On one hand, the cache-line would seriously improve image manipulation for sprite displays. That, plus REP for handling masking and shifting of sprite data, will be a very significant improvement.

On the other, it's complex and messy. Lots of simple, low throughput stuff will see greater complexity than it would otherwise.

Will the cache-line be aligned in any particular fashion, or will this be up to the user?

Maybe this is not as complex as I'm thinking it is. Would a user then be able to just do a rdhub, wrhub, start address and go from there, picking and choosing from the block that always shows up? And that's where the starting address can be any address, not just some aligned one?

It's hard to turn down the greater throughput!

Initially, my vote was for the simpler instruction as it would bring greater compatability, but we put that one to bed. Now it's just about complexity -vs- speed.

Maybe a mask register in addition to the auto increment register I think I'm already seeing? One could set that, then do ordinary mov instructions, knowing the mask is set for byte, word, long, etc... Seems to me that savings would balance a lot of the complexity out. Probably would cut down on a lot of mov, shift combinations too, further packing things into the COG instruction area.

Edit: After thread catch up, I agree with 8 cogs. Truth is, once the chip is out, 16 could be added at Prop III, or Prop II+ to account for those sets of tasks that 8 just does not fit well, and if there is demand. I think the overall risk is lower with 8. Cost is already a sales exception. At this stage, sans serious demand, I see no reason to add to that.

One other thing is the rep command will increase overall instruction density per cog. We will get more out of those 512 longs, and I'll bet that's significant, once people really put it to use.

So, yeah. 8 cogs, keep the 256Kb RAM, and if something has to go besides the fantasy COGs, let it be the simpler rdbyte type instructions. If that bulk read, write is easy to use, it can always be ignored, leaving the user to just hit the HUB more often and use constant instructions for accessing the buffer, if they don't need the speed, given there is no forced alignment.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!

Chat in real time with other Propellerheads on IRC #propeller @ freenode.net

Post Edited (potatohead) : 8/29/2008 2:06:31 AM GMT

ImageCraft · 2008-08-29 02:00

re: 8 or 16 Cogs

If the Cog is faster, (and they have to be due to smaller geometry and faster clock), I'd vote for staying with 8 so not to blow away the performance of the Hub access. If I understand it correctly, for PropII, most programs will still be bigger than the Cog memory size (I think that stays at 512 longs right?) so the bottleneck will still be Hub RAM access. While the cache line read would improve the instruction fetch significantly, it will not help with data access, so any way to minimize that would be good. With a faster Cog, one can probably do something kind of virtual interrupt (thinks equivalence of hyperthreading).

I agree with Sapieha too: if there is a fast PropII<->PropII connection, we will somehow find a use for it. Transputer, Hypercube, etc. comes to mind.

// richard

Sapieha · 2008-08-29 02:08

Hi waltc

You said
"" And if folks really need more Cogs, just buy another Prop! ""

Yes You·can buy 100 of Prop but if You cn´t comunicate with them it is not meaning.

And Yes folks want good product.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

Post Edited (Sapieha) : 8/29/2008 2:30:12 AM GMT

potatohead · 2008-08-29 02:10

@Oldbit

I too want to see the small scale personal computer back. IMHO, the faster LMM speeds possible + 256Kb RAM are gonna do some serious damage on this score. A supervisor type OS, written in LMM C or just LMM ASM, can provide a lot of basic services, and it's basically one COG. Another one or two for graphics (and I'm thinking single COG drivers can do Atari / C64 level graphics with the proposals mentioned), another for sound and that leaves plenty for special stuff!

Betcha the 8 vs 16 discussion has little impact on these things.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!

Chat in real time with other Propellerheads on IRC #propeller @ freenode.net

Should the next Propeller be code-compatible?

Comments