Should the next Propeller be code-compatible?

Capt. Quirk · 2008-09-04 21:16

Will the counters be modified to better handle motor control on Prop2, similar to Phil Pilgrims
thread·or will that remain a code issue?

evanh · 2008-09-05 06:29

Quadrature counting can be done in software easy. 500kHz is the maximum expected pulse rate from such a device. And while the existing example object might struggle with the full 500kHz on the Prop I, I'm sure the Prop II won't have any problems keeping up.

What's really neat is that this is a perfect I/O device type application to dedicated a Cog to.

I would say the same for serial capture also except that serial data rates often massively exceed 500 kHz.

Evan

Cluso99 · 2008-09-05 07:10

Chip...

If the Prop II is as powerful as we now expect, can you ensure that the instructions can cope with 16 cogs (or more) as I have a feeling you might be making a version with 16 (or 32) soon after and cost will not be an issue for this version. I'd hate to lose compatability between these two versions (not Prop I). I realise there will be hub interface differences.

Javalin · 2008-09-05 13:06

Chip,

Any chance of some floating point hardware in the prop2?

James

Mike Green · 2008-09-05 14:41

Floating point hardware is very complex. Given that there's relatively little need for floating point in embedded systems and that there already is a very good floating point library with good execution speed that will be even faster on the Prop II, it's just not worth the silicon to build it in.

A lot of applications that use floating point can use fixed point scaled arithmetic which is very fast on the Prop I and Prop II. Transcendental functions can be done using CORDIC which will have hardware assist on the Prop II since that's very simple.

Lord Steve · 2008-09-05 15:52

Dear Chip,

Would 16 cogs be possible if the cog RAM were triple-port (two read, one read/write) as opposed to quad port (three read, one write)?

OwenS · 2008-09-05 16:38

Lord Steve,

Three ports isn't enough. Every cycle you need to read an instruction, source and destination, and write the destination.

This is, of course, why most older and embedded processor designs go for small register files and large memory. (Of course, some desktop processors, like the Itanium, have insane ammounts of registers. Then do something silly which makes compilers chew through them, like not supporting offset addressing, and again I'm looking at the Itanium)

Lord Steve · 2008-09-05 20:58

OwenS.

The problem is that there must be 4 memory accesses per cycle...having four-port RAM is one possible solution and at present is the one Chip and Company have adopted.· However, that solution has been found to preclude having 16 cogs in the Propeller II, which makes me cry at night when I'm all alone.· Therefore, I wonder if there isn't some way around it.

I asked about three-port RAM (two read ports, one read/write) because maybe it would be possible to use a write buffer to store the write while the read is happening on the third (read/write) port of the RAM.· Then, between cycles, the write buffer would commit the result to the RAM.· Chip indicated that the RAM could be "overclocked" to some degree without that RAM becoming the critical path.· Perhaps that would allow what I am suggesting or something similar...one (even complicated) write buffer is a lot cheaper silicon-wise than quad-port RAM.

simonl · 2008-09-05 21:20

I'd settle for one, dedicated, full-duplex, inter-chip comm's COG, so I can just keep adding more PropIIs - especially if it runs REALLY fast (maybe using Beau's high-speed comm's) and allows writing/reading HUB RAM on specified Prop.

I've been thinking that good old Token-Ring sits perfectly with the Prop's deterministic nature

(I really should set aside some time to build a Token-Ring object; now that the exchange rate's put paid to the Propeller webshop I've been working on :SOB[noparse]:)[/noparse]

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Cheers,
Simon

www.norfolkhelicopterclub.com

You'll always have as many take-offs as landings, the trick is to be sure you can take-off again

BTW: I type as I'm thinking, so please don't take any offence at my writing style

Cluso99 · 2008-09-06 00:01

Think we have lost the 16 cog version. Pitty, because I would have rathered it, even with a 4 cycle instruction (single cog ram path), or 2 cycle (dual cog ram path). The Prop II is going to fly anyway, and the hub access seems solved. I would rather have paid for the extra silicon cost, but that's just me. I think this Prop II is going to impress a lot of engineers

Just thinking - would 32 cogs with single cog ram path (4 clock execution) use as much silicon as 8 cogs with quad cog ram path (1 clock execution)???

Sync'ing multiple props...
They can already use the same xtal, so they would remain in sync. But we have to get them into sync. We could use a pin (presumably they are communicating anyway) and say cog 7 in the master prop to set a pin change. Each prop has say cog 6 in a waitpeq loop looking for the common pin to toggle, then cog 6 in each prop would save the CNT to a global variable in its' respective prop. Now we have them all aligned. (Would be much better if we could somehow reset CNT using a special hub instruction) - Any ideas Chip??

Phil Pilgrim (PhiPi) · 2008-09-06 01:46

Chip,

Would it be possible to make the Prop II pins 5V-tolerant — similar to the way 74LVC logic parts do it? Or does a pin's inherent bidirectionality prevent that? I was thinking that for input and simulated (via DIRx) open-drain output, this would be a benefit. Obviously, normal totem-pole outputs could not take advantage of such a feature, unless you had a separate VCCIO pin like the FTDI chips provide.

This capability would eliminate an entire category of forum questions and tech support issues.

-Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!

Paul Baker · 2008-09-06 05:28

Hi Phil, this has been asked of him before while I was around, and the gist is that in order to support higher voltages, the transistor width (and therefore the size) must be increased to place the breakdown voltage further out of reach. Stepping up 1 voltage level isn't a big deal, but jumping up two levels is. By enabling 5V operation, 1.8V operation is encumbered due to the dramatic increase in capacitance. I may be a little off on the explanation, but this is how I remember his response.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

Sleazy - G · 2008-09-06 09:24

Chip Gracey (Parallax) said...

Oldbitcollector said...

Does this mean that the next Propeller might require some additional considerations for cooling?

No way! It's just going to draw more quiescent current. If the chip winds up running hot, it's not a good thing, as heat slows things down. This is where clock-gating comes in to minimize unnecessary signal transitions, which contribute to current draw, which generate heat. A hot chip is to be avoided.

The only extra power dissipation would come from loads connected to the extra pins. Otherwise, it will have the same power characteristics of the current Propeller.

····· I think if the power characteristics were similar between prop I and prop II, the prop II would most certainly
····· have to get hotter than prop I.

····· Imagine putting 2 prop I's next to each other.··They would definately help heat each other up if in proximity.
····· think·doing some finite element heat transfer simulations would demonstrate this.

·····

·····

Beau Schwabe · 2008-09-06 13:28

Sleazy - G,

The native internal voltage of the Prop II will be about half that of the·Prop I, so this will be a power determining factor as well that must be considered.
·

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Beau Schwabe

IC Layout Engineer
Parallax, Inc.

Paul Baker · 2008-09-06 19:41

The heat generated by CMOS logic processes is proportional to CV². Where C is the speed of operation and V is the operating·voltage, you can see that doubling the speed generates twice as much heat, but doubling the voltage results in quadrupling the heat. Conversely halving the voltage results in one fourth the amount of heat generated. Since the target speed is twice the current Propeller, the resultant proportionality is roughly 1/2 the current chip. This isn't an exact computation, it won't be as low as one half because·the subthreshold leakage is worse in smaller processes, the hub will be operating at the speed of the cogs (whereas the original Propeller's hub operates at half the speed), and there will be more circuitry in the chip.·

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

Post Edited (Paul Baker (Parallax)) : 9/6/2008 7:52:13 PM GMT

Phil Pilgrim (PhiPi) · 2008-09-06 19:49

'Just a random question with — hopefully

— not a random answer: What gets read from a four-port memory cell that's being written in the same clock cycle? The new contents, or the prior contents? If the latter, is there a predictive mechanism in the Prop II to read the new data from an internal register instead?

Thanks,
Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!

bambino · 2008-09-06 19:58

Along the same lines as Phil's question, if a cog writes to the hub, will the next cog read that hub address's new or prior data?
Sounds silly to ask, but with this speed I thought I would!

OwenS · 2008-09-06 20:08

Phil: Presumably the reading instruction would be stalled if any other instruction which would write to that data was in the pipeline

Mark Swann · 2008-09-06 20:46

OwenS said...
Phil: Presumably the reading instruction would be stalled if any other instruction which would write to that data was in the pipeline

OwenS,

Stalling the instruction would affect determinism. I suspect Chip would not allow that, but would use something like·a content addressible cache to store unwritten results so that an instruction reading that value would pull it from the cache instead. Just speculation. The cache would not need to be large. A max of two or three unwritten values at a time should be sufficient.

Mark

Capt. Quirk · 2008-09-06 21:01

Why does asynchronous serial communication need to be part of the Prop2, I understand why it has been a big part of Parallax's product to product communication in the past, but why now. On all of my projects, Parallax serial communication devices, and programed pause statements (required by chips), created hurdles that led me to use the Prop1.

Why wouldn't some form of synchronous serial communication be better (for more cogs and other devices), especially if it's going to be able to process huge amounts of data as Chip suggested? or is it better to have no serial bias, and leave it up to the programmer to exploit the Prop2's capabilities.

·

Mike Green · 2008-09-06 22:16

A lot of external devices use asynchronous serial communications including PCs. The only place where the Prop II (or Prop I) have built-in support for asynchronous serial communications is in the bootloader in ROM. Host communications is serial because it's easier for the PC and it's built-in, either as a serial port or as a virtual serial port via USB.

I don't think anyone has suggested built-in asynchronous serial communications support in the Prop 2 except for decoding things like Manchester encoding and the like for ethernet interfacing and possibly interchip communications where self-clocking may be needed because of I/O pin limitations in some applications.

Sapieha · 2008-09-06 22:43

Hi Mike Green.

In my proposo to Serial link was - Wariable length SERIN/OUT. ( 1-32 Bits )
In that way if I program it to 9-11 Bits and I can have pseudo Asyncron serial COM.
( 1-start bit, 8 data bits, 1-2 stop bits and trig first bit on negative going IN signal) and with test on receive count flag complete I test for ocurences of protocol. In send mode simply send 9-11 Bits
In all other cases it acts as syncron COM without start/stop Bits with receive/send flag on Bit count.

Ps. In that construction it can act as SERIN/OUT and Patern IN/OUT with given frequency.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

Post Edited (Sapieha) : 9/7/2008 1:02:39 AM GMT

Cluso99 · 2008-09-07 00:40

Sapieha,

That would be a great feature

It would simplify the async driver and I am sure a lot of other uses for this feature will be found

Sapieha · 2008-09-07 00:44

Hi Cluso99.

Yes it has many posiblites.

With one more function That start and stop clock with bit count start and stop it is simple I2C serial with posiblites tu send 8 and 16 bits I2C data bits.

·

Ps. Propellers power is its programablity. I love this construction. And all my proposo´s is to enhance this. And not have specialised only one funktions IO bloks. And Chip has probably much more to ad to it.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

Post Edited (Sapieha) : 9/8/2008 2:42:07 AM GMT

evanh · 2008-09-07 04:05

Phil Pilgrim (PhiPi) said...
What gets read from a four-port memory cell that's being written in the same clock cycle? The new contents, or the prior contents?

I know the execution question has been answered but I'll add one more detail for educational purposes.

General flip-flop timings allow simultaneous clocking of old data out and new data in on the same clock edge. This is the basis of synchronous design and also what creates the high current spikes that digital circuits are renowned for.

There can be many simultaneous reads of the same bit with no interference between them. The added complexity to the sram cell is in the attachment to multiple buses. This applies exactly the same for a single write port also, it's just one more bus. Because there is only one write port there is no contention for simultaneous access. This fact simplifies the memory structure compared to fully dual-ported sram where both buses have write access.

Evan

rjo_ · 2008-09-07 04:31

To the original question... it doesn't matter. As long as you signal your intentions far enough in advance for your professional users that should be all that is required.

I love the rest of this thread... Chip asks a simple question ... and 22 pages of answers later, it is still spinning.

Rich

Roy Eltham · 2008-09-07 10:21

New question:·With the normal RDLONG, would it be possible to have swizzling options?

Something like this:

RDLONG_1324 D, S··(where·1324 is the byte order placed into D from the read)

There are·several useful combinations, like 4123, 3412, 2341, 4321,·1324, 2413. Also certain replication ones are handy, like 1111, 2222, 3333, 4444, 1212, 3434, 1122, 3344, etc.
If it was possible to specify the swizzle mapping in a general way instead of fixed mappings that would be ideal, but fixed mappings would go a long way...

My usage would be for graphics image manipulations and "vector" math like operations on 16 and 8 bit portions using integer/fixed math formats (similar stuff I do in shaders on GPUs at work).

Post Edited (Roy Eltham) : 9/7/2008 12:36:31 PM GMT

Roy Eltham · 2008-09-07 10:26

Chip Gracey (Parallax) said...
ANOTHER QUESTION:

If each new cog is more powerful than a whole current Propeller chip, do you really need 16 of them? Would 8 not suffice? Personally, I've never used all 8, except in some demo to show what the chip could do. To me, 8 is quite rich. By the time we get to 16, we are hub-starved and have to resort to cache-line style hub accesses to get the bandwidth back up (well, way up).

Are you guys sure about 16 cogs?

I'm much more interested in each cog being as powerful as possible, and having as streamlined as possible access to the hub. If that means only 8 cogs, then so be it.

Post Edited (Roy Eltham) : 9/7/2008 11:25:20 AM GMT

Roy Eltham · 2008-09-07 11:02

Chip Gracey (Parallax) said...
NEXT QUESTION:

I talked to Andre LaMothe (HYDRA) tonight about what he thought could be done to improve the video circuitry.

He thought adding a layer on top of the current video circuitry which would automate the gathering and outputting of color and pixel data would be good. This would mean that rather than doing flurries of WAITVIDs, you could point at the beginning of a stream of color and pixel longs within cog RAM and have the video circuit fetch them automatically when needed by stalling·cog execution periodically for one clock to get access to the cog RAM. So, you would set the begin and end pointers, set VSCL, do a WAITVID, and it would release you as soon as it took your command. After that, it would gather and output all color/pixel long pairs, not accepting another WAITVID until it was done with the series. This means that rather than doing lots of WAITVIDs, you'd be free to compose a whole scan line. During that time, you would lose a cycle here and there, but not otherwise be interrupted.

Also, we talked about the possibility of putting a color lookup RAM into the video circuit which would translate those 8-bit pixels into 16-bit pixels which could be·output as follows:

for composite video: %PPPPPP_TTTTT_BBBBB; where P=phase, T=top level, B=bottom level. This would use a 5-bit R2R DAC.

for vga (possibility): %RRRRR_GGGGG_BBBB_HV; where RGB have 6:6:5 bits, and Horizontal and Vertical are in the LSBs. Each color would use an R2R DAC.

This would make both composite and vga quite high quality.

The lookup table would be loaded 1-color-word-at-a-time by special instructions.

Would·these modifications be beneficial to you?

Absolutely YES PLEASE.·· I would love it if the VGA output could be even more...·have an option for the·8bit CLUT to translate into 32bit colors for VGA with 10bits per color component and·2 bits HV syncs (10:10:10:2). People could decide to use less of the color component bits if they wanted to save output pins for other uses, but having the option to approach·top end video signals would be REALLY nice. HDMI is capable of 10, 12, and 16bit per component color and a lot of decent VGA monitors are capable for displaying greater than 8 bit per component color. However, I could live with the 32bit color mode only being 8bits per color component and having some unused bits. Whatever is possible would be great!

Also, for·your 16bit variant it would be 5:5:4 14bit color with 2bit HV sync, I would prefer this to be 4:5:5 for the color part·(since human vision is typically more sensitive to blue shades).

In any case, whatever improvements you can do to the video hardware in the cogs would be awesome. I could even live with only having one actual VSU that was super beefy, and just have the cogs share it. It's pretty atypical to have more than one display.

Post Edited (Roy Eltham) : 9/7/2008 12:42:23 PM GMT

evanh · 2008-09-08 08:28

Yeah, red is the least sensitive and green is the most sensitive.

The 32 bit output idea is a bit nutty though. That's half the pins and just grossly overkill for the memory size. That said, I guess it might be possible to build a nice self service kios or something with an SD card, dunno, my mind borks just a tad at this one.

Evan

Should the next Propeller be code-compatible?

Comments