Propeller II update - BLOG

jmg · 2012-03-27 13:26

Phil Pilgrim (PhiPi) wrote: »

Things to do after first silicon "in hand":
Thoroughly test functionality.

Characterize for voltage, current consumption, speed, etc., over temperature.

Assuming first silicon checks out, schedule a production run.

Produce a complete datasheet.

Produce and document dev tools.

Design and manufacture EVM kits.

Write example software and educational materials for the EVMs.

Design a marketing program and produce sales materials.

It's show time! (Sometime in mid 2013?)

I'm sure I've left out some steps.

-Phil

Much of that they can start before silicon is in hand.
I'd also reverse the 'checks out' logic as the odds of something this complex working first time, are low,
so they need to schedule the Errata loop tightly, and that means starting tests on 'hot spots' first.

If the errata is not drop dead, just mission-annoying, often companies release Engineering samples on a throttled basis - eg into Eval boards, and selected customers pilot runs.

My hope is they have not forgotten the silicon details, in the 'do it all in SW focus', as state of the art moves on, and as an example of this, this just out from Freescale : (this silicon will be production-real before Prop II)

http://www.freescale.com/files/32bit/doc/brochure/VYBRIDBYNDBITS.pdf

besides the usual 'more glitter' fluff, there is interesting stuff in the details of this Vybrid, especially this :

["Dual Quad SPI supporting a double data rate interface, an enhanced read data buffering scheme,
Execute-in-Place and support for dual-die flashes.... An enhanced read data buffering architecture, minimizes the
latency impact of cache misses. Only the CPU will access the Quad SPI in the XiP mode of operation.
Double data rate support for Spansion (data learning) and Macronix (DTR2 mode) serial flash "]

Yes, that is in-Silicon support, for DDR QuadSPI (!), up to 132MBytes/sec (2 x SO8), and with eXecute in Place support.

["Up to 1.5 MB on-chip SRAM with ECC support on 512 KB"]

That's large enough to run many control tasks purely in on chip RAM, and with the XiP backup, you can allocate all the message and config stuff rarely used, into that. (There are even dual-core Vybrids, but in larger pin-count packages. )

aside : These top end Vybrids look like a better platform for a product like the Raspberry Pi, than the Broadcom part they chose.
PCB re-spins are relatively cheap.

Phil Pilgrim (PhiPi) · 2012-03-27 13:39

jmg wrote:

Yes, that is in-Silicon support, for DDR QuadSPI (!)

Yeah, so?

-Phil

jmg · 2012-03-27 14:35

Phil Pilgrim (PhiPi) wrote: »

Yeah, so?
-Phil

So you do not care if the Prop II can do that, or not ?
132MBytes/sec if you do it in silicon, (Silicon cost is not large) - care to take a guess at how much lower a SW-bash approach will be ?

rod1963 · 2012-03-27 14:45

Hmmm, first shuttle run in May, that's a good sign. With a little luck the user community will have them before years end.

Software will be the hard part. Would love to see a development suite like the one Cypress has for their PSOC5 line.

Parallax will also need some white papers showing how the PropII is better solution to various problem than say the upper end ARM and FPGA solutions along with various performance benchmarks. Coremark anyone?

And those Freescale Vybirds, monster SOC solutions like these is what the PropII will be competing with.

Leon · 2012-03-27 15:34

XMOS and GreenArrays are probably the main competitors.

pedward · 2012-03-27 15:40

jmg wrote: »

So you do not care if the Prop II can do that, or not ?
132MBytes/sec if you do it in silicon, (Silicon cost is not large) - care to take a guess at how much lower a SW-bash approach will be ?

Seems to me that the bottleneck is the protocol, not the chip. If it's DDR and Dual quad, the clock must be 66Mhz.

Since the Prop2 is supposed to be 200Mhz, 200MIPS per COG, I don't see any reason why it couldn't turn on a counter, then read 8 bits on the high and low going clocks, that would be 100Mhz at 8 bits, or 200MB/s, which is FASTER than the claimed peripheral.

Also, is this burst mode only? How much can be read at that 66MHz DDR?

The Prop2 has an instruction for doing bulk transfers to COG ram or HUB ram, right now it's supposed to be intended for SDRAM, but in reality it is any 8 or 16bit bus.

I had a discussion with Chip about this, at 16 bits, you can read 1 word per clock with the external bulk transfer instruction, which works out to 4 longs in 8 clocks.

The quad transfer instruction moves 4 longs per hub access window, with 8 clocks between windows you effectively get 1 long per 2 clocks, or 400MB/s to HUB ram.

However, because the bulk transfer function can write to HUB memory, you effectively get DMA. A lot of the functions of the new Prop were developed separately and so they have separate state machines, which means they run in parallel to the main ALU in each COG. The multiplier unit, quad cache read, bulk transfer, and counters all run independently. The texture mapping engine is dedicated hardware too.

The Prop-2-Prop serial engine is another capability that is parallelled. The engine will receive data asynchronously to the COG, and when you call the instruction you can block or non-block, if you non-block and call with the wc flag, it signifies if there is data in the buffer. The serial engine is actually triple buffered, there is the read buffer, the long buffer, and the long in your COG ram. The long buffer holds the last value received, while the receive buffer receives a new value, then you can periodically fetch the long from the long buffer and store it in your COG ram.

All of this isn't apparent from the preliminary specifications, but it really opens up the possibilities, and the WOW factor of what you can do.

The Prop2 is supposed to be 1600MIPS, which is equivalent to a 1.6Ghz ARM SoC, but it has 8 parallel cores to make that happen. I would have to wonder if most jobs would be more efficiently handled if spread across 8 cores at a lower speed than 1 core at a higher speed, having to stop and handle interrupts, to data comms, screen updates, etc.

Phil Pilgrim (PhiPi) · 2012-03-27 15:42

jmg wrote:

So you do not care if the Prop II can do that, or not ?

Not really. Besides, we've already beat the subject to death. Why bring it up again?

-Phil

jmg · 2012-03-27 15:53

Phil Pilgrim (PhiPi) wrote: »

Not really. Besides, we've already beat the subject to death. Why bring it up again?

Because state of the art changes, as time moves on. Closing your eyes does not make a problem go away.
Just a few months ago, there were no DDR QuadSPI devices, now we have both FLASH devices, and these new Freescale processors, supporting this.

This will quickly become an 'expected feature' in silicon, and on a constrained device like a Prop, such detail is actually very important.

jmg · 2012-03-27 16:08

pedward wrote: »

Seems to me that the bottleneck is the protocol, not the chip. If it's DDR and Dual quad, the clock must be 66Mhz.

Since the Prop2 is supposed to be 200Mhz, 200MIPS per COG, I don't see any reason why it couldn't turn on a counter, then read 8 bits on the high and low going clocks, that would be 100Mhz at 8 bits, or 200MB/s, which is FASTER than the claimed peripheral.

Only if you don't do anything with the data - if you want to shift/combine/store those fragments, that's many more cycles, and you need to contrast that with a Silicon support, which will deliver 32 bit words (buffered?) every 12 cpu cycles - and a Prop II can do real useful work in 12 cycles, and still not begin to slow down the interface.
Also, any spare cycles can translate directly to power savings.

pedward wrote: »

Also, is this burst mode only? How much can be read at that 66MHz DDR?

Once you have the READ mode address preamble done, most devices can stream forever at that speed.

pedward wrote: »

The Prop-2-Prop serial engine is another capability that is parallelled. The engine will receive data asynchronously to the COG, and when you call the instruction you can block or non-block, if you non-block and call with the wc flag, it signifies if there is data in the buffer. The serial engine is actually triple buffered, there is the read buffer, the long buffer, and the long in your COG ram. The long buffer holds the last value received, while the receive buffer receives a new value, then you can periodically fetch the long from the long buffer and store it in your COG ram.

Now imagine if all of that silicon support, also applied and worked with a DDR QuadSPI memory link too ?

Far more customers will want to link to FLASH, than those wanting to link to another Prop II
Clearly booting would/could use this as well.

evanh · 2012-03-27 16:43

Sequentially streamed, so, in practice, limited, but undefined, in length.

"State of the art" is prolly not a good classification for the low end stuff like this. It's more along the lines of making use of cheap mass-produced commodities.

Also, just remember that Cogs will never execute native outside of their native 9 bit (496?) address range. Everything bigger has to be Hub or some other virtual addressing scheme. This means that all large memory accesses are going to be done as data. XIP for the Prop is worthless. Doubly so because to fill 8 Cogs would need 8 buses each running at 800 MB/s. And that's just for instruction fetches!

Cluso99 · 2012-03-27 16:59

I am just hoping we can boot directly from an SD card without requiring an eeprom. An sd socket is cheaper than any eeprom. For hobby projects, the user can supply the microSD, making the whole board cheaper. For commercial products, the microSD is replacable, or it can be updated by connecting to a pc. At $6 for a 2GB microSD card retail, its a no-brainer for me.

First shuttle chips in May...
A lot of validation has been done in FPGA already.
Already a shuttle run of test sections has been run and verified - remember the fuses failed.
Providing everything checks out then a small run of engineering samples in maybe 3-4 months.
My best estimate with no hiccups is end of 2012 for production chips available.

BTW: Some still seem to think that Prop II is a replacement for Prop 1. This is DEFINATELY not the case. I see Prop 1 sales increasing as a result of Prop II legitimising Parallax (and so do Parallax and others here on the forums). The two chips will do different things, but with overlap of course. Don't underestimate the capability of Prop 1. We are still discovering techniques and abilities of Prop 1, ~6 years after release. And we still dont know all the capabilities of the counter/video circuits on Prop 1.

pedward · 2012-03-27 18:11

Cluso, the FLASH size Chip and I talked about is an 8Mbit, that's $1.10 in quantities he's interested in. The socket is like $3 IIRC, plus $5-$10 for a card. That means $8 minimum.

The problem with SD is that it may be an SPI bus, but it doesn't implement the same command set as flash.

The Prop2 won't use EEPROM, it will use FLASH, which is managed in sectors of 4KB instead of byte rewrite. The Prop2 needs a 1Mbit FLASH to store the program for conventional usage.

When talking with Chip about the bootloader I wanted to make certain that the Prop2 only needs a 4KB FLASH, so you can put a 4KB flash on board and have a bootloader that will boot firmware from SD.

The Prop2 shouldn't have the limitation of requiring a FLASH chip the same size as the RAM.

localroger · 2012-03-27 18:16

While they were floating the boot-from-SD thing last year at UPEW I'm pretty sure that has gone by the wayside. SD sockets aren't really cheaper than EEPROM, and that's before you buy the SD card to plug into them. They have circuit board mounting and acreage requirements. And booting from SD requires some fairly sophisticated logic in the boot loader to navigate the file system.

For commercial devices, that $6 plus socket versus < $2 in both SOIC and thru-hole can kill you.

P.S. agree about P2 not driving P1 away; P1 will be much lower power and simpler to use. I see myself doing P1 stuff for awhile. Also it's DEFIN-I-TELY. :-)

jmg · 2012-03-27 18:42

localroger wrote: »

... And booting from SD requires some fairly sophisticated logic in the boot loader to navigate the file system.

I don't think it is quite as bad as this - I recall discussions about FPGA type loaders, and there I think if the trigger-header is reasonable unique, it simply 'chugs along until it finds it'.
Of course, if you had a SD card with 20 valid Loader files on it, this simple method falls down, but most SD usages/users could easily live with a single valid boot file.

That would have a very low cost in a ROM loader ?

I see the newest Spansion QuadSPI parts, even have a special Bootloader NVM register, where you can specify an offset for data to stream from, in a Auto-boot application. Seemed like a nice idea, and makes ROM loaders even simpler.

jmg · 2012-03-27 18:51

pedward wrote: »

When talking with Chip about the bootloader I wanted to make certain that the Prop2 only needs a 4KB FLASH, so you can put a 4KB flash on board and have a bootloader that will boot firmware from SD.

If this still has i2c as an option, I see SOT23 sized memory comes up to 8K bytes, so that gives a very flexible tiered boot choice.

pedward · 2012-03-27 22:17

The Prop2 is switching to SPI for the FLASH, the same densities aren't available at an economic price in I2C. It's 4x more for I2C than SPI, for the equivalent part.

I did some research, a 2KB EEPROM is 17-33 cents in reel quantities. The smallest SPI FLASH that Digikey carries is 512Kb (64KB) and that is 26 cents in reel quantities.

The reality is that SPI FLASH is much cheaper than EEPROM, so much so that it pretty much doesn't make sense to load from SD. The novelty of having removable memory is what drives the desire to boot from SD, but you could just as easily put a bunch of code in the SPI FLASH and use the SD for other stuff.

Reel quantities of 8Mbit FLASH are 61 cents, it's just rediculous how little difference in price each device is!

Cluso99 · 2012-03-27 22:47

pedward: Please make sure you get your facts correct before persuading people the microSD is not an option...

microSD (hinged) smt socket $0.1067384 in Qty 800 reel
microSD push/push smt socket $0.2034935 in Qty 1000 reel
SD smt socket $0.2435 Qty 2000

And yes, I realise SPI Flash will be used. IMHO, I don't want the space or the part for Flash on my pcbs if I can avoid it. Just no reson to use Flash if you have an micro/SD card with just a few extra instructions (use the boot area on the SD if necessary, or demand the first file space).

Postedit: I am not advocating the removal of the Flash boot, just proposing (like had been discussed previously) that there be a boot option from SD card.

If you want a small footprint microSD socket, see tubulars post today... the Molex device is a microSD socket. Bit more expensive (~80c? in low volume) but the card sits over the top of the prop chip!

And from digikey http://search.digikey.com/us/en/products/DM3D-SF/HR1941CT-ND/1786515
$1.24/100, 0.8246/1000

Circuitsoft · 2012-03-28 00:34

evanh wrote: »

Also, just remember that Cogs will never execute native outside of their native 9 bit (496?) address range.

504 General purpose longs on PII.

evanh wrote: »

XIP for the Prop is worthless. Doubly so because to fill 8 Cogs would need 8 buses each running at 800 MB/s. And that's just for instruction fetches!

Well, good thing hub ram has 3200MB/s bandwidth then. Granted, that's 400MB/s/cog, but since several cogs will be used to implement hardware peripherals that can be defined in <504 instructions, XIP isn't needed quite so much. Also, some of the cacheing mechanisms implemented by some LMM kernels have proven to be pretty effective.

John A. Zoidberg · 2012-03-28 05:09

Cluso99 wrote: »

I am just hoping we can boot directly from an SD card without requiring an eeprom. An sd socket is cheaper than any eeprom. For hobby projects, the user can supply the microSD, making the whole board cheaper. For commercial products, the microSD is replacable, or it can be updated by connecting to a pc. At $6 for a 2GB microSD card retail, its a no-brainer for me.

First shuttle chips in May...
A lot of validation has been done in FPGA already.
Already a shuttle run of test sections has been run and verified - remember the fuses failed.
Providing everything checks out then a small run of engineering samples in maybe 3-4 months.
My best estimate with no hiccups is end of 2012 for production chips available.

BTW: Some still seem to think that Prop II is a replacement for Prop 1. This is DEFINATELY not the case. I see Prop 1 sales increasing as a result of Prop II legitimising Parallax (and so do Parallax and others here on the forums). The two chips will do different things, but with overlap of course. Don't underestimate the capability of Prop 1. We are still discovering techniques and abilities of Prop 1, ~6 years after release. And we still dont know all the capabilities of the counter/video circuits on Prop 1.

Ooh! That's pretty a cool news if these are available in end 2012. Makes a good Christmas gift for the multi-core fanatics!

evanh · 2012-03-28 05:26

SD cards are best used as a filesystem. And having more than one socket works best. Just the sort of thing that is done after an OS is loaded up.

Booting from removable media is best kept to a second stage function. So it 's then up to the system designer to include it. OS or first stage boot sits in a soldered part. SPI Flash is the norm for this sort of thing.

Chip is on the right track.

Pharseid380 · 2012-03-28 20:06

I posted this question to the wrong thread yesterday, so please forgive this double post of sorts. I see in the preliminary feature list that texture mapping is still there. I know how texture mapping in general works, but does anyone know how it works specifically on the Prop II?

-phar

Seairth · 2012-05-04 09:11

I realize that all of this information will be forthcoming, but I don't suppose we could get some details on the following new features:

SETMAP : Used for context switches? If so, how? Maybe a simple example of how the registers are "mapped" before and after this instruction would be enough to clarify.

SETINDx : When using SETINDx, an address range is specified. Does the auto-increment mechanism just wrap around when the highest register is accessed? Also, is there going to be a GETINDx pair of instructions? I see this as a very convenient way to implement a circular queue (write to INDA, read from INDB).

SETPORx : Obviously, there are more I/O ports than physical pins. If the map is to subsequent blocks of 32 pins, does that mean that the third block will have unimplemented pins (4, it looks like)?

SNDSER/RCVSER: what is the protocol used? I'm guessing that one cog provides the clock while the other is slaved? And based on the commands, the actual I/O is asynchronous (one cog can send independent or the other)?

SETXCH : I take it that calling SETPORx and setting the matching DIRx is not enough to use the internal I/O? The text "Reconfigure Port D I/O masks given D or n to select which cogs to listen to" is a bit cryptic. What that supposed to be "Port 3" instead of "Port D"?

SETXFR : Does this instruction trigger the actual read/write? Also, is this a multi-cycle instruction (considering 5ns instruction clock speed @ 200MHz and a typical 60-80ns access time for RAM)? How are the address and I/O lines configured (I'm expecting this to use up to two I/O ports)?

tonyp12 · 2012-05-04 09:55

This looks like a good spi flash at 256 Kilo Bytes at 35cent each.
http://www.futureelectronics.com/en/technologies/semiconductors/memory/flash/serial/Pages/6000867-MX25L2006EM1I-12G.aspx?IM=0

But will there be USB bootstrap loader in ROM?,
As preprogramming a software based USB uart in flash first plus with its inherited risk of being comptetly deleted is not good.

Dave Hein · 2012-05-04 10:10

Searirth, my understanding is that INDA and INDB will wrap from the top value back to the bottom value. There doesn't seem to be a way to read the values of the two index registers.

evanh · 2012-05-04 19:48

Seairth wrote: »

SETMAP : Used for context switches? If so, how? Maybe a simple example of how the registers are "mapped" before and after this instruction would be enough to clarify.

I'd guess it's a simple XOR of the upper 6 bits of Cog address lines leaving the lower 3 bits for 8 registers per bank. This would render the last bank, with it's special registers, a bit tricky to use as it would end up in a different logical bank for each mapping so there will likely be an exception for the last bank or two to keep them at fixed addresses.

evanh · 2012-05-04 20:07

tonyp12 wrote: »

But will there be USB bootstrap loader in ROM?,

Holy Smile no! SD cards are at least simple even if the stored content is not. USB is a hugely complicated protocol.

As preprogramming a software based USB uart in flash first plus with its inherited risk of being comptetly deleted is not good.

As a system/device designer you have to be careful if you are going to allow reprogramming of soldered parts. No different to many many many modern devices theses days.

evanh · 2012-05-04 20:25

Seairth wrote: »

SETXCH : I take it that calling SETPORx and setting the matching DIRx is not enough to use the internal I/O? The text "Reconfigure Port D I/O masks given D or n to select which cogs to listen to" is a bit cryptic. What that supposed to be "Port 3" instead of "Port D"?

Assembly has always been a little more cryptic than the higher level languages. In this case there is no indication of the indirectness of the addressing ... The "D" in Port D is referring to a register selected port (Known as register direct addressing). As in the instruction contains the register number (D), and the register containing the port number, rather than an immediate port number in the instruction.

Seairth · 2012-05-04 20:38

Dave Hein wrote: »

Searirth, my understanding is that INDA and INDB will wrap from the top value back to the bottom value. There doesn't seem to be a way to read the values of the two index registers.

Too bad about the read operations. It would be easy enough to do the bookkeeping with additional registers, but testing the indexes would have been nicer.

Seairth · 2012-05-05 10:31

evanh wrote: »

The "D" in Port D is referring to a register selected port (Known as register direct addressing) .

I think you are incorrect in this case. It separately mentions "given D or n" as the mask. Maybe another way to ask the question is "what is the function of the mask?"

Beau Schwabe · 2012-05-05 13:54

Update:

1) To date we have met with people in Wisconsin that are working on the inner core (glue-logic) ... this was a face to face and to make sure that the streamout data that they provided could successfully be streamed in.

2) I have implemented some minor layout changes to the IO's that will increase functionality ... This is the result of a brainstorm between Chip and Roy Eltham :-)

3) I have a small change to make on the SLOW_DAC resistors ... this is a small change that will improve bit accuracy. This DAC is used internally as a reference.

4) Chip is making some small changes to the inner core (glue logic) which will not affect the layout that I am doing.

As it is as of today (May 5th, 2012) I have a current transistor count. I would like to have a mini contest. The person that guesses the closest number of transistors used in the Propeller 2 design will win a prize. I haven't decided what the prize is yet, but maybe we can get Matt G involved to come up with something. Perhaps first dibs on Beta testing the new Propeller 2?

For now, lets just leave it open... Anyone who guesses the closest number to any of the three categories is the winner.

1) Number of transistors inside the core
2) Number of transistors outside the core
3) Total Number of Transistors

A little info:
- The Chip is roughly 7.5 mm square
- The process is 180nm
- The method used to count transistors was looking for where Poly(The transistor gate) intersected Diffusion (The substrate). This intersection forms a transistor. Transistors with multiple fingers or gates, each gate/finger would count as a separate transistor.

Please E-mail your responses to: Propeller2@cox.net
... and to make it fair, I will send Matt G the numbers that I have right now and use those (as they are subject to change slightly from this point forward)

Attached are some pics of the current Propeller 2 artwork.

Propeller II update - BLOG

Comments