Who Will be Prop2's Biggest Competition?

KeithE · 2018-01-03 22:36

There are already boards with ARMs and iCE40 parts, so perhaps less wrong. If you bundled it with P1 Verilog and had a sort of NAND2TETRIS set of docs it could be an exciting educational experience. A lot of work to develop though.

David Betz · 2018-01-03 22:37

KeithE wrote: »

There’s a guy “Gnarly Grey” selling UPDuino iCE40UP5K boards for $8. Are there many Boards that pair a microcontroller with a low cost FPGA? Parallax could probably make a nice educational product in this space if they thought it was worth it.

Does the FPGA have enough space for a P1v COG?

jmg · 2018-01-03 22:42

Heater. wrote: »

I thought we were trying to sell the P2. A P1 board including cheap FPGA from Parallax would be sending totally the wrong message.

Yes and no - Certainly, the Parallax (Chip) focus needs to be P2 release, but Parallax should also not ignore other things in their stable.
Modules are relatively cheap to design and release.

The iCE40UP has been largely held up by lack of real parts, but I see Digikey has stock of ICE40UP5K-SG48, and Mouser shows Qty 1944 due later in Jan.

So it becomes a question of is there a FLiP+, and what should go onto that ?
My choices would be better precision clock, an EFM8UB3 for more flexible USB interface (and buys some board area), and perhaps a ICE40UP5K-SG48 ?

Another point to ponder, is a key selling point for P2 is the Smart Pins, which brings the question of does a board with P1+ICE40UP5K-SG48+Some Smart-Pins, allow seed of earlier development of real Smart pin uses ?
How could those Smart-pins be implemented, to make code-porting to P2 easiest ?

I can see a 32b data field, 6 bits of pin ID and 6 pin opcodes, so that's 41 bits to transport between P1 and FPGA ?

jmg · 2018-01-03 22:44

David Betz wrote: »

KeithE wrote: »

There’s a guy “Gnarly Grey” selling UPDuino iCE40UP5K boards for $8. Are there many Boards that pair a microcontroller with a low cost FPGA? Parallax could probably make a nice educational product in this space if they thought it was worth it.

Does the FPGA have enough space for a P1v COG?

I think so, but do not recall seeing a specific Lattice build for ICE40UP5K-SG48 yet.

Addit: There are some Lattice numbers here, for a different FPGA family.
http://forums.parallax.com/discussion/165645/p1v-with-lattice-ecp5-fpgas/p1
Seems to be a tad under 2k LUT per GOG, so if that maps to the ICE40, it means 1 COG + plenty-logic, or maybe 2 COGS, and sub 20% spare ?

Tubular · 2018-01-03 23:59

I would think 5k LUT might get you 2 P1V cogs. The 8k LE Max10 fpga gets you 3 P1V cogs.

About 5 of us locally are waiting on this 24k Fleaohm board, then we'll see whats possible, will be interested to see what MHz
https://www.indiegogo.com/projects/fleafpga-ohm-fpga-experimenter-board-arduino#/

cgracey · 2018-01-04 00:04

If the P2 could replace FPGAs for some types of apps where there was flexibility in redefining the problem, somewhat, to accommodate a P2 solution, that would be nice to demonstrate. Requiring the customer to deal with an FPGA tool chain, though, doesn't seem so nice. We need to show rapid application development with just the P2. I know that doesn't fit every scenario, but it could fit a lot of scenarios. Those are where the P2 would really shine.

jmg · 2018-01-04 00:23

cgracey wrote: »

If the P2 could replace FPGAs for some types of apps where there was flexibility in redefining the problem, somewhat, to accommodate a P2 solution, that would be nice to demonstrate. Requiring the customer to deal with an FPGA tool chain, though, doesn't seem so nice. We need to show rapid application development with just the P2. I know that doesn't fit every scenario, but it could fit a lot of scenarios. Those are where the P2 would really shine.

I also see a market where P2 allows someone to choose a smaller FPGA.
In the thread I linked above, rough prices are a drop from 44k LUT to 24k LUT, can save $8.53, or from 45k to 12k, can save $13, (and P2 adds a LOT more ram )

Tubular · 2018-01-04 00:28

There seems to be a certain "learned skepticism/unease" associated with successfully developing with fpgas. I think the P2 will feel more solid to develop with.

At one stage we discussed putting small fpga fabric at each smartpin. With hindsight I think the way the smartpins are now is the best.

Seairth · 2018-01-04 01:34

cgracey wrote: »

If the P2 could replace FPGAs for some types of apps where there was flexibility in redefining the problem, somewhat, to accommodate a P2 solution, that would be nice to demonstrate. Requiring the customer to deal with an FPGA tool chain, though, doesn't seem so nice. We need to show rapid application development with just the P2. I know that doesn't fit every scenario, but it could fit a lot of scenarios. Those are where the P2 would really shine.

The way as you sell it is as a rapid prototyping tool prior to using an FPGA. 90% of the time, they'll just stay with the P2 (because they've already written working code).

Cluso99 · 2018-01-04 02:30

Tubular wrote: »

There seems to be a certain "learned skepticism/unease" associated with successfully developing with fpgas. I think the P2 will feel more solid to develop with.

At one stage we discussed putting small fpga fabric at each smartpin. With hindsight I think the way the smartpins are now is the best.

IMHO, it would have been better to have a micro-COG drive each smart pin. ie an instruction reduced P1/P2 COG core with say 256-long cog ram/registers. Even better if they were 1 clock instructions. Bet we could have done a lot more with these

jmg · 2018-01-04 02:34

Cluso99 wrote: »

Tubular wrote: »

At one stage we discussed putting small fpga fabric at each smartpin. With hindsight I think the way the smartpins are now is the best.

IMHO, it would have been better to have a micro-COG drive each smart pin. ie an instruction reduced P1/P2 COG core with say 256-long cog ram/registers. Even better if they were 1 clock instructions. Bet we could have done a lot more with these

Easy to say, apart from two teensy issues :
a) That would have run much slower, & been more granular - bit-banging at 180m process, is going to be glacial in modern terms.
b) It simply would have eaten silicon. Those 64 MCUs is a LOT of chip area.

David Betz · 2018-01-04 02:38

Since there continue to be suggestions about how to improve the current P2 design it makes me wonder how many people here will be happy if the current design is fabricated without any further changes. Is this thing really done?

Heater. · 2018-01-04 02:40

How do you make a 1 clock per instruction processor?

As far as I know it cannot be done without introducing pipelines and other complexities. But then whilst you now complete an instruction every clock, each instruction still has a two or more clocks latency.

Even with 256 registers surely this a huge lot of transistors to find space for. Especially if it's one per pin!

jmg · 2018-01-04 02:48

Heater. wrote: »

How do you make a 1 clock per instruction processor?

As far as I know it cannot be done without introducing pipelines and other complexities. But then whilst you now complete an instruction every clock, each instruction still has a two or more clocks latency.

Pipelines are not mandatory - you can craft an MCU that does one opcode per clock, but the MHz it can run at, is made higher by using a pipeline.
- because branches are somewhat rare, most chip designers choose the higher straight line speed.
Some MCUs have a skip opcode, which works with the 'straight lines are faster' idea.

Ariba · 2018-01-04 02:51

David Betz wrote: »

KeithE wrote: »

There’s a guy “Gnarly Grey” selling UPDuino iCE40UP5K boards for $8. Are there many Boards that pair a microcontroller with a low cost FPGA? Parallax could probably make a nice educational product in this space if they thought it was worth it.

Does the FPGA have enough space for a P1v COG?

I can fit 2 cogs into the UPDuino if I remove the video generators.
But it was not fully working. The Propeller got identified by the PropTool, but downloading a code failed with a RAM checksum error.
Maybe I've done something wrong with the reduction to two cogs, or with the SPRAM blocks.
The performance was so poor that I don't have invested the time to find the error (about 24 MHz sysclock = 6 MIPs per cog). Two cogs are not enough and you have no free space to implement useful additions.

Maybe with a single cog you can make it faster, up to 1 cycle per instruction with pipelining, but you can buy a 8 cog Propeller 1 for the same price as the UltraPlus FPGA. So it's not worth the effort.

Remember the Ultra in UltraPlus stands for ultra low power, and therefore these FPGAs are quite slow. The ECP5 should give us much better performance and the 12k type is even cheaper.
I also wait for the Flea-Ohm.

Andy

Heater. · 2018-01-04 03:00

My understanding of processor architecture is almost nil.

Somehow you have to fetch and instruction, fetch a couple of operands, do the work and write the result.

I'd want a clock to do each one of those, that's 5. Or perhaps 4 if we can combine a couple of things.

If we have multi-port RAM I can imagine getting that down to 2 or 3. As we can now fetch the operands in one go or one instruction ahead whilst working on the current one. But that is already the beginnings of a pipeline.

Or if we have a load/store architecture and all other ops are done between registers. Then typical instructions, like ADD, don't have to fetch their operands.

How do we get down to 1 clock per instruction?

How do we get down to 1 clock per instruction in a COG style architecture?

Whilst we are thinking of these deep things, how many clocks does the P2 take per instruction?

Ariba · 2018-01-04 03:02

cgracey wrote: »

If the P2 could replace FPGAs for some types of apps where there was flexibility in redefining the problem, somewhat, to accommodate a P2 solution, that would be nice to demonstrate. Requiring the customer to deal with an FPGA tool chain, though, doesn't seem so nice. We need to show rapid application development with just the P2. I know that doesn't fit every scenario, but it could fit a lot of scenarios. Those are where the P2 would really shine.

The big advantage of the P2 is the analog function per pin. Some FPGAs have ADCs built in, but only on a few pins and I've not seen one with DACs. I think there is also no other microcontroller with many DACs.
So the P2 will shine in analog applications, but also the many possible USB ports are unique.

Modern FPGAs have fast SERDES with 2..10 GHz, which is needed for all the new standards, like PCIe, HDMI, MIPI, USB-C ...

Andy

jmg · 2018-01-04 03:04

Ariba wrote: »

I can fit 2 cogs into the UPDuino if I remove the video generators.
But it was not fully working. The Propeller got identified by the PropTool, but downloading a code failed with a RAM checksum error.
Maybe I've done something wrong with the reduction to two cogs, or with the SPRAM blocks.
The performance was so pure that I don't have invested the time to find the error (about 24 MHz sysclock = 6 MIPs per cog). Two cogs are not enough and you have no free space to implement useful additions.
Remember the Ultra in UltraPlus stands for ultra low power, and therefore these FPGAs are quite slow.

What MHz can a 32b counter or adder run at, in iCE40UP ? - data mentions 40MHz for 64b, 100MHz for 16b, but nothing for 32b.

Looks like they chose that 48MHz on chip osc for good reason.

Ariba wrote: »

Maybe with a single cog you can make it faster, up to 1 cycle per instruction with pipelining, but you can buy a 8 cog Propeller 1 for the same price as the UltraPlus FPGA. So it's not worth the effort.

On a Dollars-per-COG basis, FPGA are not near competing yet, but you can always use a P1 and a FPGA.

Dave Hein · 2018-01-04 03:07

The current P2 design uses a 2-port RAM for the cog memory. This allows instructions to be executed in 2 cycles. The first cycle fetches the instruction and writes the result of a previous instruction. The second cycle reads the two operands.

The P2-Hot used a 4-port RAM, and could execute one instruction per cycle.

Heater. · 2018-01-04 03:12

jmg,

On a Dollars-per-COG basis, FPGA are not near competing yet...

True.

But consider. If you have a super cheap Lattice FPGA you don't need all those COGS.

A typical P1 application has one COG running the "business logic", as the kids like to say today, and the other COGs are used for devices, UART, SPI, I2C, whatever bit banging.

On a cheap FPGA that translates into a processor core in HDL. All those interfaces become HDL modules around it and are much smaller and faster than full up COGs.

Cost wise the FPGA is starting to look really good now a days. Especially as we now have Open Source and easy to use tools to develop HDL for them.

cgracey · 2018-01-04 03:13

To get those three read ports and one write port, P2-Hot used THREE instances of 512x32 dual-port RAM per cog.

The current P2 uses just one instance of 512x32 dual-port RAM per cog.

Heater. · 2018-01-04 03:23

Ah, thanks Chip.

Now I understand the "hot" part

So how many clocks per instruction is the P2 down to ?

cgracey · 2018-01-04 03:32

I think there are a lot of sensing and control problems which have easily-conceivable/tedious-to-implement FPGA solutions, but by reorganizing those problems somewhat, they could be solved by writing assembly code for the P2, with less initial effort and easier modification. We have the needed timing determinancy, but not always the 1-clock latency of pure logic. Being able to stay in software and quickly iterate changes is a huge benefit compared to the tedium and delays of writing Verilog, compiling, and loading an FPGA. We're talking one second without leaving the keyboard, as opposed to minutes of monkey motion.

cgracey · 2018-01-04 03:33

Heater. wrote: »

Ah, thanks Chip.

Now I understand the "hot" part

So how many clocks per instruction is the P2 down to ?

Two clocks per instruction.

Ariba · 2018-01-04 03:40

Dave Hein wrote: »

The current P2 design uses a 2-port RAM for the cog memory. This allows instructions to be executed in 2 cycles. The first cycle fetches the instruction and writes the result of a previous instruction. The second cycle reads the two operands.

The P2-Hot used a 4-port RAM, and could execute one instruction per cycle.

The ICE40 FPGAs have only Blockrams with one read and one write port. But you can get a 2 or 3 port read if you just use a separate RAM for every read and do the write to all blocks in parallel.
For the Propeller a read for Instruction, SourceReg and DestReg is needed, so 3 times a 512x32 RAM. Then you can read all in the same cycle, and write can be done independently.

jmg:

I don't now if the UP5k chips on the UPDuino are bad, pre-versions (and therefore so cheap), but the timing models of the IDE does not fit the chips. The max frequency is about 2/3 of the one the tool says. I almost despaired until I found out, because it worked in parts but not fully, and if I changed what not worked, something else did not work - nothing that could be explained logically.

Anyway after lowering the frequency from 48 to 24 MHz my tests worked okay, and in the meantime 36 MHz is my prefered setting (with the PLL).

The architecture of these FPGAs seems not to be made for 32bit processors, they slow down if you do adders and muxes over about 24 bits. I can run my own 16bit processor with 45 MHz, while the same with 32 bit runs only at about 25 MHz.
Also the multipliers are 16x16bit, but there is a 32bit adder in the DSP blocks, which I have not tried so far.
An ALU consists of several blocks in a row, input-multiplexer, adder, and a result-multiplexer for example and all the delays adds together. These seems to be more worse with 32bits.

Andy

kwinn · 2018-01-04 04:04

Ariba wrote: »

...

The architecture of these FPGAs seems not to be made for 32bit processors, they slow down if you do adders and muxes over about 24 bits. I can run my own 16bit processor with 45 MHz, while the same with 32 bit runs only at about 25 MHz.
Also the multipliers are 16x16bit, but there is a 32bit adder in the DSP blocks, which I have not tried so far.
An ALU consists of several blocks in a row, input-multiplexer, adder, and a result-multiplexer for example and all the delays adds together. These seems to be more worse with 32bits.

Andy

The more bits the longer it takes for the carry to propagate from the lsb to the msb and overflow flag.

Ariba · 2018-01-04 04:44

Yes, but there are special carry chains between the LUTs, which should make the carry logic fast.
If I go from 16 to 24bit I don't see the same proportional reduction of the frequency.
It may also have to do with the routing recources. You can find quite detailed information about these FPGAs in the icestorm project, and there they say the longest routing wires can only span 12 LUTs, then you need to sacrify some LUT to go wider.

The UltraPlus 5k is now also supported by the free FPGA toolchain (experimental). And that's a huge advantage. You can do alot with the free tools that get rejected by the IceCube software. I have the feeling Lattice fixes some hardware bugs just by setting some rules into the router, I got a routing error if I wanted to feed the PLL input clock at a certain pin without further explanation. And they tested only some DSP setting and just do not allow all the other possible settings. If you try it with free tools they seem to work okay.

The person that added UP5k to icestorm, also found a secret oscillator tune register.

jmg · 2018-01-04 05:00

Ariba wrote: »

...
The UltraPlus 5k is now also supported by the free FPGA toolchain (experimental). And that's a huge advantage. You can do alot with the free tools that get rejected by the IceCube software. I have the feeling Lattice fixes some hardware bugs just by setting some rules into the router, I got a routing error if I wanted to feed the PLL input clock at a certain pin without further explanation. And they tested only some DSP setting and just do not allow all the other possible settings. If you try it with free tools they seem to work okay.

That all sounds good.

Ariba wrote: »

The person that added UP5k to icestorm, also found a secret oscillator tune register.

Any links ? - What granularity did that give ? - Is this is a bitstream / boot time register, or a run time live trim that user code can access ?

Ariba · 2018-01-04 05:06

jmg wrote: »

....
Any links ? - What granularity did that give ? - Is this is a bitstream / boot time register, or a run time live trim that user code can access ?

Here is the link.

thej · 2018-01-04 05:33

cgracey wrote: »

To get those three read ports and one write port, P2-Hot used THREE instances of 512x32 dual-port RAM per cog.

The current P2 uses just one instance of 512x32 dual-port RAM per cog.

Did the P2-Hot use any pipelining?
J

Who Will be Prop2's Biggest Competition?

Comments