Math coprocessors?

John A. Zoidberg · 2010-10-20 18:51

Hey there,

Did anyone tried to put a DSPIC or any 16-bit microcontroller as a coprocessor for the Propeller, to have quicker floating point math operations? I'm wondering about interfacing the DSPIC through the SPI and to the Propeller.

Mike Green · 2010-10-20 19:08

Floating point on the Propeller is pretty fast (~40us). It's hard to run an external floating coprocessor much faster, particularly over a serial link like I2C or SPI.

John A. Zoidberg · 2010-10-20 19:11

I see. Well, is it also well and good if I dedicate all the floating point operations into another cog?

hinv · 2010-10-20 22:58

I speak without experience, but I have read the documentation for Floating Point library. It covers a few difference scenarios, one of which is dedicating a cog. Check it out here: http://obex.parallax.com/objects/202/
That's one of the cool things about the propeller. Somebody's already done most of the work for you. You just have to glue the stuff together with some spin.

I hope this helps.

Doug

max72 · 2010-10-21 01:21

You can check also
http://forums.parallax.com/showthread.php?t=125498&highlight=float
Or SpinLMM
http://obex.parallax.com/objects/635/

The first is an optimized float object, while SpinLMM has a fast embedded float routine as an example, with the basic functions. It works only under BST, but if you are at ease with Pasm you can try it and add functions. The advantage is it doesn't require additional cogs.

Massimo

Humanoido · 2010-10-21 03:06

Mike has a good point. It's surprising how fast floating point can work on the Propeller. If you're looking for speed, the manual is very detailed regarding assembler and SPIN code. You can pick and choose to optimize speed-driven statements.

Using Floating Point in Assembler Code
Assembler code provides significantly faster execution speed. The following shows a quick comparison for a floating point add:

FloatMath FAdd (Spin) 371.0 usec
Float32 FAdd (Spin) 39.0 usec
_FAdd (Assembler) 4.7 usec

Each statement has its own execution time.

Ale · 2010-10-21 05:31

The problem with an external co-processor is the transfer of the arguments. Even 1 cycle execution of the fp operation at say 10 MHz means that you have to still transfer the arguments i.e. 64 bits (plus the instruction) 1 bit at a time for SPI/I2C. If you use a 5 MHz SPI you still have to wait 12.8 us to transfer those bits...

It makes sense only if via assembler you do not get the speed you want and you need more precision or expensive instructions like trascendentals where a fast implementation will compensate for slower transfer. I'd recommend you use 8 bits transfers to recover some of that dead time.

Peter Jakacki · 2010-10-21 05:49

Just thinking about this other thread about I2C peripheral micros or PPCs as I call them I have been advocating a small 32-bit micro, the LPC111x. When you think about it the chip does a 32-bit integer multiply in 20ns so floating point should be quite fast when compared to the Prop. The only limiting factor as Mike pointed out is the link speed.

I have been using the 32-bit at a time serial transfer that runs at the clock frequency so a full 32-bits can be transferred in under 2us. However the UART on the LPC111x only runs up to 3.125Mbps. Hmmmm, but there is the enhanced SPI which is a fully functional Synchronous Serial Port capable of 16-bit transfers at 25Mbits. Using a modified serial transfer format from the Prop we can essentially communicate synchronously between the two chips. So assuming that we can offload a lot of FP ops to this chip and we can transfer back and forth quickly enough then this may produce results many times faster than with the cog doing the FP.

This is just a little bit of musing, but it's food for thought and you know that this is only a $1 chip and more can be had.

lonesock · 2010-10-21 09:58

So in the F32 code linked in Massimo's post, the floating point multiply takes about 1000 clocks (~12.5us on a 80MHz system) *if* called from assembler. The portion of the code that actually performs the multiplication takes only 40% of that, and the rest is the boring but necessary unpacking of the input parameters, and the subsequent packing of the result. So, even having a single cycle integer multiply would only speed up the routine by 40%.

Calling any F32 commands from Spin has quite a bit more overhead. Of course, in the extremely unlikely case that you only need one cog for the control code, you could run 7 instances of the F32 code, so your average multiply time would be under 2us [8^)

Jonathan

Tracy Allen · 2010-10-21 14:47

I'm thinking of the microMega µFPU 3.1 coprocessor that Parallax sells. (It was designed by Cam Thomson, who wrote the pasm floating point library for the Prop.) As a coprocessor it has a lot of extra nice features, including its own toolset for and serial i/o with NMEA parsing. The speed to transfer variables is less acute when the FPU has to perform a complex chain of calculations involving few variables and many parameters and functions.

Peter Jakacki · 2010-10-21 15:41

Was that 1,000 clocks or 1,000 instructions? Remember that a cog runs at 1/4 of the clock so each instruction takes 50ns which is 2.5 times slower than even the $1 ARM's cycle time of 20ns.

lonesock wrote: »

So in the F32 code linked in Massimo's post, the floating point multiply takes about 1000 clocks (~12.5us on a 80MHz system) *if* called from assembler. The portion of the code that actually performs the multiplication takes only 40% of that, and the rest is the boring but necessary unpacking of the input parameters, and the subsequent packing of the result. So, even having a single cycle integer multiply would only speed up the routine by 40%.

Calling any F32 commands from Spin has quite a bit more overhead. Of course, in the extremely unlikely case that you only need one cog for the control code, you could run 7 instances of the F32 code, so your average multiply time would be under 2us [8^)

Jonathan

lonesock · 2010-10-21 15:46

Clocks. Yep, got it, just providing a reference point on the F32 code.

Jonathan

Math coprocessors?

Comments