Welcome to the Parallax Discussion Forums, sign-up to participate.

Mark_T
Posts: **1,981**

in Propeller 2

I'm in the throes of designing a standalone FFT library/object for the P2 using 32bit cordic operations.

It strikes me it would be good to get some input about what people might want from such a thing.

Currently I have a core routine that expects 32bit+32bit fixpoint real or complex input values,

and generates the forward FFT in-place from this. Optional post-processing to calculate magnitude+phase

or magnitude squared (actually twice magnitude squared which is what is needed for spectral measurement)

It has been tested with N upto 2^15 (above that and P2 runs out of hub ram!)

What I'm thinking of doing is extending the input options so that the input vector need not be shared

with the work vector, can be just real values (4 bytes per element rather than 8 for complex), or 16 bit

real values. The input fixed-point representation would be specifiable, as would the working vector

fixed point (probably limited between fix2.30 and fix16.16).

output options could include logarithmic (dB) as well as raw, magnitude/phase and mag squared.

the option to divide the output by N (implemented by shifting right one on each round) is already their,

there needs to be an inverse operation of course, with option to synthesize negative frequency components

from the non-negative automatically.

The execution model is a cog that waits for command to be written to its command address, which would

need a locking convention for multi-threaded use, but it could just be fired up and killed after a single use too.

Its what I'm used to on the P1, but is that the best approach?

My use for this is spectrum analysis, but I can see others wanting a workhorse for implementing filtering

or fast convolution/fast correlation. My use case is repeatedly throwing data at it and averaging the

results, with double buffering so data acquisition doesn't have to stall for it.

My current performance figures are, for a 4096 point FFT, 37.1ms (about 110k SPS). The inner core

is a butterfly computation routine using 4 cordic multiplies and 1 cordic rotate in parallel.

I have to praise the design of the cordic unit, it does everything needed for signal processing code,

I think I've used every operation now.

__Mark

It strikes me it would be good to get some input about what people might want from such a thing.

Currently I have a core routine that expects 32bit+32bit fixpoint real or complex input values,

and generates the forward FFT in-place from this. Optional post-processing to calculate magnitude+phase

or magnitude squared (actually twice magnitude squared which is what is needed for spectral measurement)

It has been tested with N upto 2^15 (above that and P2 runs out of hub ram!)

What I'm thinking of doing is extending the input options so that the input vector need not be shared

with the work vector, can be just real values (4 bytes per element rather than 8 for complex), or 16 bit

real values. The input fixed-point representation would be specifiable, as would the working vector

fixed point (probably limited between fix2.30 and fix16.16).

output options could include logarithmic (dB) as well as raw, magnitude/phase and mag squared.

the option to divide the output by N (implemented by shifting right one on each round) is already their,

there needs to be an inverse operation of course, with option to synthesize negative frequency components

from the non-negative automatically.

The execution model is a cog that waits for command to be written to its command address, which would

need a locking convention for multi-threaded use, but it could just be fired up and killed after a single use too.

Its what I'm used to on the P1, but is that the best approach?

My use for this is spectrum analysis, but I can see others wanting a workhorse for implementing filtering

or fast convolution/fast correlation. My use case is repeatedly throwing data at it and averaging the

results, with double buffering so data acquisition doesn't have to stall for it.

My current performance figures are, for a 4096 point FFT, 37.1ms (about 110k SPS). The inner core

is a butterfly computation routine using 4 cordic multiplies and 1 cordic rotate in parallel.

I have to praise the design of the cordic unit, it does everything needed for signal processing code,

I think I've used every operation now.

__Mark

## Comments

1,9811,981about twice as fast now, 210kSPS at 2048 points, 170kSPS at 16384 points, 150dB or so of dynamic range with

fix5.27 internal representation.

1,981synchronizing with Vsync on refreshing the frame buffer (guess P2 is plenty fast for this)

7,262reveal a surprising finding that contradicts Danish physicist Niels Bohr's established view

—the jumps are neither abrupt nor as random as previously thought."

15,076My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBladeProp OS(also see Sphinx, PropDos, PropCmd, Spinix)Website: www.clusos.comProp Tools (Index) , Emulators (Index) , ZiCog (Z80)1,981like a proper FFT analyzer eventually, with peak measurements, options for TFT as well as VGA display, control

interface for modes and sampling rates, external ADC interface (there's a nice 18 bit 400kSPS SAR chip, the ADS8885,

which I've got my hands on, although its apparently EOL'd).

Current module break-down is

Acquisition - built-in sigma/delta, external I2S, external SPI options

Display - DAC to 'scope, VGA, TFT options

FFT

sampling window processing (I'll probably throw this in with the FFT module as they are strongly related)

top level driver

and on the horizon:

peak finding + measurements

control module (accepts mode changes and other commands) - initially serial

BTW I was partly inspired by this paper:

http://holometer.fnal.gov/GH_FFT.pdf

9521,981should (once I have it working) give a nearly 2x speedup, although complicates the various

options - hope to have some testsuite code for most of the features and release it as an

object within a week or so.

The method of fixing up an N/2 point complex FFT to perform an N-point real input FFT seems

rather confusingly documented in various sources I found on the web, several articles with

errors or crucial details omitted! In the end its a matter of summing and differencing

and QROTATE calls.

And with a decent VGA monitor and green graphics, its starting to look like the business (calibration

test data spectrum with 30dB between each peak, showing off the dynamic range of fix5.27 calculations and

the limitations of the 114dB side lobe flat-top window used - the noise floor is mainly due to CORDIC unit

I think)

15,076My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBladeProp OS(also see Sphinx, PropDos, PropCmd, Spinix)Website: www.clusos.comProp Tools (Index) , Emulators (Index) , ZiCog (Z80)1,981working, meaning a speed up to 325kSPS from 210kSPS.

(So now 6.3ms per 2048-point FFT with a single cog - haven't thought about parallization yet)

Not quite the doubling in speed I'd hoped, but I did shake out a few bugs that only affected the phase of the

FFT result which were just breaking the real FFT completely in baffling manner.

Next stage code and options rationalization/tidy up...

1,981unrolling some loop bodies substantially - using LUTram now for more codespace. as a consequence.

There's a tiny testsuite in spin, and hopefully enough documentation at the head of the file to explain

how to play with it. Not all the modes are tested, but real and complex inputs with stride 8 have tests.

Current raw FFT speed for real input on 160MHz P2 with 2048 point FFTs is about 745kSPS, and for

complex input, 475kSPS. With real input and output squared-magnitude mode the speed is 655kSPS.

As a for-instance, reading 2048 sample blocks at 48kSPS from analog smartpin, which takes 42.6ms to

acquire, each block can be processed in 3.1ms, about 13 times real-time, leaving lots of slack for overlap/add

style processing (although there would have to be some more copying overhead in this scenario).

[ Its interesting to spot the various uses of new P2 instructions like incmod, subr, decod, although I'm sure

there's a few I've not used but would be ideal, there are so many new instructions ]

[ And I've just noticed I've achieved 6 times the initial performance figures, looking back at my first post ]

1,981of the FFT code and testsuite, and the other code I have for spectral analysis, https://github.com/MarkTillotson/Prop2_FFT

And I've added text/numeric labels and info to the display code. Screenshot below. I've been playing with

an 18 bit SAR ADC chip, the ADS8885, and it seems to perform very nicely, can really show up the noise

floor of the smartpin DACs. This isn't a sigma-delta chip, its specified to be able to settle to full 18 bit

accuracy in single sample time, and goes to 400kSPS... Not cheap, but useful for my ultimate aim of a Prop2

precision FFT analyzer. been side-tracked designing a board with two ADS8885's and a voltage reference

as per the chip's datasheet recommendations.

15,076My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBladeProp OS(also see Sphinx, PropDos, PropCmd, Spinix)Website: www.clusos.comProp Tools (Index) , Emulators (Index) , ZiCog (Z80)7,262reveal a surprising finding that contradicts Danish physicist Niels Bohr's established view

—the jumps are neither abrupt nor as random as previously thought."

3,605Looks like a nice ADC.

1,981Log freq scale means using a bigger FFT to avoid large gaps at the low frequency end, but is certainly doable as

part of the display loop that draws the trace.

My latest tweak was to combine sync serial output to one colour while streaming another, allows the graticule

to be output in a different colour to the trace - previously each screen area was streamed to a different VGA

colour pin. One bit per pixel for 1280x768 is 123kB, I would like to reduce that.

A limitation I've hit is that synchronous serial TX mode doesn't work without an enabled clock pin - for

video generation I don't need that pin (although for some TFT panels it is needed anyway).

1,981or 'scope

I've added a few more tests on the raw FFT library, which now also supports spectral windows (needs

tests still for that, also needs inverse FFT and logarithmic conversion adding).

https://github.com/MarkTillotson/Prop2_FFT

Included are 3 libraries for acquisition sharing a common interface, and a very hacked around test signal

generator driving a smart pin in DAC mode. I currently connect that to an ADC smartpin or my ADS8885

breakout on a little satellite breadboard.

The VGA 1280x768 resolution allows a 1025x513 pixel area for plotting spectrum and is 1-bit-per-pixel,

and different coloured vertical zones using streamer. The graph area has hardcoded graticule generated

on the fly on red VGA signal using synx TX mode while the streamer drives green pin from the bit map.

Like this:

15,076I like the idea for the graticule overlay.

My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBladeProp OS(also see Sphinx, PropDos, PropCmd, Spinix)Website: www.clusos.comProp Tools (Index) , Emulators (Index) , ZiCog (Z80)7,262Async serial uses internal pacing. This would eliminate the external serial clock.

The smallest pulse, start-data-stop, that could be output this way is two serial clocks long. Which should work if the serial clock were double speed of the pixel clock.

EDIT: Oops, it wouldn't work for long lines.

reveal a surprising finding that contradicts Danish physicist Niels Bohr's established view

—the jumps are neither abrupt nor as random as previously thought."

7,262reveal a surprising finding that contradicts Danish physicist Niels Bohr's established view

—the jumps are neither abrupt nor as random as previously thought."

1,981That's a good idea for the graticule, although I was also thinking about ways to do run-length-encoded

graphics too, in order to expand to more colours.

7,262In the simplest form, think of the start and stop bits as the two ends of a horizontal line. If there was no data between start and stop then the visible line length would be one serial clock. But I think the minimum data length is one bit, so the minimum line length is two serial clocks long. And to make that the same as one pixel would need doubled serial clock. You'd also need to use inverted output in the custom pin config because start bit is normally low and stop bit is normally high.

reveal a surprising finding that contradicts Danish physicist Niels Bohr's established view

—the jumps are neither abrupt nor as random as previously thought."

7,262Pixel patterns can be made from a maximum of 33 serial bits, and minimum 2.

Not sure what happens if the serial shifter time base is adjusted mid way.

reveal a surprising finding that contradicts Danish physicist Niels Bohr's established view

—the jumps are neither abrupt nor as random as previously thought."