Shop OBEX P1 Docs P2 Docs Learn Events
Wish list for general P2 FFT object? — Parallax Forums

Wish list for general P2 FFT object?

I'm in the throes of designing a standalone FFT library/object for the P2 using 32bit cordic operations.

It strikes me it would be good to get some input about what people might want from such a thing.

Currently I have a core routine that expects 32bit+32bit fixpoint real or complex input values,
and generates the forward FFT in-place from this. Optional post-processing to calculate magnitude+phase
or magnitude squared (actually twice magnitude squared which is what is needed for spectral measurement)
It has been tested with N upto 2^15 (above that and P2 runs out of hub ram!)

What I'm thinking of doing is extending the input options so that the input vector need not be shared
with the work vector, can be just real values (4 bytes per element rather than 8 for complex), or 16 bit
real values. The input fixed-point representation would be specifiable, as would the working vector
fixed point (probably limited between fix2.30 and fix16.16).

output options could include logarithmic (dB) as well as raw, magnitude/phase and mag squared.
the option to divide the output by N (implemented by shifting right one on each round) is already their,
there needs to be an inverse operation of course, with option to synthesize negative frequency components
from the non-negative automatically.

The execution model is a cog that waits for command to be written to its command address, which would
need a locking convention for multi-threaded use, but it could just be fired up and killed after a single use too.
Its what I'm used to on the P1, but is that the best approach?

My use for this is spectrum analysis, but I can see others wanting a workhorse for implementing filtering
or fast convolution/fast correlation. My use case is repeatedly throwing data at it and averaging the
results, with double buffering so data acquisition doesn't have to stall for it.

My current performance figures are, for a 4096 point FFT, 37.1ms (about 110k SPS). The inner core
is a butterfly computation routine using 4 cordic multiplies and 1 cordic rotate in parallel.

I have to praise the design of the cordic unit, it does everything needed for signal processing code,
I think I've used every operation now. :)

__Mark

Comments

  • Couple of videos

  • Figured out I was being a bit dim with the FFT inner loop and its down to one cordic rotate setup the previous loop,
    about twice as fast now, 210kSPS at 2048 points, 170kSPS at 16384 points, 150dB or so of dynamic range with
    fix5.27 internal representation.
  • Mark_TMark_T Posts: 1,981
    edited 2019-02-16 20:34
    Managed to cobble together a crude VGA interface which is remarkably flicker-free considering I'm not
    synchronizing with Vsync on refreshing the frame buffer (guess P2 is plenty fast for this)


  • evanhevanh Posts: 16,068
    Cool. Is that a grid on there too?
  • Cluso99Cluso99 Posts: 18,069
    well done!!!
  • All the VGA stuff is a crude hack at the moment, but I'm adding a font and hope to get something looking more
    like a proper FFT analyzer eventually, with peak measurements, options for TFT as well as VGA display, control
    interface for modes and sampling rates, external ADC interface (there's a nice 18 bit 400kSPS SAR chip, the ADS8885,
    which I've got my hands on, although its apparently EOL'd).

    Current module break-down is

    Acquisition - built-in sigma/delta, external I2S, external SPI options
    Display - DAC to 'scope, VGA, TFT options
    FFT
    sampling window processing (I'll probably throw this in with the FFT module as they are strongly related)
    top level driver

    and on the horizon:

    peak finding + measurements
    control module (accepts mode changes and other commands) - initially serial

    BTW I was partly inspired by this paper:
    http://holometer.fnal.gov/GH_FFT.pdf
  • Ahle2Ahle2 Posts: 1,179
    Great work Mark! Yes the cordic solver is such a fantastic addition to an already cool MCU! I just love the P2, the HW features and the instruction set. I use the rotation function at a throughput rate of 8 cycles per operation per cog in an upcoming applicition. Each rotation is 4 multiplications, 2 additions and 2 sine lookups if done without the cordic. Amazing really!
  • Mark_TMark_T Posts: 1,981
    edited 2019-02-20 13:40
    I'm adding the feature to do real input FFT packed into half-size complex array which
    should (once I have it working) give a nearly 2x speedup, although complicates the various
    options - hope to have some testsuite code for most of the features and release it as an
    object within a week or so.

    The method of fixing up an N/2 point complex FFT to perform an N-point real input FFT seems
    rather confusingly documented in various sources I found on the web, several articles with
    errors or crucial details omitted! In the end its a matter of summing and differencing
    and QROTATE calls.

    And with a decent VGA monitor and green graphics, its starting to look like the business (calibration
    test data spectrum with 30dB between each peak, showing off the dynamic range of fix5.27 calculations and
    the limitations of the 114dB side lobe flat-top window used - the noise floor is mainly due to CORDIC unit
    I think)
    P1010299.JPG
  • Cluso99Cluso99 Posts: 18,069
    Nice work :smiley:
  • Update, after two pretty unproductive evenings of bug-hunting finally got real FFT (packed into halfsize complex FFT)
    working, meaning a speed up to 325kSPS from 210kSPS.
    (So now 6.3ms per 2048-point FFT with a single cog - haven't thought about parallization yet)

    Not quite the doubling in speed I'd hoped, but I did shake out a few bugs that only affected the phase of the
    FFT result which were just breaking the real FFT completely in baffling manner.

    Next stage code and options rationalization/tidy up...
  • Mark_TMark_T Posts: 1,981
    edited 2019-02-23 20:18
    Right, time to post some code, I've done some speedups by using block reads and writes wherever possible and
    unrolling some loop bodies substantially - using LUTram now for more codespace. as a consequence.

    There's a tiny testsuite in spin, and hopefully enough documentation at the head of the file to explain
    how to play with it. Not all the modes are tested, but real and complex inputs with stride 8 have tests.

    Current raw FFT speed for real input on 160MHz P2 with 2048 point FFTs is about 745kSPS, and for
    complex input, 475kSPS. With real input and output squared-magnitude mode the speed is 655kSPS.

    As a for-instance, reading 2048 sample blocks at 48kSPS from analog smartpin, which takes 42.6ms to
    acquire, each block can be processed in 3.1ms, about 13 times real-time, leaving lots of slack for overlap/add
    style processing (although there would have to be some more copying overhead in this scenario).

    [ Its interesting to spot the various uses of new P2 instructions like incmod, subr, decod, although I'm sure
    there's a few I've not used but would be ideal, there are so many new instructions ]
    [ And I've just noticed I've achieved 6 times the initial performance figures, looking back at my first post ]
  • I've setup a github repo for this, hopefully to encourage me to finish preparing a release version
    of the FFT code and testsuite, and the other code I have for spectral analysis, https://github.com/MarkTillotson/Prop2_FFT

    And I've added text/numeric labels and info to the display code. Screenshot below. I've been playing with
    an 18 bit SAR ADC chip, the ADS8885, and it seems to perform very nicely, can really show up the noise
    floor of the smartpin DACs. This isn't a sigma-delta chip, its specified to be able to settle to full 18 bit
    accuracy in single sample time, and goes to 400kSPS... Not cheap, but useful for my ultimate aim of a Prop2
    precision FFT analyzer. been side-tracked designing a board with two ADS8885's and a voltage reference
    as per the chip's datasheet recommendations.
    P1010337.JPG
  • Cluso99Cluso99 Posts: 18,069
    Very nice indeed :smiley:
  • evanhevanh Posts: 16,068
    edited 2019-03-06 02:20
    I've always thought a log frequency scale would look cool but I have no idea if it has any use beyond looks. I guess it would be a burden to calculate.
  • What op amp are you using to drive the ADC Mark?

    Looks like a nice ADC.
  • No opamp, just a capacitor and resistor dividers to hold the inputs mid-rail.

    Log freq scale means using a bigger FFT to avoid large gaps at the low frequency end, but is certainly doable as
    part of the display loop that draws the trace.

    My latest tweak was to combine sync serial output to one colour while streaming another, allows the graticule
    to be output in a different colour to the trace - previously each screen area was streamed to a different VGA
    colour pin. One bit per pixel for 1280x768 is 123kB, I would like to reduce that.

    A limitation I've hit is that synchronous serial TX mode doesn't work without an enabled clock pin - for
    video generation I don't need that pin (although for some TFT panels it is needed anyway).
  • Github repo for my FFT library and now the ancilliary code I use to display live power spectra on VG
    or 'scope

    I've added a few more tests on the raw FFT library, which now also supports spectral windows (needs
    tests still for that, also needs inverse FFT and logarithmic conversion adding).

    https://github.com/MarkTillotson/Prop2_FFT

    Included are 3 libraries for acquisition sharing a common interface, and a very hacked around test signal
    generator driving a smart pin in DAC mode. I currently connect that to an ADC smartpin or my ADS8885
    breakout on a little satellite breadboard.

    The VGA 1280x768 resolution allows a 1025x513 pixel area for plotting spectrum and is 1-bit-per-pixel,
    and different coloured vertical zones using streamer. The graph area has hardcoded graticule generated
    on the fly on red VGA signal using synx TX mode while the streamer drives green pin from the bit map.

    Like this:
    P1010338.JPG
  • Cluso99Cluso99 Posts: 18,069
    Nice :smiley:

    I like the idea for the graticule overlay.
  • evanhevanh Posts: 16,068
    edited 2019-03-11 03:28
    Mark,
    Async serial uses internal pacing. This would eliminate the external serial clock.

    The smallest pulse, start-data-stop, that could be output this way is two serial clocks long. Which should work if the serial clock were double speed of the pixel clock.


    EDIT: Oops, it wouldn't work for long lines.
  • evanhevanh Posts: 16,068
    Another option: %00100 = pulse/cycle output. This one allows the construction of any mark-space ratio and specify the number of pulses. It could do a whole scan line in one command I think.
  • Async isn't usable due to start/stop bits as far as I can see.
    %00100 = pulse/cycle output
    That's a good idea for the graticule, although I was also thinking about ways to do run-length-encoded
    graphics too, in order to expand to more colours.
  • evanhevanh Posts: 16,068
    Oh, on third thoughts, with that same dynamic pixel scaling by adjusting the time base, async serial can do a single start-data-stop to handle a solid red over the whole scan line. So it could be used for more complex patterns than the pulse output mode.

    In the simplest form, think of the start and stop bits as the two ends of a horizontal line. If there was no data between start and stop then the visible line length would be one serial clock. But I think the minimum data length is one bit, so the minimum line length is two serial clocks long. And to make that the same as one pixel would need doubled serial clock. You'd also need to use inverted output in the custom pin config because start bit is normally low and stop bit is normally high.
  • evanhevanh Posts: 16,068
    To detail that a little more: The 1025 pixel horizontal line would work out as 25 serial bits at a shift rate of 41 pixels per bit-shift. One bit is the start-stop, leaving 24 data bits.

    Pixel patterns can be made from a maximum of 33 serial bits, and minimum 2.

    Not sure what happens if the serial shifter time base is adjusted mid way.
Sign In or Register to comment.