Wish list for general P2 FFT object?
Mark_T
Posts: 1,981
in Propeller 2
I'm in the throes of designing a standalone FFT library/object for the P2 using 32bit cordic operations.
It strikes me it would be good to get some input about what people might want from such a thing.
Currently I have a core routine that expects 32bit+32bit fixpoint real or complex input values,
and generates the forward FFT in-place from this. Optional post-processing to calculate magnitude+phase
or magnitude squared (actually twice magnitude squared which is what is needed for spectral measurement)
It has been tested with N upto 2^15 (above that and P2 runs out of hub ram!)
What I'm thinking of doing is extending the input options so that the input vector need not be shared
with the work vector, can be just real values (4 bytes per element rather than 8 for complex), or 16 bit
real values. The input fixed-point representation would be specifiable, as would the working vector
fixed point (probably limited between fix2.30 and fix16.16).
output options could include logarithmic (dB) as well as raw, magnitude/phase and mag squared.
the option to divide the output by N (implemented by shifting right one on each round) is already their,
there needs to be an inverse operation of course, with option to synthesize negative frequency components
from the non-negative automatically.
The execution model is a cog that waits for command to be written to its command address, which would
need a locking convention for multi-threaded use, but it could just be fired up and killed after a single use too.
Its what I'm used to on the P1, but is that the best approach?
My use for this is spectrum analysis, but I can see others wanting a workhorse for implementing filtering
or fast convolution/fast correlation. My use case is repeatedly throwing data at it and averaging the
results, with double buffering so data acquisition doesn't have to stall for it.
My current performance figures are, for a 4096 point FFT, 37.1ms (about 110k SPS). The inner core
is a butterfly computation routine using 4 cordic multiplies and 1 cordic rotate in parallel.
I have to praise the design of the cordic unit, it does everything needed for signal processing code,
I think I've used every operation now.
__Mark
It strikes me it would be good to get some input about what people might want from such a thing.
Currently I have a core routine that expects 32bit+32bit fixpoint real or complex input values,
and generates the forward FFT in-place from this. Optional post-processing to calculate magnitude+phase
or magnitude squared (actually twice magnitude squared which is what is needed for spectral measurement)
It has been tested with N upto 2^15 (above that and P2 runs out of hub ram!)
What I'm thinking of doing is extending the input options so that the input vector need not be shared
with the work vector, can be just real values (4 bytes per element rather than 8 for complex), or 16 bit
real values. The input fixed-point representation would be specifiable, as would the working vector
fixed point (probably limited between fix2.30 and fix16.16).
output options could include logarithmic (dB) as well as raw, magnitude/phase and mag squared.
the option to divide the output by N (implemented by shifting right one on each round) is already their,
there needs to be an inverse operation of course, with option to synthesize negative frequency components
from the non-negative automatically.
The execution model is a cog that waits for command to be written to its command address, which would
need a locking convention for multi-threaded use, but it could just be fired up and killed after a single use too.
Its what I'm used to on the P1, but is that the best approach?
My use for this is spectrum analysis, but I can see others wanting a workhorse for implementing filtering
or fast convolution/fast correlation. My use case is repeatedly throwing data at it and averaging the
results, with double buffering so data acquisition doesn't have to stall for it.
My current performance figures are, for a 4096 point FFT, 37.1ms (about 110k SPS). The inner core
is a butterfly computation routine using 4 cordic multiplies and 1 cordic rotate in parallel.
I have to praise the design of the cordic unit, it does everything needed for signal processing code,
I think I've used every operation now.
__Mark
Comments
about twice as fast now, 210kSPS at 2048 points, 170kSPS at 16384 points, 150dB or so of dynamic range with
fix5.27 internal representation.
synchronizing with Vsync on refreshing the frame buffer (guess P2 is plenty fast for this)
like a proper FFT analyzer eventually, with peak measurements, options for TFT as well as VGA display, control
interface for modes and sampling rates, external ADC interface (there's a nice 18 bit 400kSPS SAR chip, the ADS8885,
which I've got my hands on, although its apparently EOL'd).
Current module break-down is
Acquisition - built-in sigma/delta, external I2S, external SPI options
Display - DAC to 'scope, VGA, TFT options
FFT
sampling window processing (I'll probably throw this in with the FFT module as they are strongly related)
top level driver
and on the horizon:
peak finding + measurements
control module (accepts mode changes and other commands) - initially serial
BTW I was partly inspired by this paper:
http://holometer.fnal.gov/GH_FFT.pdf
should (once I have it working) give a nearly 2x speedup, although complicates the various
options - hope to have some testsuite code for most of the features and release it as an
object within a week or so.
The method of fixing up an N/2 point complex FFT to perform an N-point real input FFT seems
rather confusingly documented in various sources I found on the web, several articles with
errors or crucial details omitted! In the end its a matter of summing and differencing
and QROTATE calls.
And with a decent VGA monitor and green graphics, its starting to look like the business (calibration
test data spectrum with 30dB between each peak, showing off the dynamic range of fix5.27 calculations and
the limitations of the 114dB side lobe flat-top window used - the noise floor is mainly due to CORDIC unit
I think)
working, meaning a speed up to 325kSPS from 210kSPS.
(So now 6.3ms per 2048-point FFT with a single cog - haven't thought about parallization yet)
Not quite the doubling in speed I'd hoped, but I did shake out a few bugs that only affected the phase of the
FFT result which were just breaking the real FFT completely in baffling manner.
Next stage code and options rationalization/tidy up...
unrolling some loop bodies substantially - using LUTram now for more codespace. as a consequence.
There's a tiny testsuite in spin, and hopefully enough documentation at the head of the file to explain
how to play with it. Not all the modes are tested, but real and complex inputs with stride 8 have tests.
Current raw FFT speed for real input on 160MHz P2 with 2048 point FFTs is about 745kSPS, and for
complex input, 475kSPS. With real input and output squared-magnitude mode the speed is 655kSPS.
As a for-instance, reading 2048 sample blocks at 48kSPS from analog smartpin, which takes 42.6ms to
acquire, each block can be processed in 3.1ms, about 13 times real-time, leaving lots of slack for overlap/add
style processing (although there would have to be some more copying overhead in this scenario).
[ Its interesting to spot the various uses of new P2 instructions like incmod, subr, decod, although I'm sure
there's a few I've not used but would be ideal, there are so many new instructions ]
[ And I've just noticed I've achieved 6 times the initial performance figures, looking back at my first post ]
of the FFT code and testsuite, and the other code I have for spectral analysis, https://github.com/MarkTillotson/Prop2_FFT
And I've added text/numeric labels and info to the display code. Screenshot below. I've been playing with
an 18 bit SAR ADC chip, the ADS8885, and it seems to perform very nicely, can really show up the noise
floor of the smartpin DACs. This isn't a sigma-delta chip, its specified to be able to settle to full 18 bit
accuracy in single sample time, and goes to 400kSPS... Not cheap, but useful for my ultimate aim of a Prop2
precision FFT analyzer. been side-tracked designing a board with two ADS8885's and a voltage reference
as per the chip's datasheet recommendations.
Looks like a nice ADC.
Log freq scale means using a bigger FFT to avoid large gaps at the low frequency end, but is certainly doable as
part of the display loop that draws the trace.
My latest tweak was to combine sync serial output to one colour while streaming another, allows the graticule
to be output in a different colour to the trace - previously each screen area was streamed to a different VGA
colour pin. One bit per pixel for 1280x768 is 123kB, I would like to reduce that.
A limitation I've hit is that synchronous serial TX mode doesn't work without an enabled clock pin - for
video generation I don't need that pin (although for some TFT panels it is needed anyway).
or 'scope
I've added a few more tests on the raw FFT library, which now also supports spectral windows (needs
tests still for that, also needs inverse FFT and logarithmic conversion adding).
https://github.com/MarkTillotson/Prop2_FFT
Included are 3 libraries for acquisition sharing a common interface, and a very hacked around test signal
generator driving a smart pin in DAC mode. I currently connect that to an ADC smartpin or my ADS8885
breakout on a little satellite breadboard.
The VGA 1280x768 resolution allows a 1025x513 pixel area for plotting spectrum and is 1-bit-per-pixel,
and different coloured vertical zones using streamer. The graph area has hardcoded graticule generated
on the fly on red VGA signal using synx TX mode while the streamer drives green pin from the bit map.
Like this:
I like the idea for the graticule overlay.
Async serial uses internal pacing. This would eliminate the external serial clock.
The smallest pulse, start-data-stop, that could be output this way is two serial clocks long. Which should work if the serial clock were double speed of the pixel clock.
EDIT: Oops, it wouldn't work for long lines.
That's a good idea for the graticule, although I was also thinking about ways to do run-length-encoded
graphics too, in order to expand to more colours.
In the simplest form, think of the start and stop bits as the two ends of a horizontal line. If there was no data between start and stop then the visible line length would be one serial clock. But I think the minimum data length is one bit, so the minimum line length is two serial clocks long. And to make that the same as one pixel would need doubled serial clock. You'd also need to use inverted output in the custom pin config because start bit is normally low and stop bit is normally high.
Pixel patterns can be made from a maximum of 33 serial bits, and minimum 2.
Not sure what happens if the serial shifter time base is adjusted mid way.