Efficiently processing continuous signals

evanh · 2026-02-05 09:06

@"Christof Eb." said:
Hi,
perhaps this is again my not-knowing...
As far as I know, you can only use the streamer for data movement, if your code is executing from cog or lut memory. I don't think, that inline asm is loaded into cog memory?

Streamer and Fifo aren't entirely the same thing. The RDFAST instruction starts up the Fifo independent of the Streamer. It allows other Cog instructions like RFLONG access to the Fifo's FIFO.

Inline assembly is by default loaded into cogRAM just before it is executed. So not entirely inline as such. This is quite important when wanting to use the Fifo within the "inline" Pasm2 code. The Spin2 interpreter uses the Fifo itself. And Flexspin executes natively as hubexec which also requires the Fifo.

bob_g4bby · 2026-02-05 10:34

The "counter := @BHwindow" worked, @evanh , thank you. Attached is a test program to be run in debug mode for the scope displays - a blackman harris window is applied to a sine wave. The window is applied to a polar signal, as being easier to do when preloading the cordic engine. My attempt to do it on a cartesian signal didn't produce the correct result - I was not getting all the correct values out of the cordic, due to twice the number of multiplies, the 8 cycle pattern was broken. I may return to that - I expect it needs pacing to 16 cycles.

@"Christof Eb." , my inline assembly methods use the "org ... end" form which means the code will be loaded into cog ram before execution. Had I used " orgh ... end", then the code would execute from hub ram. (See "In-line PASM Code" section of the Spin2 manual) In the changes section it also states:-

PUB/PRI methods now support ORGH (hub) inline PASM code, in addition to ORG (cog) inline PASM code.
○ Like ORG, ORGH loads the first 16 local long variables from hub RAM into cog registers, executes
the inline code, and then updates the registers back to hub RAM.
○ Unlike ORG inline code, ORGH inline code does not load code into cog registers $000..$11F, but can
be up to $FFFF instructions long, since it stays and executes in hub RAM.
○ ORGH allows inline PASM code without interfering with the $000..$11F cog register space, So, those
cog registers can be used entirely for stay-resident code, like interrupt service routines or frequentlycalled fast PASM routines.

As you can see in the window demo here, both the "sine" and the "polwin" method in the dsp library use the FIFO successfully

evanh · 2026-02-05 11:06

@bob_g4bby said:
The "counter := @BHwindow" worked, @evanh , thank you.

LOL, I failed Spin's := syntax too. I get that wrong every time I start a new Spin project. First compile has me fixing them.

Christof Eb. · 2026-02-05 16:42

@evanh said:

Streamer and Fifo aren't entirely the same thing. The RDFAST instruction starts up the Fifo independent of the Streamer. It allows other Cog instructions like RFLONG access to the Fifo's FIFO.

Does this mean, that when DDS from LUT to DAC is used, the STREAMER is used but not the FIFO and therefore code can in parallel be executed from HUB?

Inline assembly is by default loaded into cogRAM just before it is executed. So not entirely inline as such. This is quite important when wanting to use the Fifo within the "inline" Pasm2 code. The Spin2 interpreter uses the Fifo itself. And Flexspin executes natively as hubexec which also requires the Fifo.

OK, sorry, I was not aware, that SPIN2 loads inline asm into cog ram.

evanh · 2026-02-05 21:03

@"Christof Eb." said:
Does this mean, that when DDS from LUT to DAC is used, the STREAMER is used but not the FIFO and therefore code can in parallel be executed from HUB?

Yep.

bob_g4bby · 2026-02-07 09:30

I spotted the scope plot of the windowing method didn't look quite right. The whole idea is for the waveform to start at 0 and end at 0. The scope trace above doesn't look like it does that. However, it was only the two traces being set on AUTO scaling. When I defined the max and min plot scale, then the two traces did show zero ends:-

The production of the window in the DAT section was not laboriously typed in. It was a matter of:-
1. Producing the 1024 sample window in a spreadsheet
2. In another column, scaling up the window ( samples in the range 0-1) to the point where the first sample was just under 1. The window happened to peak at 16384
3. In another column, converting that scaled window to integers
4. In a column to the left of the integers, putting 'word' in every row (or 'long' or 'byte' or whatever you need )
5. These two columns can then be directly copied and pasted into the editor in Pnut or Spin Tools

Tatevrt · 2026-02-09 15:18

@bob_g4bby said:
Applications involving continuous signals such as music or software radios generally operate on buffers of data. All calculations must complete on average faster than the sample rate, else signal dropouts occur. Also it is desirable to minimise the delay between signal in and signal out.

To make an application as powerful as possible in a P2, maybe all 8 cogs need to be kept as near 100% loaded with tasks consistent with the above. Desirable 'features' are legion.

What algorithm for cog management have folks found works for them to achieve the above?

I'm familiar with Labview which operates on a dataflow principle, but wonder if that is o.t.t. for P2?

I guess this has been discussed long ago with P1? I've searched but not found it?

Cheers, Bob

Use rdfast #0, ##BHwindow instead of #BHwindow. The double # tells PASM2 it’s an address, not a value. Make sure the code runs from COG or LUT memory since rdfast cannot stream from Hub memory directly.

bob_g4bby · 2026-02-09 15:52

Message deleted

ersmith · 2026-02-09 22:57

@Tatevrt said:

Use rdfast #0, ##BHwindow instead of #BHwindow. The double # tells PASM2 it’s an address, not a value. Make sure the code runs from COG or LUT memory since rdfast cannot stream from Hub memory directly.

A small nit-pick: ## tells PASM2 that it's more than 9 bits (or more precisely, that it needs an AUGS or AUGD instruction to encode), not necessarily that it's an address. Addresses are more than 9 bits, so they do often fall into the "needs double #" category, but so do other large constants. And a few jump style instructions can actually hold an address without an AUGD/AUGS, so don't need two # in them.

shannon1948 · 2026-02-13 14:35

My approach, which seems to work (using the edge module with PSRAM), is to fill a small chunk of samples from the ADC in hub ram, and between each sample do quite a bit:
(1) copy these chunks to a large buffer in PSRAM (burst moves work very efficiently - 100 MB/s in my case)
(2) move blocks from the large PSRAM buffer to a ring buffer on demand from a signal processing cog
This allows the signal processing cog (or cogs, as needed) to work "at its leisure" on ring buffer data while the first cog is flowing data in to PSRAM. This accounts for the fact that in my case, the signal processing code can take a variable amount of time depending on how many interesting features are in a segment of data.

It seems awkward to do this (and it is complicated to manage), but it allows me to keep one continuous segment of data in PSRAM, and as needed flush it as a file to an SD card for later analysis on a PC. In my case, this segment is 6 MB, so HUB RAM cannot be used for the large buffer.

The primary limitation is the capacity of PSRAM. In my case (which depends on sampling rate, number of channels, and sample bit depth) I can only store a maximum of 15 seconds of data in PSRAM.

I do not need my code to work on a continuous signal - I can fill the PSRAM buffer and then crunch using the signal processing cog, pulling the data into the ring buffer as needed. I have not thought about how to make this approach work on continuous signals, although the ring buffer approach seems amenable. In an SDR app that processes continuous signals, I think that performing everything on-chip could be done with smaller buffers, but am not sure.

I use Spin2 and use the hub memory only, so this is far from the most efficient way -- e.g., some day I could work on storing the small chunks in COG/LUT RAM. But even with this slow implementation, I have over 1000 clock cycles between each ADC sample to do the buffer management.

To build this, I relied on experiments using the system counter, which is invaluable. I've been using flexspin, thinking that this runs faster than interpreted byte codes (?).

The reason I like the P2 for this is I can implement true hardware threads in a shared memory architecture (I looked carefully at dual-core STM Cortex chips and in my view it looked picky and painful do to anything close to what you can do on the P2). As a result, so far I have not needed a scheduler. However, in this regard, I am curious if others have played with SPIN2 cooperative tasks.

@bob_g4bby, thanks for starting this discussion!

Paul

bob_g4bby · 2026-02-15 12:07

I'm working towards a bandpass filter for use in a software defined radio whose low and high cut-off frequency can be changed by the user whilst listening to the radio. The FFT fast-convolution filter I intend to use requires the bandpass filter impulse response as one of it's inputs.

I made a demo of producing that impulse response on the P2. I've used SPIN2 floating point for now, so as to be sure of not losing information with integer rounding. These are the debug plots it produces:-

Top line, left to right: The Blackman Harris window, The lowpass filter impulse response, The windowed lowpass filter impulse response
Bottom line, left to right: An iq sinewave, The windowed lowpass filter impulse response converted to bandpass impulse response by multiplying by the sinewave

I was gratified that the results look to be identical to the filter I produced for a LabView receiver, written quite a few years back.

I had to update @ersmith 's BinFloat.spin2 library from the obex to allow it compile under Spin2 v52 (Two names - exp and exp10 - are now reserved words). I removed methods that are now built into Spin2.

Cheers, Bob

bob_g4bby · 2026-02-15 12:23

I've been modelling the above in a free wysiwyg maths program called Smath Studio - it's been about for years. Here's the Smath file I've written

bob_g4bby · 2026-02-18 02:59

This is a bit more interesting - the bandpass filter impulse response is passed through an FFT, ready for multiplying with the FFT of the signal input, as part of the FFT fast convolution filter. All this code would be executed only if the user changes the upper or lower cut-off frequency of the filter. The final waveform produced gives some idea of the filter shape:.

The variables Flow and Fhigh may be changed by editing and the bandpass filter FFT is observed to change accordingly. Note the desirable flat top and sharp vertical sides. The left hand side of the plot is 0Hz, the right hand side is 48kHz.
Bob

bob_g4bby · 2026-02-18 03:19

The code above is inefficient, it's for demo purposes - just feeling my way here step by step. The final code would have far fewer repeat loops to speed things up and buffers would be reused where possible to reduce memory size.

Efficiently processing continuous signals

Comments