Computing a running median

In a recent thread, a question came up about array sorting using a BASIC Stamp. As part of the discussion, Tracy Allen mentioned sorting as a step in finding the median value in a sequence of sensor readings. The median is an important statistical concept, because it is less sensitive to the occasional "wild" reading, which can easily skew the mean (average) of a sequence of values.
The discussion got me thinking whether sorting is a necessary precursor in this process, or whether more efficient methods can be used. This is of particular importance for arrays stored in scratchpad RAM in the BS2p- family of Stamps, since swapping values (e.g. during a sort) can be a bit cumbersome with GET and PUT. I was hoping to find an algorithm that didn't require moving a value's location once it found its way into the scratchpad.
In pursuit of this, I came up with the following program that computes a "running median" using the N most recent sensor readings. (The test program uses random data in place of sensor readings.) After each reading, the program compares the current median with the new reading and with the value it replaces. Based on this, it determines whether the median value needs to change and, if so, scans the N most recent readings once to figure out the new median. For N readings, then, the algorithm is O(N2), the same as a bubble sort. But neither the readings themselves, nor an index table, need to be shuffled. Once a reading is written to RAM, it stays put until it's replaced by a newer reading.
Here's the program:
To test this program, I wrote a PC program in Perl that takes output from the PBASIC program and computes the median at each step the old-fashioned way using a sort. In 1000 steps, it didn't find any errors, so I'm fairly confident the PBASIC program works as it should. Here's the test program. It prints only if an error is found:
-Phil
Post Edited (Phil Pilgrim (PhiPi)) : 5/24/2008 10:53:01 PM GMT
The discussion got me thinking whether sorting is a necessary precursor in this process, or whether more efficient methods can be used. This is of particular importance for arrays stored in scratchpad RAM in the BS2p- family of Stamps, since swapping values (e.g. during a sort) can be a bit cumbersome with GET and PUT. I was hoping to find an algorithm that didn't require moving a value's location once it found its way into the scratchpad.
In pursuit of this, I came up with the following program that computes a "running median" using the N most recent sensor readings. (The test program uses random data in place of sensor readings.) After each reading, the program compares the current median with the new reading and with the value it replaces. Based on this, it determines whether the median value needs to change and, if so, scans the N most recent readings once to figure out the new median. For N readings, then, the algorithm is O(N2), the same as a bubble sort. But neither the readings themselves, nor an index table, need to be shuffled. Once a reading is written to RAM, it stays put until it's replaced by a newer reading.
Here's the program:
' {$STAMP BS2pe} ' {$PBASIC 2.5} 'Program to compute the running median of sample inputs. N CON 11 'Number of running samples for median. N must be odd. N2 CON N * 2 'For word addressing. NMID CON N / 2 'Midpoint of array. ptr VAR Byte 'Location of last value written to scratchpad. i VAR Byte k VAR Word nsame VAR Byte Rand VAR Word NewVal VAR Word Median VAR Word CurVal VAR Word NewMedian VAR Word ptr = N2 - 2 Median = 0 'Scratchpad initialized to 0's, so median is also 0. FOR k = 1 TO 1000 'Acquire 1000 samples. RANDOM Rand NewVal = Rand // 10 + 50 'Sample is a random integer on [noparse][[/noparse]50, 59]. GOSUB RunningMedian 'Compute running median. FOR i = 0 TO N2 - 2 STEP 2 'Print last N samples, <highlighting> latest one. GET i, Word CurVal IF (ptr = i) THEN DEBUG "<", DEC2 CurVal, ">" ELSE DEBUG " ", DEC2 CurVal, " " ENDIF NEXT DEBUG ": ", DEC3 Median, CR 'Print computed median. NEXT END 'Subroutine fo computing median of N most recent samples. RunningMedian: ptr = ptr + 2 // N2 'Increment pointer. GET ptr, Word CurVal 'Get value at that location. IF (NewVal <> CurVal) THEN 'Is new value different? PUT ptr, Word NewVal ' Yes: Write it. IF (NewVal <= Median AND CurVal >= Median) THEN ' NewVal <= Median <= CurVal? NewMedian = 0 ' Yes: Find highest value in list nsame = 0 ' less than median, FOR i = 0 TO N2 - 2 STEP 2 GET i, Word CurVal IF (CurVal < Median) THEN nsame = nsame + 1 ' and count qualifying values. IF (CurVal > NewMedian) THEN NewMedian = CurVal ENDIF NEXT ELSEIF (NewVal => Median AND CurVal <= Median) THEN ' NewVal >= Median >= CurVal? NewMedian = 65535 ' Yes: Find lowest value in list nsame = 0 ' greater than median, FOR i = 0 TO N2 - 2 STEP 2 GET i, Word CurVal IF (CurVal > Median) THEN nsame = nsame + 1 ' and count qualifying values. IF (CurVal < NewMedian) THEN NewMedian = CurVal ENDIF NEXT ENDIF IF (nsame > NMID) THEN Median = NewMedian 'Median changes iff number of qualifiers > NMID. ENDIF RETURN
To test this program, I wrote a PC program in Perl that takes output from the PBASIC program and computes the median at each step the old-fashioned way using a sort. In 1000 steps, it didn't find any errors, so I'm fairly confident the PBASIC program works as it should. Here's the test program. It prints only if an error is found:
use strict; my $prev; while (<DATA>) { s/[noparse][[/noparse]\n\l\r]+//; my @array = split /[noparse][[/noparse] :<>]+/; shift @array; my $median = pop @array; @array = sort {$a <=> $b} @array; if ($array[noparse][[/noparse] 5] != $median) { print "$prev\n$_\n\n" } $prev = $_; } __END__ <52> 00 00 00 00 00 00 00 00 00 00 : 000 52 <52> 00 00 00 00 00 00 00 00 00 : 000 52 52 <52> 00 00 00 00 00 00 00 00 : 000 52 52 52 <58> 00 00 00 00 00 00 00 : 000 52 52 52 58 <57> 00 00 00 00 00 00 : 000 52 52 52 58 57 <57> 00 00 00 00 00 : 052 52 52 52 58 57 57 <59> 00 00 00 00 : 052 52 52 52 58 57 57 59 <59> 00 00 00 : 052 52 52 52 58 57 57 59 59 <56> 00 00 : 056 52 52 52 58 57 57 59 59 56 <57> 00 : 057 52 52 52 58 57 57 59 59 56 57 <50>: 057 <55> 52 52 58 57 57 59 59 56 57 50 : 057 55 <52> 52 58 57 57 59 59 56 57 50 : 057 55 52 <51> 58 57 57 59 59 56 57 50 : 057 55 52 51 <59> 57 57 59 59 56 57 50 : 057 55 52 51 59 <56> 57 59 59 56 57 50 : 056 55 52 51 59 56 <58> 59 59 56 57 50 : 056 ...
-Phil
Post Edited (Phil Pilgrim (PhiPi)) : 5/24/2008 10:53:01 PM GMT
Comments
That is very nice and it is always interesting to follow through your logic!
I think your algorithm is closer to O(N) than to O(N2). Apart from the initial and final overhead, it makes only one pass through the data.
Actually the general bubble sort if properly implemented is O(n log n), not O(n2). That is because each successive pass through the loop is shorter than the last, and because the algorithm terminates when a pass through is made with zero swaps. The worst case is when the entire data array is repopulated with random numbers. Moreover, in the case where the data is already almost sorted, it can come pretty close to O(n). I do think bubbleSort routine I posted in the other thread can be improved in speed for the situation we are talking about. You've given me some food for thought.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Tracy Allen
www.emesystems.com
Each step of my algorithm is either O(N) or O(1), so O(N) overall. The O(N2) is for a group of N readings, in order to make it a fair comparison with regular, "non-running-median" calculations.
-Phil
The attached program is another one that is faster than a full bubble sort, but it does maintain a sorted table. This is for situations such as a multiplayer game, where the players take turns, or in data acquisition, where readings are processed one at a time in sequence. This is I think slower than yours, because it does do a lot of GETting and PUTting, but it does maintain a sorted table. The trick is that it takes only one bubble pass, the direction of which is determined by the new score entered in comparison to the score that it replaces. Here is the core procedure. It took some doing to compress both the scan (jx=0 TO np2) and the scan (jx=np2 TO 0) into one routine!
That still isn't very efficient. With a binary search for the ranking, it should be possible to achieve O(log n + n/k).
One data acquisition scheme is a combination median and window average, by including in the average only the values that fall within, say, the two middle quartiles.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Tracy Allen
www.emesystems.com
Post Edited (Tracy Allen) : 5/27/2008 4:34:23 PM GMT