Pattern matching and data extraction in Spin using regular expressions.
Phil Pilgrim (PhiPi)
Posts: 23,514
One question that comes up frequently in the forums goes something like, "How do I extract latitude and longitude from GPS sentences?" Extracting data from incoming character streams is a common requirement, which usually entails searching for patterns in the character stream, so you know where to look for the data. PBASIC includes rudimentary pattern matching with its input WAIT modifier, but no such facility is native to Spin.
A commonly used and very powerful pattern matching tool can be found in regular expressions. It's beyond the scope of a forum post to describe regular expresisons in any detail, but there are several good online references that do so:
····www.regular-expressions.info/tutorial.html
····en.wikipedia.org/wiki/Regular_expression
····etext.lib.virginia.edu/services/helpsheets/unix/regex.html
In order to to facilitate some upcoming GPS work, I decided to write a regular expression parser and pattern matcher in Spin. It uses pretty much the standard regex vocabulary and includes many of the standard features, but with some notable differences:
1. Only two anchors are supported for now: ^ (string beginning) and $ (string end).
2. The {m,n} repeat count is not yet supported.
3. Rather than extracting all the parenthesized groupings in a pattern, only those which begin with ($1 through ($9 are extracted.
4. My version does not do any backtracking. Once a portion of the string is matched, the matching engine will only move forward. Backtracking is difficult to implement efficiently and often causes a lot of churning to attain a match. Since my engine is written in Spin, backtracking could really slow things to a crawl.
5. Some special escape sequences have not yet been implemented.
6. Most regular expression engines compile the regex first before applying it to a string. In mine, the regex is applied entirely interpretively: the parsing and pattern matching occur simultaneously.
7. This is a matching and extraction engine only: there's no substitution or translation facility built in.
The best way to show what it does is to use a common GPS string as an example. Here you see some NMEA sentences as they might have come from a GPS unit and which exist in a string buffer somewhere:
····$GPGSV,2,1,08,01,40,083,46,02,17,308,41,12,07,344,39,14,22,228,45*75
····$GPRMC,123519,A,4807.038,N,01131.000,E,022.4,084.4,230394,003.1,W*6A
····$GPVTG,054.7,T,034.4,M,005.5,N,010.2,K*48
For this example, what we're interested in is the RMC sentence and the latitude and longitude info it contains:
····$GPRMC,123519,A,4807.038,N,01131.000,E,022.4,084.4,230394,003.1,W*6A
The red fields are latitude, and the blue fields are longitude. The data can be extracted using regex.spin and the following pattern (regular expression):
····\$GPRMC\s*,[noparse][[/noparse]^,]*,[noparse][[/noparse]^,]*,($1[noparse][[/noparse]\d\.]+)\s*,($2N|S)\s*,($3[noparse][[/noparse]\d\.]+)\s*,($4E|W)
The colors indicate those portions of the pattern used for the actual data extraction. This probably looks like a real mess to the uninitiated, and it's a fact that regular expressions are much easier to write than they are to read. But the individual elements are very simple, so I will try to explain them one at a time.
The first thing you see is \$GPRMC. This is there to make sure we're extracting data from the right sentence. The dollar sign is prepended with a backslash because $ by itself has special meaning, and the \ quotes it as a character to match. So the pattern matcher will scan the input stirng until it sees $GPRMC
Next is the rather cryptic-looking \s*. \s matches any whitespace character, such as space, CR, LF, and TAB. The * says to match the whitespace characters 0 or more times. Normally, the $GPRMC will be followed immediately by a comma, but this is put in the pattern in case some GPS receiver somewhere decides to throw in some extra blanks.
Next comes a comma, which needs to be matched, followed by another odd-looking construction: [noparse][[/noparse]^,]*. A list of characters inside square brackets defines a catagory. Any single character in the input string will match anything included in the category. Prepending the carat ^ to the list of characters means to match everything but the characters in the list. So, taken together with the *, [noparse][[/noparse]^,]* meaans to match zero or more occurances of anything besides a comma. This, along with the comma itself is used to skip data fields that we're not interested in.
Next comes ($1[noparse][[/noparse]\d\.]+). Anything in parentheses is a group that's treated as a single element. A group that starts with ($ followed by a digit is a special group whose data we want to extract. The digit (1-9) specifies which slot in the return array the extracted data should be stored. The actual data for this group has to match the pattern [noparse][[/noparse]\d\.]+. Again the bracketed set is a class consisting of two items: \d, which matches any decimal digit and \., which matches the decimal point. The latter is prepended with the backslash escape because, by itself, it has special significance: a lone period matches any single character. The plus following the class means to match the class one or more times. So, taken together, [noparse][[/noparse]\d\.]+ means to match a group of digits and decimal point(s) until something else comes along. This is the numerical part of the latitude and will be stored in position 1 of the results array. (Position 0 is reserved for the portion of the string that matched the entire pattern.)
Position 2 of the results will be either N or S. Its pattern (following the comma) is ($2N|S). The vertical bar means just what you think it does: OR. N|S will match either N or S. (It could also have been expressed [noparse][[/noparse]NS] to the same effect. But the vertical bar can also be used to separate subpatterns of more than one character, viz. NORTH|SOUTH.)
Positions 3 and 4 of the extracted data are for longitude and work just like positions 1 and 2.
Here's a sample program that takes a string containing the sentences above, locates the $GPRMC sentence, and extracts the lat/lon data from it:
Here's what the output looks like:
The results are displayed using the regex object's field method, given the index number for each field.
That's about all I can write about it here. Hopefully, I'll have a more thorough document available at a later date. In the meantime, give the program try if it's something that interests you. It's still really raw and very alpha, so don't rely on it too heavily until it receives more testing (and, possibly, some changes).
-Phil
Edit: Fixed several errors where \w was used when \s was intended. Added updated archive. Demo now uses field method to print, instead of substr method.
Post Edited (Phil Pilgrim (PhiPi)) : 6/30/2010 6:19:58 AM GMT
A commonly used and very powerful pattern matching tool can be found in regular expressions. It's beyond the scope of a forum post to describe regular expresisons in any detail, but there are several good online references that do so:
····www.regular-expressions.info/tutorial.html
····en.wikipedia.org/wiki/Regular_expression
····etext.lib.virginia.edu/services/helpsheets/unix/regex.html
In order to to facilitate some upcoming GPS work, I decided to write a regular expression parser and pattern matcher in Spin. It uses pretty much the standard regex vocabulary and includes many of the standard features, but with some notable differences:
1. Only two anchors are supported for now: ^ (string beginning) and $ (string end).
2. The {m,n} repeat count is not yet supported.
3. Rather than extracting all the parenthesized groupings in a pattern, only those which begin with ($1 through ($9 are extracted.
4. My version does not do any backtracking. Once a portion of the string is matched, the matching engine will only move forward. Backtracking is difficult to implement efficiently and often causes a lot of churning to attain a match. Since my engine is written in Spin, backtracking could really slow things to a crawl.
5. Some special escape sequences have not yet been implemented.
6. Most regular expression engines compile the regex first before applying it to a string. In mine, the regex is applied entirely interpretively: the parsing and pattern matching occur simultaneously.
7. This is a matching and extraction engine only: there's no substitution or translation facility built in.
The best way to show what it does is to use a common GPS string as an example. Here you see some NMEA sentences as they might have come from a GPS unit and which exist in a string buffer somewhere:
····$GPGSV,2,1,08,01,40,083,46,02,17,308,41,12,07,344,39,14,22,228,45*75
····$GPRMC,123519,A,4807.038,N,01131.000,E,022.4,084.4,230394,003.1,W*6A
····$GPVTG,054.7,T,034.4,M,005.5,N,010.2,K*48
For this example, what we're interested in is the RMC sentence and the latitude and longitude info it contains:
····$GPRMC,123519,A,4807.038,N,01131.000,E,022.4,084.4,230394,003.1,W*6A
The red fields are latitude, and the blue fields are longitude. The data can be extracted using regex.spin and the following pattern (regular expression):
····\$GPRMC\s*,[noparse][[/noparse]^,]*,[noparse][[/noparse]^,]*,($1[noparse][[/noparse]\d\.]+)\s*,($2N|S)\s*,($3[noparse][[/noparse]\d\.]+)\s*,($4E|W)
The colors indicate those portions of the pattern used for the actual data extraction. This probably looks like a real mess to the uninitiated, and it's a fact that regular expressions are much easier to write than they are to read. But the individual elements are very simple, so I will try to explain them one at a time.
The first thing you see is \$GPRMC. This is there to make sure we're extracting data from the right sentence. The dollar sign is prepended with a backslash because $ by itself has special meaning, and the \ quotes it as a character to match. So the pattern matcher will scan the input stirng until it sees $GPRMC
Next is the rather cryptic-looking \s*. \s matches any whitespace character, such as space, CR, LF, and TAB. The * says to match the whitespace characters 0 or more times. Normally, the $GPRMC will be followed immediately by a comma, but this is put in the pattern in case some GPS receiver somewhere decides to throw in some extra blanks.
Next comes a comma, which needs to be matched, followed by another odd-looking construction: [noparse][[/noparse]^,]*. A list of characters inside square brackets defines a catagory. Any single character in the input string will match anything included in the category. Prepending the carat ^ to the list of characters means to match everything but the characters in the list. So, taken together with the *, [noparse][[/noparse]^,]* meaans to match zero or more occurances of anything besides a comma. This, along with the comma itself is used to skip data fields that we're not interested in.
Next comes ($1[noparse][[/noparse]\d\.]+). Anything in parentheses is a group that's treated as a single element. A group that starts with ($ followed by a digit is a special group whose data we want to extract. The digit (1-9) specifies which slot in the return array the extracted data should be stored. The actual data for this group has to match the pattern [noparse][[/noparse]\d\.]+. Again the bracketed set is a class consisting of two items: \d, which matches any decimal digit and \., which matches the decimal point. The latter is prepended with the backslash escape because, by itself, it has special significance: a lone period matches any single character. The plus following the class means to match the class one or more times. So, taken together, [noparse][[/noparse]\d\.]+ means to match a group of digits and decimal point(s) until something else comes along. This is the numerical part of the latitude and will be stored in position 1 of the results array. (Position 0 is reserved for the portion of the string that matched the entire pattern.)
Position 2 of the results will be either N or S. Its pattern (following the comma) is ($2N|S). The vertical bar means just what you think it does: OR. N|S will match either N or S. (It could also have been expressed [noparse][[/noparse]NS] to the same effect. But the vertical bar can also be used to separate subpatterns of more than one character, viz. NORTH|SOUTH.)
Positions 3 and 4 of the extracted data are for longitude and work just like positions 1 and 2.
Here's a sample program that takes a string containing the sentences above, locates the $GPRMC sentence, and extracts the lat/lon data from it:
[b]CON[/b] [b]_clkmode[/b] = [b]xtal1[/b] + [b]pll16x[/b] [b]_xinfreq[/b] = 5_000_000 [b]OBJ[/b] re : "regex" io : "FullDuplexSerial" [b]PUB[/b] Start | teststr, pattern, resaddr, rslt, i io.start(31, 30, 0, 9600) io.tx(0) teststr := [b]string[/b]("$GPGSV,2,1,08,01,40,083,46,02,17,308,41,12,07,344,39,14,22,228,45*75", 13, { } "$GPRMC,123519,A,4807.038,N,01131.000,E,022.4,084.4,230394,003.1,W*6A", 13, { } "$GPVTG,054.7,T,034.4,M,005.5,N,010.2,K*48", 13) pattern := [b]string[/b]("\$GPRMC\s*,[noparse][[/noparse]*^,]*,[noparse][[/noparse]*^,]*,($1[noparse][[/noparse]*\d\.]+)\s*,($2N|S)\s*,($3[noparse][[/noparse]*\d\.]+)\s*,($4E|W)") io.[b]str[/b]([b]string[/b]("[b]String[/b]:", 13, 13)) io.[b]str[/b](teststr) io.[b]str[/b]([b]string[/b](13, 13, "Pattern:", 13, 13)) io.[b]str[/b](pattern) rslt := -[b]cnt[/b] resaddr := re.match(teststr, pattern, re#NOALT) rslt += [b]cnt[/b] io.[b]str[/b]([b]string[/b](13, 13, "Time: ")) io.dec(rslt / 80_000) io.[b]str[/b]([b]string[/b](" ms.")) io.[b]str[/b]([b]string[/b](13, 13, "Results:", 13)) [b]if[/b] (resaddr < 0) io.tx(13) io.[b]str[/b]([b]string[/b]("Error #")) io.dec(-resaddr) [b]elseif[/b] (resaddr == 0) io.tx(13) io.[b]str[/b]([b]string[/b]("No match.")) [b]else[/b] [b]repeat[/b] i [b]from[/b] 0 to 9 [b]if[/b] (rslt := [b]long[/b][noparse][[/noparse]*resaddr][noparse][[/noparse]*i]) io.tx(13) io.dec(i) io.[b]str[/b]([b]string[/b](": ")) io.[b]str[/b](re.field(i)) io.[b]str[/b]([b]string[/b](13, 13, "Remainder of [b]string[/b]:", 13, 13)) io.[b]str[/b](re.remainder)
Here's what the output looks like:
String: $GPGSV,2,1,08,01,40,083,46,02,17,308,41,12,07,344,39,14,22,228,45*75 $GPRMC,123519,A,4807.038,N,01131.000,E,022.4,084.4,230394,003.1,W*6A $GPVTG,054.7,T,034.4,M,005.5,N,010.2,K*48 Pattern: \$GPRMC\s*,[noparse][[/noparse]^,]*,[noparse][[/noparse]^,]*,($1[noparse][[/noparse]\d\.]+)\s*,($2N|S)\s*,($3[noparse][[/noparse]\d\.]+)\s*,($4E|W) Time: 143 ms. Results: 0: $GPRMC,123519,A,4807.038,N,01131.000,E 1: 4807.038 2: N 3: 01131.000 4: E Remainder of string: ,022.4,084.4,230394,003.1,W*6A $GPVTG,054.7,T,034.4,M,005.5,N,010.2,K*48
The results are displayed using the regex object's field method, given the index number for each field.
That's about all I can write about it here. Hopefully, I'll have a more thorough document available at a later date. In the meantime, give the program try if it's something that interests you. It's still really raw and very alpha, so don't rely on it too heavily until it receives more testing (and, possibly, some changes).
-Phil
Edit: Fixed several errors where \w was used when \s was intended. Added updated archive. Demo now uses field method to print, instead of substr method.
Post Edited (Phil Pilgrim (PhiPi)) : 6/30/2010 6:19:58 AM GMT
Comments
I really love regexp, and having them on the propeller is a dream...
Thanks,
Massimo
I had a heck of time writing one to incorporate into RobotBASIC just recently......it works
just like the PERL standard with extended syntax and replacing etc.
I would have never been courageous enough to do an engine in Spin.......you are Good.
For anyone wanting to learn Regular Expressions and wants a way to play with them
and/or follow along a tutorial....here is a program that lets you do so easily. It is fun to
use, and REALLY useful for following along with a book's examples or even a
web tutorial and·also gives you some standard patterns such as matching an email, a date
etc. It is a compiled EXE of a program written in RobotBASIC and uses the RB RegExp engine
that is now part of the new version that is about to be released...I am posting only the EXE...
If you want the source code it will be in the down load zip when I release the new version (V4.0.2)
very soon. But the post below is an EXE and you do not need anything else other than just run it.
·
Samuel
·
Post Edited (SamMishal) : 11/11/2009 11:30:07 AM GMT
Superb work (as usual) Phil. So you have already made a good first step towards porting perl...
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
If you always do what you always did, you always get what you always got.
Having to read modem line noise "\$GPRMC\w*,[noparse][[/noparse]^,]*,[noparse][[/noparse]^,]*,($1[noparse][[/noparse]\d\.]+)\w*,($2N|S)\w*,($3[noparse][[/noparse]\d\.]+)\w*,($4E|W)" is not fun.
Well done[noparse]:)[/noparse]
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I'm in run-on mode. Back to Topic. Great to see a RegEx object for the Prop. And thanks to you for having the depth of knowledge that recognized this as needed. Now if only someone can take the double precision stuff from the unofficial Propeller Wiki to an actual object with demo code encapsulation.
Missed seeing you at the expo this year.
Cheers,
--Steve
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM
THANK YOU!!!!
THANK YOU!!!!
THANK YOU!!!!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Andrew Williams
WBA Consulting
PowerTwig Dual Output Power Supply Module
My Prop projects: Reverse Geo-Cache Box, Custom Metronome, Micro Plunge Logger
cheers ... BBR
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
cheers ... brian riley, n1bq, underhill center, vermont
The Shoppe at Wulfden
www.wulfden.org/TheShoppe/
www.wulfden.org/TheShoppe/prop/ - Propeller Products
www.wulfden.org/TheShoppe/k107/ - Serial LCD Display Gear
By contrast, the "*" is a modifier that will cause the unit it follows to match zero or more of that unit in sequence. It's typically used to skip over spaces when you don't know whether there are any or, if there are, how many.
-Phil
We want a full implementation of AWK on the Propeller!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
At this point, it's a mere awklet. A Propellerous awklet (Aethia cogspinicus), to be exact.
-Phil
$1 is time string,
$6 is GPS fix quality
$7 is number of satellites
$8 is altitude
$9 units of altitude - M is meters
** EDIT **regex pattern corrected for mistake in Phil's original \w's replaced by \s's.
cheers ... BBR
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
cheers ... brian riley, n1bq, underhill center, vermont
The Shoppe at Wulfden
www.wulfden.org/TheShoppe/
www.wulfden.org/TheShoppe/prop/ - Propeller Products
www.wulfden.org/TheShoppe/k107/ - Serial LCD Display Gear
Post Edited (Brian Riley) : 6/30/2010 6:10:00 AM GMT
Phil ... \w or \s for whitespace??? The comments in the beginning of regex.spin made me think it was \s
cheers ... BBR
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
cheers ... brian riley, n1bq, underhill center, vermont
The Shoppe at Wulfden
www.wulfden.org/TheShoppe/
www.wulfden.org/TheShoppe/prop/ - Propeller Products
www.wulfden.org/TheShoppe/k107/ - Serial LCD Display Gear
-Phil
Addendum: Yes, Brian, you beat me to it!
-Phil
This won't compile. "re#NOALT" is not in the regex.spin you posted last night. Looking at original code, I assume its value is "0" ??? also "re.field()" is unheard of .....
cheers ... BBR
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
cheers ... brian riley, n1bq, underhill center, vermont
The Shoppe at Wulfden
www.wulfden.org/TheShoppe/
www.wulfden.org/TheShoppe/prop/ - Propeller Products
www.wulfden.org/TheShoppe/k107/ - Serial LCD Display Gear
Post Edited (Brian Riley) : 6/30/2010 6:04:26 AM GMT
-Phil
How could this post remain hidden in the shadows for so many months?
I'm doing a complex state machine based on strings commands sent via RS232. If i had access to your excelent contribution before, this machine ( including GPS ) would be finished days ago.
At this momment, your post has 2 days on my browser, since i have a lot of work here.
Once i get some free time, i will do an exhaustive test on your code, i´ll change my entire state machine
Thank you!
Best of best regards
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Are you looking for a professional PCB/Schematichs software?
Kicad: GPL. You can do Schematics, PCBS, Gerber interfaces, 3d-views and more
www.lis.inpg.fr/realise_au_lis/kicad/
Search in Parallax Forums Using Google
www.google.com/advanced_search?q=+site:forums.parallax.com&num=20&hl=en&lr=
I mucked with the demo program and made some of the display code into methods. It has both patterns (GPRMC and GPGGA) and parses both. The patterns reflect the correction for the whitespace. Its zipped up and attached here.
cheers ... BBR
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
cheers ... brian riley, n1bq, underhill center, vermont
The Shoppe at Wulfden
www.wulfden.org/TheShoppe/
www.wulfden.org/TheShoppe/prop/ - Propeller Products
www.wulfden.org/TheShoppe/k107/ - Serial LCD Display Gear
The bulk of the time savings has to do with the NOALT option. When this option is asserted, you're telling match that there are no alternate patterns to look for at the top level, so it will not scan throught the entire pattern looking for a "|" symbol every time it fails to match a character.
-Phil