+ Reply to Thread
Page 1 of 2 12 LastLast
Results 1 to 20 of 23

Thread: Pattern matching and data extraction in Spin using regular expressions.

  1. #1

    Default Pattern matching and data extraction in Spin using regular expressions.

    One question that comes up frequently in the forums goes something like, "How do I extract latitude and longitude from GPS sentences?" Extracting data from incoming character streams is a common requirement, which usually entails searching for patterns in the character stream, so you know where to look for the data. PBASIC includes rudimentary pattern matching with its input WAIT modifier, but no such facility is native to Spin.

    A commonly used and very powerful pattern matching tool can be found in regular expressions. It's beyond the scope of a forum post to describe regular expresisons in any detail, but there are several good online references that do so:

    ····www.regular-expressions.info/tutorial.html
    ····en.wikipedia.org/wiki/Regular_expression
    ····etext.lib.virginia.edu/services/helpsheets/unix/regex.html

    In order to to facilitate some upcoming GPS work, I decided to write a regular expression parser and pattern matcher in Spin. It uses pretty much the standard regex vocabulary and includes many of the standard features, but with some notable differences:

    1. Only two anchors are supported for now: ^ (string beginning) and $ (string end).

    2. The {m,n} repeat count is not yet supported.

    3. Rather than extracting all the parenthesized groupings in a pattern, only those which begin with ($1 through ($9 are extracted.

    4. My version does not do any backtracking. Once a portion of the string is matched, the matching engine will only move forward. Backtracking is difficult to implement efficiently and often causes a lot of churning to attain a match. Since my engine is written in Spin, backtracking could really slow things to a crawl.

    5. Some special escape sequences have not yet been implemented.

    6. Most regular expression engines compile the regex first before applying it to a string. In mine, the regex is applied entirely interpretively: the parsing and pattern matching occur simultaneously.

    7. This is a matching and extraction engine only: there's no substitution or translation facility built in.

    The best way to show what it does is to use a common GPS string as an example. Here you see some NMEA sentences as they might have come from a GPS unit and which exist in a string buffer somewhere:

    ····$GPGSV,2,1,08,01,40,083,46,02,17,308,41,12,07,344, 39,14,22,228,45*75
    ····$GPRMC,123519,A,4807.038,N,01131.000,E,022.4,084.4 ,230394,003.1,W*6A
    ····$GPVTG,054.7,T,034.4,M,005.5,N,010.2,K*48


    For this example, what we're interested in is the RMC sentence and the latitude and longitude info it contains:

    ····$GPRMC,123519,A,4807.038,N,01131.000,E,022.4,084.4,230394,003.1,W*6A

    The red fields are latitude, and the blue fields are longitude. The data can be extracted using regex.spin and the following pattern (regular expression):

    ····\$GPRMC\s*,[^,]*,[^,]*,($1[\d\.]+)\s*,($2N|S)\s*,($3[\d\.]+)\s*,($4E|W)

    The colors indicate those portions of the pattern used for the actual data extraction. This probably looks like a real mess to the uninitiated, and it's a fact that regular expressions are much easier to write than they are to read. But the individual elements are very simple, so I will try to explain them one at a time.

    The first thing you see is \$GPRMC. This is there to make sure we're extracting data from the right sentence. The dollar sign is prepended with a backslash because $ by itself has special meaning, and the \ quotes it as a character to match. So the pattern matcher will scan the input stirng until it sees $GPRMC

    Next is the rather cryptic-looking \s*. \s matches any whitespace character, such as space, CR, LF, and TAB. The * says to match the whitespace characters 0 or more times. Normally, the $GPRMC will be followed immediately by a comma, but this is put in the pattern in case some GPS receiver somewhere decides to throw in some extra blanks.

    Next comes a comma, which needs to be matched, followed by another odd-looking construction: [^,]*. A list of characters inside square brackets defines a catagory. Any single character in the input string will match anything included in the category. Prepending the carat ^ to the list of characters means to match everything but the characters in the list. So, taken together with the *, [^,]* meaans to match zero or more occurances of anything besides a comma. This, along with the comma itself is used to skip data fields that we're not interested in.

    Next comes ($1[\d\.]+). Anything in parentheses is a group that's treated as a single element. A group that starts with ($ followed by a digit is a special group whose data we want to extract. The digit (1-9) specifies which slot in the return array the extracted data should be stored. The actual data for this group has to match the pattern [\d\.]+. Again the bracketed set is a class consisting of two items: \d, which matches any decimal digit and \., which matches the decimal point. The latter is prepended with the backslash escape because, by itself, it has special significance: a lone period matches any single character. The plus following the class means to match the class one or more times. So, taken together, [\d\.]+ means to match a group of digits and decimal point(s) until something else comes along. This is the numerical part of the latitude and will be stored in position 1 of the results array. (Position 0 is reserved for the portion of the string that matched the entire pattern.)

    Position 2 of the results will be either N or S. Its pattern (following the comma) is ($2N|S). The vertical bar means just what you think it does: OR. N|S will match either N or S. (It could also have been expressed [NS] to the same effect. But the vertical bar can also be used to separate subpatterns of more than one character, viz. NORTH|SOUTH.)

    Positions 3 and 4 of the extracted data are for longitude and work just like positions 1 and 2.

    Here's a sample program that takes a string containing the sentences above, locates the $GPRMC sentence, and extracts the lat/lon data from it:

    Code:
    CON
    
      _clkmode      = xtal1 + pll16x
      _xinfreq      = 5_000_000
    
    OBJ
    
      re    : "regex"
      io    : "FullDuplexSerial"
    
    PUB  Start | teststr, pattern, resaddr, rslt, i
    
      io.start(31, 30, 0, 9600)
      io.tx(0)
      
      teststr := string("$GPGSV,2,1,08,01,40,083,46,02,17,308,41,12,07,344,39,14,22,228,45*75", 13, {
                     }  "$GPRMC,123519,A,4807.038,N,01131.000,E,022.4,084.4,230394,003.1,W*6A", 13, {
                     }  "$GPVTG,054.7,T,034.4,M,005.5,N,010.2,K*48", 13)
                     
      pattern := string("\$GPRMC\s*,[*^,]*,[*^,]*,($1[*\d\.]+)\s*,($2N|S)\s*,($3[*\d\.]+)\s*,($4E|W)")
    
      io.str(string("String:", 13, 13))
      io.str(teststr)
      io.str(string(13, 13, "Pattern:", 13, 13))
      io.str(pattern)
    
      rslt := -cnt
      resaddr := re.match(teststr, pattern, re#NOALT)
      rslt += cnt
    
      io.str(string(13, 13, "Time: "))
      io.dec(rslt / 80_000)
      io.str(string(" ms."))
      io.str(string(13, 13, "Results:", 13))
    
      if (resaddr < 0)
        io.tx(13)
        io.str(string("Error #"))
        io.dec(-resaddr)
      elseif (resaddr == 0)
        io.tx(13)
        io.str(string("No match."))
      else
        repeat i from 0 to 9
          if (rslt := long[*resaddr][*i])
            io.tx(13)
            io.dec(i)
            io.str(string(": "))
            io.str(re.field(i))
    
        io.str(string(13, 13, "Remainder of string:", 13, 13))
        io.str(re.remainder)


    Here's what the output looks like:

    Code:
    String:
    
    $GPGSV,2,1,08,01,40,083,46,02,17,308,41,12,07,344,39,14,22,228,45*75
    $GPRMC,123519,A,4807.038,N,01131.000,E,022.4,084.4,230394,003.1,W*6A
    $GPVTG,054.7,T,034.4,M,005.5,N,010.2,K*48
    
    
    Pattern:
    
    \$GPRMC\s*,[^,]*,[^,]*,($1[\d\.]+)\s*,($2N|S)\s*,($3[\d\.]+)\s*,($4E|W)
    
    Time: 143 ms.
    
    Results:
    
    0: $GPRMC,123519,A,4807.038,N,01131.000,E
    1: 4807.038
    2: N
    3: 01131.000
    4: E
    
    Remainder of string:
    
    ,022.4,084.4,230394,003.1,W*6A
    $GPVTG,054.7,T,034.4,M,005.5,N,010.2,K*48


    The results are displayed using the regex object's field method, given the index number for each field.

    That's about all I can write about it here. Hopefully, I'll have a more thorough document available at a later date. In the meantime, give the program try if it's something that interests you. It's still really raw and very alpha, so don't rely on it too heavily until it receives more testing (and, possibly, some changes).

    -Phil

    Edit: Fixed several errors where \w was used when \s was intended. Added updated archive. Demo now uses field method to print, instead of substr method.

    Post Edited (Phil Pilgrim (PhiPi)) : 6/30/2010 6:19:58 AM GMT
    Attached Files Attached Files
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  2. #2

    Default

    That's great!
    I really love regexp, and having them on the propeller is a dream...

    Thanks,
    Massimo
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  3. #3

    Default

    WOW....Phil....you are a MASTER.....RegExps on the Propeller....you are REALLY A MASTER....

    I had a heck of time writing one to incorporate into RobotBASIC just recently......it works
    just like the PERL standard with extended syntax and replacing etc.


    I would have never been courageous enough to do an engine in Spin.......you are Good.

    For anyone wanting to learn Regular Expressions and wants a way to play with them
    and/or follow along a tutorial....here is a program that lets you do so easily. It is fun to
    use, and REALLY useful for following along with a book's examples or even a
    web tutorial and·also gives you some standard patterns such as matching an email, a date
    etc. It is a compiled EXE of a program written in RobotBASIC and uses the RB RegExp engine
    that is now part of the new version that is about to be released...I am posting only the EXE...
    If you want the source code it will be in the down load zip when I release the new version (V4.0.2)
    very soon. But the post below is an EXE and you do not need anything else other than just run it.


    ·

    Samuel

    ·

    Post Edited (SamMishal) : 11/11/2009 11:30:07 AM GMT
    Attached Thumbnails Attached Thumbnails Click image for larger version

Name:	RegExp_Trainer.jpg‎
Views:	106
Size:	270.2 KB
ID:	65063  
    Attached Files Attached Files
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  4. #4

    Default

    I'm keeping that in my favorites :)
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  5. #5

    Default

    Phil Pilgrim (PhiPi) said...

    In order to to facilitate some upcoming GPS work, I decided to write a regular expression parser and pattern matcher in Spin. It uses pretty much the standard regex vocabulary and includes many of the standard features, but with some notable differences:
    Superb work (as usual) Phil. So you have already made a good first step towards porting perl...

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    If you always do what you always did, you always get what you always got.
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  6. #6

    Default

    Ahhhh...My worst nightmare has come to the Propeller world. Regular expressions give me severe headache.
    Having to read modem line noise "\$GPRMC\w*,[^,]*,[^,]*,($1[\d\.]+)\w*,($2N|S)\w*,($3[\d\.]+)\w*,($4E|W)" is not fun.

    Well done:)

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  7. #7

    Default

    Wow Phil, excellent work. I live in Perl on other machines and use PCRE in C much. Nice to see this. However, the NMEA parsing example is good, but in spin, there is no way to define on-time per NMEA for, let's say a cog displaying time in an LED via perhaps parallel connection. Wish for RegEx NMEA driver in PASM. I've been looking at NMEA parsers written in C/C++ for AVR micro-controllers that do only one thing, on-time dump of fairly accurate time to serially connected time displays. This would be a replacement for the likes of IRIG-B, which is notoriously difficult to demodulate-decode (this is another target for the propeller for many reasons, if only there was a hardware multiplier/divider).

    I'm in run-on mode. Back to Topic. Great to see a RegEx object for the Prop. And thanks to you for having the depth of knowledge that recognized this as needed. Now if only someone can take the double precision stuff from the unofficial Propeller Wiki to an actual object with demo code encapsulation.
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  8. #8

    Default

    Phil, Thanks for writing this :)
    Missed seeing you at the expo this year.

    Cheers,
    --Steve

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Propeller Pages: Propeller JVM
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  9. #9

    WBA Consulting's Avatar
    Location
    4.3 Light Years from Rigil Kentaurus
    Posts
    2,193
    Blog Entries
    26

    Default

    Phil, have you been reading my notes or something? This is exactly what I have been trying to grasp recently so....
    THANK YOU!!!!
    THANK YOU!!!!
    THANK YOU!!!!


    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Andrew Williams
    WBA Consulting
    PowerTwig Dual Output Power Supply Module
    My Prop projects: Reverse Geo-Cache Box, Custom Metronome, Micro Plunge Logger
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  10. #10

    Default

    Now we have two problems...
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  11. #11

    Default

    Great work, Phil. I will have a use for this right quick ... I was going to ask for a string to parse $GPGGA .... but I went back and reread your excellant mini-tutorial and I figured I could do it myself. So, I started by modifying $GPRMC to also get time and date strings .. it worked on the first try ... I think I understand most of it, more than enough to do $GPGGA ... however could you please explain a little more about the "+" ????

    Code:
    String:
    
    $GPGSV,2,1,08,01,40,083,46,02,17,308,41,12,07,344,39,14,22,228,45*75
    $GPRMC,123519,A,4807.038,N,01131.000,E,022.4,084.4,230394,003.1,W*6A
    $GPVTG,054.7,T,034.4,M,005.5,N,010.2,K*48
    
    
    Pattern:
    
    \$GPRMC\w*,($1[\d]+)\w*,[^,]*,($2[\d\.]+)\w*,($3N|S)\w*,($4[\d\.]+)\w*,($5E|W)\w*,[^,]*,[^,]*,($6[\d]+)
    
    Results:
    
    0: $GPRMC,123519,A,4807.038,N,01131.000,E,022.4,084.4,230394
    1: 123519
    2: 4807.038
    3: N
    4: 01131.000
    5: E
    6: 230394



    cheers ... BBR

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    cheers ... brian riley, n1bq, underhill center, vermont
    The Shoppe at Wulfden
    www.wulfden.org/TheShoppe/
    www.wulfden.org/TheShoppe/prop/ - Propeller Products
    www.wulfden.org/TheShoppe/k107/ - Serial LCD Display Gear
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  12. #12

    Default

    The "+" is a modifier to the unit it follows, such that it will match one or more of that unit in sequence. For example, "[ab]+" will match "aabbbababc" up to, but not including, the letter "c". It's often used for capturing numeric digits (i. e. "\d+"), when you know there's at least one but not how many beyond that.

    By contrast, the "*" is a modifier that will cause the unit it follows to match zero or more of that unit in sequence. It's typically used to skip over spaces when you don't know whether there are any or, if there are, how many.

    -Phil
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  13. #13

    Default

    Aw, c'mon Phil - why tease us and then stop half way?

    We want a full implementation of AWK on the Propeller!

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Catalina - a FREE C compiler for the Propeller - see Catalina
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  14. #14

    Default

    RossH,

    At this point, it's a mere awklet. A Propellerous awklet (Aethia cogspinicus), to be exact.

    -Phil
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  15. #15

    Default

    OK, Thanks for the additional explanation Phil. I cleaned up my pattern some, then I sat down and did $GPGGA. Here is the pattern for that ... in case anyone wants it

    $1 is time string,
    $6 is GPS fix quality
    $7 is number of satellites
    $8 is altitude
    $9 units of altitude - M is meters

    ** EDIT **regex pattern corrected for mistake in Phil's original \w's replaced by \s's.

    Code:
    _
    String:
    
    $GPGSV,2,1,08,01,40,083,46,02,17,308,41,12,07,344,39,14,22,228,45*75
    $GPGGA,170834,4124.8963,N,08151.6838,W,1,05,1.5,280.2,M,-34.0,M,,,*75
    $GPVTG,054.7,T,034.4,M,005.5,N,010.2,K*48
    
    
    Pattern:
    
    \$GPGGA\s*,($1\d+)\s*,($2[\d\.]+)\s*,($3N|S)\s*,($4[\d\.]+)\s*,($5E|W)\s*,($6\d)\s*,($7\d+)\s*,[^,]*,($8[\d\.]+)\s*,($9M)
    
    Time: 1576 ms.
    
    Results:
    
    0: $GPGGA,170834,4124.8963,N,08151.6838,W,1,05,1.5,280.2,M
    1: 170834
    2: 4124.8963
    3: N
    4: 08151.6838
    5: W
    6: 1
    7: 05
    8: 280.2
    9: M


    cheers ... BBR

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    cheers ... brian riley, n1bq, underhill center, vermont
    The Shoppe at Wulfden
    www.wulfden.org/TheShoppe/
    www.wulfden.org/TheShoppe/prop/ - Propeller Products
    www.wulfden.org/TheShoppe/k107/ - Serial LCD Display Gear

    Post Edited (Brian Riley) : 6/30/2010 6:10:00 AM GMT
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  16. #16

    Default

    Phil Pilgrim (PhiPi) said...
    Next is the rather cryptic-looking \w*. \w matches any whitespace character, such as space, CR, LF, and TAB. The * says to match the whitespace characters 0 or more times. Normally, the $GPRMC will be followed immediately by a comma, but this is put in the pattern in case some GPS receiver somewhere decides to throw in some extra blanks.
    Phil ... \w or \s for whitespace??? The comments in the beginning of regex.spin made me think it was \s

    cheers ... BBR

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    cheers ... brian riley, n1bq, underhill center, vermont
    The Shoppe at Wulfden
    www.wulfden.org/TheShoppe/
    www.wulfden.org/TheShoppe/prop/ - Propeller Products
    www.wulfden.org/TheShoppe/k107/ - Serial LCD Display Gear
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  17. #17

    Default

    I just realized that I screwed something up, and it reflects an error I seem to repeat in my Perl programming more often than I care to admit. \w, in standard RE parlance, is supposed to represent a word character [A-Za-z0-9_], not whitespace. Whitespace is represented by \s. This is a change I will have to make. My apologies for the confusion.

    -Phil

    Addendum: Yes, Brian, you beat me to it!
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  18. #18

    Default

    'Turns out the object is correct, but the demo is wrong. It works only because there aren't any embedded whitespace characters in the NMEA sentences. Attached is a corrected demo program. I'll also make the necessary changes to the original post.

    -Phil
    Attached Files Attached Files
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  19. #19

    Default

    Phil Pilgrim (PhiPi) said...
    'Turns out the object is correct, but the demo is wrong. It works only because there aren't any embedded whitespace characters in the NMEA sentences. Attached is a corrected demo program. I'll also make the necessary changes to the original post.

    -Phil

    This won't compile. "re#NOALT" is not in the regex.spin you posted last night. Looking at original code, I assume its value is "0" ??? also "re.field()" is unheard of .....

    cheers ... BBR

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    cheers ... brian riley, n1bq, underhill center, vermont
    The Shoppe at Wulfden
    www.wulfden.org/TheShoppe/
    www.wulfden.org/TheShoppe/prop/ - Propeller Products
    www.wulfden.org/TheShoppe/k107/ - Serial LCD Display Gear

    Post Edited (Brian Riley) : 6/30/2010 6:04:26 AM GMT
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

  20. #20

    Default

    Sorry, Brian. Try the corrected archive I uploaded to the original post instead. I must've added some stuff to the regex object in the interim.

    -Phil
    Last edited by ForumTools; 10-01-2010 at 03:08 PM. Reason: Forum Migration

+ Reply to Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts