Parsing more than eight characters

dbpage · 2014-12-21 15:03

What is a good way to parse a text string that is longer than eight characters? In the code examples that follow, if I add more "if" statements, I will get the error message "Limit of 8 nested blocks exceeded." I would like to parse a few more characters. After reading timecode there is a space, end of file or invalid format. After reading command, there is a space, carriage return and line feed or end of file or invalid format. I resort to additional methods to continue parsing, but I don't like separating parsing statements.

Example 1:

' Parse timecode MM:SS.T, Where: M=Minutes, S=Seconds, T=Tenths of seconds
  i := 0
  if C[i]<>-1 and digit := lookdown(C[i]:"0".."5")
    time := --digit * _6000tenths                                   
    if digit := lookdown(C[++i]:"0".."9")
      time := (--digit * _600tenths) + time
      if C[++i] == ":"
        if digit := lookdown(C[++i]:"0".."5")
          time := (--digit * _100tenths) + time
          if digit := lookdown(C[++i]:"0".."9")
            time := (--digit * _10tenths) + time
            if C[++i] == "."
              if digit := lookdown(C[++i]:"0".."9")
                time := (--digit * tenths) + time
                return true
  return false                                  ' End of line, file or invalid command format

Example 2:

' Parse command string 0NN-NNN
  i := 0                                      
  if (C[i]=="0")                            
    if digit := lookdown(C[++i]:"0".."9")
      Addr := --digit
      if digit := lookdown(C[++i]:"0".."9")
        Addr := Addr * 10 + --digit
        if C[++i] == "-"
          if digit := lookdown(C[++i]:"0".."2")
            Fcn := --digit
            if digit := lookdown(C[++i]:"0".."9")
              Fcn := Fcn * 10 + --digit
              if digit := lookdown(C[++i]:"0".."9")
                Fcn := Fcn * 10 + --digit
                interpret
                return true                     
  return false                                  ' End of line, file or invalid command format

Heater. · 2014-12-21 15:36

Parsing is a huge topic.

But the next most stupid way to do it might look like the following pseudo code:

c = char[0]
if c != what I want
    return sytax error
do domething with c
c = char[1]
if c != what I want
    return sytax error
do domething with c
c = char[2]
if c != what I want
    return sytax error
do domething with c
c = char[3]
if c != what I want
    return sytax error
do domething with c

For a little more sophisticated ways of parsing have a read of "Let's build a Compiler" http://compilers.iecc.com/crenshaw/

Don't be scared off by the "Build a Compiler" thing. That text is very easy to read and starts with ideas about how to parse numbers and white space etc. All done in Pascal which looks like Spin.

kwinn · 2014-12-21 20:54

A string is a byte array terminated by a zero byte so you can use an if statement in a repeat loop to scan the string one byte at a time.

turbosupra · 2014-12-22 06:12

When I go past 8 IF statements, I use the case function or the repeat function with a variable for an index value

I too am having a hard time parsing data though

JonnyMac · 2014-12-22 09:36

What is a good way to parse a text string that is longer than eight characters? In the code examples that follow, if I add more "if" statements, I will get the error message "Limit of 8 nested blocks exceeded." I would like to parse a few more characters. After reading timecode there is a space, end of file or invalid format. After reading command, there is a space, carriage return and line feed or end of file or invalid format. I resort to additional methods to continue parsing, but I don't like separating parsing statements.

It would be useful to see more examples of the strings you're wanting to parse, and how you get them. Are there separator or terminating characters?

kwinn · 2014-12-22 13:00

Sorry about the silly post earlier. Should have taken a better look at the code you posted. Parsing time and dates are always a bit of a pain.
This code should work. It compiles but I do not have a propeller with me to test it.

VAR

  long  time

PUB start

PRI timetobin | i

' Parse timecode MM:SS.T, Where: M=Minutes, S=Seconds, T=Tenths of seconds

  i := 0                                                ' point to tens of hours

  if C[i] <"0" and C[i] >"5"
    return false                                        ' exit if 10's of minutes not 0-5    
  time := (C[i++] and $F) * 6000                        ' set time = 10's of minutes in tenths of a second
  
  if C[i] <"0" and C[i]>"9"
    return false                                        ' exit if minutes not 0-9
  time := time + ((C[i++] and $F) * 600)                ' add minutes to time

  if C[++i] <"0" and C[i]>"5"
    return false                                        ' exit if 10's of seconds not 0-5    
  time := time + ((C[i++] and $F) * 100)                ' add 10's of seconds to time
  
  if C[i] <"0" and C[i]>"9"
    return false                                        ' exit if seconds not 0-9
  time := time + ((C[i++] and $F) * 10)                 ' add seconds to time  
  
  if C[++i] <"0" and C[i]>"9"
    return false                                        ' exit if tenths of seconds not 0-9
  time := time + (C[i] and $F)                          ' add tenths of seconds
  
  return true

DAT

C       byte  "MM:SS.T"

dbpage · 2014-12-22 18:27

I am approaching the conclusion that multiple embedded "return false" statements are the best way to handle more than 8 nested "if" (or "case") statements. Multiple exits remind me of BASIC spaghetti code, but what is the alternative? A full expression parser such as the one described in Jack Crenshaw's article brought back many memories (thank you Heater), including the Pascal parser I first wrote in my undergraduate computer science class. This almost prompted me to implement the parser as a reverse Polish stack for nostalgic reasons, but the source expressions seem too simple to go that far.

For Jon McPhalen, I offer the following full disclosure:

MicroSD card file format (94 characters per line maximum):
  Header:
  ct0CL or ct0-381CL or ct0-382CL or ct0-xxxCL
  One or more lines:
  MM:SS.TAAA-FFF{SAAA-FFF}{SAAA-FFF}{SAAA-FFF}{SAAA-FFF}{SAAA-FFF}{SAAA-FFF}{SAAA-FFF}{SAAA-FFF}{SAAA-FFF}{SAAA-FFF}CL
  EOF

  Timecodes are 7 characters, where:
    MM = Minutes
    SS = Seconds
    T  = Tenths of secods
    Colon separates Minutes and Seconds
    Period separates Tenths
 Commands are 6 digits, where: 
    AAA = 3 digit Address
    FFF = 3 digit Function
    -   = Dash separates Address and Function
    S   = Space separates multiple commands, if any
    C   = Carriage return
    L   = Line feed (End of line delimiter)
  Minimum 1 command; maximum 11 commands per line
  EOF = End of File

I have working versions (Examples 1 and 2). My intent is to perfect and learn.

kwinn · 2014-12-22 19:41

dbpage wrote: »
I am approaching the conclusion that multiple embedded "return false" statements are the best way to handle more than 8 nested "if" (or "case") statements. Multiple exits remind me of BASIC spaghetti code, but what is the alternative? A full expression parser such as the one described in Jack Crenshaw's article brought back many memories (thank you Heater), including the Pascal parser I first wrote in my undergraduate computer science class. This almost prompted me to implement the parser as a reverse Polish stack for nostalgic reasons, but the source expressions seem too simple to go that far.

For Jon McPhalen, I offer the following full disclosure:
MicroSD card file format (94 characters per line maximum):
  Header:
  ct0CL or ct0-381CL or ct0-382CL or ct0-xxxCL
  One or more lines:
  MM:SS.TAAA-FFF{SAAA-FFF}{SAAA-FFF}{SAAA-FFF}{SAAA-FFF}{SAAA-FFF}{SAAA-FFF}{SAAA-FFF}{SAAA-FFF}{SAAA-FFF}{SAAA-FFF}CL
  EOF

  Timecodes are 7 characters, where:
    MM = Minutes
    SS = Seconds
    T  = Tenths of secods
    Colon separates Minutes and Seconds
    Period separates Tenths
 Commands are 6 digits, where: 
    AAA = 3 digit Address
    FFF = 3 digit Function
    -   = Dash separates Address and Function
    S   = Space separates multiple commands, if any
    C   = Carriage return
    L   = Line feed (End of line delimiter)
  Minimum 1 command; maximum 11 commands per line
  EOF = End of File
I have working versions (Examples 1 and 2). My intent is to perfect and learn.

I think this could be done with only one "return false" using a loop and index to select limits and multipliers but inline seemed to be the simplest and clearest way to do it. It is a bit reminiscent of Basic spaghetti code but at least the code is short enough that it can all be displayed at once.

Mark Mara · 2014-12-23 06:52

You might consider using a finite state machine to do the parsing. I much prefer a state machine to nested if statements or case statements for parsing. It is much cleaner and easer to debug.

Heater. · 2014-12-23 07:28

Mark Mara,

I do agree that a state machine is very useful in parsing. As some have noted above.

How do you make your state machines without case statements?

In the past I have done it with goto. Shameful I know but when you need the speed goto is the way to go. As it were.

I would be inclined to divide an conquer:

Make a method that parses numbers, hopefully with a variable number of digits for flexibility.

Make another method that parses alpa or alpha-numeric symbols, again of variable length and perhaps including "$" or "_" or whatever you might want to allow in such symbols.

Perhaps make a method that skips over white space.

Then you can tie all these together to parse all manner of data that contains numbers and symbols. That higher level parser will probably have to take care of detecting terminators like space, end of line, comma, semi-colon, whatever.

As a bonus you end up with some methods that can be reused for parsing in other different problems. And your finished code will be a lot more readable when you get back to it.

If you want to get really serious, use those low level methods to parse out tokens, the numbers and symbols etc and then have a higher level that works through the tokenized data.

Before you you know it you will be writing a compiler or interpreter as described by Crenshaw:)

Mark Mara · 2014-12-23 08:14

I use 2 arrays. The arrays are indexed by (current_state, token_number). The new_state array is an array of integers. The action array is an array of pointers to functions. To cycle the fsm you use the current state and token number to index into action array to fire off a function then index into the new_state array to get the new state. I probably have some code examples if that would help.

Heater. · 2014-12-23 08:19

Mark,

A good technique for sure.

Sadly we don't have the ability to make or call function pointers in Spin.

Tracy Allen · 2014-12-23 09:09

Here is a method that has served me well in various instances. It might suggest an approach for you, as it works by moving the data stream past an 8 byte window, aliased as two longs. That is true of yours with <CR>MM:SS.T and xAAA-FFF. (where "x" is either T or S). Even though the whole string is longer, the basic units are 7 bytes + separator.

Here is a snippet of how it works for parsing a GPS sentence. In this case the bytes come from the serial port, but in your case they would come from the sd card. The bytes are shifted msb first into an eight byte (two longs) buffer. After each shift, a condition is tested against the contents of the longs. In the case of the GPS, it is looking for a match to a set of key strings in a table. In your case, rather than a table match, you might have two conditions:
1) time string: <CR> in byte7, ":" in byte3, and "." in byte 1
2) command string: "-" in byte3 and either T or S in byte7
When that condition is met, you have all the other bytes in known position for quick analysis.

repeat  ' rotate bytes from GPS_PORT thru 8 bytes buffer until command key is detected in byte 1
     if char := uarts.rx(GPS_PORT)
       char0 := (char0 >> 8) + (char1.byte[0]<<24)   ' rotate bytes of char0 and char1, msb to lsb
       char1 := (char1 >> 8) + (char << 24)
       if char0 ==  long[@gps]  ' this is the key string for the device, a sentence start $GP has been detected.
         case char1
           long[@gga] : ParseGGA
           long[@rmc] : ParseRMC
           long[@scv] : ParseSCV
           ' capture values using "," or "*" or <CR> as delimiter.
DAT
  table long    0                               ' force long alignment, 4 bytes per entry to retain alignment
  gga byte "GGA,"                               ' chars GGA,
  rmc byte "RMC,"
  scv byte "GSV,"
  gps byte  $A,"$GP"                            ' line feed and chars $GP

You have to be aware of endedness. The bytes are shifted in msb first so that a match can be made with a text string that is stored in normal left to right order. For example, in "GGA,", the first G ends up in the least significant byte of the double long char1:char0.

kwinn · 2014-12-23 13:01

Heater. wrote: »

Mark Mara,

I do agree that a state machine is very useful in parsing. As some have noted above.

How do you make your state machines without case statements?

In the past I have done it with goto. Shameful I know but when you need the speed goto is the way to go. As it were.

I would be inclined to divide an conquer:

Make a method that parses numbers, hopefully with a variable number of digits for flexibility.

Make another method that parses alpa or alpha-numeric symbols, again of variable length and perhaps including "$" or "_" or whatever you might want to allow in such symbols.

Perhaps make a method that skips over white space.

Then you can tie all these together to parse all manner of data that contains numbers and symbols. That higher level parser will probably have to take care of detecting terminators like space, end of line, comma, semi-colon, whatever.

As a bonus you end up with some methods that can be reused for parsing in other different problems. And your finished code will be a lot more readable when you get back to it.

If you want to get really serious, use those low level methods to parse out tokens, the numbers and symbols etc and then have a higher level that works through the tokenized data.

Before you you know it you will be writing a compiler or interpreter as described by Crenshaw:)

I was tempted to divide and conquer for the minute and second fields but it hardly seemed worth the effort, and the multiple "return false" could be eliminated by setting an error flag instead. In some cases it might be worth setting multiple error flags for each line that is parsed so that a comprehensive list of errors could could be provided for debugging after a single pass through all the commands.

Parsing more than eight characters

Comments