Fork me on GitHub

Extracting data using recursive descent parsing

The other day there was an interesting query on the bangpypers mailing list : How should I do it?. It was a question which wondered how to extract data out of a datastructure. The question reproduced below was as follows :

I have a txt file in the following format:

"confident" => {
  count => 4,
  trans => {
     "ashahvasahta" => 0.74918568,
    "atahmavaishahvaasa" => 0.09095465,
    "pahraaram\.nbha" => 0.06990729,
         "mailatae" => 0.02856427,
           "utanai" => 0.01929341,
             "anaa" => 0.01578552,
         "uthaanae" => 0.01403157,
         "jaitanae" => 0.01227762,
    },
},
"consumers" => {
  count => 4,
  trans => {
    "upabhaokahtaa" => 0.75144362,
    "upabhaokahtaaom\.n" => 0.12980166,
    "sauda\�\�\�dha" => 0.11875471,
    },
},
"a" => {
  count => 1164,
  trans => {
              "eka" => 0.14900491,
           "kaisai" => 0.08834675,
             "haai" => 0.06774697,
             "kaoi" => 0.05394308,
              "kai" => 0.04981982,
         "\(none\)" => 0.04400085,
              "kaa" => 0.03726579,
              "kae" => 0.03446450,
    },
},

and I need to extract "confident" , "ashahvasahta" from the first record, "consumers", "upabhaokahtaa" from the second record... i.e. "word in english" and the "first word in the probable-translations"

The first level key is an english word and the entries against the subkey "trans" seem translations with their associated weightages. The task is to extract the english word followed by the (sanskrit) translation word that has the highest weightage associated with it corresponding to the word.

While the problem could have been solved in different (and perhaps simpler) ways, it seemed an attractive one since it also could be solved using a recursive descent parser - something that I had not worked upon before (which then became the primary motivator for solving it). A quick study showed up a number of recursive descent parsers written in python language and the one I quickly settled on was pyparsing. A good introductory writeup on the same can be found at O'reilly onlamp.com at Building Recursive Descent Parsers with Python.

The code below is a suggested solution for the same. Hopefully the comments inline are sufficiently self explanatory.

A quick legend to some of the pyparsing constructs : Forward() : This is a forward declaration to a token whose structure will be defined later Word(characters) : A contiguous sequence of one or more characters that form the word Group() : A construct which separately classifies the matching data in the results token ^ token : The caret indicates an OR ie one of the two tokens should be matched. Optional(token) : An instruction which matches 0 or 1 occurences of token (equivalent to '?' in regex) .setResultsName(name) : An instruction which marks the matched result with the given name token.suppress() : An instruction which matches but then ignores the matched data in the result

# coding=utf-8
from pyparsing import *
import pprint
import sys

data = '''
"confident" => {
   count => 4,
   trans => {
     "ashahvasahta" => 0.74918568,
     "atahmavaishahvaasa" => 0.09095465,
     "pahraaram\.nbha" => 0.06990729,
     "mailatae" => 0.02856427,
     "utanai" => 0.01929341,
     "anaa" => 0.01578552,
     "uthaanae" => 0.01403157,
     "jaitanae" => 0.01227762,
   },
},
"consumers" => {
 count => 4,
 trans => {
   "upabhaokahtaa" => 0.75144362,
   "upabhaokahtaaom\.n" => 0.12980166,
   "sauda\�\�\�dha" => 0.11875471,
   },
},
"a" => {
 count => 1164,
 trans => {
             "eka" => 0.14900491,
          "kaisai" => 0.08834675,
            "haai" => 0.06774697,
            "kaoi" => 0.05394308,
             "kai" => 0.04981982,
        "\(none\)" => 0.04400085,
             "kaa" => 0.03726579,
             "kae" => 0.03446450,
   },
}
'''

# Setup pyparsing tokens

# Set up a forward reference to a dict (to be defined later)
dict_ = Forward()
# A key is a word of alphabets & numbers or is a quoted string
key = (Word(alphas + nums) ^ quotedString).setResultsName("key")
# A value is a sequence of digits and . or a dictionary
val = (Word("0123456789.") ^ dict_).setResultsName("value")
# A key value pair is a key followed by => operator followed by value
key_value_pair = Group(key + Literal("=>").suppress() + val)
# A key value pair list is a comma delimited list of key value pairs
key_value_pair_list = delimitedList(key_value_pair)
# A dictionary is { plus a key value pair list plus an optional trailing
# comma followed by a }
dict_ << Group(Literal("{").suppress() + key_value_pair_list + 
            Optional(Literal(",").suppress()) + Literal("}").suppress())

# parse data
parsed = key_value_pair_list.parseString(data)

# function to extract the parsed data into a dictionary
# Note : the function is kind of weird and does type checks
# due to the way ParseResult is modeled  
def extract(result):    
        if 'key' in result.keys() :
            if isinstance(result.value,ParseResults) :
                return ( result.key,  extract(result.value) )
            else :
                return ( result.key,  result.value )
        else :
            return(dict(extract(elem) for elem in result))

# This converts the parsed ParseResult structures into a dictionary       
extracted = extract(parsed)

# print the dictionary
pprint.pprint(extracted, sys.stdout)

# print the english word and first translated word

print "\n\n============\nTranslations\n============\n"
translated = {}
for english, translations in extracted.items() :
    current_word = ''
    current_weight = 0.0
    for word,weight in translations['trans'].items() :
        if float(weight) > current_weight :
            current_word = word
            current_weight = float(weight)
    translated[english] = current_word
print translated

As usual comments are most welcome. Enjoy!

Comments !

social