Parsing

Parsing is a procedure that recognises a sentence and discovers how it is built (i.e. gives its grammatical structure). Recognition involves finding out whether the sentence under consideration belongs to the particular language, i.e. whether it follows all the rules of well-formedness that the language prescribes. Discovering the structure involves identifying and marking the various components of a sentence - the phrases and the individual parts of speech such as noun, verb, preposition etc.. Both the above fuctions require some concept of grammar of the underlying language.

Parsing is the first step in natural language processing. Given a sentence, what is needed is a procedure that recognizes the sentence and also discovers how it is built. The execution of that procedure is called parsing and the thing that executes it is called a parser. This breakdown essentially is the first step in understanding the meaning of the sentence.

Parsers essentially do two things:

  1. When presented with a string they have to recognize it as a sentence of the language they can parse.
  2. They have to assign to that sentence a structure which they have to output. This implies that parsers must rely on linguistic information as contained in a grammar.

A typical parser consists of the following components:

  1. Database of words
  2. Principles for joining words into phrases
  3. Principles for checking for grammatical correctness of a phrase

A database of words (i.e vocabulary) is required for any parser so that the parser is able to recognize the words in the sentence that it has to parse. The database can be regarded as a kind of memory, whose chief characteristics are the items in it, the structure imposed on them, and the way they are accessed. Hence, the database of any parser would essentially consist of the words, along with all the different ways in which they are used in the language or part of the language that is to be parsed. A phrase can consist of words, phrases or both necessiating the presence of some principles which can differentiate any grammatically correct phrase from a mere cluster of words. These principles constitute grammar.

Parsing with computers is fundamentally a search process, where one grammatical rule after another is tried on the input string until a set of rules is found that is completely satisfied by the string under consideration. Usually, no unique set is found and the parser has to output all the different breakdowns that it has obtained, not all of which may be correct. Of the numerous correct interpretations that may be obtained, which one is correct will depend upon other factors such as the context of the utterance. For example, the first computerised parser output five different breakdowns for the following sentence:

Time flies like an arrow.

Parsing Algorithms

There are different approaches that one can take towards parsing. The first one being top-down. In this we start from the top i.e. at the level of the sentence and try to break it down into phrases using rules of the grammar. These phrases are further broken down according to given grammatical rules until we achieve terminal nodes which are then compared to the words in the utterance. So these parsers are 'hypothesis-driven', a kind of depth first search, exploring a particular derivation until it meets success or failure and in the case of failure switching to the next grammatical rule that may be used. The other genre of parsers are the bottom-up parsers. As the name suggests here we begin from the bottom that is the words in the utterance and move towards the grammatical sentence. In this process, we first replace the lexical entries (words) with their grammatical equivalents. e.g. 'the' can be replaced by 'determiners'. Now, we try binding these grammatical entities to give other entities higher up the heirarchy, finally reaching the sentence. Both the methods have their respective disadvantages. The first one involves backtracking and can only be meaningfully applied when the grammar is over-simplified. The second one on the other hand, blindly finds all the substructures that can be put together without using any constraints.