contextera

Friday, March 30, 2012

NLTK : A Python module for Natural Language Processing

This page talks about my new exploration in steps involved in Natural Language Processing and the how to perform these steps in Python using a module "nltk"

Applications of NLP

  • Sentiment mining
  • Information Extraction
  • Document classification
  • Named Entity Recognition
  • Relation Extraction between entities

Phases in NLP

Sentence Tokenization

AIM : To separate out sentences from a given paragraph or text.

Simple implementation is to split the text with ".". In case of Abstract parsing for Pathway interaction database generation, we split sentences with ". " [dot followed by a space]. But when parsing general text from web, this approach may fail. NLTK provides a separate "sent_tokenize" method which takes care of these kind of issues

Example :

Output :

Word Tokenization

AIM : Identify words in a given sentence

Simple logic again will be to split with " "[space] and then ensure that words separated by multiple spaces are handled properly. NLTK provides a "word_tokenize" method.

Example :

Output :

Stemming and Lemmatization

AIM : To identify the root word or lemma [the word that can be found in the dictionary.]

Stemming : Stemming usually refers to a crude heuristic process that chops off the ends of words[affixes] in the hope of identifying the root word .

Example :

Output :

Lemmatization : Identify and remove the affixes if present in a given vocabulary.

Example :

Output : More information about the algorithms used can be found at Stemming & Lemmatization

Parts of Speech Tagging

AIM : To identify parts of speech of each word in the text. Some tags output by the pos tagger from nltk are listed below.

Tag Description
CC Coordinating conjunction
CD Cardinal number
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
UH Interjection
VB Verb, base form
VBD Verb, past tense

The entire list can be accessed at Penn TreeBank parts of speech

Example :

Output :

Named Entity Recognition

AIM : Identify nouns mentioned in the sentence

Identify the parts of speech of words in the sentence and group consecutively occurring nouns

Example :

Output :

Document classification

AIM : Given a set of documents and their categories, train a model and use that trained model to classify other documents.

Example :

Relation Extraction TBD

AIM : Identify the entities and the relationship between them.

Resources

No comments:

Post a Comment