Tech Updates: NLTK : A Python module for Natural Language Processing

This page talks about my new exploration in steps involved in Natural Language Processing and the how to perform these steps in Python using a module "nltk"

Applications of NLP
Phases in NLP
Resources

Applications of NLP

Sentiment mining
Information Extraction
Document classification
Named Entity Recognition
Relation Extraction between entities

Phases in NLP

Sentence Tokenization

AIM : To separate out sentences from a given paragraph or text.

Simple implementation is to split the text with ".". In case of Abstract parsing for Pathway interaction database generation, we split sentences with ". " [dot followed by a space]. But when parsing general text from web, this approach may fail. NLTK provides a separate "sent_tokenize" method which takes care of these kind of issues

Example :

Output :

Word Tokenization

AIM : Identify words in a given sentence

Simple logic again will be to split with " "[space] and then ensure that words separated by multiple spaces are handled properly. NLTK provides a "word_tokenize" method.

Example :

Output :

Stemming and Lemmatization

AIM : To identify the root word or lemma [the word that can be found in the dictionary.]

Stemming : Stemming usually refers to a crude heuristic process that chops off the ends of words[affixes] in the hope of identifying the root word .

Example :

Output :

Lemmatization : Identify and remove the affixes if present in a given vocabulary.

Example :

Output : More information about the algorithms used can be found at Stemming & Lemmatization

Parts of Speech Tagging

AIM : To identify parts of speech of each word in the text. Some tags output by the pos tagger from nltk are listed below.

Tag	Description
CC	Coordinating conjunction
CD	Cardinal number
IN	Preposition or subordinating conjunction
JJ	Adjective
JJR	Adjective, comparative
JJS	Adjective, superlative
NN	Noun, singular or mass
NNS	Noun, plural
NNP	Proper noun, singular
NNPS	Proper noun, plural
PDT	Predeterminer
POS	Possessive ending
PRP	Personal pronoun
PRP$	Possessive pronoun
RB	Adverb
RBR	Adverb, comparative
RBS	Adverb, superlative
RP	Particle
SYM	Symbol
UH	Interjection
VB	Verb, base form
VBD	Verb, past tense

The entire list can be accessed at Penn TreeBank parts of speech

Example :

Output :

Named Entity Recognition

AIM : Identify nouns mentioned in the sentence

Identify the parts of speech of words in the sentence and group consecutively occurring nouns

Example :

Output :

Document classification

AIM : Given a set of documents and their categories, train a model and use that trained model to classify other documents.

Example :

Relation Extraction TBD

AIM : Identify the entities and the relationship between them.

Tech Updates

contextera

Friday, March 30, 2012

NLTK : A Python module for Natural Language Processing

Applications of NLP

Phases in NLP

Sentence Tokenization

Word Tokenization

Stemming and Lemmatization

Parts of Speech Tagging

Named Entity Recognition

Document classification

Relation Extraction TBD

Resources

No comments:

Post a Comment

Search This Blog

About Me

Blog Archive

Followers

Tech Updates

contextera

Friday, March 30, 2012

NLTK : A Python module for Natural Language Processing

Applications of NLP

Phases in NLP

Sentence Tokenization

Word Tokenization

Stemming and Lemmatization

Parts of Speech Tagging

Named Entity Recognition

Document classification

Relation Extraction TBD

Resources

No comments:

Post a Comment

Search This Blog

About Me

Blog Archive

Subscribe To

Followers