This page talks about my new exploration in steps involved in Natural Language Processing and the how to perform these steps in Python using a module "nltk"
Applications of NLP
- Sentiment mining
- Information Extraction
- Document classification
- Named Entity Recognition
- Relation Extraction between entities
Phases in NLP
Sentence Tokenization
AIM : To separate out sentences from a given paragraph or text.
Simple implementation is to split the text with ".". In case of Abstract parsing for Pathway interaction database generation, we split sentences with ". " [dot followed by a space]. But when parsing general text from web, this approach may fail. NLTK provides a separate "sent_tokenize" method which takes care of these kind of issues
Example :
Output :Word Tokenization
AIM : Identify words in a given sentence
Simple logic again will be to split with " "[space] and then ensure that words separated by multiple spaces are handled properly. NLTK provides a "word_tokenize" method.
Example :
Output :Stemming and Lemmatization
AIM : To identify the root word or lemma [the word that can be found in the dictionary.]
Stemming : Stemming usually refers to a crude heuristic process that chops off the ends of words[affixes] in the hope of identifying the root word .
Example :
Output :
Lemmatization : Identify and remove the affixes if present in a given vocabulary.
Example :
Output : More information about the algorithms used can be found at Stemming & LemmatizationParts of Speech Tagging
AIM : To identify parts of speech of each word in the text. Some tags output by the pos tagger from nltk are listed below.
| Tag | Description |
| CC | Coordinating conjunction |
| CD | Cardinal number |
| IN | Preposition or subordinating conjunction |
| JJ | Adjective |
| JJR | Adjective, comparative |
| JJS | Adjective, superlative |
| NN | Noun, singular or mass |
| NNS | Noun, plural |
| NNP | Proper noun, singular |
| NNPS | Proper noun, plural |
| PDT | Predeterminer |
| POS | Possessive ending |
| PRP | Personal pronoun |
| PRP$ | Possessive pronoun |
| RB | Adverb |
| RBR | Adverb, comparative |
| RBS | Adverb, superlative |
| RP | Particle |
| SYM | Symbol |
| UH | Interjection |
| VB | Verb, base form |
| VBD | Verb, past tense |
The entire list can be accessed at Penn TreeBank parts of speech
Example :
Output :Named Entity Recognition
AIM : Identify nouns mentioned in the sentence
Identify the parts of speech of words in the sentence and group consecutively occurring nouns
Example :
Output :Document classification
AIM : Given a set of documents and their categories, train a model and use that trained model to classify other documents.
Example :
Relation Extraction TBD
AIM : Identify the entities and the relationship between them.
No comments:
Post a Comment