Monday, December 10, 2012
Blogger preview icon similar to sencha loading icon
Login option on Wikipedia pages
Saturday, March 31, 2012
How to convert .flv files to .mpg
My Laptop was able to play .flv files but my DVD player couldn't understand that format.
Hence i started checking how to convert .flv files into .mpg. There are many softwares available for the same purpose but the best and easiest way is to use FFmpeg
Below are the steps to do the same on an Ubuntu machine.
# To install ffmpeg
But the above command failed for some flv files with the below error
After reading the man pages, i tried the below command and it worked. It will try to auto detect the frame rate and use that for conversion.
Friday, March 30, 2012
NLTK : A Python module for Natural Language Processing
This page talks about my new exploration in steps involved in Natural Language Processing and the how to perform these steps in Python using a module "nltk"
Applications of NLP
- Sentiment mining
- Information Extraction
- Document classification
- Named Entity Recognition
- Relation Extraction between entities
Phases in NLP
Sentence Tokenization
AIM : To separate out sentences from a given paragraph or text.
Simple implementation is to split the text with ".". In case of Abstract parsing for Pathway interaction database generation, we split sentences with ". " [dot followed by a space]. But when parsing general text from web, this approach may fail. NLTK provides a separate "sent_tokenize" method which takes care of these kind of issues
Example :
Output :Word Tokenization
AIM : Identify words in a given sentence
Simple logic again will be to split with " "[space] and then ensure that words separated by multiple spaces are handled properly. NLTK provides a "word_tokenize" method.
Example :
Output :Stemming and Lemmatization
AIM : To identify the root word or lemma [the word that can be found in the dictionary.]
Stemming : Stemming usually refers to a crude heuristic process that chops off the ends of words[affixes] in the hope of identifying the root word .
Example :
Output :
Lemmatization : Identify and remove the affixes if present in a given vocabulary.
Example :
Output : More information about the algorithms used can be found at Stemming & LemmatizationParts of Speech Tagging
AIM : To identify parts of speech of each word in the text. Some tags output by the pos tagger from nltk are listed below.
| Tag | Description |
| CC | Coordinating conjunction |
| CD | Cardinal number |
| IN | Preposition or subordinating conjunction |
| JJ | Adjective |
| JJR | Adjective, comparative |
| JJS | Adjective, superlative |
| NN | Noun, singular or mass |
| NNS | Noun, plural |
| NNP | Proper noun, singular |
| NNPS | Proper noun, plural |
| PDT | Predeterminer |
| POS | Possessive ending |
| PRP | Personal pronoun |
| PRP$ | Possessive pronoun |
| RB | Adverb |
| RBR | Adverb, comparative |
| RBS | Adverb, superlative |
| RP | Particle |
| SYM | Symbol |
| UH | Interjection |
| VB | Verb, base form |
| VBD | Verb, past tense |
The entire list can be accessed at Penn TreeBank parts of speech
Example :
Output :Named Entity Recognition
AIM : Identify nouns mentioned in the sentence
Identify the parts of speech of words in the sentence and group consecutively occurring nouns
Example :
Output :Document classification
AIM : Given a set of documents and their categories, train a model and use that trained model to classify other documents.
Example :
Relation Extraction TBD
AIM : Identify the entities and the relationship between them.
