contextera

Saturday, March 31, 2012

How to convert .flv files to .mpg

I downloaded some YouTube videos which were in .flv format.
My Laptop was able to play .flv files but my DVD player couldn't understand that format.

Hence i started checking how to convert .flv files into .mpg. There are many softwares available for the same purpose but the best and easiest way is to use FFmpeg
Below are the steps to do the same on an Ubuntu machine.
# To install ffmpeg
But the above command failed for some flv files with the below error

 After reading the man pages, i tried the below command and it worked. It will try to auto detect the frame rate and use that for conversion.
 

Friday, March 30, 2012

NLTK : A Python module for Natural Language Processing

This page talks about my new exploration in steps involved in Natural Language Processing and the how to perform these steps in Python using a module "nltk"

Applications of NLP

  • Sentiment mining
  • Information Extraction
  • Document classification
  • Named Entity Recognition
  • Relation Extraction between entities

Phases in NLP

Sentence Tokenization

AIM : To separate out sentences from a given paragraph or text.

Simple implementation is to split the text with ".". In case of Abstract parsing for Pathway interaction database generation, we split sentences with ". " [dot followed by a space]. But when parsing general text from web, this approach may fail. NLTK provides a separate "sent_tokenize" method which takes care of these kind of issues

Example :

Output :

Word Tokenization

AIM : Identify words in a given sentence

Simple logic again will be to split with " "[space] and then ensure that words separated by multiple spaces are handled properly. NLTK provides a "word_tokenize" method.

Example :

Output :

Stemming and Lemmatization

AIM : To identify the root word or lemma [the word that can be found in the dictionary.]

Stemming : Stemming usually refers to a crude heuristic process that chops off the ends of words[affixes] in the hope of identifying the root word .

Example :

Output :

Lemmatization : Identify and remove the affixes if present in a given vocabulary.

Example :

Output : More information about the algorithms used can be found at Stemming & Lemmatization

Parts of Speech Tagging

AIM : To identify parts of speech of each word in the text. Some tags output by the pos tagger from nltk are listed below.

Tag Description
CC Coordinating conjunction
CD Cardinal number
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
UH Interjection
VB Verb, base form
VBD Verb, past tense

The entire list can be accessed at Penn TreeBank parts of speech

Example :

Output :

Named Entity Recognition

AIM : Identify nouns mentioned in the sentence

Identify the parts of speech of words in the sentence and group consecutively occurring nouns

Example :

Output :

Document classification

AIM : Given a set of documents and their categories, train a model and use that trained model to classify other documents.

Example :

Relation Extraction TBD

AIM : Identify the entities and the relationship between them.

Resources

Syntax Highlighter

In my earlier posts, i used to write code in a plain way, which on copy pasting, used to loose its indentation. After some google searches and some surprising results, I came across this link which guides you in allowing people to add synatx highlighted code snippets to both websites and blogs. I am talking about Syntax Highlighter. Below is just a sample code to test it with some Python code. Syntax Highlighter is a java script library and it supports around 20 programming languages. Refer Syntax Highlighter supported languages
Refer http://geektalkin.blogspot.in/2009/11/embed-code-syntax-highlighting-in-blog.html for a detailed explanation on how to set syntax highlighter on your blog or website

Wednesday, March 28, 2012

Check this out ...

YUI Component Gallery

a repository of Yahoo ui components that could be readily embedded into web pages. Cool stuff.

Thursday, March 15, 2012

Java Fx and Desktop

The below link has a nice tutorial on features of Java Fx that can be used in Desktop UI applications. http://docs.oracle.com/javafx/1.3/tutorials/ui/overview/index.html