Tech Updates: 2012

Monday, December 10, 2012

Blogger preview icon similar to sencha loading icon

Another observation. When i did a preview of my earlier blog post, blogspot displayed a nice image http://www.blogger.com/img/gear.gif

which looked like the one extjs sencha http://www.sencha.com/ uses

I am trying to learn extjs for my office project and was able to identify this

GMail was down for a while

Login option on Wikipedia pages

I was trying to see how Google news works and came across a link to wikipedia on the same topic. I was surprised to see a link to "Create Account", "Log In" on the top right hand side of the page. Just sharing the info

Saturday, March 31, 2012

How to convert .flv files to .mpg

I downloaded some YouTube videos which were in .flv format.
My Laptop was able to play .flv files but my DVD player couldn't understand that format.

Hence i started checking how to convert .flv files into .mpg. There are many softwares available for the same purpose but the best and easiest way is to use FFmpeg
Below are the steps to do the same on an Ubuntu machine.
# To install ffmpeg
But the above command failed for some flv files with the below error

After reading the man pages, i tried the below command and it worked. It will try to auto detect the frame rate and use that for conversion.

Friday, March 30, 2012

NLTK : A Python module for Natural Language Processing

This page talks about my new exploration in steps involved in Natural Language Processing and the how to perform these steps in Python using a module "nltk"

Applications of NLP
Phases in NLP
Resources

Applications of NLP

Sentiment mining
Information Extraction
Document classification
Named Entity Recognition
Relation Extraction between entities

Phases in NLP

Sentence Tokenization

AIM : To separate out sentences from a given paragraph or text.

Simple implementation is to split the text with ".". In case of Abstract parsing for Pathway interaction database generation, we split sentences with ". " [dot followed by a space]. But when parsing general text from web, this approach may fail. NLTK provides a separate "sent_tokenize" method which takes care of these kind of issues

Example :

Output :

Word Tokenization

AIM : Identify words in a given sentence

Simple logic again will be to split with " "[space] and then ensure that words separated by multiple spaces are handled properly. NLTK provides a "word_tokenize" method.

Example :

Output :

Stemming and Lemmatization

AIM : To identify the root word or lemma [the word that can be found in the dictionary.]

Stemming : Stemming usually refers to a crude heuristic process that chops off the ends of words[affixes] in the hope of identifying the root word .

Example :

Output :

Lemmatization : Identify and remove the affixes if present in a given vocabulary.

Example :

Output : More information about the algorithms used can be found at Stemming & Lemmatization

Parts of Speech Tagging

AIM : To identify parts of speech of each word in the text. Some tags output by the pos tagger from nltk are listed below.

Tag	Description
CC	Coordinating conjunction
CD	Cardinal number
IN	Preposition or subordinating conjunction
JJ	Adjective
JJR	Adjective, comparative
JJS	Adjective, superlative
NN	Noun, singular or mass
NNS	Noun, plural
NNP	Proper noun, singular
NNPS	Proper noun, plural
PDT	Predeterminer
POS	Possessive ending
PRP	Personal pronoun
PRP$	Possessive pronoun
RB	Adverb
RBR	Adverb, comparative
RBS	Adverb, superlative
RP	Particle
SYM	Symbol
UH	Interjection
VB	Verb, base form
VBD	Verb, past tense

The entire list can be accessed at Penn TreeBank parts of speech

Example :

Output :

Named Entity Recognition

AIM : Identify nouns mentioned in the sentence

Identify the parts of speech of words in the sentence and group consecutively occurring nouns

Example :

Output :

Document classification

AIM : Given a set of documents and their categories, train a model and use that trained model to classify other documents.

Example :

Relation Extraction TBD

AIM : Identify the entities and the relationship between them.

Resources

Syntax Highlighter

In my earlier posts, i used to write code in a plain way, which on copy pasting, used to loose its indentation. After some google searches and some surprising results, I came across this link which guides you in allowing people to add synatx highlighted code snippets to both websites and blogs. I am talking about Syntax Highlighter. Below is just a sample code to test it with some Python code. Syntax Highlighter is a java script library and it supports around 20 programming languages. Refer Syntax Highlighter supported languages
Refer http://geektalkin.blogspot.in/2009/11/embed-code-syntax-highlighting-in-blog.html for a detailed explanation on how to set syntax highlighter on your blog or website

Wednesday, March 28, 2012

Check this out ...

YUI Component Gallery

a repository of Yahoo ui components that could be readily embedded into web pages. Cool stuff.

Thursday, March 15, 2012

Java Fx and Desktop

The below link has a nice tutorial on features of Java Fx that can be used in Desktop UI applications. http://docs.oracle.com/javafx/1.3/tutorials/ui/overview/index.html

Tech Updates

contextera

Monday, December 10, 2012

Blogger preview icon similar to sencha loading icon

GMail was down for a while

Login option on Wikipedia pages

Saturday, March 31, 2012

How to convert .flv files to .mpg

Friday, March 30, 2012

NLTK : A Python module for Natural Language Processing

Applications of NLP

Phases in NLP

Sentence Tokenization

Word Tokenization

Stemming and Lemmatization

Parts of Speech Tagging

Named Entity Recognition

Document classification

Relation Extraction TBD

Resources

Syntax Highlighter

Wednesday, March 28, 2012

Thursday, March 15, 2012

Java Fx and Desktop

Search This Blog

About Me

Blog Archive

Followers

contextera

Monday, December 10, 2012

Saturday, March 31, 2012

Friday, March 30, 2012

Applications of NLP

Phases in NLP

Sentence Tokenization

Word Tokenization

Stemming and Lemmatization

Parts of Speech Tagging

Named Entity Recognition

Document classification

Relation Extraction TBD

Resources

Wednesday, March 28, 2012

Thursday, March 15, 2012

Search This Blog

About Me

Blog Archive

Subscribe To

Followers