contextera

Wednesday, March 15, 2017

Working on Classification algorithm



After a long time Today I got a chance to work on Classification in Python. I selected the world favorite Iris dataset.

One of my teams submitted a very interesting idea as part of a Hackathon and they needed help on Classification logic. I am not revealing the idea itself as that team might file for a patent.

The code below is what I wrote to demo how they can use classification to solve their problem.

I used the Iris dataset available at https://archive.ics.uci.edu/ml/datasets/Iris  as a sample dataset.

To give a brief about this dataset, Iris is a flower and this dataset has details of about 150 such flowers samples.

Each sample (each row) in this dataset has the following 5 columns
  1. sepal length in cm
  2. sepal width in cm
  3. petal length in cm 
  4. petal width in cm 
  5. class : Which can be any one of  "Iris Setosa",  "Iris Versicolour" or "Iris Virginica"

In the dataset there were some rows where "Versicolor" is used instead of "Versicolour" basically due to the US and British style of spelling the word "Colour".

Due to this I had to handle this in my code.

Now, coming to the classification logic itself, like any classification problem, I used this dataset to train my model. This model basically predicts the flower class given the 4 parameters [sepal length, sepal width, petal length, petal width].

Finally below is the code that did the classification


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import pandas as pd 


def get_iris_data():

 url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
 col_names = ["sepal length", "sepal width", "petal length", "petal width", "classification"]

 iris = pd.read_csv(url, header=None, names=col_names)
 #print iris.head()
 return iris

def classify(training_features, training_classification, test_features):
 from sklearn.linear_model import SGDClassifier

 classifier = SGDClassifier(loss="hinge", penalty="l2")
 classifier.fit(training_features, training_classification)
 pred_class = classifier.predict([test_features])
 return pred_class

def print_class(ind):
 if ind == 1 :
  return "Iris-setosa"
  
 elif ind == 2:
  return "Iris-versicolour" 
  
 elif ind == 3:
  return "Iris-virginica"

def get_numeric_classification(iris_classification):
 
 trans_classification = []
 for iclass in iris_classification:
  
  if iclass == "Iris-setosa":
   trans_classification.append(1)
  elif (iclass == "Iris-versicolour" or iclass == "Iris-versicolor"):
   trans_classification.append(2)
  elif iclass == "Iris-virginica":
   trans_classification.append(3)

 return trans_classification


# Get the Iris Dataset
iris = get_iris_data()

# Except for "classfication" we are considering all other columns to build our model.
train_columns = ["sepal length", "sepal width", "petal length","petal width"];

# Seperate out the features that we are using to build our model from the main dataset
iris_features = iris[train_columns]

# Separate out the classification column that we will be using to train our model
iris_classification = iris.classification

# Since the classification algo can not take strings, we are transforming our classification into integers
iris_transformed_classification = get_numeric_classification(iris_classification)

# Sample test data
test_data = [1.1, 2.3, 4.5, 2.8]

# The predicted class of the sample test data
pred_class = classify(iris_features, iris_transformed_classification, test_data)

print pred_class

# Transforming back the predicted class into the actual classification
print print_class(pred_class)

No comments:

Post a Comment