Data Mining and Data Warehousing: Classification and Prediction

Classification :

" Assigning Data cases to one of the fixed number of classes (Represented by Nominal Numbers) "

" Classification is the separation of ordering of classes "

Prediction :

" It is one type of process with the use of predictors in the prediction of one or more than one variable "

Application of Classification and Prediction :

- Weather Analysis
- Prediction of Null space
- Cryptography
- Target Marketing
- Medical Diagnosis
- Fraud Detection
- Performance Prediction
- Manufacturing

Five criteria for evaluation of Classification and Prediction :

Predictive Accuracy
Computational Speed
Robustness
Scalability
Interoperability

Classification by Decision Tree Induction

It is the learning of Class labled training tuples.
A decision tree is a structure like flow chart. Here each non-leaf nodes denotes a test on online attributes.
Each branch represents an outcome of the test, and each leaf node holds a class label.
Top most node in a tree is Root node.

Now the question can be asked that how are decision tree used for classification ?
Why decision tree classifiers so popular ?

Because it does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowledge discovery.
Decision trees can handle high dimensional data.

Generally easy to estimate by humans also.
This type of Induction is easy and fast.
Decision tree classifiers have good accuracy.

Different algorithms can be generated for Decision Tree Induction Most popular of them are ID3, C4.5.
C4.5 is successor of ID3. The book named CART(Classification and Regression Tree) describes the generation of binary trees.
most algorithms for decision tree induction also follows such a top-down approach.

Basic algorithm for Decision Tree Induction :

create a node N.
If tuples in D are all of the same class, C then

Return N as a leaf node labeled with class C

If attribute_list is empty then

Return N as a leaf node labeled with the majority class in D

Apply attribute_selection_method(D,attribute list) to find best splitting ctiterian
Label node N with splitting_criterian
if splitting attribute is discrete-valued and

multiway splits allowed then //not restricted to binary trees

attribute_list << attribute_list - splitting_attribute; // remove splitting attribute
for each outcome of j of splitting_criterian // partition the tuples and grow subtrees for each partition.
let Dj be the set of data tuples in D satisfying outcome of j; // a partition
if Dj is empty then

attach a leaf labeled with the majority class in D to node N;

else attach the node returned by generate decision tree(Dj,attribute_list) to node N;
end for
Return N;

Attribute Selection Measures

It is a heuristic for selecting the splitting criterion that "best" separates a given data D.
and also class labels and sub classes
The best splitting criterion is that which gives the most pure result.
" Attribute selection measure are also known as splitting rules because they determine how the tuples at a given training tuples "
The attribute having best score of the measure can be chosen as splitting criterion.
This selection describes mainly three :

Information Gain
Gain Ratio
Gini Index

Information Gain :

ID3 use it as its attribute selection measure.
The attribute with the highest information gain is chosen as the splitting attribute for node N.
This attributes minimizes the impurity of the classification.
It also Minimizes the expected number of tests and guarantees that a simple tree is found.
The expected information need to be classified a tuple in D is given by

Info(D) = - Pi Log2(Pi), where i is 1 to m.

here Pi is the probability
It denotes that arbitrary tuple in D belongs to class Ci and is estimated by |Ci,d| / |D|
here base 2 function is used because the information is encoded in bits.
" Info(D) is just the average amount of information needed to identify the class label of tuple in D "
How much more information would still need in order to arrive ate an exact classification ?

It can be measured by InfoA(D) = ( |Dj| / |D| ) x InfoA(D)
The smaller the information required the greater the purity of the partition.

Information gain is defined as the difference between the original information requirement and the new requirement.
Gain(A) = Info(D) - InfoA(D)
here Gain(A) tells that how much would be gained by branching on A.
The attribute A with the highest information gain can be chosen as the splitting criterion on node N.

Gain Ratio :

Information Gain prefers to select attributes having a large number of values.
Partitioning on multi-unique valued attribute is not an good Idea.
Ex : Suppose are going to partition by considering the attribute Product_Id so it will have many branches which are pure having one data content.
So this type of attributes will not considered as the splitting criterion or for further classification.
C4.5 algorithm uses extension of Information Gain as Gain Ratio.
It overcomes above overhead by applying Normalization to information gain using split_information.
SplitInfoA(D) = - |Dj| / |D| * Log2(|Dj| / |D|), where j ranges from 1 to v.
Above equation represents the information generated by splitting the training data set D into v partition.
Then GainRatio(A) = Gain(A) / SplitInfo(A)
Attribute with the minimum gain ration is selected as the splitting attribute.

Gini Index :

It is used in CART(Classification and Regression Tree) method.
It is used to measures the impurity of D, a data partition or set of training tuples.
Gini(D) = 1- pi2., pi is upto 1 to m.
The Gini index considers a binary split for each attribute.
If A has v possible values, then there are 2v possible subset.
When considering binary split, we compute a weighted sum of the impurity of each resulting partition.

Tree Pruning :

While constructing decision tree, many of the branches will reflect as noise or outliers.
Tree pruning addresses the problem of over fitting the data.
There are mostly two common approaches for establishing pruning method,
1. Pre pruning 2. Post Pruning

Back Propagation :

Back propagation basically works on backward manner.
It contains the concept of neural network.
Also widely used for prediction
Also include normalization with scale of 0.0 to 1.0
Neural network can be used for both Classification and Prediction
Classification (Predict Class Label of Given Tuple)
Prediction (To Predict Continuous valued output)

By : Ishan K. Rajani
Contact : ishanrajani@gmail.com

Data Mining and Data Warehousing

Saturday, 15 September 2012

Classification and Prediction

No comments:

Post a Comment