Saturday, 15 September 2012

Classification and Prediction



  • Classification :


" Assigning Data cases to one of the fixed number of classes (Represented by Nominal Numbers) "

" Classification is the separation of ordering of classes " 


  • Prediction :


" It is one type of process with the use of predictors in the prediction of one or more than one variable  "


  • Application of Classification and Prediction :


- Weather Analysis
- Prediction of Null space
- Cryptography
- Target Marketing
- Medical Diagnosis
- Fraud Detection
- Performance Prediction
- Manufacturing


  • Five criteria for evaluation of Classification and Prediction :
  1. Predictive Accuracy
  2. Computational Speed
  3. Robustness
  4. Scalability
  5. Interoperability

  • Classification by Decision Tree Induction
    • It is the learning of Class labled training tuples.
    • A decision tree is a structure like flow chart. Here each non-leaf nodes denotes a test on online attributes.
    • Each branch represents an outcome of the test, and each leaf node holds a class label.
    • Top most node in a tree is Root node.
    • Now the question can be asked that how are decision tree used for classification ?
    • Why decision tree classifiers so popular ?
Because it does not require any domain knowledge or parameter setting, and therefore is  appropriate for exploratory knowledge discovery.
Decision trees can handle high dimensional data.
      • Generally easy to estimate by humans also.
      • This type of Induction is easy and fast.
      • Decision tree classifiers have good accuracy.
    • Different algorithms can be generated for Decision Tree Induction Most popular of them are ID3, C4.5.
    • C4.5 is successor of ID3. The book named CART(Classification and Regression Tree) describes the generation of binary trees.
    • most algorithms for decision tree induction also follows such a top-down approach.
    • Basic algorithm for Decision Tree Induction : 
      • create a node N.
      • If tuples in D are all of the same class, C then
        • Return N as a leaf node labeled with class C
      • If attribute_list is empty then
        • Return N as a leaf node labeled with the majority class in D
      •  Apply attribute_selection_method(D,attribute list) to find best splitting ctiterian
      • Label node N with splitting_criterian
      • if splitting attribute is discrete-valued and
        • multiway splits allowed then //not restricted to binary trees
      • attribute_list << attribute_list - splitting_attribute; // remove splitting attribute
      •  for each outcome of j of splitting_criterian // partition the tuples and grow subtrees for each partition.
      • let Dj be the set of data tuples in D satisfying outcome of j; // a partition
      • if Dj is empty then
        • attach a leaf labeled with the majority class in D to node N;
      • else attach the node returned by generate decision tree(Dj,attribute_list) to node N;
      • end for
      • Return N;

  • Attribute Selection Measures 
    • It is a heuristic for selecting the splitting criterion that "best" separates a given data D.
    • and also class labels and sub classes
    • The best splitting criterion is that which gives the most pure result.
    • " Attribute selection measure are also known as splitting rules because they determine how the tuples at a given training tuples "
    • The attribute having best score of the measure can be chosen as splitting criterion.
    • This selection describes mainly three :
      • Information Gain
      • Gain Ratio
      • Gini Index
    • Information Gain :
      • ID3 use it as its attribute selection measure.
      • The attribute with the highest information gain is chosen as the splitting attribute for node N.
      • This attributes minimizes the impurity of the classification.
      • It also Minimizes the expected number of tests and guarantees that a simple tree is found.
      • The expected information need to be classified a tuple in D is given by
        • Info(D) = - Pi Log2(Pi), where i is 1 to m.
          • here Pi is the probability
          • It denotes that arbitrary tuple in D belongs to class Ci and is estimated by |Ci,d| / |D|
          • here base 2 function is used because the information is encoded in bits.
          • " Info(D) is just the average amount of information needed to identify the class label of tuple in D "
          • How much more information would still need in order to arrive ate an exact classification ?
            • It can be measured by InfoA(D) = ( |Dj| / |D| ) x InfoA(D)
            • The smaller the information required the greater the purity of the partition.
          • Information gain is defined as the difference between the original information requirement and the new requirement.
          • Gain(A) = Info(D) - InfoA(D)
          • here Gain(A) tells that how much would be gained by branching on A.
          • The attribute A with the highest information gain can be chosen as the splitting criterion on node N.

    • Gain Ratio :
      • Information Gain prefers to select attributes having a large number of values.
      • Partitioning on multi-unique valued attribute is not an good Idea.
      • Ex : Suppose are going to partition by considering the attribute Product_Id so it will have  many branches which are pure having one data content.
      • So this type of attributes will not considered as the splitting criterion or for further classification.
      • C4.5 algorithm uses extension of Information Gain as Gain Ratio.
      • It overcomes above overhead by applying Normalization to information gain using split_information.
      • SplitInfoA(D) = - |Dj| / |D| * Log2(|Dj| / |D|),   where j ranges from 1 to v.
      • Above equation represents the information generated by splitting the training data set D into v partition.
      • Then GainRatio(A) = Gain(A) / SplitInfo(A)
      • Attribute with the minimum gain ration is selected as the splitting attribute.


    • Gini Index :
      • It is used in CART(Classification and Regression Tree) method.
      • It is used to measures the impurity of D, a data partition or set of training tuples.
      • Gini(D) = 1- pi2., pi is upto 1 to m.
      • The Gini index considers a binary split for each attribute.
      • If A has v possible values, then there are 2v possible subset.
      • When considering binary split, we compute a weighted sum of the impurity of each resulting partition.

Tree Pruning :


    • While constructing decision tree, many of the branches will reflect as noise or outliers.
    • Tree pruning addresses the problem of over fitting the data.
    • There are mostly two common approaches for establishing pruning method,
    • 1. Pre pruning 2. Post Pruning
Back Propagation :
    • Back propagation basically works on backward manner.
    • It contains the concept of neural network.
    • Also widely used for prediction
    • Also include normalization with scale of 0.0 to 1.0
    • Neural network can be used for both Classification and Prediction
    • Classification (Predict Class Label of Given Tuple)
    • Prediction (To Predict Continuous valued output)

By : Ishan K. Rajani
Contact : ishanrajani@gmail.com




      
  

         




No comments:

Post a Comment