Saturday 15 September 2012

Data Mining & Data Warehousing Introduciton


                                                                         

Introduction
  • Data Mining :


" It is a process of extracting or mining knowledge from multiple sources Data Bases, or Data Warehouses or other information repositories. "


  • Data Warehouse :


"A Data warehouse is a subject-oriented, integrated, Time-Variant and Non-volatile collection of data, which can be further used for Data mining processes. "

"  It refers to a database that is maintained separately from an organization's separate operational database  "


  • Two types of Data Mining Tasks :


(1) Descriptive : It is used to characterized general characteristics of the data in the database repositories, and also used for more generalization of the data.

(2) Predictive : It is used for inference on the current data in order to make prediction. We can consider Regression as an example of this type of category. 

  • Applications of Data Mining :


1. Weather analysis
2. Market study
3. Handling of different type of data bases
4. Outlier analysis
5. Multidimensional representation of data
6. Data transformation
7. Missing data prediction
8. Removal of noise from data
9. Fast computation (ex. by using sparse structures like Iceburg cube)
10. Generalization of task relevant data


  • Major issues in Data Mining :


Mining methodology and task interaction issues :

(1) Mining Different kind of knowledge in database.
(2) Interactive mining of knowledge at multiple levels of abstraction
(3) In corporations of background knowledge
(4) Data mining query languages and ad-hoc data mining
(5) Presentation and visualization of data mining results
(6) Handling noisy or incomplete data
(7) Pattern evaluation

Issues regarding performance :

(1) Efficiency and scalability of data mining algorithms
(2) Parallel, Distributed and incremental mining algorithms

Issues relating to the diversity of database types :

(1) Handling relations between complex types of data
(2) Mining information from heterogeneous databases and global information systems.


  • Data Mining can be deal with following kind of Databases :
(1) Relational Database
(2) Data Warehouse (Reprocessed Databases)
(3) Transactional Databases
(4) Object-Relational Database
(5) Temporal & Time series Databases
(6) Sequential Databases
(7) Spatial and Spatiotemporal Database
(8) Text and Multimedia Database
(9) Heterogeneous and Legacy Database
(10) Data Streams
(11) WWW
  • KDD(Knowledge Discovery from Data) process in Data Mining :


KDD stands for knowledge discovery from data base. There are some pre-processing operations which are required to make pure data in data warehouse before use that data for Data Mining processes.

KDD includes Data cleaning, integration, selection, transformation and reduction as its basic preprocessing activities.




  • Integration of a Data Mining System with a DB or Data Warehouse system


There are mainly three types of schemes for it

(1) No coupling : This type of system is not going to utilize any function of Data base of Data warehouse system.

(2) Loose coupling : This type of data mining system will use some facilities of a DB or DW system, by fetching data from data repositories managed by system.

It is better than No coupling because it can fetch any data from database by using different operations like query processing, indexing etc.

(3) Semitight coupling : It means that besides linking a DM system with DB/DW system, efficient implementations of DM mining primitives can be provided in DB/DW systems

(4) Tight coupling : It simply means that DM system is smoothly integrated into the DB/DW system.

                                       Here I found a nice video about the introduction to data mining, Hope It may help you.. you can checkout the following link:

http://www.youtube.com/watch?v=8fh2zUNs22U

      Pre-processing on Data


  • Objective of Pre-procesing on data is to remove noise from data or to remove redundant data to create pure data for further mining process.
  • There are mainly 4 types of Pre-processing activities Data cleaning, Data integration, Data transformation, Data reduction.
Data Cleaning :

  • Used to create noise free data by different methods. It mainly includes following methods.
    • cleaning by missing values
    • smoothing noisy data by Binning, Regression, and Clustering
    • identifying or removing outliers.
Data Integration :

  • It means merging of data from multiple data stores.
  • There are some issues while performing Integration as follows :
    • Schema integration and Object matching can be tricky
    • Redundancy 
    • Detection and Resolution of data value conflicts.
Data Transformation :

  • In this preprocessing actiivity data are transformed into forms appropriate for mining.
  • Data transformation can involve the following.
    • Smoothing : It works to remove noise from data. It includes binning, regression and clustering.
    • Aggregation : Summary or aggregation operations are applied to data.
    • Generalization : Lower level data are replaced by higher-level concepts through the use of concept hierarchies.
    • Normalization : Here attribute data are scaled so as to fall within a small specified range.
    • Attribute Construction : New attributes are constructed and added from the given set of attributes to help the mining process.








Concepts of OLTP, OLAP and OLMP

 

Concepts of OLTP, OLAP and OLMP


  • OLTP :  (Online Transaction Processing)



--> It covers mostly recent operations results, or mostly day to day records.
       Ex : purchasing, inventory, banking, payroll transactions etc.
--> It is customer-oriented and used for recent query processing.
--> Usually adopts an entity-relationship (ER) data model and application oriented database design.
--> View of OLTP is mostly detailed or flat file relations.
--> Also very important for daily sales analysis in industries.
--> Its number of access is in tens, and Number of users may be thousands.


  • OLAP : (Online Analytic Processing)


--> It generally focuses on historical data.
--> Mostly used for long term operations.
--> Can provide summarized or multidimensional view.
--> Number of users in hundreds because it does not focuses on particular operation but it focus on bulk      
      of many results.
--> It is highly flexible and end-user autonomy.
--> Mostly read-only operations because most of the data warehouses store historical data.


  • Difference between OLTP and OLAP



        Operations of OLAP :

(1) Roll-Up  :

- It performs aggregation by climbing up a concept hierarchy for a dimension or by dimension     reduction.
- performs by removing dimension
   Ex : street < city < province_or_state < country
- Here when roll-up is performed one or more dimensions are removed from above hierarchy.
- for example rather than grouping the data by city, the resulting cube can groups data by country.


(2) Drill-Down : 

- Drilling Down is the reverse of Roll-up.
- It navigates from less detailed data to more detailed data.
- Because it adds more detail to given data, It can be also performed by adding new dimension to a cube.


(3) Slice and Dice :

- Slice  performs a selection on one dimension of the given cube, resulting in a sub cube.
- Dice operation defines a sub cube by performing a selection on two or more dimensions.

(4) Pivot (Rotate) :

- It is a visualization operations that rotates data axes in view in order to provide an alternative presentation of the data.



Data Cube Computation and Data Generalization


Mining Frequent Patterns, Associations and Classification


  • Frequent Patterns :

This are the patterns that appear in a data set frequently. It may be Itemsets, Subsequences, or Substructures.

Ex : set of items milk and bread that appears frequently in most of the transactions so it can be frequent items and a set can be frequent item set.

Sequential Pattern : A sequence that a customer is first buying PC then antivirus software than head phone. So this type of sequential data can be represented as Sequential Pattern.

Structural Pattern : If a substructure occurs frequently than it is called a frequent strucutured pattern.
Ex : Cuboids in Cube structure.

  • Support and Confidence:
    •  If A >> B then
    •  Support : (A U B)
    •  Confidence :  (A U B) / total number of times of A's occurrence. 
      • It can also be defined as P(B|A)


Concepts and Algorithms for Frequent Itemset/Pattern Mining

  • Market Basket Analysis :
    • A huge amount of data being collected and stored, many industries are becoming interested in mining such patterns from their databases.
    • Market basket analysis helps in following activities.
      • Business Decision making process
      • Catalog Design
      • Marketing purpose
      • Customer shopping behavior analysis
      • product placing in stores
      • Inventory management 
    •  This process analyze customer buying habits by finding association between different items that customer places in their shopping baskets.
    • The discovery of such baskets helps retailers to develop marketing strategies by gaining insight into which items are frequently purchased together. 
    • Ex : If customers who purchase computers also tend to buy antivirus software at the same time. then placing hardware display nearer to software display may help in increase of the sale of both the items.
    •  If we think of the universe as the set of items available at the store, then each item has a Boolean variable representing the presence or absense of that item.
    • Each basket can be represented by a Boolean Vector of values assigned to these variables.
    • The Boolean vectors can be analyzed for buying patterns that reflect items that are frequently associated or purchased together.
    • These patterns can be represented as association rules.    
Efficient and Scalable Frequent Itemset Mining Methods

The Apriori Algorithm :
  • Apriori Property : All nonempty subset of a frequent Itemset must also be frequent.
  • " Apriori " name is based on a fact that the algorithm uses prior knowledge of frequent itemsets.
  • employs an iterative approach known as a level-wise search.
  • First, the set of frequent itemset is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support.
  • Here, Apriori property is used to reduce the search space. 
  • Following observation is based on following observation
    • If an itemset I does not satisfy the minimum support threshold (min_sup), then I is not frequent. That P(I) < min_sup.
    • If an item A is added to the itemset I, then the resulting itemset (I U A) can not occur more frequently than I.
    • Above property condition belongs to a special category of properties called antimonotone, means if the set can not pass the test, all of its supersets will fail the same test as well. and the property is monotonic in the context of failing a test. 
 Generating Association Rules from Frequent Item sets :

  • It is used to define rules for the association of different item sets together.
  • Here strong association rules satisfies both minimum support threshold and minimum support confidence.
  • This can be done by calculating support and confidence by following equation.
    •  Support = P(A U B)
    •  Confidence =  support_count(A U B) / support_count(A) = support(A U B) / support(A)
      • It can also be defined as P(B|A)
      • here AUB means the number of transactions containing the itemsets A U B.
      •  
  • Based on this equation, association rules can be generated as follows :
    • For each frequent itemset L, generate all nonempty subset of L.
    • For every non empty sybset s of L. output the rule s >> (L-s) if sup_count(L)/sup_count(s) >= min_confidence, where min_confidence is the minimum confidence threshold.
 Improving the efficiency of Apriori : 
  • Hash based techniques :
    • A hash-based techniques can be used to reduce the size of the candidate k-itemsets, Ck.
    • In this technique each bucket has some sets as bucket contents.
    • Itemsets whose corresponding bucket count in the hash table is velow the support threshold can not be frequent and should be removed from the candidate set.
  • Transaction Reduction :  
    • It reduces the number of transactions scanned in future iterations.
    • The transaction that does not contain k-itemsets that can not contain k+1 itemset.
    • such transactions can be marked or removed from further consideration.
  •   Partitioning :
    • Partitioning the data to find candidate itemsets.
    • It consists of two phases.
      • Phase 1 : The algorithms subdivides the transactions of D into n nonoverlapping partitions. If min _sup threshold for transactions in D is min_sup, then the minimum support count for a partition is min_sup * the number of transactions in that partition.
        • for each partition all frequent itemsets within the partition are found. These are referred to as local frequent itemsets. 
        • again on local frequent itemset property like apriori can be given as follows,
          • Any itemset that is potentially frequent with respect to D must occur as a frequent itemset in at least one of the partitions.
          • The collection of frequent itemsets from all partitions forms the global candidate itemsets with respect to D.
      •   Phase 2 : A second scan of D is conducted in which the actual support of each candidate is assessed in order to determine the global frequent item sets.
  • Sampling :
    • Mining on a subset of given data
    • The basic idea is to pick a random sample S of the given data D, and Search for frequent itemsets in S instead of D.
  • Dynamic itemset counting :
    • Adding candidate itemsets at different points during a scan
    • A dynamic itemset counting technique was proposed in which the database is partitioned into blocks marked by start points.
    • In this new candidate itemset can be generated at any start point, unlike apriori which determines new candidate itesets only immediately before each complete database scan. 
Mining Frequent Itemsets without candidate generation :
    •  Apriori candidate generate-and-test method significantly reduces the size of candidate sets, leading to good performance gain.
    • Then also it can suffer from two nontrivial costs :
      • It may need to generate a huge number of candidate sets
      • It may need to repeatedly scan the database and check a large set of candidate by patterns.
    •  An important method to generate frequent itemset without candidate generation is Frequent-Pattern Growth or FP-growth.
    • It adopts divide-and-conquer strategy.
    • First : It compresses the database representing frequent items into a frequent-pattern tree.
      • or FP-tree, which retains the itemset association information.
    • Secondly : It divides the compressed database into a set of conditional databases, each associated with one frequent item.
    • Here First scan is same as Apriori, which derives the set of frequent item and their support_count(frequencies).
    • Let the min_sup count is 2. The set of frequent items is sorted in the order of descending support count. suppose take an example of a set {{I2:7},{I1:6},{I3:6},{I4:2},{I5:2}}
    • Following are the steps to construct an FP-tree :
      • First create the root of the tree, labeled with " null ".
      • then Scan database second for processing items in each transaction in L order, means descending support count.
      • and here branch is created for each transactions.
      • For example :
        • The scan of first transaction T1: I2,I1,I5 in L order leads to the construction of the first branch of the tree with three nodes.
        • where I2 is linked to root node, I1 is linked to I2 and I5 is linked to I1.
        • The second transaction contains I2,I4 in L order, It would result in a branch where I2 is linked to the root and I4 is linked to I2. This branch will share a common prefix I2.
By : Ishan K. Rajani
Contact : ishanrajani@gmail.com




   
 

Classification and Prediction



  • Classification :


" Assigning Data cases to one of the fixed number of classes (Represented by Nominal Numbers) "

" Classification is the separation of ordering of classes " 


  • Prediction :


" It is one type of process with the use of predictors in the prediction of one or more than one variable  "


  • Application of Classification and Prediction :


- Weather Analysis
- Prediction of Null space
- Cryptography
- Target Marketing
- Medical Diagnosis
- Fraud Detection
- Performance Prediction
- Manufacturing


  • Five criteria for evaluation of Classification and Prediction :
  1. Predictive Accuracy
  2. Computational Speed
  3. Robustness
  4. Scalability
  5. Interoperability

  • Classification by Decision Tree Induction
    • It is the learning of Class labled training tuples.
    • A decision tree is a structure like flow chart. Here each non-leaf nodes denotes a test on online attributes.
    • Each branch represents an outcome of the test, and each leaf node holds a class label.
    • Top most node in a tree is Root node.
    • Now the question can be asked that how are decision tree used for classification ?
    • Why decision tree classifiers so popular ?
Because it does not require any domain knowledge or parameter setting, and therefore is  appropriate for exploratory knowledge discovery.
Decision trees can handle high dimensional data.
      • Generally easy to estimate by humans also.
      • This type of Induction is easy and fast.
      • Decision tree classifiers have good accuracy.
    • Different algorithms can be generated for Decision Tree Induction Most popular of them are ID3, C4.5.
    • C4.5 is successor of ID3. The book named CART(Classification and Regression Tree) describes the generation of binary trees.
    • most algorithms for decision tree induction also follows such a top-down approach.
    • Basic algorithm for Decision Tree Induction : 
      • create a node N.
      • If tuples in D are all of the same class, C then
        • Return N as a leaf node labeled with class C
      • If attribute_list is empty then
        • Return N as a leaf node labeled with the majority class in D
      •  Apply attribute_selection_method(D,attribute list) to find best splitting ctiterian
      • Label node N with splitting_criterian
      • if splitting attribute is discrete-valued and
        • multiway splits allowed then //not restricted to binary trees
      • attribute_list << attribute_list - splitting_attribute; // remove splitting attribute
      •  for each outcome of j of splitting_criterian // partition the tuples and grow subtrees for each partition.
      • let Dj be the set of data tuples in D satisfying outcome of j; // a partition
      • if Dj is empty then
        • attach a leaf labeled with the majority class in D to node N;
      • else attach the node returned by generate decision tree(Dj,attribute_list) to node N;
      • end for
      • Return N;

  • Attribute Selection Measures 
    • It is a heuristic for selecting the splitting criterion that "best" separates a given data D.
    • and also class labels and sub classes
    • The best splitting criterion is that which gives the most pure result.
    • " Attribute selection measure are also known as splitting rules because they determine how the tuples at a given training tuples "
    • The attribute having best score of the measure can be chosen as splitting criterion.
    • This selection describes mainly three :
      • Information Gain
      • Gain Ratio
      • Gini Index
    • Information Gain :
      • ID3 use it as its attribute selection measure.
      • The attribute with the highest information gain is chosen as the splitting attribute for node N.
      • This attributes minimizes the impurity of the classification.
      • It also Minimizes the expected number of tests and guarantees that a simple tree is found.
      • The expected information need to be classified a tuple in D is given by
        • Info(D) = - Pi Log2(Pi), where i is 1 to m.
          • here Pi is the probability
          • It denotes that arbitrary tuple in D belongs to class Ci and is estimated by |Ci,d| / |D|
          • here base 2 function is used because the information is encoded in bits.
          • " Info(D) is just the average amount of information needed to identify the class label of tuple in D "
          • How much more information would still need in order to arrive ate an exact classification ?
            • It can be measured by InfoA(D) = ( |Dj| / |D| ) x InfoA(D)
            • The smaller the information required the greater the purity of the partition.
          • Information gain is defined as the difference between the original information requirement and the new requirement.
          • Gain(A) = Info(D) - InfoA(D)
          • here Gain(A) tells that how much would be gained by branching on A.
          • The attribute A with the highest information gain can be chosen as the splitting criterion on node N.

    • Gain Ratio :
      • Information Gain prefers to select attributes having a large number of values.
      • Partitioning on multi-unique valued attribute is not an good Idea.
      • Ex : Suppose are going to partition by considering the attribute Product_Id so it will have  many branches which are pure having one data content.
      • So this type of attributes will not considered as the splitting criterion or for further classification.
      • C4.5 algorithm uses extension of Information Gain as Gain Ratio.
      • It overcomes above overhead by applying Normalization to information gain using split_information.
      • SplitInfoA(D) = - |Dj| / |D| * Log2(|Dj| / |D|),   where j ranges from 1 to v.
      • Above equation represents the information generated by splitting the training data set D into v partition.
      • Then GainRatio(A) = Gain(A) / SplitInfo(A)
      • Attribute with the minimum gain ration is selected as the splitting attribute.


    • Gini Index :
      • It is used in CART(Classification and Regression Tree) method.
      • It is used to measures the impurity of D, a data partition or set of training tuples.
      • Gini(D) = 1- pi2., pi is upto 1 to m.
      • The Gini index considers a binary split for each attribute.
      • If A has v possible values, then there are 2v possible subset.
      • When considering binary split, we compute a weighted sum of the impurity of each resulting partition.

Tree Pruning :


    • While constructing decision tree, many of the branches will reflect as noise or outliers.
    • Tree pruning addresses the problem of over fitting the data.
    • There are mostly two common approaches for establishing pruning method,
    • 1. Pre pruning 2. Post Pruning
Back Propagation :
    • Back propagation basically works on backward manner.
    • It contains the concept of neural network.
    • Also widely used for prediction
    • Also include normalization with scale of 0.0 to 1.0
    • Neural network can be used for both Classification and Prediction
    • Classification (Predict Class Label of Given Tuple)
    • Prediction (To Predict Continuous valued output)

By : Ishan K. Rajani
Contact : ishanrajani@gmail.com




      
  

         




Cluster Mining


Graph Mining

It is one type of structural mining.

" Structure mining is the finding and extracting useful information from semi structure data sets "

Text, Multimedia Mining

Text mining can also referred as Text Data Mining, It refers to the process of deriving high quality information from text.

Quality information can be achieved from pattern analysis and trends, such as statistical pattern learning.


Web Mining

Web mining - is the application of data mining techniques to discover patterns from the WWW.

There are mainly three types of Web Mining :


1. Web Usage Mining

2. Web Content Mining
3. Web Structure Mining


1. Web Usage Mining :  It is the process of extracting useful information from servers. It is also stated as the process of finding out what users are looking for on Internet.


2. Web Content Mining : Focuses on the Data Content. Like Documents, Web Pages etc.




Trends in Data Mining