Saturday 15 September 2012

Mining Frequent Patterns, Associations and Classification


  • Frequent Patterns :

This are the patterns that appear in a data set frequently. It may be Itemsets, Subsequences, or Substructures.

Ex : set of items milk and bread that appears frequently in most of the transactions so it can be frequent items and a set can be frequent item set.

Sequential Pattern : A sequence that a customer is first buying PC then antivirus software than head phone. So this type of sequential data can be represented as Sequential Pattern.

Structural Pattern : If a substructure occurs frequently than it is called a frequent strucutured pattern.
Ex : Cuboids in Cube structure.

  • Support and Confidence:
    •  If A >> B then
    •  Support : (A U B)
    •  Confidence :  (A U B) / total number of times of A's occurrence. 
      • It can also be defined as P(B|A)


Concepts and Algorithms for Frequent Itemset/Pattern Mining

  • Market Basket Analysis :
    • A huge amount of data being collected and stored, many industries are becoming interested in mining such patterns from their databases.
    • Market basket analysis helps in following activities.
      • Business Decision making process
      • Catalog Design
      • Marketing purpose
      • Customer shopping behavior analysis
      • product placing in stores
      • Inventory management 
    •  This process analyze customer buying habits by finding association between different items that customer places in their shopping baskets.
    • The discovery of such baskets helps retailers to develop marketing strategies by gaining insight into which items are frequently purchased together. 
    • Ex : If customers who purchase computers also tend to buy antivirus software at the same time. then placing hardware display nearer to software display may help in increase of the sale of both the items.
    •  If we think of the universe as the set of items available at the store, then each item has a Boolean variable representing the presence or absense of that item.
    • Each basket can be represented by a Boolean Vector of values assigned to these variables.
    • The Boolean vectors can be analyzed for buying patterns that reflect items that are frequently associated or purchased together.
    • These patterns can be represented as association rules.    
Efficient and Scalable Frequent Itemset Mining Methods

The Apriori Algorithm :
  • Apriori Property : All nonempty subset of a frequent Itemset must also be frequent.
  • " Apriori " name is based on a fact that the algorithm uses prior knowledge of frequent itemsets.
  • employs an iterative approach known as a level-wise search.
  • First, the set of frequent itemset is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support.
  • Here, Apriori property is used to reduce the search space. 
  • Following observation is based on following observation
    • If an itemset I does not satisfy the minimum support threshold (min_sup), then I is not frequent. That P(I) < min_sup.
    • If an item A is added to the itemset I, then the resulting itemset (I U A) can not occur more frequently than I.
    • Above property condition belongs to a special category of properties called antimonotone, means if the set can not pass the test, all of its supersets will fail the same test as well. and the property is monotonic in the context of failing a test. 
 Generating Association Rules from Frequent Item sets :

  • It is used to define rules for the association of different item sets together.
  • Here strong association rules satisfies both minimum support threshold and minimum support confidence.
  • This can be done by calculating support and confidence by following equation.
    •  Support = P(A U B)
    •  Confidence =  support_count(A U B) / support_count(A) = support(A U B) / support(A)
      • It can also be defined as P(B|A)
      • here AUB means the number of transactions containing the itemsets A U B.
      •  
  • Based on this equation, association rules can be generated as follows :
    • For each frequent itemset L, generate all nonempty subset of L.
    • For every non empty sybset s of L. output the rule s >> (L-s) if sup_count(L)/sup_count(s) >= min_confidence, where min_confidence is the minimum confidence threshold.
 Improving the efficiency of Apriori : 
  • Hash based techniques :
    • A hash-based techniques can be used to reduce the size of the candidate k-itemsets, Ck.
    • In this technique each bucket has some sets as bucket contents.
    • Itemsets whose corresponding bucket count in the hash table is velow the support threshold can not be frequent and should be removed from the candidate set.
  • Transaction Reduction :  
    • It reduces the number of transactions scanned in future iterations.
    • The transaction that does not contain k-itemsets that can not contain k+1 itemset.
    • such transactions can be marked or removed from further consideration.
  •   Partitioning :
    • Partitioning the data to find candidate itemsets.
    • It consists of two phases.
      • Phase 1 : The algorithms subdivides the transactions of D into n nonoverlapping partitions. If min _sup threshold for transactions in D is min_sup, then the minimum support count for a partition is min_sup * the number of transactions in that partition.
        • for each partition all frequent itemsets within the partition are found. These are referred to as local frequent itemsets. 
        • again on local frequent itemset property like apriori can be given as follows,
          • Any itemset that is potentially frequent with respect to D must occur as a frequent itemset in at least one of the partitions.
          • The collection of frequent itemsets from all partitions forms the global candidate itemsets with respect to D.
      •   Phase 2 : A second scan of D is conducted in which the actual support of each candidate is assessed in order to determine the global frequent item sets.
  • Sampling :
    • Mining on a subset of given data
    • The basic idea is to pick a random sample S of the given data D, and Search for frequent itemsets in S instead of D.
  • Dynamic itemset counting :
    • Adding candidate itemsets at different points during a scan
    • A dynamic itemset counting technique was proposed in which the database is partitioned into blocks marked by start points.
    • In this new candidate itemset can be generated at any start point, unlike apriori which determines new candidate itesets only immediately before each complete database scan. 
Mining Frequent Itemsets without candidate generation :
    •  Apriori candidate generate-and-test method significantly reduces the size of candidate sets, leading to good performance gain.
    • Then also it can suffer from two nontrivial costs :
      • It may need to generate a huge number of candidate sets
      • It may need to repeatedly scan the database and check a large set of candidate by patterns.
    •  An important method to generate frequent itemset without candidate generation is Frequent-Pattern Growth or FP-growth.
    • It adopts divide-and-conquer strategy.
    • First : It compresses the database representing frequent items into a frequent-pattern tree.
      • or FP-tree, which retains the itemset association information.
    • Secondly : It divides the compressed database into a set of conditional databases, each associated with one frequent item.
    • Here First scan is same as Apriori, which derives the set of frequent item and their support_count(frequencies).
    • Let the min_sup count is 2. The set of frequent items is sorted in the order of descending support count. suppose take an example of a set {{I2:7},{I1:6},{I3:6},{I4:2},{I5:2}}
    • Following are the steps to construct an FP-tree :
      • First create the root of the tree, labeled with " null ".
      • then Scan database second for processing items in each transaction in L order, means descending support count.
      • and here branch is created for each transactions.
      • For example :
        • The scan of first transaction T1: I2,I1,I5 in L order leads to the construction of the first branch of the tree with three nodes.
        • where I2 is linked to root node, I1 is linked to I2 and I5 is linked to I1.
        • The second transaction contains I2,I4 in L order, It would result in a branch where I2 is linked to the root and I4 is linked to I2. This branch will share a common prefix I2.
By : Ishan K. Rajani
Contact : ishanrajani@gmail.com




   
 

1 comment:

  1. Titanium Hair Cutter for a Men's Tasty Diner - TiG
    Titanium is a apple watch titanium fine sharpening blade from the tip titanium earrings sensitive ears of the nail titanium tent stove that is cut into the back of the Our Teton apple watch 6 titanium Platinum is one of the finest premium shaving tools $3.95 stiletto titanium hammer · ‎In stock

    ReplyDelete