Saturday 15 September 2012

Data Mining & Data Warehousing Introduciton


                                                                         

Introduction
  • Data Mining :


" It is a process of extracting or mining knowledge from multiple sources Data Bases, or Data Warehouses or other information repositories. "


  • Data Warehouse :


"A Data warehouse is a subject-oriented, integrated, Time-Variant and Non-volatile collection of data, which can be further used for Data mining processes. "

"  It refers to a database that is maintained separately from an organization's separate operational database  "


  • Two types of Data Mining Tasks :


(1) Descriptive : It is used to characterized general characteristics of the data in the database repositories, and also used for more generalization of the data.

(2) Predictive : It is used for inference on the current data in order to make prediction. We can consider Regression as an example of this type of category. 

  • Applications of Data Mining :


1. Weather analysis
2. Market study
3. Handling of different type of data bases
4. Outlier analysis
5. Multidimensional representation of data
6. Data transformation
7. Missing data prediction
8. Removal of noise from data
9. Fast computation (ex. by using sparse structures like Iceburg cube)
10. Generalization of task relevant data


  • Major issues in Data Mining :


Mining methodology and task interaction issues :

(1) Mining Different kind of knowledge in database.
(2) Interactive mining of knowledge at multiple levels of abstraction
(3) In corporations of background knowledge
(4) Data mining query languages and ad-hoc data mining
(5) Presentation and visualization of data mining results
(6) Handling noisy or incomplete data
(7) Pattern evaluation

Issues regarding performance :

(1) Efficiency and scalability of data mining algorithms
(2) Parallel, Distributed and incremental mining algorithms

Issues relating to the diversity of database types :

(1) Handling relations between complex types of data
(2) Mining information from heterogeneous databases and global information systems.


  • Data Mining can be deal with following kind of Databases :
(1) Relational Database
(2) Data Warehouse (Reprocessed Databases)
(3) Transactional Databases
(4) Object-Relational Database
(5) Temporal & Time series Databases
(6) Sequential Databases
(7) Spatial and Spatiotemporal Database
(8) Text and Multimedia Database
(9) Heterogeneous and Legacy Database
(10) Data Streams
(11) WWW
  • KDD(Knowledge Discovery from Data) process in Data Mining :


KDD stands for knowledge discovery from data base. There are some pre-processing operations which are required to make pure data in data warehouse before use that data for Data Mining processes.

KDD includes Data cleaning, integration, selection, transformation and reduction as its basic preprocessing activities.




  • Integration of a Data Mining System with a DB or Data Warehouse system


There are mainly three types of schemes for it

(1) No coupling : This type of system is not going to utilize any function of Data base of Data warehouse system.

(2) Loose coupling : This type of data mining system will use some facilities of a DB or DW system, by fetching data from data repositories managed by system.

It is better than No coupling because it can fetch any data from database by using different operations like query processing, indexing etc.

(3) Semitight coupling : It means that besides linking a DM system with DB/DW system, efficient implementations of DM mining primitives can be provided in DB/DW systems

(4) Tight coupling : It simply means that DM system is smoothly integrated into the DB/DW system.

                                       Here I found a nice video about the introduction to data mining, Hope It may help you.. you can checkout the following link:

http://www.youtube.com/watch?v=8fh2zUNs22U

      Pre-processing on Data


  • Objective of Pre-procesing on data is to remove noise from data or to remove redundant data to create pure data for further mining process.
  • There are mainly 4 types of Pre-processing activities Data cleaning, Data integration, Data transformation, Data reduction.
Data Cleaning :

  • Used to create noise free data by different methods. It mainly includes following methods.
    • cleaning by missing values
    • smoothing noisy data by Binning, Regression, and Clustering
    • identifying or removing outliers.
Data Integration :

  • It means merging of data from multiple data stores.
  • There are some issues while performing Integration as follows :
    • Schema integration and Object matching can be tricky
    • Redundancy 
    • Detection and Resolution of data value conflicts.
Data Transformation :

  • In this preprocessing actiivity data are transformed into forms appropriate for mining.
  • Data transformation can involve the following.
    • Smoothing : It works to remove noise from data. It includes binning, regression and clustering.
    • Aggregation : Summary or aggregation operations are applied to data.
    • Generalization : Lower level data are replaced by higher-level concepts through the use of concept hierarchies.
    • Normalization : Here attribute data are scaled so as to fall within a small specified range.
    • Attribute Construction : New attributes are constructed and added from the given set of attributes to help the mining process.








No comments:

Post a Comment