Counter of Festivals

Ashok Blog for SQL Learners and Beginners and Experts

Tuesday, 8 January 2013

Data Warehouse & Data mining Concepts

 

What is a data warehouse - A 101 guide to modern data warehousing

This article discusses data warehousing from a holistic standpoint and quickly touches upon all the relevant concepts that one needs to know. Start here if you do not know where to start from.
Since this site is dedicated to data warehousing professionals, it was plain for us to assume that visitors of this site are already quite familiar with the basic concepts of data warehousing and business intelligence. Creating basic tutorials on these subjects were perhaps too obvious for us to do and that explains why there are not many articles on the basic concepts and definitions of data warehousing in this site. But something began to strike us.
We began to realize that there are still a lot of craving among our readers to understand the basic. And wikipedia definition of data warehousing is not enough for them - partly because of the inadequate elaboration and partly because of the heterogeneous background and experiences of our users who are still struggling to grasp the fundamentals. So we decided to produce a set of comprehensive and basic ground up articles for our readers to enable them venture deep into the subject of data warehousing. So let's begin with what is data warehousing, and what is business intelligence and why should we care.

What is data warehousing?

Data warehousing is the science of storing data for the purpose of meaningful future analysis.
Yes, it is a science (not much art involved!) and it deals with the mechanism of electronically storing and retrieving data so that some analysis can be performed on that data to corroborate / support a business decision or to predict a business outcome.
What is Business Intelligence?
Business Intelligence, on the other hand, is simply the art and science of presenting historical data in a meaningful way (often by using different data visualization techniques).
Raw data stored in databases turns into valuable information through the implementation of Business Intelligence processes.

Why data warehouse?

DW technologies provide historical, current and predictive views of business operations by analyzing the present and historical business data. Data analysis is often done using visualization techniques that turn complex data into images that tells compelling story. Raw data by this process of analysis help management take right decisions.
As an example of how modern visualization techniques are helping to unlock the complex and hidden information stored deep inside the data, visit http://www.gapminder.org
GapMinder is a small tool conceptualized by Prof. Hans Rosling, that analyzes complex socio-economical data to reveal world's most important trends
To further demonstrate the need of data warehousing, consider this.
English: Typical Data Warehouse Functional Diagram Let's imagine a company called "FairShop" that has 1000 retail outlets across USA. The company has built one data warehouse to store the data collected from all the shop outlets so that they can analyze the data to gather business intelligence.
The company collects raw sales data from all of their outlet shops (through a process called ETL) and then load them into a place called Data warehouse or data mart (at this point don't bother too much about the exact meaning and differences of data mart and data warehouse - we will get to it in detail later).
Once the data is there in data warehouse (or data mart) business intelligence techniques are applied to that data for analysis and reporting. Since the company now has the sales and purchase information from all their shops in a centralized place, it can easily use this data to answer some rudimentary questions about their business e.g. what shop makes highest sales, which product is most popular across the shop, what is the stock balance etc.
It is very common to talk about both data warehousing and business intelligence together since business intelligence in this context refers to analyzing the data in data warehouse. As Wikipedia puts it:
"Common functions of business intelligence technologies are reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining and predictive analytics."

Building the definition of data warehouse

There are couple of points to notice from the typical retail scenario shown above. These points form the base of our discussion on the definition.
  1. Objective of building a data warehouse is to store data that are required for analysis and reporting
  2. For a data warehouse to function, it should be supplemented with a process that can collect and load data into it (e.g. ETL)
  3. In a data warehouse, data actually flows from the source to target - so the contents of the data warehouse would be different between 2 given points in time
Now if I tell you that, the definition of data warehouse can be constructed from the above 3 points - that shouldn't surprise you. But what will surprise you is - a lot of these points are not really considered in the classic definition of data warehouse.
So let's discuss the classic definitions of data warehouse first.

Classic Definition of Data Warehouse - A peek in the hostory

The history of data warehouse dates back to 1960. Without going into detail, here we will quickly touch upon a few noteworthy events of the history of data warehouse. In 1970, ACNielsen, a global marketting research company, published sales data pertaining to retail industry in the form of dimensional data mart.
Earlier than this, the concept of data warehouse for analysis was only a subject of academic pursuit.
Along the same time, the concept of decision support system were gradually developing and people started to realize that data stored in operational systems (e.g. data stored in the individual stores of a retail chain) are not easy to analyze in the time of decision making. So in 1983, Teradata introduced a database management system specifically designed for decision support.
In this decade and next, several people experimented with several designs of data warehouse and some of them were quite successful. In the year 1992, one of them, named Bill Inmon published a book - Building the Data Warehouse - which among other things, gave us a widely accepted definition of what a data warehouse is. We will soon jump into that definition. But before that let me mention one more author - Ralph Kimball - who 4 years later in 1996 wrote another book - Data warehouse toolkit - showing us yet another approach of defining and building a data warehouse. Since then, both Inmon and Kimball approaches are widely accepted and implemented throughout the globe.
So How did Bill Inmon defined a data warehouse? Here it is:
A data warehouse is a subject oriented, non-volatile, integrated, time variant collection of data in support of management's decisions.
Now let's understand this definition.

Explanation on the classic definition of data warehouse

Subject Oriented

This means a data warehouse has a defined scope and it only stores data under that scope. So for example, if the sales team of your company is creating a data warehouse - the data warehouse by definition is required to contain data related to sales (and not the data related to production management for example)

Non-volatile

This means that data once stored in the data warehouse are not removed or deleted from it and always stay there no matter what.

Integrated

This means that the data stored in a data warehouse make sense. Fact and figures are related to each other and they are integrable and projects a single point of truth.

Time variant

This means that data is not constant, as new and new data gets loaded in the warehouse, data warehouse also grows in size.

Identifying a data warehouse based on its definition

4 simple terms by Inmon defined data warehouse succinctly. Let's now check how the definition help us identify a data warehouse from other types of data stores.
Is a book written on how to create 17 different peas pudding, a data warehouse? It is subject oriented (deals with peas pudding), Non-volatile (deals with fixed 17 methods that are there to stay), integrated (makes sense). But it's not time variant. Answer is it's not a data warehouse.
So is the folder on my desktop named "Account Statements" a data warehouse? Subject oriented (deals with financial accounting), non-volatile (until I manually delete it), Time variant (every month new account statements pour in) but it's not integrated (one file in the folder containing the account statement from bank XYZ for the month of May has no way to link to the other file in the folder containing the account statement of the bank ABC for the month of June). So - not a data warehouse.
So is the sales information collected in one store of a big retail chain a data warehouse? It's subject oriented, time variant and integrated (after all there is a relational database behind). But is it non-volatile? Mostly not. And even if it is - there is a fifth factor. Is it being used for the purpose of management decision making? Surely not. After all who will take an enterprise wide management decision based on the data collected from a single store?

A broader definition of data warehouse

The classic definition that we discussed above does not focus much on the purpose of the data warehouse. The purpose is something which distinguishes a data warehouse from a data mart if you will and help us understand the need of the data warehouse. The purpose of a data warehouse, as we discussed before, is to render a timely data-driven insight that was otherwise inconceivable directly from the raw data. A data warehouse which stores data, is time variant and subject oriented and integrated yet does not solve this purpose - is no better than just a data dump.
An alternative (and more concurrent) definition of data warehouse will be:
A data warehouse is an electronically stored collection of integrated data that can be used for the purpose of intelligent analysis.
Dropping the time variance from the above definition broadens the coverage of the definition and omission of non-volatility condition makes the definition more realistic rather than idealistic. There are many data that are not time variant (historical and scientific data) but can be stored in a data warehouse for analysis. Similarly modern data warehouses are purged regularly when the data lose its purpose. Adding a sense of purpose in the definition enables us to create a more reliable and goal-oriented data warehouse.

Schematic View of a data warehouse

Simple schematic for a data warehous... The diagram above shows a typical schematic structure of a data warehouse. As one can see here, most data warehouses collect data from multiple sources to form one integrated warehouse. Before loading to the warehouse, these data often need special treatment which is done in the ETL layer (ETL - Extraction, Transformation, Loading). ETL layer is mostly responsible for 2 types of treatments on the data:
  • Data Integration - So that some links can be established between data coming from separate systems, and
  • Qualitative Treatment - so that the validity and quality of the data can be checked (and if required corrected) before loading to the data warehouse.

 

 

 

A road-map on Testing in Data Warehouse

Testing in data warehouse projects are till date a less explored area. However, if not done properly, this can be a major reason for data warehousing project failures - especially in user acceptance phase. Given here a mind-map that will help a project manager to think all the aspects of testing in data warehousing.

Testing Mindmap

DWBI Testing

Points to consider for DWBI Testing

  1. Why is it important?
    • To bug-free the code
    • To ensure data quality
    • To increase credibility of BI Reports
    • More BI projects fail after commissioning due to quality issue
  2. What constitutes DWBI Testing?
    • Performance Testing
    • Functional Testing
    • Canned Report Testing
    • Ad-hoc testing
    • Load Reconciliation
  3. What can be done to ease it?
    • Plan for testing
    • Start building DWBI Test competency
    • Design code that generates debug information
    • Build reconciliation mechanism
  4. Why is it difficult?
    • Limited Testing Tool
    • Automated Testing not always possible
    • Data traceability not always available
    • Requires extensive functional knowledge
    • Metadata management tool often fails
    • Deals with bulk data - has performance impact
    • Number of data conditions are huge
Use the above mind-map to plan and prepare the testing activity for your data warehousing project

Data Mining - a simple guide for beginners

This paper introduces the subject of data mining in simple lucid language and moves on to build more complex concepts. Start here if you are a beginner.

Data Mining. I have an allergy to this term.

Not because I hate the subject of data mining itself, but because this term is so much over-used and misused and exploited and commercialized and often conveyed in inaccurate manner, in inappropriate places and often with intentional vagueness.
So when I decided to write about what is data mining, I was convinced that I need to write about what is NOT data mining first, in order to build a formal definition of data mining.

What is Data Mining? (And what it is not)

Here is the Wikipedia definition of data mining:
Data mining … is the process of discovering new patterns from large data sets
Now the question is: what does the above definition really mean and how does it differ from finding information from databases? We often store information in databases (as in data warehouses) and retrieve the information from the database when we need it. Is that data mining? Answer is ‘no’. We will soon see why is it so.
Let’s start with the big picture first. This all starts with something called "Knowledge Discovery in Database". Data mining is basically one of the steps in the process of knowledge discovery in database (KDD). Knowledge discovery process is basically divided in 5 steps:
  1. Selection
  2. Pre-processing
  3. Transformation
  4. Data Mining
  5. Evaluation
“Selection” is the step where we identify the data, “pre-processing” is where we cleanse and profile the data, “transformation” step is required for data preparation, and then is data mining. Lastly we use “Evaluation” to test the result of the data mining.
Notice here the term – “Knowledge” as in Knowledge Discovery in Database (KDD). Why did you say “Knowledge”? Why not “information” or “data”?
This is because there are differences among the terms “data”, “information” and “knowledge”. Let’s understand this difference through one example.
You run a local departmental store and you log all the details of your customers in the store database. You know the names of your customers and what items they buy each day.
For example, Alex, Jessica and Paul visit your store every Sunday and buys candle. You store this information in your store database. This is data. Any time you want to know who are the visitors that buy candle, you can query your database and get the answer. This is information. You want to know how many candles are sold on each day of week from your store, you can again query your database and you’d get the answer – that’s also information.
But suppose there are 1000 other customers who also buy candle from you on every Sunday (mostly – with some percentage of variations) and all of them are Christian by religion. So, you can conclude that Alex, Jessica and Paul must be also Christian.
Now the religion of Alex, Jessica and Paul were not given to you as data. This could not be retrieved from the database as information. But you learnt this piece of information indirectly. This is the ”knowledge” that you discovered. And this discovery was done through a process called “Data Mining”.
Now there are chances that you are wrong about Alex, Jessica and Paul. But there are fare amount of chances that you are actually right. That is why it is very important to “evaluate” the result of KDD process.
I gave you this example because I wanted to make a clear distinction between knowledge and information in the context of data mining. This is important to understand our first question – why retrieving information from deep down of your database is not same as data mining. No matter how complex the information retrieval process is, no matter how deep the information is located at, it’s still not data mining.
As long as you are not dealing with predictive analysis or not discovering “new” pattern from the existing data – you are not doing data mining.

What are the applications of Data Mining?

One of the most important applications of data mining can be observed in the field of machine learning
When it comes to applying data mining, your imagination is the only barrier (not really ☺ there are technological hindrances as well as we will see later). But it’s true that data mining is applied in almost any fields starting from genetics to human rights violation. One of the most important applications is in “Machine Learning”. Machine learning is a branch of artificial intelligence concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data. Machine learning makes it possible for computers to take autonomous decisions based on the data available from past experiences. Many of the standard problems of today’s world are being solved by the application of machine learning as solving them otherwise (e.g. through the deterministic algorithmic approach) would be impossible given the breadth and depth of the problem.
Let me start with one example of the application of data mining that enables machine-learning algorithm to drive an autonomous vehicle. This vehicle does not have any driver and it moves around the road all by itself. The way it maneuvers and overcomes the obstacles is by applying the images that it sees (through a VGA camera) and then using data mining to determine the course of action based on the data of its past experiences.

YTB_VD1_AUTONOMOUS_VEHICLE

Fig. Autonomous Vehicle Designed in Stanford University using 
Data Mining methods to maneuver (Video)
There are notable applications of data mining in the subjects such as –
  • Voice recognition
  • Think of Siri in iPhone. How does it understand your commands? Clearly it’s not deterministically programmable as every body has different tone and accent and voice. And not only it understands, it also adapts better with your voice as you keep using it more and more.
  • Classification of DNA sequences
  • DNA sequence contains biological information. One of the many approaches of DNA sequencing is through sequence mining where data mining techniques are applied to find statistically relevant patters, which are then compared with previously studied sequences to understand the given sequence.
  • Natural Language processing
  • Consider the following conversations between customer (Mike) and shop-keeper (Linda).
    Mike: You have playing cards?
    Linda: We have one blue stack from Jackson’s and also one other from Deborah
    Mike: What is the price?
    Linda: Jackson’s $4 and Deborah’s $7.
    Mike: Okay give me the blue one please.
    Now consider this. What if “Linda” was an automated machine? You could probably have the same kind of conversations still, but it would probably had much more unnatural.
    Mike: You have playing cards?
    Robot: Yes. 
    Mike: What type of playing cards do you have?
    Robot: We have Jackson’s and Deborah’s playing cards.
    Mike: What are the colors of the playing cards?
    Robot: Which Company’s playing card do you want to know the color of?
    Mike: What is the color of Jackson’s playing cards?
    Robot: Blue.
    Mike: What are the prices of Jackson’s and deborah’s playing cards?
    Robot: Jacksons’ playing cards cost you $4 and Deborah’s playing cards cost you $7.
    Mike: Ok, then can I buy the blue ones?
    Robot: We do not have any product called ‘blue ones’.
    Mike: Can I have the blue color playing cards please?
    Robot: Sure!
    I know the above example is a bit of overshoot, but you got the idea. Machines do not understand natural language. And it’s a challenge to make them understand the same. And until we do that we wont be able to build a really useful human-computer interface.
    Recently, real advancement on natural language processing is done after the application of data mining. Prior implementations of language-processing tasks typically involved the direct hand coding of large sets of rules. But the machine-learning paradigm instead used general learning algorithms — often, although not always, grounded in statistical inference — to automatically learn such rules through the analysis of large corpora of typical real-world examples.

Methods of data mining

Now if the above examples interest you then let’s continue learning more about data mining. One of the first tasks that we have to do next is to understand the different approaches that are used in the field of data mining. Below list shows most of the important methods:

Anomaly Detection

This is the method of detecting patterns in a given data set that does not conform to an established normal behavior. This is applied in number of different fields such as – network intrusion detection, share market fraud detection etc.

Association Rule Learning

This is a method of discovering interesting relations between variables in large databases. Ever seen “Buyers who bought this product, also bought these:” type of messages in e-commerce websites (e.g. in Amazon.com)? That’s an example of Association Rule learning.

Clustering

Clustering is the method of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters. Cluster analysis is widely used in market research when working with multivariate data. Market researchers often use this to create customer segmentation, product segmentation etc.

Classification

This method is used for the task of generalizing known structure to apply to new data. For example, an email program might attempt to classify an email as legitimate or spam.

Regression

Attempts to find a function, which models the data with the least error. The above example of autonomous driving uses this method.
.