DatawareHousing Questions and Answers
what is datawarehouse?
A data warehouse is a electronic storage of an Organization's historical data for the purpose of analysis and reporting
a datawarehouse should be subject-oriented, non-volatile, integrated and time-variant.
What is ETL(Extract
Transform Load)?
ETL is the abbreviation f extract, transform, load. ETL is software that enables businesses to consolidate their deisparate data
while moving it from place to place, and it doesn’t really matter that that
data is in different forms or formats. The data can come from any source.
First the Extract functions reads data from a specified
source database and extracts a desired subset of data. Next, the transform
function works with the acquired data using rules or lookup tables or
creating combinations with other data to convert it to the desired state.
Finally, the load function is ued to
write the resulting data to a target database.
what is fact table?
A fact table is the central table in a star schema of a data warehouse. A fact table stores
quantitative information for analysis
A fact table works with dimension tables. A fact table holds the data to be analyzed, and a
dimension table stores data about the ways in which the data in the fact table can be analyzed.
Thus, the fact table consists of two types of columns. The foreign keys column allows joins with
dimension tables, and the measures columns contain the data that is being analyzed.
Suppose that a company sells products to customers. Every sale is a fact that happens, and the
fact table is used to record these facts. For example:
Time ID | Product ID | Customer ID | Unit Sold |
4 | 17 | 2 | 1 |
8 | 21 | 3 | 2 |
8 | 4 | 1 | 1 |
Now we can add a dimension table about customers:
Customer ID
|
Name
|
Gender
|
Income
|
Education
|
Region
|
1 | Brian Edge | M | 2 | 3 | 4 |
2 | Fred Smith | M | 3 | 5 | 1 |
3 | Sally Jones | F | 1 | 7 | 3 |
In this example, the customer ID column in the fact table is the foreign key that joins with the
dimension table. By following the links, you can see that row 2 of the fact table records the fact
that customer 3, Sally Jones, bought two items on day 8. The company would also have a product
table and a time table to determine what Sally bought and exactly when.
When building fact tables, there are physical and data limits. The ultimate size of the object
as well as access paths should be considered. Adding indexes can help with both. However, from a
logical design perspective, there should be no restrictions. Tables should be built based on
current and future requirements, ensuring that there is as much flexibility as possible built into
the design to allow for future enhancements without having to rebuild the data.
what is dimension table?
A dimension table is a table in a star schema of a data warehouse. A dimension table stores
attributes, or dimensions, that describe the objects in a fact table.
In data warehousing, a dimension is a collection of reference information about a measurable event. These events are known as facts and are stored in a fact table. Dimensions categorize and describe data warehouse facts and measures in ways that support meaningful answers to business questions. They form the very core of dimensional modeling.
A data warehouse organizes descriptive attributes as columns in dimension tables. For
example, a customer dimension’s attributes could include first and last name, birth date, gender,
etc., or a website dimension would include site name and URL attributes.
A dimension table has a primary key column that uniquely identifies each dimension record
(row). The dimension table is associated with a fact table using this key. Data in the
fact table can be filtered and grouped (“sliced and diced”) by various combinations of
attributes. For example, a Login fact with Customer, Website, and Date dimensions can be
queried for “number of males age 19-25 who logged in to funsportsite.com more than once during the
last week of September 2010, grouped by day.”
Dimension tables are referenced by fact tables using keys. When creating a dimension table in a
data warehouse, a system-generated key is used to uniquely identify a row in the dimension. This
key is also known as a surrogate key. The surrogate key is used as the primary key in the dimension
table. The surrogate key is placed in the fact table and a foreign key is defined between the two
tables. When the data is joined, it does so just as any other join within the database.
Like fact tables, dimension tables are often highly de-normalized, because these structures are
not built to manage transactions they are built to enable users to analyze data as easily as
possible.
What is the difference between OLTP and OLAP?
OLTP is the transaction system that collects business data. Whereas OLAP is the reporting and analysis system on that data.
OLTP systems are optimized for INSERT, UPDATE operations and
therefore highly normalized. On the other hand, OLAP systems are
deliberately denormalized for fast data retrieval through SELECT
operations.
What is data mart?
Data marts are generally designed for a single subject area. An
organization may have data pertaining to different departments like
Finance, HR, Marketting etc. stored in data warehouse and each
department may have separate data marts. These data marts can be built
on top of the data warehouse.
What is ER model?
ER model or entity-relationship model is a particular methodology of
data modeling wherein the goal of modeling is to normalize the data by
reducing redundancy. This is different than dimensional modeling where
the main goal is to improve the data retrieval mechanism.
What is dimensional modeling?
Dimensional model consists of dimension and fact tables. Fact tables
store different transactional measurements and the foreign keys from
dimension tables that qualifies the data. The goal of Dimensional model
is not to achive high degree of normalization but to facilitate easy and faster data retrieval.
What is dimension?
A dimension is something that qualifies a quantity (measure).
For an example, consider this: If I just say… “20kg”, it does not
mean anything. But if I say, "20kg of Rice (Product) is sold to Ramesh
(customer) on 5th April (date)", then that gives a meaningful sense.
These product, customer and dates are some dimension that qualified the measure - 20kg.
Dimensions are mutually independent. Technically speaking, a dimension
is a data element that categorizes each item in a data set into
non-overlapping regions.
What is Fact?
A fact is something that is quantifiable (Or measurable). Facts are
typically (but not always) numerical values that can be aggregated.
What is Star-schema?
This schema is used in data warehouse models where one centralized fact
table references number of dimension tables so as the keys (primary key)
from all the dimension tables flow into the fact table (as foreign key)
where measures are stored. This entity-relationship diagram looks like a
star, hence the name.
Consider a fact table that stores sales quantity for each product and
customer on a certain time. Sales quantity will be the measure here and
keys from customer, product and time dimension tables will flow into the
fact table.What is snow-flake schema?
This is another logical arrangement of tables in dimensional modeling
where a centralized fact table references number of other dimension
tables; however, those dimension tables are further normalized into
multiple related tables.
Consider a fact table that stores sales quantity for each product and
customer on a certain time. Sales quantity will be the measure here and
keys from customer, product and time dimension tables will flow into the
fact table. Additionally all the products can be further grouped under
different product families stored in a different table so that primary
key of product family tables also goes into the product table as a
foreign key. Such construct will be called a snow-flake schema as
product table is further snow-flaked into product family.
Note
Snow-flake increases degree of normalization in the design.
What is surrogate key ? Where we use it explain with examples?
Surrogate key is a
unique identification key, it is like an artificial or alternative key
to production key, bz the production key may be alphanumeric or
composite key but the surrogate key is always single numeric key.
Assume the production key is an alphanumeric field if you create an
index for this fields it will occupy more space, so it is not advisable
to join/index, bz generally all the datawarehousing fact table are
having historical data. These factable are linked with so many
dimension table. if it's a numerical fields the performance is high
or
Surrogate key is a substitution for the natural primary key.
It
is just a unique identifier or number for each row that can be used for
the primary key to the table. The only requirement for a surrogate
primary key is that it is unique for each row in the table.
Data
warehouses typically use a surrogate, (also known as artificial or
identity key), key for the dimension tables primary keys. They can use
Infa sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key.
It
is useful because the natural primary key (i.e. Customer Number in
Customer table) can change and this makes updates more difficult.
Some
tables have columns such as AIRPORT_NAME or CITY_NAME which are stated
as the primary keys (according to the business users) but, not only can
these change, indexing on a numerical value is probably better and you
could consider creating a surrogate key called, say, AIRPORT_ID. This
would be internal to the system and as far as the client is concerned
you may display only the AIRPORT_NAME.
Another benefit you can get from surrogate keys (SID) is :
Tracking the SCD - Slowly Changing Dimension.
Let me give you a simple, classical example:
On
the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1'
(that's what would be in your Employee Dimension). This employee has a
turnover allocated to him on the Business Unit 'BU1' But on the 2nd of
June the Employee 'E1' is muted from Business Unit 'BU1' to Business
Unit 'BU2.' The entire new turnovers have to belong to the new Business
Unit 'BU2' but the old one should belong to the Business Unit 'BU1.'
If
you used the natural business key 'E1' for your employee within your
datawarehouse everything would be allocated to Business Unit 'BU2' even
what actually belongs to 'BU1.'
If you use surrogate keys, you
could create on the 2nd of June a new record for the Employee 'E1' in
your Employee Dimension with a new surrogate key.
This way, in
your fact table, you have your old data (before 2nd of June) with the
SID of the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would
take the SID of the employee 'E1' + 'BU2.'
You could consider
Slowly Changing Dimension as an enlargement of your natural key: natural
key of the Employee was Employee Code 'E1' but for you it becomes
Employee
Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference
with the natural key enlargement process, is that you might not have
all part of your new key within your fact table, so you might not be
able to do the join on the new enlarge key -> so you need another id.
What are the different types of dimension?
In a data warehouse model, dimension can be of following types,
- Conformed Dimension
- Junk Dimension
- Degenerated Dimension
- Role Playing Dimension
Based on how frequently the data inside a dimension changes, we can further classify dimension as
- Unchanging or static dimension (UCD)
- Slowly changing dimension (SCD)
- Rapidly changing Dimension (RCD)
What is a 'Conformed Dimension'?
A conformed dimension is the dimension that is shared across multiple
subject area. Consider 'Customer' dimension. Both marketing and sales
department may use the same customer dimension table in their reports.
Similarly, a 'Time' or 'Date' dimension will be shared by different
subject areas. These dimensions are conformed dimension.
Theoretically, two dimensions which are either identical or strict
mathematical subsets of one another are said to be conformed.
What is degenerated dimension?
A degenerated dimension is a dimension that is derived from fact table and does not have its own dimension table.
A dimension key, such as transaction number, receipt number, Invoice
number etc. does not have any more associated attributes and hence can
not be designed as a dimension table.
What is junk dimension?
A junk dimension is a grouping of typically low-cardinality attributes
(flags, indicators etc.) so that those can be removed from other tables
and can be junked into an abstract dimension table.
These junk dimension attributes might not be related. The only purpose
of this table is to store all the combinations of the dimensional
attributes which you could not fit into the different dimension tables
otherwise. Junk dimensions are often used to implement Rapidly Changing Dimensions in data warehouse.
What is a role-playing dimension?
Dimensions are often reused for multiple applications within the same
database with different contextual meaning. For instance, a "Date"
dimension can be used for "Date of Sale", as well as "Date of Delivery",
or "Date of Hire". This is often referred to as a 'role-playing
dimension'
What is SCD?
SCD stands for slowly changing dimension, i.e. the dimensions where data
is slowly changing. These can be of many types, e.g. Type 0, Type 1,
Type 2, Type 3 and Type 6, although Type 1, 2 and 3 are most common.
What is rapidly changing dimension?
This is a dimension where data changes rapidly.
Describe different types of slowly changing Dimension (SCD)
Type 0:
A Type 0 dimension is where dimensional changes are not considered. This
does not mean that the attributes of the dimension do not change in
actual business situation. It just means that, even if the value of the
attributes change, history is not kept and the table holds all the
previous data.
Type 1:
A type 1 dimension is where history is not maintained and the table
always shows the recent data. This effectively means that such dimension
table is always updated with recent data whenever there is a change,
and because of this update, we lose the previous values.
Type 2:
A type 2 dimension table tracks the historical changes by creating
separate rows in the table with different surrogate keys. Consider there
is a customer C1 under group G1 first and later on the customer is
changed to group G2. Then there will be two separate records in
dimension table like below,
Key | Customer | Group | Start Date | End Date |
1 | C1 | G1 | 1st Jan 2000 | 31st Dec 2005 |
2 | C1 | G2 | 1st Jan 2006 | NULL |
Note that separate surrogate keys are generated for the two records.
NULL end date in the second row denotes that the record is the current
record. Also note that, instead of start and end dates, one could also
keep version number column (1, 2 … etc.) to denote different versions of
the record.
Type 3:
A type 3 dimension stored the history in a separate column instead of
separate rows. So unlike a type 2 dimension which is vertically growing,
a type 3 dimension is horizontally growing. See the example below,
Key | Customer | Previous Group | Current Group |
1 | C1 | G1 | G2 |
This is only good when you need not store many consecutive histories and when date of change is not required to be stored.
Type 6:
A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts
very similar to type 2, but only you add one extra column to denote
which record is the current record.
Key | Customer | Group | Start Date | End Date | Current Flag |
1 | C1 | G1 | 1st Jan 2000 | 31st Dec 2005 | N |
2 | C1 | G2 | 1st Jan 2006 | NULL | Y |
What is a mini dimension?
Mini dimensions can be used to handle rapidly changing dimension
scenario. If a dimension has a huge number of rapidly changing
attributes it is better to separate those attributes in different table
called mini dimension. This is done because if the main dimension table
is designed as SCD type 2, the table will soon outgrow in size and
create performance issues. It is better to segregate the rapidly
changing members in different table thereby keeping the main dimension
table small and performing.
What is a fact-less-fact?
A fact table that does not contain any measure is called a fact-less
fact. This table will only contain keys from different dimension tables.
This is often used to resolve a many-to-many cardinality issue.
Explanatory Note:
Consider a school, where a single student may be taught by many teachers
and a single teacher may have many students. To model this situation in
dimensional model, one might introduce a fact-less-fact table joining
teacher and student keys. Such a fact table will then be able to answer
queries like,
- Who are the students taught by a specific teacher.
- Which teacher teaches maximum students.
- Which student has highest number of teachers.etc. etc.
What is a coverage fact?
A fact-less-fact table can only answer 'optimistic' queries (positive
query) but can not answer a negative query. Again consider the
illustration in the above example. A fact-less fact containing the keys
of tutors and students can not answer a query like below,
- Which teacher did not teach any student?
- Which student was not taught by any teacher?
Why not? Because fact-less fact table only stores the positive scenarios
(like student being taught by a tutor) but if there is a student who is
not being taught by a teacher, then that student's key does not appear in this table, thereby reducing the coverage of the table.
Coverage fact table attempts to answer this - often by adding an extra
flag column. Flag = 0 indicates a negative condition and flag = 1
indicates a positive condition. To understand this better, let's
consider a class where there are 100 students and 5 teachers. So
coverage fact table will ideally store 100 X 5 = 500 records (all
combinations) and if a certain teacher is not teaching a certain
student, the corresponding flag for that record will be 0.
What are incident and snapshot facts
A fact table stores some kind of measurements. Usually these
measurements are stored (or captured) against a specific time and these
measurements vary with respect to time. Now it might so happen that the
business might not able to capture all of its measures always for every
point in time. Then those unavailable measurements can be kept empty
(Null) or can be filled up with the last available measurements. The
first case is the example of incident fact and the second one is the
example of snapshot fact.
What is aggregation and what is the benefit of aggregation?
A data warehouse usually captures data with same degree of details as
available in source. The "degree of detail" is termed as granularity.
But all reporting requirements from that data warehouse do not need the
same degree of details.
To understand this, let's consider an example from retail business. A
certain retail chain has 500 shops accross Europe. All the shops record
detail level transactions regarding the products they sale and those
data are captured in a data warehouse.
Each shop manager can access the data warehouse and they can see which
products are sold by whom and in what quantity on any given date. Thus
the data warehouse helps the shop managers with the detail level data
that can be used for inventory management, trend prediction etc.
Now think about the CEO of that retail chain. He does not really care
about which certain sales girl in London sold the highest number of
chopsticks or which shop is the best seller of 'brown breads'. All he is
interested is, perhaps to check the percentage increase of his revenue
margin accross Europe. Or may be year to year sales growth on eastern
Europe. Such data is aggregated in nature. Because Sales of goods in
East Europe is derived by summing up the individual sales data from each
shop in East Europe.
Therefore, to support different levels of data warehouse users, data aggregation is needed.
What is slicing-dicing?
Slicing means showing the slice of a data, given a certain set of
dimension (e.g. Product) and value (e.g. Brown Bread) and measures (e.g.
sales).
Dicing means viewing the slice with respect to different dimensions and in different level of aggregations.
Slicing and dicing operations are part of pivoting.
What is drill-through?
Drill through is the process of going to the detail level data from summary data.
Consider the above example on retail shops. If the CEO finds out that
sales in East Europe has declined this year compared to last year, he
then might want to know the root cause of the decrease. For this, he may
start drilling through his report to more detail level and eventually
find out that even though individual shop sales has actually increased,
the overall sales figure has decreased because a certain shop in Turkey
has stopped operating the business. The detail level of data, which CEO
was not much interested on earlier, has this time helped him to pin
point the root cause of declined sales. And the method he has followed
to obtain the details from the aggregated data is called drill through.
Data Warehousing Q & A
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Thanking you for your valuable comments
ReplyDelete