Data Warehouse Architecture
Now that you’ve developed an understanding of how your business community actually uses its data and where it wants to go with it, it’s time to develop a logical data model. Data Warehouses can be architected in many different ways, depending on the specific needs of a business. In short, data is moved from databases used in operational systems into a data warehouse staging area, then into a data warehouse and finally into a set of conformed data marts. A data warehouse environment can include an extraction, transportation, transformation, and loading (ETL) solution, statistical analysis, reporting, data mining capabilities, client analysis tools, and other applications that manage the process of gathering data, transforming it into useful, actionable information, and delivering it to business users.
Traditional Approaches to Data Warehouse Architecture
When it comes to designing a data warehouse for your business, the two most commonly discussed methods are the approaches introduced by Bill Inmon and Ralph Kimball. Each methodology has its own advantage and differentiating factors, and both are continuously used based on particular business needs.
Bill Inmon’s Methodology
A normalized enterprise data model starts with its design and is followed by the dimensional data marts, which contain data required for specific business processes or specific departments that are created from the data warehouse. A data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in third normal form.
Ralph Kimball’s Methodology
This is a dimensional design approach in which the data marts facilitating reports and analysis are created first, then combined together to create a broad data warehouse. Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model.
Which Methodology is Best for You?
Deciding which is best suitable for your business needs depends on the business objectives of an organization, nature of business, as well as time and cost involved. There is no right or wrong between these two ideas, as they represent different data warehousing philosophies.
Planning a quality data warehouse design requires the objectives for query performance to be met as well as the complete lifecycle of data as it enters and exits in the data warehouse over a period of time. Knowledge of how the data warehouse database is used and maintained plays an essential part in many of the necessary design factors. Consider the following:
- Query Performance: It is important to understand query performance because it affects many aspects of design such as the placement of database objects.
- Data Availability: Data availability affects the scheduling of maintenance operations.
- Currency of the Data: How current/fresh must the data be? Determine frequency of data pull (hourly, daily, weekly). What is the schedule of maintenance operations such as backup? The strategy for these operations affects your data warehouse design.
- Populating the data warehouse: How is the data loaded or extracted from the data warehouse? Understanding how to perform these operations in your data warehouse can help you to determine whether you need a staging layer. The way that you remove or archive the data also influences your table partitioning and design.
- Data Volume: The volume of data to be loaded affects indexing, table partitioning, and maintaining the aggregation layer.
- User Skills & Knowledge:
- Unskilled in data acquisition- Need preformed reports and charts. Typically do not know the data.
- Skilled in data acquisition- Can write their own SQL quires, can use BI tools, and/or are Excel gurus- they must know the data
- Management- Skilled or unskilled but need information in actionable format
- Access Tools: How will the data be retrieved and viewed?
- Only SQL
- BI tools
A Data Structure represents both physical and logical data contained in common data architecture. It includes data entities, data subjects, its contents, relationships and arrangement.
Data component refers to a component of the metadata warehouse that contains the structure of data within the common data architecture.
- Data structures define what and how data will be stored in the database or data warehouse
- How long the data will be stored
- If it is no longer needed how will it be disposed of or archived
- Who will be responsible for collecting and ensuring quality
- Who will have access to data
Types of Data Structures
- Structured Data: Information that conforms with the formal structure of data models associated with relational databases or other forms of data tables.
- Unstructured Data: Information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Typically text-heavy, but may contain data such as dates, numbers, and facts.
- Semi-structured Data: Variant of structured data that does not fully conform with the formal above definition, but contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Things like XML, JASON, and Standard Generalized Markup Language (SGML) document (e.g. Word document).
- Big Data: A term for collections of data so large or complex that traditional data processing application are inadequate. Can be any mixture of structured, unstructured, or semi-structured data.
Structured Data Designs
- Copy of Source: Direct copy of the source data in its format
- Third Normal Form: Relational database in 3rd normal (or more) form.
- Dimensional Model: Summary Tables, Star Schemas, and/or Data Cubes.
Physical database design takes into account data and transaction volume as well as typical queries to produce a schema and environment that will meet necessary performance requirements.
- Third Normal Form: Direct one to 3rd normal form model entity to relational database table.
- Star Schema: direct one to one dimensional model entity to database table with no summarization.
- Summary Table: just the dimensional model fact entity to a database table summarizing a single layer of each dimension.
- Data Cube: Multi-dimensional data are seen and represented as data cubes
It’s is important to understand what your data is and why it is needed when deciding on which design best suites your needs. If the speed of getting information into the warehouse is not critical, quick query response time is not needed, and the users are skilled, then the best structure is third normal form. Third normal form is also best for data mining. If fast response time is critical, actionable information is of high importance, and user skill levels are not as strong, then a combination of the dimensional model designs it the best.
The very best design is one that has all these pieces:
- All data in third normal form.
- Star schema and summary tables for the most often used dimensions such as time, customer, and product.
- Very specialized summary tables for actionable information.
- Data cubes for analytics and to drive dashboards.
Having read this article won’t make you an expert in data modeling, but you will hopefully understand what sort of skill level is required to achieve the business objectives for a data warehouse. You’ll need an experienced data modeler who can do both the acquisition and logical modeling steps if your warehouse is to be successful and valuable.