What is data warehousing?
A data warehouse is a system that stores data from a company’s operational databases as well as external sources. Data warehouse platforms are different from operational databases because they store historical information, making it easier for business leaders to analyze data over a specific period of time. These data warehouses can contain many types of data. A common database houses topics and allows them to combine to create a single report.
Why is data warehousing important?
Data warehouses are an increasingly important business intelligence tool, allowing organizations to:
- Get the big picture. Putting strategic corporate data into a single data warehouse allows the business to see all of their data in one place. Allowing for valuable insights across all departments and less time moving data.
- Make better business decisions. Successful business leaders develop data-driven strategies and rarely make decisions without consulting the facts. So, data warehousing improves the ease and speed of accessing different data sets and makes it easier for corporate decision-makers to derive insights that will guide the business and marketing strategies that set them apart from their competitors.
- Improve their bottom line. Data warehouse platforms allow business leaders to quickly access their organization’s historical activities. Providing the ability to evaluate initiatives that have been successful — or unsuccessful — in the past. This allows executives to see where they can adjust their strategy to decrease costs, maximize efficiency and increase sales to improve their bottom line.
Furthermore, cloud-based technology has revolutionized the business world for the better. Allowing companies to easily manage valuable data about their customers, products and employees anywhere they work. Important business decisions are based on corporate data. Collecting all of this data from an ever-increasing amount of data sources and applications can be difficult. Therefore, it is imperative to have the ability to bring all of that data to one place to gain a 360-degree view of your business.
How is an OLTP different from a Data Warehouse?
Online Transaction-Processing (OLTP) systems facilitate and manage transaction-oriented applications. A data entry tool, a financial processing system, or a retail sales program are examples of such applications. In contrast, data warehouses collect, aggregate, and analyze data to provide business insights.
The way these two access data reveals the difference between them. Since transactions are happening every few minutes, new information needs to be added constantly. This makes OLTP applications considered “write-heavy”. As a result, OLTP applications use structured data to provide faster write and update times, taking into consideration natural business relationships. In turn, optimizing transactions and favoring inserts over queries.
On the other hand, given the frequency data is read from a warehouse, data is restructured for simple access and tuned to provide quick query results. This makes a data warehouse considered “read-heavy”. Data warehouses can be simply cloned from the OLTP schemas, but are generally restructured for ease of understanding by reporting users.
Differences in Data Access and Schema Design
Data access priority is only one of a handful of stark differences between OLTP systems and data warehouse applications. OLTP applications only use the data necessary to carry out current transactions. Therefore, using a smaller number of records per transaction provides a more granular view of the data.
Data Warehouses utilize bulk data operations in order to handle large volumes of data. These usually have a high number of records that need to remain organized and quickly accessible. The schema design patterns also differ for these two classes of application. An OLTP system uses highly normalized schemas in order to facilitate quick updates at the cost of slower queries whenever joins are needed. While traditional data warehouses utilize highly denormalized schemas to boost efficiency and query performance.
|High Throughput||High Volume|
|Standard Queries||Ad Hoc Queries|
|Normalized structure||Denormalized, Cube, Star schema, NoSQL|
Necessity drives design in database architecture. Therefore, this determines exactly how data will be designed for an optimal schema. What are the major use cases for an operational data store? For instance, a database schema used for the purpose of data mining will differ greatly from that of a data store used for reporting. For this type of schema, various pre-computed columns provide quick query results for vectors like totals and averages, which contrasts with analysis data stores that use highly denormalized data.
The typical normalized relational model is the most widely used among schemas and seeks to minimize data redundancy and avoid abnormalities in data operations. However, star schema embraces redundancy through “dimensional modeling”. Here multiple specialized data marts are expanded via denormalization to provide fast and efficient reporting.
Data Warehouse Layouts:
On the most abstract level, data warehouses fall into one of three basic layouts: basic, basic with staging areas, basic with a staging area, and data marts. There are numerous data marts within an organization, each optimized specifically for a specific purpose.
Software developers design standard reports with specialized logic and purposes, such as accounting statements. These entail complex business logic, master-detail layout, and customized formatting. Therefore, they are written by programmers, not end-users, using software that requires knowledge of the database structure. Thus, deep knowledge of the data is required, including the meaning and derivation of each field, and how the data is related.
Businesses can create or extend reports using Business Intelligence reporting platforms. Because of this, they are usually written with a layer of metadata that describes the underlying content and structure, making it easy to develop specific queries. Once called “end-user computing”, they offer drop-and-drag user interfaces and don’t require an understanding of physical data structure by end-users. You can develop Slice-and-Dice reports with summary, drill-down, and detail views. BI platforms are metadata-driven with a system administrator predefining what’s exposed.
Analytics are calculation-heavy queries that provide metrics into an organization’s performance. Key Performance Indicators (KPI’s) to measure business performance are often built ahead of time into daily, weekly, monthly, quarterly, and annual time periods. Moreover, the user of these systems can perform trend analysis and form corrective policies based on poor performance factors.
Data mining involves searching for patterns in the data to spot trends, associations of events, or operational anomalies. Therefore, the query filtering capability is very complex to search for specific conditions or very flexible to give you many views on the data. A data mining warehouse provides the ability to find anomalies, fraud, and operational violations in any business.
Data Integration Hub
A warehouse can be used as a data integration hub, providing a source for downstream applications to retrieve transactional data. This data can be received from a system of record as a hub and spoke. All data moving from the various spoke applications to the hub can then move outward to other spokes as needed.
An integration warehouse provides common access to all data and removes the need for silos. This allows for the collection and sharing of corporate data from all applications and departments. A single view (e.g.: Customer Master) is a different structure of common data. Upgrades and replacements of applications only require a change in the interface between the application and the warehouse.
A data integration hub allows common access using a single DBMS platform. This enables table joins and merging data between formerly separate databases. Instead of requiring staging tables or files. You can then use a common reporting or Business Intelligence platform. Finally, access is controlled within the BI tool or by GRANT access, not by physical separation.
One of the beauties of a data integration hub is that an integration developer doesn’t have to know API’s for other applications. As a result, all they have to do is integrate to the hub, and let the other developers for other applications do their own integration to the hub, using the common framework and data model of the hub.
Compliance has become a necessity in today’s enterprises. Contributing to the increased complexity and costs for data warehousing and operations. Ultimately, however, compliance is “backup’s twin”. Therefore, full backups and compliance data stores use similar techniques.
A compliance backup should provide
- A backup of all records
- An ability to keep and flag records that are deleted from the original system
- An audit trail of changed and deleted data, showing all the versions of individual fields or complete records by change date and who changed them.
- An ability to restore data for single or multiple records from any version of historical data, including the relationships between different types of data (a customer and its contacts, for instance)
The data storage technology can vary widely, using flat files or databases to store the versions. The backup can have the same or a different structure as the source, depending on how the backup needs to be accessed. One popular cloud application uses a single database table for all standard fields of all objects and another database table for custom fields. However, when copied to a backup database, there are some issues. All replication products use the logical model provided by the application’s metadata to show customers, contacts, and all the other related records.
In order to copy data, one can use ETL tools, scripts, or programs that understand the source schema and can navigate through all of its objects. For the best results, perform the full copy process on a regular schedule, or incrementally and frequently.
To sum up this article, data warehouses are important for reporting, integration, and compliance. They take a load off production OLTP systems and make reporting and integration more flexible, scalable, and higher performing for the bulk workload. It is important to use the correct structural style when building a warehouse. Star schemas are not for everyone!
Data warehouses, while having many advantages, can also be costly to build. Investigate these costs before undertaking the project.
Rick is the Founder and CEO of Sesame Software. He has developed application systems for 40 years in his roles as a product architect, software developer, systems integrator, and database analyst. Rick developed Relational Junction to meet the market demand for easily managed enterprise application integration and data warehouse products. He has been awarded six patents for technology used in Relational Junction.