Purposes of Data Warehouses

So you’re up and running with all your company’s application systems — marketing, sales, accounting, manufacturing, provisioning, shipping, and human resources – and now you need a reporting system that ties all of this together so senior management actually knows what’s going on. The business has identified over 800 reports that need to be written in the next year, and there’s no way IT will be able to spend that amount of time writing reports, so you need a Business Intelligence package to let the business write their own reports. But the last time one of your developers ran a financial report on the accounting database, it brought the web store to its knees, and the production support team won’t let you near the production database any more. Even worse, some of your application platforms are in the Cloud, some are on-premise, and the reporting interface for each system only works with data in that system. What are you going to do?

Why Build a Data Warehouse?

sisyphusBusiness intelligence activities drive the evolution of data flow within any organization. Data warehouses exist to facilitate the understanding and improvement of their organization’s performance.  

Reporting is improved by offloading I/O from Online Transaction Processing (OLTP) systems, which avoids runaway queries that could kill production transactions processing. Data warehouses provide an opportunity to restructure data for optimal query access

Integration Hub – a central repository for common data that can be more easily shared among applications than point-to-point integration

Compliance – a backup of operational data, perhaps with versioning of records, to provide recovery ability and an audit trail of changes

The field of data warehousing and database theory is mired in ambiguous terminology used to describe various database structures and methodologies. Terminology is absolutely crucial to understanding the concepts. Unfortunately, various parties use a given term in often vastly different ways, or describe the same phenomena differently.  To be clear about this, if a vendor can control the definitions and frame the conversation, they can claim to be a trend-setter and influence the market for solutions.

OLTP vs. Warehouse: Access and Structure

Online Transaction-Processing (OLTP) systems facilitate and manage transaction-oriented applications—that is, the applications built to handle daily business functions such as data entry, financial processing systems, and retail sales. Data warehouses, conversely, are built around the idea of collecting, aggregating, and analyzing data in order to provide valuable business insights.

The contrast between these paradigms is most exemplified by the manner in which data warehouses access data compared to the data access needs of OLTP: whereas OLTP applications are write-heavy (since transactions are being continuously conducted, new information need be incorporated constantly), data warehouse applications are universally read-heavy. Given the frequency data is read from a data warehouse, data should be structured for simple access and tuned to provide quick query results. Data within OLTP applications on the other hand need only be structured to handle the natural business relationships in the data and provide the fastest write and update times, thus optimizing transactions and favoring inserts over queries.

Data access priority is only one of a handful of stark differences between the technical considerations for OLTP systems versus data warehouse applications. The volume and nature of the workload for data warehouses differs greatly from that of OLTP. Warehouses facilitate large volumes of often historical data whereas OLTP applications need only use the data necessary to carry out current transactions. Data warehouses are designed to accommodate large and dynamic queries while OLTP systems are much more highly specialized. Data Warehouses utilize bulk data operations whereas OLTP applications utilize more granular data operations that operate on a considerably smaller number of records per transaction. Even the schema design patterns diverge for these two classes of application: OLTP systems utilize highly normalized schemas in order to facilitate quick updates at the cost of slower queries whenever joins are needed, while data warehouses utilize highly de-normalized schemas to boost efficiency and query performance.

OLTP Data Warehouse
Write-Heavy Read-Heavy
Transactional Analytical
High Throughput High Volume
Current Historical
Standard Queries Ad Hoc Queries
Granular Bulk
Normalized structure Denormalized, Cube, Star schema, NoSQL

Schema Styles

Necessity drives design in database architecture: exactly how data will be used determines the most optimal strategy for designing a schema. What, exactly, are the major use cases for an operational data store? A schema for a database used for the purpose of data mining will differ greatly from that of a data store used for reporting—wherein various pre-computed columns provide quick query results for vectors like totals and averages, which in turn contrasts with analysis data stores—which use highly de-normalized data.

The typical normalized relational model is the most widely used among schemas and seeks out to minimize data redundancy and avoid abnormalities in data operations. Star schemas, contrastingly, embrace redundancy through “dimensional modeling” in which multiple specialized data marts are expanded via denormalization to provide fast and efficient reporting. Similarly, data “cubes” provide expanded data with pre-computed fields included for faster analysis. Big Data systems also utilize this strategy through JSON (JavaScript Object Notation) representing the logical structure of documents which may or may not be strictly structured.

{  "glossary": {
"title": "example glossary",
    "GlossDiv": {
      "title": "S",
      "GlossList": {
        "GlossEntry":
                   "ID": "SGML",
                   "SortAs": "SGML",
                   "GlossTerm": "Standard Generalized Markup Language",
                   "Acronym": "SGML",
                   "Abbrev": "ISO 8879:1986",
                   "GlossDef": {
                     "para": "A meta-markup language, used to create markup languages such as DocBook.",
                     "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}

JSON Example

Source: JSON.org

 

On the most abstract level, data warehouses fall into one of three basic layouts:

  • Basic,
  • Basic with staging areas
  • Basic with staging area and data marts—that is, specialized data stores optimized specifically for various tasks or areas within an organization.

Standard reporting

Standard reports are designed by software developers with specialized logic and purposes, such as accounting statements. These entail complex business logic, master-detail layout, and customized formatting. They are written by programmers, not end users, using report-writer software that requires knowledge of the database structure. Deep knowledge of the data is required, including the meaning and derivation of each field, and how the data is related.

Business Intelligence

Business Intelligence reporting platforms are designed for business analysts to develop or extend reports, usually with a layer of metadata that describes the underlying content and structure in a way that makes it easy to develop specific queries. Once called “end user computing,” they offer drop-and-drag user interfaces, and don’t require understanding of physical data structure by end users. You can develop Slice-and-Dice reports with summary, drill-down, and detail views. BI platforms are metadata-driven with a system administrator predefining what’s exposed.

Analytics

Analytics are calculation-heavy queries that provide metrics into an organization’s performance. Key Performance Indicators (KPI’s) to measure business performance are often built ahead of time into daily, weekly, monthly, quarterly, and annual time periods. The user of these systems can perform trend analysis and form corrective policies based on poor performance factors.

Data Mining

Data mining involves searching for pattern in the data to spot trends, associations of events, or operational anomalies, essentially building the haystack in hope of discovering needles. The query filtering capability must be very complex to search for specific conditions, or very flexible to give you many views on the data. A data mining warehouse can keep every possible business face in hopes of finding business anomalies, fraud, operational violations, etc.

Data Integration Hub


warehouse_hubA warehouse can be used as a data integration hub, providing a source for downstream applications to retrieve transactional data from a system of record, or as a hub and spoke, with all data moving from the various spoke applications to the hub, and outward to other spokes as needed. An integration warehouse provides common access to all data, and mitigates the negatives of silos. Corporate data from all applications and all departments is collected and shared. Different structures of common data are resolved into a single view (e.g.: Customer Master). This stabilizes the corporate view of data when applications are upgraded or replaced, and only the interface between the changing application to the warehouse needs to be changed.

A data integration hub allows common access using a single DBMS platform. This enables table joins and merging data between formerly separate databases instead of requiring staging tables or files. You can then use a common reporting or Business Intelligence platform. Finally, access is controlled within the BI tool or by GRANT access, not physical separation.

One of the beauties of a data integration hub is that an integration developer doesn’t have to know API’s for other applications. All they have to do is integrate to the hub, and let the other developers for other applications do their own integration to the hub, using the common framework and data model of the hub.

Compliance Backup

Compliance has become an increasingly prevalent necessity in today’s enterprises, and, in turn, has contributed to increased complexity and costs for data warehousing and operations. Ultimately, however, compliance is “backup’s twin” in such that compliance merely denotes a backup kept for the specific purpose of complying with some regulation, and to assist in auditing, or in the rare case, recovery. Therefore, any techniques used to maintain full backups of operational data can be applied to compliance data stores.

A compliance backup should provide

  • A backup of all records
  • An ability to keep and flag records that are deleted from the original system
  • An audit trail of changed and deleted data, showing all the versions of individual fields or complete records by change date and who changed them.
  • An ability to restore data for single or multiple records from any version of historical data, including the relationships between different types of data (a customer and it’s contacts, for instance)

The data storage technology can vary widely, using flat files or databases to store the versions. The backup can have the same or a different structure as the source, depending on how the backup needs to be accessed. One popular cloud application uses a single database table for all standard fields of all objects, and another database table for custom fields; but when copied to a backup database all replication products use the logical model provided by the application’s metadata to show customers, contacts, and all the other related records.

The actual copy process can be painstakingly built using an ETL tool or scripts, or by using purpose-built technology that understands the source schema and navigates through any or all objects. The copy process can be a full copy performed on a regular schedule, or incrementally and frequently.

Takeaways

Data warehouses are important for Reporting, Integration, and Compliance. They take load off production OLTP systems and make reporting and integration more flexible, scalable, and higher performing for bulk workload. There are several structural styles of warehouses which should be used within the appropriate context; star schemas are not for everyone! There are many benefits of data warehouses that are strategic to modern companies, but the costs of building them should be fully explored and managed.