Extraction is the first operation, literally the “E” in ETL. The extraction of data from various source systems for use in the warehouse is the most laborious task in building a data warehouse. Source systems come in varying levels of complexity and a multitude of physical architectures. However, the general rule is that the structure of source systems cannot be modified. That inflexibility is one of the reasons that data warehouses are so valuable.
The extraction process involves deciding
- How often to load the data (latency),
- Whether the data is to be extracted in full, incrementally, or in real time,
- The physical mechanism by which data is extracted from the source (the extract layer),
- How the extracted data are to be further processed (the transformation layer), and
- How to load the data (the load layer).
The user community has a sense of tolerance for latency. Data mining to gain insights and spot trends in the data can happen in a historical context with monthly summaries and daily updates. Management by exception can require instantaneous alerts to operational or senior staff. The CEO of American Airlines gets a daily summary of all flight operations, which is used to avert or manage the day’s natural and man-made crises on an enormous scale. Banking systems typically settle overnight anyway, so there is no concept of what the assets of the bank are from minute to minute.
The latency requirements in turn determine whether extraction can be full or incremental. If you have the luxury of a nightly window during which no one needs access to the data warehouse, and the time to reliably empty and entirely reload the data, you may be able to get by with a full load process. Otherwise, the load process needs to be incremental or even continuous.
Full loads offer a simpler manner to store source data, since there is no need to implement change data capture, timestamp-based queries, event-driven logic, or to handle deleted source records. This may be outweighed by the cost of significantly slower run times and the possibility that failed jobs will produce an empty, unusable warehouse for the day. The art of bulk loading attempts to overcome the performance and scalability issues within a large organization that produces a lot of data and yet has employees needing access to the warehouse in multiple time zones.
Incremental loads can usually solve low latency requirements, with near-real-time updates of the warehouse. Incremental logic requires metadata within the extraction mechanism to accommodate change history and ensure data integrity, a way to remember where we were in each data source so we don’t miss any changes and yet don’t create duplicate records in the warehouse. Incremental logic may also involve recomputing calculated summary fields or summary tables. Incremental logic allows for less latency, so you’re looking at current data instead of day-old data. And if something goes wrong with the loads, at least you won’t have empty tables if the load jobs truncate and then fail to load some tables perfectly every night.
Real-time continuous loads are required in a limited set of circumstances, usually when something needs to be done about an exceptional condition immediately and the source system can’t be used to manage the condition. A system that collects data from geographically dispersed devices in order to alert operational staff of a condition, such as fire control or medical alerts, has to be designed to both collect the data continuously and to alert the appropriate staff. A situation where a customer is on the phone and needs immediately follow-up is another case of real-time latency.
Physical Sources of Data
Online extractions pull from a direct connection to the source system using JDBC, OBDC, or similar connection methods. Offline extractions utilize data staged explicitly outside the source system to synchronize data without the need for source-intervention, using such techniques as flat files, database dump files, database change logs, and DBMS-specific mechanisms such as Oracle’s transportable table spaces or snapshot tables. So essentially, you can pull from the source, a copy of the source, or an altered state of the source in another medium
Source data for a data warehouse will be from operational systems, typically stored in
- Relational databases designed for transaction processing, such as Oracle, SQL Server, MySQL, DB2, or PostGreSQL
- Cloud application databases that are accessed with API’s, such as Salesforce.com, NetSuite, or the Cloud versions of Oracle and SAP
- Legacy databases (IBM’s IMS, Software AG’s Adabas, dBase, etc.),
- Flat files, which could be original source data or just interface files created from the source data,
- XML files created for interface purposes,
- JSON (Java Script Object Notation) files,
- Spread sheets, and
- NoSQL databases such as Objectivity, MongoDB, Apache Cassandra, MarkLogic, Datastax, and Redis, if your organization has made the leap into building new systems on this kind of platform.
Systems of record often store data in a format designed for programmers instead of business users. Ones and zeroes, Boolean true-false, Yes-No-Maybe conditions in the data are probably not understood by users if printed verbatim on a computer screen. A data warehouse should be a user-friendly view of the information, decrypted in common terminology that people will understand.
An example of this is meteorological data represented in METAR (Meteorological Terminal Aviation Routine Weather Report) format that is translated into a more easily understood format.
While reading METAR text is something that pilots have to understand to pass their pilot’s license exam, reading an interpreted report on an iPad using sophisticated aeronautic navigation software enables faster decision making with fewer mistakes. Data warehouses should strive for this kind of obviousness and clarity by interpreting the raw data.
What are the tools to transform data? This depends on the ETL framework, whether the warehouse is built with SQL, procedural SQL, scripts, compiled code, or a commercial ETL tool. You will need the ability to do numeric, date, string, and logical operations, hopefully with a set of built-in functions to which you can add your own custom functions.
Common examples of transformations are:
- Changing internal dates stored as seconds since an “epoch” of January 1, 1970 to a human readable format, such as “1970-01-01T00:00:00Z ” or just “January 1, 1970” instead of “0”.
- Formatting numbers, such as changing 12345.67 to $12,345.67.
- Changing logical bits into “Yes” or “No”
- Manipulating string values into mixed case (CAMEL CASE à Camel Case)
- Changing codes into words with Case logic (WàWon, LàLost, OàOpen)
- Looking up values from a reference table (US à United States)
- Concatenating three address lines into a single field with line breaks with no extra empty lines:
- Address1=5201 Great America Parkway
- Address2=Suite 320
- Address3=Mail Stop 17
5201 Great America Parkway
Mail Stop 17
If the ETL platform doesn’t allow flexibility to write your own functions, you’re going to hit a dead end eventually or have to resort to programmatic manipulation by compiled code tearing strings apart letter by letter.
Challenges in Loading a Warehouse
Finding exceptional conditions is a key reason to build data warehouses.In a demo of one warehouse decades ago, a business user of a major American car and truck manufacturer asked a crazy question: show warrantee repairs for body work for trucks delivered with only a chassis – no panel trucks, no trucks with pickup beds. Crazy question, but up pops a dealership in New Jersey that had filed a number of fraudulent warrantee claims.
In a warehouse for a major DSL provider, it was discovered that the best way to get service orders expedited was to date the service order request before the date of the initial order, bumping the service order ahead of the queue. This totally threw off service level reporting metrics, but also pointed out the level of desperation throughout the entire customer service process.
The difficulty of loading a target database, whether it is a data warehouse or another application system, depends on
- The complexity of the database schema. Flattened (denormalized) table structures can be loaded in fewer stages; from a complex join of several tables, only one table is loaded that represents the logical view of the source query in one physical table. Normalized or star schemas require multiple passes to populate the nodes of the schema and link the relationships.
- Staging tables (also known as “landing tables”) and staging files can be a crutch to bridge diverse data structures from different data sources into a common area before being restructured into the final destination. Some ETL tools use intermediary flat files to merge data from different physical databases that cannot be joined with SQL directly, making it appear to the integration developer that it’s a simple merge when in fact there is a lot of moving parts behind the scene. Using landing tables when there is only one source and one target is a probably the result of not having sophisticated tools. This is a common approach needed when populating tables using non-procedural SQL.
- Foreign keys from the source systems (Customer ID, Product ID) should be used instead of internally generated GUIDs (Globally Unique Identifiers) or automatically generated sequence numbers where possible, allowing users to trace data lineage to the source systems.
- Business rule enforcement is important to make sense out of data that naturally meets the rules of the business. But one important use of data warehouses is to find exceptions to business rules. Enforce the rules you can, but be prepared to handle exceptions to the rules and report these back to the user community.
- If a warehouse is used as a feeder system for “downstream” applications or as a hub-and-spoke architecture for data sharing, bi-directional integration creates issues of which system “wins” and when to stop the back-and-forth updates. Systems of record may have to “be satisfied” by table or even field level ownership if the “downstream” application sends data fields back to the source system. An example of this would be flagging sales orders in a sales system when those orders are turned into invoices in the accounting system, or updating addresses in the sales system if the finance department determines that there is a correction to be made.
Physical Loading Styles
There is more than one way to load a data warehouse, given the source extraction and transformations are handled.
- Transactional loading can be either bulk oriented or business transaction oriented. You might load a single database table using ordinary SQL, committing thousands of records at a time, then move on to the next table. Or, you could process individual business transactions one at a time, creating the Customer, the Addresses, the Order, and Order Line Items. The first method is going to be conceptually simpler to implement, as you only have to solve one problem at a time, and potentially faster to run, but there will be temporary data inconsistency during the load process (customers without transactions, orders without product lines).
- Bulk loading using DBMS vendor-supplied utilities can bypass the overhead of SQL inserts. Column-oriented SQL databases store just one column for many records in a single database block, making reads of a few columns very fast but inserts glacially slow, as the DBMS has to split records into separate physical blocks, resulting potentially in hundreds of physical writes per record. Bulk loaders of many records at a time optimize this process and can be extremely fast. Bulk loaders to not typically handle updates or deletes, however, and usually involve staging files.
- APIs are used for some of the newer Cloud-based platforms. If your warehouse is on Salesforce.com’s WAVE platform, for instance, you are going to be forced into using API’s or products that exploit those API’s. The load process will be weighted toward performance, and the API’s are designed to handle sets of records rather than one at a time. Anything in the Cloud that is done one record at a time is going to suffer from severe latency issues. The number of hops required to transmit only one record over the internet over relatively slow bandwidth will result in a just a few records per second loaded.
Factors of Success in Load Architecture
We will do a deeper dive in later articles on the following topics, but understand for now that you need to fully explore:
- Available database loading tools
- Non-procedural SQL (SELECT INTO…)
- Procedural SQL – stored procedures
- Database Utilities – database vendors often supply database bulk load utilities
- Scripts – PHP or other non-compiled languages with good database interfaces
- Compiled custom code – Java/JDBC or similar programming languages
- ETL tools – there are many products designed specifically to load databases
- APIs – Cloud databases use API’s instead of native SQL to overcome performance issues in pushing large amounts of data over the internet
- Stable production environments are not well suited for one-time data migration, since everything is an experiment the first time, but will be needed for ongoing warehouse or integration processes. For one-time loads, consider a dedicated server that the developers can have hands-on access to rather than continuously going through complicated change control processes. Otherwise, you may have a very stable but very empty database on Monday morning after a failed migration.
- Point to point integration directly from a single data source to the target environment is simple to implement, but becomes a limiting factor if multiple data sources are combined.
- Hub and spoke architecture uses a central staging area where all data goes before it is sent the downstream applications. It can also be leverages for reporting, if you aren’t fixated on star schemas.
- Data bus technology is a publish-and-subscribe model which is designed for real-time integration of disparate applications but not bulk loading.
- Available quality assurance tools
- Database queries can be easily developed by technical staff that show anomalies in the data. There is often an expectation by the user community of what the source data looks like (a, b, c) that isn’t borne out by actually verifying the clustering of values (a, b, c, d, e…).
- QA tools can get the user community involved in the testing process without having to learn SQL.
- Applications can be developed that show the anomalies in the data, such as user-facing exception reports or alerts.
- Project Management
- Scope creep is a constant menace to satisfying business expectations. Sometimes, a quick and dirty warehouse will resolve immediate business needs that one meticulously developed over a year will not. Many companies have gone out of business during the development of computer systems because they couldn’t meet critical needs, whether it’s an order entry system that isn’t ready when the products are ready to be shipped or an analytics system that wasn’t ready in time to diagnose fatal trends in the business. The user community should be apprised of what is and what isn’t feasible within a given time frame.
- Changing database platforms or ETL technology during the project can force a total rewrite. The choice of ETL tools can actually consume more time than the actual implementation of a warehouse, especially if you include the learning curve involved in some of the tools.
- QA methods should be developed by people experienced in breaking systems, not developers who are interested in proving that things work. One applications systems vendor was aghast when a consultant developed a test script that was pretty much guaranteed to make a banking transaction fail, exclaiming “you can’t do that – that’s negative testing!”
Loading a data warehouse should be done as a series of steps where you solve each step before you can move on to the next step. The implementation choices should be based on user requirements, not the default style of the integration developer or how they did their last project.
- What data is to be warehoused?
- Who will use the warehouse and for what purpose?
- What are the latency requirements?
- What should be the structure of the warehouse schema?
- Should the warehouse be loaded entirely or incrementally or continuously?
- How will the data be extracted from the source systems?
- How will the data be transformed into an easily understood form?
- How will the data be loaded to support the latency and data volume requirements?