ETL Tools vs. Hand Written SQL

Human vs RobotTwo hands-on integration experts have wildly different conclusions on the merits of ETL tools, despite each having years of experience with hand-written SQL and ETL tools. Both experts represent the IT side. Both earn a living implementing data integration solutions. But they disagree on what is the best approach: home grown native SQL or vendor ETL products. This demonstrates how fine minds can respectfully disagree based on a number of considerations. Neither of them are fans of Citizen Integration, having seen what happens when business analysts fail in the execution phase.

David Aldridge wrote the original remarks in a 2006 blog, A List: Ten Reasons Not To Use An External ETL Tool, to which an unknown author added three more reasons. June Tong wrote the responses years later. The following dialogue is reprinted with their permission.

  1. They require little database expertise to use: This sounds at first blush like an advantage, but knowledge of the graphical interface used for generating transformations can mask a shallow appreciation of the performance enhancing internal mechanisms of the database, such as bulk operations, parallelism, partitioning, etc. When a performance problem arises you are into a multi-person analysis which is less efficient, more time consuming and potentially divisive. Another issue, particularly for smaller teams, is that dividing the expertise between database people and ETL people can require higher staffing levels — instead of a DBA and two database developers (who can at a pinch substitute for the DBA on vacations and sick days) you need a DBA, a developer and two ETL developers. You can still find yourself in a bind with one member of staff on leave and no margin of safety in case of illness or other duties.

This is why you should never hire someone who just claims to know the ETL tool well; you need someone who knows databases too.  When I was evaluating DI [Business Objects Data Integrator] for Salesforce.com, I asked them how DI would produce a complex SQL query like one I had in my PL/SQL, and they said “If you can write this kind of query, you’re overqualified to be doing DI development.”  Which is exactly the problem: companies don’t hire people who have the right combination of skills to make the ETL work well.  They think you can just learn how to drag and drop stuff and make it work (inefficiently).  If you don’t understand how to force DI to generate the proper SQL, you make the ETL engine do all the work and hog up the network pulling millions of rows of data to the ETL server.  Not all PL/ SQL is good either.  As with everything (including PL/SQL development), there’s a huge difference between having someone who knows enough to make it work, and someone who is really good at it.

  1. They are generally not good at bulk operations as the database: Bulk, nologging operations are the number one performance enhancer for data warehouse loads, and your external ETL tool may not allow them, or may make it tricky to tell whether they are being used through requiring multiple configuration items to be set.

Picking the right ETL tool is also critical.  Unlike the novice ETL developer, the developers who actually wrote the ETL tool usually have a very good knowledge of the database and have put a lot of thought into how they can use the database’s functionality to help their tool achieve better performance.  In order to pick the right ETL tool, the people evaluating the tools have to know enough to ask the right questions.

  1. You will have to use some in-database transformations anyway: Related to issue number one. Your ETL developer may be unqualified to do this work, yet it is almost always going to be required. Gathering statistics, refreshing materialized views or other precalculated result sets, disabling and rebuilding indexes …

Same as #1.

  1. They don’t know as much about the data as the database does: We expend a lot of effort in making sure that the database knows as much about the data as possible. Row counts, average row lengths, cardinalities, foreign keys, unique keys, dimensions etc. all allow the database to make the most efficient choices in optimizing operations. Your ETL tool rarely will take advantage of these features. Indeed when it starts to read data it often has no idea whether it will read 10 rows or 10 million … how can it make a choice on an efficient join mechanism under those circumstances?

Oh come on.  Oracle usually requires a bunch of optimizer hints before it does the right thing anyway, and a person had to decide what indexes to build, and how to write the SQL.  If you know enough to give Oracle the right hints, you can make the ETL tool do what you want, too.  Again, this is the same as #1.     

  1. Database functionality has expanded to encompass many ETL tasks: Not by coincidence, Oracle has many features that now make it a very efficient load-and-transform tool. Full outer joins, external tables, analytic functions, regular expressions … every one of these helps to kick the feet out from under the external tool market.

This is the same as #2.  Many ETL tools also allow you to modify or write your own SQL, if you are so inclined.  I’ve done this, and taken advantage of a lot of Oracle functions.

  1. Difficult transaction control: External tools generally enforce coding in units across which transactions cannot span. You are often forced into smaller transactions than is ideal, leading to more complex recovery mechanisms in the event of a processing failure.

But they also make it easier to avoid problems that arise when your transactions are too big.  And they take care of it under the covers, without requiring a lot of custom coding by the developer.

  1. External tools do not have a performance analysis interface as well developed as Oracle’s: Good luck looking for a wait interface. A performance problem can be in the tool, in the database, in the network …. Interestingly, despite being allegedly easier to use the creation of a performance test case on an external ETL tool can be a very tricky and time-consuming proposition. If you recall back in the days when everyone was using BCHRs [Buffer Cache Hit Ratio] and rules of thumb for Oracle performance tuning, that is where ETL tools are today. Scary, huh?

Never having used Oracle’s analysis tools, I can’t comment.

  1. Hardware issues: Buy separate hardware and pay the price, or host the tool on the database hardware and be unable to load share effectively between them? Poor database performance? Sorry, the ETL tool is using all your resources right now. High hardware costs? Sorry, the ETL tool needs it to avoid contention with the database, but sits idle twelve hours a day.

Oh, please.  The hardware for the ETL tool is usually pretty cheap, and can be combined with the report server that would otherwise go unused at night.

  1. Purchase Costs: Have you seen the price of these things? And whatever we need is sliced into a different piece or module requiring a separate license.

Yes, but they can really speed up development, and that’s worth something, if only the saved time of the developer.  In reality, shortened implementation time is worth a lot in the business world.  I can write something in DI in less than half the time to write a PL/SQL procedure, and I am really good at SQL.

  1. Smaller user base: There are good resources out there in terms of forums, but precious few in comparison to Oracle.

True of Data Integrator.  Not true of Informatica and Ascential.  Both of those are so prevalent (a small minority of DW shops write their own code anymore) that expertise is easy to find.

  1. Lack of Source Control: When your ETL tool stores it’s configuration in a bunch of tables in a database repository, you are pretty much stuck with whatever source control they give you, including “none” or what I think of as “marketingware” (features that marketing need the software to include to be able to sell the product but which are little more than a unreliable stub feature). With text-based code you are free to use best practices and any tool you like — including, of course, “none”.

Again, this should be an issue when picking your tool.  Most ETL tools have some source code control functionality (DI’s is not quite as robust as I would like, but it’s adequate.)

  1. High Consulting Costs: Usually involves work being formulated and implemented by highly paid consultants, at least on the initial phase.

Yeah, I actually like this part!  But there’s nothing to stop you from developing or hiring your own in-house expertise to avoid paying consulting fees.  PL/SQL developers aren’t cheap either.  I charge the same amount whether I’m developing in DI or PL/SQL.

  1. Lack of knowledge of Data: The tool developers or consultants does not understand or comprehend the data at the time of the initial rollout of the project by when they start gaining some expertise with the data, the needs and sources they move to another project. Most of the time the complex tasks are moved to the SQL, PL/SQL developers …

Same as #1 and #12.

I agree with many of these points.  [But] this sounds like it was written a long time ago [Editor’s note: Original list appeared in 2006 on The Oracle Sponge blog.]  ETL tools have been around for over 10 years.  Almost no one writes their own ETL in PL/SQL anymore.  The competition among ETL tools has forced them to address all the tool-related issues that were raised, at least by the major players (among which I don’t consider DI).  What this all boils down to is, get in-house ETL developers who understand databases and your data instead of relying on consulting companies or people who just learned how to drag-and-drop with the tool. [My remarks mostly pertain to] Oracle. Once you start to bring databases like Redshift or GreenPlum into the picture, it’s not the same story.

Takeaways

  • Expertise should be recruited and respected. In-house staff will be needed for long term support, so reliance on consultants is a short-term solution. Avoid hiring developers with no experience in your chosen tool. Choose a tool with consideration for the pool of available talent.
  • Don’t expect a pricey tool to make up for a lack of technical experience. Trusting the tool to make up for the integration developer’s technical deficiencies, whether it’s a so-called “citizen integrator” or a programmer who is inexperienced with the tool set, is going to produce a negative outcome no matter what technology is used.
  • Consider purchasing an ETL tool with flexibility to “drop down” into native SQL to resolve performance issues or handle complexity that is possibly not available by drop-and-drag user interfaces.
  • Have a plan to diagnose performance issues. ETL tools typically do not offer diagnostics of this nature.