Everything in life has an opportunity cost as well as financial costs. Acquiring or building a software solution for data integration is no different. Presumably, choosing the least expensive product solution would probably involve writing everything yourself using open source tools and scripting. But wait – that approach hides the true cost of development and maintenance, unless you work for free or are just looking to justify your employment in the absence of “real work.”
Although you can get exactly what you want with a completely custom system, home-built solutions generally involve a team of people in a build-and-fix phase of many months, and create a high risk of not having a skilled support person after go-live that will have to continually improve system functionality to meet unforeseen needs, fix ongoing bugs, and support the business community’s evolving integration projects. That dedicated developer who maintains the system probably won’t be part of the development effort in the build phase because software developers typically just don’t stick around after the system is finished. They go on and build another entirely new system because they want to improve their skill sets and not get bored. Want to be that guy? You’ll be stuck forever in that role, or you’ll leave your employer with a gaping hole in their staff when you leave for greener pastures, and soon the system will start to fall apart because no one knows how it works and it will be time to do a total replacement. Why can’t anyone understand it? Because – let’s be real about this – developers never document anything. This is how all home-grown software projects end up – when the developers leave, these undocumented systems fall apart.
You can have Functionality, Quality, Economy, and Timeliness of any solution that you build yourself – but not all four. There are vendors that sell millions of copies of office automation or accounting software products or smart phone apps that will give you all four, but the more specialized and customized the solution, the more you’re going to have to accept trade-offs, since the vendor has to amortize their costs over a smaller number of customers. The overhead of designing, building, testing, and supporting a home-built software solution is the worst-case scenario, as there is no cost sharing among many customers who agree on basic tenets of how the system should work.
Sadly, buying off the shelf integration products can involve more work to learn and evaluate the various packages than the actual integration project itself. The vendors make it hard, with products designed to impress on a demo with “circles and arrows and a paragraph on the back of each one explaining what each one was to be used as evidence against us.” But those products typically take over a month to gain even limited proficiency. And vendors don’t disclose their pricing up front because it’s usually “whatever we can get for it,” according to a comparative study of ETL products by Passioned Group. There is a special breed of consultants who spend months evaluating integration products for each client, and developing detailed studies and recommendations for what should be a labor-saving tool.
Despite the complexity of choosing an off-the-shelf data integration product, we will explain why your company needs a commercial vendor-supported tool to work with all of your corporate data, and what architecture is the most flexible, reliable, scalable, and easiest to implement and maintain. But first, what are the business requirements for any solution?
To build a modern data integration capability, you would need to develop all the following features:
- The ability to handle common relational databases.
- A transport mechanism to replicate the data to and from your Cloud applications using native API’s.
- The ability to handle any other exotic or external data sources you have or might need down the road, from flat files to NoSQL databases.
- A metadata layer to map the structure of Cloud data to your database, with automated discovery of new objects and fields, and the ability to handle data type changes.
- A data transformation mechanism (the “T” in ETL) to map and modify raw data into a format recognized by the target system (such as changing Boolean bit flags to human readable text). It should come with built-in functions but allow the creation of user-written functions.
- A method of handling the mappings that doesn’t break when the Cloud schema is modified.
- A scheduling mechanism to tell the transport layer when to run.
- A method of handling incremental loading of data, because you don’t have time to reload everything every time.
- A recovery mechanism to handle all possible network and system failures from your Cloud applications, which are a generally a frequent occurrence. There may be days when the API interface is down for hours, which would cripple anything that requires direct API access. You’ll need functionality to properly handle all outages, and to retry connections.
- A restart mechanism to retry failed loads without missing records and without creating duplicate records.
- A technique to handle millions of records without timing out. Cloud applications are designed to protect the vendor’s infrastructure from overly zealous use and abuse by batch or polling processes, so there are limits and governors built in to prevent you from getting all that data too easily.
- A method of determining which side wins when there are multiple updaters of the same record, including handling record ownership issues down to individual fields.
- A logging mechanism to track job history and run statistics, and to remember where we were in the incremental queries that rely on timestamps in the source data.
- A failure logging mechanism to trap record level exceptions with all data exposed for debugging purposes.
- A notification mechanism that will inform staff of problems without overwhelming them with information.
- A complete retesting of all functionality for every new release of your Cloud applications. There will be several major releases each year, each of which requires code modification, often times without documentation until after the release is production.
Making the build-vs.-buy decision and choosing specific solutions involve weighing many factors:
- Total Cost of Ownership (TCO) – Consider the cost over the lifespan of the solution, not just the up-front licensing costs. One organization cannot build nor maintain a solution for its own use as inexpensively as it can by leveraging a vendor solution, because vendors can pay for their effort over a large number of customers.
- Time to Market – Can the business live without a solution while the platform decision is researched or architected, the platform itself is either purchased and learned, or built internally, and the integration is actually built? Lucidera was a business intelligence vendor that went out of business after spending $1.5 million over a year’s time on twelve offshore developers building a Salesforce.com data warehouse engine behind their BI product. That is the typical time it takes with dedicated resources to develop a solution that has even the most rudimentary feature sets, has been tested thoroughly, and scales to handle expected data volumes over time. If time to market is at all important, you’ll leverage a package solution instead of building the infrastructure yourself.
- Functionality – Does the solution meet the current and future needs of the enterprise? If you’re working at a startup and use a solution that is too small for future needs, factor in future transition costs. Vendors survive because they adapt to new data sources and technical challenges that will make home-grown solution obsolete quickly.
- Maintainability – Vendor solutions come with product upgrades and support, and hopefully a community of developers with existing skill sets that are readily adaptable to the product. Internally built solutions come with staffing issues as people are promoted out of current responsibilities or leave the company.
- Architecture that Supports Your Needs – You’ll need to choose an integration style first, then pick or design a solution that meets this need. Point to point integration is quick, fragile, and short-term. Hub centric integration allows you to leverage data warehouse technology, create a central repository of corporate data, and consume Cloud data locally. Data bus is for real-time integration.
- Reliability – Try before you buy. Do a proof of concept with candidate vendor solutions, using your toughest use case for a couple of weeks. Does it fall over if there is too much data? Is the vendor responsive to support issues? This is the thing you can’t test with home-built solutions ahead of time, and will create support nightmares down the road.
- Scalability – Try copying your largest set of data, especially if it’s coming from or going to a Cloud application or Cloud data warehouse. Again, you can’t do this ahead of time with home-built solutions. Once you’ve committed to an internal development project, if it doesn’t work, you’ve thrown away your entire investment.
- Risk of Failure – Vendor solutions are generally known quantities with predictable results and a community of developers with applicable skills. Home-grown projects can fail, take longer than expected, or lose key developers mid-project.
 With apologies to Arlo Guthrie, “Alice’s Restaurant Massacree”