Regardless of the industry, mid-to large-size companies are looking for integration solutions that are scalable in a cloud-based data warehouse. For both B2B and B2C companies, the amount of data they will warehouse is likely to continue to scale exponentially. Now, even mid-size firms can have a half of terabyte of data already stored in the cloud. In the next few years, a terabyte or more will likely be the norm.
Being limited to data queries at 40 records per second because of internet latency or restrictions by SaaS platforms means most of a company’s analytical data will be inaccessible.
Most of a company’s analytical data will be inaccessible if limited to data queries at 40 records per second because of internet latency or restrictions by SaaS platforms.
One way to improve the accessibility of data is to mirror cloud data to a data warehouse sitting on an on-premise or collocated (Cloud) server.
Although there are many products that do this, there are vast differences in scalability, performance, robustness, and ease of implementation of the various offerings. Therefore, some form of query logic is necessary. This logic should either stream the data or sip the data from the Cloud in chunks. This is done to limit the number of records queried in one pass.
Replicated Data Warehouse
Employing a replicated data warehouse allows you to do reporting and integration involving large data requests faster than what SaaS platforms can deliver. This is done using the Business Intelligence or Analytics tool of your choice. Not only does this warehouse allow for faster reporting and data integration, but it also provides companies with the additional piece of mind that they’ll have a backup of their data for compliance purposes.
In the last decade, super-fast cloud databases have supported instantaneous report results on vast amounts of data. Snowflake, Oracle Autonomous Data Warehouse, Google BigQuery, Greenplum, Redshift, Teradata, and many other on-demand databases scale without having to buy more disk drives, CPUs, and RAM when data volumes and workloads increase. In many cases, vendors will also automatically upgrade and patch the database software.
Cloud Reporting Database Architecture
The architecture of cloud reporting databases supports the fast reading of a few columns that include many records. Data is physically organized by column instead of by row. Which makes table scans on large datasets much faster. This is because the column values for many records are stored together and read with one physical disk operation.
There is a downside, though. These “columnar” databases do not have efficient transactional support for insert, update, and delete operations of individual records. As a result, loading millions of records with INSERT statements can be very slow. This is because the data for every column goes into a different database block.
UPDATE statements can also be slow for the same reason. The database has to assemble the record out of each of its columns in separate blocks to read it. Then it updates all the columns in separate blocks. All the reads and writes require physical movements of the disk head for each column.
The physical storage structure of the database is not the only significant performance factor. Pushing large amounts of data in transactional mode with individual INSERT or UPDATEs on each record involves internet latency on every record. You are sending single records over the internet with the built-in lag that accompanies going through a dozen or more routers along the way. Therefore, links that might be overloaded or underpowered, giving two to six records per second throughput. Clearly, this is unacceptable. Relational Junction uses bulk loaders for pushing data over the internet. This is done to take advantage of the optimized performance of the loaders.
SaaS and Scalability
Having a cloud-based data warehouse that is both scalable and easily accessible is tricky. Especially for companies working with SaaS enterprise platforms such as Salesforce.com and NetSuite.
The SaaS resource consumption model is designed to prevent a single user from consuming all the resources of the Cloud server. This is done in order to maintain reasonable performance for all users. Governors and limits on the amount of cloud-based records that any client can access in one session keep the system moving. However, this prevents customers from getting all the reports they want, all the time, in unlimited quantities.
The vast majority of those self-imposed SaaS bottlenecks are time-based. For example, a business user wants to generate reports on sales completion rates based on two million records. This request would likely time out well before any reports were delivered, leaving the user empty-handed.
By building these time constraints into any request for data, SaaS providers essentially force their clients into receiving data from the cloud in much smaller chunks—and that’s a numbers game many companies will have trouble winning.
Transmitting only 40 records a second—the average rate for the many vendors and SaaS providers—means that performing data integration on five million cloud-based records will take up an entire workday and still not be completed. Delays like that can end up crippling a company’s real-time decision-making and put it at a serious disadvantage.
The Solution to Accessible Cloud-Based Data
All of these database platforms come with very fast bulk load utilities. They require a programmer to:
- Write scripts to create flat files with the data and control files with the instructions for the bulk loader to parse the files
- Send the data and control filesend to the database server
- Execute the bulk loader
- Check for errors due to file mismatches with the target database or other reasons
- Report those errors to the programmer
If you were to copy all of the tables or objects in the more sophisticated accounting systems, you would have to analyze, code, and test the scripts on thousands of tables if you were to copy the entire system’s data to a warehouse.
Experience has shown that this will result in delays, cost overruns, and months of work, plus a permanent team of support people to fix problems at all hours of the day and night. The reliability will also be poor. Solutions rarely account for the automatic addition of new or deleted tables or fields in the source system.
Relational Junction has patented features to replicate data as fast as possible. This includes dealing with SaaS limitations, governors, and databases that are optimized for very fast reporting. Relational Junction rapidly connects to your data, routes the data where you need it, and then selectively replicates the items you desire in near-real-time. You can be up and running within minutes with no data modeling, design, data mapping, or even a project. Being able to warehouse all of your application systems into a modern data warehouse eliminates data silos, enabling quicker and better-informed decisions using all your relevant data.
If you would like to learn more about Relational Junction, schedule a discovery call today!
Rick is the Founder and CEO of Sesame Software. He has developed application systems for 40 years in his roles as a product architect, software developer, systems integrator, and database analyst. Rick developed Relational Junction to meet the market demand for easily managed enterprise application integration and data warehouse products. He has been awarded six patents for technology used in Relational Junction.