Scalability and Performance

With enormous amounts of data comes great responsibility – and great headaches. While the business continues its pursuit of tracking everything forever, the IT department is required to keep abreast of the technology required to store, access, maintain, backup and recover all of that data. The number of commercial and open source DBMS (DataBase Management System) products is exploding faster than anyone can comprehend, much less master. Each DBMS has its own purpose and its own limitations. Each of the solutions acquires a community of evangelists and developers. Vendors respond with applications that use the DBMS, business intelligence software to report on the data, tools to integrate with it, and systems to back up and recover it.

The broad classes of data storage solutions include but are certainly not limited to:

  • File-oriented storage
    • Fixed length records used primarily by legacy mainframe systems, readable by COBOL or other second generation languages. Punched cards, 9-track tapes, and disk files were typical physical media.
    • Comma Separated Values (CSV) or tab-delimited files, with the first record being the titles of the columns for each subsequent row of data. This format allowed for variable length fields, which both saves storage costs and provides flexibility to handle long text data.
    • XML files, which use tags to delimit the start and end of data values similar to HTML, such as “<DataName>data value</DataName>”.
    • JSON (Java Script Object Notation) files, which use a “name=value” notation style.
  • Hierarchical and Network Databases were invented decades ago to store and efficiently retrieve complex data structures.
    • IBM’s IMS (Information Management System) hierarchical database uses “records” consisting of “segments” of repeating data, with parent-child relationships. You can make any part of the record searchable with indexes.
    • CA Technologies IDMS (Integrated Database Management System) is a network database with a CODASYL standard programming language interface as well as a SQL interface
    • Software AG’s ADABA (Adaptable DAta BAse System) uses the NATURAL language or SQL
  • Relational Databases were invented to support relationships between data stored in tables, using mathematical set theory to describe the data and retrieve the data using SQL (Structured Query Language). Relational databases include:
    • Oracle
    • DB2
    • SQL Server
    • Sybase
    • Ingres
    • Vectorwise
    • PostGreSQL
    • Informix
    • Teradata
    • Tandem NonStop SQL
    • MySQL
    • H2
  • NoSQL databases ("non SQL" or "not only SQL")store data without relying on tables and relationships. They make use of distributed storage for scalability, and “eventual consistency” instead of guaranteeing transactional integrity at any point in time.  
    • Architectural types include Column-oriented, Document, Key-value, Graph, Object, and Multi-model.
    • DBMS solutions include Apache Cassandra, BigTable, Clusterpoint, Couchbase, CouchDB, Datastax, DocumentDB, DovetailDB, Druid, Dynamo, Dynomite, HBase, Hypertable, IBM Domino, KAI, KDI, MarkLogic, MongoDB, MotionDb, Objectivity, OpenNeptune, Oracle NoSQL Database, Qbase, Qizx, Redis, RethinkDB, Riak, SubRecord, Voldemort, and XML-databases.

 

See WikiPedia for a more detailed discussion of NoSQL databases.

 

Common Elements of Scalability

Scalability is the ability to both store and retrieve large amounts of data. It doesn’t accomplish much if we can store an ocean of data but we can’t get it back when we need it. It may take years to accumulate data, but retrieving it has to be done in a time frame that has value to the business, meets batch processing windows, and returns information to impatient humans who are just trying to get their job done without losing a customer on the phone or losing their minds wondering when the computer will respond.

The essentials of DBMS storage scalability are to use available physical media in ways that provide for storage of data that can be written and read by many users at one time. This implies that there can be no bottlenecks in accessing a set of records, and individual records must be accessible without having to read the entire database. Tape drives provided only sequential access to a single batch job. Disk drives allowed DBMS vendors to provide random or sequential access to individual or sets of records. Software architecture that provides locking of records and efficient use of the physical blocks supported by the hardware makes online transaction processing systems possible. Physical storage may span multiple physical files on one or more disk drives, or even multiple computers in a federated configuration. Physical disk drives with movable read-write heads are starting to give way to solid state drives with no moving parts and much faster access. In-memory databases can be used if data volumes are not too large, but even large databases can be kept partially in memory for improved performance.

Common Elements of Performance

Data must be structured for performance. The data model must include accessibility as one of its primary objectives. Indexes should be employed to avoid reading all the data to find just a few records. However, if a large percentage of the records will be retrieved by a typical query, it is usually better to not use indexes, since the random lookup of data from a sequential read of the indexes can involve more physical I/O than reading the entire set of data. Ultimately, performance at the database level can all be boiled down to minimizing physical reads, especially when they involve moving the read/write head on a disk controller for each randomly accessed record. Waiting for the drive to spin and moving the head can take up a lot of time over the span of many records or on a heavily utilized system.

Reading data in bulk is facilitated by transmitting multiple logical records in one physical block, or buffer. This is especially important when there is significant network latency, as happens with internet-based transmission over many network hops and many miles.

Writing buffers of data works well for sequential writes or mass inserts. It gets more complicated when the database has to work out the “upsert” (update-or-insert) logic using uniquely valued “keys” in the data. The application layer or integration tool will typically have to decide in advance whether to present the data for insert or update.

Takeaways

Scalability is the volume of data that can be stored and retrieved. Performance is the speed in which this is accomplished. There is no single solution to solving these requirements for all kinds of file and database management platforms, but there are common techniques of indexing, buffering, and organizing the data for optimum retrieval.