The Architecture of Databases
The Architecture of Databases
What is The Architecture of Databases?
- The different components: Transport System, Query Processor, Execution Engine, and the Storage Engine.
The Architecture of Databases
Database Management Systems provide an abstraction that you can use in your applications for storing and (later) retrieving data.
The DBMS exposes an API that you can use through some type of Query Language and it provides a set of guarantees on how you can store/retrieve your data from the database.
A couple examples of guarantees are things like durability (you won’t lose your data if the database server crashes), strong consistency (for distributed databases - once you write some data to the database, all subsequent reads will return that value and you won’t get “stale” data) or read/write speeds (IOPS is the standard measure of input and output operations per second on storage devices).
The architecture of Database Management Systems can vary widely based on its guarantees and design goals (a distributed database meant to store petabytes of data will have a different design than an embedded database like SQLite).
However, databases have some common themes in their architectures, and knowing these themes can provide a useful model for how they work.
The general architecture can be described by this diagram...
Database Management Systems use a client/server model.
When you use a database, your application is the client and the database system is the server (either hosted on the same machine or hosted on a different machine).
The Transport system is how the DBMS accepts client requests. Client requests come in the form of a database query and are usually expressed in some type of query language.
The Transport system first takes in the client’s query and passes it on to the Query Processor.
The first part of the Query Processor is the Query Parser. It parses the client’s query ( using an Abstract Syntax Tree for example) and makes sure the client’s query is valid. If the query isn’t valid, then the database will return an error to the client.
The Query Parser may also run access control checks, and make sure that the client is correctly permission to access/modify the data that they’re requesting.
After, the parsed query is passed to the Query Optimizer.
A parsed query can normally be satisfied in different ways, but the different ways vary inefficiency. The optimizer’s job is to find the most efficient way.
The optimizer will first eliminate redundant parts of the query and then use internal database statistics (index cardinality, approximate intersection size, etc.) to find the most efficient way to execute the query.
For distributed databases, the Optimizer will also consider data placement like which node in the cluster holds the data and the costs associated with the transfer.
The output from the Optimizer is an Execution Plan that describes the optimal method of executing the query. This plan is also called the Query Plan or Query Execution Plan.
This Execution Plan gets passed on to the Execution Engine which carries out the plan. When you’re using a distributed database, the plan can involve Remote Execution (making network requests for data that is stored on a different machine). Otherwise, it’s just Local Execution (carrying out queries for data that is stored locally).
Remote Execution involves Cluster Communication, where we communicate with other machines in our database cluster and send them requests for data.
Local Execution involves talking to the Storage Engine to get the data.
The Storage Engine is the component in the database that is directly responsible for storing, retrieving, and managing data in memory and on disk.
Storage Engines typically provide a simple data manipulation API (allowing for creating, reading, updating, delete features) and contain all the logic for the actual details of how to manipulate the data.
Examples of Storage Engines include BerkeleyDB, LevelDB, RocksDB, etc.
Databases will often allow you to pick the Storage Engine that’s being used.
MySQL, for example, has several choices for the storage engine, including RocksDB and InnoDB.
The Storage Engine then consists of several components
Transaction Manager - responsible for creating transaction objects and managing their atomicity (either the entire transaction succeeds or it is rolled back).
Lock Manager - Transactions will be executing concurrently, so the Lock Manager manages the locks on database objects being accessed by each transaction (and releasing those locks when the transaction either commits or rolls back).
Access Methods - These manage access, compression, and organizing data on a disk. Access methods include heap files and storage structures such as B-trees.
Buffer Manager - This manager caches data pages in RAM to reduce the number of accesses to the disk.
Recovery Manager - Maintains the operation log and restores the system state in case of a failure.
Different Storage Engines make different tradeoffs between these components resulting in a differing performance for things like compression, scaling, partitioning, speed, etc.
Comments
Post a Comment