The Next Evolution of the Database Sharding Architecture – InfoQ.com

Key Takeaways

With mobile phones and the internet becoming what some would call a daily necessity, it's not uncommon for websites and business services to receive billions of visits on a weekly basis - and thats not all.

Sales days like Black Friday in North America or Double Eleven (aka Singles Day) in Asia are excellent examples of traditional retail enterprises adapting to the digital world. These enterprises must now contend with new needs and challenges, to successfully achieve their business goals.

They all have to answer the same question: we need to drive our digital sales on this Black Friday, but when we succeed and this incredible traffic reaches the database cluster, will our databases be able to handle it?

Different business cases have multiple options when it comes to database solutions. Such options range from NoSQL products (e.g., MongoDB, Cassandra, Amazon DynamoDB, etc.), to NewSQL products (like Amazon Aurora or CockroachDB, which are popular these days).

Besides these great solutions, some industries will also consider transparent sharding on top of existing database clusters.

According to the database trend ranking DB-Engines, although many new database offerings are hitting the market, the traditional relational database still retains a considerable share of the pie.

Considering the new challenges databases are facing, is there an efficient and cost-effective way to leverage these types of databases and enhance them through some new practical ideas? Database transparent sharding is one of the best answers to this question.

Database popularity ranking on DB-Engines

One of the best techniques for this is to split the data into separate rows and columns. This splitting of large database tables into multiple small tables are known as shards. The original table is divided into either vertical shards or horizontal shards. Terminologies used to label these tables can be subjective to VS1 for vertical shards and HS1 for flat shards. The number represents the first table or the first schema. Then 2 and 3, and so on. These subsets of data are referred to as the table's original schema.

So what is the difference between sharding and partitioning? Both sharding and partitioning include breaking large data sets into smaller ones. But a key difference is that sharding implies that the breakdown of data is spread across multiple computers, either as horizontal or vertical partitioning. On the other hand, partitioning is when the database is broken down into different subsets but held within a single database, sometimes referred to as the database instance.

Since for sharding data is divided into numerous pieces stored across different machines, this approach offers the following advantages:

However, sharding architecture is not perfect and has some drawbacks:

Sharding: One to multiple shards

Like for most things in technology, let alone in life, there is no silver bullet. You should perform a thorough analysis to have a complete picture of your needs and scenarios, and only then move on to choose the best possible solution.

Generally, the advantages of sharding architecture prevail, and many excellent products that play an essential role in the database industry are based on this architecture. Citus or Vitess have their respective definitions, but they're based on database-sharding architecture in nature.

Citus manages a coordinator (proxy) cluster to distribute the PostgreSQL cluster, while Vitess shards MySQL alike. Both of them concentrate on providing a low-cost and efficient distributed solution of traditional but prevailing relational databases. Actually, sharding architecture is fundamental for most NoSQL and NewSQL products as well, but that would be another topic focusing on sharding with NoSQL and NewSQL. This article focuses on sharding with relational databases since there are some innovations brought to the classic sharding technique.

The occurrence of sharding is the result of the distributed needs of databases. These days an increasing number of new issues involve databases, such as privacy protection, SQL audit, tenant, distributed authentication, etc.

These represent real-world new demands for databases. How to deal with these issues is an inevitable question for all database products, no matter the type of database. Could these issues be addressed with a database sharding solution? It looks like sharding needs to evolve to meet these challenges, which is our topic i.e., what's the next evolution for database sharding architecture.

My answer is Database Plus, the guiding concept for creating a distributed database system for more than sharding, positioned above DBMS.

It was conceived with the aim to build a standardized layer and ecosystem above existing and fragmented databases, as well as provide a unified and standardized database usage specification. This provides for upper-level applications, and the challenges faced by businesses due to underlying databases fragmentation get minimized as much as possible. The result is an environment where the applications only need to speak with a standardized service, instead of different ones for each database.

This idea was initiated by the PMC (Project Management Committee) of Apache ShardingSphere, and it took about a year to release 5.0.0 GA and implement this concept in its architecture.

In the 3.x and 4.x release stages, we defined Apache ShardingSphere as a distributed database middleware (sharding architecture) to only solve the sharding issue. However, new challenges for databases and the community pushed this project to evolve and include more features such as data encryption, shadow database, distributed authentication, distributed governance, etc. All of these changes go beyond the traditional sharding range, as sharding is just one part of Database Plus.

The evolution of ShardingSphere's Database plus architecture

Apache ShardingSphere's example supports my thesis that a simple and classic sharding architecture can do more than sharding. The kernel mechanism directs all traffic through a proxy or driver, and then if it could parse SQLs and know the location of every database, the following jobs will be easy to perform:

So what do these jobs mean to the end users? Based on these kernel jobs, the products of Apache ShardingSphere are qualified to ease users database pain points.

Originally sharding, data encryption, shadow database, distributed authentication, distributed governance, etc. were all based on the necessary steps above. The architecture proposed by the Database Plus concept of Apache ShardingSphere brings these enhancing features with flexibility in mind.

All functions are just plugins that can be added or removed at any given time in this distributed system. Some people may just want to shard a database, while others may prefer to do data encryption. Users needs never stop evolving and are diverse, and for this reason Database Plus can be fully customizable and continuously receive new plugins (features) allowing it to meet users demands one by one specifically and flexibly.

ShardingSpheres architecture includes the following four layers as shown in Figure 1 below.

ShardingSphere's Four Layer Architecture

Foundation Layer:Provides a variety of access terminals such as driver or proxy to flexibly meet the needs of users in different scenarios.

Storage Layer:All the functions are supported in these databases with the possibility to include more.

Function Layer:Provides a variety of functional plug-ins that meet users' needs, allowing a high degree of flexibility in plug-in choice and combination.

Solution Layer:End users are provided with industry-oriented (e.g. financial, e-commerce and entertainment industries) and specific scenario-oriented standard product solutions (e.g. distributed database solution, encrypted database solution or database gateway).

ShardingSphere JDBC and ShardingSphere Proxy have been polished and tested for five years and are now available in production. Many community users provided relevant production cases, and production feasibility has been verified.

With the shared core functions among different ShardingSphere clients, users can also choose hybrid deployment to achieve balance between query performance and management convenience (shown in Figure 2 below).

ShardingSphere JDBC and Proxy Hybrid Development

The Apache ShardingSphere community proposed a SQL dialect, i.e., DistSQL (distributed SQL), to operate and manage all functions of ShardingSphere.

SQL is the standard and conventional interaction method with databases. However there are many new features in this distributed database system which required us to think of a SQL dialect to configure and use these new functions.

DistSQL allows users to use SQL-like commands to create, modify or delete a distributed database and table, or to encrypt or decrypt data. All of the above mentioned functions could be performed with distributed SQL. Some DistSQL snippets are presented below.

DistSQL in Action

Distributed database system governance capability is necessary to alleviate the suffering of distributed cluster management. In the ShardingSphere ecosystem where computing and storage are separated, features are greatly enhanced in the new version including:

Additionally, the distributed lock new feature is scheduled to be released soon.

ShardingSpheres Distributed Governance

Although many advantages have been listed above, there are some constraints or limitations that are worth mentioning. You should give careful consideration to the following items before adopting ShardingSphere::

This section will introduce two practical examples to demonstrate how create a distributed database and create an encrypted table with DistSQL - the SQL dialect connecting all the elements of the ShardingSphere ecosystem.

This part will guide you through an example of how to leverage DistSQL to create a distributed database. Users and applications visit Proxy to achieve a logic table (distributed table) which has been sharded among different servers. There is no need to take care of these shards, instead make your applications operate and manage this logical table.

Prerequisites:

Process:

Login Proxy CLI by executing SQLcommand:

mysql -h127.0.0.1 -uroot -P3307 -proot

Register two MySQL databases using DistSQL

ADD RESOURCE ds_0( HOST=127.0.0.1, PORT=3306, DB=demo_ds_0, USER=root, PASSWORD=root );

ADD RESOURCE ds_1 ( HOST=127.0.0.1, PORT=3306, DB=demo_ds_1, USER=root, PASSWORD=root );

Create sharding rule by distSQL

CREATE SHARDING TABLE RULE t_order( RESOURCES(ds_0,ds_1), SHARDING_COLUMN=order_id, TYPE(NAME=hash_mod,PROPERTIES("sharding-count"=4)), GENERATED_KEY(COLUMN=order_id,TYPE(NAME=snowflake,PROPERTIES("worker-id"=123))) );

Create sharding table by the previous sharding rule

CREATE TABLE `t_order` ( `order_id` int NOT NULL, `user_id` int NOT NULL, `status` varchar(45) DEFAULT NULL, PRIMARY KEY (`order_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Show resources, sharding databases and sharding tables

sql SHOW SCHEMA RESOURCES;

SHOW DATABASES;

SHOW TABLES;

Show sharding tables

SHOW TABLES;

Below are the tables in MySQL:

And below are the tables in ShardingSphere Proxy:

Drop sharding table

DROP TABLE t_order;

This example shows you how to create an encrypted table with DistSQL. The data encryption feature is ShardingSphere Proxy, which helps encrypt and decrypt data. Applications do not need any coding refactoring, and just send the plaintext to Proxy, where plaintext is encrypted and resends the ciphertext to the databases. Additionally, users can configure which column in which table should be encrypted by which encryption algorithm.

Prerequisites:

Process:

Login Proxy CLI by executing the following command:

Add Resources by distSQL.

ADD RESOURCE ds_0 ( HOST=127.0.0.1, PORT=3306, DB=ds_0, USER=root, PASSWORD=root );

Create encrypt rule

CREATE ENCRYPT RULE t_encrypt ( COLUMNS( (NAME=user_id,PLAIN=user_plain,CIPHER=user_cipher,TYPE(NAME=AES,PROPERTIES('aes-key-value'='123456abc')))));

SHOW ENCRYPT TABLE RULE t_encrypt;

Create encrypt table

CREATE TABLE `t_encrypt` ( `order_id` int NOT NULL, `user_plain` varchar(45) DEFAULT NULL, `user_cipher` varchar(45) DEFAULT NULL, PRIMARY KEY (`order_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Below is the result in MySQL:

Read the original here:
The Next Evolution of the Database Sharding Architecture - InfoQ.com

Related Posts
This entry was posted in $1$s. Bookmark the permalink.