Sharding a SQL Server database

Posted in: Microsoft SQL Server, Technical Track

This blog post covers sharding a SQL Server database using Azure tools and PowerShell script snippets.

Sharding a 1TB database into four 256 GB DBs

Sharding a database

Sharding, at its core, is breaking up a single, large database into multiple smaller, self-contained ones. This is usually done by companies that need to logically break the data up, for example a SaaS provider segregating client data. The performance benefits of this are clear, as the sharded database is generally much smaller than the original, and so queries, maintenance, and all other tasks are much faster. However, the company now needs to deal with many more (possibly hundreds of) databases than it previously had.

At a high level, sharding works like this:

  1. The company chooses a logical method to separate the data called the Sharding Key
  2. A Shard Map is created in a new database. This map ties the sharding key to the database it’s data is associated with
  3. New databases are created and the data is moved to it’s new home.
  4. The client connections are changed. They will now query the shard map to find the shard’s data, and then connect to the new database.

In addition, with Azure and sharding, we see a lot of people making use of a set of sharded databases and then placing them all in an Elastic Pool for the performance and maintenance gains see there.

The rest of this blog post is a mid to high-level walk-through of the steps needed to shard a database.

Identify sharding key

The Sharding key is the value that will be used to break up the data into separate shards. The key is used by the Sharding Map to identify where the required user data is being stored, and to route connections there appropriately. It is critical that the Sharding key be able to be mapped to every value that will be migrated. Any values without a Sharding key will be skipped.

For example, a retail business with multiple stores across the US may choose to use a StoreID value as a Sharding Key. StoreID may be a uniqueidentifier or an INT IDENTITY, and logically this means that the data will be sharded by store.

Identify sharding method

Shards can be stored in their respective databases via one of two methods:

Range sharding
Range Sharding stores several shards in one database based on the Sharding key being within a defined range of values. For example, in a system with an Integer Sharding key, the values 1-10 could be stored within the same database, and data with the values 11-20 stored in a second database. This allows database resources to be shared across several Sharding keys, and reduces the overall number of databases that must be maintained.

List/point sharding
Point Sharding stores the data for every shard in a separate database for each key. This allows a guaranteed level of service for each shard as database resources are not shared; however, it can also mean that many databases are created and must be maintained.

Identify shard tables and reference tables

There are two types of tables in a Sharded database.

The Shard tables are the tables that have been broken up based on the Sharding key. For these tables, the data will be different depending on which database the client connects to.

The Reference tables are exactly the same regardless of the database.

In the retail store example, a Product table may be a reference table because all stores will need a complete list of all products. On the other hand, the ProductSold table would have data that only relates to an individual store, so it is a Shard table.

Migrate sharing key to shard tables

As mentioned earlier, all tables that will be sharded must have the Sharding key as a column. This step is simply creating the [StoreID] column in every sharded table and the updating the value to the associated store. For this piece, manual scripts will need to be created and run.

Create shard map database

The Shard Map tracks which shards are in which database. It also handles returning the correct connection string to the application.
It is important this be placed in a separate database to ensure performance can be maintained for all clients regardless of any one client having issues.

The Shard Map database is a regular Azure SQL DB and should be created via the Azure portal front-end. This database will be hit by all clients to discover which shard database they need to connect to, so make sure it’s powerful enough to handle the expected load.

Register database schema in shard map

The database schema must be registered in the Shard Map. This is used by the Split-Merge process to identify the Sharded tables and the Reference tables.

The below PowerShell commands give an example of how to do this. Note that it takes advantage of a module written by the Azure Shard team.

$ShardMapManager = new-ShardMapManager -UserName 'example' -Password '*****!' -SqlServerName 'example.database.windows.net' -SqlDatabaseName 'ShardAdmin'

# $ShardMapManager is the shard map manager object
new-ListShardMap -KeyType $([guid]) -ShardMapManager $ShardMapManager -ListShardMapName 'ListShardMap'

$SchemaInfo = New-Object Microsoft.Azure.SqlDatabase.ElasticScale.ShardManagement.Schema.SchemaInfo

# Reference Table
$ReferenceTableName = "Product"
$SchemaInfo.Add($(New-Object Microsoft.Azure.SqlDatabase.ElasticScale.ShardManagement.Schema.ReferenceTableInfo($ReferenceTableName)))

# Sharded Table
$ShardedTableSchemaName = "dbo" 
$ShardedTableName = "ProductSold" 
$ShardedTableKeyColumnName = "StoreID" 
$SchemaInfo.Add($(New-Object Microsoft.Azure.SqlDatabase.ElasticScale.ShardManagement.Schema.ShardedTableInfo($ShardedTableSchemaName, $ShardedTableName, $ShardedTableKeyColumnName)))

$SchemaInfoCollection = $ShardMapManager.GetSchemaInfoCollection()

# Add the SchemaInfo for this Shard Map to the Schema Info Collection
$SchemaInfoCollection.Add('StoreListShardMap', $SchemaInfo)

Create cloud services to run split-merge process

The split-merge process is run via a cloud service in Azure. After registering the shard with the Shard Map, a notification is sent to the Split-Merge process, and a new request is queued up. Each request is worked through serially, and because of this we recommend having multiple cloud services to run different split-merge requests.

To create a cloud service for the Split-Merge process, follow this tutorial.

Create new databases and assign shards

For every shard in the existing database, these steps will have to be performed:

Create a new Azure SQL database and database objects like tables, views, etc…
Depending on the number of shards you’re dealing with, this is almost certainly going to be easier with a PowerShell script of some kind. It is important that you do not create, or at least enable, constraints at this point. The split-merge utility does not reference them when inserting data, and the process will fail.

Associate the new database with the GUID shard value in the Shard Map
The following code snippet will do this:

	
$ShardMapManager = Get-ShardMapManager -UserName 'example' -Password '****!' -SqlServerName 'example.database.windows.net' -SqlDatabaseName 'ShardAdmin'

# Get Shard Map. 
$ShardMap = Get-ListShardMap -KeyType $([guid]) -ShardMapManager $ShardMapManager -ListShardMapName $ShardMapName

# Add new DB to shard map
Add-Shard -ShardMap $ShardMap -SqlServerName $FullSQLServerName -SqlDatabaseName $ShardName | wait-process

# Add shard to shard map -- Mapped to $SourceDB because that's where it is currently
Add-ListMapping -keyType $([guid]) -ListShardMap $ShardMap -ListPoint $Guid -SqlServerName $FullSQLServerName -SQLDatabaseName $SourceDB | wait-process

Assign the new shard to a Cloud Service for the Split-Merge process
Again, this code snippet is an example of doing this. In this case, a modulus value is used to assign each shard to a different merge-split service.

$mod = $NumOfShards % $NumOfMergeSplitApps

if ($mod -eq 1) {
  $SplitMergeURL = "https://example-mergesplit.cloudapp.net"
  $LogOutput = (Get-Date).ToShortDateString() + " " + (Get-Date).ToShortTimeString() + " : " + $ShardName + " sent to " + $SplitMergeURL 
  Add-Content -Path $LogFile -value $LogOutput
}
elseif ($mod -eq 2) {
  $SplitMergeURL = "https://example-mergesplit2.cloudapp.net"
  $LogOutput = (Get-Date).ToShortDateString() + " " + (Get-Date).ToShortTimeString() + " : " + $ShardName + " sent to " + $SplitMergeURL 
  Add-Content -Path $LogFile -value $LogOutput
}
else{
  $SplitMergeURL = "https://example-mergesplit10.cloudapp.net"
  $LogOutput = (Get-Date).ToShortDateString() + " " + (Get-Date).ToShortTimeString() + " : " + $ShardName + " sent to " + $SplitMergeURL 
  Add-Content -Path $LogFile -value $LogOutput
}

# Queue up the split database operation
$OperationID = Submit-ShardletMoveRequest `
  -SplitMergeServiceEndpoint $SplitMergeURL `
  -ShardMapManagerServerName $ShardMapServerName `
  -ShardMapManagerDatabaseName $ShardMapDB `
  -ShardMapName $ShardMapName `
  -ShardKeyType 'guid' `
  -ShardletValue $Guid `
  -TargetServerName $FullSQLServerName `
  -TargetDatabaseName $ShardName `
  -UserName $AdminLogin `
  -Password $AdminPasswd `
  -CertificateThumbprint '####' #Unique to your project. See MS Tutorial

$LogOutput = (Get-Date).ToShortDateString() + " " + (Get-Date).ToShortTimeString() + " : " + "Operation ID: " + $OperationID
Add-Content -Path $LogFile -value $LogOutput

Monitor split-merge processes

The Split-Merge process logs its current status to a database, and each process has its own DB. The databases for this example will be located on the shard map database server and are named example-mergesplitN where N is a number.
The below queries will return information about the currently executing split process, any successful or failed process, and how many processes are left in the queue.

SELECT 
  [TimeStamp] LastUpdateTime, 
  [Status], 
  Progress [EstPercentDone], 
  OperationID, 
  CancelRequest [Cancelled], 
  Details 
FROM 
  RequestStatus 
WHERE
  OperationID IN (select operationid from requeststatus where [status] not in ('Queued', 'Canceled', 'Failed', 'Succeeded')) 
ORDER BY 
  LastUpdateTime desc;

SELECT
  *
FROM
  RequestStatus 
WHERE
  status IN ('Failed', 'Succeeded') 
ORDER BY 
  [timestamp] desc;

SELECT
  *
FROM
  RequestStatus 
WHERE
  status = 'queued' 
ORDER BY 
  [timestamp] desc;

Change application code to use shard map

Microsoft has written a set of libraries called the ShardMapManagerFactory to enable an easy transition to a sharded database. These libraries allow a client to pass in a Sharding Key and will return a connection string to the database associated with that Shard.

Enable foreign key constraints

The Split-Merge process does not perform INSERT or DELETE operations in any particular order, and does not respect Foreign Key constraints. Because of this, all constraints must be disabled prior to running the Split-Merge process.

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

17 Comments. Leave new

Adinath Kamode
August 9, 2018 12:36 am

Nice Article, How database writes would be handled? Assuming that application will route connections to appropriate shard according to key, will other shards will have a full copy of data ? If write happens to shard A, will it be auto populated to Shard B… C etc?

Thanks,
Adinath Kamode

Reply
Scott McCormick
August 9, 2018 12:08 pm

No, the data is not replicated to the other shards. Each shard has identical schemas, but completely separate data that needs to be managed on its own.

Reply
Keith Valentine
September 12, 2018 9:41 am

How much cost in speed when you have to query the shard manager to map to a specific shard, or can you just setup an application service to hit a specific shard once all the data is separated? Just wondering if we make this switch if it is better to start isolating at the .net service layer and only use elastic queries for data warehouse type queries.

Reply
Scott McCormick
September 13, 2018 4:02 am

If you’re using the connection library that Microsoft wrote, it will hit the shard map once per connection to get the sharded database’s connection info. After that, all connections will be direct to that DB, so it’s a very low cost.

If your application opens/closes connections to the DB many times, you might want to think about a workaround, but if it just establishes a connection to use for the entire session then I wouldn’t worry about it.

Reply

Sharding is, in essence, horizontal partitioning. We already have one database per client (an SaaS environment). What advantage does sharding provide over simply mapping clients, for processing by ClientID (i.e. what would be the sharding key)? For example, I might have a windows service instance, that only maps to ClientID’s 1-10 and another that manages 11-20 etc. Would sharding give me more bang for my buck, so to speak?

Reply
Scott McCormick
October 22, 2018 3:56 am

From your description, I would say you’ve already sharded the data. The only item from this blog that might be helpful is the sharding library. Once you’ve configured that and set up the map, it would be fairly easy for the developers to connect to the correct database. It’s up to you if it’s worth the effort though, since you might already have a solution in place for that.

Reply

Interesting read. Couple of questions

1) does the application accessing the DB need to be shard aware? ie would we need to reprogram our software?
2) can sharding be done with any version of SQL eg express, standard?

Reply
Scott McCormick
December 10, 2018 12:56 pm

The connection strings for the application will need to be changed. Your developers will call into a .NET library which looks up the correct database for the shard, and then passes back a connection to that database.

Sharding can be done for any version. This is not a built in feature of SQL Server at all.

Reply

Great article!

How does sharding handle the PKs of your tables. For example, say you have Tenant 1 on one shard and 2 on another. There is an order table that has OrderId and TenantId. The TenantId is the Shard Key but the OrderID is an Identity column. Let’s say each Tenant has 5 orders. So before you broke them into separate shards Tenant 1 had order ids 1-5 and Tenant 2 had orders 6-10. If your application creates another order, for Tenant 1, will the OrderId be 6 or 11. I’m thinking the ShardMap has to be aware of this type of thing. If you ever wanted to use the Split/Merge tool to put both Tenants back on the same shard, these order ids would have to be maintained. Or does it just remap all our PKs and FKs so everything is in sync.

Reply
Scott McCormick
January 23, 2019 9:11 am

It’s not that aware, unfortunately. If you merge the databases back together, you will need to manually handle any PK/FK/Unique Key conflicts.

Reply
Gourav Gupta
March 25, 2019 6:02 am

I am using .Net library for Sharding (on-premises or managed instance). I am not using Azure sql server (Pass) so I can not use elastic pool feature like elastic query, split-merge utility, Elastic Database jobs etc. Do I need to create libraries for these features (Provided by elastic pool)

Reply
Florence Tissot
April 3, 2019 7:09 pm

Thanks for the article. Can you clarify what happens to the reference tables? are these replicated somehow in each shard? or stored in the shardmap database?

Reply
Scott McCormick
April 4, 2019 8:52 am

The lookup tables are kept in each database.

Theoretically if you have 100’s of sharded databases & a lookup table that is updated frequently, you could come up with a different architecture (or a process to push out changes). That’s outside the scope of this article though :)

Reply

Can this be implemented on On-Premise SQL Server or is this exclusive to Azure SQL DB only ?

Reply
Scott McCormick
January 25, 2021 8:10 am

I believe that this would work for on-premise SQL Server (I haven’t tested), but it does make use of some tools that live in Azure for the initial migration. So you’d need to be able to access them, or manually move the data between databases.

Reply

Hello.

Quick Question.

A customer wants to implement this in order to obtain data from different databases into a “central” DB.

Wouldn’t it be the same if they create a View or Materialized View?

What would be the benefits about this?

Thanks!

Reply
Scott McCormick
February 10, 2021 3:11 am

No. Using a view, or even a materialized view, wouldnt be the same. A sharded database is about spreading the workload over multiple databases on several instances & servers. A view/materialized view is going to be in the same instance as the underlying tables, and so you don’t get that benefit. Of course, view is a lot simpler to setup.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *