You are viewing an older version of this section. View current production version.

Backing Up and Restoring Data min read


Info

The following sections in this topic do not apply to MemSQL Helios: “Incremental Backups on Columnstores” and “Verifying Full Backup Files”.

With MemSQL you can back up a database and later restore it. The backup files are saved on the master aggregator and on the leaves.

You can make a full backup or an incremental backup of a database. When you make a full back up, you can target local file systems, network file systems, Azure, Google Cloud Storage, S3, and S3 Compatible Storage.

Info

This topic applies to database backups made using the BACKUP DATABASE command or using the MemSQL tools memsql-admin create-backup command.

Alert

MemSQL does not support restoring database backups from a newer version of MemSQL into an older version.

Remote databases, i.e. databases on a secondary cluster to which the primary database is being replicated, cannot be backed up with the BACKUP DATABASE command.

Store database backups in a safe place. You should use the built-in functionality that automatically copies local backups to Azure, Google Cloud Storage, S3, or S3 Compatible Storage. Alternatively, manually copy local backups to another location.

The Backup History Table

The information_schema.mv_backup_history table stores information about backups that have been made. The columns in this table are described below.

Column Description
backup_id A unique identifier for the backup.
cluster_name Name of the cluster that carried out the backup.
cluster_id Unique id that can identify a cluster (see show cluster id).
database_name Name of the database that was backed up.
start_timestamp Timestamp at which the backup started.
end_timestamp Timestamp at which the backup completed.
num_partitions Number of partitions for the backup.
backup_path Path to where the backup file was placed after backup completed.
checksum Concatenated CRC32C checksum of all backup files, see Verifying Backup Files.
status Success or Failure.
size Size of all backup files, in megabytes written.
incr_backup_id A unique identifier for an incremental backup. If the backup is a full backup instead of an incremental backup, incr_backup_id is NULL.

Note: information_schema.MV_BACKUP_HISTORY is populated only when the BACKUP DATABASE command is run to create a database backup.

Items Included in and Excluded from a Backup

When you make a database backup, the following items are included in and excluded from the backup.

Included Excluded
Tables Resource Pools
Views System Variables
Procedures Users
Pipelines Grants
Functions
Stored Procedures

Types of Backups

You can make full and incremental database backups. The following sections explain the differences between these backups and how they work.

Full Backups

A full backup contains a complete copy of a database.

Making a Full Backup

To make a full backup, run the BACKUP DATABASE command. Specify the name of the database to back up, followed by the directory on disk where the backup will reside. The following example makes a backup of test_db and stores it in the /test_db_backups_week_1 directory.

BACKUP DATABASE test_db to "/test_db_backups_week_1";

BACKUP DATABASE outputs one row containing the BACKUP_ID column. This column contains the id of the backup that was just made.

You can view the history all of the backups made in the cluster by querying information_schema.mv_backup_history. The output of the following query assumes that only one backup was made.

SELECT backup_id, incr_backup_id, database_name, start_timestamp
    FROM information_schema.mv_backup_history
    ORDER BY backup_id;
****
+-----------+----------------+-----------------+---------------------+
| backup_id | incr_backup_id | database_name   | start_timestamp     |
+-----------+----------------+-----------------+---------------------+
|         1 |           NULL | test_db         | 2019-08-04 00:00:00 |
+-----------+----------------+-----------------+---------------------+

You cannot save a full backup to a directory that already contains a full backup. Instead, save the full backup in a different directory or delete the contents in the original directory.

Restoring a Full Backup

Info

You can restore a full backup into 7.0 from all supported releases 5.0 and later.

You can also restore a full database backup from one taken in 7.0 RC (i.e. 7.0.7) and 7.0 beta 3 (i.e. 7.0.6).

In MemSQL Helios, restoring a database from a local filesystem is not supported.

To restore a full backup, use the RESTORE DATABASE command and specify the database name along with the directory where the backup is stored. The following example restores test_db from the test_db_backups_week_1 directory.

RESTORE DATABASE 'test_db' FROM "/test_db_backups_week_1";

Incremental Backups on Columnstores

Info

This section does not apply to MemSQL Helios or to backups made using the MemSQL tools memsql-admin create-backup command.

An incremental backup contains only the changes to columnstore data that were made since the previous backup was taken.

Info

You can make an incremental backup on a database that has rowstore data, but a full copy of the rowstore data will be saved in the backup.

A backup set contains an initial full backup followed by any number of incremental backups. All of the backups in a backup set are stored in the same directory.

Making Incremental Backups

Suppose you want to store incremental backups for one week in the same backup set.

First, you make an initial full backup by running the BACKUP DATABASE command and include the WITH INIT clause. This is shown in the following example.

BACKUP DATABASE test_db to "/test_db_backups_week_1" WITH INIT;

To make the first incremental backup, run the BACKUP DATABASE command and include the WITH DIFFERENTIAL clause. This is shown in the following example.

BACKUP DATABASE test_db to "/test_db_backups_week_1" WITH DIFFERENTIAL;

To make the second incremental backup, run the same command again:

BACKUP DATABASE test_db to "/test_db_backups_week_1" WITH DIFFERENTIAL;

To view the history of the incremental backups taken in the cluster, query information_schema.mv_backup_history. The output of the following query assumes that the only backups made are those in this section, Making Incremental Backups.

SELECT backup_id, incr_backup_id, database_name, start_timestamp
    FROM information_schema.mv_backup_history
    ORDER BY backup_id;
****
+-----------+----------------+-----------------+---------------------+
| backup_id | incr_backup_id | database_name   | start_timestamp     |
+-----------+----------------+-----------------+---------------------+
|        10 |             10 | test_db         | 2019-08-04 00:00:00 |
|        11 |             11 | test_db         | 2019-08-05 00:00:00 |
|        12 |             12 | test_db         | 2019-08-06 00:00:00 |
+-----------+----------------+-----------------+---------------------+

Restoring an Incremental Backup

Info

You can restore an incremental backup from one taken in 7.0 RC (i.e. 7.0.7) and 7.0 beta 3 (i.e. 7.0.6).

In MemSQL Helios, restoring a database from a local filesystem is not supported.

To restore an incremental backup, run the RESTORE DATABASE command, including the WITH FILE clause. In this clause, specify the incr_backup_id of the incremental backup that you want to restore.

The following command assumes the backups in the previous section, Making Incremental Backups, have been made. The command restores the incremental backup with the incr_backup_id 12. Internally, MemSQL uses the initial full backup with the incr_backup_id 10 and applies the changes with the incr_backup_id 11 and 12 to restore the backup.

RESTORE DATABASE test_db FROM "/test_db_backups_week_1" WITH FILE = 12;

You can restore any incremental backup from a backup set. You are not limited to restoring the latest incremental backup, as was shown in the previous example.

Making Frequent Incremental Backups and Occasional Full Backups

When you make an incremental backup, it may no longer contain columnstore data that was in a previous incremental backup. This data in the previous backup is not needed once the current backup is made; the data and will occupy extra disk space.

To save space, you can make incremental backups and occasional full backups. The following example explains a schedule you could use.

Example

  • Make incremental backups on every day of the week except Sunday.
  • On Sunday, take a initial full backup (using BACKUP DATABASE ... WITH INIT). This will create a new backup set.
  • Store the backups for each week in a separate directory.
  • When the full backup made on the Sunday of a new week succeeds, delete the backup directory for the previous week (if desired).

Incremental Backup Performance

Generally, making an incremental backup will be significantly faster than making a full backup. This is because an incremental backup contains only the changes that were made since the previous backup was taken.

If you run OPTIMIZE TABLE FULL prior to making an incremental backup, the backup take significantly longer to complete than it normally would. This is because OPTIMIZE TABLE FULL sorts the entire columnstore table.

Clearing Backup History

You can remove the records in the mv_backup_history table in the information_schema database by running CLEAR BACKUP_HISTORY;.

Note: information_schema.MV_BACKUP_HISTORY is populated only when the BACKUP DATABASE command is run to create a database backup.

Backup and Restore on Specific Systems

Backup and Restore on NFS (Network File System)

MemSQL makes BACKUP DATABASE and RESTORE DATABASE easy to use with a Network File System (NFS) by naming the .backup files so that that no two leaves will attempt to write to the same file even if they are all writing to the same NFS directory.

To back up MemSQL with NFS:

  • Ensure that the same NFS share is mounted in the same directory on all leaves (e.g. /mnt/backup_nfs/
  • When running BACKUP DATABASE, simply select a backup_path that points to a directory on that NFS share (e.g. BACKUP DATABASE memsql_demo to '/mnt/backup_nfs/backup_2016_05_04/')

Backup and Restore without NFS

If your cluster is not on NFS and you are restoring a cluster with a different configuration or on different hardware than the original, you must manually distribute partitions across the cluster before running RESTORE DATABASE. If the cluster has redundancy level greater than one, you must group leaf nodes into replication pairs. Paired nodes must have the same set of partition backups in their local backup directories (specified by backup_path). MemSQL will automatically choose master and replica partitions when you run RESTORE DATABASE on the master aggregator.

Backup and Restore on S3 and Azure

Backup and restore can also specify an S3 bucket or Azure container. In this case, all the backup files across all the leaves (one per partition as described above) are put directly into the S3 bucket or Azure container. When restore is called, MemSQL will distribute the backup files appropriately and restore the data (similar to how restore from an NFS drive works).

Verifying Full Backup Files

Info

This section does not apply to MemSQL Helios.

This section applies to full backups, only. It does not apply to initial full backups (those taken with BACKUP DATABASE ... WITH INIT) or to incremental backups.

With each backup, there is a checksum associated with the files that comprise the backup. Each checksum is a long string of hex characters, with each eight characters representing a subsection of the backup. To create this checksum, MemSQL utilizes CRC32C, a common variant of CRC32, to process the backup files.

Our implementation of CRC32C has been verified with the commonly used package crcmod, but any library that implements or uses CRC32C can be utilized.

To illustrate how a checksum is comprised of the discrete parts of a database backup, consider the following example where all backup files for the cluster are stored into a NFS share named backup_database:

| backup_database
  \  
  | BACKUP_COMPLETE
  | db.backup
  | db_0.backup
  | db_0.backup_columns0.tar
  | db_0.backup_columns1.tar
  | db_1.backup
  | db_2.backup
  | db_2.backup_columns0.tar
  | db_2.backup_columns1.tar
  | db_2.backup_columns2.tar
  | db_3.backup
  | db_3.backup_columns0.tar

This is a backup is for a small, mixed columnstore/rowstore to an NFS drive with four partitions. The checksum for this backup is 91b3d3353709baa1eff3ba5cfba6bac939b318f41652eac4 and is found in the BACKUP_COMPLETE sentinel file that was created on the master aggregator node in its relative /data/db_backup/ MemSQL directory.

The first 8 characters are the CRC32C of the reference database, db.backup, on the master aggregator.

Next, each partition in order, either has 8 or 16 characters, depending on whether it has columnstore segment tar files or not. In the latter case, the first 8 characters are reserved for the partitions snapshot, db_0.backup while the second 8 characters are for the segment tar files.

If segment tar files have been created, each segment tar file is concatenated, and then the CRC32C checksum is taken. This is equivalent to calculating the CRC32C of db_0.backup_columns0.tar and db_0.backup_columns1.tar concatenated together or calculating the CRC32C of the first tar file, and then the second file, and so on without finalizing the checksum.

Note: Even if a database has a columnstore table, there may not be a corresponding tar file for a partition, because data may be cached in the hidden rowstore table for that columnstore table. In this case, the data would be contained inside the rowstore snapshot.

The following Python script can be used to verify the backup checksum. Copy and paste this script into a new file named verify_backup.py. Usage instructions are covered in the code comments for the script. After running the script, run echo $?. A result of 0 means the verification was successful, while a result of -1 means it was not. The script is well commented and includes debugging output if there was an error in the validation process.

Info

Script completion time is dependent on the size of the backup and may take hours to complete.

# verify_backup.py
#
# Given a directory, this script will verify that the backup crc and size both
# are unchanged.
#
# REQUIRES: crcmod to be installed: https://pypi.org/project/crcmod/
#
# USAGE: python verify_backup.py /absolute/path/to/backup
#
# NOTE: This script needs read privileges on all files being verified.
#
import crcmod
import glob
import json
import sys
import errno

# VerifyBackup:
#   Verifies the CRC located in the backup sentinel file (BACKUP_COMPLETE)
#   matches the calculated CRC of files in backupDirectory.
#
# Param backupDirectory: absolute path to directory where backup exists.
# Return: 0 on success, -1 on failure.
#
def verifyBackup(backupDirectory):
    # Strip off trailing '/' if exists.
    #
    if backupDirectory[-1] == '/':
        backupDirectory = backupDirectory[:(len(backupDirectory)-1)]

    with open("%s/BACKUP_COMPLETE" % backupDirectory, "r") as f:
        buf = f.read()
        backupDictionary = json.loads(buf)
        try:
            finalCrc = backupDictionary["Checksum"]
            dbName = backupDictionary["Database_Name"]
            numPartitions = int(backupDictionary["Num_Partitions"])
        except KeyError as e:
            print e
            print "Sentinel File 'BACKUP_COMPLETE' is from unsupported version of backup."
            return -1

    # This is in the crc32c specification, crcmod also has crc32c hardcoded,
    # so either can be used.
    #
    crc = crcmod.Crc(0x11EDC6F41, rev=True, initCrc=0x00000000, xorOut =0xFFFFFFFF)
    crclist = ""

    # Process the reference snapshot.
    #
    with open("%s/%s.backup" % (backupDirectory, dbName), "r") as f:
            buf = f.read()
            crc.update(buf)
    crclist += crc.hexdigest()

    # Process each partition.
    #
    for i in range (numPartitions):
        crc = crc.new()

        # Process Partition snapshot.
        # Each Partition MUST have a snapshot.
        #
        with open("%s/%s_%d.backup" % (backupDirectory, dbName, i), "r") as f:
            buf = f.read()
            crc.update(buf)
        crclist += crc.hexdigest()

        # Snapshots and Columns are checksummed with seperate CRC's.
        #
        crc = crc.new()

        # To emulate a do while loop in Python.
        tarFound = True

        # If the columnar blobs is non empty, append the crc to the list.
        columnCrc = False
        j = 0

        # Process all tarballed columnstore files.
        #
        # NOTE: Even if a database has a columnstore, this does not imply
        # each partition has columnar blobs. The data might exist in the
        # rowstore snapshot or might be skewed such that all rows exist
        # in other partitions.
        #
        while tarFound:
            try:
                with open("%s/%s_%d.backup_columns%d.tar" % (backupDirectory, dbName, i,j) , "r") as f:
                    buf = f.read()
                    crc.update(buf)
                    j += 1
                    columnCrc = True
            except IOError as e:
                if e.errno == errno.ENOENT:
                    tarFound = False
                else:
                    assert e
        if columnCrc:
            crclist += crc.hexdigest()

    # CRC's will be of different case, make both uppercase.
    #
    if crclist != finalCrc.upper():
        print "Crc calculated from directory:" + crclist
        print "Crc in backup file :" + finalCrc
        print "Crcs do not match!"
        return -1
    return 0

if __name__ == '__main__':
    backupDirectory = sys.argv[1]
    if len(sys.argv) != 2:
        print "Incorrect usage: please include just the directory where the backup is located."
    sys.exit(verifyBackup(backupDirectory))

Related Topics