MemSQL Loader

Warning

This documentation does not reflect the latest version of MemSQL.
This page refers to MemSQL 4.1.

For the latest MemSQL product documentation, visit http://docs.memsql.com/latest.

MemSQL Loader is an open-source tool that lets you load sets of files from Amazon S3, the Hadoop Distributed File System (HDFS), and the local filesystem into MemSQL (or MySQL) with just one command. You can specify all of the files you want to load with one command, and MemSQL Loader will take care of deduplicating files, parallelizing the workload, retrying files if they fail to load, and more.

Background

One of the most common tasks with any database is loading large amounts of data into it from an external data store. Both MemSQL and MySQL provide the LOAD DATA command for this task. Loader can only read from a single file at a time so if you want to load multiple files you will need to issue multiple LOAD DATA commands. To perform this work in parallel, you will need to write your own scripts.

At MemSQL, we’ve acutely felt all of these limitations. That’s why we developed MemSQL Loader, which solves all of the above problems and more.

Basic Usage

Downloading the Loader

The loader is a standalone binary that you can download and run directly. Downloading the loader binary is the easiest way to get started with memsql-loader. We keep the latest version hosted at our GitHub repository. The binary is produced by compiling this Python project with PyInstaller.

To get started with the binary, just run

$ tar -xzf memsql-loader
$ cd memsql-loader
$ ./memsql-loader --help

If you would like to download and run the source code directly, you can download the repository from GitHub and run the manual steps outlined in the README.

Running the Loader

The primary interface to the loader is the memsql-loader load command. The command takes arguments that specify the source, parsing options, and destination server. For example, to load some files from S3, you can run

$ ./memsql-loader load -h 127.0.0.1 -u root --database db --table t \
    s3://memsql-loader-examples/sanity/*

The loader automatically daemonizes and runs the load process in a background server. You can monitor its progress with

$ ./memsql-loader ps --watch

If you would like to run this example against MemSQL or MySQL, run

memsql> CREATE DATABASE db;
memsql> CREATE TABLE    db.t (a int, b int, primary key (a));

File Pattern Syntax

The loader supports loading files from Amazon S3, HDFS, and the local filesystem. The file’s prefix determines the source. You can specify “s3://”, “hdfs://”, or “file://”. If you omit the prefix, then the loader defaults to the local filesystem.

The loader also supports glob syntax (with semantics similar to bash). A single * matches files in the current directory, and ** matches files recursively. MemSQL Loader uses the glob2 library under the hood to facilitate this.

File Parsing Options

MemSQL Loader’s command line options mirror the LOAD DATA command’s syntax. See the load data options section in ./memsql-loader load –help for reference.

Automatic Deduplication

MemSQL Loader is designed to support one-time loads as well as synchronizing behavior. You can use this functionality to effectively sync a table’s data to the set of files matching a path. The loader will automatically deduplicate files that it knows it does not need to load (by using the MD5 feature on S3), and transactionally delete and reload data when the contents of a file have changed.

Note

This reload behavior requires specifying a column to use as a file_id.

Spec Files

We found with usage that it was really convenient to be able to define a load job as a JSON file, instead of just command line options. MemSQL Loader lets you use “spec files” to accomplish this. To generate one, just append –print-spec to the ./memsql-loader load command. It will generate a spec file that you can use with –spec. Any command line options that you provide along with –spec will override options in the spec file.