S3 Pipelines Overview

MemSQL Pipelines can extract objects from Amazon S3 buckets, optionally transform them, and insert them into a destination table. To understand Amazon S3’s core concepts and the terminology used in this topic, please read the Amazon S3 documentation. To get started quickly using an S3 pipeline, see S3 Pipelines Quickstart.

Example S3 Pipeline Scenario

Imagine you have an Amazon S3 bucket that contains a large number of files in CSV or TSV format, either as flat files or as zipped files (the S3 extractor will automatically unzip files with a .gz extension). New files are being generated every minute, and you need a persistent, reliable, and automatic method of loading their contents into a database.

An S3 Pipeline is the ideal solution to this scenario. Consider a MemSQL cluster with one master aggregator node, one child aggregator node, and two leaf nodes. Given this topology, the following steps occur when you start an S3 Pipeline:

  1. The MemSQL cluster’s master aggregator accesses the specified S3 bucket and authenticates if necessary.
  2. The master aggregator reads the contents of the bucket, divides the number of objects in the bucket by the number of leaf partitions in the cluster, and assigns a portion of the objects to each leaf.
  3. Each leaf node begins streaming their assigned objects from the S3 bucket, and also authenticating if necessary. If a transform executable was specified when the pipeline was created, the leaf node will transform the object’s data as it is streamed from S3.
  4. The object’s records are streamed into the destination table immediately after extraction and optional transformation.

As new files are added to the source bucket, the master aggregator constantly monitors it to ensure that the files will be processed automatically.

Info

As a best practice to ensure the highest processing rate for additional files added to your bucket, you should prefix the files with an increasing alpha-numeric value, such as a timestamp or some other marker (e.g. YYYY-MM-DD-filename.extension). This will ensure that S3 pipelines connected to buckets with a large file count perform as efficiently as possible.

S3 Pipeline Syntax Examples

The following syntax demonstrates how to create a new S3 Pipeline. For complete syntax documentation, see CREATE PIPELINE.

Example 1 – Read all objects in a bucket using ID and secret credentials for CSV files:

CREATE PIPELINE library
AS LOAD DATA S3 'my-bucket-name'
CONFIG '{"region": "us-west-1"}'
CREDENTIALS '{"aws_access_key_id": "your_access_key_id", "aws_secret_access_key": "your_secret_access_key"}'
INTO TABLE `classic_books`
FIELDS TERMINATED BY ',';

Example 2 – Read all objects in a folder using IAM role credentials for TSV files:

CREATE PIPELINE library
AS LOAD DATA S3 'my-bucket-name/my-folder/'
CONFIG '{"region": "us-west-1"}'
CREDENTIALS '{"role_arn": "arn:aws:iam::<AccountID>:role/S3ReadOnly"}'
INTO TABLE `classic_books`
FIELDS TERMINATED BY '\t';

Authentication and Access Control in Amazon S3

An S3 pipeline requires you to authenticate to the desired bucket. Depending on the provided credentials, a bucket’s access permissions or policies may require changes to allow access. To familiarize yourself with access management for buckets and objects, see the Amazon S3 documentation.

Authentication

An S3 pipeline must authenticate to Amazon S3 before it can begin reading objects from a bucket. While Amazon S3 allows anonymous authentication to buckets and their objects, S3 pipelines currently support only explicit authentication using credentials or an EC2 instance profile. This section describes the different ways you can authenticate to Amazon S3, some of which depend on the MemSQL cluster’s deployment environment.

Option 1: Access Key ID and Secret Access Key

You can authenticate by providing the Access Key ID and Secret Access Key of either the root Amazon AWS user or an IAM user account. For example, when executing the CREATE PIPELINE statement, these credentials are specified in the following way:

...
CREDENTIALS '{"aws_access_key_id": "your_access_key_id", "aws_secret_access_key": "your_secret_access_key"}'
...

If the credentials for a non-root IAM user account are specified, the desired bucket may require a policy to grant access to the user. See this Amazon S3 documentation page for more information on configuring a bucket policy.

Option 2: EC2 Instance Profile

You can authenticate by using an EC2 instance profile if MemSQL is deployed in Amazon EC2. Instance profiles allow MemSQL to authenticate without providing explicit credentials in the CREATE PIPELINE statement. As a result, this authentication method is more secure because no credentials are sent or stored. For more information about configuring instance profiles, see this Amazon EC2 documentation page.

In addition to an instance profile, you can specify an IAM role that MemSQL will inherit to perform additional authorization. This additional authorization is called assuming a role, and can only be used if MemSQL is deployed on an EC2 instance with an appropriately configured instance profile. For more information on assuming a role, see this IAM documentation page.

To assume a role, executing a CREATE PIPELINE statement and specify CREDENTIALS in the following way:

...
CREDENTIALS '{"role_arn": "arn:aws:iam::<AccountID>:role/S3ReadOnly"}'
...

Permissions and Policies

An S3 pipeline can be created to read from either a bucket or an object. Both of these resource types may be configured with access policies or permissions. Before creating an S3 pipeline, it’s important to consider both the provided user credentials and any existing policies or permissions on the desired resource.

For example, if you provide credentials that implicitly grant access to all objects in a bucket, you may not need to add or modify any access policies or permissions on the desired resource. However, if you provide credentials that do not have permission to access the resource, an administrator will allow access.

Consider the following scenarios:

  • Read all objects in a bucket: An S3 pipeline configured for a bucket will automatically read all objects it has access to. Changes to a bucket-level policy or permissions may be required.
  • Read all objects in a bucket’s folder: An S3 pipeline configured for a folder, which is an object with key prefix, will automatically read all objects in the folder. Changes to a bucket-level policy with a prefix may be required to allow access the desired folder.
  • Read a specific object in a bucket: An S3 pipeline configured for a specific object may require changes to the policies and permissions for both the object and bucket.

Authentication Errors

You may receive authentication errors if you attempt to create an S3 pipeline without providing credentials or if the provided credentials are invalid.

NoCredentialProviders: no valid providers in chain.

This error is caused by one or more of the following conditions:

  • No CREDENTIALS were specified in the CREATE PIPELINE statement or the JSON was malformed.
  • An IAM role was specified, but your EC2 instance was not configured with an instance profile.

“aws_access_key_id” specified, but not “aws_secret_access_key”

This error is caused by a missing aws_secret_access_key key in the CREDENTIALS JSON of your CREATE PIPELINE statement, or if the JSON key is malformed.

“aws_secret_access_key” specified, but not “aws_access_key_id”

This error is caused by a missing aws_access_key_id key in the CREDENTIALS JSON of your CREATE PIPELINE statement, or if the JSON is malformed.

InvalidAccessKeyID: The access key ID you provided does not exist in our records

This error is caused by specifying an Access Key ID that does not exist.

SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your key and signing method

This error is caused by specifying an invalid combination of an Access Key ID and a Secret Access Key.

S3 Pipeline Batches

Metadata and configuration information for every pipeline can be found in tables in the information_schema database. These tables use the term batch, which is a standard term across all MemSQL pipelines, but have the following definitions in the context of S3 pipelines.

When the master aggregator reads a container’s contents, it schedules a subset of the objects for ingest across all database partitions. After each each partition across the cluster has finished extracting, transforming, and loading its object, a batch has been completed. Therefore, a S3 pipeline batch is defined as a cluster-level operation where each partition processes at most a single object from the source container.

Consider the following example: There are 4 objects in a source container. If your cluster has 2 leaf nodes that have 2 partitions each (4 partitions total), all of the container’s objects can be ingested in 1 batch. In the same cluster, if there are 40 objects in the source container, it will take 10 batches to fully ingest the data.

Information about recent batches can be found in information_schema.PIPELINES_BATCHES_SUMMARY.

Information about files to be loaded can be found in information_schema.PIPELINES_FILES.

S3 Pipeline Limitations

An S3 pipeline has a few limitations and also inherits limitations from the Amazon S3 service itself. See the S3 Documentation for more detailed information.

  • Versioning: If your S3 pipeline is configured to read from a version-enabled bucket, you will only receive the latest version of the object from the bucket. Currently, you cannot configure your S3 pipeline to read specific version IDs.
  • 5 TB Max Object Size: Amazon S3 supports a maximum object size of 5 TB, and this limit also applies to an S3 pipeline.
  • Rate Limiting: A large MemSQL cluster configured with an S3 pipeline might encounter throttling or rate limiting imposed by Amazon S3. Currently, an S3 pipeline cannot be configured to bypass these limits by reducing the frequency of read requests. For more information on Amazon S3’s rate limiting and how to optimize your data, see this S3 documentation page.
Was this article useful?