Working with GCS Pipelines min read


MemSQL Pipelines can extract objects from Google Cloud Storage, optionally transform them, and insert them into a destination table. To understand GCS core concepts and the terminology used in this topic, please read the Google Cloud Storage Documentation. To get started quickly using a GCS pipeline, see GCS Pipelines Quickstart.

Example GCS Pipeline Scenario

Imagine you have a GCS bucket that contains a large number of files in CSV or TSV format, either as flat files or as zipped files. New files are being generated every minute, and you need a persistent, reliable, and automatic method of loading their contents into a database. A GCS Pipeline is an ideal solution to this scenario. Consider a MemSQL cluster with one master aggregator node, one child aggregator node, and two leaf nodes. Given this topology, the following steps occur when you start a GCS pipeline:

  1. The MemSQL cluster’s master aggregator accesses the specified GCS bucket and authenticates if necessary.
  2. The master aggregator reads the contents of the bucket, divides the number of objects in the bucket by the number of leaf partitions in the cluster, and assigns a portion of the objects to each leaf.
  3. Each leaf node begins streaming its assigned objects from the GCS bucket and also authenticating if necessary. If a transform executable was specified when the pipeline was created, the leaf node will transform the object’s data as it is streamed from GCS.
  4. The object’s records are streamed into the destination table immediately after extraction and optional transformation.
  5. As new files are added to the source bucket, the master aggregator constantly monitors it to ensure that the files will be processed automatically.
Info

As a best practice to ensure the highest processing rate for additional files added to your bucket, you should prefix the files with an increasing alpha-numeric value, such as a timestamp or some other marker (e.g. YYYY-MM-DD-filename.extension). This will ensure that GCS pipelines connected to buckets with a large file count perform as efficiently as possible.

GCS Pipeline Syntax Examples

The following syntax demonstrates how to create a new GCS Pipeline. For complete syntax documentation, see CREATE PIPELINE.

Example – Read all objects in a bucket using ID and secret credentials for CSV files:

CREATE PIPELINE library
AS LOAD DATA GCS 'my-bucket-name'
CONFIG '{"suffixes": ["csv"]}'
CREDENTIALS '{"access_id": "your_google_access_key", "secret_key": "your_google_secret_key"}'
INTO TABLE `classic_books`
FIELDS TERMINATED BY ',';

Authentication and Access Control in GCS

A GCS pipeline requires you to authenticate to the desired bucket. Depending on the provided credentials, a bucket’s access permissions or policies may require changes to allow access. To familiarize yourself with access management for buckets and objects, see the GCS documentation here.

Authentication

A GCS pipeline must authenticate to Google before it can begin reading objects from a bucket. While GCS allows anonymous authentication to buckets and their objects, MemSQL GCS pipelines support only HMAC keys. The pipeline requires you to provide a GCS access key and secret key.

The CREDENTIALS clause should be a JSON with two fields:

access_id: usually a 24 or 60 character alphanumeric string, which is linked to the Google account, typically is all uppercase and starts with GOOG

secret_key: usually a 40 character Base-64 encoded string that is linked to a specific access_id.

...
CREDENTIALS '{"access_id": "your_google_access_key", "secret_key": "your_google_secret_key"}'
...

GCS Pipeline Batches

Metadata and configuration information for every pipeline can be found in tables in the information_schema database. These tables use the term batch, which is a standard term across all MemSQL pipelines, but have the following definitions in the context of GCS pipelines.

When the master aggregator reads a container’s contents, it schedules a subset of the objects for ingest across all database partitions. After each partition across the cluster has finished extracting, transforming, and loading its object, a batch has been completed. Therefore, a GCS pipeline batch is defined as a cluster-level operation where each partition processes at most a single object from the source container.

Consider the following example: There are 4 objects in a source container. If your cluster has 2 leaf nodes that have 2 partitions each (4 partitions total), all of the container’s objects can be ingested in 1 batch. In the same cluster, if there are 40 objects in the source container, it will take 10 batches to fully ingest the data.

Information about recent batches can be found in information_schema.PIPELINES_BATCHES_SUMMARY.

Information about files to be loaded can be found in information_schema.PIPELINES_FILES.

GCS Pipeline Limitations

A GCS pipeline has a few limitations and also inherits limitations from the Google Cloud Storage service itself. See the GCS Documentation for more detailed information.

  • Versioning: If your GCS pipeline is configured to read from a version-enabled bucket, you will only receive the latest version of the object from the bucket. Currently, you cannot configure your GCS pipeline to read specific version IDs.
  • 5 TB Max Object Size: Google Cloud Storage supports a maximum object size of 5 TB, and this limit also applies to an GCS pipeline.
  • Rate Limiting: A large MemSQL cluster configured with a GCS pipeline might encounter throttling or rate limiting imposed by GCS. Currently, a GCS pipeline cannot be configured to bypass these limits by reducing the frequency of read requests. For more information on GCS’s rate limiting and how to optimize your data, see this GCS documentation page.