Product Solutions Support
Try MemSQL

Pipelines Overview

MemSQL Pipelines is a MemSQL Database feature that natively ingests real-time data from external sources. As a built-in component of the database, Pipelines can extract, transform, and load external data without the need for third-party tools or middleware. Pipelines is robust, scalable, highly performant, and supports fully distributed workloads.

Introduced in MemSQL 5.5, Pipelines currently supports Apache Kafka and Amazon S3 data sources. See Kafka Pipelines Overview and S3 Pipelines Overview for more information.

Overview

All database products provide native mechanisms to load data. For example, MemSQL can natively load data from a file, a Kafka cluster, cloud repositories like Amazon S3, or from other databases. However, modern database workloads require data ingestion from an increasingly large ecosystem of data sources. These sources often use unique protocols or schemas and thus require custom connectivity that must be updated regularly.

The challenges posed by this dynamic ecosystem are often resolved by using middleware – software that knows how to deal with the nuances of each data source and can perform the process of Extract, Transform, and Load (ETL). This ETL process ensures that source data is properly structured and stored in the database.

Most ETL processes are external, third-party systems that integrate with a database; they’re not a component of the database itself. As a result, ETL middleware can introduce additional problems of its own, such as cost, complexity, latency, maintenance, and downtime.

Unlike external middleware, MemSQL Pipelines is a built-in ETL feature of MemSQL Database. Pipelines can be used to extract data from a source, transform that data using arbitrary code, and then load the transformed data into MemSQL Database.

Features

The features of MemSQL Pipelines make it a powerful alternative to third-party ETL middleware in many scenarios:

Use Cases

MemSQL Pipelines is ideal for scenarios where data from a supported source must be ingested and processed in real time. Pipelines is also a good alternative to third-party middleware for basic ETL operations that must be executed as fast as possible. Traditional long-running processes, such as overnight batch jobs, can be eliminated by using Pipelines.

Terms and Concepts

MemSQL Pipelines uses the following terminology to describe core concepts:

Supported Data Sources

Apache Kafka and Amazon S3 are supported. For more information, see Extractors.

Data Source Data Source Version MemSQL Version
Apache Kafka 0.8.2.2 or newer 5.5.0 or newer
Amazon S3 N/A 5.7.1 or newer
Filesystem Extractor N/A 5.8.5 or newer
Azure Blob N/A 5.8.5 or newer

Pipelines Scheduling

MemSQL supports running multiple pipelines in parallel. Pipelines will be run in parallel until all MemSQL partitions have been saturated. For example, consider a MemSQL cluster with 10 partitions. With this architecture, it is possible to run 5 parallel pipelines using 2 partitions each, 2 pipelines using 5 partitions each, and so on. If no two pipelines have partition requirements that sum to less than the total number of MemSQL partitions, each pipeline will be run serially in a round robin fashion. Note that how many partitions a pipeline uses is dependent on the pipeline source. For more information, please see Extractors.

See Also

Was this article useful?