MemSQL Pipelines is a MemSQL Database feature that natively ingests real-time data from external sources. As a built-in component of the database, Pipelines can extract, transform, and load external data without the need for third-party tools or middleware. Pipelines is robust, scalable, highly performant, and supports fully distributed workloads.
Introduced in MemSQL 5.5, Pipelines currently only supports Kafka data sources. For more information, see Kafka Pipelines Overview.
All database products provide native mechanisms to load data. For example, MemSQL Database can natively load data from a file, from cloud repositories like Amazon S3, or even from other databases. However, modern database workloads require data ingestion from an increasingly large ecosystem of data sources. These sources often use unique protocols or schemas and thus require custom connectivity that must be updated regularly.
The challenges posed by this dynamic ecosystem are often resolved by using middleware – software that knows how to deal with the nuances of each data source and can perform the process of Extract, Transform, and Load (ETL). This ETL process ensures that source data is properly structured and stored in the database.
Most ETL processes are external, third-party systems that integrate with a database; they’re not a component of the database itself. As a result, ETL middleware can introduce additional problems of its own, such as cost, complexity, latency, maintenance, and downtime.
Unlike external middleware, MemSQL Pipelines is a built-in ETL feature of MemSQL Database. Pipelines can be used to extract data from a source, transform that data using arbitrary code, and then load the transformed data into MemSQL Database.
The features of MemSQL Pipelines make it a powerful alternative to third-party ETL middleware in many scenarios:
- Scalability: Pipelines inherently scales with MemSQL clusters as well as distributed data sources like Kafka.
- High Performance: Pipelines data is loaded in parallel from the data source to MemSQL leaves, which improves throughput by bypassing the aggregator. Additionally, Pipelines has been optimized for low lock contention and concurrency.
- Exactly-once Semantics: The architecture of Pipelines ensures that transactions are processed exactly once, even in the event of failover.
- Debugging: Pipelines makes it easier to debug each step in the ETL process by storing exhaustive metadata about transactions, including stack traces and
MemSQL Pipelines is ideal for scenarios where data from a supported source must be ingested and processed in real time. Pipelines is also a good alternative to third-party middleware for basic ETL operations that must be executed as fast as possible. Traditional long-running processes, such as overnight batch jobs, can be eliminated by using Pipelines.
Terms and Concepts
MemSQL Pipelines uses the following terminology to describe core concepts:
- Data Source: A data source is any system that provides a way to access its data. A data source requires a matching extractor that knows how to communicate with it.
- Extractor: An extractor is the core Pipelines component that extracts data from a data source. Extractors provide a connection to the data source by communicating via supported protocols, schemas, or APIs.
- Transform: A transform is a user-defined program that executes arbitrary code to transform extracted data into CSV format. The transformed CSV data is then written into the specified table in the database.
- Pipeline: A pipeline is a conceptual term that represents a single instance of three unified components:
- A connection to a data source,
- The extractor being used, e.g. Kafka Extractor, and
- The transform that is converting the data (Optional)
Supported Data Sources
Apache Kafka is currently the only supported data source.
|Data Source||Data Source Version||MemSQL Version|
|Apache Kafka||0.8.2.2 or newer||5.5.0 or newer|