HDFS Pipelines Overview

MemSQL Pipelines can extract data from HDFS. To understand HDFS core concepts and how it works, please read the Apache Hadoop HDFS documentation. This topic assumes that you understand the basic concepts and terminology of HDFS.

HDFS Pipeline Syntax Example

The following syntax demonstrates how to create a new HDFS Pipeline. For complete syntax documentation, see CREATE PIPELINE.

CREATE PIPELINE `hadoop-cluster`
AS LOAD DATA HDFS 'hdfs://<namenode DNS | IP address>:<port>/<directory>'
INTO TABLE `my_table`
FIELDS TERMINATED BY ','

The directory location must be provided to indicate the directory from which the HDFS extractor will begin pulling data. This could be any directory, such as input or output directory. If you have subdirectories in your target directory, the extractor will recursively walk your directory tree and import additional data, if present.

Output directories have a _SUCCESS file in them, so the example shown below uses an Hadoop client to search all directories for the hadoop user and list those with a _SUCCESS file.

$ hadoop fs -find hdfs://hadoop-namenode:8020/user/hadoop/ -name _SUCCESS -print
hdfs://hadoop-namenode:8020/user/hadoop/output/_SUCCESS
Info

Connection strings cannot start or end with whitespaces and cannot contain wildcard characters (e.g. *).

Files Imported

When extracting output files from HDFS, they will have the following naming conventions if they are the output of Hadoop, Pig, or Spark jobs:

  • part-r-NNNNN
  • part-m-NNNNN
  • part-NNNNN
  • part-NNNN-XXXX-XXXX-XXXX-XXXXX (where X is a postfix ID)

These files are written to an output directory specified during the job execution statement. The part files can also be optionally archived with Gzip. The HDFS extractor will import (and optionally unzip) output files into the table specified in your CREATE PIPELINE statement, but skip any _SUCCESS files present.

HDFS Pipeline Batches

Metadata and configuration information for every pipeline can be found in tables in the information_schema database. These tables use the term batch, which is a standard term across all MemSQL pipelines, but have the following definitions in the context of HDFS pipelines.

When the master aggregator reads a container’s contents, it schedules a subset of the objects for ingest across all database partitions. After each each partition across the cluster has finished extracting, transforming, and loading its object, a batch has been completed. Therefore, an HDFS pipeline batch is defined as a cluster-level operation where each partition processes at most a single object from the source container.

Consider the following example: There are 4 objects in a source container. If your cluster has 2 leaf nodes that have 2 partitions each (4 partitions total), all of the container’s objects can be ingested in 1 batch. In the same cluster, if there are 40 objects in the source container, it will take 10 batches to fully ingest the data.

Information about recent batches can be found in information_schema.PIPELINES_BATCHES_SUMMARY.

Information about files to be loaded can be found in information_schema.PIPELINES_FILES.

Example

The following example creates an HDFS pipeline that extracts and loads data from a simple Hadoop wordcount job and pushes it into a table named messages.

CREATE PIPELINE `hdfs_test`
AS LOAD DATA HDFS 'hdfs://hadoop-namenode:8020/user/hadoop/output'
INTO TABLE `messages`
FIELDS TERMINATED BY '\t';
Was this article useful?