You are viewing an older version of this section. View current production version.

Configuration Files

Replicate uses YAML configuration files to define database connections, filters, maps, and schemas, and to fine-tune replication settings.

Connection

A connection configuration files must contain the minimal set of key-value pairs required to establish a database connection, including type, host, port, username, and max-connections. The one exception is password, which is optional.

The type key is used to specify the database type that Replicate will connect to. Currently, ORACLE and MEMSQL are the only supported source types, while MEMSQL is the only supported destination type.

For Oracle, an additional service-name key-value pair is also required.

The connection configuration format is the same for both the source and destination databases.

File Format

type: [ORACLE || MEMSQL]
host: <host>
port: <port>
service-name: <service-name>  ← Oracle only
username: "<username>"
password: "<password>"
max-connections: <number_of_connections>

Examples

Oracle (source only)

type: ORACLE
host: localhost
port: 1521
service-name: IO
username: “replicate”
password: “Replicate#123”
max-connections: 16

MemSQL (source or destination)

type: MEMSQL
host: localhost
port: 3306
username: “replicate”
password: “Replicate#123”
max-connections: 16

Extractor

An extractor configuration file can be used to fine-tune Replicate’s behavior when retrieving data from a source.

An extractor configuration file is typically optional and, if not provided, default values are used. An extractor configuration file may be required in specific cases.

File Format

An extractor configuration file consists of two sections: snapshot and realtime. Note that both sections are indented relative to their keys.

snapshot:
  <snapshot_configuration>
realtime:
  <realtime_configuration>

Snapshot Configuration

threads: <number_of_threads>
fetch-size-rows: <number_of_rows>
_split-tables: <true_or_false>
min-job-size-rows: <number_of_rows>
max-jobs-per-chunk: <number_of_jobs>

By default, Replicate creates twice the number of threads as there are available CPUs. This default can be changed via the threads key. Generally, it’s advisable to use at least as many threads as there are available destination connections.

Use fetch-size-rows to specify the number of rows to retrieve in each round trip to the database. Note: Reducing this value increases the frequency of round trips, which can impair performance. Increasing this value also significantly increase Replicate’s memory requirements.

When source tables are partitioned, data from each partition or subpartition is read separately. This behavior can be disabled by setting _split-tables to false. The default behavior is preferred unless the tables have a large number of small partitions. As a rule of thumb, the number of rows in each partition and subpartition should be an order of magnitude larger than fetch-size-rows to achieve best performance.

To achieve intra-table parallelization when reading the data from the source, each table is split into multiple jobs. Use min-job-size-rows to prevent the creation of a large number of small jobs. Job size should be at least equal to fetch-size-rows.

Where tables significantly differ in size, use max-jobs-per-chunk to further limit the number of jobs created for each individual table. If tables are partitioned and _split-tables is set to true, the jobs are created at the partition or subpartition level rather than the table level.

Oracle source supports the following additional keys:

parallel-query: <true_or_false>
flashback-query: <true_or_false>
inf-number-behavior: [EXCEPTION || ROUND || CONVERT_TO_NULL] 

Replicate reads data in parallel and Oracle offers parallel query execution. Set parallel-query to false to disable this behavior.

Oracle’s flashback query is used to ensure consistency when reading data from Oracle. However, if a Data Definition Language (DDL) is executed on a table, flashback query won’t work for that table and replication will fail. In this case, disable the use of flashback query by setting flashback-query to false. Once disabled, snapshot load consistency is not guaranteed, nor can it be followed by CDC replication.

While infinite numbers can be stored in Oracle’s NUMBER columns, they cannot be accurately replicated to the destination. Replicate exits with an exception when it encounters such values. Set inf-number-behavior to ROUND to represent these values as either rounded to the minimum value (for negative infinity) or maximum value (for positive infinity) as supported by the destination column. Alternatively, set inf-number-behavior to CONVERT_TO_NULL to convert these values to null.

Realtime Configuration

fetch-size-rows: <number_of_rows>

Similar to snapshot, fetch-size-rows in a realtime configuration can be used to specify how many changes are retrieved from the log with each round trip. By default, this value is set to 0, where the default driver behavior is used. If this key is either included or excluded, a fetch-size of 1 is used initially, and an adjustment is made only if there is a high volume of activity in the log. In this case, either the user-specified value or the driver default is used when this adjustment is made.

Oracle source supports one additional key:

retention-period-min: <number_of_minutes>

By default, this value is set to 0 (disabled). Use retention-period-min to ignore transactions from the log based on when the transactions were started. Any transaction that was started before current_time-retention_period_min is ignored. Note: This option should only be used when there are issues related to long-running transactions.

Examples

Oracle

snapshot:
  threads: 16
  fetch-size-rows: 5_000
  _split-tables: true
  min-job-size-rows: 1_000_000
  max-jobs-per-chunk: 32
  inf-number-behavior: EXCEPTION
  flashback-query: true
  parallel-query: true
realtime:
  fetch-size-rows: 0
  retention-period-min: 0

MemSQL

snapshot:
  threads: 16
  fetch-size-rows: 5_000
  _split-tables: true
  min-job-size-rows: 1_000_000
  max-jobs-per-chunk: 32
realtime:
  fetch-size-rows: 0

Applier

An applier configuration file can be used to fine-tune Replicate’s behavior when applying changes to MemSQL.

An applier configuration file is typically optional and, if not provided, default values are used. An applier configuration file may be required in some cases.

File Format

An applier configuration file consists of two sections: snapshot and realtime. Note that both sections are indented relative to their keys.

snapshot:
  <snapshot_configuration>
realtime:
  <realtime_configuration>
Snapshot Configuration
threads: <number_of_threads>
batch-size-rows: <number_of_rows>
txn-size-rows: <number_of_rows>
bulk-load:
  enable: <true_or_false>
  type: [PIPE || FILE]
  _ignore-duplicates: <true_or_false>
_skip-tables-on-failures: <true_or_false>

By default, Replicate creates twice the number of threads as there are available CPUs. This default can be changed via the threads key. Generally, it’s advisable to use at least as many threads as there are available destination connections.

Use batch-size-rows to set the target size of each batch. Since executing each batch requires a round trip to the database, reducing this value could impair performance. Conversely, using larger batch sizes increases Replicate’s memory requirements, especially if there are a large number of destination connections. Replicate may occasionally execute smaller batches than specified in order to avoid deadlocks. This option take effect only when data is loaded by executing INSERT statements and is ignored if bulk loading is used.

As there are no individual transactional boundaries during snapshot load, use txn-size-rows to change target size of each transaction. Bulk loading offers significantly better performance as txn-size-rows is used to determine how many rows are loaded by each file or pipe. However, this option is mainly intended to handle cases where the destination must process large transactions.

By default, Replicate writes data to MemSQL by executing INSERT statements.

To use bulk loading, set the value of the enable key to true in the bulk-load section. When bulk loading is used, data must first be converted to CSV format and then loaded to the destination.

By default, files are used for bulk loading. Pipes can be used instead by changing the value of the type key in the bulk-load section to PIPE. Using pipes generally offers some benefits over using files.

Set _ignore-duplicates key to true to skip duplicate rows when bulk loading. Duplicate rows are skipped based on the primary key.

Set the value of the _skip-tables-on-failures key to true to exclude any table from replication when Replicate encounters an issue creating it or loading data into it. All excluded tables are listed in the data/<instance_name>/error_summary.txt file.

Realtime Configuration

batch-size-rows: <number_of_rows>

Use batch-size-rows to control how many rows are written to the destination with each round trip.

Example

MemSQL

snapshot:
  threads: 16
  batch-size-rows: 5_000
  txn-size-rows: 1_000_000
  bulk-load:
    enable: false
    type: FILE
    _ignore-duplicates: false
  _skip-tables-on-failures: false
realtime:
  batch-size-rows: 100