Join the MemSQL Community Today
Get expert advice, develop skills, and connect with others.

Configuration Files

Replicate uses YAML configuration files to define database connections, filters, maps, and schemas, and to fine-tune replication settings.

Connection

A connection configuration contains a minimal set of key-value pairs required to establish a database connection, including type, host, port, username, and max-connections. The one exception is password, which is optional.

The type field is used to specify the database system that Replicate will connect to. Currently supported source type values are: ORACLE, MEMSQL, MYSQL, POSTGRESQL, MARIADB, and SQLSERVER. The only supported destination type is MEMSQL.

Depending on the specified type, some additional parameters may be available or required.

The connection configuration format is the same for both the source and destination databases.

File Format

type: <type>

host: <host>
port: <port>

username: "<username>"
password: "<password>"

#... [other parameters if any]

max-connections: <number_of_connections>

max-retries: <number_of_retries>
retry-wait-duration-ms: <duration_between_each_retry>

Examples

Oracle (source only)

type: ORACLE

host: localhost
port: 1521

service-name: IO

username: 'replicate'
password: 'Replicate#123'

max-connections: 30

max-retries: 10
retry-wait-duration-ms: 1000

MemSQL (source or destination)

type: MEMSQL

host: localhost
port: 3306

username: "replicate"
password: "Replicate#123"

#credential-store:
#  type: PKCS12
#  path: #Location of key-store
#  key-prefix: "memsql_"
#  password: #If password to key-store is not provided then default password will be used

max-connections: 30

max-retries: 10
retry-wait-duration-ms: 1000

MySQL and MariaDB (source only)

type: MYSQL

host: localhost
port: 3306

username: "replicate"
password: "Replicate#123"

slave-server-ids: [1]
max-connections: 30

max-retries: 10
retry-wait-duration-ms: 1000

PostgreSQL (source only)

type: POSTGRESQL

host: localhost
port: 5432

database: 'tpch'
username: 'replicate'
password: 'Replicate#123'

max-connections: 30

max-retries: 10
retry-wait-duration-ms: 1000

replication-slots:
  io_replicate:
    - wal2json
  io_replicate1:
    - wal2json

SQL Server (source only)

type: SQLSERVER

host: localhost
port: 1433

username: 'replicate'
password: 'Replicate#123'

max-connections: 30

Set Up a Credential Store

Replicate supports consuming username and password configurations from a credential store instead of specifying them in plain text in the config file.

Note: Setting credential-store is optional. Username and password may be specified in plain text if desired.

As an example, assume that a user must specify a username and password to connect to MemSQL. Keystore and encrypted keys can be created for these instead:

echo "<MemSQL_Username>" | keytool -importpass -keystore <keystore_file_name>.jks -storetype pkcs12 -storepass <License_UUID> -alias <key-prefix>username -keypass <License_UUID> -noprompt
echo "<MemSQL_Password>" | keytool -importpass -keystore <keystore_file_name>.jks -storetype pkcs12 -storepass <License_UUID> -alias <key-prefix>password -keypass <License_UUID> -noprompt

To confirm that these values have been saved in the keystore, run the following:

keytool -keystore <keystore_file_name>jks -list
****
Enter keystore password:
Keystore type: PKCS12
Keystore provider: SUN

Your keystore contains 2 entries

<key-prefix>password, 10-Apr-2020, SecretKeyEntry,
<key-prefix>username, 10-Apr-2020, SecretKeyEntry,

Once the keystore has been set up, the details, including type, path, and key-prefix, must be specified in Replicate’s connection config file as shown below:

type: <type>

host: localhost
port: 3306

credential-store:
  type: PKCS12
  path: <path-of-key-store>
  key-prefix: "prefix_"
  password: #If password to key-store is not provided then default password will be used

#... [other parameters if any]

max-connections: 30

max-retries: 10
retry-wait-duration-ms: 1000

Extractor

An extractor configuration file can be used to fine-tune Replicate’s behavior when retrieving data from a source.

An extractor configuration file is typically optional and, if not provided, default values are used. An extractor configuration file may be required in specific cases.

File Format

An extractor configuration file consists of two sections: snapshot and realtime. Note that both sections are indented relative to their keys.

snapshot:
  <snapshot_configuration>
realtime:
  <realtime_configuration>

Snapshot Configuration

  • threads: Maximum number of threads that Replicate should use for data extraction from the source database.
  • fetch-size-rows: Maximum number of records/documents fetched each time by Replicate from the source database. min-job-size-rows: Chunks tables/collections into multiple jobs for replication. This configuration specifies minimum size for each such job. This has a positive correlation with the memory footprint of Replicate.
  • max-jobs-per-chunk: Determines the maximum number of jobs created per source table/collection, split-key: Used by Replicate to split the table/collection into multiple jobs in order to perform parallel extraction. The specified split key column must be of numeric or timestamp type. Source data extraction using this split key provides significant performance improvement if:
    • The split key has uniform data distribution in the table, with minimal data skew in the values of split-key.
    • An index is present on split-key on the source database.
  • per-table-config: Use this section to override certain configurations on a per-table basis.
    • catalog: <catalog-name>
    • schema: <schema-name>
    • tables:
      • <table-name>:
        • max-jobs-per-chunk: Use this to control intra-table parallelism. Set it to 1 if there is no split-key candidate in a given collection/table.
        • split-key: Use this to specify the split key for this table.
        • row-identifier-key: A list of columns which uniquely identify a row in this table. If a table does not have a PK/UK defined, and if such a subset of columns exists in a table which can uniquely identify rows in a table, then it is strongly recommended to specify that subset of columns as a row-identifier-key. Specifying this can provide significant performance improvements in the overall replication of this table.

Oracle source supports the following additional keys:

parallel-query: <true_or_false>
flashback-query: <true_or_false>
inf-number-behavior: [EXCEPTION || ROUND || CONVERT_TO_NULL] 

Replicate reads data in parallel and Oracle offers parallel query execution. Set parallel-query to false to disable this behavior.

While infinite numbers can be stored in Oracle’s NUMBER columns, they cannot be accurately replicated to the destination. Replicate exits with an exception when it encounters such values. Set inf-number-behavior to ROUND to represent these values as either rounded to the minimum value (for negative infinity) or maximum value (for positive infinity) as supported by the destination column. Alternatively, set inf-number-behavior to CONVERT_TO_NULL to convert these values to null.

Realtime Configuration

  • threads: Maximum number of threads to be used by Replicate for real-time extraction.

  • fetch-size-rows: The number of records/documents fetched by Replicate from the source database each time.

  • heartbeat: With this configuration enabled, Replicate provisions a heartbeat table on source database. Replicate periodically updates this table at a configurable frequency. The heartbeat table helps to forcefully flush CDC logs for all committed transactions so that any remaining transactions will be replicated. The heartbeat table must be created in the catalog/schema which is to be replicated via Replicate with the DDL below.

    CREATE TABLE <catalog>.<schema>.replicate_io_cdc_heartbeat(timestamp <data_type_equivalent_to_long>)
    enable: enable heartbeat mechanism
    catalog: catalog of the heartbeat table
    schema: schema of the heartbeat table
    interval-ms: Interval at which heartbeat table should be updated with the latest timestamp ( milliseconds since epoch) by Replicate.
    

Oracle source supports one additional key:

retention-period-min: <number_of_minutes>

By default, this value is set to 0 (disabled). Use retention-period-min to ignore transactions from the log based on when the transactions were started. Any transaction that was started before current_time-retention_period_min is ignored. Note: This option should only be used when there are issues related to long-running transactions.

Examples

Sample files can be found in conf/src/<database-name>.yaml in Replicate’s HOME directory.

Oracle

snapshot:
   threads: 16
   fetch-size-rows: 10_000
   verify-row-count: false
   _fetch-exact-row-count: true
   _traceDBTasks: true
#  inf-number-behavior: EXCEPTION   # EXCEPTION, ROUND, CONVERT_TO_NULL
#  flashback-query: true
#  parallel-query: true
#  fetch-user-roles: true

#   per-table-config:
#   - schema: tpch
# 	tables:
#   	lineitem1:
#     	row-identifier-key: [ORDERKEY, LINENUMBER]
#    	products:
#      	per-partition-config:
#      	- partition-name: SYS_P463
#        	row-count: 1
#      	- partition-name: SYS_P464
#        	row-count: 3

realtime:
  threads: 4
  _traceDBTasks: true
  #fetch-size-rows: 0
  #retention-period-min: 0
  heartbeat:
	enable: true
	schema: "tpch"
	interval-ms: 10000

MemSQL

snapshot:
  #threads: 1
  #fetch-size-rows: 10_000

  min-job-size-rows: 5_000_000
  #max-jobs-per-chunk: 1
  #verify-row-count: false
  _traceDBTasks: true

  per-table-config:
  - catalog: tpch
    tables:
# 	testTable:
#   	  split-key: split-key-column
  	lineitem:
    	  row-identifier-key: [l_orderkey, l_linenumber]
    	  split-key: l_orderkey
#    	  split-hints:
#         row-count-estimate: 15000
#         split-key-min-value: 1
#         split-key-max-value: 60_000

Applier

An applier configuration file can be used to fine-tune Replicate’s behavior when applying changes to MemSQL.

An applier configuration file is typically optional and, if not provided, default values are used. An applier configuration file may be required in some cases.

File Format

An applier configuration file consists of two sections: snapshot and realtime. Note that both sections are indented relative to their keys.

snapshot:
  <snapshot_configuration>
realtime:
  <realtime_configuration>

Snapshot Configuration

  • threads: Maximum number of threads Replicate should use for writing to the target.
  • batch-size-rows: Sets the size of a batch. It is recommended to refer to the value provided in the sample extractor configuration file in the Replicate release. Experimenting with your own deployment to determine an appropriate value is also encouraged.
  • txn-size-rows: Sets the unit of applier-side job size, where a transaction consists of multiple batches.
  • bulk-load: By default, Replicate writes data to MemSQL by executing INSERT statements. To change this, enable bulk loading. Replicate can leverage FILE- or PIPE-based bulk loading into MemSQL. Available values are:
    • enable: true || false
    • type: FILE || PIPE
  • skip-tables-on-failures: When set, Replicates skips a table/collection if an issue is encountered (even with repeated attempts) and continues replicating other tables. table-store: Table store to use: ROW, COLUMN, etc. per-table-config: Sets the various properties for target tables on a per-table basis:
    • table-type: Type of table (REFERENCE in the case of MemSQL).
    • table-store: Table store to use: ROW, COLUMN, etc.
    • sort-key: Sort key to be created for a table (if applicable for target).
    • shard-key: Shard key to be created for a target table.

Realtime Configuration

  • threads: Maximum number of threads Replicate should use for writing to the target.
  • txn-size-rows: Sets the unit of applier-side job size, where a transaction consists of multiple batches.
  • batch-size-rows: Sets the size of a batch.

Example

MemSQL

snapshot:
  threads: 32
#  batch-size-rows: 10_000
#  txn-size-rows: 1_000_000

  bulk-load:
	enable: true
	type: FILE

  skip-tables-on-failures : false
  _traceDBTasks: true
  #table-store: ROW

  per-table-config:
  - catalog: tpch
	tables:
  	MemSQL_orders:
    	  table-store: COLUMN
    	  sort-key: [MemSQL_orderkey]
    	  #shard-key: [c2]
  	partsupp:
    	  #table-type: REGULAR | REFERENCE
    	  table-store: ROW
  	partsupp_macro_delta:
    	  table-store: COLUMN
    	  sort-key: [partkey, suppkey]

  _delete-using-insert-opt: true
  #deferred-delete: true

realtime:
  threads: 8
  txn-size-rows: 10000
  batch-size-rows: 1000
  _oper-queue-size-rows: 20000