SharePlex Connector for Hadoop 8.5.6

Replicate tables in Hive over HBase Replicate tables in Hive over HDFS for Sequence files Replicate tables in Hive over HDFS for Avro files Replicate tables in Hive over HDFS for Avro files with Snappy compression codec

Configure SharePlex Connector for Hadoop

Unpack the Archive Run install.sh Run conn_setup.sh

Configure Partitions on HDFS

About Partitions on HDFS Configure Custom Partitions Configure Range Partitions Create a Hive External Table on Partitioned Data Configure Numerous Partitions

Use Cases

Set Up and Start Replication Replication Paused or Data Inconsistent (Out of Sync) Edit the List of Tables being Replicated Start and Stop the Derby Network Server

Command Reference

conn_setup.sh conn_snapshot.sh conn_cdc.sh conn_ctrl.sh conn_monitor.sh conn_cleanup.sh uninstall.sh

SharePlex Log Files How to Verify the JMS Queue Prepare the Oracle Data Source

conn_snapshot.sh

Use this script to take a fresh copy of an Oracle table for replication.

Shell Script Usage

conn_snapshot.sh -t <TABLE_OWNER.TABLE_NAME> [-f <FILE_TYPE>] [-s <FIELD_SEPARATOR>] [-e <CREATE_EXTERNAL_TABLE>] [-h <HIVE_HOME_DIR>] [-m <NUM_OF_MAPPERS>] [-n <CHANGES_THRESHOLD>] [-i <TIME_THRESHOLD>] [-r] [-d] [-v] [--partition-key <LIST_OF_COLUMNS>] [--compression-codec <CODEC_NAME>] [--help] [--version]

Options

Parameter

Description

-t <TABLE_OWNER.TABLE_NAME>

Name and owner of the table to import from Oracle. Required.

-f <FILE_TYPE>

Applicable to HDFS Near Real Time Replication.

File type for import. [Text|Sequence|Avro] (Default = Text. Use -f Sequence for sequence file type and –f Avro for Avro file type.)

-f Sequence

All data is replicated in the Sequence files on the HDFS. To read/write to the Sequence files you need access to the Writable classes used to generate the Sequence files. See the SharePlex Connector for Hadoop lib/sqoop-records directory and SharePlex Connector for Hadoop lib/sqoop-records.jar.

-s <FIELD_SEPARATOR>

The separator between each field/column. The separator must be enclosed in single quotes.

-e <CREATE_EXTERNAL_TABLE>

Copy to Hive. [true|false] (Default = false. Use -e true to enable.)

If true then

Copy to Hive over HDFS if a copy is taken for HDFS Near Real Time Replication. Enabling the creation of an external table in Hive over HDFS informs the connector to maintain two versions of the replicated data on HDFS. Versioned data ensures that HDFS replication does not interfere with Hive queries, and merging operations (Map jobs) work without any side effects on Hive queries.
Copy to Hive over HBase if a copy is taken for HBase Real Time Replication.

-h <HIVE_HOME_DIR>

Path to the Hive home directory.

If not specified the value of the HIVE_HOME environment variable is used. If this option is not set and the HIVE_HOME environment variable is also not set, this parameter will be set as relative to HADOOP_HOME.

-m <NUM_OF_MAPPERS>

The number of mappers to be used.

-n <CHANGES_THRESHOLD>

Use to override the default setting for how often SharePlex Connector for Hadoop replicates the table (measured by the number of changes to the table).

Applicable to HDFS Near Real Time Replication.
The default setting was set in conn_setup.sh.
Replication is executed on the first condition met: on the given number of changes to the table or the set time period -i, whichever comes first.
SharePlex Connector for Hadoop remembers this setting and makes use of it during further execution of conn_snapshot.

-i <TIME_THRESHOLD>

Use to override the default setting for how often SharePlex Connector for Hadoop replicates the table (measured by the number of minutes).

Applicable to HDFS Near Real Time Replication.
The default setting was set in conn_setup.sh.
Replication is executed on the first condition met: on the given number of changes to the table -n or the set time period, whichever comes first.
SharePlex Connector for Hadoop remembers this setting and makes use of it during further execution of conn_snapshot.

-r -d

Use to override the settings in conn_setup.sh. If not specified, replicate this Oracle table using HBASE and/or HDFS as per the settings in conn_setup.sh.

-r

A copy of the table is taken for HBase Real Time Replication.

Do not replicate this Oracle table using HDFS.

This overrides the settings in conn_setup.sh.

-d

A copy of the table is taken for HDFS Near Real Time Replication.

Do not replicate this Oracle table using HBASE.

This overrides the settings in conn_setup.sh.

-r -d

A copy of the table is taken for HBase Real Time Replication and HDFS Near Real Time Replication.

This overrides the settings in conn_setup.sh.

SharePlex Connector for Hadoop remembers these settings and makes use of them during further execution of conn_snapshot.

-v

Verbose - Show detailed information of each step.

--partition-key <LIST_OF_COLUMNS> Applicable to HDFS Near Real Time Replication. Use to provide a list of columns (along with a range if required) on which table data should be partitioned on HDFS. The order of columns in this list specifies the directory structure on HDFS.

--compression-codec <CODEC_NAME> Applicable to HDFS Near Real Time Replication for Avro file format. Use to provide the compression codec (snappy or deflate) to be used for the Avro file format. Currently only snappy and deflate compression codec are supported for the Avro file format. The use of compression codec may degrade the performance of a Hive query.

--help

Show this help and exit.

--version

Show version information and exit.

Example

[user@host bin]$ ./conn_snapshot.sh -t Schema.Table -s ';'

Use Cases
Take a copy of the Oracle table Shema.Table for replication over HDFS and / or HBase as per the settings in conn_setup.sh	conn_snapshot.sh -t Schema.Table -s ';'
As above and ... Copy to Hive over HDFS if a copy is taken for HDFS Near Real Time Replication. Copy to Hive over HBase if a copy is taken for HBase Real Time Replication.	conn_snapshot.sh -t Schema.Table -s ';' -e true
Take a copy of the Oracle table Shema.Table for replication over HBase. Do not replicate over HDFS.	conn_snapshot.sh -t Schema.Table -s ';' -r
Take a copy of the Oracle table Shema.Table for replication over HDFS. Do not replicate over HBase. Replication is set for every 20 minutes or 100 changes - whichever comes first.	conn_snapshot.sh -t Schema.Table -s ';' –d –i 20 –n 100

conn_cdc.sh

Use this script to start capturing change history data for insert, update, delete and before update operations on HDFS for a specified table name.

Note: For conn_cdc script to run, the change history feature must be enabled in the setup script. Capturing history data for a table will start only when the user executes the conn_cdc script for that table.

Shell Script Usage

conn_cdc.sh -t <TABLE_OWNER.TABLE_NAME> [-i] [-u] [-d] [-b] [--help] [--version]

Options

Parameter	Description
-t <TABLE_OWNER.TABLE_NAME>	Name and owner of the table to import from Oracle. Required.
-i	If specified, enables change history support for insert operations.
-u	If specified, enables change history support for update operation.
-d	If specified, enables change history support for delete operation.
-b	If specified, enables change history support for before update operation values.
--help	Show this help and exit.
--version	Show version information and exit.

Example

[user@host bin]$ ./conn_cdc.sh -t Schema.Table

Use Cases
If none of these parameters [-i] [-u] [-d] [-b] are provided, then all captured changes (insert, update and delete) will be recorded by default. Before updates will not be recorded.	conn_cdc.sh -t Schema.Table
Capture change data for only insert operation. Rest of the operations will not be recorded.	conn_cdc.sh -t Schema.Table -i
Capture change data for only update and delete operation. Insert and before update operation will not be recorded.	conn_cdc.sh -t Schema.Table –u -d
Capture change data for update and before update operation. Rest of the operations will not be recorded.	conn_cdc.sh -t Schema.Table –u -b

Note: Execute conn_cdc script again for the same table if you want to change the preference of capturing different operations.

Accessing captured history data written on HDFS

History data written on HDFS for a particular table can be accessed using two ways as mentioned below.

Hive external tables pointing to history data on HDFS.

SharePlex Connector for Hadoop creates external tables in Hive for CDC feature. External Hive table(s) gets created under Hive database same as Oracle table owner / schema. Each external Hive table points to data stored for that table on change history HDFS destination directory.

SharePlex Connector for Hadoop does the mapping of Oracle data types to appropriate data types in external Hive table as shown below. This allows user to make use of built-in Hive UDFs available with the specific Hive data types.

Oracle data type	Hive data type
CHAR	STRING
VARCHAR2	STRING
NCHAR	STRING
NVARCHAR2	STRING
NUMBER	DOUBLE
FLOAT	DOUBLE
DATE	TIMESTAMP
TIMESTAMP	TIMESTAMP
TIMESTAMP WITH TIME ZONE	STRING
TIMESTAMP WITH LOCAL TIME ZONE	STRING
INTERVAL YEAR TO MONTH	STRING
INTERVAL DAY TO SECOND	STRING
ROWID	STRING

Note: If any of Oracle table’s column data type is modified, Hive data type for that particular column is mapped to STRING.

Custom Map-Reduce programs

SharePlex Connector for Hadoop generates change-data-records (Java utility classes) per table using which custom Map-Reduce programs can read history data stored on HDFS. Change a data record (utility classes) gets generated under “lib” directory as well as these classes gets added in lib\change-data-records.jar file.

conn_ctrl.sh

Use this script to start / restart / shutdown SharePlex Connector for Hadoop and reload SharePlex Connector for Hadoop configurations.

Shell Script Usage

Options

Parameter	Description
start	Start SharePlex Connector for Hadoop.
restart	Stop and start SharePlex Connector for Hadoop.
shutdown	Stop SharePlex Connector for Hadoop.
reload-conf	Reload some of the SharePlex Connector for Hadoop configurations without restarting SharePlex Connector for Hadoop. These configurations include HDFS Near Real Time Replication changes and time interval.
--help	Show this help and exit.
--version	Show version information and exit.
--status	Show the status (started / shutdown) of SharePlex Connector for Hadoop.

Example

[user@host bin]$ ./conn_ctrl.sh

conn_monitor.sh

Use this script to show a list of the tables being replicated. Monitor the status of a table under replication.

Shell Script Usage

Options

Parameter	Description
--summary	Show a summary of the tables under replication.
--status <TABLE_OWNER.TABLE_NAME> [--hdfs\|--hbase\|--cdc]	Show the status of the given table. Include the --hdfs option (optional) to limit the display to HDFS Near Real Time Replication. Include the --hbase option (optional) to limit the display to HBase Real Time Replication. Include the --cdc option (optional) to limit the display to the change history (formerly Change Data Capture) feature.
--help	Show this help and exit.
--version	Show version information and exit.

Please select your product:

To serve you better, please complete the Purpose of your Chat:

Recommended Solutions for Your Problem

SharePlex Connector for Hadoop 8.5.6 - Installation Guide

conn_snapshot.sh

conn_snapshot.sh

conn_cdc.sh

conn_ctrl.sh

conn_ctrl.sh

conn_monitor.sh

conn_monitor.sh