Chat now with support
Chat with Support

SharePlex Connector for Hadoop 8.5.6 - Installation Guide

Configure Numerous Partitions

Creating large number of partitions on HDFS using Sqoop import may result in failure due to the unavailability of resources. For example, the number of open files may exceed a set limit. The following are some configuration changes to try that may reduce the chance of failure, although none guarantee support for very large numbers of partitions.

Use Sqoop-Oraoop import to increase the number of mappers

Use the -m parameter in the conn_snapshot.sh script. Sqoop spawns as many mappers as available resources permit, but does not guarantee that the -m number will be reached.

Example:

./conn_snapshot.sh -t qauser.sensor_data --partition-key time_stamp,device_id –e true –m 100

This example configures the number of mappers to 100 for the sqoop-oraoop job during the snapshot.

Increase the limit on the number of open files

Increase the hard and soft limits on the number of open files on each node in Hadoop cluster. The following configuration steps may differ based on different Linux distributions.

  1. Edit the /etc/security/limits.conf file to append the following lines:

    * soft nofile 20000 * hard nofile 50000

    Decide these limits based on the specifications of the hardware. The soft limit value cannot exceed the hard limit value.

  2. Edit the /etc/pam.d/login file to append the following lines:

    session required <path to pam_limits.so>

  3. Restart the Hadoop cluster.
  4. Edit the cmf-agent file of Cloudera to modify the value of the ulimit parameter.

  5. Restart the Cloudera Hadoop cluster.

  6. Restart the Cloudera manager.
  7. Verify that the limits are properly set:

    1. Issue sudo ps aux | grep datanodeto get the PID of the datanode.
    2. Issue sudo cat /proc/PID_Of_Datanode/limitsand view the Max open files column to verify the new setting.

Configure Hive to support numerous partitions

  1. Add the following properties to the hive-site.xml file. You may need to adjust the values based on the cluster specification.

    Hive Property Value
    hive.server.thrift.socket.timeout 1000
    hive.client.thrift.socket.timeout 1000
    hive.metastore.client.socket.timeout 1000
    hive.server.read.socket.timeout 1000
    hive.server.write.socket.timeout 1000
    hive.client.read.socket.timeout 1000
    hive.client.write.socket.timeout 10000
    hive.metastore.batch.retireve.table.partition.max 1000
  2. Restart the Hiveserver.
  3. Restart the Hivemetastore.

Use Cases

Set Up and Start Replication

Use case: Set Up and Start Replication

Note: Ensure you complete all Initial Setup instructions first.

1. Start SharePlex for Oracle and sp_ctrl

Ensure SharePlex for Oracle and sp_ctrl are running. The prompt should be sp_ctrl (host:port)>. Refer to the SharePlex for Oracle documentation for more information.

/u01/app/shareplex/prod/bin > ./sp_ctrl

sp_ctrl ()>

2. Define the Oracle tables to replicate/capture change data

Use the SharePlex for Oracle create config command to create the file ConfigFile. The file is opened in vi. Declare all the Oracle tables you want captured into Hadoop, one table per line.

sp_ctrl ()> create config ConfigFile

#######################

datasource: O.OracleSID

OracleSchema.OracleTable !jms[:TargetSchema.TargetTable] IPHostPostQueue[:PostQueueName]

#####################

Example line: soo70.G_AUTHORS !jms 10.20.26.28:q2

IPHostPostQueue is the name or IP address of the host on which the SharePlex post queue is running.

For more information on PostQueueName see Configure ActiveMQ to work with SharePlex.

TIP: To verify there are no errors in the config file run command sp_ctrl ()> verify config ConfigFile.

3. Stop the post queue

SharePlex for Oracle uses the post queue to send messages to the JMS queue.

sp_ctrl ()> stop post

4. Run activate config

Use the SharePlex for Oracle activate config command to activate the file ConfigFile.

sp_ctrl ()> activate config ConfigFile

If you see the error "minimal supplemental logging should be enabled," then see Prepare the Oracle Data Source

5. SharePlex™ Connector for Hadoop® - Run conn_snapshot.sh and/or conn_cdc.sh

 

I) Run conn_snapshot.sh

To enable HDFS Near Real Time Replication or HBase Real Time Replication execute the SharePlex Connector for Hadoop conn_snapshot.sh script.

Specify an Oracle table to replicate. This makes a copy of that Oracle table. Note: Only the HDFS Near Real Time Replication feature supports replication of tables that do not have a primary key.

If you declared more than one Oracle table in ConfigFile in step 2, then run conn_snapshot.sh for only those tables which are added for replication.

Note: Take a snapshot of the tables that are added for replication before you start the post queue.

The conn_snapshot.sh script is fully customizable. It is fully documented in conn_snapshot.sh.

conn_snapshot.sh -t Schema.Table -s ';'

You will be prompted to enter the Oracle password. This is the password to the Oracle username supplied during configuration. Run conn_setup.sh

 

II) Run conn_cdc.sh

To start capturing change data on HDFS for an Oracle table, execute the SharePlex Connector for Hadoop conn_cdc.sh script.

If you declared more than one Oracle table in ConfigFile in step 2, then run conn_cdc.sh for only those tables which are added for capturing change data.

Note: Run the conn_cdc.sh script before you start the post queue.

The conn_cdc.sh script is fully customizable. It is fully documented in conn_cdc.sh (conn_cdc.sh).

Captures change data on HDFS for insert, update and delete operations.

conn_cdc.sh -t Schema.Table

6. SharePlex for Oracle - Start the post queue

Start the post queue so SharePlex for Oracle can send messages from the post queue to the JMS queue.

sp_ctrl ()> start post

7. Start SharePlex Connector for Hadoop

Return to SharePlex Connector for Hadoop. For more on the conn_ctrl.sh command see conn_ctrl.sh.

conn_ctrl.sh start

Replication Paused or Data Inconsistent (Out of Sync)

Use Case: Replication Paused or Data Inconsistent (Out of Sync)

SharePlex Connector for Hadoop compares the values stored in HBase / HDFS with the lookup values received from SharePlex. If they don't match then data inconsistency is reported on the console and in the shareplex-connector-alert.log.

REPLICATION PAUSED

Scenarios that may lead to data inconsistency where SharePlex Connector for Hadoop pauses replication of a table include:

  • For HDFS Near Real Time replication the entire merging job at Hadoop fails. Replication of the table is paused as there will be inconsistencies going ahead.
  • Following changes to the schema (Alter).

Data Inconsistent (Out of Sync) - Take a fresh snapshot

Scenarios that may lead to data inconsistency include: receiving a delete for a row which does not exist, receiving an update on a deleted row, receiving an insert on an already inserted row.

SharePlex Connector for Hadoop suggests you take a new snapshot of the table

Follow these instructions.

1. SharePlex for Oracle - Stop the post queue

SharePlex for Oracle uses the post queue to send messages to the JMS queue. Stop the post queue before you take a snapshot.

TIP: Enter command /u01/app/shareplex/prod/bin > ./sp_ctrl to open the sp_ctrl ()> prompt.

sp_ctrl ()>stop post

For more information on PostQueueName see Configure ActiveMQ to work with SharePlex.

2. SharePlex Connector for Hadoop - Run conn_snapshot.sh

Execute the SharePlex Connector for Hadoop conn_snapshot.sh script. In SharePlex Connector for Hadoop, run the conn_snapshot.sh script for the Oracle table associated with the REPLICATION PAUSED or DATA_INCONSISTENT message. This makes a fresh copy of that Oracle table.

This script is fully documented in conn_snapshot.sh.

conn_snapshot.sh -t Schema.Table -s ';'

Note: You will be prompted to enter the Oracle password. This is the password to the Oracle user name supplied during configuration. For more information, see Run conn_setup.sh.

3. SharePlex for Oracle - Start the post queue

Return to SharePlex for Oracle. Start the post queue so SharePlex for Oracle can send messages from the post queue to the JMS queue.

sp_ctrl ()>start post

Related Documents

The document was helpful.

Select Rating

I easily found the information I needed.

Select Rating