Replicate tables in Hive over HBase Replicate tables in Hive over HDFS for Sequence files Replicate tables in Hive over HDFS for Avro files Replicate tables in Hive over HDFS for Avro files with Snappy compression codec

Configure SharePlex Connector for Hadoop

Unpack the Archive Run install.sh Run conn_setup.sh

Configure Partitions on HDFS

About Partitions on HDFS Configure Custom Partitions Configure Range Partitions Create a Hive External Table on Partitioned Data Configure Numerous Partitions

Use Cases

Set Up and Start Replication Replication Paused or Data Inconsistent (Out of Sync) Edit the List of Tables being Replicated Start and Stop the Derby Network Server

Command Reference

conn_setup.sh conn_snapshot.sh conn_cdc.sh conn_ctrl.sh conn_monitor.sh conn_cleanup.sh uninstall.sh

SharePlex Log Files How to Verify the JMS Queue Prepare the Oracle Data Source

Configure Custom Partitions

Configure Custom Partitioning

Custom Partitioning allows you to create a partition for each unique value of an Oracle column that is used for partitioning. For each unique value in a column, a directory is created on HDFS. The name of the partition directories takes the form of:

partition_columnName=value

where:

columnName is the name of the column.
value is the value on which partitioning is done.

Note: For Date/Timestamp datatype, the value of the partition will only contain the Date. The TimeStamp is ignored.

The hierarchy of a directory on HDFS depends on the order of the columns provided as a parameter to the --partition-key option in the conn_snapshot script. (For more information on the conn_snapshot.sh command, refer to conn_snapshot.sh).

Example of Custom Partitioning

The following table SENSOR_DATA owned by qauser is the basis for this partitioning example.

run_id – number (primary key)

device_id - number

partition_id - number

time_stamp - timestamp

start_time - timestamp

event_name - varchar

event_type - varchar

alarm_id - varchar

alarm_code - varchar

Create the partitions

The following command creates a custom partition based on the time_stamp column and sub-partitioned by the device_id column.

./conn_snapshot.sh -t qauser.sensor_data --partition-key time_stamp,device_id –e true

The HDFS directory looks as follows:

/baseDirectory/partition_time_stamp=1990-02-10/partition_device_id=101/

/baseDirectory/partition_time_stamp=1990-02-11/partition_device_id=102/

/baseDirectory/partition_time_stamp=1991-03-14/partition_device_id=102/

The Hive table schema looks as follows:

run_id - double

device_id - double

partition_id - double

time_stamp - string

start_time - string

event_name - string

event_type - string

alarm_id - string

alarm_code – string

partition_time_stamp - string

partition_device_id –double

SharePlex Connector for Hadoop has created extra columns (partition_time_stamp and partition_device_id) in the Hive table schema: one for each partition column specified for conn_snapshot.

Issue queries

This example supports the following queries:

select * from sensor_data where partition_ time_stamp ='1990-02-10';

This outputs records for the date 1990-02-10.
select * from sensor_data where partition_ time_stamp =’1990-02-10' and partition_device_id =102;

This outputs records for date 1990-02-10 and whose device_id is 102.
select * from sensor_data where partition_ device_id > 100 and partition_ device_id < 200;

This outputs records for all of the tools whose device_id is between 100 and 200.

Note: If filters in Hive queries are on partition columns, Hive does not trigger a Map-Reduce job. Otherwise, it runs a Map-Reduce job to retrieve data.

Configure Range Partitions

Configure Range Partitioning

Range partitioning allows you to create partitions based on ranges of the values in a column. For each range interval, a separate directory is created on HDFS. To create a range, you specify a list of the columns to be used for partitioning, along with the interval for each range. SharePlex Connector for Hadoop supports only Number and Date/Timestamp datatypes for range partitioning.

The naming convention for directories on HDFS is the same as that of Custom Partitioning, but the partition name depends on the interval used.

The following table show the supported datatypes and the interval format to be used for each one.

Datatype	Interval Format	Description
Number	Any positive integer	A numerical interval. The range value is not polarity based.
Date/TimeStamp	Any of Y,M,D string constants ,where M is for MONTH, Y for YEAR and D for Day	If the interval format is specified as Y for a column with the TimeStamp datatype, then SharePlex Connector for Hadoop sets the value of Month and Day to 01 when creating a partition for that column value. Similarly, if the interval format is specified as M for a column with a TimeStamp datatype, SharePlex Connector for Hadoop sets the value of Day to 01 when creating a partition for that column value. For example, to create a range partition on a time_stamp column with an interval of Y, assuming a column value of 1990-02-08 12:00:02, then the partition value for that record will be partition_time_stamp=1990-01-01. Note: For the TimeStamp datatype,the time part of the value is ignored when creating the partition.

Datatype

Interval Format

Description

Number

Any positive integer

A numerical interval. The range value is not polarity based.

Date/TimeStamp

Any of Y,M,D string constants ,where M is for MONTH, Y for YEAR and D for Day

If the interval format is specified as Y for a column with the TimeStamp datatype, then SharePlex Connector for Hadoop sets the value of Month and Day to 01 when creating a partition for that column value.

Similarly, if the interval format is specified as M for a column with a TimeStamp datatype, SharePlex Connector for Hadoop sets the value of Day to 01 when creating a partition for that column value.

For example, to create a range partition on a time_stamp column with an interval of Y, assuming a column value of 1990-02-08 12:00:02, then the partition value for that record will be partition_time_stamp=1990-01-01.

Note: For the TimeStamp datatype,the time part of the value is ignored when creating the partition.

Example of Range Partitioning

The following table SENSOR_DATA owned by qauser is the basis for this partitioning example.

run_id – number (primary key)

device_id - number

partition_id - number

time_stamp - timestamp

start_time - timestamp

event_name - varchar

event_type - varchar

alarm_id - varchar

alarm_code - varchar

Create the partitions

The following command creates a range partition based on the time_stamp column with an interval of Year and sub-partitioned by the device_id column with an interval of 10 per partition.

./conn_snapshot.sh -t qauser.sensor_data --partition-key time_stamp[Y],device_id[10]

The HDFS directory looks as follows:

/baseDirectory/partition_time_stamp=1990-01-01/partition_device_id=10/

/baseDirectory/partition_time_stamp=1991-01-01/partition_device_id=20/

/baseDirectory/partition_time_stamp=1992-01-01/partition_device_id=30/

The Hive table schema looks as follows:

run_id - double

device_id - double

partition_id - double

time_stamp - string

start_time - string

event_name - string

event_type - string

alarm_id - string

alarm_code – string

partition_time_stamp - string

partition_device_id – double

Issue queries

This example supports the following queries:

select * from sensor_data where partition_time_stamp ='1990-01-01';

This outputs records for year 1990.
select * from sensor_data where partition_time_stamp ='1990-01-01' and partition_device_id='10';

This outputs records for year 1990 and whose device_id is between 0 to 10.
select * from sensor_data where partition_device_id='20';

This outputs records whose device_id is between 11 and 20, irrespective of their dates.

Create a Hive External Table on Partitioned Data

Create Hive External Tables on Partitioned Data

SharePlex Connector for Hadoop can create a Hive external table for partitioned data. SharePlex Connector for Hadoop adds extra columns in the Hive table for each column that is used for partitioning. These column names are prepended with the partition_ string.

To create a Hive external table on partitioned data

Specify the –e true option with the --partition-key option in the conn_snapshot script.

For example, if you take a snapshot using device_id as the partition column, then SharePlex Connector for Hadoop will create a Hive external table on that data with an extra column that is named partition_device_id. SharePlex Connector for Hadoop does this to differentiate between actual and partition columns in the Hive table definition.

Note: You can specify these partition columns in a query for faster data retrieval. Hive does not trigger a Map-Reduce job when partition columns are used as filters in Hive queries.

Configure Numerous Partitions

Creating large number of partitions on HDFS using Sqoop import may result in failure due to the unavailability of resources. For example, the number of open files may exceed a set limit. The following are some configuration changes to try that may reduce the chance of failure, although none guarantee support for very large numbers of partitions.

Use Sqoop-Oraoop import to increase the number of mappers

Use the -m parameter in the conn_snapshot.sh script. Sqoop spaws as many mappers as available resources permit, but does not guarantee that the -m number will be reached.

Example:

./conn_snapshot.sh -t qauser.sensor_data --partition-key time_stamp,device_id –e true –m 100

This example configures the number of mappers to 100 for the sqoop-oraoop job during the snapshot.

Increase the limit on the number of open files

Increase the hard and soft limits on the number of open files on each node in Hadoop cluster. The following configuration steps may differ based on different Linux distributions.

Edit the /etc/security/limits.conf file to append the following lines:

* soft nofile 20000 * hard nofile 50000

Decide these limits based on the specifications of the hardware. The soft limit value cannot exceed the hard limit value.
Edit the /etc/pam.d/login file to append the following lines:

session required <path to pam_limits.so>
Restart the Hadoop cluster.
Edit the cmf-agent file of Cloudera to modify the value of the ulimit parameter.
Restart the Cloudera Hadoop cluster.
Restart the Cloudera manager.
Verify that the limits are properly set:
1. Issue sudo ps aux | grep datanodeto get the PID of the datanode.
2. Issue sudo cat /proc/PID_Of_Datanode/limitsand view the Max open files column to verify the new setting.

Configure Hive to support numerous partitions

Add the following properties to the hive-site.xml file. You may need to adjust the values based on the cluster specification.

Hive Property	Value
hive.server.thrift.socket.timeout	1000
hive.client.thrift.socket.timeout	1000
hive.metastore.client.socket.timeout	1000
hive.server.read.socket.timeout	1000
hive.server.write.socket.timeout	1000
hive.client.read.socket.timeout	1000
hive.client.write.socket.timeout	10000
hive.metastore.batch.retireve.table.partition.max	1000

Restart the Hiveserver.
Restart the Hivemetastore.

Documents connexes

The document was helpful.

Sélectionner une évaluation

I easily found the information I needed.