Replicate tables in Hive over HBase Replicate tables in Hive over HDFS for Sequence files Replicate tables in Hive over HDFS for Avro files Replicate tables in Hive over HDFS for Avro files with Snappy compression codec

Configure SharePlex Connector for Hadoop

Unpack the Archive Run install.sh Run conn_setup.sh

Configure Partitions on HDFS

About Partitions on HDFS Configure Custom Partitions Configure Range Partitions Create a Hive External Table on Partitioned Data Configure Numerous Partitions

Use Cases

Set Up and Start Replication Replication Paused or Data Inconsistent (Out of Sync) Edit the List of Tables being Replicated Start and Stop the Derby Network Server

Command Reference

conn_setup.sh conn_snapshot.sh conn_cdc.sh conn_ctrl.sh conn_monitor.sh conn_cleanup.sh uninstall.sh

SharePlex Log Files How to Verify the JMS Queue Prepare the Oracle Data Source

About Partitions on HDFS

About Support for Partitions in HDFS Replication

SharePlex Connector for Hadoop supports the partitioning of data for the HDFS replication feature. SharePlex Connector for Hadoop takes the parameter --partition-key from a snapshot script that is used to specify the column name(s) of an Oracle table for which partition(s) are to be created on HDFS.

SharePlex Connector for Hadoop supports both custom and range partitioning.

Note: Partitioning is only supported for Text and Avro file formats. Partitioning is not supported for the Sequence file format.

Partitioning support includes:

Configure Custom Partitioning

Configure Range Partitioning

Create Hive External Tables on Partitioned Data

Configure Custom Partitions

Configure Custom Partitioning

Custom Partitioning allows you to create a partition for each unique value of an Oracle column that is used for partitioning. For each unique value in a column, a directory is created on HDFS. The name of the partition directories takes the form of:

partition_columnName=value

where:

columnName is the name of the column.
value is the value on which partitioning is done.

Note: For Date/Timestamp datatype, the value of the partition will only contain the Date. The TimeStamp is ignored.

The hierarchy of a directory on HDFS depends on the order of the columns provided as a parameter to the --partition-key option in the conn_snapshot script. (For more information on the conn_snapshot.sh command, refer to conn_snapshot.sh).

Example of Custom Partitioning

The following table SENSOR_DATA owned by qauser is the basis for this partitioning example.

run_id – number (primary key)

device_id - number

partition_id - number

time_stamp - timestamp

start_time - timestamp

event_name - varchar

event_type - varchar

alarm_id - varchar

alarm_code - varchar

Create the partitions

The following command creates a custom partition based on the time_stamp column and sub-partitioned by the device_id column.

./conn_snapshot.sh -t qauser.sensor_data --partition-key time_stamp,device_id –e true

The HDFS directory looks as follows:

/baseDirectory/partition_time_stamp=1990-02-10/partition_device_id=101/

/baseDirectory/partition_time_stamp=1990-02-11/partition_device_id=102/

/baseDirectory/partition_time_stamp=1991-03-14/partition_device_id=102/

The Hive table schema looks as follows:

run_id - double

device_id - double

partition_id - double

time_stamp - string

start_time - string

event_name - string

event_type - string

alarm_id - string

alarm_code – string

partition_time_stamp - string

partition_device_id –double

SharePlex Connector for Hadoop has created extra columns (partition_time_stamp and partition_device_id) in the Hive table schema: one for each partition column specified for conn_snapshot.

Issue queries

This example supports the following queries:

select * from sensor_data where partition_ time_stamp ='1990-02-10';

This outputs records for the date 1990-02-10.
select * from sensor_data where partition_ time_stamp =’1990-02-10' and partition_device_id =102;

This outputs records for date 1990-02-10 and whose device_id is 102.
select * from sensor_data where partition_ device_id > 100 and partition_ device_id < 200;

This outputs records for all of the tools whose device_id is between 100 and 200.

Note: If filters in Hive queries are on partition columns, Hive does not trigger a Map-Reduce job. Otherwise, it runs a Map-Reduce job to retrieve data.

Configure Range Partitions

Configure Range Partitioning

Range partitioning allows you to create partitions based on ranges of the values in a column. For each range interval, a separate directory is created on HDFS. To create a range, you specify a list of the columns to be used for partitioning, along with the interval for each range. SharePlex Connector for Hadoop supports only Number and Date/Timestamp datatypes for range partitioning.

The naming convention for directories on HDFS is the same as that of Custom Partitioning, but the partition name depends on the interval used.

The following table show the supported datatypes and the interval format to be used for each one.

Datatype	Interval Format	Description
Number	Any positive integer	A numerical interval. The range value is not polarity based.
Date/TimeStamp	Any of Y,M,D string constants ,where M is for MONTH, Y for YEAR and D for Day	If the interval format is specified as Y for a column with the TimeStamp datatype, then SharePlex Connector for Hadoop sets the value of Month and Day to 01 when creating a partition for that column value. Similarly, if the interval format is specified as M for a column with a TimeStamp datatype, SharePlex Connector for Hadoop sets the value of Day to 01 when creating a partition for that column value. For example, to create a range partition on a time_stamp column with an interval of Y, assuming a column value of 1990-02-08 12:00:02, then the partition value for that record will be partition_time_stamp=1990-01-01. Note: For the TimeStamp datatype,the time part of the value is ignored when creating the partition.

Datatype

Interval Format

Description

Number

Any positive integer

A numerical interval. The range value is not polarity based.

Date/TimeStamp

Any of Y,M,D string constants ,where M is for MONTH, Y for YEAR and D for Day

If the interval format is specified as Y for a column with the TimeStamp datatype, then SharePlex Connector for Hadoop sets the value of Month and Day to 01 when creating a partition for that column value.

Similarly, if the interval format is specified as M for a column with a TimeStamp datatype, SharePlex Connector for Hadoop sets the value of Day to 01 when creating a partition for that column value.

For example, to create a range partition on a time_stamp column with an interval of Y, assuming a column value of 1990-02-08 12:00:02, then the partition value for that record will be partition_time_stamp=1990-01-01.

Note: For the TimeStamp datatype,the time part of the value is ignored when creating the partition.

Example of Range Partitioning

The following table SENSOR_DATA owned by qauser is the basis for this partitioning example.

run_id – number (primary key)

device_id - number

partition_id - number

time_stamp - timestamp

start_time - timestamp

event_name - varchar

event_type - varchar

alarm_id - varchar

alarm_code - varchar

Create the partitions

The following command creates a range partition based on the time_stamp column with an interval of Year and sub-partitioned by the device_id column with an interval of 10 per partition.

./conn_snapshot.sh -t qauser.sensor_data --partition-key time_stamp[Y],device_id[10]

The HDFS directory looks as follows:

/baseDirectory/partition_time_stamp=1990-01-01/partition_device_id=10/

/baseDirectory/partition_time_stamp=1991-01-01/partition_device_id=20/

/baseDirectory/partition_time_stamp=1992-01-01/partition_device_id=30/

The Hive table schema looks as follows:

run_id - double

device_id - double

partition_id - double

time_stamp - string

start_time - string

event_name - string

event_type - string

alarm_id - string

alarm_code – string

partition_time_stamp - string

partition_device_id – double

Issue queries

This example supports the following queries:

select * from sensor_data where partition_time_stamp ='1990-01-01';

This outputs records for year 1990.
select * from sensor_data where partition_time_stamp ='1990-01-01' and partition_device_id='10';

This outputs records for year 1990 and whose device_id is between 0 to 10.
select * from sensor_data where partition_device_id='20';

This outputs records whose device_id is between 11 and 20, irrespective of their dates.

Create a Hive External Table on Partitioned Data

Create Hive External Tables on Partitioned Data

SharePlex Connector for Hadoop can create a Hive external table for partitioned data. SharePlex Connector for Hadoop adds extra columns in the Hive table for each column that is used for partitioning. These column names are prepended with the partition_ string.

To create a Hive external table on partitioned data

Specify the –e true option with the --partition-key option in the conn_snapshot script.

For example, if you take a snapshot using device_id as the partition column, then SharePlex Connector for Hadoop will create a Hive external table on that data with an extra column that is named partition_device_id. SharePlex Connector for Hadoop does this to differentiate between actual and partition columns in the Hive table definition.

Note: You can specify these partition columns in a query for faster data retrieval. Hive does not trigger a Map-Reduce job when partition columns are used as filters in Hive queries.

Please select your product:

To serve you better, please complete the Purpose of your Chat:

Recommended Solutions for Your Problem

SharePlex Connector for Hadoop 8.5.6 - Installation Guide

About Partitions on HDFS

About Support for Partitions in HDFS Replication

Configure Custom Partitions

Configure Custom Partitioning

Example of Custom Partitioning

Configure Range Partitions

Configure Range Partitioning

Example of Range Partitioning

Create a Hive External Table on Partitioned Data

Create Hive External Tables on Partitioned Data