Chat now with support
Chat with Support

SharePlex Connector for Hadoop 8.5.6 - Installation Guide

About Partitions on HDFS

About Support for Partitions in HDFS Replication

SharePlex Connector for Hadoop supports the partitioning of data for the HDFS replication feature. SharePlex Connector for Hadoop takes the parameter --partition-key from a snapshot script that is used to specify the column name(s) of an Oracle table for which partition(s) are to be created on HDFS.

SharePlex Connector for Hadoop supports both custom and range partitioning.

Note: Partitioning is only supported for Text and Avro file formats. Partitioning is not supported for the Sequence file format.

Partitioning support includes:

Configure Custom Partitioning

Configure Range Partitioning

Create Hive External Tables on Partitioned Data

Configure Custom Partitions

Configure Custom Partitioning

Custom Partitioning allows you to create a partition for each unique value of an Oracle column that is used for partitioning. For each unique value in a column, a directory is created on HDFS. The name of the partition directories takes the form of:

partition_columnName=value

where:

  • columnName is the name of the column.
  • value is the value on which partitioning is done.

Note: For Date/Timestamp datatype, the value of the partition will only contain the Date. The TimeStamp is ignored.

The hierarchy of a directory on HDFS depends on the order of the columns provided as a parameter to the --partition-key option in the conn_snapshot script. (For more information on the conn_snapshot.sh command, refer to conn_snapshot.sh).

Example of Custom Partitioning

The following table SENSOR_DATA owned by qauser is the basis for this partitioning example.

run_id – number (primary key)

device_id - number

partition_id - number

time_stamp - timestamp

start_time - timestamp

event_name - varchar

event_type - varchar

alarm_id - varchar

alarm_code - varchar

Create the partitions

The following command creates a custom partition based on the time_stamp column and sub-partitioned by the device_id column.

./conn_snapshot.sh -t qauser.sensor_data --partition-key time_stamp,device_id –e true

The HDFS directory looks as follows:

/baseDirectory/partition_time_stamp=1990-02-10/partition_device_id=101/

/baseDirectory/partition_time_stamp=1990-02-11/partition_device_id=102/

/baseDirectory/partition_time_stamp=1991-03-14/partition_device_id=102/

The Hive table schema looks as follows:

run_id - double

device_id - double

partition_id - double

time_stamp - string

start_time - string

event_name - string

event_type - string

alarm_id - string

alarm_code – string

partition_time_stamp - string

partition_device_id –double

SharePlex Connector for Hadoop has created extra columns (partition_time_stamp and partition_device_id) in the Hive table schema: one for each partition column specified for conn_snapshot.

Issue queries

This example supports the following queries:

  • select * from sensor_data where partition_ time_stamp ='1990-02-10';

    This outputs records for the date 1990-02-10.

  • select * from sensor_data where partition_ time_stamp =’1990-02-10' and partition_device_id =102;

    This outputs records for date 1990-02-10 and whose device_id is 102.

  • select * from sensor_data where partition_ device_id > 100 and partition_ device_id < 200;

    This outputs records for all of the tools whose device_id is between 100 and 200.

    Note: If filters in Hive queries are on partition columns, Hive does not trigger a Map-Reduce job. Otherwise, it runs a Map-Reduce job to retrieve data.

Configure Range Partitions

Configure Range Partitioning

Range partitioning allows you to create partitions based on ranges of the values in a column. For each range interval, a separate directory is created on HDFS. To create a range, you specify a list of the columns to be used for partitioning, along with the interval for each range. SharePlex Connector for Hadoop supports only Number and Date/Timestamp datatypes for range partitioning.

The naming convention for directories on HDFS is the same as that of Custom Partitioning, but the partition name depends on the interval used.

The following table show the supported datatypes and the interval format to be used for each one.

Datatype Interval Format Description
Number Any positive integer A numerical interval. The range value is not polarity based.
Date/TimeStamp

Any of Y,M,D string constants ,where M is for MONTH, Y for YEAR and D for Day

 

If the interval format is specified as Y for a column with the TimeStamp datatype, then SharePlex Connector for Hadoop sets the value of Month and Day to 01 when creating a partition for that column value.

Similarly, if the interval format is specified as M for a column with a TimeStamp datatype, SharePlex Connector for Hadoop sets the value of Day to 01 when creating a partition for that column value.

For example, to create a range partition on a time_stamp column with an interval of Y, assuming a column value of 1990-02-08 12:00:02, then the partition value for that record will be partition_time_stamp=1990-01-01.

Note: For the TimeStamp datatype,the time part of the value is ignored when creating the partition.

Example of Range Partitioning

The following table SENSOR_DATA owned by qauser is the basis for this partitioning example.

run_id – number (primary key)

device_id - number

partition_id - number

time_stamp - timestamp

start_time - timestamp

event_name - varchar

event_type - varchar

alarm_id - varchar

alarm_code - varchar

Create the partitions

The following command creates a range partition based on the time_stamp column with an interval of Year and sub-partitioned by the device_id column with an interval of 10 per partition.

./conn_snapshot.sh -t qauser.sensor_data --partition-key time_stamp[Y],device_id[10]

The HDFS directory looks as follows:

/baseDirectory/partition_time_stamp=1990-01-01/partition_device_id=10/

/baseDirectory/partition_time_stamp=1991-01-01/partition_device_id=20/

/baseDirectory/partition_time_stamp=1992-01-01/partition_device_id=30/

The Hive table schema looks as follows:

run_id - double

device_id - double

partition_id - double

time_stamp - string

start_time - string

event_name - string

event_type - string

alarm_id - string

alarm_code – string

partition_time_stamp - string

partition_device_id – double

SharePlex Connector for Hadoop has created extra columns (partition_time_stamp and partition_device_id) in the Hive table schema: one for each partition column specified for conn_snapshot.

Issue queries

This example supports the following queries:

  • select * from sensor_data where partition_time_stamp ='1990-01-01';

    This outputs records for year 1990.

  • select * from sensor_data where partition_time_stamp ='1990-01-01' and partition_device_id='10';

    This outputs records for year 1990 and whose device_id is between 0 to 10.

  • select * from sensor_data where partition_device_id='20';

    This outputs records whose device_id is between 11 and 20, irrespective of their dates.

Create a Hive External Table on Partitioned Data

Create Hive External Tables on Partitioned Data

SharePlex Connector for Hadoop can create a Hive external table for partitioned data. SharePlex Connector for Hadoop adds extra columns in the Hive table for each column that is used for partitioning. These column names are prepended with the partition_ string.

To create a Hive external table on partitioned data

Specify the –e true option with the --partition-key option in the conn_snapshot script.

For example, if you take a snapshot using device_id as the partition column, then SharePlex Connector for Hadoop will create a Hive external table on that data with an extra column that is named partition_device_id. SharePlex Connector for Hadoop does this to differentiate between actual and partition columns in the Hive table definition.

Note: You can specify these partition columns in a query for faster data retrieval. Hive does not trigger a Map-Reduce job when partition columns are used as filters in Hive queries.

Related Documents

The document was helpful.

Select Rating

I easily found the information I needed.

Select Rating