SharePlex Connector for Hadoop supports the partitioning of data for the HDFS replication feature. SharePlex Connector for Hadoop takes the parameter --partition-key from a snapshot script that is used to specify the column name(s) of an Oracle table for which partition(s) are to be created on HDFS.
SharePlex Connector for Hadoop supports both custom and range partitioning.
|
Note: Partitioning is only supported for Text and Avro file formats. Partitioning is not supported for the Sequence file format. |
Partitioning support includes:
Create Hive External Tables on Partitioned Data
Custom Partitioning allows you to create a partition for each unique value of an Oracle column that is used for partitioning. For each unique value in a column, a directory is created on HDFS. The name of the partition directories takes the form of:
partition_columnName=value
where:
|
Note: For Date/Timestamp datatype, the value of the partition will only contain the Date. The TimeStamp is ignored. |
The hierarchy of a directory on HDFS depends on the order of the columns provided as a parameter to the --partition-key option in the conn_snapshot script. (For more information on the conn_snapshot.sh command, refer to conn_snapshot.sh).
The following table SENSOR_DATA owned by qauser is the basis for this partitioning example.
run_id – number (primary key)
device_id - number
partition_id - number
time_stamp - timestamp
start_time - timestamp
event_name - varchar
event_type - varchar
alarm_id - varchar
alarm_code - varchar
Create the partitions
The following command creates a custom partition based on the time_stamp column and sub-partitioned by the device_id column.
./conn_snapshot.sh -t qauser.sensor_data --partition-key time_stamp,device_id –e true
The HDFS directory looks as follows:
/baseDirectory/partition_time_stamp=1990-02-10/partition_device_id=101/
/baseDirectory/partition_time_stamp=1990-02-11/partition_device_id=102/
/baseDirectory/partition_time_stamp=1991-03-14/partition_device_id=102/
The Hive table schema looks as follows:
run_id - double
device_id - double
partition_id - double
time_stamp - string
start_time - string
event_name - string
event_type - string
alarm_id - string
alarm_code – string
partition_time_stamp - string
partition_device_id –double
SharePlex Connector for Hadoop has created extra columns (partition_time_stamp and partition_device_id) in the Hive table schema: one for each partition column specified for conn_snapshot.
Issue queries
This example supports the following queries:
select * from sensor_data where partition_ time_stamp ='1990-02-10';
This outputs records for the date 1990-02-10.
select * from sensor_data where partition_ time_stamp =’1990-02-10' and partition_device_id =102;
This outputs records for date 1990-02-10 and whose device_id is 102.
select * from sensor_data where partition_ device_id > 100 and partition_ device_id < 200;
This outputs records for all of the tools whose device_id is between 100 and 200.
|
Note: If filters in Hive queries are on partition columns, Hive does not trigger a Map-Reduce job. Otherwise, it runs a Map-Reduce job to retrieve data. |
Range partitioning allows you to create partitions based on ranges of the values in a column. For each range interval, a separate directory is created on HDFS. To create a range, you specify a list of the columns to be used for partitioning, along with the interval for each range. SharePlex Connector for Hadoop supports only Number and Date/Timestamp datatypes for range partitioning.
The naming convention for directories on HDFS is the same as that of Custom Partitioning, but the partition name depends on the interval used.
The following table show the supported datatypes and the interval format to be used for each one.
Datatype | Interval Format | Description |
---|---|---|
Number | Any positive integer | A numerical interval. The range value is not polarity based. |
Date/TimeStamp |
Any of Y,M,D string constants ,where M is for MONTH, Y for YEAR and D for Day
|
If the interval format is specified as Y for a column with the TimeStamp datatype, then SharePlex Connector for Hadoop sets the value of Month and Day to 01 when creating a partition for that column value. Similarly, if the interval format is specified as M for a column with a TimeStamp datatype, SharePlex Connector for Hadoop sets the value of Day to 01 when creating a partition for that column value. For example, to create a range partition on a time_stamp column with an interval of Y, assuming a column value of 1990-02-08 12:00:02, then the partition value for that record will be partition_time_stamp=1990-01-01. Note: For the TimeStamp datatype,the time part of the value is ignored when creating the partition. |
The following table SENSOR_DATA owned by qauser is the basis for this partitioning example.
run_id – number (primary key)
device_id - number
partition_id - number
time_stamp - timestamp
start_time - timestamp
event_name - varchar
event_type - varchar
alarm_id - varchar
alarm_code - varchar
Create the partitions
The following command creates a range partition based on the time_stamp column with an interval of Year and sub-partitioned by the device_id column with an interval of 10 per partition.
./conn_snapshot.sh -t qauser.sensor_data --partition-key time_stamp[Y],device_id[10]
The HDFS directory looks as follows:
/baseDirectory/partition_time_stamp=1990-01-01/partition_device_id=10/
/baseDirectory/partition_time_stamp=1991-01-01/partition_device_id=20/
/baseDirectory/partition_time_stamp=1992-01-01/partition_device_id=30/
The Hive table schema looks as follows:
run_id - double
device_id - double
partition_id - double
time_stamp - string
start_time - string
event_name - string
event_type - string
alarm_id - string
alarm_code – string
partition_time_stamp - string
partition_device_id – double
SharePlex Connector for Hadoop has created extra columns (partition_time_stamp and partition_device_id) in the Hive table schema: one for each partition column specified for conn_snapshot.
Issue queries
This example supports the following queries:
select * from sensor_data where partition_time_stamp ='1990-01-01';
This outputs records for year 1990.
select * from sensor_data where partition_time_stamp ='1990-01-01' and partition_device_id='10';
This outputs records for year 1990 and whose device_id is between 0 to 10.
select * from sensor_data where partition_device_id='20';
This outputs records whose device_id is between 11 and 20, irrespective of their dates.
SharePlex Connector for Hadoop can create a Hive external table for partitioned data. SharePlex Connector for Hadoop adds extra columns in the Hive table for each column that is used for partitioning. These column names are prepended with the partition_ string.
To create a Hive external table on partitioned data
Specify the –e true option with the --partition-key option in the conn_snapshot script.
For example, if you take a snapshot using device_id as the partition column, then SharePlex Connector for Hadoop will create a Hive external table on that data with an extra column that is named partition_device_id. SharePlex Connector for Hadoop does this to differentiate between actual and partition columns in the Hive table definition.
|
Note: You can specify these partition columns in a query for faster data retrieval. Hive does not trigger a Map-Reduce job when partition columns are used as filters in Hive queries. |
© 2024 Quest Software Inc. ALL RIGHTS RESERVED. 이용 약관 개인정보 보호정책 Cookie Preference Center