Custom Partitioning allows you to create a partition for each unique value of an Oracle column that is used for partitioning. For each unique value in a column, a directory is created on HDFS. The name of the partition directories takes the form of:
partition_columnName=value
where:
|
|
Note: For Date/Timestamp datatype, the value of the partition will only contain the Date. The TimeStamp is ignored. |
The hierarchy of a directory on HDFS depends on the order of the columns provided as a parameter to the --partition-key option in the conn_snapshot script. (For more information on the conn_snapshot.sh command, refer to conn_snapshot.sh).
The following table SENSOR_DATA owned by qauser is the basis for this partitioning example.
run_id – number (primary key)
device_id - number
partition_id - number
time_stamp - timestamp
start_time - timestamp
event_name - varchar
event_type - varchar
alarm_id - varchar
alarm_code - varchar
Create the partitions
The following command creates a custom partition based on the time_stamp column and sub-partitioned by the device_id column.
./conn_snapshot.sh -t qauser.sensor_data --partition-key time_stamp,device_id –e true
The HDFS directory looks as follows:
/baseDirectory/partition_time_stamp=1990-02-10/partition_device_id=101/
/baseDirectory/partition_time_stamp=1990-02-11/partition_device_id=102/
/baseDirectory/partition_time_stamp=1991-03-14/partition_device_id=102/
The Hive table schema looks as follows:
run_id - double
device_id - double
partition_id - double
time_stamp - string
start_time - string
event_name - string
event_type - string
alarm_id - string
alarm_code – string
partition_time_stamp - string
partition_device_id –double
SharePlex Connector for Hadoop has created extra columns (partition_time_stamp and partition_device_id) in the Hive table schema: one for each partition column specified for conn_snapshot.
Issue queries
This example supports the following queries:
select * from sensor_data where partition_ time_stamp ='1990-02-10';
This outputs records for the date 1990-02-10.
select * from sensor_data where partition_ time_stamp =’1990-02-10' and partition_device_id =102;
This outputs records for date 1990-02-10 and whose device_id is 102.
select * from sensor_data where partition_ device_id > 100 and partition_ device_id < 200;
This outputs records for all of the tools whose device_id is between 100 and 200.
|
|
Note: If filters in Hive queries are on partition columns, Hive does not trigger a Map-Reduce job. Otherwise, it runs a Map-Reduce job to retrieve data. |
Range partitioning allows you to create partitions based on ranges of the values in a column. For each range interval, a separate directory is created on HDFS. To create a range, you specify a list of the columns to be used for partitioning, along with the interval for each range. SharePlex Connector for Hadoop supports only Number and Date/Timestamp datatypes for range partitioning.
The naming convention for directories on HDFS is the same as that of Custom Partitioning, but the partition name depends on the interval used.
The following table show the supported datatypes and the interval format to be used for each one.
| Datatype | Interval Format | Description |
|---|---|---|
| Number | Any positive integer | A numerical interval. The range value is not polarity based. |
| Date/TimeStamp |
Any of Y,M,D string constants ,where M is for MONTH, Y for YEAR and D for Day
|
If the interval format is specified as Y for a column with the TimeStamp datatype, then SharePlex Connector for Hadoop sets the value of Month and Day to 01 when creating a partition for that column value. Similarly, if the interval format is specified as M for a column with a TimeStamp datatype, SharePlex Connector for Hadoop sets the value of Day to 01 when creating a partition for that column value. For example, to create a range partition on a time_stamp column with an interval of Y, assuming a column value of 1990-02-08 12:00:02, then the partition value for that record will be partition_time_stamp=1990-01-01.
|
The following table SENSOR_DATA owned by qauser is the basis for this partitioning example.
run_id – number (primary key)
device_id - number
partition_id - number
time_stamp - timestamp
start_time - timestamp
event_name - varchar
event_type - varchar
alarm_id - varchar
alarm_code - varchar
Create the partitions
The following command creates a range partition based on the time_stamp column with an interval of Year and sub-partitioned by the device_id column with an interval of 10 per partition.
./conn_snapshot.sh -t qauser.sensor_data --partition-key time_stamp[Y],device_id[10]
The HDFS directory looks as follows:
/baseDirectory/partition_time_stamp=1990-01-01/partition_device_id=10/
/baseDirectory/partition_time_stamp=1991-01-01/partition_device_id=20/
/baseDirectory/partition_time_stamp=1992-01-01/partition_device_id=30/
The Hive table schema looks as follows:
run_id - double
device_id - double
partition_id - double
time_stamp - string
start_time - string
event_name - string
event_type - string
alarm_id - string
alarm_code – string
partition_time_stamp - string
partition_device_id – double
SharePlex Connector for Hadoop has created extra columns (partition_time_stamp and partition_device_id) in the Hive table schema: one for each partition column specified for conn_snapshot.
Issue queries
This example supports the following queries:
select * from sensor_data where partition_time_stamp ='1990-01-01';
This outputs records for year 1990.
select * from sensor_data where partition_time_stamp ='1990-01-01' and partition_device_id='10';
This outputs records for year 1990 and whose device_id is between 0 to 10.
select * from sensor_data where partition_device_id='20';
This outputs records whose device_id is between 11 and 20, irrespective of their dates.
SharePlex Connector for Hadoop can create a Hive external table for partitioned data. SharePlex Connector for Hadoop adds extra columns in the Hive table for each column that is used for partitioning. These column names are prepended with the partition_ string.
To create a Hive external table on partitioned data
Specify the –e true option with the --partition-key option in the conn_snapshot script.
For example, if you take a snapshot using device_id as the partition column, then SharePlex Connector for Hadoop will create a Hive external table on that data with an extra column that is named partition_device_id. SharePlex Connector for Hadoop does this to differentiate between actual and partition columns in the Hive table definition.
|
|
Note: You can specify these partition columns in a query for faster data retrieval. Hive does not trigger a Map-Reduce job when partition columns are used as filters in Hive queries. |
Creating large number of partitions on HDFS using Sqoop import may result in failure due to the unavailability of resources. For example, the number of open files may exceed a set limit. The following are some configuration changes to try that may reduce the chance of failure, although none guarantee support for very large numbers of partitions.
Use Sqoop-Oraoop import to increase the number of mappers
Use the -m parameter in the conn_snapshot.sh script. Sqoop spaws as many mappers as available resources permit, but does not guarantee that the -m number will be reached.
Example:
./conn_snapshot.sh -t qauser.sensor_data --partition-key time_stamp,device_id –e true –m 100
This example configures the number of mappers to 100 for the sqoop-oraoop job during the snapshot.
Increase the limit on the number of open files
Increase the hard and soft limits on the number of open files on each node in Hadoop cluster. The following configuration steps may differ based on different Linux distributions.
Edit the /etc/security/limits.conf file to append the following lines:
* soft nofile 20000 * hard nofile 50000
Decide these limits based on the specifications of the hardware. The soft limit value cannot exceed the hard limit value.
Edit the /etc/pam.d/login file to append the following lines:
session required <path to pam_limits.so>
Edit the cmf-agent file of Cloudera to modify the value of the ulimit parameter.
Restart the Cloudera Hadoop cluster.
Verify that the limits are properly set:
Configure Hive to support numerous partitions
Add the following properties to the hive-site.xml file. You may need to adjust the values based on the cluster specification.
| Hive Property | Value |
|---|---|
| hive.server.thrift.socket.timeout | 1000 |
| hive.client.thrift.socket.timeout | 1000 |
| hive.metastore.client.socket.timeout | 1000 |
| hive.server.read.socket.timeout | 1000 |
| hive.server.write.socket.timeout | 1000 |
| hive.client.read.socket.timeout | 1000 |
| hive.client.write.socket.timeout | 10000 |
| hive.metastore.batch.retireve.table.partition.max | 1000 |