Requirements

Hive 2.0.X

Support data type

Numeric Types

Type	Supported
TINYINT	true
SMALLINT	true
INTEGER	true
FLOAT	true
DOUBLE	true
DECIMAL	false

Date/Time Types

Type	Supported
TIMESTAMP	false
DATE	false
INTERVAL	false

String Types

Type	Supported
STRING	true
VARCHAR	false
CHAR	false

Misc Types

Type	Supported
BOOLEAN	true
BINARY	true

Complex Types

Type	Supported
ARRAYS	true
MAPS	true
STRUCT	true
UNION	true

Support vectorization

Supports primitive type only

Support pushdown

It corresponds to both predicate pushdown and pushdown filter. However, as for pushdown filter, column binaryization is pluggable, so it is different whether binary is supported.

jars

jp_dataplatform_config.jar
jp_dataplatform_schema_common.jar
jp_dataplatform_schema_hive.jar
jp_dataplatform_schema_jackson.jar
dataplatform_mds_common.jar
dataplatform_mds_hive.jar
dataplatform_mds_schema.jar

Setup Hive

See Apache Hive

Add jars

add jar jp_dataplatform_config.jar;
add jar jp_dataplatform_schema_common.jar;
add jar jp_dataplatform_schema_hive.jar;
add jar jp_dataplatform_schema_jackson.jar;
add jar dataplatform_mds_common.jar;
add jar dataplatform_mds_hive.jar;
add jar dataplatform_mds_schema.jar;

Create table

create table sample(
  name string,
  price bigint
)
ROW FORMAT SERDE
  'jp.co.yahoo.dataplatform.mds.hadoop.hive.MDSSerde'
STORED AS INPUTFORMAT
  'jp.co.yahoo.dataplatform.mds.hadoop.hive.io.MDSHiveLineInputFormat'
OUTPUTFORMAT
  'jp.co.yahoo.dataplatform.mds.hadoop.hive.io.MDSHiveParserOutputFormat'

Make complicated data correspond to vectorization

Supports array expansion

Supports flattening Map and Struct types

Array expansion and flattening

TBLPROPERTIES Bug

MR job that MR does not run can not acquire MapWork, so the setting will not be valid.

I want to know how to acquire MapWork.

Code

Base table

CREATE EXTERNAL TABLE mds_t(
  `unix_timestamp` bigint,
  `f_map` map<string,string>,
  `f_array` array<struct<key1:string>>
)
ROW FORMAT SERDE
  'jp.co.yahoo.dataplatform.mds.hadoop.hive.MDSSerde'
STORED AS INPUTFORMAT
  'jp.co.yahoo.dataplatform.mds.hadoop.hive.io.MDSHiveLineInputFormat'
OUTPUTFORMAT
  'jp.co.yahoo.dataplatform.mds.hadoop.hive.io.MDSHiveParserOutputFormat'
LOCATION '/tmp/mds_t'
TBLPROPERTIES (
  'mds.expand' = '{
    "base" : {
      "node" : "f_array" ,  "link_name" : "f_expand_array" 
    }
  }'
);

Expand table

This feature is read only.

CREATE EXTERNAL TABLE mds_t_expand(
  `unix_timestamp` bigint,
  `f_map` map<string,string>,
  `f_expand_array` struct<key1:string>
)
ROW FORMAT SERDE
  'jp.co.yahoo.dataplatform.mds.hadoop.hive.MDSSerde'
STORED AS INPUTFORMAT
  'jp.co.yahoo.dataplatform.mds.hadoop.hive.io.MDSHiveLineInputFormat'
OUTPUTFORMAT
  'jp.co.yahoo.dataplatform.mds.hadoop.hive.io.MDSHiveParserOutputFormat'
LOCATION '/tmp/mds_t'
TBLPROPERTIES (
  'mds.expand' = '{
    "base" : {
      "node" : "f_array" ,  "link_name" : "f_expand_array" 
    }
  }'
);

Flatten table

This feature is read only.

CREATE EXTERNAL TABLE mds_t_flatten(
  `unix_timestamp` bigint,
  `f_map_key1` string,
  `f_map_key2` string,
  `f_array` array<struct<key1:string>>
)
ROW FORMAT SERDE
  'jp.co.yahoo.dataplatform.mds.hadoop.hive.MDSSerde'
STORED AS INPUTFORMAT
  'jp.co.yahoo.dataplatform.mds.hadoop.hive.io.MDSHiveLineInputFormat'
OUTPUTFORMAT
  'jp.co.yahoo.dataplatform.mds.hadoop.hive.io.MDSHiveParserOutputFormat'
LOCATION '/tmp/mds_t_2'
TBLPROPERTIES (
  "mds.flatten" = '[ 
    { "link_name" : "unix_timestamp" , "nodes" : [ "unix_timestamp"] } ,
    { "link_name" : "f_array" , "nodes" : [ "f_array"] } ,
    { "link_name" : "f_map_key1" , "nodes" : [ "f_map" , "key1"] } ,
    { "link_name" : "f_map_key2" , "nodes" : [ "f_map" , "key2"] } 
  ]'
);

Expand and Flatten table

This feature is read only.

CREATE EXTERNAL TABLE mds_t_expand_flatten(
  `unix_timestamp` bigint,
  `f_map_key1` string,
  `f_map_key2` string,
  `f_expand_array_key1` string
)
ROW FORMAT SERDE
  'jp.co.yahoo.dataplatform.mds.hadoop.hive.MDSSerde'
STORED AS INPUTFORMAT
  'jp.co.yahoo.dataplatform.mds.hadoop.hive.io.MDSHiveLineInputFormat'
OUTPUTFORMAT
  'jp.co.yahoo.dataplatform.mds.hadoop.hive.io.MDSHiveParserOutputFormat'
LOCATION '/tmp/mds_t_2'
TBLPROPERTIES (
  'mds.expand' = '{
    "base" : {
      "node" : "f_array" ,  "link_name" : "f_expand_array" 
    }
  }',
  "mds.flatten" = '[ 
    { "link_name" : "f_expand_array_key1" , "nodes" : [ "f_expand_array" , "key1"] } ,
    { "link_name" : "unix_timestamp" , "nodes" : [ "unix_timestamp"] } ,
    { "link_name" : "f_map_key1" , "nodes" : [ "f_map" , "key1"] } ,
    { "link_name" : "f_map_key2" , "nodes" : [ "f_map" , "key2"] } 
  ]'
);

Performance

Measure processing speed with very complicated log data.

About 500 GB with Gzip compression

Version 2.0.0

Hive on tez

Lateral view
----------------------------------------------------------------------------------------------
  VERTICES   DURATION(ms)  CPU_TIME(ms)  GC_TIME(ms)  INPUT_RECORDS  OUTPUT_RECORDS
----------------------------------------------------------------------------------------------
     Map 1      272029.00    98,781,530    2,399,386  1,086,932,934         194,236
 Reducer 2       71392.00       746,700       17,408        194,236             381
 Reducer 3        1072.00         3,800           24            381               0
----------------------------------------------------------------------------------------------
 
Expand and flatten
----------------------------------------------------------------------------------------------
  VERTICES   DURATION(ms)  CPU_TIME(ms)  GC_TIME(ms)  INPUT_RECORDS  OUTPUT_RECORDS
----------------------------------------------------------------------------------------------
     Map 1       81328.00    11,281,810      208,822 21,509,301,388          96,809
 Reducer 2       17627.00       652,470       18,826         96,809             381
 Reducer 3        1258.00         3,260            0            381               0
----------------------------------------------------------------------------------------------

Benchmark

Version 2.0.0

Hive on tez

TCP-DS scale 500

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getting_started_hive.md

getting_started_hive.md

Requirements

Support data type

Numeric Types

Date/Time Types

String Types

Misc Types

Complex Types

Support vectorization

Support pushdown

jars

Setup Hive

Add jars

Create table

Make complicated data correspond to vectorization

TBLPROPERTIES Bug

Base table

Expand table

Flatten table

Expand and Flatten table

Performance

Benchmark

Files

getting_started_hive.md

Latest commit

History

getting_started_hive.md

File metadata and controls

Requirements

Support data type

Numeric Types

Date/Time Types

String Types

Misc Types

Complex Types

Support vectorization

Support pushdown

jars

Setup Hive

Add jars

Create table

Make complicated data correspond to vectorization

TBLPROPERTIES Bug

Base table

Expand table

Flatten table

Expand and Flatten table

Performance

Benchmark