HiveQL

データ型

プリミティブなデータ型
TINYINT     1バイトの符号付整数
SMALLINT    2バイトの符号付整数
INT         4バイトの符号付整数
BIGINT      8バイトの符号付整数
BOOLEAN     真あるいは偽の論理値
FLOAT       単精度の浮動小数
DOUBLE      倍精度の浮動小数点数
STRING      文字列
TIMESTAMP   YYYY-MM-DD hh:mm:ss.fffffffff       
BINARY      バイト配列

コレクションデータ型
STRUCT
MAP
ARRAY

コレクションデータ型のフィールドを含むテーブルを作成
create table employees (
name string,
salary float,
subordinates array<string>,　←　ARRAY
deductions map<string, float>,　←　MAP
address struct<street:string, city:string, state:string, zip:int>　←　STRUCT
)
row format delimited
fields terminated by '\001'
collection items terminated by '\002'
map keys terminated by '\003'
lines terminated by '\n'
stored as textfile;

Hiveにおけるデフォルトのレコード及びフィールドの区切り文字　　（）は８進数
\n          改行
^A（\001）  フィールド同士を区切る
^B（\002）  ARRAYやSTRUCTの要素(リスト)、あるいはMAPのキー/値ペアを区切る
^C（\003）  MAPのキー/値ペアのキーと対応する値を区切る

$ cat -v employees.data
Jone Doe^A1000.0^AMary Smith^Btodd Jones^AFederal Taxes^C.2^Bstate Taxes^C.05^BInsurance^C.1^A1Michigan Ave.^BChicago^BIL^B60600

hive> load data local inpath "/home/vagrant/data/employees.data" into table employees;

hive> select * from employees;
Jone Doe        1000.0  ["Mary Smith","todd Jones"]     {"Federal Taxes":0.2,"state Taxes":0.05,"Insurance":0.1}        
{"street":"1Michigan Ave.","city":"Chicago","state":"IL","zip":60600}

MAP要素を参照
hive> select name, deductions["state Taxes"] from employees;
Jone Doe        0.05

STRUCT要素を参照
hive> select name, address.city from employees;
Jone Doe        Chicago

データのロード

LOCALキーワード：パスはローカルファイルシステム上のパスと見なされる
OVERWRITEキーワード：ALL DELETE & INSERTされる。付けないとデータが追加される。
hive> load data local inpath "/home/vagrant/data/employees.data" overwrite into table employees;

insert into：置き換えしないで、追加する
insert overwrite：置き換えして、追加する（ALL DELETE & INSERT）
hive> insert overwrite table sales
    > select * from sales
    > where sales = 100;

テーブル作成と、データのロードを１つのクエリーで行う
hive> create table sin_sales
    > as select * from sales
    > where sales > 500
    > ;

データのエクスポート

directory：エクスポート先
hive> insert overwrite local directory '/home/vagrant/tmp'
    > select * from sales
    > where sales > 100;

hive> !ls /home/vagrant/tmp;
000000_0

hive> !cat /home/vagrant/tmp/000000_0;
新宿店340
池袋店874
渋谷店400

エクスポート先を条件で分ける
hive> from sales sa
    > insert overwrite local directory '/home/vagrant/tmp/a'
    > select * where sa.sales = 100
    > insert overwrite local directory '/home/vagrant/tmp/b'
    > select * where sa.sales = 400
    > ;

インデックス

hive> create index sales_index on table sales(shop)
    > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
    > with deferred rebuild
    > in table sales_index_table;

hive> show tables;
OK
sales
sales_index_table

hive> show formatted index on sales;
idx_name        tab_name        col_names       idx_tab_name            idx_type        comment
sales_index     sales           shop            sales_index_table       compact

hive> alter index sales_index on sales rebuild;　←　インデックス再構築(※)

※テーブルが更新されても、インデックスは再構築されない。

ネストしたSELECT

hive> from (
    >   select
    >     shop, (sales * 2) as sales2
    >   from
    >     sales
    > ) e
    > select
    >   e.shop, e.sales2
    > where
    >   e.sales2 < 1000
    > ;

CASE...WHEN...THEN文

hive> select shop, sales,
    >   case
    >     when sales < 300 then 'low'
    >     when sales >= 300 and sales < 500 then 'middle'
    >     else 'high'
    >   end as rank
    > from sales
    > ;

WHERE 節

hive> select
    >   shop, (sales * 2) as sales2
    > from
    >   sales
    > where
    >   sales2 > 1000
    > ;
FAILED: SemanticException [Error 10004]: Line 7:2 Invalid table alias or column reference 'sales2': 
(possible column names are: shop, sales)

WHERE節の中で、列のエイリアスは参照できないので、ネストしたSELECT文を使う。
hive> select
    >   e.*
    > from
    >   (
    >     select
    >       shop, (sales * 2) as sales2
    >     from
    >       sales
    >   ) e
    > where
    >   sales2 < 1000
    > ;

制御文字を含む文字列への判定

$ cat -v data.csv
sibuya,400
ro^Appong,500    　←　制御文字「^A」を含む
akasa^Cka,600     　←　制御文字「^C」を含む

テーブル作成＆data.csvをロード
$ create table sales(shop string,sales int) row format delimited fields terminated by ',';
$ load data local inpath "/home/vagrant/tmp/data.csv" into table sales;

$ hive

hive> select * from sales;
sibuya  400
roppong 500
akasaka 600

hive> select * from sales where regexp(shop, '^.*\\x01.*$') = true;　^Aを含む（true）レコードを抽出
select * from sales where regexp(shop, '^.*\\x01.*$') = true;
OK
roppong 500
Time taken: 0.054 seconds, Fetched: 1 row(s)

hive> select * from sales where regexp(shop, '^.*\\x01.*$') = false;　^Aを含まない（false）レコードを抽出
select * from sales where regexp(shop, '^.*\\x01.*$') = false;
OK
sibuya  400
akasaka 600
Time taken: 0.047 seconds, Fetched: 2 row(s)

参考先

プログラミングHive / Edward Capriolo／著 / オライリー・ジャパン , 2013.6