Yogesh Sachwani: Hive Indexing

People coming from RDBMS background might know the benefit of Indexing.Indexes are useful for faster access to rows in a table. If we want to apply indexing using Hive then the first expectation might be that with indexing it should take less time to fetch records and it should not launch a map reduce job. Whereas in practice a map reduce job would still be launched on a Hive query even though an index is created on ahive table.Map/reduce job runs on the table that holds the index data to get all the relevant offsets into the main table and then using those offsets it figures out which blocks to read from the main table. So you will not see map/reduce go away even when you are running queries on tables with indexes on them. The biggest advantage of having index is that it does not require a full table scan and it would query only the HDFS blocks required.

The difference b/w compact and bitmap indexes(Hive 0.8) is how they store the mapping from values to the rows in which the value occurs (Compact Index seems to store (value, block-id) pairs while Bitmap Index stores (value , list of rows as a bitmap)).

For example on a hadoop cluster if we run the following query without index

select * from test_table where test_column=02196666033;

It takes 71 seconds to run as it would perform a complete table scan. Note this table has storage in RCFile fromat and ~ 1 billion records.

The cluster is a 8 node hadoop cluster.

Now we can create the index with following commands

hive> create INDEX test_column_index ON TABLE test_table(test_column) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD;

Time taken: 3.134 seconds

hive> ALTER INDEX test_column_index ON test_table REBUILD;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 26

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapred.reduce.tasks=<number>

Starting Job = job_201208160845_0121, Tracking URL = http://test-cluster-1:50030/jobdetails.jsp?jobid=job_201208160845_0121

Kill Command = /home/hadoop/hadoop-0.20.2-hdh3u3/bin/../bin/hadoop job -Dmapred.job.tracker=test-cluster-1:50300 -kill job_201208160845_0121

2012-08-20 12:43:32,481 Stage-1 map = 0%, reduce = 0%

2012-08-20 12:43:36,503 Stage-1 map = 1%, reduce = 0%

…………………………………………………………………………………………….

2012-08-20 12:44:34,797 Stage-1 map = 100%, reduce = 100%

Ended Job = job_201208160845_0121

Loading data to table default.default__test_table_test_column_index__

Deleted hdfs://test-cluster-1:54310/user/hive/warehouse/default__test_table_test_column_index__

Table default.default__test_table_test_column_index__ stats: [num_partitions: 0, num_files: 26, num_rows: 0, total_size: 179836098]

Time taken: 68.077 seconds

Note this has now created an index table with following name under default Database “default__test_table_test_column_index__”

hive> INSERT OVERWRITE DIRECTORY "/tmp/index_test_result" SELECT `_bucketname` , `_offsets` FROM default__test_table_test_column_index__ where test_column=02196666033 ;

……………..

Moving data to: /tmp/index_test_result

1 Rows loaded to /tmp/index_test_result

Time taken: 2.683 seconds

hive> SET hive.index.compact.file=/tmp/index_test_result;

hive> SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;

hive> SELECT * FROM test_table WHERE test_column=02196666033;

15836400000 21 02196666033 2813 13 11 3 0 61075 1 0 0 2078 01

15836580000 21 02196666033 2813 13 11 3 0 61075 1 0 0 2078 01

15836760000 21 02196666033 2813 13 11 3 0 61075 1 0 0 2078 01

………………………………

15836760000 21 02196666033 2813 13 11 3 0 61075 1 0 0 2078 01

Time taken: 2.479 seconds

Now If you see the difference, the same query would now take 2.4 seconds instead of 71 seconds as it is using indexing.

hive> SELECT count(*) FROM test_table WHERE test_column=02196666033;

602

Time taken: 3.417 seconds

If you want to see what is happened is that now the index table has stored the location of HDFS blocks where the data resides for this query and when query is fired it just reads form those blocks the actual data

hive> describe default__test_table_test_column_index__;

test_column string from deserializer

_bucketname string from deserializer

_offsets array<bigint> from deserializer

hive> select * from default__test_table_test_column_index__ where test_column=02196666033;

02196666033 hdfs://test-cluster-1:54310/user/hive/warehouse/test_table/000033_0 [1653776,9918864,6611866,8265443,28615496,18337446,30330621,21765257,25190947,14908809,26903328,16623458,20052481,194,4958659,3305997,23478272,11571563,13224916]

8 comments:

MANOJ BABUMay 24, 2013 at 9:15 AM
Thanks for sharing the useful info!
UnknownJune 11, 2013 at 2:10 AM
Great insights on hive index's. thanks Yogesh.
arturocgOctober 1, 2013 at 9:18 PM
Thanks for such an interesting post.

So, for each query you'll need to create the temporary file extracted from the index, and then query the table itself, right?

Then at least, you'll need your query to wait for the creation of no less than 2 different map-reduce jobs, first of them reading the index, and the second one going through your temporary file to the final blocks where the data is, right?

While this looks nice, you should count the total time, since for every query you need to create a new temporary, so overall time is around 6 secs ... good improvement, but this is still a not very easy way to work with indices, shouldnt be hive the one doing something like that behind the scenes when you have the index?

I'm still not sure what is the best approach to boost hive, I'm exploring columnar format like orc, and indexing, and have seen nice improves, queries taking 30secs-40 secs down to 10s, but the thing is most of the time is not really CPU time, having CPU time of just 1-2 secs, the ellapsed time for the query to return is still on 10-11 secs, which is probably not acceptable if I want to build a web with that as a back-end, which makes maybe more desirable to choose for a different storage to support that (like hbase), and leave hive just for batch processing of large volumes. Impala seems like a good approach, need to have it working on Hortonworks to try it out, and so does Drill (waiting for it)
cmbhattApril 16, 2014 at 7:21 PM
Great post Yogesh.. Good insight to Hive indexes...

Thanks!
AnonymousJuly 9, 2014 at 6:06 AM
Great writeup! How much disk space, and how many mappers for your source data?

A note: bitmap indexes are useful for fields with a small number of values, while compact indexes are for fields with many possible values.
NamitaOctober 7, 2014 at 12:12 PM
Good Post Yogesh.
I am using hive 0.13. I have reduced the query time by using compact indexes & RC FILE table storage.

I have not done any setting for hive.index.compact.file & hive.input.format as you have mentioned. Is it required in Hive 0.13. ?
kotiJuly 25, 2015 at 8:31 PM
Thanks for sharing very useful information
AnonymousSeptember 2, 2015 at 2:52 PM
Thanks for sharing this post :)

Thursday, September 20, 2012

Hive Indexing

8 comments: