Thursday, September 20, 2012

Hive Indexing


People coming from RDBMS background might know the benefit of Indexing.Indexes are useful for faster access to rows in a table. If we want to apply indexing using Hive then the first expectation might be that with indexing it should take less time to fetch records and it should not launch a map reduce job. Whereas in practice a map reduce job would still be launched on a Hive query even though an index is created on ahive table.Map/reduce job runs on the table that holds the index data to get all the relevant offsets into the main table and then using those offsets it figures out which blocks to read from the main table. So you will not see map/reduce go away even when you are running queries on tables with indexes on them. The biggest advantage of having index is that it does not require a full table scan and it would query only the HDFS blocks required.
The difference b/w compact and bitmap indexes(Hive 0.8) is how they store the mapping from values to the rows in which the value occurs (Compact Index seems to store (value, block-id) pairs while Bitmap Index stores (value , list of rows as a bitmap)).

For example on a hadoop cluster if we run the following query without index
select * from test_table where test_column=02196666033;

It takes 71 seconds to run as it would perform a complete table scan.  Note this table has storage in RCFile fromat and ~ 1 billion records.
The cluster is a 8 node hadoop cluster. 
Now we can create the index with following commands

hive> create INDEX test_column_index ON TABLE test_table(test_column) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'  WITH DEFERRED REBUILD;
OK
Time taken: 3.134 seconds
hive> ALTER INDEX  test_column_index  ON test_table REBUILD; 
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 26
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201208160845_0121, Tracking URL = http://test-cluster-1:50030/jobdetails.jsp?jobid=job_201208160845_0121
Kill Command = /home/hadoop/hadoop-0.20.2-hdh3u3/bin/../bin/hadoop job  -Dmapred.job.tracker=test-cluster-1:50300 -kill job_201208160845_0121
2012-08-20 12:43:32,481 Stage-1 map = 0%,  reduce = 0%
2012-08-20 12:43:36,503 Stage-1 map = 1%,  reduce = 0%
…………………………………………………………………………………………….
2012-08-20 12:44:34,797 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201208160845_0121
Loading data to table default.default__test_table_test_column_index__
Deleted hdfs://test-cluster-1:54310/user/hive/warehouse/default__test_table_test_column_index__
Table default.default__test_table_test_column_index__ stats: [num_partitions: 0, num_files: 26, num_rows: 0, total_size: 179836098]
OK
Time taken: 68.077 seconds

Note this has now created an index table with following name under default Database “default__test_table_test_column_index__”

hive> INSERT OVERWRITE DIRECTORY "/tmp/index_test_result" SELECT `_bucketname` , `_offsets` FROM default__test_table_test_column_index__ where  test_column=02196666033 ;
……………..
Moving data to: /tmp/index_test_result
1 Rows loaded to /tmp/index_test_result
OK
Time taken: 2.683 seconds

hive> SET hive.index.compact.file=/tmp/index_test_result;
hive> SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;

hive> SELECT * FROM  test_table WHERE test_column=02196666033;
15836400000     21              02196666033     2813    13                     11               3               0       61075   1       0       0       2078   01
15836400000     21              02196666033     2813    13                     11               3               0       61075   1       0       0       2078   01
15836580000     21              02196666033     2813    13                     11               3               0       61075   1       0       0       2078   01
15836760000     21              02196666033     2813    13                     11               3               0       61075   1       0       0       2078   01
………………………………
15836760000     21              02196666033     2813    13                     11               3               0       61075   1       0       0       2078   01
Time taken: 2.479 seconds

Now If you see the difference, the same query would now take 2.4 seconds instead of 71 seconds as it is using indexing.

hive> SELECT count(*) FROM  test_table WHERE test_column=02196666033;
602
Time taken: 3.417 seconds


If you want to see what is  happened is that now the index table has stored the location of HDFS blocks where the data resides for this query and when query is fired it just reads form those blocks the actual data
hive> describe  default__test_table_test_column_index__;
OK
test_column  string  from deserializer
_bucketname     string  from deserializer
_offsets        array<bigint>   from deserializer

hive> select * from default__test_table_test_column_index__ where test_column=02196666033;
02196666033     hdfs://test-cluster-1:54310/user/hive/warehouse/test_table/000033_0       [1653776,9918864,6611866,8265443,28615496,18337446,30330621,21765257,25190947,14908809,26903328,16623458,20052481,194,4958659,3305997,23478272,11571563,13224916]


8 comments:

  1. Thanks for sharing the useful info!

    ReplyDelete
  2. Great insights on hive index's. thanks Yogesh.

    ReplyDelete
  3. Thanks for such an interesting post.

    So, for each query you'll need to create the temporary file extracted from the index, and then query the table itself, right?

    Then at least, you'll need your query to wait for the creation of no less than 2 different map-reduce jobs, first of them reading the index, and the second one going through your temporary file to the final blocks where the data is, right?

    While this looks nice, you should count the total time, since for every query you need to create a new temporary, so overall time is around 6 secs ... good improvement, but this is still a not very easy way to work with indices, shouldnt be hive the one doing something like that behind the scenes when you have the index?

    I'm still not sure what is the best approach to boost hive, I'm exploring columnar format like orc, and indexing, and have seen nice improves, queries taking 30secs-40 secs down to 10s, but the thing is most of the time is not really CPU time, having CPU time of just 1-2 secs, the ellapsed time for the query to return is still on 10-11 secs, which is probably not acceptable if I want to build a web with that as a back-end, which makes maybe more desirable to choose for a different storage to support that (like hbase), and leave hive just for batch processing of large volumes. Impala seems like a good approach, need to have it working on Hortonworks to try it out, and so does Drill (waiting for it)

    ReplyDelete
  4. Great post Yogesh.. Good insight to Hive indexes...

    Thanks!

    ReplyDelete
  5. Great writeup! How much disk space, and how many mappers for your source data?

    A note: bitmap indexes are useful for fields with a small number of values, while compact indexes are for fields with many possible values.

    ReplyDelete
  6. Good Post Yogesh.
    I am using hive 0.13. I have reduced the query time by using compact indexes & RC FILE table storage.

    I have not done any setting for hive.index.compact.file & hive.input.format as you have mentioned. Is it required in Hive 0.13. ?

    ReplyDelete
  7. Thanks for sharing very useful information

    ReplyDelete
  8. Thanks for sharing this post :)

    ReplyDelete