Big Data Part 5.2: Advanced hive features

Lecture



In this article we will continue to look at the possibilities of the hive engine, which translates SQL-like queries in the MapReduce task.

In the previous article, we looked at the basic features of hive, such as creating tables, loading data, and performing simple SELECT queries. Now let's talk about advanced features that will allow you to squeeze the most out of Hive.


User Defined Functions


One of the main obstacles to working with Hive is the stiffness of standard SQL. This problem can be solved by using language extensions - the so-called “User Defined Functions”. Quite a lot of useful features are built right into the Hive language. I will give some of the most interesting in my opinion (information taken from official documentation):

Json

A fairly common task when dealing with large data is the processing of unstructured data stored in json format. To work with json hive, support a special get_json_object method that allows you to extract values ​​from json documents. To retrieve values ​​from an object, use a limited version of the JSONPath notation. The following operations are supported:
  • $: Returns the root object
  • .: Turns object child
  • []: Address by index in array
  • *: Wildcard for

Examples of working with Json from the official documentation:

Let there is a table: src_json, consisting of one column (json) and one row:
{"store": {"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"} }, "email":"amy@only_for_json_udf_test.net", "owner":"amy" } 

Examples of queries to the table:
 hive> SELECT get_json_object(src_json.json, '$.owner') FROM src_json; amy hive> SELECT get_json_object(src_json.json, '$.store.fruit\[0]') FROM src_json; {"weight":8,"type":"apple"} hive> SELECT get_json_object(src_json.json, '$.non_exist_key') FROM src_json; NULL 

Xpath

Similarly, if the data that needs to be processed using hive is not stored in json, but in XML, it can be processed using the xpath function, which allows parsing XML using the appropriate language. An example of parsing xml data with xpath:
 hive> select xpath('b1b2','a/*/text()') from sample_table limit 1 ; ["b1","b2"] 

Other useful built-in functions:

The built-in library contains a fairly rich set of built-in functions. There are several groups:
  • Mathematical functions (sin, cos, log, ...)
  • Functions for working with time (from_unix_timestamp, to_date, current date, hour (string date), timediff, ...) - a very rich choice of functions for converting dates and times
  • Functions for working with strings. It supports both generally applicable functions, such as lengh, reverse, regexp, and specific ones - such as parse_url or already considered get_json_object)
  • Many different system functions are current_user, current_database, ...
  • Cryptographic functions - sha, md5, aes_encrypt, aes_decrypt ...

A full list of functions built into hive can be found here.

Writing your own UDF


Not always enough built-in functions hive to solve the problem. If there is no built-in function, you can write your own UDF. This is done in java.

Let us analyze the creation of a custom UDF using the example of a simple string conversion function in lowercase:

1. Create the com / example / hive / udf package and create the Lower.java class in it:
 mkdir -p com/example/hive/udf edit com/example/hive/udf/Lower.java 

2. Implement the Lower class itself:
 package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; public final class Lower extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toLowerCase()); } } 

3. Add the necessary libraries to CLASSPATH (in your hadoop distribution, the links to the jar files may be slightly different):
 export CLASSPATH=/opt/cloudera/parcels/CDH/lib/hive/lib/hive-exec.jar:/opt/cloudera/parcels/CDH/lib/hadoop/hadoop-common.jar 

4. Compile our UDF-ku and collect the jar-archive:
 javac com/example/hive/udf/Lower.java jar cvf my_udf.jar * 

5. In order to be able to use a function in hive, you need to explicitly declare it:
 hive> ADD JAR my_udf.jar; hive> create temporary function my_lower as 'com.example.hive.udf.Lower'; hive> select my_lower('HELLO') from sample_table limit 1; hello 

Transforming a table with scripts


Another way to extend the standard HIVE functionality is to use the TRANSFORM method, which allows you to convert data using custom scripts in any programming language (this is especially suitable for those who do not like java and do not want to write udf on it).




Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Databases, knowledge and data warehousing. Big data, DBMS and SQL and noSQL

Terms: Databases, knowledge and data warehousing. Big data, DBMS and SQL and noSQL