Tuesday 30 September 2014

Introduction to Hive UDFs

Apache Hive comes with a set of pre-defined User Defined Functions (UDFs). A complete listing of Hive UDFs is available here. Some common UDFs are unix_timestamp(), to_date(string timestamp), concat(string|binary A, string|binary B...), etc. However sometimes custom UDF is needed to solve specific problems.

In this post I will go through the process of creating custom UDFs.

Difference between UDF and GenericUDF
Hive UDFs are written in Java. In order to create a Hive UDF you need to derive from one of two classes UDF or GenericUDF. GenericUDFis bit complex to develop compared to UDF but it offers better performance and it supports all non-primitive parameters as input parameters and return types.

For writing custom UDFs by extending GenericUDF we need to overwrite 3 methods: initialize(), evaluate() and getDisplayString().

initialize()
This method only gets called once per JVM at the beginning to initilize the UDF. initilialize() is used to assert and validate the number and type of parameters that a UDF takes and the type of argument it returns. It also returns an ObjectInspector corresponding to the return type of the UDF.

evaluate()
This method is called once for every row of data being processed. Here the actual logic for transformation/processing of each row is written. It will return an object containing the result of processing logic. 

getDisplayString()
A simple method for returning the display string for the UDF when explain is used.

Apart from these we can have these Annotations also -
  • @UDFType(deterministic = true)
A deterministic UDF is one which always gives the same result when passed the same parameters. An example of such UDF are length(string input), regexp_replace(string initial_string, string pattern, string replacement), etc. A non-deterministic UDF, on the other hand can return different result for the same set of parameters. For example, unix_timestamp() returns the current timestamp using the default time zone. Therefore, when unix_timestamp() is invoked with the same parameters (no parameters) at different times, different results are obtained, making it non-deterministic. This annotation allows Hive to perform some optimization if the UDF is deterministic.
  • @Description(name="my_udf", value="This will be the result returned by explain statement.", extended="This will be result returned by the explain extended statement.")
This annotation tells Hive the name of your UDF. It will also be used to populate the result of queries like `DESCRIBE FUNCTION MY_UDF` or `DESCRIBE FUNCTION EXTENDED MY_UDF`.

In my next post I will give an example of GenericUDF to latitude and longitude of a location.

No comments:

Post a Comment