前言
前面已经介绍过Hive UDF有两种实现方式,其中GenericUDF的方式是比较复杂的一种,为了加深对这种方式的理解,尝试去看了下Hive原生函数的源码,记录如下。新人入门,水平不足,如有错误,欢迎指正。
源码解析
public class GenericUDFDateDiff extends GenericUDF{
//import java.text.SimpleDateFormat; 声明一个日期格式变量
private transient SimpleDateFormat formatter=new SimpleDateFormat("yyyy-MM-dd");
//import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters.Converter;
//声明两个参数的转换变量,用来判断入参的类型
private transient Converter inputConverter1;
private transient Converter inputConverter2;
//import org.apache.hadoop.io.IntWritable; 声明返回值的类型,IntWritable是Hadoop中实现的用于封装Java数据类型的类
private IntWritable output=new IntWritable();
//import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory;
//声明两个入参的类型是Hive支持的原始数据类型
private transient PrimitiveCategory inputType1;
private transient PrimitiveCategory inputType2;
private IntWritable result=new IntWritable();
public GenericUDFDateDiff(){
//import java.util.TimeZone;
this.formatter.setTimeZone(TimeZone.getTimeZone("UTC"));
}
}
上述代码首先继承了GenericUDF,并且定义了多个接下来会用到的变量。接下来就是重写initialize的代码:
//import org.apache.hadoop.hive.ql.exec.UDFArgumentException
public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException{
//import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
//进行参数个数检查,如果不是两个参数则抛出异常
if(arguments.length!=2){
throw new UDFArgumentLengthException("datediff() requires 2 argument,got "+arguments.length);
}else{
//
this.inputConverter1=this.checkArguments(arguments,0);
this.inputConverter2=this.checkArguments(arguments,1);
//import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
//import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory;
//获取两个入参的数据类型
this.inputType1=((PrimitiveObjectInspector)arguments[0].getPrimitiveCategory();
this.inputType2=((PrimitiveObjectInspector)arguments[1].getPrimitiveCategory();
ObjectInspector outputOI=PrimitiveObjectInspectorFactory.writableIntObjectInspector;
return outputOI;
}
}
在重写的initialize的代码中,首先做了参数个数的检查,当参数个数不是两个时抛出异常。然后初始化了前面声明的参数类型和参数类型转换变量。
private Converter checkArguments(ObjectInspector[] arguments,int i) throws UDFArgumentException{
//import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException
//检查入参的类型
if(arguments[i].getCategory()!=Category.PRIMITIVE){
throw new UDFArgumentTypeException(0,"Only primitive type arguments are accepted but "+arguments[i].getTypeName()+" is passed. as first arguments");
}else {
//获取入参数据类型
PrimitiveCategory inputType=((PrimitiveObjectInspector)arguments[i]).getPrimitiveCategory();
Object converter;
//判断入参的具体数据类型,赋值相应的converter
switch(inputType){
case STRING;
case VARCHAR;
case CHAR;
converter=ObjectInspectorConverters.getConverter((PrimitiveObjectInspector)arguments[i],PrimitiveObjectInspectorFactory.writableStringObjectInspector);
break;
case TIMESTAMP;
converter=new TimestampConverter((PrimitiveObjectInspector)arguments[i],PrimitiveObjectInspectorFactory.writableTimestampObjectInspector);
break;
case DATE;
converter=ObjectInspectorConverter.getConverter((PrimitiveObjectInspector)arguments[i],PrimitiveObjectInspectorFactory.writableDateObjectInspector);
break;
default;
throw new UDFArgumentException("DATEDIFF() only take STRING/TIMESTAMP/DATEWRITABLE types as "+ (i+1) +"-th argument,got " inputType);
}
return (Converter)converter;
}
}
checkArguments方法首先做了入参的类型检查,要求必须是Hive的原生数据类型,否则会抛出异常。然后再分别根据具体的实际数据类型,赋值相应的converter,最后对于非Sting timestamp date 的数据类型,同样抛出异常。
private Date convertToDate(PrimitiveCategory inputType,Converter converter,DeferredObject argument) throws HiveException{
assert converter!=null;
assert argument!=null;
if(argument.get()==null){
return null;
}else {
Date date=new Date();
switch(inputType){
case STRING;
case VARCHAR;
case CHAR;
String dateString=converter.convert(argument.get()).toString;
try{
date=this.formatter.parse(dateString);
break;
}catch(ParseException var8){
return null;
}
case TIMESTAMP;
Timestamp ts=((TimestampWritable)converter.convert(argument.get()).getTimestamp();
((Date)date).setTime(ts.getTime());
break;
case DATE;
DateWritable dw=(DateWritable)converter.convert(argument.get());
date=dw.get();
break;
default;
throw new UDFArgumentException("TO_DATE() only takes STRING/TIMESTAMP/DATEWRITABLE types,got "+ inputType);
}
return (Date)date;
}
}
convertToDate方法根据传入的参数类型,相应的converter及参数值,返回'yyyy-MM-dd'格式的Date数据
接下来是重写evaluate方法,如下:
public String getDisplayString(String[] children) {
return this.getStandardDisplayString("datediff", children);
}
private IntWritable evaluate(Date date,Date date2){
if(date!=null && date2!=null){
long diffInMilliSeconds=date.getTime()-date2.getTime();
this.result.set((int)(diffInMilliSeconds/86400000L));
return this.result;
}else{
return null;
}
}
public IntWritable evaluate(DeferredObject[] arguments) throws HiveException{
this.output=this.evaluate(this.convertToDate(this.inputType1,this.inputConverter1,argument[0],this.convertToDate(this.inputType2,this.inputConvertert2,arguments[1]));
return this.output;
}
先是定义了一个私有的evaluate方法,用来计算两个日期之间的天数差,之后重写了public evaluate方法。
总结
源码阅读下来,感觉源码中对数据类型的定义转换检查做的十分严格,值得再之后的自己开发过程中学习。