前言
最近在专注Spark开发,记录下自己的工作和学习路程,希望能跟大家互相交流成长
具体代码可参考本人GitHub地址:
https://github.com/guofei1219/RiskControl
本文章对应代码地址:
https://github.com/guofei1219/RiskControl/tree/master/src/main/scala/clickstream
具体代码实现以及思路请参考笔者之前发布的文章:
//www.greatytc.com/p/ccba410462ba
鉴于文章篇幅有限,关于Maven/InteliiJ IDEA/Scala等知识请自行补充
本文章发布后会及时更新文章中出现的错误及增加内容,欢迎大家订阅
QQ:86608625 微信:guofei1990123
背景
Spark运行模式有 local/standalone等等,为了方便开发测试开发过程中使用Local模式运行
本地开发环境介绍
开发工具:IntelliJ IDEA 2016.1.2
打包工具 : apache-maven-3.3.9
Spark版本:1.3.0
JDK版本:jdk1.8.0_66
Scala SDK版本:2.10.4
Kafka版本:kafka_2.10
系统版本:Windows 10旗舰版
本机IP:192.168.61.1
实现思路及部分代码
- 模拟一个Kafka消息生产者往对应 Kafka Topic写数据,核心逻辑如下:
val topic = "user_events"
val brokers = "hc4:9092"
val props = new Properties()
props.put("metadata.broker.list", brokers)
props.put("serializer.class", "kafka.serializer.StringEncoder")
val kafkaConfig = new ProducerConfig(props)
val producer = new Producer[String, String](kafkaConfig)
while(true) {
// prepare event data val event = new JSONObject()
event
.put("uid", UUID.randomUUID())//随机生成用户id
.put("event_time", System.currentTimeMillis.toString) //记录时间发生时间
.put("os_type", getOsType) //设备类型
.put("click_count", click) //点击次数
// produce event message
producer.send(new KeyedMessage[String, String](topic, event.toString))
println("Message sent: " + event)
Thread.sleep(200)
}
- Spark-Streaming程序消费对应 Kafka Topic中数据并做相关业务逻辑操作
Streaming程序消费Kafka数据核心逻辑如下:
// Kafka Topic
val topics = Set("user_events")
// Kafka brokers
val brokers = "hc4:9092"
val kafkaParams = Map[String, String]( "metadata.broker.list" -> brokers, "serializer.class" -> "kafka.serializer.StringEncoder")
// Create a direct stream
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
- 统计结果存储
结果数据保存HBase核心逻辑代码:
/**
userClicks.foreachRDD拿到的是微批处理一个批次数据
rdd.foreachPartition拿到的是一个批次在Spark各节点对应的分区数据
partitionOfRecords.foreach拿到对应分区的每条数据
*/
userClicks.foreachRDD(rdd => {
rdd.foreachPartition(partitionOfRecords => {
//Hbase配置
val tableName = "PageViewStream"
val hbaseConf = HBaseConfiguration.create()
hbaseConf.set("hbase.zookeeper.quorum", "hc4:9092")
hbaseConf.set("hbase.zookeeper.property.clientPort", "2181")
hbaseConf.set("hbase.defaults.for.version.skip", "true")
partitionOfRecords.foreach(pair => {
//用户ID
val uid = pair._1
//点击次数
val click = pair._2
System.out.println("uid: "+uid+" click: "+click)
//组装数据 create 'PageViewStream','Stat'
val put = new Put(Bytes.toBytes(uid))
put.add("Stat".getBytes, "ClickStat".getBytes, Bytes.toBytes(click))
val StatTable = new HTable(hbaseConf, TableName.valueOf(tableName))
StatTable.setAutoFlush(false, false)
//写入数据缓存
StatTable.setWriteBufferSize(3*1024*1024)
StatTable.put(put)
//提交
StatTable.flushCommits()
})
})
})
具体执行
- 运行Kafka生产者模拟器(KafkaMessageGenerator)
附加:程序打包到Linux环境执行使用
//java -classpath ./spark-streaming-1.0-SNAPSHOT-shaded.jar guofei.KafkaEventProducer
java -classpath Jar包路径 KafkaMessageGenerator类全路径
- 运行Spark-Streaming主程序(PageViewStream),浏览器打开Spark UI界面,下图为Job运行情况,URL地址:
http://本地IP:4040/jobs
- 通过hbase客户端(hbase shell)查看对应表统计的数据
hbase shell
scan 'PageViewStream'
FAQ
- 运行Streaming主程序报找不到 hadoop二进制文件
Failed to locate the winutils binary in the hadoop binary path
Streaming本地运行模式需要本地装有配置好 HADOOP_HOME的hadoop环境
解决:解压window平台下编译的hadoop组建,配置环境变量HADOOP_HOME并重启IDEA
- 权限验证失败
SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hc-3450); users with modify permissions: Set(hc-3450)
Exception in thread "main" java.lang.NoSuchMethodError: scala.collection.immutable.HashSet$.empty()Lscala/collection/immutable/HashSet;
原因:Scala SDK版本与Spark和Kafka内置Scala版本不一致
解决:Scala SDK换成Spark和Kafka对应的Scala版本