Spark Streaming与Kafka的整合

Spark Streaming实时流处理，与Kafka消息队列的整合。实时处理中，kafka作为消息中间件，Spark Streaming作为数据处理工具，根据Spark Streaming数据接收的方式，将Spark Streaming与kafka的整合分为两种：1. Receiver模式，由kafka将数据发送数据，Spark Streaming被动接收数据； 2. Direct模式，由Spark Streaming主动去kafka中拉取数据。

1. Receiver模式

Receiver模式，在Spark Streaming程序启动后，由receiver task接收kafka推送过来的数据，并将数据进行持久化【默认持久化等级：Memory_And_Disk_Ser_2】及备份到其他节点上，当数据完成持久化过程后，会将此次消费的offset偏移量信息保存到Zookeeper中。然后，由Driver根据数据位置信息，分发任务到最佳节点上运行。

代码实现：

Kafka生产数据代码

/**

* 模拟kafka的数据生产者

public class KafkaProducer {

private static String topic_name = "test";

public static void main(String[] args) {

Properties prop = new Properties();

prop.put("metadata.broker.list", "node01:9092,node02:9092,node03:9092");

prop.put("serializer.class", StringEncoder.class.getName());

Producer<Integer, String> producer = new Producer<Integer, String>(new ProducerConfig(prop));

String[] words = new String[]{"JianShu", "ZhiHu", "CSDN", "BoKeYuan"};

while (true) {

Random random = new Random();

int index = random.nextInt(4);

System.out.println(index);

producer.send(new KeyedMessage<>(topic_name, words[index]));

}

Receiver模式 Spark Streaming实时处理数据

public class SparkStream_Receiver {

private static String topic_name = "test";

public static void main(String[] args) {

SparkConf conf = new SparkConf();

// 至少需要2个executor进程，一个启动driver的receive task，一个处理数据

conf.setMaster("local[2]").setAppName("Receiver");

// 每隔3秒将数据封装成一个DStream

JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(3));

Map<String, Integer> map = new HashMap<>();

map.put(topic_name, 1);

/**

* 简单单词计数

JavaPairReceiverInputDStream<String, String> receiverInputDStream = KafkaUtils.createStream(jsc,

"node01:2181,node02:2181,node03:2181", "test", map);

JavaPairDStream<Tuple2<String, String>, Integer> pairWord = receiverInputDStream.mapToPair(word -> new Tuple2<>(word, 1));

JavaPairDStream<Tuple2<String, String>, Integer> result = pairWord.reduceByKey((x, y) -> x + y);

result.print();

// 启动Spark Streaming

jsc.start();

try {

jsc.awaitTermination();

} catch (InterruptedException e) {

e.printStackTrace();

}

// 关闭Spark Streaming

jsc.stop();

}

到此，简单Receiver模式已实现。

Receiver模式存在的问题：

在数据已持久化，且偏移量已提交到Zookeeper后，driver挂掉，此时executor也会挂掉，此时则会导致当前正在处理的数据还未计算完，而下次处理，则会从zookeeper中获取偏移量，往后继续获取数据，那么此次处理的数据就会丢失

解决方案：开启WAL机制，将数据备份到HDFS，当遇到上述情况，则可以从HDFS上读取数据，防止数据丢失的问题。此种机制的弊端是，需要消耗性能将数据写入到HDFS。

2. Direct模式

Direct模式，在Spark Streaming程序启动后，Spark Streaming程序主动去Kafka中拉取数据，且数据消费的偏移量也是由自身维护，若由设置checkpoint目录，也会将其持久化到checkpoint中。

代码实现：【kafka数据生产实现代码与Receiver模式相同】

Direct模式：

public class SparkStream_Direct {

private static Stringtopic_name ="test";

public static void main(String[] args) {

SparkConf conf =new SparkConf();

// 至少需要2个executor进程，一个启动driver的receive task，一个处理数据

conf.setMaster("local[2]").setAppName("Direct");

// 每隔3秒将数据封装成一个DStream

JavaStreamingContext jsc =new JavaStreamingContext(conf, Durations.seconds(3));

Map kafkaconf =new HashMap();

kafkaconf.put("metadata.broker.list","node01:9092,node02:9092,node03:9092");

Set topics =new HashSet<>();

topics.add(topic_name);

/**

* 简单单词计数

// 与Receiver不同之处

JavaPairInputDStream directStream = KafkaUtils.createDirectStream(jsc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaconf, topics);

JavaPairDStream, Integer> pairWord = directStream.mapToPair(word ->new Tuple2<>(word,1));

JavaPairDStream, Integer> result = pairWord.reduceByKey((x, y) -> x + y);

result.print();

// 启动Spark Streaming

jsc.start();

try {

jsc.awaitTermination();

}catch (InterruptedException e) {

e.printStackTrace();

}

// 关闭Spark Streaming

jsc.stop();

}

Spark Streaming与Kafka的整合

1. Receiver模式

2. Direct模式

推荐阅读更多精彩内容