PySpark学习:WordCount排序
环境:
1、配置好Spark
集群环境
2、配置好Python
环境,在spark
解压目录下的python文件夹中 执行python setup.py install
即可安装好pyspark
库
3、代码:
import sys
import os
from operator import add
from pyspark.context import SparkContext
# os.environ['JAVA_HOME'] = "/usr/local/java/jdk1.8.0_231"
if __name__ == "__main__":
sc = SparkContext(appName="WorkCount")
lines = sc.textFile("All along the River.txt")
output = lines.flatMap(lambda line: line.split(' ')) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b) \
.sortBy(lambda x: x[1], False) \
.take(50)
for (word,count) in output:
print("%s : %i" % (word,count))
sc.stop()
注释:
lines.flatMap(lambda line: line.split(' ')) # 根据空格对一行文本进行分割
.map(lambda word: (word, 1)) # 每个单词计数1
.reduceByKey(lambda a, b: a+b) # 相同的key的value值相加
.sortBy(lambda x: x[1], False) # 根据value的值进行排序
.take(50) # 取前50个数据
错误:
- Exception:Java gateway process exited before sending its port number
添加上这一句:
os.environ['JAVA_HOME'] = "/usr/local/java/jdk1.8.0_231"