PyCharm开发PySpark程序的配置和实例

对于PyCharm,需要作如下设置:
1、安装pyspark,它会自动安装py4j
2、在edit configuration中,add content root,选择spark下载包的python/pyspark/lib下的pyspark.zip和py4j.zip两个包;

代码实例:

from pyspark.sql import Row
from pyspark.sql import SparkSession

logFile = "file:///Users/peishuaishuai/tmp/sparktest.txt"  # Should be some file on your system
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()

input = spark.read.text(logFile).rdd.map(
    lambda x: str(x[0]).split("\t")
).filter(
    lambda x: len(x) == 2
).map(
    lambda x: Row(name=x[0], grade=int(x[1]))
)

schemaData = spark.createDataFrame(input)
schemaData.createOrReplaceTempView("tb")

print(schemaData.count())
schemaData.printSchema()

datas = spark.sql("select name,sum(grade) from tb group by name").rdd.map(
    lambda x: "\t".join([x[0], str(x[1])])
)

datas.repartition(3).saveAsTextFile("file:///Users/peishuaishuai/tmp/sparktest_output")

spark.stop()

 

输入数据为:

name1    11
name2   12
name3   13
name4   14
name5   15
name1   16
name2   17
name3   18
name4   19
name5   20
name11  21
name12  22
name1   23
name2   24
name3   25
name4   26
name5   27
name18  28
name19  29
name20  30
name21  31
name1   32
name2   33
name3   34
name4   35
name5   36
name27  37
name28  38
name29  39
name1   40
name2   41
name3   42
name4   43

输出 print结果为:

33
root
 |-- grade: long (nullable = true)
 |-- name: string (nullable = true)

文件中内容为:

name3    132
name19  29
name2   127
name12  22
name11  21
name20  30
name28  38
name27  37
name5   98
name29  39
name21  31
name4   137
name1   122
name18  28

pyspark开发起来,有点问题就是当级联过多的时候,类型可能丢失,导致代码没有提示,这点很不爽。

其实对比了python、scala、java,我觉得编写大型的spark代码,用Java是最靠谱的,因为它强类型,代码提示很爽很直观。

 

相关推荐