官方写Python的案例pyspark调kafka

Z时代
2024-01-10
分类：综合

"""
Consumes messages from one or more topics in Kafka and does wordcount.
Usage: structured_kafka_wordcount.py <bootstrap-servers> <subscribe-type> <topics>
<bootstrap-servers> The Kafka "bootstrap.servers" configuration. A
comma-separated list of host:port.
<subscribe-type> There are three kinds of type, i.e. "assign", "subscribe",
"subscribePattern".
|- <assign> Specific TopicPartitions to consume. Json string
| {"topicA":[0,1],"topicB":[2,4]}.
|- <subscribe> The topic list to subscribe. A comma-separated list of
| topics.
|- <subscribePattern> The pattern used to subscribe to topic(s).
| Java regex string.
|- Only one of "assign, "subscribe" or "subscribePattern" options can be
| specified for Kafka source.
<topics> Different value format depends on the value of "subscribe-type".

Run the example
`$ bin/spark-submit examples/src/main/python/sql/streaming/structured_kafka_wordcount.py
host1:port1,host2:port2 subscribe topic1,topic2`
"""
from __future__ import print_function

import sys

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

if __name__ == "__main__":
if len(sys.argv) != 4:
print("""
Usage: structured_kafka_wordcount.py <bootstrap-servers> <subscribe-type> <topics>
""", file=sys.stderr)
sys.exit(-1)

bootstrapServers = sys.argv[1]
subscribeType = sys.argv[2]
topics = sys.argv[3]

spark = SparkSession
.builder
.appName("StructuredKafkaWordCount")
.getOrCreate()

# Create DataSet representing the stream of input lines from kafka
lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option(subscribeType, topics)
.load()
.selectExpr("CAST(value AS STRING)")

# Split the lines into words
words = lines.select(
# explode turns each item in an array into a separate row
explode(
split(lines.value, " ")
).alias("word")
)

# Generate running word count
wordCounts = words.groupBy("word").count()

# Start running the query that prints the running counts to the console
query = wordCounts
.writeStream
.outputMode("complete")
.format("console")
.start()

query.awaitTermination()

以上是官方写Python的案例pyspark调kafka 的全部内容，来源链接： utcz.com/z/512900.html

官方写Python的案例pyspark调kafka

其他人也看了：