如何在PySpark的UDF中返回“元组类型”?

输入的所有数据类型pyspark.sql.types为:

__all__ = [

"DataType", "NullType", "StringType", "BinaryType", "BooleanType", "DateType",

"TimestampType", "DecimalType", "DoubleType", "FloatType", "ByteType", "IntegerType",

"LongType", "ShortType", "ArrayType", "MapType", "StructField", "StructType"]

我必须编写一个UDF(在pyspark中),它返回一个元组数组。我应该给它第二个参数是udf方法的返回类型吗?这将是ArrayType(TupleType())

回答:

TupleTypeSpark中没有这样的东西。产品类型structs用特定类型的字段表示。例如,如果您想返回一个成对的数组(整数,字符串),则可以使用如下模式:

from pyspark.sql.types import *

schema = ArrayType(StructType([

StructField("char", StringType(), False),

StructField("count", IntegerType(), False)

]))

用法示例:

from pyspark.sql.functions import udf

from collections import Counter

char_count_udf = udf(

lambda s: Counter(s).most_common(),

schema

)

df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["id", "value"])

df.select("*", char_count_udf(df["value"])).show(2, False)

## +---+-----+-------------------------+

## |id |value|PythonUDF#<lambda>(value)|

## +---+-----+-------------------------+

## |1 |foo |[[o,2], [f,1]] |

## |2 |bar |[[r,1], [a,1], [b,1]] |

## +---+-----+-------------------------+

以上是 如何在PySpark的UDF中返回“元组类型”? 的全部内容, 来源链接: utcz.com/qa/412614.html

回到顶部