PySpark:使用标记生成器

映射词

我开始我的旅程PySpark和我都停留在一个点,例: 我有这样的代码:(我把它从https://spark.apache.org/docs/2.1.0/ml-features.html)PySpark:使用标记生成器

from pyspark.ml.feature import Tokenizer, RegexTokenizer 

from pyspark.sql.functions import col, udf

from pyspark.sql.types import IntegerType

sentenceDataFrame = spark.createDataFrame([

(0, "Hi I heard about Spark"),

(1, "I wish Java could use case classes"),

(2, "Logistic,regression,models,are,neat")

], ["id", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")

regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="\\W")

# alternatively, pattern="\\w+", gaps(False)

countTokens = udf(lambda words: len(words), IntegerType())

tokenized = tokenizer.transform(sentenceDataFrame)

tokenized.select("sentence", "words")\

.withColumn("tokens", countTokens(col("words"))).show(truncate=False)

regexTokenized = regexTokenizer.transform(sentenceDataFrame)

regexTokenized.select("sentence", "words") \

.withColumn("tokens", countTokens(col("words"))).show(truncate=False)

而且我加入了这样的事情:

test = sqlContext.createDataFrame([ 

(0, "spark"),

(1, "java"),

(2, "i")

], ["id", "word"])

输出是:

id |sentence       |words          |tokens| 

+---+-----------------------------------+------------------------------------------+------+

|0 |Hi I heard about Spark |[hi, i, heard, about, spark] |5 |

|1 |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7 |

|2 |Logistic,regression,models,are,neat|[logistic, regression, models, are, neat] |5 |

上午I p ossible实现这样的事情: [ID从“测试”,编号从“regexTokenized”]

2, 0 

2, 1

1, 1

0, 1

从从那里符号化“字”可以映射“测试”我可以“regexTokenized”虎视眈眈的ID列表在这两个数据集? 或者应该采取另一种解决方案?

在预先感谢您的任何帮助:)

回答:

explodejoin

from pyspark.sql.functions import explode 

(testTokenized.alias("train")

.select("id", explode("words").alias("word"))

.join(

trainTokenized.select("id", explde("words").alias("word")).alias("test"),

"word"))

以上是 PySpark:使用标记生成器 的全部内容, 来源链接: utcz.com/qa/265360.html

回到顶部