如何根据Pyspark中的数据框中的条件设置新的列表值?

我有一个像下面的DataFrame。如何根据Pyspark中的数据框中的条件设置新的列表值?

+---+------------------------------------------+ 

|id |features |

+---+------------------------------------------+

|1 |[6.629056, 0.26771536, 0.79063195,0.8923] |

|2 |[1.4850719, 0.66458416, -2.1034079] |

|3 |[3.0975454, 1.571849, 1.9053307] |

|4 |[2.526619, -0.33559006, -1.4565022] |

|5 |[-0.9286196, -0.57326394, 4.481531] |

|6 |[3.594114, 1.3512149, 1.6967168] |

+---+------------------------------------------+

我想设置一些我的功能的价值根据我的地方如下条件。即其中id=1,id=2id=6

我想设置新功能值,其中id=1,我目前的功能值是[6.629056, 0.26771536, 0.79063195,0.8923],但我想设置[0,0,0,0]

我想设置新的功能值,其中id=2,我目前的功能值是[1.4850719, 0.66458416, -2.1034079],但我想设置[0,0,0]

我最后出来放将是:

+------+-----------------------------------+ 

|id | features |

+-----+---------------------------------- -+

|1 | [0, 0, 0, 0] |

|2 | [0,0,0] |

|3 | [3.0975454, 1.571849, 1.9053307] |

|4 | [2.526619, -0.33559006, -1.4565022] |

|5 | [-0.9286196, -0.57326394, 4.481531] |

|6 | [0,0,0] |

+-----+------------------------------------+

回答:

如果您有一套有限的id,Shaido的答案没问题,您也知道相应的feature的长度。

如果不是的话,它应该是清洁使用UDF,并要能够在另一个Seq加载转换id S:

在斯卡拉

val arr = Seq(1,2,6) 

val fillArray = udf { (id: Int, array: WrappedArray[Double]) =>

if (arr.contains(id)) Seq.fill[Double](array.length)(0.0)

else array

}

df.withColumn("new_features" , fillArray($"id", $"features")).show(false)

在Python中

from pyspark.sql import functions as f 

from pyspark.sql.types import *

arr = [1,2,6]

def fillArray(id, features):

if(id in arr): return [0.0] * len(features)

else : return features

fill_array_udf = f.udf(fillArray, ArrayType(DoubleType()))

df.withColumn("new_features" , fill_array_udf(f.col("id"), f.col("features"))).show()

输出

+---+------------------------------------------+-----------------------------------+ 

|id |features |new_features |

+---+------------------------------------------+-----------------------------------+

|1 |[6.629056, 0.26771536, 0.79063195, 0.8923]|[0.0, 0.0, 0.0, 0.0] |

|2 |[1.4850719, 0.66458416, -2.1034079] |[0.0, 0.0, 0.0] |

|3 |[3.0975454, 1.571849, 1.9053307] |[3.0975454, 1.571849, 1.9053307] |

|4 |[2.526619, -0.33559006, -1.4565022] |[2.526619, -0.33559006, -1.4565022]|

|5 |[-0.9286196, -0.57326394, 4.481531] |[-0.9286196, -0.57326394, 4.481531]|

|6 |[3.594114, 1.3512149, 1.6967168] |[0.0, 0.0, 0.0] |

+---+------------------------------------------+-----------------------------------+

回答:

使用whenotherwise如果你有一个小集ID的改变:

df.withColumn("features", 

when(df.id === 1, array(lit(0), lit(0), lit(0), lit(0)))

.when(df.id === 2 | df.id === 6, array(lit(0), lit(0), lit(0)))

.otherwise(df.features)))

应该比UDF但如果快有很多ID很快就会变成很多代码。在这种情况下,请按照philantrovert的回答使用UDF

以上是 如何根据Pyspark中的数据框中的条件设置新的列表值? 的全部内容, 来源链接: utcz.com/qa/263630.html

回到顶部