Pyspark:如何使用其他数据框
我使用PySpark V1.6.1创建一个数据帧,我想用另外一个来创建一个数据帧:Pyspark:如何使用其他数据框
- 转换已在不同的三个值中的一个结构体的列
- 从字符串转换的时间戳DATATIME
- 使用时间戳
- 更改列名和类型
现在正在使用.map(func)创建一个使用该函数的RDD(它从原始类型的一行转换并返回一个新行)。但是这是创建一个RDD,我不会这样做。
有没有更好的方法来做到这一点?
谢谢!
回答:
希望这有助于!
from pyspark.sql.functions import unix_timestamp, col, to_date, struct #### 
#sample data 
#### 
df = sc.parallelize([[25, 'Prem', 'M', '12-21-2006 11:00:05','abc', '1'], 
         [20, 'Kate', 'F', '05-30-2007 10:05:00', 'asdf', '2'], 
         [40, 'Cheng', 'M', '12-30-2017 01:00:01', 'qwerty', '3']]).\ 
    toDF(["age","name","sex","datetime_in_strFormat","initial_col_name","col_in_strFormat"]) 
#create 'struct' type column by combining first 3 columns of sample data - (this is built to answer query #1) 
df = df.withColumn("struct_col", struct('age', 'name', 'sex')).\ 
    drop('age', 'name', 'sex') 
df.show() 
df.printSchema() 
#### 
#query 1 
#### 
#Convert a field that has a struct of three values (i.e. 'struct_col') in different columns (i.e. 'name', 'age' & 'sex') 
df = df.withColumn('name', col('struct_col.name')).\ 
    withColumn('age', col('struct_col.age')).\ 
    withColumn('sex', col('struct_col.sex')).\ 
    drop('struct_col') 
df.show() 
df.printSchema() 
#### 
#query 2 
#### 
#Convert the timestamp from string (i.e. 'datetime_in_strFormat') to datetime (i.e. 'datetime_in_tsFormat') 
df = df.withColumn('datetime_in_tsFormat', 
        unix_timestamp(col('datetime_in_strFormat'), 'MM-dd-yyyy hh:mm:ss').cast("timestamp")) 
df.show() 
df.printSchema() 
#### 
#query 3 
#### 
#create more columns using above timestamp (e.g. fetch date value from timestamp column) 
df = df.withColumn('datetime_in_dateFormat', to_date(col('datetime_in_tsFormat'))) 
df.show() 
#### 
#query 4.a 
#### 
#Change column name (e.g. 'initial_col_name' is renamed to 'new_col_name) 
df = df.withColumnRenamed('initial_col_name', 'new_col_name') 
df.show() 
#### 
#query 4.b 
#### 
#Change column type (e.g. string type in 'col_in_strFormat' is coverted to double type in 'col_in_doubleFormat') 
df = df.withColumn("col_in_doubleFormat", col('col_in_strFormat').cast("double")) 
df.show() 
df.printSchema() 
的样本数据:
+---------------------+----------------+----------------+------------+ |datetime_in_strFormat|initial_col_name|col_in_strFormat| struct_col| 
+---------------------+----------------+----------------+------------+ 
| 12-21-2006 11:00:05|    abc|    1| [25,Prem,M]| 
| 05-30-2007 10:05:00|   asdf|    2| [20,Kate,F]| 
| 12-30-2017 01:00:01|   qwerty|    3|[40,Cheng,M]| 
+---------------------+----------------+----------------+------------+ 
root 
|-- datetime_in_strFormat: string (nullable = true) 
|-- initial_col_name: string (nullable = true) 
|-- col_in_strFormat: string (nullable = true) 
|-- struct_col: struct (nullable = false) 
| |-- age: long (nullable = true) 
| |-- name: string (nullable = true) 
| |-- sex: string (nullable = true) 
最终输出数据:
+---------------------+------------+----------------+-----+---+---+--------------------+----------------------+-------------------+ |datetime_in_strFormat|new_col_name|col_in_strFormat| name|age|sex|datetime_in_tsFormat|datetime_in_dateFormat|col_in_doubleFormat| 
+---------------------+------------+----------------+-----+---+---+--------------------+----------------------+-------------------+ 
| 12-21-2006 11:00:05|   abc|    1| Prem| 25| M| 2006-12-21 11:00:05|   2006-12-21|    1.0| 
| 05-30-2007 10:05:00|  asdf|    2| Kate| 20| F| 2007-05-30 10:05:00|   2007-05-30|    2.0| 
| 12-30-2017 01:00:01|  qwerty|    3|Cheng| 40| M| 2017-12-30 01:00:01|   2017-12-30|    3.0| 
+---------------------+------------+----------------+-----+---+---+--------------------+----------------------+-------------------+ 
root 
|-- datetime_in_strFormat: string (nullable = true) 
|-- new_col_name: string (nullable = true) 
|-- col_in_strFormat: string (nullable = true) 
|-- name: string (nullable = true) 
|-- age: long (nullable = true) 
|-- sex: string (nullable = true) 
|-- datetime_in_tsFormat: timestamp (nullable = true) 
|-- datetime_in_dateFormat: date (nullable = true) 
|-- col_in_doubleFormat: double (nullable = true) 
以上是 Pyspark:如何使用其他数据框 的全部内容, 来源链接: utcz.com/qa/265339.html








