SparkSQL中的Concat

我有一个模式的菲尼克斯表(电子邮件,产品名称,购买日期,数量)。 我在Spark Dataframe中加载这个表来处理。SparkSQL中的Concat

Email  | ProductName | PurchaseDate    | Quantity 

[email protected] Dell 2016-03-31 14:30:00.0 5

[email protected] Lenovo 2016-03-31 14:30:00.0 2

[email protected] Intel 2016-04-21 14:30:00.0 14

[email protected] Lenovo 2016-06-31 14:30:00.0 3

[email protected] Nokia 2016-03-21 14:30:00.0 5

的投入将是一个EMAILID和截止日期进出应该是以下格式: 如:输入“[email protected]”和日期小于“2016年4月1日00:00: 00.0

电子邮件| ProductList |总量 [email protected]诺基亚 - >戴尔 - >英特尔19

这基本上是连接的列和数量的总和。

我无法获得上述格式的输出。

任何帮助? 在此先感谢!

更新: 由于@Daniel de Paula建议,并且由于collect_list属于HiveContext,因此我已更改为val sqlContext = new HiveContext(sc)。我得到这个错误,
2611 [主] INFO DataNucleus.Persistence - 物业datanucleus.cache.level2未知 - 会被忽略

2611 [main] INFO DataNucleus.Persistence - Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 

4805 [main] INFO DataNucleus.Datastore - The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.

4806 [main] INFO DataNucleus.Datastore - The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.

5859 [main] INFO DataNucleus.Datastore - The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.

5859 [main] INFO DataNucleus.Datastore - The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.

6671 [main] INFO DataNucleus.Datastore - The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.

7785 [main] INFO DataNucleus.Persistence - Property datanucleus.cache.level2 unknown - will be ignored

7785 [main] INFO DataNucleus.Persistence - Property hive.metastore.integral.jdo.pushdown unknown - will be ignored

10488 [main] INFO DataNucleus.Datastore - The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.

10489 [main] INFO DataNucleus.Datastore - The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.

10778 [main] INFO DataNucleus.Datastore - The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.

10778 [main] INFO DataNucleus.Datastore - The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.

11008 [main] INFO DataNucleus.Query - Reading in results for query "[email protected]" since the connection used is closing

11128 [main] ERROR org.apache.hadoop.hive.metastore.ObjectStore - Version information found in metastore differs 2.1.0 from expected schema version 1.2.0. Schema verififcation is disabled hive.metastore.schema.verification so setting version.

11271 [main] WARN hive.ql.metadata.Hive - Failed to access metastore. This class should not accessed in runtime.

org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)

at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)

at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166)

at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)

at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249)

at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:327)

at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)

at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)

at org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:226)

at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:229)

at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)

at SimpleApp$.<init>(SimpleApp.scala:50)

at SimpleApp$.<clinit>(SimpleApp.scala)

at SimpleApp.main(SimpleApp.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)

at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)

at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)

at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)

at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)

at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)

at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)

at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)

at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)

... 27 more

Caused by: java.lang.reflect.InvocationTargetException

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)

... 33 more

Caused by: java.lang.NoSuchMethodError: org.apache.thrift.EncodingUtils.setBit(BIZ)B

at org.apache.hadoop.hive.metastore.api.PrivilegeGrantInfo.setCreateTimeIsSet(PrivilegeGrantInfo.java:245)

at org.apache.hadoop.hive.metastore.api.PrivilegeGrantInfo.<init>(PrivilegeGrantInfo.java:163)

at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultRoles_core(HiveMetaStore.java:675)

at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultRoles(HiveMetaStore.java:645)

at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:462)

at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)

at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)

at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)

at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199)

at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)

... 38 more

11288 [main] INFO DataNucleus.Query - Reading in results for query "[email protected]" since the connection used is closing

Exception in thread "main" java.lang.ExceptionInInitializerError

at SimpleApp.main(SimpleApp.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)

at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)

at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: java.lang.reflect.InvocationTargetException

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249)

at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:327)

at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)

at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)

at org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:226)

at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:229)

at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)

at SimpleApp$.<init>(SimpleApp.scala:50)

at SimpleApp$.<clinit>(SimpleApp.scala)

... 10 more

Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)

at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194)

... 23 more

Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)

at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)

at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)

at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)

at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)

at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)

at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)

... 24 more

Caused by: java.lang.reflect.InvocationTargetException

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:526)

at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)

... 30 more

Caused by: java.lang.NoSuchMethodError: org.apache.thrift.EncodingUtils.setBit(BIZ)B

at org.apache.hadoop.hive.metastore.api.PrivilegeGrantInfo.setCreateTimeIsSet(PrivilegeGrantInfo.java:245)

at org.apache.hadoop.hive.metastore.api.PrivilegeGrantInfo.<init>(PrivilegeGrantInfo.java:163)

at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultRoles_core(HiveMetaStore.java:675)

at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultRoles(HiveMetaStore.java:645)

at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:462)

at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)

at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)

at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)

at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199)

at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)

... 35 more

回答:

我的建议是做一个组,收集产品名称为列表;那么,您使用的UserDefinedFunction来连接列表中的值:

import org.apache.spark.sql.functions._ 

val newDF = df.groupBy("Email").agg(

collect_list("ProductName").as("ProductList"),

sum("Quantity").as("TotalQuantity")

)

val concatList = (xs: Seq[String]) => xs.foldLeft("")({

case (s1, s2) => if (s1 == "") s2 else s1 + " --> " + s2

})

val myUDF = udf(concatList)

val result = newDF.withColumn("ProductList", myUDF(col("ProductList")))

以上是 SparkSQL中的Concat 的全部内容, 来源链接: utcz.com/qa/265218.html

回到顶部