在spark1.5.0环境下编译好的作业在spark2.0.0环境下运行后出现的问题
当spark升级到2.0.0后,spark对应的其他组件版本也要同步升级到适配版本,包含但不限定以下
- hadoop-2.7.2
- hbase-1.2.2
- hive-2.1.0
spark2.0.0默认编译环境为:
- scala-2.11.8
- jdk-1.8.0
所以需要把scala和jdk都升级到要求的版本,
Tips:
1、spark-submit命令的--master参数值改为yarn,yarn-cluster已经过时。
16/11/16 13:06:24 WARN spark.SparkConf: spark.master yarn-cluster is deprecated in Spark 2.0+, please instead use "yarn" with specified deploy mode.
2、spark的配置spark.akka.timeout不再支持!原因如下:
16/11/16 13:06:24 WARN spark.SparkConf: The configuration key spark.akka.timeout is not supported any more because Spark doesn't use Akka since 2.0
问题一:
Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/spark/Logging
找不到Logging的错误!
解决办法一:
解决办法二:
在pom.xml文件中增加配置:
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-8_2.11</artifactId> <version>2.0.0</version> </dependency>
问题二:
Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:D:/workspace/sparkstreaming/error-info/spark-warehouse
详细报错日志:
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:D:/workspace/sparkstreaming/error-info/spark-warehouse at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.<init>(Path.java:172) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89) at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95) at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95) at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112) at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112) at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167) at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59) at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:441) at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:395) at org.apache.spark.sql.SQLImplicits.rddToDatasetHolder(SQLImplicits.scala:163) at mx.tc.spark.erroinfo.ErroInfoStreaming$$anonfun$main$1.apply(ErroInfoStreaming.scala:8) at mx.tc.spark.erroinfo.ErroInfoStreaming$$anonfun$main$1.apply(ErroInfoStreaming.scala:6) at mx.tc.sparkstreaming.DapLogStreaming$$anonfun$run$1.apply(DapLogStreaming.scala:61) at mx.tc.sparkstreaming.DapLogStreaming$$anonfun$run$1.apply(DapLogStreaming.scala:58) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:245) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:245) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:245) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:244) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:D:/workspace/sparkstreaming/error-info/spark-warehouse at java.net.URI.checkPath(Unknown Source) at java.net.URI.<init>(Unknown Source) at org.apache.hadoop.fs.Path.initialize(Path.java:203) ... 37 more
在sqlContenxt这里报的错,很明显,在初始化sqlContent时找不到这个路径,而我的配置没有配置这样的路径,于是到官网查看了相关资料后,发现spark2.0.0的sparksql增加了一个配置项:
spark.sql.warehouse.dir
官网配置如下:
spark = SparkSession.builder \
.master('local[*]') \
.appName('My App') \
.config('spark.sql.warehouse.dir', 'file:///C:/path/to/my/') \
.getOrCreate()
找到原因后,把这个缺少的配置项加上,并新建了对应的目录后。问题得到完美解决!
本地编译环境 jdk 1.8 scala 2.11.8 都已经升级过了 , 还是一直遇到这个问题
求回复,感激不尽!!!