您的足迹:首页 > Spark >【原创】Spark1.5.0升级到2.0.0需要注意的问题

【原创】Spark1.5.0升级到2.0.0需要注意的问题

在spark1.5.0环境下编译好的作业在spark2.0.0环境下运行后出现的问题

当spark升级到2.0.0后,spark对应的其他组件版本也要同步升级到适配版本,包含但不限定以下

  • hadoop-2.7.2
  • hbase-1.2.2
  • hive-2.1.0

spark2.0.0默认编译环境为:

  • scala-2.11.8
  • jdk-1.8.0

所以需要把scala和jdk都升级到要求的版本,

Tips:

1、spark-submit命令的--master参数值改为yarn,yarn-cluster已经过时。

16/11/16 13:06:24 WARN spark.SparkConf: spark.master yarn-cluster is deprecated in Spark 2.0+, please instead use "yarn" with specified deploy mode.

2、spark的配置spark.akka.timeout不再支持!原因如下:

16/11/16 13:06:24 WARN spark.SparkConf: The configuration key spark.akka.timeout is not supported any more because Spark doesn't use Akka since 2.0

问题一:

Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/spark/Logging

找不到Logging的错误!

解决办法一:

http://stackoverflow.com/questions/39212906/exception-in-thread-main-java-lang-noclassdeffounderror-org-apache-spark-logg

解决办法二:

在pom.xml文件中增加配置:

<dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
	<version>2.0.0</version>
</dependency>

问题二:

Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:D:/workspace/sparkstreaming/error-info/spark-warehouse

详细报错日志:

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:D:/workspace/sparkstreaming/error-info/spark-warehouse
	at org.apache.hadoop.fs.Path.initialize(Path.java:206)
	at org.apache.hadoop.fs.Path.<init>(Path.java:172)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
	at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
	at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
	at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
	at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
	at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:161)
	at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
	at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
	at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:441)
	at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:395)
	at org.apache.spark.sql.SQLImplicits.rddToDatasetHolder(SQLImplicits.scala:163)
	at mx.tc.spark.erroinfo.ErroInfoStreaming$$anonfun$main$1.apply(ErroInfoStreaming.scala:8)
	at mx.tc.spark.erroinfo.ErroInfoStreaming$$anonfun$main$1.apply(ErroInfoStreaming.scala:6)
	at mx.tc.sparkstreaming.DapLogStreaming$$anonfun$run$1.apply(DapLogStreaming.scala:61)
	at mx.tc.sparkstreaming.DapLogStreaming$$anonfun$run$1.apply(DapLogStreaming.scala:58)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:245)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:245)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:245)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:244)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:D:/workspace/sparkstreaming/error-info/spark-warehouse
	at java.net.URI.checkPath(Unknown Source)
	at java.net.URI.<init>(Unknown Source)
	at org.apache.hadoop.fs.Path.initialize(Path.java:203)
	... 37 more

在sqlContenxt这里报的错,很明显,在初始化sqlContent时找不到这个路径,而我的配置没有配置这样的路径,于是到官网查看了相关资料后,发现spark2.0.0的sparksql增加了一个配置项:

spark.sql.warehouse.dir

官网配置如下:

spark = SparkSession.builder \
           .master('local[*]') \
           .appName('My App') \
           .config('spark.sql.warehouse.dir', 'file:///C:/path/to/my/') \
           .getOrCreate()

找到原因后,把这个缺少的配置项加上,并新建了对应的目录后。问题得到完美解决!

本博客所有文章如无特别注明均为原创。作者:数据为王复制或转载请以超链接形式注明转自 数据为王
原文地址《【原创】Spark1.5.0升级到2.0.0需要注意的问题

相关推荐


  • blogger

发表评论

路人甲 表情
看不清楚?点图切换 Ctrl+Enter快速提交

网友评论(2)

楼主, 求教,java.lang.NoClassDefFoundError: org/apache/spark/Logging  问题一这个用您写的方法1和2都解决不了咋办????
本地编译环境 jdk 1.8 scala 2.11.8 都已经升级过了 , 还是一直遇到这个问题
求回复,感激不尽!!!
yefeiss 2年前 (2017-03-20) 回复
@yefeiss:按方法一下载对应jar包放到spark根目录没?
数据为王 2年前 (2017-03-22) 回复