Describe the bug
Hi team, graphframes does not seem to work with databricks-connect on a databricks runtime 17.3 using spark 4.0
To Reproduce
Steps to reproduce the behavior:
- Setup a databricks cluster using databricks runtime 17.3
- install on the cluster from maven io.graphframes:graphframes-spark4_2.13:0.10.1
- build and install on the cluster the dbx spark connect jar version using
./build/sbt connect/assembly -Dvendor.name=dbx -Dscala.version=2.13.16 -Dspark.version=4.0.0
- Run a small pyspark code locally using databricks-connect (the same code without building spark from databricks-connect works by running it on a notebook in databricks:
from databricks.connect import ( # type: ignore
DatabricksEnv,
DatabricksSession,
)
spark = DatabricksSession.builder.getOrCreate()
nodes = [(1, "Alice", 30), (2, "Bob", 25), (3, "Charlie", 35)]
nodes_df = spark.createDataFrame(nodes, ["id", "name", "age"])
edges = [
(1, 2, "friend"),
(2, 1, "friend"),
(2, 3, "friend"),
(3, 2, "enemy"), # eek!
]
edges_df = spark.createDataFrame(edges, ["src", "dst", "relationship"])
edges_df.show()
g = GraphFrame(nodes_df, edges_df)
g.connectedComponents().show()
Expected behavior
A completed connectedComponents run in databricks connect
System [please complete the following information]:
databricks
- databricks runtime: 17.3.x-scala2.13
- Operating System: Ubuntu 24.04.2 LTS
- Java: Zulu17.58+21-CA
- Scala: 2.13.16
- Python: 3.12.3
- Delta Lake: 4.0.0
- spark: 4.0.0
- graphframes: io.graphframes:graphframes-spark4_2.13:0.10.1
local system
- Java: openjdk 17.0.11 2024-04-16 LTS
- python: 3.12.12
- databricks-connect: 17.3
- databricks-sdk: 0.64
- graphframes-py: 0.10.1
Component
Additional context
I get two errors coming from grpc/protobuf. The first one always appears the first time the cluster is being started when running the testing code
status = StatusCode.UNKNOWN
details = "grpc_shaded/com/google/protobuf/RuntimeVersion$RuntimeDomain"
debug_error_string = "UNKNOWN:Error received from peer {grpc_status:2, grpc_message:"grpc_shaded/com/google/protobuf/RuntimeVersion$RuntimeDomain"}"
java.lang.NoClassDefFoundError: grpc_shaded/com/google/protobuf/RuntimeVersion$RuntimeDomain
at org.graphframes.connect.proto.GraphFramesAPI.<clinit>(GraphFramesAPI.java:22)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:569)
at grpc_shaded.com.google.protobuf.Internal.getDefaultInstance(Internal.java:353)
at grpc_shaded.com.google.protobuf.Any.is(Any.java:85)
at org.apache.spark.sql.graphframes.GraphFramesConnect.transform(GraphFramesConnect.scala:16)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelationPlugin$1(SparkConnectPlanner.scala:375)
at scala.collection.Iterator$$anon$9.next(Iterator.scala:584)
at scala.collection.IterableOnceOps.find(IterableOnce.scala:677)
at scala.collection.IterableOnceOps.find$(IterableOnce.scala:674)
at scala.collection.AbstractIterable.find(Iterable.scala:935)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelationPlugin(SparkConnectPlanner.scala:378)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelation$1(SparkConnectPlanner.scala:343)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$8(SessionHolder.scala:743)
at org.apache.spark.sql.connect.service.SessionHolder.measureSubtreeRelationNodes(SessionHolder.scala:759)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$6(SessionHolder.scala:742)
at scala.Option.getOrElse(Option.scala:201)
at org.apache.spark.sql.connect.service.SessionHolder.usePlanCache(SessionHolder.scala:740)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:229)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:215)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformShowString(SparkConnectPlanner.scala:433)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelation$1(SparkConnectPlanner.scala:232)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$8(SessionHolder.scala:743)
at org.apache.spark.sql.connect.service.SessionHolder.measureSubtreeRelationNodes(SessionHolder.scala:759)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$6(SessionHolder.scala:742)
at scala.Option.getOrElse(Option.scala:201)
at org.apache.spark.sql.connect.service.SessionHolder.usePlanCache(SessionHolder.scala:740)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:229)
at org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.handlePlan(SparkConnectPlanExecution.scala:96)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.handlePlan(ExecuteThreadRunner.scala:385)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1(ExecuteThreadRunner.scala:291)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1$adapted(ExecuteThreadRunner.scala:247)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$2(SessionHolder.scala:536)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:860)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$1(SessionHolder.scala:536)
at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:97)
at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:124)
at org.apache.spark.sql.artifact.ArtifactManager.withClassLoaderIfNeeded(ArtifactManager.scala:118)
at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:123)
at org.apache.spark.sql.connect.service.SessionHolder.withSession(SessionHolder.scala:535)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.executeInternal(ExecuteThreadRunner.scala:247)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$execute$1(ExecuteThreadRunner.scala:141)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at com.databricks.spark.connect.service.UtilizationMetrics.recordActiveQueries(UtilizationMetrics.scala:43)
at com.databricks.spark.connect.service.UtilizationMetrics.recordActiveQueries$(UtilizationMetrics.scala:40)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.recordActiveQueries(ExecuteThreadRunner.scala:53)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.org$apache$spark$sql$connect$execution$ExecuteThreadRunner$$execute(ExecuteThreadRunner.scala:139)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.$anonfun$run$2(ExecuteThreadRunner.scala:595)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:51)
at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:104)
at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:109)
at scala.util.Using$.resource(Using.scala:296)
at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:108)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.run(ExecuteThreadRunner.scala:595)
Caused by: java.lang.ClassNotFoundException: grpc_shaded.com.google.protobuf.RuntimeVersion$RuntimeDomain
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)
at com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader.loadClass(ClassLoaders.scala:152)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
... 57 more
the second one appears on all following tries of running the same code
status = StatusCode.UNKNOWN
details = "Could not initialize class org.graphframes.connect.proto.GraphFramesAPI"
debug_error_string = "UNKNOWN:Error received from peer {grpc_status:2, grpc_message:"Could not initialize class org.graphframes.connect.proto.GraphFramesAPI"}"
java.lang.NoClassDefFoundError: Could not initialize class org.graphframes.connect.proto.GraphFramesAPI
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:569)
at grpc_shaded.com.google.protobuf.Internal.getDefaultInstance(Internal.java:353)
at grpc_shaded.com.google.protobuf.Any.is(Any.java:85)
at org.apache.spark.sql.graphframes.GraphFramesConnect.transform(GraphFramesConnect.scala:16)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelationPlugin$1(SparkConnectPlanner.scala:375)
at scala.collection.Iterator$$anon$9.next(Iterator.scala:584)
at scala.collection.IterableOnceOps.find(IterableOnce.scala:677)
at scala.collection.IterableOnceOps.find$(IterableOnce.scala:674)
at scala.collection.AbstractIterable.find(Iterable.scala:935)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelationPlugin(SparkConnectPlanner.scala:378)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelation$1(SparkConnectPlanner.scala:343)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$8(SessionHolder.scala:743)
at org.apache.spark.sql.connect.service.SessionHolder.measureSubtreeRelationNodes(SessionHolder.scala:759)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$6(SessionHolder.scala:742)
at scala.Option.getOrElse(Option.scala:201)
at org.apache.spark.sql.connect.service.SessionHolder.usePlanCache(SessionHolder.scala:740)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:229)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:215)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformShowString(SparkConnectPlanner.scala:433)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelation$1(SparkConnectPlanner.scala:232)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$8(SessionHolder.scala:743)
at org.apache.spark.sql.connect.service.SessionHolder.measureSubtreeRelationNodes(SessionHolder.scala:759)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$6(SessionHolder.scala:742)
at scala.Option.getOrElse(Option.scala:201)
at org.apache.spark.sql.connect.service.SessionHolder.usePlanCache(SessionHolder.scala:740)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:229)
at org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.handlePlan(SparkConnectPlanExecution.scala:96)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.handlePlan(ExecuteThreadRunner.scala:385)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1(ExecuteThreadRunner.scala:291)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1$adapted(ExecuteThreadRunner.scala:247)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$2(SessionHolder.scala:536)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:860)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$1(SessionHolder.scala:536)
at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:97)
at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:124)
at org.apache.spark.sql.artifact.ArtifactManager.withClassLoaderIfNeeded(ArtifactManager.scala:118)
at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:123)
at org.apache.spark.sql.connect.service.SessionHolder.withSession(SessionHolder.scala:535)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.executeInternal(ExecuteThreadRunner.scala:247)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$execute$1(ExecuteThreadRunner.scala:141)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at com.databricks.spark.connect.service.UtilizationMetrics.recordActiveQueries(UtilizationMetrics.scala:43)
at com.databricks.spark.connect.service.UtilizationMetrics.recordActiveQueries$(UtilizationMetrics.scala:40)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.recordActiveQueries(ExecuteThreadRunner.scala:53)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.org$apache$spark$sql$connect$execution$ExecuteThreadRunner$$execute(ExecuteThreadRunner.scala:139)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.$anonfun$run$2(ExecuteThreadRunner.scala:595)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:51)
at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:104)
at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:109)
at scala.util.Using$.resource(Using.scala:296)
at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:108)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.run(ExecuteThreadRunner.scala:595)
Caused by: java.lang.ExceptionInInitializerError: Exception java.lang.NoClassDefFoundError: grpc_shaded/com/google/protobuf/RuntimeVersion$RuntimeDomain [in thread "SparkConnectExecuteThread_opId=63e2cdea-e873-41be-ac29-32fb5d8b5882"]
at org.graphframes.connect.proto.GraphFramesAPI.<clinit>(GraphFramesAPI.java:22)
... 56 more
I tried changinga lot of different parameters including protoc version in the build.sbt file l.34 manually without much success with any prior 3.x or later version.
Are you planning on creating a PR?
thanks a lot for the support for everything that you are doing!
Joshua
Describe the bug
Hi team, graphframes does not seem to work with databricks-connect on a databricks runtime 17.3 using spark 4.0
To Reproduce
Steps to reproduce the behavior:
./build/sbt connect/assembly -Dvendor.name=dbx -Dscala.version=2.13.16 -Dspark.version=4.0.0Expected behavior
A completed connectedComponents run in databricks connect
System [please complete the following information]:
databricks
local system
Component
Additional context
I get two errors coming from grpc/protobuf. The first one always appears the first time the cluster is being started when running the testing code
the second one appears on all following tries of running the same code
I tried changinga lot of different parameters including protoc version in the build.sbt file l.34 manually without much success with any prior 3.x or later version.
Are you planning on creating a PR?
thanks a lot for the support for everything that you are doing!
Joshua