feat: all paths between two nodes by SemyonSinchenko · Pull Request #828 · graphframes/graphframes

SemyonSinchenko · 2026-04-29T10:01:13Z

What changes were proposed in this pull request?

A simple wrapper over AggregateNeighbors for enumerating all paths between two nodes.

Why are the changes needed?

Close #200

P.S. PySpark + Docs after pre-approve of the API

james-willis · 2026-05-08T17:52:58Z

+
+    agg
+      .run()
+      .select(col("path"), size(col("path")).cast("long").alias("len"))


len should be size(col("path") - 1 since its the length of the edges, not nodes.

this will fix issues with off-by-one hop counts as well.

james-willis · 2026-05-08T17:56:48Z

+        "path",
+        array(col(GraphFrame.ID)),
+        concat(col("path"), array(AggregateNeighbors.dstAttr(GraphFrame.ID))))
+      .setRequiredVertexAttributes(Seq(GraphFrame.ID))


I think since we only require id here, then toExpr will only be able to reference id.

Is that intended?

You are right, we should keep all the attributes here. Maybe in the future we will apply expression-parsing you made to implicitly determine required state. For now let's pray that Catalyst will be able to optimize it.

james-willis · 2026-05-08T17:59:33Z

Should the 0 hop path be returned? BFS would return that 0 hop path

james-willis · 2026-05-08T18:08:47Z

+      val reversed = graph.edges.select(
+        (Seq(
+          col(GraphFrame.DST).alias(GraphFrame.SRC),
+          col(GraphFrame.SRC).alias(GraphFrame.DST)) ++
+          edgeColumns.filterNot(c => c == GraphFrame.SRC || c == GraphFrame.DST).map(col)): _*)
+      GraphFrame(graph.vertices, graph.edges.unionByName(reversed))


Should we call distinct in case the user already has a large number of bidirectional edges?

Suggested change

val reversed = graph.edges.select(

(Seq(

col(GraphFrame.DST).alias(GraphFrame.SRC),

col(GraphFrame.SRC).alias(GraphFrame.DST)) ++

edgeColumns.filterNot(c => c == GraphFrame.SRC || c == GraphFrame.DST).map(col)): _*)

GraphFrame(graph.vertices, graph.edges.unionByName(reversed))

val reversed = graph.edges.select(

(Seq(

col(GraphFrame.DST).alias(GraphFrame.SRC),

col(GraphFrame.SRC).alias(GraphFrame.DST)) ++

edgeColumns.filterNot(c => c == GraphFrame.SRC || c == GraphFrame.DST).map(col)): _*)

GraphFrame(graph.vertices, graph.edges.unionByName(reversed).distinct()) // distinct in case graph already has a large number of bidirectional edges

Distinct over duplicated edges is quite an expensive operation. I would like to mark it in documentation somehow instead of unconditionally call distinct.

fair enough

james-willis · 2026-05-08T18:10:07Z

+    agg
+      .run()
+      .select(col("path"), size(col("path")).cast("long").alias("len"))
+      .distinct()


path is a set of vertices but ignores the edges used to get there. is two difference edges between the same 2 nodes a distinct path?

Yes it is a distinct path based on the definitions from books...

probably need to include edges in the return path value. and also have different paths for different edges.

But the edge in graphframes representation is just a src - dst. What will be the difference between:

path of vertices : A, B, C

path of edges : A-B,B-C

For me it the same but with the first one is much easier to work in our API

We do not have unique ID on edges by the end (until user provides it as a property)

sometimes edges have more than src and dst as fields. idk what we should do.

james-willis · 2026-05-08T18:10:48Z

+  def edgeFilter(value: String): this.type = edgeFilter(expr(value))
+
+  def run(): DataFrame = {
+    require(fromExpression != null, "fromExpr is required.")


should we add requires to check for collisions on 'len' and 'path'?

james-willis · 2026-05-08T18:14:57Z

+
+    edgeFilterExpression.foreach { ef =>
+      agg.setEdgeFilter(SparkShims.applyExprToCol(graph.spark, ef, "edge_attributes"))
+      agg.setRequiredEdgeAttributes(graph.edges.columns.toSeq)


Is this not the default?

james-willis · 2026-05-08T18:26:41Z

+import org.graphframes.WithDirection
+
+/**
+ * Computes all simple paths between source and destination vertices.


The class scaladoc is the most likely entry point for users of this API — could we expand it a bit to surface a few things that are currently only discoverable by reading the implementation?

What "simple" means here (no repeated vertices), since that's a load-bearing semantic for what's returned vs. excluded.

That direction is configurable via setIsDirected(false) for undirected traversal.

The default maxPathLength = 10.

A short example block (similar to the one on BFS) would also help users get started without grepping through the suite.

SemyonSinchenko · 2026-05-08T19:05:58Z

Should the 0 hop path be returned? BFS would return that 0 hop path

What is zer-hop path between two nodes? Is it a self-loop in the case both source and target are the same? Does it make any sense?

james-willis · 2026-05-08T19:50:59Z

yes the 0 hop path is the self loop. maybe it is a 1-hop case...

SemyonSinchenko · 2026-05-08T20:02:36Z

yes the 0 hop path is the self loop. maybe it is a 1-hop case...

And to get it user must specify srcExpression === dstExpression that would be strange. I do not see any reasons to support it tbh. For cycles detection we have a standalone algorithm. I do not see this API as an API for detecting cycles.

james-willis · 2026-05-09T01:29:52Z

I see so since we dont support cycles we dont support loops.

feat: all paths between two nodes

a83f620

SemyonSinchenko self-assigned this Apr 29, 2026

SemyonSinchenko added the scala label Apr 29, 2026

SemyonSinchenko requested a review from james-willis April 29, 2026 10:01

fix: let's have some 2.12 vs 2.13 fun

fea94ca

james-willis reviewed May 8, 2026

View reviewed changes

Conversation

SemyonSinchenko commented Apr 29, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

james-willis commented May 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SemyonSinchenko commented May 8, 2026

Uh oh!

james-willis commented May 8, 2026

Uh oh!

SemyonSinchenko commented May 8, 2026

Uh oh!

james-willis commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants