Skip to content

feat: all paths between two nodes#828

Open
SemyonSinchenko wants to merge 2 commits into
graphframes:mainfrom
SemyonSinchenko:200-all-paths
Open

feat: all paths between two nodes#828
SemyonSinchenko wants to merge 2 commits into
graphframes:mainfrom
SemyonSinchenko:200-all-paths

Conversation

@SemyonSinchenko
Copy link
Copy Markdown
Collaborator

What changes were proposed in this pull request?

A simple wrapper over AggregateNeighbors for enumerating all paths between two nodes.

Why are the changes needed?

Close #200

P.S. PySpark + Docs after pre-approve of the API


agg
.run()
.select(col("path"), size(col("path")).cast("long").alias("len"))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len should be size(col("path") - 1 since its the length of the edges, not nodes.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will fix issues with off-by-one hop counts as well.

"path",
array(col(GraphFrame.ID)),
concat(col("path"), array(AggregateNeighbors.dstAttr(GraphFrame.ID))))
.setRequiredVertexAttributes(Seq(GraphFrame.ID))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since we only require id here, then toExpr will only be able to reference id.

Is that intended?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we should keep all the attributes here. Maybe in the future we will apply expression-parsing you made to implicitly determine required state. For now let's pray that Catalyst will be able to optimize it.

@james-willis
Copy link
Copy Markdown
Collaborator

Should the 0 hop path be returned? BFS would return that 0 hop path

Comment on lines +87 to +92
val reversed = graph.edges.select(
(Seq(
col(GraphFrame.DST).alias(GraphFrame.SRC),
col(GraphFrame.SRC).alias(GraphFrame.DST)) ++
edgeColumns.filterNot(c => c == GraphFrame.SRC || c == GraphFrame.DST).map(col)): _*)
GraphFrame(graph.vertices, graph.edges.unionByName(reversed))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call distinct in case the user already has a large number of bidirectional edges?

Suggested change
val reversed = graph.edges.select(
(Seq(
col(GraphFrame.DST).alias(GraphFrame.SRC),
col(GraphFrame.SRC).alias(GraphFrame.DST)) ++
edgeColumns.filterNot(c => c == GraphFrame.SRC || c == GraphFrame.DST).map(col)): _*)
GraphFrame(graph.vertices, graph.edges.unionByName(reversed))
val reversed = graph.edges.select(
(Seq(
col(GraphFrame.DST).alias(GraphFrame.SRC),
col(GraphFrame.SRC).alias(GraphFrame.DST)) ++
edgeColumns.filterNot(c => c == GraphFrame.SRC || c == GraphFrame.DST).map(col)): _*)
GraphFrame(graph.vertices, graph.edges.unionByName(reversed).distinct()) // distinct in case graph already has a large number of bidirectional edges

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Distinct over duplicated edges is quite an expensive operation. I would like to mark it in documentation somehow instead of unconditionally call distinct.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough

agg
.run()
.select(col("path"), size(col("path")).cast("long").alias("len"))
.distinct()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

path is a set of vertices but ignores the edges used to get there. is two difference edges between the same 2 nodes a distinct path?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is a distinct path based on the definitions from books...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably need to include edges in the return path value. and also have different paths for different edges.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the edge in graphframes representation is just a src - dst. What will be the difference between:

  1. path of vertices : A, B, C
  2. path of edges : A-B,B-C

For me it the same but with the first one is much easier to work in our API

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not have unique ID on edges by the end (until user provides it as a property)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes edges have more than src and dst as fields. idk what we should do.

def edgeFilter(value: String): this.type = edgeFilter(expr(value))

def run(): DataFrame = {
require(fromExpression != null, "fromExpr is required.")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add requires to check for collisions on 'len' and 'path'?


edgeFilterExpression.foreach { ef =>
agg.setEdgeFilter(SparkShims.applyExprToCol(graph.spark, ef, "edge_attributes"))
agg.setRequiredEdgeAttributes(graph.edges.columns.toSeq)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not the default?

import org.graphframes.WithDirection

/**
* Computes all simple paths between source and destination vertices.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class scaladoc is the most likely entry point for users of this API — could we expand it a bit to surface a few things that are currently only discoverable by reading the implementation?

  • What "simple" means here (no repeated vertices), since that's a load-bearing semantic for what's returned vs. excluded.
  • That direction is configurable via setIsDirected(false) for undirected traversal.
  • The default maxPathLength = 10.

A short example block (similar to the one on BFS) would also help users get started without grepping through the suite.

@SemyonSinchenko
Copy link
Copy Markdown
Collaborator Author

Should the 0 hop path be returned? BFS would return that 0 hop path

What is zer-hop path between two nodes? Is it a self-loop in the case both source and target are the same? Does it make any sense?

@james-willis
Copy link
Copy Markdown
Collaborator

yes the 0 hop path is the self loop. maybe it is a 1-hop case...

@SemyonSinchenko
Copy link
Copy Markdown
Collaborator Author

yes the 0 hop path is the self loop. maybe it is a 1-hop case...

And to get it user must specify srcExpression === dstExpression that would be strange. I do not see any reasons to support it tbh. For cycles detection we have a standalone algorithm. I do not see this API as an API for detecting cycles.

@james-willis
Copy link
Copy Markdown
Collaborator

I see so since we dont support cycles we dont support loops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: all paths between two nodes

2 participants