feat: all paths between two nodes#828
Conversation
|
|
||
| agg | ||
| .run() | ||
| .select(col("path"), size(col("path")).cast("long").alias("len")) |
There was a problem hiding this comment.
len should be size(col("path") - 1 since its the length of the edges, not nodes.
There was a problem hiding this comment.
this will fix issues with off-by-one hop counts as well.
| "path", | ||
| array(col(GraphFrame.ID)), | ||
| concat(col("path"), array(AggregateNeighbors.dstAttr(GraphFrame.ID)))) | ||
| .setRequiredVertexAttributes(Seq(GraphFrame.ID)) |
There was a problem hiding this comment.
I think since we only require id here, then toExpr will only be able to reference id.
Is that intended?
There was a problem hiding this comment.
You are right, we should keep all the attributes here. Maybe in the future we will apply expression-parsing you made to implicitly determine required state. For now let's pray that Catalyst will be able to optimize it.
|
Should the 0 hop path be returned? BFS would return that 0 hop path |
| val reversed = graph.edges.select( | ||
| (Seq( | ||
| col(GraphFrame.DST).alias(GraphFrame.SRC), | ||
| col(GraphFrame.SRC).alias(GraphFrame.DST)) ++ | ||
| edgeColumns.filterNot(c => c == GraphFrame.SRC || c == GraphFrame.DST).map(col)): _*) | ||
| GraphFrame(graph.vertices, graph.edges.unionByName(reversed)) |
There was a problem hiding this comment.
Should we call distinct in case the user already has a large number of bidirectional edges?
| val reversed = graph.edges.select( | |
| (Seq( | |
| col(GraphFrame.DST).alias(GraphFrame.SRC), | |
| col(GraphFrame.SRC).alias(GraphFrame.DST)) ++ | |
| edgeColumns.filterNot(c => c == GraphFrame.SRC || c == GraphFrame.DST).map(col)): _*) | |
| GraphFrame(graph.vertices, graph.edges.unionByName(reversed)) | |
| val reversed = graph.edges.select( | |
| (Seq( | |
| col(GraphFrame.DST).alias(GraphFrame.SRC), | |
| col(GraphFrame.SRC).alias(GraphFrame.DST)) ++ | |
| edgeColumns.filterNot(c => c == GraphFrame.SRC || c == GraphFrame.DST).map(col)): _*) | |
| GraphFrame(graph.vertices, graph.edges.unionByName(reversed).distinct()) // distinct in case graph already has a large number of bidirectional edges |
There was a problem hiding this comment.
Distinct over duplicated edges is quite an expensive operation. I would like to mark it in documentation somehow instead of unconditionally call distinct.
| agg | ||
| .run() | ||
| .select(col("path"), size(col("path")).cast("long").alias("len")) | ||
| .distinct() |
There was a problem hiding this comment.
path is a set of vertices but ignores the edges used to get there. is two difference edges between the same 2 nodes a distinct path?
There was a problem hiding this comment.
Yes it is a distinct path based on the definitions from books...
There was a problem hiding this comment.
probably need to include edges in the return path value. and also have different paths for different edges.
There was a problem hiding this comment.
But the edge in graphframes representation is just a src - dst. What will be the difference between:
- path of vertices :
A, B, C - path of edges :
A-B,B-C
For me it the same but with the first one is much easier to work in our API
There was a problem hiding this comment.
We do not have unique ID on edges by the end (until user provides it as a property)
There was a problem hiding this comment.
sometimes edges have more than src and dst as fields. idk what we should do.
| def edgeFilter(value: String): this.type = edgeFilter(expr(value)) | ||
|
|
||
| def run(): DataFrame = { | ||
| require(fromExpression != null, "fromExpr is required.") |
There was a problem hiding this comment.
should we add requires to check for collisions on 'len' and 'path'?
|
|
||
| edgeFilterExpression.foreach { ef => | ||
| agg.setEdgeFilter(SparkShims.applyExprToCol(graph.spark, ef, "edge_attributes")) | ||
| agg.setRequiredEdgeAttributes(graph.edges.columns.toSeq) |
There was a problem hiding this comment.
Is this not the default?
| import org.graphframes.WithDirection | ||
|
|
||
| /** | ||
| * Computes all simple paths between source and destination vertices. |
There was a problem hiding this comment.
The class scaladoc is the most likely entry point for users of this API — could we expand it a bit to surface a few things that are currently only discoverable by reading the implementation?
- What "simple" means here (no repeated vertices), since that's a load-bearing semantic for what's returned vs. excluded.
- That direction is configurable via
setIsDirected(false)for undirected traversal. - The default
maxPathLength = 10.
A short example block (similar to the one on BFS) would also help users get started without grepping through the suite.
What is zer-hop path between two nodes? Is it a self-loop in the case both source and target are the same? Does it make any sense? |
|
yes the 0 hop path is the self loop. maybe it is a 1-hop case... |
And to get it user must specify |
|
I see so since we dont support cycles we dont support loops. |
What changes were proposed in this pull request?
A simple wrapper over
AggregateNeighborsfor enumerating all paths between two nodes.Why are the changes needed?
Close #200
P.S. PySpark + Docs after pre-approve of the API