Skip to content

feat: requiredEdgeColumns API in Pregel #835

@SemyonSinchenko

Description

@SemyonSinchenko

Is your feature request related to a problem? Please describe.
At the moment all the edge attributes are packed to the struct and persisted (!):

    val edges = graph.edges
      .select(col(SRC).alias("edge_src"), col(DST).alias("edge_dst"), struct(col("*")).as(EDGE))
      .repartition(col("edge_src"))
      .persist(intermediateStorageLevel)

While in built-in algorithms we are work-arounding it by explicitly select only required columns, it would be nice to add an API for end users that allows to specify required columns.

Describe the solution you would like
requiredEdgeColumns: if specified we are selecting only required edge columns. If it is empty we should remove struct(col("*")).as(EDGE) from edges at all. At the moment it is more like a bug / unclear behaviour: it always add a struct with SRC and DST that is persisted (!). It is bad.

Component

  • Scala Core Internal
  • Scala API
  • Spark Connect Plugin
  • Infrastructure
  • PySpark Classic
  • PySpark Connect

Additional context
While it may look like a breaking change, for me it is more like fixing unspecified behavior on a very-very rare case someone (wrongly) uses Pregel.edge(SRC) instead of Pregel.src(ID). In the case requiredEdgeColumns is empty I would like to drop EDGE at all from edges.

By default the requiredEdgeColumns should use all the edge columns except the SRC and DST. AND in the case there are no additional columns, EDGE struct should not be created.

We may mention it in release notes for the rare case someone is still using the old EDGE for any reason (that is very-very unlikely imo).

Are you planning on creating a PR?

  • I'm willing to make a pull-request

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions