Skip to content

feat: selections for Pregel on the step of triplets generation #723

@SemyonSinchenko

Description

@SemyonSinchenko

Is your feature request related to a problem? Please describe.

At the moment in Pegel, when we are constructing triplets, we are always taking all the source and destination vertices columns. It create a huge dataset in memory, especially for algorithms, that have a big state (cycles detection, future random walks, etc.)

IRL, we do not need always to have the full state, but only part of it. For example, in Rocha-Thatte algorithm, it is enough to have on each triplet only source vertex' sequences.

Describe the solution you would like

I would like to have an API like:

requiredSrcColumns(col: Column, cols: Column*)
requiredDstColumns(col: Column, cols: Column*)

and on the step of generating triplets, select only required columns instead of the whole Pregel state of both src and dst columns.

Bonus update existing Pregel-based algorithms by explicitly providing only required columns (based on the context of the sendToSrc and sendToDst

Bonus 2 provide PySpark Classic / Connect APIs.

Component

  • Scala Core Internal
  • Scala API
  • Spark Connect Plugin
  • Infrastructure
  • PySpark Classic
  • PySpark Connect

Additional context

Are you planning on creating a PR?

  • I'm willing to make a pull-request

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions