Skip to content

feat: add NeighborhoodAwareCDLP community detection algorithm#825

Open
SemyonSinchenko wants to merge 14 commits into
graphframes:mainfrom
SemyonSinchenko:791-community-detection
Open

feat: add NeighborhoodAwareCDLP community detection algorithm#825
SemyonSinchenko wants to merge 14 commits into
graphframes:mainfrom
SemyonSinchenko:791-community-detection

Conversation

@SemyonSinchenko
Copy link
Copy Markdown
Collaborator

@SemyonSinchenko SemyonSinchenko commented Apr 7, 2026

What changes were proposed in this pull request?

  • Introduce NeighborhoodAwareCDLP: a neighborhood-aware variant of label propagation that weights incoming votes by a combination of direct-link strength (a) and neighborhood overlap (c * commonNeighbors).
  • Add implementation at core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala with:
    • approximate common-neighbor estimation using theta sketches,
    • parameters for a, c, initial label column, and sketch size,
    • Pregel-based propagation and integration with GraphFrame options.
  • Expose API on GraphFrame as structureAwareLabelPropagation.
  • Add comprehensive unit tests at core/src/test/scala/org/graphframes/lib/NeighborhoodAwareCDLPSuite.scala covering basic propagation, parameter sensitivity, directed/undirected behavior, isolated vertices, and disconnected components.
  • Bump default Spark version from 3.5.7 to 3.5.8 in build.sbt.
  • Note: the theta-sketch based overlap estimation requires Spark >= 4.1; the implementation checks the Spark version and fails fast on older versions.

Why are the changes needed?

The current CDLP is very "basic" but optimized well for it's own problem. I do not want to break it. The new implementation is mostly based on the https://arxiv.org/pdf/1105.3264 with my won adjustments.

image

(on the picture c=0 is a classical CDLP)

Close #791
Close #301
Close #456 (partially?)

Python?

After I get an approve on the core I will add python.

- Introduce NeighborhoodAwareCDLP: a neighborhood-aware variant of label
  propagation that weights incoming votes by a combination of direct-link
  strength (a) and neighborhood overlap (c * commonNeighbors).
- Add implementation at
  core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala with:
  - approximate common-neighbor estimation using theta sketches,
  - parameters for a, c, initial label column, and sketch size,
  - Pregel-based propagation and integration with GraphFrame options.
- Expose API on GraphFrame as structureAwareLabelPropagation.
- Add comprehensive unit tests at
  core/src/test/scala/org/graphframes/lib/NeighborhoodAwareCDLPSuite.scala
  covering basic propagation, parameter sensitivity, directed/undirected
  behavior, isolated vertices, and disconnected components.
- Bump default Spark version from 3.5.7 to 3.5.8 in build.sbt.
- Note: the theta-sketch based overlap estimation requires Spark >= 4.1;
  the implementation checks the Spark version and fails fast on older versions.#
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new community detection algorithm to GraphFrames: a neighborhood-aware variant of label propagation that weights label “votes” using a direct-link term plus an approximate common-neighbor overlap term (Theta sketches), and exposes it via the GraphFrame API.

Changes:

  • Introduces NeighborhoodAwareCDLP implementation using Pregel and Spark 4.1+ Theta sketch SQL functions.
  • Exposes the algorithm on GraphFrame as structureAwareLabelPropagation.
  • Adds a new Scala test suite for correctness/sensitivity cases and bumps the default Spark version to 3.5.8.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 7 comments.

File Description
core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala Implements NeighborhoodAwareCDLP (weighted label propagation with sketch-based overlap) and integrates with Pregel/options.
core/src/main/scala/org/graphframes/GraphFrame.scala Adds a public entrypoint structureAwareLabelPropagation.
core/src/test/scala/org/graphframes/lib/NeighborhoodAwareCDLPSuite.scala Adds unit tests covering propagation behavior, parameter effects, directionality, and edge cases.
build.sbt Bumps default Spark version from 3.5.7 to 3.5.8.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala Outdated
Comment thread core/src/test/scala/org/graphframes/lib/NeighborhoodAwareCDLPSuite.scala Outdated
SemyonSinchenko and others added 3 commits April 7, 2026 23:16
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…uite.scala

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 8, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 1.12360% with 88 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.55%. Comparing base (a28a4e8) to head (5bdc3e1).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
...la/org/graphframes/lib/NeighborhoodAwareCDLP.scala 0.00% 70 Missing ⚠️
...park/sql/graphframes/GraphFramesConnectUtils.scala 0.00% 14 Missing ⚠️
core/src/main/scala/org/graphframes/mixins.scala 25.00% 3 Missing ⚠️
...re/src/main/scala/org/graphframes/GraphFrame.scala 0.00% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #825      +/-   ##
==========================================
- Coverage   80.75%   79.55%   -1.20%     
==========================================
  Files          78       79       +1     
  Lines        4421     4485      +64     
  Branches      543      548       +5     
==========================================
- Hits         3570     3568       -2     
- Misses        851      917      +66     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala
Comment thread core/src/test/scala/org/graphframes/TestUtils.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala
Comment thread core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala Outdated
Multiple changes due to applying a formatter. Real changes only for
the new algorithm.
Comment thread docs/src/04-user-guide/06-graph-clustering.md
@james-willis
Copy link
Copy Markdown
Collaborator

Claude raised the following naming inconsistencies:

Naming mismatches worth raising:

  • Entrypoint name differs across languages. Scala: GraphFrame.structureAwareLabelPropagation (singular "structure", no "neighborhood"). Python: g.neighborhood_aware_cdlp(...). The Scala class is NeighborhoodAwareCDLP. Three different names for the same algorithm — pick one.
  • Python casing is inconsistent with siblings. Existing Python public methods are camelCase (labelPropagation, pageRank, shortestPaths), but the new one is neighborhood_aware_cdlp. Should probably be neighborhoodAwareCDLP (or whatever name aligns with the Scala entrypoint after the rename above).

@SemyonSinchenko
Copy link
Copy Markdown
Collaborator Author

  • Python casing is inconsistent with siblings. Existing Python public methods are camelCase (labelPropagation, pageRank, shortestPaths), but the new one is neighborhood_aware_cdlp. Should probably be neighborhoodAwareCDLP (or whatever name aligns with the Scala entrypoint after the rename above).

We should update AGENTS.md. Overall goal is to move in the direction when the Python API is "pythonic" (multi-arg methods instead of builders, snake_case naming, etc.). We cannot just change all the existing code, but we can at least avoid adding new tech debt. See #713 and feel free to comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

4 participants