feat: add NeighborhoodAwareCDLP community detection algorithm#825
feat: add NeighborhoodAwareCDLP community detection algorithm#825SemyonSinchenko wants to merge 14 commits into
Conversation
- Introduce NeighborhoodAwareCDLP: a neighborhood-aware variant of label propagation that weights incoming votes by a combination of direct-link strength (a) and neighborhood overlap (c * commonNeighbors). - Add implementation at core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala with: - approximate common-neighbor estimation using theta sketches, - parameters for a, c, initial label column, and sketch size, - Pregel-based propagation and integration with GraphFrame options. - Expose API on GraphFrame as structureAwareLabelPropagation. - Add comprehensive unit tests at core/src/test/scala/org/graphframes/lib/NeighborhoodAwareCDLPSuite.scala covering basic propagation, parameter sensitivity, directed/undirected behavior, isolated vertices, and disconnected components. - Bump default Spark version from 3.5.7 to 3.5.8 in build.sbt. - Note: the theta-sketch based overlap estimation requires Spark >= 4.1; the implementation checks the Spark version and fails fast on older versions.#
There was a problem hiding this comment.
Pull request overview
Adds a new community detection algorithm to GraphFrames: a neighborhood-aware variant of label propagation that weights label “votes” using a direct-link term plus an approximate common-neighbor overlap term (Theta sketches), and exposes it via the GraphFrame API.
Changes:
- Introduces
NeighborhoodAwareCDLPimplementation using Pregel and Spark 4.1+ Theta sketch SQL functions. - Exposes the algorithm on
GraphFrameasstructureAwareLabelPropagation. - Adds a new Scala test suite for correctness/sensitivity cases and bumps the default Spark version to 3.5.8.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| core/src/main/scala/org/graphframes/lib/NeighborhoodAwareCDLP.scala | Implements NeighborhoodAwareCDLP (weighted label propagation with sketch-based overlap) and integrates with Pregel/options. |
| core/src/main/scala/org/graphframes/GraphFrame.scala | Adds a public entrypoint structureAwareLabelPropagation. |
| core/src/test/scala/org/graphframes/lib/NeighborhoodAwareCDLPSuite.scala | Adds unit tests covering propagation behavior, parameter effects, directionality, and edge cases. |
| build.sbt | Bumps default Spark version from 3.5.7 to 3.5.8. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…uite.scala Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #825 +/- ##
==========================================
- Coverage 80.75% 79.55% -1.20%
==========================================
Files 78 79 +1
Lines 4421 4485 +64
Branches 543 548 +5
==========================================
- Hits 3570 3568 -2
- Misses 851 917 +66 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Multiple changes due to applying a formatter. Real changes only for the new algorithm.
|
Claude raised the following naming inconsistencies: Naming mismatches worth raising:
|
We should update AGENTS.md. Overall goal is to move in the direction when the Python API is "pythonic" (multi-arg methods instead of builders, snake_case naming, etc.). We cannot just change all the existing code, but we can at least avoid adding new tech debt. See #713 and feel free to comment. |
What changes were proposed in this pull request?
Why are the changes needed?
The current CDLP is very "basic" but optimized well for it's own problem. I do not want to break it. The new implementation is mostly based on the https://arxiv.org/pdf/1105.3264 with my won adjustments.
(on the picture c=0 is a classical CDLP)
Close #791
Close #301
Close #456 (partially?)
Python?
After I get an approve on the core I will add python.