Skip to content

feat: add approximate triangle counting algorithm using DataSketches#789

Merged
SemyonSinchenko merged 5 commits into
graphframes:mainfrom
SemyonSinchenko:786-approx-triangle-count
Feb 3, 2026
Merged

feat: add approximate triangle counting algorithm using DataSketches#789
SemyonSinchenko merged 5 commits into
graphframes:mainfrom
SemyonSinchenko:786-approx-triangle-count

Conversation

@SemyonSinchenko
Copy link
Copy Markdown
Collaborator

What changes were proposed in this pull request?

This introduces a new "approx" algorithm for triangle counting that leverages Apache Spark's DataSketches (Theta sketches) integration (available since Spark 4.1). This provides a scalable alternative to the existing "exact" implementation, which often encounters Out-of-Memory (OOM) issues on power-law graphs with high-degree nodes.

Changes include:

  • Core: Added approximateRun to TriangleCount.scala using theta_sketch_agg and theta_intersection.
  • Core: Added GraphFramesRequireSpark exception for feature gating.
  • Connect: Updated protobuf definition and Scala/Python logic to support algorithm and lg_nom_entries parameters.
  • Python: Updated triangleCount API in both classic and connect implementations to support the new parameters.
  • Docs: Expanded the traversal guide with a detailed comparison between exact and approximate implementations and usage guidelines.
  • Tests: Added Scala and Python test suites for the approximate algorithm.

Why are the changes needed?

Closes #786

This introduces a new "approx" algorithm for triangle counting that
leverages Apache Spark's DataSketches (Theta sketches) integration
(available since Spark 4.1). This provides a scalable alternative to the
existing "exact" implementation, which often encounters
Out-of-Memory (OOM) issues on power-law graphs with high-degree nodes.

Changes include:
- Core: Added `approximateRun` to `TriangleCount.scala` using
  `theta_sketch_agg` and `theta_intersection`.
- Core: Added `GraphFramesRequireSpark` exception for feature gating.
- Connect: Updated protobuf definition and Scala/Python logic to support
  `algorithm` and `lg_nom_entries` parameters.
- Python: Updated `triangleCount` API in both classic and connect
  implementations to support the new parameters.
- Docs: Expanded the traversal guide with a detailed comparison between
  exact and approximate implementations and usage guidelines.
- Tests: Added Scala and Python test suites for the approximate algorithm.

Closes graphframes#786
@SemyonSinchenko SemyonSinchenko added scala pyspark-classic GraphFrames on PySpark Classic pyspark-connect GraphFrames on PySpark Connect labels Jan 29, 2026
Comment thread core/src/main/scala/org/graphframes/lib/TriangleCount.scala
Comment thread core/src/main/scala/org/graphframes/lib/TriangleCount.scala Outdated
Comment thread python/graphframes/classic/graphframe.py
Comment thread python/graphframes/connect/graphframes_client.py
Comment thread core/src/main/scala/org/graphframes/lib/TriangleCount.scala Outdated
Comment thread core/src/main/scala/org/graphframes/exceptions.scala Outdated
Comment thread core/src/main/scala/org/graphframes/exceptions.scala Outdated
Comment thread core/src/main/scala/org/graphframes/lib/TriangleCount.scala
Comment thread core/src/test/scala/org/graphframes/lib/TriangleCountSuite.scala Outdated
Comment thread python/graphframes/classic/graphframe.py
@SemyonSinchenko SemyonSinchenko merged commit 48e0c15 into graphframes:main Feb 3, 2026
7 checks passed
@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 16.27907% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.75%. Comparing base (d36a5ac) to head (ea3cfdc).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...main/scala/org/graphframes/lib/TriangleCount.scala 19.44% 29 Missing ⚠️
...park/sql/graphframes/GraphFramesConnectUtils.scala 0.00% 7 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #789      +/-   ##
==========================================
- Coverage   84.90%   83.75%   -1.15%     
==========================================
  Files          68       68              
  Lines        3444     3539      +95     
  Branches      431      415      -16     
==========================================
+ Hits         2924     2964      +40     
- Misses        520      575      +55     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@SemyonSinchenko SemyonSinchenko deleted the 786-approx-triangle-count branch February 3, 2026 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pyspark-classic GraphFrames on PySpark Classic pyspark-connect GraphFrames on PySpark Connect scala

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: approximate triangles count API

3 participants