feat: random walks and embeddings by SemyonSinchenko · Pull Request #752 · graphframes/graphframes

SemyonSinchenko · 2025-11-20T18:17:19Z

What changes were proposed in this pull request?

RandomWalks Base
RandomWalks with Restart Impl
Edges Sampling API

Why are the changes needed?

Close #726
Close #324

SemyonSinchenko · 2025-11-20T18:19:59Z

Work is in progress actually. I'm still thinking how to implement RW in a best way, what should be in abstraction, what should be in impls. What configurations should be provided, etc.

Current idea:

limiting amount of collected neighbors (would be nice to have Reservoir Sampling, cc: @SauronShepherd )
run in batches
each batch generate short walks, save to parquet (partitioning???)
at the end we are joining all the batches based on initially generated RW UUID

I think it should allow to generate really long way with a quite limited resources... Not sure about performance.

rjurney · 2025-11-20T18:33:43Z

@SemyonSinchenko How could this work? Like a Pregel thing? Maye nodes carry their paths with them? I've long been confused here... joins don't seem to scale. I'm not sure but would love to do so... can you explain? I mean I have read other implementations like the one I shared in the other thread: https://github.com/data61/stellar-random-walk

SemyonSinchenko · 2025-11-20T19:40:17Z

@SemyonSinchenko How could this work? Like a Pregel thing? Maye nodes carry their paths with them? I've long been confused here... joins don't seem to scale. I'm not sure but would love to do so... can you explain? I mean I have read other implementations like the one I shared in the other thread: https://github.com/data61/stellar-random-walk

Change my mind, but pure second oreder RW is not scalable. Just to understand, imagine two node with 1000 degree (common case in power law graphs). You need to collect two sets of neighborhouds of size 1000. It scales even worser than GF triangleCount that is suffering from the same problem.

SemyonSinchenko · 2025-11-20T19:43:38Z

Yes, it is doing joins. Joins are slow but they are scalable. I see no other options tbh. To avoid huge neighborhoods, Im using a limit: at each batch take only part of vertex neighbors.

SauronShepherd · 2025-11-21T10:37:15Z

I'm afraid I'm not knowledgeable enough to give a strong opinion here, but after a short chat with ChatGPT I wondered: what about integrating an existing system like ThunderRW with GraphFrames?

SemyonSinchenko · 2025-11-21T11:01:28Z

I'm starting to think that we do not need RWs at all. @SauronShepherd there is no problem to write it from scratch in Spark. My question to you was mostly about reservoir sampling aggregation function if you are interested to implement it.

rjurney · 2025-11-21T17:25:40Z

Various embedding algorithms require random walks and I've seen implementations out there, but maybe they're not top priority?

SemyonSinchenko · 2025-11-21T18:37:10Z

Various embedding algorithms require random walks and I've seen implementations out there, but maybe they're not top priority?

This PR is a WIP implementation, so if you have any suggestions or comments feel free. Im trying to make it scalable and from the first look it is.

- Added Scala-style docstrings to all classes, traits, methods, and fields - Improved documentation for random walk algorithms and configurations

- Correct element_at index from 0 to 1 for 1-based Spark SQL arrays - Fix walk array construction by appending nextNode instead of currVisitingVertex - Add null handling for nodes with no outgoing neighbors in restart logic - Add comprehensive Scala docstrings to RandomWalkBase and RandomWalkWithRestart - Create RWExample.scala demonstrating RandomWalkWithRestart on LDBC datasets ...

This commit introduces a new Word2Vec-based embedding method using the hashing trick to handle large vocabularies efficiently in graph frames, particularly for random walk sequences. It includes configurable parameters like number of hashing functions, max features, and standard W2V settings, with comprehensive Scaladoc for public APIs. - Added core/src/main/scala/org/graphframes/embeddings/Word2VecHashingTrick.scala: New class implementing hashing trick by applying multiple Murmur3 hash functions and modulo to map features to a fixed-size space, reducing collisions and memory usage. It trains a W2V model on expanded sequences and provides a companion model class for vector retrieval via averaging hashed embeddings. Setters include docstrings explaining trade-offs (e.g., more hashes improve quality but multiply dataset size). - Modified core/src/main/scala/org/graphframes/examples/RWExample.scala: Updated main method to accept a single file path argument for edge loading instead of downloading LDBC datasets, simplifying usage for local files. Replaced vertex loading with direct derivation from edges for consistency and reduced I/O. - Modified core/src/main/scala/org/graphframes/exceptions.scala: Added GraphFramesW2VException class to handle W2V-specific errors, such as unsupported input types in hashing.

Replace collect_set + shuffle + slice with ReservoirSamplingAgg UDAF for efficient sampling of up to maxNbrs neighbors per vertex. This improves performance by avoiding full neighbor list aggregation and shuffling, especially beneficial for high-degree vertices. - Add ReservoirSamplingAgg trait: generic aggregator using reservoir sampling algorithm, supporting merge operations for distributed computation. - Handle various vertex ID types (String, Short, Byte, Int, Long) with appropriate encoders. - Raise GraphFramesUnsupportedVertexTypeException for unsupported types. - Add comprehensive test suite covering reduce, merge, and finish operations with edge cases and fixed seeds for determinism. Modified files: - .gitignore: Ignore Emacs temp files for cleaner diffs. - core/src/main/scala/org/graphframes/exceptions.scala: New exception class. - core/src/main/scala/org/graphframes/rw/RandomWalkBase.scala: Integrate ReservoirSamplingAgg in prepareGraph method. New files: - core/src/main/scala/org/apache/spark/sql/graphframes/expressions/ReservoirSamplingAgg.scala - core/src/test/scala/org/apache/spark/sql/graphframes/expressions/ReservoirSamplingAggSuite.scala

delete wrong implementation of w2v + hashing

- replace Reservoir sampling by KMinSampling - add L2norm to Hash2vec - add an optional convolution step to RW embeddings - small updates and performance fixes

SemyonSinchenko · 2026-02-12T22:37:06Z

Mostly the latest changes are related to the performance of the Hash2Vec and GC-pressure on huge graphs / walks / data. Based on my tests these are resolved and not a problem anymore.

…rting from non-first batch

…g with same walkID

…g Hadoop FS

…uite

… update instance method

…etup in cleanUp

…test

…estartSuite test

…indentation

…estartSuite

…gs pipeline

…bles and sections

…nition

…meters

SemyonSinchenko · 2026-02-26T12:21:06Z

@james-willis Hi! Thanks for the review. I addressed all of ur comments, could you take another look?

james-willis · 2026-02-28T20:54:54Z

Looking at the KMinSampling implementation, I have a suggestion for future optimization:

Performance Enhancement: Bloom Filters for High-Degree Vertices

The current KMinSampling approach works well for most cases, but could benefit from bloom filters when dealing with very high-degree vertices (e.g., >10k neighbors).

For vertices with extremely large neighbor sets, a bloom filter could:

Pre-filter candidates before reservoir sampling
Reduce memory pressure during sampling
Improve performance on scale-free networks with hub vertices

This could be implemented as an optional optimization in a follow-up PR, perhaps triggered automatically when exceeds a threshold or when degree distribution analysis indicates the presence of super-nodes.

The current implementation is solid and this would be a nice-to-have enhancement for very large graphs.

Comment by Claude (AI Assistant)

SemyonSinchenko · 2026-03-01T12:37:43Z

Looking at the KMinSampling implementation, I have a suggestion for future optimization:

Performance Enhancement: Bloom Filters for High-Degree Vertices

The current KMinSampling approach works well for most cases, but could benefit from bloom filters when dealing with very high-degree vertices (e.g., >10k neighbors).

For vertices with extremely large neighbor sets, a bloom filter could:

Pre-filter candidates before reservoir sampling

Reduce memory pressure during sampling

Improve performance on scale-free networks with hub vertices

This could be implemented as an optional optimization in a follow-up PR, perhaps triggered automatically when exceeds a threshold or when degree distribution analysis indicates the presence of super-nodes.

The current implementation is solid and this would be a nice-to-have enhancement for very large graphs.

Comment by Claude (AI Assistant)

I don't understand what does it mean tbh. The KMinSampling is here exactly to avoid the super-nodes problem:

it has a pre-aggregate mechanics (partial aggregate)
it has a constant memory consumption
an order of if-else is written specifically to mitigate the problem of super-nodes (because the "short path" is useless on small degree nodes)

Where, how and why should I put bloom-filters here? We need samples, not the probabilistic strcuture to check does set contain element... As well to make a bloom filter you would need another aggregations similar to this KMin.

Could you clarify please?

james-willis

LGTM.

Please ignore the bloom filter question. I agree it doesn't make sense with the sieve filter (honestly can't figure out what a bloom filter could be used for here).

SemyonSinchenko added 5 commits October 16, 2025 14:18

edges sampling API (scala)

3dcd569

add seed to z-estimation

06f6af6

wip

e4ede62

Merge remote-tracking branch 'graphframes/main' into 726-sampling-api

08ee0eb

WIP

d050a7a

SemyonSinchenko self-assigned this Nov 20, 2025

SemyonSinchenko added the scala label Nov 20, 2025

scalfix

4a0a14f

SemyonSinchenko added 14 commits December 1, 2025 22:32

docstrings to RandomWalkBase and RandomWalkWithRestart

579a5c3

- Added Scala-style docstrings to all classes, traits, methods, and fields - Improved documentation for random walk algorithms and configurations

fix scalastyle?

aafcbc6

fix reservoir

f27462d

add hash2vec

28b8ca5

delete wrong implementation of w2v + hashing

fixes

b8255ac

docstrings + scalfix

2113376

remove sampling as not needed

6244ec2

fixes in build and code

a9ce665

Big update

0d5da92

- replace Reservoir sampling by KMinSampling - add L2norm to Hash2vec - add an optional convolution step to RW embeddings - small updates and performance fixes

Tests and updates

84306b6

workaround scala 2.13 deprecation of Searching.search

5084d88

SemyonSinchenko added 22 commits February 18, 2026 10:36

fix: skip seeds for previous batches to maintain consistency when sta…

de2d6a3

…rting from non-first batch

fix: add overwrite mode when writing batch results to allow re-runnin…

3a0a302

…g with same walkID

feat: add cleanUp method to remove temporary files for a walk ID usin…

af7f8f4

…g Hadoop FS

docs: improve documentation for cleanUp method in RandomWalkBase trait

579a2b5

test: add cleanUp call in RandomWalkWithRestart test

5e2a462

test: move walks execution inside try block in RandomWalkWithRestartS…

d3cb1a2

…uite

refactor: set walkID default to UUID and remove redundant runID variable

f5bb5ca

docs: update comment to clarify walkID retrieval method behavior

4b25649

refactor: rename walkID to runID for clarity in random walk operations

0c9262d

refactor: remove runId parameter from cleanUp method in RandomWalkBase

c3b8d86

refactor: move cleanUp method to companion object with parameters and…

3309269

… update instance method

refactor: improve documentation formatting and remove redundant log s…

a9ff184

…etup in cleanUp

test: verify temporary files are deleted after RandomWalkWithRestart …

9eb4dea

…test

refactor: use numBatches variable and fix run path in RandomWalkWithR…

5b349c8

…estartSuite test

test: add test for RandomWalkWithRestart resuming from middle iteration

5c12113

style: Format RandomWalkWithRestartSuite with consistent spacing and …

483c3be

…indentation

refactor: use getSeq instead of getAs[Seq[String]] in RandomWalkWithR…

5efc5dd

…estartSuite

fix: correct typo in error message and parameter name for gaussian sigma

001a2bf

feat: add clean-up option for temporary random walk files in embeddin…

64aee65

…gs pipeline

docs: improve formatting and consistency in graph-ml documentation ta…

e45381d

…bles and sections

feat: add clean_up_after_run field to RandomWalkEmbeddings proto defi…

d4f9f50

…nition

chore: add clean_up_after_run parameter to _RandomWalksEmbeddingsPara…

79be29d

…meters

SemyonSinchenko requested a review from james-willis February 26, 2026 12:20

james-willis approved these changes Mar 10, 2026

View reviewed changes

SemyonSinchenko merged commit 6a0f34c into graphframes:main Mar 10, 2026
7 checks passed

SemyonSinchenko deleted the 726-sampling-api branch March 10, 2026 20:12

Conversation

SemyonSinchenko commented Nov 20, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Uh oh!

SemyonSinchenko commented Nov 20, 2025

Uh oh!

rjurney commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SemyonSinchenko commented Nov 20, 2025

Uh oh!

SemyonSinchenko commented Nov 20, 2025

Uh oh!

SauronShepherd commented Nov 21, 2025

Uh oh!

SemyonSinchenko commented Nov 21, 2025

Uh oh!

rjurney commented Nov 21, 2025

Uh oh!

SemyonSinchenko commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SemyonSinchenko commented Feb 12, 2026

Uh oh!

SemyonSinchenko commented Feb 26, 2026

Uh oh!

james-willis commented Feb 28, 2026

Performance Enhancement: Bloom Filters for High-Degree Vertices

Uh oh!

SemyonSinchenko commented Mar 1, 2026

Performance Enhancement: Bloom Filters for High-Degree Vertices

Uh oh!

james-willis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rjurney commented Nov 20, 2025 •

edited

Loading

SemyonSinchenko commented Nov 21, 2025 •

edited

Loading