diff --git a/README.md b/README.md
index 69c3a8eac..bdbda309c 100644
--- a/README.md
+++ b/README.md
@@ -1,39 +1,219 @@
-# graphframes
+
+
[](https://github.com/graphframes/graphframes/actions/workflows/scala-ci.yml)
[](https://github.com/graphframes/graphframes/actions/workflows/python-ci.yml)
[](https://github.com/graphframes/graphframes/actions/workflows/pages/pages-build-deployment)
-
# GraphFrames: DataFrame-based Graphs
-This is a package for DataFrame-based graphs on top of Apache Spark.
-Users can write highly expressive queries by leveraging the DataFrame API, combined with a new
-API for motif finding. The user also benefits from DataFrame performance optimizations
-within the Spark SQL engine.
+This is a package for graphs processing and analytics on scale. It is build on top of Apache Spark and relying on DataFrame abstraction. Users can write highly expressive queries by leveraging the DataFrame API, combined with a new API for network motif finding. The user also benefits from DataFrame performance optimizations within the Spark SQL engine. GraphFrames works in Java, Scala, and Python.
+
+You can find user guide and API docs at https://graphframes.github.io/graphframes
+
+## GraphFrames is Back!
+
+This projects was in maintenance mode for some time, but we are happy to announce that it is now back in active development! We are working on a new release with many bug fixes and improvements. We are also working on a new website and documentation.
+
+## Installation and Quick-Start
+
+The easiest way to start using GraphFrames is through the [Spark Packages system](https://spark-packages.org/package/graphframes/graphframes). Just run the following command:
+
+```bash
+# Interactive Scala/Java
+$ spark-shell --packages graphframes:graphframes:0.8.4-spark3.5-s_2.12
+
+# Interactive Python
+$ pyspark --packages graphframes:graphframes:0.8.4-spark3.5-s_2.12
+
+# Submit a script in Scala/Java/Python
+$ spark-submit --packages graphframes:graphframes:0.8.4-spark3.5-s_2.12 script.py
+```
+
+Now you can create a GraphFrame as follows.
+
+In Python:
+
+```python
+from pyspark.sql import SparkSession
+from graphframes import GraphFrame
+
+spark = SparkSession.builder.getOrCreate()
+
+nodes = [
+ (1, "Alice", 30),
+ (2, "Bob", 25),
+ (3, "Charlie", 35)
+]
+nodes_df = spark.createDataFrame(nodes, ["id", "name", "age"])
+
+edges = [
+ (1, 2, "friend"),
+ (2, 1, "friend"),
+ (2, 3, "friend"),
+ (3, 2, "enemy") # eek!
+]
+edges_df = spark.createDataFrame(edges, ["src", "dst", "relationship"])
+
+g = GraphFrame(nodes_df, edges_df)
+```
+
+Now let's run some graph algorithms at scale!
+
+```python
+g.inDegrees.show()
+
+# +---+--------+
+# | id|inDegree|
+# +---+--------+
+# | 2| 2|
+# | 1| 1|
+# | 3| 1|
+# +---+--------+
+
+g.outDegrees.show()
+
+# +---+---------+
+# | id|outDegree|
+# +---+---------+
+# | 1| 1|
+# | 2| 2|
+# | 3| 1|
+# +---+---------+
+
+g.degrees.show()
+
+# +---+------+
+# | id|degree|
+# +---+------+
+# | 1| 2|
+# | 2| 4|
+# | 3| 2|
+# +---+------+
+
+g2 = g.pageRank(resetProbability=0.15, tol=0.01)
+g2.vertices.show()
-You can find user guide and API docs at https://graphframes.github.io/graphframes.
+# +---+-----+---+------------------+
+# | id| name|age| pagerank|
+# +---+-----+---+------------------+
+# | 1| John| 30|0.7758750474847483|
+# | 2|Alice| 25|1.4482499050305027|
+# | 3| Bob| 35|0.7758750474847483|
+# +---+-----+---+------------------+
+
+# GraphFrames' most used feature...
+# Connected components can do big data entity resolution on billions or even trillions of records!
+# First connect records with a similarity metric, then run connectedComponents.
+# This gives you groups of identical records, which you then link by same_as edges or merge into list-based master records.
+sc.setCheckpointDir("/tmp/graphframes-example-connected-components") # required by GraphFrames.connectedComponents
+g.connectedComponents().show()
+
+# +---+-----+---+---------+
+# | id| name|age|component|
+# +---+-----+---+---------+
+# | 1| John| 30| 1|
+# | 2|Alice| 25| 1|
+# | 3| Bob| 35| 1|
+# +---+-----+---+---------+
+
+# Find frenemies with network motif finding! See how graph and relational queries are combined?
+(
+ g.find("(a)-[e]->(b); (b)-[e2]->(a)")
+ .filter("e.relationship = 'friend' and e2.relationship = 'enemy'")
+ .show()
+)
+
+# These are paths, which you can aggregate and count to find complex patterns.
+# +------------+--------------+----------------+-------------+
+# | a| e| b| e2|
+# +------------+--------------+----------------+-------------+
+# |{2, Bob, 25}|{2, 3, friend}|{3, Charlie, 35}|{3, 2, enemy}|
+# +------------+--------------+----------------+-------------+
+```
+
+## Learn GraphFrames
+
+To learn more about GraphFrames, check out these resources:
+* [GraphFrames Documentation](https://graphframes.github.io/graphframes)
+* [GraphFrames Network Motif Finding Tutorial](https://graphframes.github.io/graphframes/docs/_site/motif-tutorial.html)
+* [Introducing GraphFrames](https://databricks.com/blog/2016/03/03/introducing-graphframes.html)
+* [On-Time Flight Performance with GraphFrames for Apache Spark](https://databricks.com/blog/2016/03/16/on-time-flight-performance-with-graphframes-for-apache-spark.html)
+
+## Community Resources
+
+* [GraphFrames Google Group](https://groups.google.com/forum/#!forum/graphframes)
+* [#graphframes Discord Channel on GraphGeeks](https://discord.com/channels/1162999022819225631/1326257052368113674)
+
+## `graphframes-py` is our Official PyPi Package
+
+We recommend using the Spark Packages system to install the latest version of GraphFrames, but now publish a build of our Python package to PyPi in the [graphframes-py](https://pypi.org/project/graphframes-py/) package. It can be used to provide type hints in IDEs, but does not load the java-side of GraphFrames so will not work without loading the GraphFrames package. See [Installation and Quick-Start](#installation-and-quick-start).
+
+```bash
+pip install graphframes-py
+```
+
+This project does not own or control the [graphframes PyPI package](https://pypi.org/project/graphframes/) (installs 0.6.0) or [graphframes-latest PyPI package](https://pypi.org/project/graphframes-latest/) (installs 0.8.4).
+
+## GraphFrames and sbt
+
+If you use the sbt-spark-package plugin, in your sbt build file, add the following, pulled from [GraphFrames on Spark Packages](https://spark-packages.org/package/graphframes/graphframes):
+
+```
+spDependencies += "graphframes/graphframes:0.8.4-spark3.5-s_2.12"
+```
+
+Otherwise,
+
+```
+resolvers += "Spark Packages Repo" at "https://repos.spark-packages.org/"
+
+libraryDependencies += "graphframes" % "graphframes" % "0.8.4-spark3.5-s_2.12"
+```
+
+## GraphFrames and Maven
+
+GraphFrames is not on Maven Central Repository but we are going to restore it soon. For now use Spark Packages system to install the package: [https://spark-packages.org/package/graphframes/graphframes](https://spark-packages.org/package/graphframes/graphframes).
+
+```xml
+
+
+
+ graphframes
+ graphframes
+ 0.8.4-spark3.5-s_2.12
+
+
+
+
+
+ SparkPackagesRepo
+ https://repos.spark-packages.org/
+
+
+```
+
+## GraphFrames Internals
+
+To learn how GraphFrames works internally to combine graph and relational queries, check out the paper [GraphFrames: An Integrated API for Mixing Graph and
+Relational Queries, Dave et al. 2016](https://people.eecs.berkeley.edu/~matei/papers/2016/grades_graphframes.pdf).
## Building and running unit tests
-To compile this project, run `build/sbt assembly` from the project home directory.
-This will also run the Scala unit tests.
+To compile this project, run `build/sbt assembly` from the project home directory. This will also run the Scala unit tests.
-To run the Python unit tests, run the `run-tests.sh` script from the `python/` directory.
-You will need to set `SPARK_HOME` to your local Spark installation directory.
+To run the Python unit tests, run the `run-tests.sh` script from the `python/` directory. You will need to set `SPARK_HOME` to your local Spark installation directory.
## Release new version
+
Please see guide `dev/release_guide.md`.
## Spark version compatibility
-This project is compatible with Spark 2.4+. However, significant speed improvements have been
-made to DataFrames in more recent versions of Spark, so you may see speedups from using the latest
-Spark version.
+This project is compatible with Spark 3.4+. Significant speed improvements have been made to DataFrames in recent versions of Spark, so you may see speedups from using the latest Spark version.
## Contributing
-GraphFrames is collaborative effort among UC Berkeley, MIT, and Databricks.
-We welcome open source contributions as well!
+GraphFrames is collaborative effort among UC Berkeley, MIT, Databricks and the open source community. We welcome open source contributions as well!
## Releases:
diff --git a/dev/release_guide.md b/dev/release_guide.md
index 19be87366..f89708c9c 100644
--- a/dev/release_guide.md
+++ b/dev/release_guide.md
@@ -1,8 +1,8 @@
-# Guild for releasing a new Graphframe version
+# Guild for releasing a new Graphframes version
-## How to build GraphFrame package ?
+## How to build GraphFrames package ?
-To build a GraphFrame package for releasing, you only need to run the following command:
+To build a GraphFrames package for releasing, you only need to run the following command:
```
cd graphframe_repo
@@ -30,10 +30,9 @@ then upload the zip file generated by instructions in "How to build GraphFrame p
## How to publish the GraphFrame doc ?
-GraphFrame doc is hosted in 'https://graphframes.github.io/graphframes/', to publish doc,
-you just need to build doc content, then push the doc content to gh-pages branch of https://github.com/graphframes/graphframes project.
+GraphFrames docs are hosted in 'https://graphframes.github.io/graphframes/'. To publish the docs, you just need to build the doc content, then push the doc content to gh-pages branch of the https://github.com/graphframes/graphframes project.
-Before building doc, you need to install jekyll, please refer to 'docs/README.md' for details.
+Before building the docs, you need to install jekyll, please refer to 'docs/README.md' for details.
The following command is for building and publishing doc:
```
diff --git a/docs/README.md b/docs/README.md
index 769c1cd3f..6305494fc 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -1,39 +1,28 @@
Welcome to the GraphFrames Spark Package documentation!
-This readme will walk you through navigating and building the GraphFrames documentation, which is
-included here with the source code.
+This readme will walk you through navigating and building the GraphFrames documentation, which is included here with the source code.
-Read on to learn more about viewing documentation in plain text (i.e., markdown) or building the
-documentation yourself. Why build it yourself? So that you have the docs that correspond to
-whichever version of GraphFrames you currently have checked out of revision control.
+Read on to learn more about viewing documentation in plain text (i.e., markdown) or building the documentation yourself. Why build it yourself? So that you have the docs that correspond to whichever version of GraphFrames you currently have checked out of revision control.
## Generating the Documentation HTML
-We include the GraphFrames documentation as part of the source (as opposed to using a hosted wiki, such as
-the github wiki, as the definitive documentation) to enable the documentation to evolve along with
-the source code and be captured by revision control (currently git). This way the code automatically
-includes the version of the documentation that is relevant regardless of which version or release
-you have checked out or downloaded.
+We include the GraphFrames documentation as part of the source (as opposed to using a hosted wiki, such as the github wiki, as the definitive documentation) to enable the documentation to evolve along with the source code and be captured by revision control (currently git). This way the code automatically
+includes the version of the documentation that is relevant regardless of which version or release you have checked out or downloaded.
-In this directory you will find textfiles formatted using Markdown, with an ".md" suffix. You can
-read those text files directly if you want. Start with index.md.
+In this directory you will find textfiles formatted using Markdown, with an ".md" suffix. You can read those text files directly if you want. Start with index.md.
The markdown code can be compiled to HTML using the [Jekyll tool](http://jekyllrb.com).
-`Jekyll` and a few dependencies must be installed for this to work. We recommend
-installing via the Ruby Gem dependency manager. Since the exact HTML output
-varies between versions of Jekyll and its dependencies, we list specific versions here
-in some cases:
+`Jekyll` and a few dependencies must be installed for this to work. We recommend installing via the Ruby Gem dependency manager. Since the exact HTML output varies between versions of Jekyll and its dependencies, we list specific versions here in some cases:
- $ sudo gem install jekyll
- $ sudo gem install jekyll-redirect-from
+ $ gem install jekyll
+ $ gem install jekyll-redirect-from
On macOS, with the default Ruby, please install Jekyll with Bundler as [instructed on offical website](https://jekyllrb.com/docs/quickstart/). Otherwise the build script might fail to resolve dependencies.
- $ sudo gem install jekyll bundler
- $ sudo gem install jekyll-redirect-from
+ $ gem install jekyll bundler
+ $ gem install jekyll-redirect-from
-Execute `jekyll build` from the `docs/` directory to compile the site. Compiling the site with Jekyll will create a directory
-called `_site` containing index.html as well as the rest of the compiled files.
+Execute `jekyll build` from the `docs/` directory to compile the site. Compiling the site with Jekyll will create a directory called `_site` containing index.html as well as the rest of the compiled files.
You can modify the default Jekyll build as follows:
@@ -45,27 +34,23 @@ You can modify the default Jekyll build as follows:
$ PRODUCTION=1 jekyll build
Note that `SPARK_HOME` must be set to your local Spark installation in order to generate the docs.
+
To manually point to a specific `Spark` installation,
$ SPARK_HOME= PRODUCTION=1 jekyll build
## Sphinx
-We use Sphinx to generate Python API docs, so you will need to install it by running
-`sudo pip install sphinx`.
+We use Sphinx to generate Python API docs, so you will need to install it by running (once we upgrade to Python 3.10 it will get added to the dev requirements):
+
+ pip install sphinx
## API Docs (Scaladoc, Sphinx)
You can build just the scaladoc by running `build/sbt unidoc` from the GRAPHFRAMES_PROJECT_ROOT directory.
-Similarly, you can build just the Python docs by running `make html` from the
-GRAPHFRAMES_PROJECT_ROOT/python/docs directory. Documentation is only generated for classes that are listed as
-public in `__init__.py`.
+Similarly, you can build just the Python docs by running `make html` from the GRAPHFRAMES_PROJECT_ROOT/python/docs directory. Documentation is only generated for classes that are listed as public in `__init__.py`.
-When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various
-subprojects into the `docs` directory (and then also into the `_site` directory). We use a
-jekyll plugin to run `build/sbt unidoc` before building the site so if you haven't run it (recently) it
-may take some time as it generates all of the scaladoc. The jekyll plugin also generates the
+When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various subprojects into the `docs` directory (and then also into the `_site` directory). We use a jekyll plugin to run `build/sbt unidoc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the
Python docs [Sphinx](http://sphinx-doc.org/).
-NOTE: To skip the step of building and copying over the Scala, Python API docs, run `SKIP_API=1
-jekyll build`. To skip building Scala API docs, run `SKIP_SCALADOC=1 jekyll build`; to skip building Python API docs, run `SKIP_PYTHONDOC=1 jekyll build`.
+NOTE: To skip the step of building and copying over the Scala, Python API docs, run `SKIP_API=1 jekyll build`. To skip building Scala API docs, run `SKIP_SCALADOC=1 jekyll build`; to skip building Python API docs, run `SKIP_PYTHONDOC=1 jekyll build`.
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index 3ee1a85ee..51bc021ea 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -74,6 +74,7 @@
diff --git a/docs/img/4-node-directed-graphlets.png b/docs/img/4-node-directed-graphlets.png
new file mode 100644
index 000000000..74d8e1806
Binary files /dev/null and b/docs/img/4-node-directed-graphlets.png differ
diff --git a/docs/img/Directed-Graphlet-G17.png b/docs/img/Directed-Graphlet-G17.png
new file mode 100644
index 000000000..4feea4f37
Binary files /dev/null and b/docs/img/Directed-Graphlet-G17.png differ
diff --git a/docs/img/Directed-Graphlet-G22.png b/docs/img/Directed-Graphlet-G22.png
new file mode 100644
index 000000000..1778e56fe
Binary files /dev/null and b/docs/img/Directed-Graphlet-G22.png differ
diff --git a/docs/img/G11_motif.png b/docs/img/G11_motif.png
new file mode 100644
index 000000000..1e2524093
Binary files /dev/null and b/docs/img/G11_motif.png differ
diff --git a/docs/img/G4_and_G5_directed_network_motif.png b/docs/img/G4_and_G5_directed_network_motif.png
new file mode 100644
index 000000000..83d34c901
Binary files /dev/null and b/docs/img/G4_and_G5_directed_network_motif.png differ
diff --git a/docs/img/GraphFrames-Logo-Dark-Small.png b/docs/img/GraphFrames-Logo-Dark-Small.png
new file mode 100644
index 000000000..05b0c3bc3
Binary files /dev/null and b/docs/img/GraphFrames-Logo-Dark-Small.png differ
diff --git a/docs/img/GraphFrames-Logo-Large.png b/docs/img/GraphFrames-Logo-Large.png
new file mode 100644
index 000000000..bfac7ebcb
Binary files /dev/null and b/docs/img/GraphFrames-Logo-Large.png differ
diff --git a/docs/img/GraphFrames-Logo-Small.png b/docs/img/GraphFrames-Logo-Small.png
new file mode 100644
index 000000000..1e052e776
Binary files /dev/null and b/docs/img/GraphFrames-Logo-Small.png differ
diff --git a/docs/img/directed_graphlets.webp b/docs/img/directed_graphlets.webp
new file mode 100644
index 000000000..caa02321c
Binary files /dev/null and b/docs/img/directed_graphlets.webp differ
diff --git a/docs/index.md b/docs/index.md
index 9bd2ccb82..b9d8917bc 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -63,14 +63,21 @@ GraphFrames supplied as a package.
* [Quick Start](quick-start.html): a quick introduction to the GraphFrames API; start here!
* [GraphFrames User Guide](user-guide.html): detailed overview of GraphFrames
in all supported languages (Scala, Java, Python)
+* [Motif Finding Tutorial](motif-tutorial.html): learn to perform pattern recognition with GraphFrames using a technique called network motif finding over the knowledge graph for the `stackexchange.com` subdomain [data dump](https://archive.org/details/stackexchange)
**API Docs:**
* [GraphFrames Scala API (Scaladoc)](api/scala/index.html#org.graphframes.package)
* [GraphFrames Python API (Sphinx)](api/python/index.html)
+**Community Forums:**
+
+* [GraphFrames Mailing List](https://groups.google.com/g/graphframes/): ask questions about GraphFrames here
+* [#graphframes Discord Channel on GraphGeeks](https://discord.com/channels/1162999022819225631/1326257052368113674)
+
**External Resources:**
* [Apache Spark Homepage](http://spark.apache.org)
* [Apache Spark Wiki](https://cwiki.apache.org/confluence/display/SPARK)
-* [Mailing Lists](http://spark.apache.org/mailing-lists.html): Ask questions about Spark here
+* [Apache Spark Mailing Lists](http://spark.apache.org/mailing-lists.html)
+* [GraphFrames on Stack Overflow](https://stackoverflow.com/questions/tagged/graphframes)
diff --git a/docs/motif-tutorial.md b/docs/motif-tutorial.md
new file mode 100644
index 000000000..4d512f656
--- /dev/null
+++ b/docs/motif-tutorial.md
@@ -0,0 +1,773 @@
+---
+layout: global
+displayTitle: GraphFrames Network Motif Finding Tutorial
+title: Network Motif Finding Tutorial
+description: GraphFrames GRAPHFRAMES_VERSION motif finding tutorial - teaches you to find motifs using Stack Exchange data
+---
+
+This tutorial covers GraphFrames' motif finding feature. We perform pattern matching on a property graph representing a Stack Exchange site using Apache Spark and [GraphFrames' motif finding](user-guide.html#motif-finding) feature. We will download the `stats.meta` archive from the [Stack Exchange Data Dump at the Internet Archive](https://archive.org/details/stackexchange), use PySpark to build a property graph and then mine it for property graph network motifs by combining both graph and relational queries.
+
+* Table of contents (This text will be scraped.)
+ {:toc}
+
+# What are graphlets and network motifs?
+
+Graphlets are small, connected subgraphs of a larger graph. Network motifs are recurring patterns in complex networks that are significantly more frequent than in random networks. They are the building blocks of complex networks and can be used to understand the structure and function of networks. Network motifs can be used to identify functional modules in biological networks, detect anomalies in social networks, detect money laundering and terrorism financing in financial networks, and predict the behavior of complex systems.
+
+
+
+We are going to mine motifs using Stack Exchange data. The Stack Exchange network is a complex network of users, posts, votes, badges, and tags. We will use GraphFrames to build a property graph from the Stack Exchange data dump and then use GraphFrames' motif finding feature to find network motifs in the graph. You'll see how to combine graph and relational queries to find complex patterns in the graph.
+
+# Download the Stack Exchange Dump for [stats.meta](https://stats.meta.stackexchange.com)
+
+The Python tutorials include a CLI utility at `graphframes stackexchange`for downloading any site's [Stack Exchange Data Dump](https://archive.org/details/stackexchange) from the Internet Archive. The command takes the subdomain as an argument, downloads the corresponding 7zip archive and expands it into the `python/graphframes/tutorials/data` folder.
+
+
+{% highlight bash %}
+Usage: graphframes [OPTIONS] COMMAND [ARGS]...
+
+ GraphFrames CLI: a collection of commands for graphframes.
+
+Options:
+ --help Show this message and exit.
+
+Commands:
+ stackexchange Download Stack Exchange archive for a given SUBDOMAIN.
+{% endhighlight %}
+
+
+Use `graphframes stackexchange stats.meta` to download the Stack Exchange Data Dump for `stats.meta.stackexchange.com`.
+
+
+
+# Build the Graph
+
+We will build a property graph from the Stack Exchange data dump using PySpark in the [python/graphframes/tutorials/stackexchange.py](python/graphframes/tutorials/stackexchange.py) script. The data comes as a single XML file, so we use [spark-xml](https://github.com/databricks/spark-xml) (moving inside Spark as of 4.0) to load the data, extract the relevant fields and build the nodes and edges of the graph. For some reason Spark XML uses a lot of RAM, so we need to increase the driver and executor memory to at least 4GB.
+
+
+
+The script will output the nodes and edges of the graph in the `python/graphframes/tutorials/data` folder. We can now use GraphFrames to load the graph and perform motif finding.
+
+# Motif Finding
+
+We will use GraphFrames to find motifs in the Stack Exchange property graph. The script [python/graphframes/tutorials/motif.py](python/graphframes/tutorials/motif.py) demonstrates how to load the graph, define various motifs and find all instances of the motif in the graph.
+
+NOTE: I use the terms `node` as interchangaable with `vertex` and `edge` with `link` or `relationship`. The API is [GraphFrame.vertices](https://graphframes.github.io/graphframes/docs/_site/api/python/graphframes.html#graphframes.GraphFrame.vertices) and [GraphFrames.edges](https://graphframes.github.io/graphframes/docs/_site/api/python/graphframes.html#graphframes.GraphFrame.edges) but some documentation says `relationships`. We need to add an alias from `g.vertices` to `g.nodes` and `g.edges` to both `g.relationships` and `g.links`.
+
+For a quick run-through of the script, use the following command:
+
+
+
+Let's walk through what it does, line-by-line. The script starts by importing the necessary modules and defining some utility functions for visualizing paths returned by [g.find()](https://graphframes.github.io/graphframes/docs/_site/user-guide.html#motif-finding). Note that if you give `python/graphframes/tutorials/download.py` CLI a different subdomain, you will need to change the `STACKEXCHANGE_SITE` variable.
+
+
+{% highlight python %}
+import pyspark.sql.functions as F
+from graphframes import GraphFrame
+from pyspark import SparkContext
+from pyspark.sql import DataFrame, SparkSession
+
+# Initialize a SparkSession
+
+spark: SparkSession = (
+ SparkSession.builder.appName("Stack Overflow Motif Analysis")
+ # Lets the Id:(Stack Overflow int) and id:(GraphFrames ULID) coexist
+ .config("spark.sql.caseSensitive", True)
+ .getOrCreate()
+)
+sc: SparkContext = spark.sparkContext
+sc.setCheckpointDir("/tmp/graphframes-checkpoints")
+
+# Change me if you download a different stackexchange site
+
+STACKEXCHANGE_SITE = "stats.meta.stackexchange.com"
+BASE_PATH = f"python/graphframes/tutorials/data/{STACKEXCHANGE_SITE}"
+{% endhighlight %}
+
+
+Load the nodes and edges of the graph from the `data` folder and count the types of node and edge. We repartition the nodes and edges to give our motif searches parallelism. GraphFrames likes nodes/vertices and edges/relatonships to be cached.
+
+
+{% highlight python %}
+#
+# Load the nodes and edges from disk, repartition, checkpoint [plan got long for some reason] and cache.
+#
+
+# We created these in stackexchange.py from Stack Exchange data dump XML files
+
+NODES_PATH: str = f"{BASE_PATH}/Nodes.parquet"
+nodes_df: DataFrame = spark.read.parquet(NODES_PATH)
+
+# Repartition the nodes to give our motif searches parallelism
+
+nodes_df = nodes_df.repartition(50).checkpoint().cache()
+
+# We created these in stackexchange.py from Stack Exchange data dump XML files
+
+EDGES_PATH: str = f"{BASE_PATH}/Edges.parquet"
+edges_df: DataFrame = spark.read.parquet(EDGES_PATH)
+
+# Repartition the edges to give our motif searches parallelism
+
+edges_df = edges_df.repartition(50).checkpoint().cache()
+{% endhighlight %}
+
+
+Check out the node types we have to work with:
+
+
+{% highlight python %}
+# What kind of nodes we do we have to work with?
+node_counts = (
+ nodes_df
+ .select("id", F.col("Type").alias("Node Type"))
+ .groupBy("Node Type")
+ .count()
+ .orderBy(F.col("count").desc())
+ # Add a comma formatted column for display
+ .withColumn("count", F.format_number(F.col("count"), 0))
+)
+node_counts.show()
+{% endhighlight %}
+
+
+Note: you don't need to run the code in this section, it is just for reference. The data we loaded above is already prepared for use. Jump ahead to Creating GraphFrames and run that next :)
+
+At the moment, GraphFrames has a limitation: there is only one node and edge type (for now). There are many fields in the nodes of our `GraphFrame` because there only one node type is available. I have combined different types of node into a single type by including all properties of all types in one class of node. I created a `Type` field for each type of node, then merged all fields into a single, global `nodes_df` `DataFrame`. This `Type` column can then be used in relational [DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) operations to distinguish between types of nodes.
+
+This limitation is an annoyance that should be fixed in the future, with the ability to have multiple node types in a `GraphFrame`. In practice it isn't a big hit in productivity, but it means you have to [DataFrame.select](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.select.html) certain columns for each node `Type` when you do a [DataFrame.show()](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.show.html) or the width of the DataFrame will be too wide to easily read.
+
+Here is how that was accomplished in python/graphframes/tutorials/stackexchange.py.
+
+
+{% highlight python %}
+#
+# Form the nodes from the UNION of posts, users, votes and their combined schemas
+#
+
+all_cols: List[Tuple[str, T.StructField]] = list(
+ set(
+ list(zip(posts_df.columns, posts_df.schema))
+ + list(zip(post_links_df.columns, post_links_df.schema))
+ + list(zip(comments_df.columns, comments_df.schema))
+ + list(zip(users_df.columns, users_df.schema))
+ + list(zip(votes_df.columns, votes_df.schema))
+ + list(zip(tags_df.columns, tags_df.schema))
+ + list(zip(badges_df.columns, badges_df.schema))
+ )
+)
+all_column_names: List[str] = sorted([x[0] for x in all_cols])
+
+def add_missing_columns(df: DataFrame, all_cols: List[Tuple[str, T.StructField]]) -> DataFrame:
+ """Add any missing columns from any DataFrame among several we want to merge."""
+ for col_name, schema_field in all_cols:
+ if col_name not in df.columns:
+ df = df.withColumn(col_name, F.lit(None).cast(schema_field.dataType))
+ return df
+
+# Now apply this function to each of your DataFrames to get a consistent schema
+
+posts_df = add_missing_columns(posts_df, all_cols).select(all_column_names)
+post_links_df = add_missing_columns(post_links_df, all_cols).select(all_column_names)
+users_df = add_missing_columns(users_df, all_cols).select(all_column_names)
+votes_df = add_missing_columns(votes_df, all_cols).select(all_column_names)
+tags_df = add_missing_columns(tags_df, all_cols).select(all_column_names)
+badges_df = add_missing_columns(badges_df, all_cols).select(all_column_names)
+assert (
+ set(posts_df.columns)
+ == set(post_links_df.columns)
+ == set(users_df.columns)
+ == set(votes_df.columns)
+ == set(all_column_names)
+ == set(tags_df.columns)
+ == set(badges_df.columns)
+)
+{% endhighlight %}
+
+
+
Creating GraphFrames
+
+Now we create a [GraphFrame object](https://graphframes.github.io/graphframes/docs/_site/api/python/graphframes.html#graphframes.GraphFrame) from the `nodes_df` and `edges_df` `DataFrames`. We will use this object to find motifs in the graph.
+
+Back to our motifs :) It is time to create our GraphFrame object. It has a number of powerful APIs, including the [GraphFrame.find()](https://graphframes.github.io/graphframes/docs/_site/api/python/graphframes.html#graphframes.GraphFrame.find) method for finding motifs in the graph.
+
+
+
+Let's validate that all edges in our `GraphFrame` object have valid IDs - it is common to make mistakes in ETL for knowledge graph construction and have edges that point nowhere. GraphFrames tries to validate itself but can sometimes accept bogus edges.
+
+
+{% highlight python %}
+# Sanity test that all edges have valid ids
+edge_count = g.edges.count()
+valid_edge_count = (
+ g.edges.join(g.vertices, on=g.edges.src == g.vertices.id)
+ .select("src", "dst", "relationship")
+ .join(g.vertices, on=g.edges.dst == g.vertices.id)
+ .count()
+)
+
+# Just up and die if we have edges that point to non-existent nodes
+
+assert (
+ edge_count == valid_edge_count
+), f"Edge count {edge_count} != valid edge count {valid_edge_count}"
+print(f"Edge count: {edge_count:,} == Valid edge count: {valid_edge_count:,}")
+{% endhighlight %}
+
+
+Let's look for a simple motif: a directed triangle. We will find all instances of a directed triangle in the graph. The [`GraphFrame.find()`](https://graphframes.github.io/graphframes/docs/_site/api/python/graphframes.html#graphframes.GraphFrame.find) method takes a string as an argument that specifies the structure of a motif one edge at a time, in the same syntax as Cypher, with a semi-colon between edges. For a triangle motif, that works out to: `(a)-[e]->(b); (b)-[e2]->(c); (c)-[e3]->(a)`. Edge labels are optional, this is valid graph query: `(a)-[]->(b)`.
+
+The `g.find()` method returns a `DataFrame` with fields fo each of the node and edge labels in the pattern. To further express the motif you're interested in, you can now use relational `DataFrame` operations to filter, group, and aggregate the results. This makes the network motif finding in GraphFrames very powerful, and this type of property graph motif was originally defined in the [graphframes paper](https://people.eecs.berkeley.edu/~matei/papers/2016/grades_graphframes.pdf).
+
+A complete description of the graph query language is in the [GraphFrames User Guide](https://graphframes.github.io/graphframes/docs/_site/user-guide.html#motif-finding). Let's look at an example: a directed triangle. We will find all instances of a directed triangle in the graph.
+
+
+
+This can be overwhelming to look at, so in practice you will `DataFrame.select()` (a path is just a pyspark.sql.DataFrame) the properties of interest.
+
+Aggregating paths can express powerful semantics. Let's count the types of paths of this triangle motif in the graph of each node and edge type.
+
+
+
+The result shows the only continuous triangles in the graph are 39 question-link loops. Motif matching for simple motifs based on topology alone can be used to for exploratory data analysis over a knowledge graph in the same way you might run GROUP BY / COUNT queries on a table in a relational database to start to understand its contents.
+
+
+
+1. (Tag)-[Tags]->(Question B); (Tag)-[Tags]->(Question C); (Question C)-[Links]->(Question B), or "A tag is used on a question, that tag is used on another question, and the two questions are linked." It makes sense that questions sharing tags are often linked.
+2. (User)-[Asks]->(Question B); (User)-[Posts]->(Answer C); (Answer C)-[Answers]->(Question B), or "A user answers their own question."
+3. A triangle of linked questions.
+4. (Tag)-[Tags]->(Question B); (Tag)-[Tags]->(Question C); (Question B)-[Duplicates]->(Question C), or "A tag appears for a pair of duplicate answers."
+5. A user asks linked questions.
+
+
Property Graph Motifs
+
+Simple motif finding can be used to explore a knowledge graph. It is also possibel to use domain knowledge to define and match known patterns and then explore new variant motifs. This can be used to apply and then expand domain knowledge about a knowledge graph. It is powerful stuff!
+
+We can do more with the properties of paths than just count them by node and edge type. We can use the properties of the nodes and edges in the paths to filter, group, and aggregate the results to form property graph motifs. Such complex motifs were first defined (without being formally named) in the paper describing this prject: GraphFrames: An Integrated API for Mixing Graph and Relational Queries, Dave et al. 2016. They are a combination of graph and relational queries. We can use them to find complex patterns in the graph.
+
+The larger motifs get, the more interesting they are. Five nodes is often the limit with a Spark cluster, depending on how large your graph is. In this instance I will limit myself to a 4-path pattern as you may not have a Spark cluster on which to learn. Keep in mind that I am talking about paths - through aggregation a motif might cover thousands of nodes!
+
+First lets express the structural logic of the motif we are looking for. Let's try G22 - a triangle with a fourth node pointing at the node with in-degree of 2. The pattern is (a)-[e1]->(b); (a)-[e2]->(c); (c)-[e3]->(b); (d)-[e4]->(b).
+
+Visually this pattern looks like this:
+
+
+
+The simplest pattern with four nodes is a 3-path, directed graphlet G30. Let's see how aggregation makes this a more powerful pattern than we might at first guess.
+
+
+{% highlight python %}
+# G17: A directed 3-path is a surprisingly diverse graphlet
+paths = g.find("(a)-[e1]->(b); (b)-[e2]->(c); (d)-[e3]->(c)")
+{% endhighlight %}
+
+
+Let's count the number of instances by type for of this path in the graph. To let you know of a hard-won tip: alias the edge with its pattern. This makes it easier to read the results, even when C points to A or B, not D.
+
+
+
+The fourth row catches my eye - there are 300,017 matches for the votes cast for linked questions: (Vote A)-[CastFor]->(Question B); (Question B)-[Links]->(Question C); (Vote D)-[CastFor]->(Question C). This gives a way to compare the popularity of linked questions! Let's calculate how correlated linked questions are.
+
+
+{% highlight python %}
+# A user answers an answer that answers a question that links to an answer.
+linked_vote_paths = paths.filter(
+ (F.col("a.Type") == "Vote") &
+ (F.col("e1.relationship") == "CastFor") &
+ (F.col("b.Type") == "Question") &
+ (F.col("e2.relationship") == "Links") &
+ (F.col("c.Type") == "Question") &
+ (F.col("e3.relationship") == "CastFor") &
+ (F.col("d.Type") == "Vote")
+)
+
+# Sanity check the count - it should match the table above
+
+linked_vote_paths.count()
+
+300017
+{% endhighlight %}
+
+
+We start by using aggregation to count the total votes cast for each end of a link questions. To get the count for Question B, get the distinct 3-paths, group by it's ID and count the votes.
+
+
+
+Now join the counts to the links to get the total votes for each pair of linked questions. Then run `pyspark.sql.DataFrame.stats.corr()` to get the correlation between the vote counts for linked questions. We'll use the `Vote.VoteTypeId` to ensure only positive votes are counted.
+
+
+{% highlight python %}
+linked_vote_counts = (
+ linked_vote_paths
+ .filter((F.col("a.VoteTypeId") == 2) & (F.col("d.VoteTypeId") == 2))
+ .select("b", "c")
+ .join(b_vote_counts, on="b", how="inner")
+ .withColumnRenamed("count", "b_count")
+ .join(c_vote_counts, on="c", how="inner")
+ .withColumnRenamed("count", "c_count")
+)
+linked_vote_counts.stat.corr("b_count", "c_count")
+0.4287709940689788
+{% endhighlight %}
+
+We conclude there is a moderate correlation in the vote counts of linked questions. This makes sense. Note that this is only the fourth row, there are many more patterns to be examined and considered.
+
+This is just one type of aggregation you can employ - but hopefully it illustrates the way properties and aggregation and other relational operators can transform simple pattern matching into a powerful tool for exploring a knowledge graph.
+
+
Conclusion
+
+In this tutorial, we learned to use GraphFrames to find network motifs in a property graph. We saw how to combine graph and relational queries to find complex patterns in the graph. We also saw how to use the properties of the nodes and edges in the paths to filter, group, and aggregate the results to form complex property graph motifs. Motif finding in GraphFrames is a powerful technique that can be used to explore and understand complex networks. Network motifs are the building blocks of complex networks.
diff --git a/docs/quick-start.md b/docs/quick-start.md
index 56ec541f6..f7e16c8ce 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -30,7 +30,7 @@ We use the `--packages` argument to download the graphframes package and any dep
diff --git a/docs/user-guide.md b/docs/user-guide.md
index 7e66f74ff..5d2a112a9 100644
--- a/docs/user-guide.md
+++ b/docs/user-guide.md
@@ -5,8 +5,7 @@ title: User Guide
description: GraphFrames GRAPHFRAMES_VERSION user guide
---
-This page gives examples of how to use GraphFrames for basic queries, motif finding, and
-general graph algorithms. This includes code examples in Scala and Python.
+This page gives examples of how to use GraphFrames for basic queries, motif finding, and general graph algorithms. This includes code examples in Scala and Python.
* Table of contents (This text will be scraped.)
{:toc}
@@ -174,7 +173,9 @@ from graphframes.examples import Graphs
g = Graphs(spark).friends() # Get example graph
# Display the vertex and edge DataFrames
+
g.vertices.show()
+
# +--+-------+---+
# |id| name|age|
# +--+-------+---+
@@ -188,6 +189,7 @@ g.vertices.show()
# +--+-------+---+
g.edges.show()
+
# +---+---+------------+
# |src|dst|relationship|
# +---+---+------------+
@@ -204,12 +206,12 @@ g.edges.show()
# Get a DataFrame with columns "id" and "inDegree" (in-degree)
vertexInDegrees = g.inDegrees
-# Find the youngest user's age in the graph.
-# This queries the vertex DataFrame.
+# Find the youngest user's age in the graph
+# This queries the vertex DataFrame
g.vertices.groupBy().min("age").show()
-# Count the number of "follows" in the graph.
-# This queries the edge DataFrame.
+# Count the number of "follows" in the graph
+# This queries the edge DataFrame
numFollows = g.edges.filter("relationship = 'follow'").count()
{% endhighlight %}
@@ -218,13 +220,9 @@ numFollows = g.edges.filter("relationship = 'follow'").count()
# Motif finding
-Motif finding refers to searching for structural patterns in a graph.
+Motif finding refers to searching for structural patterns in a graph. For an example of real-world use, check out the [Motif Finding Tutorial](motif-tutorial.html).
-GraphFrame motif finding uses a simple Domain-Specific Language (DSL) for expressing structural
-queries. For example, `graph.find("(a)-[e]->(b); (b)-[e2]->(a)")` will search for pairs of vertices
-`a,b` connected by edges in both directions. It will return a `DataFrame` of all such
-structures in the graph, with columns for each of the named elements (vertices or edges)
-in the motif. In this case, the returned columns will be "a, b, e, e2."
+GraphFrame motif finding uses a simple Domain-Specific Language (DSL) for expressing structural queries. For example, `graph.find("(a)-[e]->(b); (b)-[e2]->(a)")` will search for pairs of vertices `a,b` connected by edges in both directions. It will return a `DataFrame` of all such structures in the graph, with columns for each of the named elements (vertices or edges) in the motif. In this case, the returned columns will be "a, b, e, e2."
DSL for expressing structural patterns:
@@ -304,11 +302,11 @@ from graphframes.examples import Graphs
g = Graphs(spark).friends() # Get example graph
-# Search for pairs of vertices with edges in both directions between them.
+# Search for pairs of vertices with edges in both directions between them
motifs = g.find("(a)-[e]->(b); (b)-[e2]->(a)")
motifs.show()
-# More complex queries can be expressed by applying filters.
+# More complex queries can be expressed by applying filters
motifs.filter("b.age > 30").show()
{% endhighlight %}
@@ -375,14 +373,16 @@ g = Graphs(spark).friends() # Get example graph
chain4 = g.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)")
# Query on sequence, with state (cnt)
-# (a) Define method for updating state given the next element of the motif.
+# (a) Define method for updating state given the next element of the motif
sumFriends =\
lambda cnt,relationship: when(relationship == "friend", cnt+1).otherwise(cnt)
-# (b) Use sequence operation to apply method to sequence of elements in motif.
-# In this case, the elements are the 3 edges.
+
+# (b) Use sequence operation to apply method to sequence of elements in motif
+# In this case, the elements are the 3 edges
condition =\
reduce(lambda cnt,e: sumFriends(cnt, col(e).relationship), ["ab", "bc", "cd"], lit(0))
-# (c) Apply filter to DataFrame.
+
+# (c) Apply filter to DataFrame
chainWith2Friends2 = chain4.where(condition >= 2)
chainWith2Friends2.show()
{% endhighlight %}
@@ -428,8 +428,8 @@ from graphframes.examples import Graphs
g = Graphs(spark).friends() # Get example graph
-# Select subgraph of users older than 30, and relationships of type "friend".
-# Drop isolated vertices (users) which are not contained in any edges (relationships).
+# Select subgraph of users older than 30, and relationships of type "friend"
+# Drop isolated vertices (users) which are not contained in any edges (relationships)
g1 = g.filterVertices("age > 30").filterEdges("relationship = 'friend'").dropIsolatedVertices()
{% endhighlight %}
@@ -470,15 +470,17 @@ from graphframes.examples import Graphs
g = Graphs(spark).friends() # Get example graph
# Select subgraph based on edges "e" of type "follow"
-# pointing from a younger user "a" to an older user "b".
+# pointing from a younger user "a" to an older user "b"
paths = g.find("(a)-[e]->(b)")\
.filter("e.relationship = 'follow'")\
.filter("a.age < b.age")
-# "paths" contains vertex info. Extract the edges.
+
+# "paths" contains vertex info. Extract the edges
+
e2 = paths.select("e.src", "e.dst", "e.relationship")
-# In Spark 1.5+, the user may simplify this call:
-# val e2 = paths.select("e.*")
+# In Spark 1.5+, the user may simplify this call
+# val e2 = paths.select("e.*")
# Construct the subgraph
g2 = GraphFrame(g.vertices, e2)
{% endhighlight %}
@@ -539,11 +541,11 @@ from graphframes.examples import Graphs
g = Graphs(spark).friends() # Get example graph
-# Search from "Esther" for users of age < 32.
+# Search from "Esther" for users of age < 32
paths = g.bfs("name = 'Esther'", "age < 32")
paths.show()
-# Specify edge filters or max path lengths.
+# Specify edge filters or max path lengths
g.bfs("name = 'Esther'", "age < 32",\
edgeFilter="relationship != 'friend'", maxPathLength=3)
{% endhighlight %}
@@ -741,15 +743,16 @@ from graphframes.examples import Graphs
g = Graphs(spark).friends() # Get example graph
-# Run PageRank until convergence to tolerance "tol".
+# Run PageRank until convergence to tolerance "tol"
results = g.pageRank(resetProbability=0.15, tol=0.01)
+
# Display resulting pageranks and final edge weights
-# Note that the displayed pagerank may be truncated, e.g., missing the E notation.
-# In Spark 1.5+, you can use show(truncate=False) to avoid truncation.
+# Note that the displayed pagerank may be truncated, e.g., missing the E notation
+# In Spark 1.5+, you can use show(truncate=False) to avoid truncation
results.vertices.select("id", "pagerank").show()
results.edges.select("src", "dst", "weight").show()
-# Run PageRank for a fixed number of iterations.
+# Run PageRank for a fixed number of iterations
results2 = g.pageRank(resetProbability=0.15, maxIter=10)
# Run PageRank personalized for vertex "a"
@@ -874,15 +877,15 @@ from graphframes.examples import Graphs
g = Graphs(spark).friends() # Get example graph
-# Save vertices and edges as Parquet to some location.
+# Save vertices and edges as Parquet to some location
g.vertices.write.parquet("hdfs://myLocation/vertices")
g.edges.write.parquet("hdfs://myLocation/edges")
-# Load the vertices and edges back.
+# Load the vertices and edges back
sameV = spark.read.parquet("hdfs://myLocation/vertices")
sameE = spark.read.parquet("hdfs://myLocation/edges")
-# Create an identical GraphFrame.
+# Create an identical GraphFrame
sameG = GraphFrame(sameV, sameE)
{% endhighlight %}
@@ -945,7 +948,7 @@ from pyspark.sql.functions import sum as sqlsum
g = Graphs(spark).friends() # Get example graph
-# For each user, sum the ages of the adjacent users.
+# For each user, sum the ages of the adjacent users
msgToSrc = AM.dst["age"]
msgToDst = AM.src["age"]
agg = g.aggregateMessages(
@@ -1038,3 +1041,8 @@ val g2: GraphFrame = GraphFrame.fromGraphX(gx)
These conversions are only supported in Scala since GraphX does not have a Python API.
+
+# GraphFrames Internals
+
+To learn how GraphFrames works internally to combine graph and relational queries, check out the paper [GraphFrames: An Integrated API for Mixing Graph and
+Relational Queries, Dave et al. 2016](https://people.eecs.berkeley.edu/~matei/papers/2016/grades_graphframes.pdf).
diff --git a/python/graphframes/tutorials/stackexchange.py b/python/graphframes/tutorials/stackexchange.py
index 02ebb2bb5..5dab1eafe 100644
--- a/python/graphframes/tutorials/stackexchange.py
+++ b/python/graphframes/tutorials/stackexchange.py
@@ -5,6 +5,7 @@
#
# Batch Usage: spark-submit --packages com.databricks:spark-xml_2.12:0.18.0 python/graphframes/tutorials/stackexchange.py
#
+
from __future__ import annotations
import re