November 7, 2022TUTORIAL · MACHINE-LEARNING

# Computing PageRank

## Introduction

PageRank is best known for its success in ranking web pages in Google Search engine. Here is a quick description:

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

In fact, the algorithm outputs a probability distribution that represents the likelihood of arriving at any particular page after randomly clicking on links for a while. We will simulate this behavior by the following 'surfing the Internet' procedure:

• Initially, at each page, some amount of people start surfing the internet from that page.
• In each turn, some users decide to click on a random link and visit a new page.
• We iterate for a fixed number of rounds.

## Code

First things first - imports and constants.

from typing import Any

import pathway as pw

### I/O Data

We use an Edge schema to represent the graph and Result schema to represent the final ranks.

class Edge(pw.Schema):
u: Any
v: Any

class Result(pw.Schema):
rank: float

pagerank performs one turn of 'surfing the Internet' procedure by uniformly distributing rank from each node to all its adjacent nodes, for a fixed number of rounds.

def pagerank(edges: pw.Table[Edge], steps: int = 5) -> pw.Table[Result]:
in_vertices = edges.groupby(id=edges.v).reduce(degree=0)
out_vertices = edges.groupby(id=edges.u).reduce(degree=pw.reducers.count())
degrees = pw.Table.update_rows(in_vertices, out_vertices)
base = out_vertices.difference(in_vertices).select(flow=0)

ranks = degrees.select(rank=6_000)

grouper = edges.groupby(id=edges.v)

for step in range(steps):
outflow = degrees.select(
flow=pw.if_else(
degrees.degree == 0, 0, (ranks.rank * 5) // (degrees.degree * 6)
)
)

inflows = edges.groupby(id=edges.v).reduce(
flow=pw.reducers.sum(outflow.ix(edges.u).flow)
)

inflows = pw.Table.concat(base, inflows)

ranks = inflows.select(rank=inflows.flow + 1_000).with_universe_of(degrees)
return ranks

### Tests

We present two easy test cases here. A test case with a single 3-vertices loop with one backward edge.

# directed graph
vertices = pw.debug.table_from_markdown(
"""
|
a |
b |
c |
"""
).select()
edges = pw.debug.table_from_markdown(
"""
u | v
a | b
b | c
c | a
c | b
""",
).select(u=vertices.pointer_from(pw.this.u), v=vertices.pointer_from(pw.this.v))

pw.debug.compute_and_print(pagerank(edges))
[2023-12-08T16:31:17]:INFO:Preparing Pathway computation

| rank
^HJRW9JY... | 3945
^4DM90AW... | 6981
^G8TD95H... | 7069

Why these numbers? 3945, 6981, 7069? Feel free to skip the quite mathy explanation below.

Let us calculate what the correct answer should be. PageRank actually finds a stationary distribution of a random walk on a graph in which the probability of each move depends only on the currently visited state, i.e. it is a Markov Chain.

One may think that the transition matrix of the Markov chain in our example is

We move to a new page with probability 5/6 uniformly distributed among all the linked (adjacent) pages, and with probability 1/6 we mix uniformly at random. The result is a stationary distribution roughly of which is proportional to the rank returned. However, we output only the approximation of this result, and our output is not normalized.

### Summary

As always, feel free to play and experiment with this code! In case you are looking for cool real-world graphs to experiment with, the Stanford Network Analysis Project is an excellent source of reference instances, big and small.

Pathway Team

pagerankgraph