# Computing PageRank

## Introduction

PageRank is best known for its success in ranking web pages in Google Search engine. Here is a quick description:

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

In fact, the algorithm outputs a probability distribution that represents the likelihood of arriving at any particular page after randomly clicking on links for a while. We will simulate this behavior by the following 'surfing the Internet' procedure:

• Initially, at each page, some amount of people start surfing the internet from that page.
• In each turn, some users decide to click on a random link and visit a new page.
• We iterate for fixed number of rounds.

## Code

First things first - imports and constants.

from typing import Anyimport pathway as pw

### I/O Data

We use an Edge schema to represent the graph and Result schema to represent the final ranks.

class Edge(pw.Schema):    u: Any    v: Anyclass Result(pw.Schema):    rank: float

pagerank performs one turn of 'surfing the Internet' procedure by uniformly distributing rank from each node to all its adjacent nodes, for a fixed number of rounds.

def pagerank(edges: pw.Table[Edge], steps: int = 50) -> pw.Table[Result]:    degrees = edges.groupby(id=edges.u).reduce(        degree=pw.reducers.count(None)    )  # indexed by refs to vertices    ranks = degrees.select(rank=1.0)  # indexed by refs to vertices    for step in range(steps):        # if a node has no degree, its degree is 0 and we can ignore its outflow        outflow = ranks.intersect(degrees).select(            flow=ranks.rank / degrees.degree        )  # indexed by refs to vertices        edges_flows = outflow.having(edges.u).select(            edges.v, flow=outflow.ix[edges.u].flow        )        inflows = edges_flows.groupby(id=edges_flows.v).reduce(            flow=pw.reducers.sum(edges_flows.flow)        )  # indexed by refs to vertices        ranks = inflows.select(            rank=pw.apply_with_type(                lambda x: round(x, 4), float, inflows.flow * 0.85 + 0.15            ),        )  # indexed by refs to vertices    return ranks

### Tests

We present two easy test cases here. A test case with a single 3-vertices loop with one backward edge.

# directed graphvertices = pw.debug.table_from_markdown(    """      |    a |    b |    c |    """).select()edges = pw.debug.table_from_markdown(    """    u | v    a | b    b | c    c | a    c | b    """,).select(u=vertices.pointer_from(pw.this.u), v=vertices.pointer_from(pw.this.v))pw.debug.compute_and_print(pagerank(edges))
            | rank
^EGS8DW6... | 0.6444
^76BNCAZ... | 1.1634
^JASWCNM... | 1.1923

Why these numbers? 0.6444, 1.1923, 1.1634? Feel free to skip the quite mathy explanation below.

Let us calculate what a correct answer should be. PageRank actually finds a stationary distribution of a random walk on a graph in which the probability of each move depends only on the currently visited state, i.e. it is a Markov Chain.

One may think that the transition matrix of the Markov chain in our example is

We move to a new page with probability 5/6 uniformly distributed among all the linked (adjacent) pages, and with probability 1/6 we mix uniformly at random. The result is a stationary distribution roughly of which is proportional to the rank returned. The difference is due to rank not being normalized.

### Summary

As always, feel free to play and experiment with this code! In case you are looking for cool real-world graphs to experiment with, the Stanford Network Analysis Project is an excellent source of reference instances, big and small.