# Computing PageRank

## Introduction

PageRank is best known for its success in ranking web pages in Google Search engine. Here is a quick description:

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

In fact, the algorithm outputs a probability distribution that represents the likelihood of arriving at any particular page after randomly clicking on links for a while. We will simulate this behavior by the following 'surfing the Internet' procedure:

- Initially, at each page, some amount of people start surfing the internet from that page.
- In each turn, some users decide to click on a random link and visit a new page.
- We iterate for fixed number of rounds.

This article assumes that you are already familiar with some basics of Pathway transformers.

## Code

First things first - imports and constants.

`from typing import Anyimport pathway as pw`

### I/O Data

We use an `Edge`

schema to represent the graph and `Result`

schema to represent the final ranks.

`class Edge(pw.Schema): u: Any v: Anyclass Result(pw.Schema): rank: float`

`pagerank`

performs one turn of 'surfing the Internet' procedure by uniformly
distributing rank from each node to all its adjacent nodes, for a fixed number of rounds.

`def pagerank(edges: pw.Table[Edge], steps: int = 50) -> pw.Table[Result]: degrees = edges.groupby(id=edges.u).reduce( degree=pw.reducers.count(None) ) # indexed by refs to vertices ranks = degrees.select(rank=1.0) # indexed by refs to vertices for step in range(steps): # if a node has no degree, its degree is 0 and we can ignore its outflow outflow = ranks.intersect(degrees).select( flow=ranks.rank / degrees.degree ) # indexed by refs to vertices edges_flows = outflow.having(edges.u).select( edges.v, flow=outflow.ix[edges.u].flow ) inflows = edges_flows.groupby(id=edges_flows.v).reduce( flow=pw.reducers.sum(edges_flows.flow) ) # indexed by refs to vertices ranks = inflows.select( rank=pw.apply_with_type( lambda x: round(x, 4), float, inflows.flow * 0.85 + 0.15 ), ) # indexed by refs to vertices return ranks`

### Tests

We present two easy test cases here. A test case with a single 3-vertices loop with one backward edge.

`# directed graphvertices = pw.debug.table_from_markdown( """ | a | b | c | """).select()edges = pw.debug.table_from_markdown( """ u | v a | b b | c c | a c | b """,).select(u=vertices.pointer_from(pw.this.u), v=vertices.pointer_from(pw.this.v))pw.debug.compute_and_print(pagerank(edges))`

```
| rank
^EGS8DW6... | 0.6444
^76BNCAZ... | 1.1634
^JASWCNM... | 1.1923
```

Why these numbers? 0.6444, 1.1923, 1.1634? Feel free to skip the quite mathy explanation below.

Let us calculate what a correct answer should be. PageRank actually finds a stationary distribution of a random walk on a graph in which the probability of each move depends only on the currently visited state, i.e. it is a Markov Chain.

One may think that the transition matrix of the Markov chain in our example is

We move to a new page with probability 5/6 uniformly distributed among all the linked (adjacent) pages, and with probability 1/6 we mix uniformly at random. The result is a stationary distribution roughly of $(x=(0.2150.3970.388))$ which is proportional to the rank returned. The difference is due to rank not being normalized.

### Summary

As always, feel free to play and experiment with this code! In case you are looking for cool real-world graphs to experiment with, the Stanford Network Analysis Project is an excellent source of reference instances, big and small.