27 August 2024 ~ 2 Comments

The Glass Door of Wikipedia’s Notable People

I think Wikipedia is great. I spend tons of time on it. I especially like to read about history, because it allows me to quickly jump into obscure details about anything, without the need to scout for specialized literature that might be super hard to find. But one question always creeps in the back of my mind: am I reading something as fair as it can be? How much are the editors’ biases driving my discovery process? These are testable questions! And the subject of this blog post, and a paper I recently published.

I don’t need this judgemental look when I’m deep in a clicking rabbit hole that started wondering why Brown Noise is called that way…

The paper is “Traces Of Unequal Entry Requirement For Illustrious People On Wikipedia Based On Their Gender” recently published in the Advances in Complex Systems journal. This is mostly the product of brilliant Lea Krivaa‘s master thesis. In the paper, we decided to focus on a specific bias: the role gender plays in the inclusion criteria of notable people on Wikipedia.

The hypothesis is that women need to do more than men to “deserve” a Wikipedia page. The are a few problems with this hypothesis. For starters, we can’t really prove it by simply saying that there are way more men than women on Wikipedia. That can happen and still be fair, because Wikipedia is just working with whatever it can collect from the notoriously male-centric historiography. Moreover, a true fairness test is hard to make: it’s not feasible to collect from the already-biased archives every person’s CV and see that there are discarded CVs from women that are as good as some of the included men on Wikipedia. Good luck checking the Roman Empire CVs after the Visigoths sacked the capital in 410AD.

Gosh darn it, I’m doing the rabbit hole clicking thing again, am I? I’ll never be done writing this article…

However, it turns out that we can find traces of this glass entrance door by using some unexpected network science techniques. We built a network of notable people: we took the set of people from Pantheon, because it’s a curated list of people that are on multiple language Wikipedia editions — this ensures they’re not just a pet peeve of some local editor. Then we connected them with an edge if the page of one person has a hyperlink connecting it to the page of another.

Crucially, we’re able to estimate the weight of the edge with some natural language processing: we count the number of times the target of this hyperlink is mentioned in the page containing the link. Knowing the edge weight is fundamental, because then we can use my backboning method to know the significance of this weight: how likely is it to be a noisy link?

I don’t mean to brag (I do), but I’m quite big in the network backboning industry. Wait, am I on Wikipedia again? Please send help.

Backboning is done to sparsify a network, by dropping the least significant edges. But here we’re just interested in looking at the significance values themselves. By looking at them we discover something odd: the edges involving women are on average more significant. If we were to establish a high significance threshold, we would end up isolating (and dropping) way more men than women. This shouldn’t happen if there was no bias.

On the left we have the distribution of significance for all four types of edges (since it’s a directed network). On the right we have the share of nodes we isolate with a given edge significance threshold. In both cases, you can see women’s edges and nodes are more impervious to harsher significance thresholds.

Our interpretation is that this is a hint that the glass entrance door exist: to be included in multiple language Wikipedias, a woman needs to have more significant ties with other notable people than a man.

This might seem a stretch or a bit abstract, but there’s a neat way to test this interpretation. On March, Wikipedia has a tradition of celebrating the month by improving its coverage of notable women. This means that, in March, it is “easier” for a woman to get added to Wikipedia than normal. And we can confirm this with our analysis! If we only look at pages created in March, the gap we observe is noticeably smaller.

The dark lines (March pages) are closer to each other than the faded ones (all other months).

Of course all of this should be taken with a grain of salt. Since we rely on Pantheon’s curation of profiles, we inherit all of their biases. Moreover, we only focus on the 1750-1950 time period, for various data quality reasons. And there are other factors affecting how much we can read in this analysis. For instance it might be that we simply do not have enough women to include in Wikipedia, because of the male bias in historiography I already mentioned. However, we think this is an interesting question to ask, because we can do better to improve inclusivity. If the gap can shrink in March, we ask: why can’t it shrink the whole year around?

Continue Reading

15 April 2021 ~ 0 Comments

A Bayesian Framework for Online Social Network Sampling

Analyzing an online social network is hard because there’s no “download” button to get the data. In the best case scenario, you have to query an API (Application Program Interface) system. You have to tell the API which user you want to see the connections of, and you can’t just ask for all users at the same time. With the current API rate limits, downloading the Twitter graph would take seven centuries, as the picture below explains.

The gray circle represents the totality of Twitter’s users. The red one is what you’d get with a continuous crawl of one year.

I don’t know about you, but I don’t have seven centuries to publish my papers. I want to become a famous scientist now. If you’re like me, you’re forced to extract only a sample. In this post, I’ll describe a network sampling algorithm that works with social media APIs to maximize the amount of information you get from your sample. The method has been published in the paper “Noise Corrected Sampling of Online Social Networks“, which recently appeared in the journal Transactions on Knowledge Discovery from Data.

My Noise Corrected sampling (NC) is a topological method. This means that you start from one (or more) user IDs that are known, and then you gather new user IDs from their friends, then you move on to the friends’ friends and so on, following the connections in the graph you’re sampling. This is the usual approach, because there is no way to know the user IDs beforehand — unless you have inside knowledge about how the social media platform works.

The underlying common question behind a topological sampler is: how do we pick the best user to explore next? Different samplers answer this question differently. A Random Walk sampler says: “Screw it! Thinking is too hard! I’m gonna get a random node from the friends of the currently explored node! Then, with the time I saved, I’m gonna get some ice cream.” The Metropolis-Hastings Random Walk — the nerd’s answer to the jock’s Random Walk — wants to preserve the degree distribution, so it penalizes high-degree nodes — since they’re more likely to be oversampled. Alternatively, we could use classical graph exploration techniques such as Depth-First and Breadth-First Search — and, again, you can enhance them so that you can preserve some properties of the network, for instance its clustering coefficient.

A random walk on a graph. It looks like a drunk going around. I don’t drink and sample, and neither should you.

My NC departs from all these approaches in a fundamental way. The thing they all have in common is that they don’t use information from the sample they have currently gathered. They only look at the potential next users and make a decision on whom to explore based on their characteristics alone. For instance Metropolis-Hastings Random Walk checks the user’s popularity in number of connections. NC, instead, looks at the entire collected sample and asks a straightforward question: “which of these next users is most likely to give me the most information about the whole structure, given what I explored so far?”

It does so with a Bayesian framework. The gory math is in the paper, but the underlying philosophy is that you can calculate a z-score for each edge of the network. It tells you how surprising it is that this edge is there. In probability theory, “surprise” is a good thing: it is directly related to how much information something gives you. Probability theorists are a bunch of jolly folks, always happy to discover jacks-in-a-box. If you always pick the most surprising edge in the network to explore the next user, you’re maximizing the amount of information you’re going to get back. The picture below unravels what this entails on a simple example.

In this example, the green nodes are explored, the blue nodes are known — because they’re friends of an explored node — but not explored yet. The red nodes are unknown.

The first figure (a) is the start of the process, where we have a network and we pick a random starting point, node 7. I use the edge thickness to show its z-score: we always follow the widest edge. Initially, you only pick a random edge, because all edges have the same z-score in a star. However, once I pick node 3, all z-scores change as we discover new edges (b). In this case, first we prioritize the star around node 7 (c), then we start following the path from node 3 until node 9 (d-e), then we explore the star around node 9 until there are no more nodes nor edges to discover in this toy network (f). Note how edges exhaust their surprise as you learn more about the neighborhoods of the nodes they connect.

Our toy example is an unweighted network, where all edges have the same weight, but this method thrives if you have information about how strong a connection is. The edge weight is additional information that can be taken into account to measure its surprise value. Under the hood, NC estimates the expected weight of an edge given the average edge weights of the two nodes it connects. In an unweighted network this is equivalent to the degree, because all edges have the same weight. But in a weighted network you actually have meaningful differences between nodes that tend to have strong/weak connections.

So, the million dollar question: how does NC fare when tested against the alternatives I presented before? The first worry you might have is that it takes too much time to calculate the z-scores. After all, to do so you need to look at the entire sampled network so far, while the other methods only look at a local neighborhood. NC is indeed slower than them—it has to be. But it doesn’t matter. The figure above shows how much time it takes to calculate the z-scores for more and more edges. Those runtimes (less than a tenth of a second) are puny when compared to the multi-second wait times due to the APIs’ rate limits — or even to simple network latency. After all, NC is a simplification of the Noise-Corrected Backbone algorithm I developed with Frank Neffke, and we showed in the original paper that it was pretty darn fast.

What about the quality of the samples? This is a question with a million answers, because it depends on what you want to do with the sample, besides what are the characteristics of the original graph and of the API system you’re wrestling against, and how much time you have to gather your sample. For instance, in the figure below we see that NC is the best method when you want to estimate the assortativity value of a disassortative attribute (think gender in a dating network), but it is a boring middle-of-the-pack method when you want to reconstruct the original communities of a network.

Left: the Mean Absolute Error between the sample’s and the original graph’s assortativity values (lower is better). Right: the Normalized Mutual Information between the sample’s and the original graph’s communities (higher is better). NC in red.

NC ranks between the 3rd and 4th position on average across a wide range of applications, network topologies, API systems, and budget levels. This is good because I test eight alternatives, thus the expectation is that NC will practically always be better than average. Among all eight methods tested, in fact, no other method has a better rank average. The second best method is Depth-First Search, which ranks 4th three out of four times, against NC’s one out of two.

I have shared some code allowing to replicate my results. If you speak Python better than you speak academiquese, you could use that code to infer how to implement an NC sampling strategy next time you need to download Twitter.

Continue Reading

22 August 2019 ~ 0 Comments

Deriving Networks isn’t as Easy as it Looks

Networks are cool because they’re a relatively simple model that allows you to understand complex systems. The problem is that they’re too cool: sometimes they make you want to do network analysis on something that isn’t really a network. For instance, consider Netflix. Here you have people watching movies. You want to know which movies are similar to each other so that you can suggest them to similar users. On the wings of Maslow’s Lawwhen you’re holding a hammer everything starts looking like a nail –, the network scientist would want to build a movie-movie network.

Of course XKCD has something about this.

The problem is that there are many different ways to make a movie-movie network from Netflix data. Each of these different ways will alter the shape of your network in dramatic ways, which will affect the results you’re going to get once you use it for your aims. With Luca Rossi, I started exploring this space. This resulted in the paper “The Impact of Projection and Backboning on Network Topologies“, which I will present next week at the ASONAM conference.

In the paper we take some real-world data and we apply all possible combinations of network building techniques on it. We systematically explore the key topological properties of the resulting networks, and see that they dramatically change depending on which strategy you picked. Meaning that you’re going to get completely different results from the same analysis later on.

Good network analysis is like good art: if you gaze long at the hairball, the hairball will gaze back at you. Image property of the Albright-Knox Art Gallery, Buffalo, NY.

The first thing we need to understand is that, to get to the movie-movie network, we need to perform two major steps. Each movie is a vector, containing information about each user. It could simply be a one if the user watched the movie, or a zero if they didn’t. Thus first we need to apply a similarity measure quantifying how similar two movies are to each other (what I call “projection”). Then we’ll realize that all we got is a hairball. Every movie has a non-zero similarity with any other movie. After all, there are millions of users, but just a handful of movies, so the probability that any two movies were watched by at least one user is pretty high. So you need to filter out your movie-movie similarity, otherwise your resulting network will be too dense.

Comparing two vectors is the oldest profession in the world, assuming your world is completely made up of linear algebra — mine, sadly, is. Thus you can pick and choose dozens of similarity metrics — Euclidean, cosine, I have a soft spot for the Mahalanobis distance myself. However, you’d be better served by the measures that were developed with complex networks in mind. You see, the binary movie-user vectors will have typical broad degree distributions: some movies are very popular — everybody watches them –, some people like me are pathological movie buffs and will watch everything — my watchlist has ~6,500 entries. Thus for this paper we focus on a few of those “bipartite projection” techniques: hyperbolic, resource allocation (ProbS), and my beloved YCN method.

Since I already jumped on the XKCD wagon, I see no harm in continuing down this path…

Then, to filter out connections, you have to have an idea of what’s a “strong, significant” connection and what isn’t. If you’re naive and just think that you should only keep connections with higher weights (what I call “naive thresholding”), boy do I have news for you. Also in this case, we’re going to consider a couple of ways to filter out noisy connections: the disparity filter and my noise corrected backbone.

Ok, the stage is set. If you were paying attention, you’ll figure out what’s coming next: a total mess.

Above (click to enlarge) you see the filtering techniques — top to bottom: naive threshold, disparity filter, noise corrected –, for each projection — line color –, on different network properties. From left to right we see: how many nodes survive the filter step, how much clustered the network is, and how well separated its communities are. The threshold levels (x-axis) attempt to preserve a comparable number of edges for each technique combination.

Yeah, it doesn’t look good. Look at the middle column: there are some versions of this network with perfect clustering, meaning that every common neighbor of a movie is connected to every other; while networks have a transitivity of zero; with almost every possible other values in between. The same holds for modularity, which can span from ~0.2 to practically 1. So there’s no way of saying whether these are properties of the system or just properties of the cleaning procedures you used. Keep in mind that the original data is the same. We could conclude anything by stirring our pile of linear algebra. Want to argue that the movie space doesn’t cluster? Project with YCN and filter with noise corrected method. Want to find strong communities instead? No biggie: project with resource allocation and do a simple threshold of the result.

Famous network scientist and two-time “world’s best mustache” winner Nietzsche once said: “He who fights with hairballs should look to it that he himself does not become a hairball.”

I wish I had a wise message to wrap up this blog post. Something about how to choose the best projection-filtering pair best fitting a specific analysis — one that you cannot tune to obtain the results you want. However, that will have to wait for further research. For now, I just want you to grow suspicious about specific results you see out there from networks that really aren’t network. If your nodes aren’t really connecting directly — like physical connections would do, for instance between neurons –, pretending they do so might lead you down a catastrophic over-confident path.

Continue Reading

25 January 2017 ~ 0 Comments

Network Backboning with Noisy Data

Networks are a fantastic tool for understanding an interconnected world. But, to paraphrase Spider Man, with networks’ expressive power come great headaches. Networks lure us in with their promise of clearly representing complex phenomena. However, once you start working with them, all you get is a tangled mess. This is because, most of the time, there’s noise in the data and/or there are too many connections: you need to weed out the spurious ones. The process of shaving the hairball by keeping only the significant connections — the red ones in the picture below —  is called “network backboning”. The network backbone represents the true relationships better and will play much nicer with other network algorithms. In this post, I describe a backboning method I developed with Frank Neffke, from the paper “Network Backboning with Noisy Data” accepted for publication in the International Conference on Data Engineering (the code implementing the most important backbone algorithms is available here).

bb1

Network backboning is as old as network analysis. The first solution to the problem was to keep edges according to their weight. If you want to connect people who read the same books, pairs who have few books in common are out. Serrano et al. pointed out that edge weight distributions can span many orders of magnitude — as shown in the figure below (left). Even with a small threshold, we are throwing away a lot of edges. This might not seem like a big deal — after all we’re in the business of making the network sparser — except that the weights are not distributed randomly. The weight of an edge is correlated with the weights of the edges sharing a node with it — as shown by the figure below (right). It is easy to see why: if you have a person who read only one book, all its edges can have at most weight one.

Their weights might be low in comparison with the rest of the network, but they are high for their nodes, given their propensity to connect weakly. Isolating too many nodes because we accidentally removed all their edges is a no-no, so Serrano and coauthors developed the Disparity Filter (DF): a technique to estimate the significance of one node’s connections given its typical edge weight, regardless of what the rest of the network says.

bb2

This sounds great, but DF and other network backboning approaches make imprecise assumptions about the possibility of noise in our estimate of edge weights. In our book example, noise means that a user might have accidentally said that she read a book she didn’t, maybe because the titles were very similar. One thing DF gets wrong is that, when two nodes are not connected in the raw network data, it would say that measurement error is absent. This is likely incorrect, and it screams for a more accurate estimate of noise. I’m going to leave the gory math details in the paper, but the bottom line is that we used Bayes’ rule. The law allows us to answer the question: how surprising is the weight of this edge, given the weights of the two connected nodes? How much does it defy my expectation?

The expectation here can be thought of as an extraction without replacement, much like Bingo (which statisticians — notorious for being terrible at naming things — would call a “hypergeometric” one). Each reader gets to extract a given number of balls (n, the total number of books she read), drawing from a bin in which all balls are the other users. If a user read ten books, then there are ten balls representing her in the bin. This is a good way to have an expectation for zero edge weights (nodes that are not connected), because we can estimate the probability of never extracting a ball with a particular label.

bb4

I highlighted the words one and two, because they’re a helpful key to understand the practical difference between the approaches. Consider the toy example below. In it, each edge’s thickness is proportional to its weight. Both DF and our Noise Corrected backbone (NC) select the black edges: they’re thick and important. But they have different opinions about the blue and red edges. DF sees that nodes 2 and 3 have mostly weak connections, meaning their thick connection to node 1 stands out. So, DF keeps the blue edges and it drops the red edge. It only ever looks at one node at a time.

bb5

NC takes a different stance. It selects the red edge and drops the blue ones. Why? Because for NC what matters more is the collaboration between the two nodes. Sure, the blue connection is thicker than the red one. But node 1 always has strong connections, and its blue edges are actually particularly weak. On the other hand, node 3 usually has weak connections. Proportionally speaking, the red edge is more important for it, and so it gets saved.

To sum up, NC:

  1. Refines our estimate of noise in the edge weights;
  2. Sees an edge as the collaboration between two nodes rather that an event happening to one of them;
  3. Uses a different model exploiting Bayes’ law to bake these aspects together.

bb6

How does that work for us in practice? Above you see some simulations made with artificial networks, of which we know the actual underlying structure, plus some random noise — edges thrown in that shouldn’t exist. The more noise we add the more difficult it is to recover the original structure. When there is little noise, DF (in blue) is better. NC (in red) starts to shine as we increase the amount of noise, because that’s the scenario we were targeting.

In the paper we also show that NC backbones have a comparable stability with DF, meaning that extracting the backbone from different time snapshots of the same phenomenon usually does not yield wildly different results. Coverage — the number of nodes that still have at least one edge in the network — is also comparable. Then we look at quality. When we want to predict some future relationship among the nodes, we expect noisy edges to introduce errors in the estimates. Since a backbone aims at throwing them away, it should increase our predictive power. The table below (click it to enlarge) shows that, in different country-country networks, the predictive quality (R2) using an NC backbone is higher than the one we get using the full noisy network. The quality of prediction can get as high as twice the baseline (the table reports the quality ratio: R2 of the backbone over R2 of the full network, for different methods).

bb8

The conclusion is that, when you are confident about the measurement of your network, you should probably extract its backbone using DF. However, in cases of uncertainty, NC is the better option. You can test it yourself!

Continue Reading