Michele Coscia - Connecting Humanities

Michele Coscia I am an associate prof at IT University of Copenhagen. I mainly work on algorithms for the analysis of complex networks, and on applying the extracted knowledge to a variety of problems. My background is in Digital Humanities, i.e. the connection between the unstructured knowledge and the coldness of computer science. I have a PhD in Computer Science, obtained in June 2012 at the University of Pisa. In the past, I visited Barabasi's CCNR at Northeastern University, and worked for 6 years at CID, Harvard University.

23 January 2018 ~ 0 Comments

Hitting the Front Page Triggers an Evolutionary Arms Race

I’m a conformist. Just like everyone else in computer science working on memes, I am lured by the Holy Grail of questions. Can I figure out what makes content go viral? I really want to answer “yes”, but my absence from Reddit’s front page speaks louder than any advice I could give to you. That didn’t dissuade me from trying to look for a question fitting my preferred answer. After building Earth and waiting a few billion years for it to process the question, this is what I got: “can I figure out what makes content not go viral?” I guess that’s half of the job, right?

In 2014 I proposed my explanation, a rather trivial one: the content that does not go viral is the one that is unoriginal, the one that is too close a copy of something that is already out there. My explanation sounds uncontroversial: to be successful you have to be original, yeah thank you very much. But there was a little problem with it: karma trains. Very often, topics stay popular multiple days: Reddit and social media are flooded with posts about the same thing, seemingly tapping into a neverending pit of attention and upvotes. Unoriginal content actually makes it to the front page. Was it time to throw my theory in the dustbin?* I didn’t think so. So, in this month’s issue of Communications of the ACM, I published a sequel: “Popularity Spikes Hurt Future Chances for Viral Propagation of Protomemes.”

I need to defuse the danger that karma trains represent for my theory, and I do so in two parts, asking two questions. First: is it really true that, in karma trains, content that stays closer to the original post gets more attention and success? Second: is it really true that karma trains contain exclusively unoriginal content? For these questions, I define specifically karma trains to be the collection of posts using a meme after it hit the front page. To answer these questions I focus mainly on Reddit. I use data kindly donated to me by Tim Weninger from the University of Notre Dame (in particular from this paper of his). I look at all catchphrases used frequently — hence the word “protomeme” in the title: my definition is a bit wider than just memes — and I track down how successful they are in each day.

For the first question I have to debunk the notion that unoriginal content is successful in karma trains. First, I check if a meme hit the front page on a particular day. Then I look at all the Reddit posts containing that meme the day after. A karma train implies that more people will use the meme — so, more posts using the catchphrase — and that posts including the meme will be on average more successful. The first part is true: karma trains do exist and, after a meme hits the front page, more people will use it. But the second part is crucially false: on average these posts score poorly. This is not just regression towards the mean: obviously if the meme just hit the front page, its average popularity the day after can only go down. But I control for that. I control for the entire history of the meme. Its average popularity the day after hitting the front page is significantly lower than its regular average popularity, its recent popularity trends, and the average popularity of everything else that day.

So what gives? If the meme is doing poorly, why are karma trains a thing? Because, over those many attempts, a few are going to hit the front page again. Since the front page is very noticeable, we’re tricked into thinking that all posts using the meme are doing well. This transitions nicely into the second part of the paper. Why are those few posts successful? Bell-shaped random distributions teach us that, if you try enough times, one of the attempts is going to be really good. Is that the case? Are we looking at statistical aberrations or is there something interesting? There is: these posts are not ending up on the top randomly. There’s something special about them.

I made up a measure of post originality. Given that a post contains a meme, I want to know if it is repeating it in a boring way, or if it is adding something to the mix. It answers the question: how canonical is the usage of the meme in this post? That is why I called this measure “canonicity”. In practice, for every word that ever appeared in a Reddit title, I calculate the probability that the word is going to be used in a post including that meme. So for every new post I can calculate the average word probability, and ending up with an estimation of how surprising this particular post title is.

You know what I’m going to say next. The more unsuprising a post is, the less likely it is to be successful. A high-canonicity post has roughly half the odds of being widely successful — e.g. hitting the front page — than a low-canonicity one. And the fact that there are low-canonicity posts in karma trains is interesting of itself. It confirms my hunch that, among the posts that jump on the bandwagon of popular memes, there are also original thoughts. This is the evolutionary arms race I mention in the title: when a meme hits the front page, subsequent implementations of it have to constantly innovate, else the meme will be infested by high-canonicity copies, and fade into oblivion. This is generally true for all memes, but it is an effect on steroids for recently successful memes, because they are the ones that are being copied the most in a particular time period.

The story has another interesting turn. Low-canonicity is a double-edged sword. It gives you better odds to hit the front page, but if you fail at it then your performance is going to be atrocious. In that case, high-canonicity posts are actually doing better than low-canonicity ones. In other words, a meme after hitting the front page does a sort of “canonicity sandwich”: the very best AND very worst posts are low-canonicity ones, and in the middle you have high-canonicity posts. Why does this happen? Maybe it’s because of familiarity. Familiar content is reassuring and so people “get it” and upvote. It just does not soar high enough. Or it can be a million other reasons that I haven’t tested yet, so I can only speculate.

What the canonicity sandwich means is that content originality has a varying effect: high canonicity harms you in some cases, but it’s good for you in others. This discovery is important, because other researchers have found that a post’s content doesn’t seem to explain its success very well. The sandwich effect might very well be the cause of our disagreement.

To wrap up, I hope I put on a credible defense of my theory in the face of karma trains. These annoying meme critters are dangerous to my explanation of popularity, because they directly contradict it. Karma trains seems to be a collection of popular unoriginal content: the exact thing my theory says it shouldn’t exist. Upon closer inspection, I noticed that (a) it isn’t really true that karma train posts are particularly successful and (b) it isn’t really true that they only contain unoriginal content. So, my theory is going to die another day, like in all good James Bond flicks**.


* Yes, but I need tenure, so I’ll have to put up a fight.

** Which Die Another Day wasn’t.

25 October 2017 ~ 0 Comments

Nice Transaction Data Clustering Algorithm You Have There. It would be a Shame if Someone were to Misapply it.

I’m coming out of a long hiatus to write a quick post about a nice little paper I just put together with Riccardo Guidotti. The topic today is community discovery: the task of grouping nodes in a network because they connect to each other more strongly than with the other nodes in a network. No, wait, the actual topic is transactional data clustering: the task of grouping customers of a supermarket because they usually buy the same stuff. Wait what? What do these problems have to do with each other? Well, that’s what we concluded: the two things can be the same.

The title of the paper is “On the Equivalence between Community Discovery and Clustering” and it was accepted for publication at the GoodTechs conference in Pisa. The starting point of this journey was peculiar. Riccardo, being the smart cookie he is, published earlier this year a paper in the prestigious SIGKDD conference. In the paper, he and coauthors describe the Tx-means algorithm, that does exactly what I said before: it looks at the shopping carts of people, and figures out which ones are similar to each other. You can use it to give suggestions of items to buy and other nice applications. There is a three-minute video explaining the algorithm better than I can, which incidentally also shows that Riccardo has some serious acting chops, as a side of his research career:

The title of the paper is “Clustering Individual Transactional Data for Masses of Users” and you should check it out. So the question now is: how can I make all of this about me? Well, Riccardo wanted some help to break into the community discovery literature. So he asked my advice. My first advice was: don’t. Seriously, it’s a mess. He didn’t follow it. So the only option left was to help him out.

The main contribution of the paper is to propose one of the possible ways in which transactional data clustering can be interpreted as community discovery. The two tasks are known to be very similar in the literature, but there are many ways to map one onto the other, with no clear reason to prefer this or the other translation. In this paper, we show one of those translations, and some good reasons why it’s a good candidate. Now, the article itself looks like a five year old took a bunch of Greek letters from a bag and scattered them randomly on a piece of paper, so I’ll give you a better intuition for it.

The picture shows you transactional data. Each cart contains items. As the video points out, this type of data can also represent other things: a cart could be a user and the items could be the webpages she read. But we can go even more abstract than that. There is no reason why the items and the carts should be different entity types. A cart can contain… carts. We can define it as the list of carts it is related to, for any arbitrary relatedness criterion you’re free to choose (maybe it’s because they are cart-friends or something).

Once we do that, we can pivot our representation. Now it’s not transactional data any more: it is an edge list. Each node (the cart) lists the other nodes (cart-items) to which it connects. So this is a network — as shown below. What would it mean to find communities in this network? Well, it would mean — as I said at the beginning — to find groups of nodes that connect to each other. Or, in other words, group of carts that contain the same items (carts). Which is what Tx-means does.

This approach is a bit of a bastardization of community discovery because, if we apply it, then the community definition shifts from “densely connected nodes” to “similarly connected nodes.” But that’s a valid community definition — and that’s why community discovery is a mess — so we get a pass. It’s more or less the same thing I did on another working paper I’m sweating on, and so far only one person has called me out on that (thanks Adam), so I call it a success.

Why on Earth would we want to do this? Well, because Tx-means comes with some interesting advantages. First, let’s pitch it against some state of the art community discovery methods. In our experiments, Tx-means was the best at guessing the number of communities in networks of non-trivial size. This came without a big penalty in their Normalized Mutual Information — which measures how aligned the discovered communities are to the “real” communities as we gather them from metadata.

More importantly — as you saw in the video — Tx-means is able to extract “typical carts.” Translated into network lingo: with Tx-means you can extract the prototypical nodes of the community, or its most representative members. This has a number of applications if you are interested in, for instance, knowing who are the leaders, around which the nodes in the community coalesce. They are highlighted in colors in the picture above. You cannot really do this with Infomap, for instance.

So that’s the reason of this little exercise. Other transactional data clustering algorithms can use the same trick to go from carts to nodes, and therefore translate their features into the context of community discovery. Features that might be missing from the traditional algorithms we use to detect communities.

 

22 May 2017 ~ 0 Comments

Netonets @ Netsci17: Program

As previously announced, this year’s edition of the Netonets event will happen again as a satellite of NetSci. The conference will take place in Indianapolis. The general program of the conference can be found here: http://netsci2017.net/program. I will be there, hosting the satellite just like last year. It will take place Tuesday, June 20th and it will run for the entire day.

We have just completed a tentative program. We are going to have four great invited speakers: Marta C. Gonzales, Romualdo Pastor-Satorras, Gareth Baxter and Paul Hines. We also have five contributed talks. You can find the first draft of the program down here, subject to change in case of conflicting schedules for any of the participants. I will keep up to date in case that happens.

Looking forward to see you in Indianapolis!

Session I

9:00 – 9:15: Room set up
9:15 – 9:30: Welcome from the organizers
9:30 – 10:15: Invited I: Marta Gonzales: “Coupled networks of mobility and energy”

10:15 – 10:45: Coffee Break

Session II

10:45 – 11:30: Invited II: Gareth Baxter: “Critical dynamics of the interdependent network transition”
11:30 – 11:50: Contributed I: Dana Vaknin, Michael Danziger and Shlomo Havlin: “Spreading of localized attacks in spatial multiplex networks”
11:50 – 12:10: Contributed II: Ivana Bachmann and Javier Bustos-Jiménez: “Seven years of interdependent network’s robustness”

12:10 – 14:00: Lunch Break

Session III

14:00 – 14:45: Invited III: Romualdo Pastor-Satorras: “Effects of temporal correlations in social multiplex networks”
14:45 – 15:05: Contributed III: Zhiwei Yang, Jichao Li, Danling Zhao, Yuejin Tan and Kewei Yang: “Operation Loop based Structural Robustness of Combat Networks of Weapon System-of-systems”
15:05 – 15:25: Contributed IV: Shawn Gu and Tijana Milenkovic: “From Homogeneous to Heterogeneous Network Alignment”
15:25 – 15:45: Contributed V: Louis Shekhtman, Michael Danziger, Ivan Bonamassa, Sergey Buldyrev, Vinko Zlatic and Shlomo Havlin: “Secure message passing in networks with communities”

15:45 – 16:10: Coffee Break

Session IV

16:10 – 16:55: Invited IV: Paul Hines: “Increasing interconnectivity between networks can increase network robustness”
16:55 – 17:30: Round table – Open discussion
17:30 – 17:45: Organizers wrap up

21 February 2017 ~ 0 Comments

Netonets @ NetSci 2017: Call for Contributions!

We are delighted to invite submissions for

Network of Networks
7th Edition – Satellite Symposium at NetSci2017

taking place in JW Marriott Indianapolis, Indiana, United States,
on June 20th, 2017.

Submission:
We invite you to submit a 300 word abstract including one descriptive figure using our EasyChair submission link:
https://easychair.org/conferences/?conf=netonets2017

Deadline for submission: April 21st, 2017.
Notification of acceptance will be sent out by April 28th, 2017.

Further Information at: http://www.netonets.org/events/netonets2017

Abstract:
For the seventh time, it is our pleasure to bring together pioneer work in the study of networks of networks. Networks of networks are networks in which the nodes may be connected through different relations, are part of interdependent layers and connected by higher order dynamics. They can represent multifaceted social interactions, critical infrastructure and complex relational data structures. In our call, we are looking for a diversity of research contributions revolving around networks of networks of any kind: in social media, in infrastructure, in culture. We are particularly keen in receiving works raising novel issues and provocative queries in the investigation of networks of networks, as well as new contributions tackling these challenges. How do networks of networks change the paradigm of established problems like percolation or community detection? How are we shifting our thoughts to be ready for this evolution? Running parallel to NetSci, the top network science conference, this event provides a valuable opportunity to connect with leading researchers in complex network science.

Confirmed Keynote:

Marta Gonzales – MIT
Gareth Baxter – University of Aveiro
Romualdo Pastor-Satorras – Universitat Politècnica de Catalunya
Paul Hines – University of Vermont (UNCONFIRMED)

The final program will strive for the inclusion of contributions from different research fields, creating an interdisciplinary dialogue about networks of networks.

Best regards,
The Netonets organizers,

Antonio Scala, La Sapienza – antonio.scala.phys@gmail.com
Gregorio D’Agostino, ENEA – gregorio.dagostino@enea.it
Michele Coscia, Harvard University – michele_coscia@hks.harvard.edu
Przemysław Kazienko, Wroclaw University of Technology – przemyslaw.kazienko@pwr.edu.pl

25 January 2017 ~ 0 Comments

Network Backboning with Noisy Data

Networks are a fantastic tool for understanding an interconnected world. But, to paraphrase Spider Man, with networks’ expressive power come great headaches. Networks lure us in with their promise of clearly representing complex phenomena. However, once you start working with them, all you get is a tangled mess. This is because, most of the time, there’s noise in the data and/or there are too many connections: you need to weed out the spurious ones. The process of shaving the hairball by keeping only the significant connections — the red ones in the picture below —  is called “network backboning”. The network backbone represents the true relationships better and will play much nicer with other network algorithms. In this post, I describe a backboning method I developed with Frank Neffke, from the paper “Network Backboning with Noisy Data” accepted for publication in the International Conference on Data Engineering (the code implementing the most important backbone algorithms is available here).

bb1

Network backboning is as old as network analysis. The first solution to the problem was to keep edges according to their weight. If you want to connect people who read the same books, pairs who have few books in common are out. Serrano et al. pointed out that edge weight distributions can span many orders of magnitude — as shown in the figure below (left). Even with a small threshold, we are throwing away a lot of edges. This might not seem like a big deal — after all we’re in the business of making the network sparser — except that the weights are not distributed randomly. The weight of an edge is correlated with the weights of the edges sharing a node with it — as shown by the figure below (right). It is easy to see why: if you have a person who read only one book, all its edges can have at most weight one.

Their weights might be low in comparison with the rest of the network, but they are high for their nodes, given their propensity to connect weakly. Isolating too many nodes because we accidentally removed all their edges is a no-no, so Serrano and coauthors developed the Disparity Filter (DF): a technique to estimate the significance of one node’s connections given its typical edge weight, regardless of what the rest of the network says.

bb2

This sounds great, but DF and other network backboning approaches make imprecise assumptions about the possibility of noise in our estimate of edge weights. In our book example, noise means that a user might have accidentally said that she read a book she didn’t, maybe because the titles were very similar. One thing DF gets wrong is that, when two nodes are not connected in the raw network data, it would say that measurement error is absent. This is likely incorrect, and it screams for a more accurate estimate of noise. I’m going to leave the gory math details in the paper, but the bottom line is that we used Bayes’ rule. The law allows us to answer the question: how surprising is the weight of this edge, given the weights of the two connected nodes? How much does it defy my expectation?

The expectation here can be thought of as an extraction without replacement, much like Bingo (which statisticians — notorious for being terrible at naming things — would call a “hypergeometric” one). Each reader gets to extract a given number of balls (n, the total number of books she read), drawing from a bin in which all balls are the other users. If a user read ten books, then there are ten balls representing her in the bin. This is a good way to have an expectation for zero edge weights (nodes that are not connected), because we can estimate the probability of never extracting a ball with a particular label.

bb4

I highlighted the words one and two, because they’re a helpful key to understand the practical difference between the approaches. Consider the toy example below. In it, each edge’s thickness is proportional to its weight. Both DF and our Noise Corrected backbone (NC) select the black edges: they’re thick and important. But they have different opinions about the blue and red edges. DF sees that nodes 2 and 3 have mostly weak connections, meaning their thick connection to node 1 stands out. So, DF keeps the blue edges and it drops the red edge. It only ever looks at one node at a time.

bb5

NC takes a different stance. It selects the red edge and drops the blue ones. Why? Because for NC what matters more is the collaboration between the two nodes. Sure, the blue connection is thicker than the red one. But node 1 always has strong connections, and its blue edges are actually particularly weak. On the other hand, node 3 usually has weak connections. Proportionally speaking, the red edge is more important for it, and so it gets saved.

To sum up, NC:

  1. Refines our estimate of noise in the edge weights;
  2. Sees an edge as the collaboration between two nodes rather that an event happening to one of them;
  3. Uses a different model exploiting Bayes’ law to bake these aspects together.

bb6

How does that work for us in practice? Above you see some simulations made with artificial networks, of which we know the actual underlying structure, plus some random noise — edges thrown in that shouldn’t exist. The more noise we add the more difficult it is to recover the original structure. When there is little noise, DF (in blue) is better. NC (in red) starts to shine as we increase the amount of noise, because that’s the scenario we were targeting.

In the paper we also show that NC backbones have a comparable stability with DF, meaning that extracting the backbone from different time snapshots of the same phenomenon usually does not yield wildly different results. Coverage — the number of nodes that still have at least one edge in the network — is also comparable. Then we look at quality. When we want to predict some future relationship among the nodes, we expect noisy edges to introduce errors in the estimates. Since a backbone aims at throwing them away, it should increase our predictive power. The table below (click it to enlarge) shows that, in different country-country networks, the predictive quality (R2) using an NC backbone is higher than the one we get using the full noisy network. The quality of prediction can get as high as twice the baseline (the table reports the quality ratio: R2 of the backbone over R2 of the full network, for different methods).

bb8

The conclusion is that, when you are confident about the measurement of your network, you should probably extract its backbone using DF. However, in cases of uncertainty, NC is the better option. You can test it yourself!

30 November 2016 ~ 1 Comment

Exploring the Uncharted Export

Exporting goods is great for countries: it is a way to attract foreign currency. Exports are also fairly easy to analyze, since they are put in big crates and physically shipped through borders, where they are usually triple checked*. However, there is another way to attract foreign currency that escapes this analytical convenience. And it is a huge one. Tourism. When tourists get inside your country, you are effectively exporting something: anything that they buy. Finding out exactly what and how much you’re exporting is tricky. Some things are easy: hotels, vacation resorts, and the like. Does that cover all they buy? Probably not.

Investigating this question is what I decided to do with Ricardo Hausmann and Frank Neffke in our new CID Working Paper “Exploring the Uncharted Export: An Analysis of Tourism-Related Foreign Expenditure with International Spend Data“. The paper analyzes tourism with a new and unique dataset. The MasterCard Center for Inclusive Growth endowed us with a data grant, sharing with us anonymized and aggregated transaction data giving us insights about the spend behavior of foreigners inside two countries, Colombia and the Netherlands.

tourism1

The first thing to clear is the question: does tourism really matter? Tourism might be huge for some countries — Seychelles or Bahamas come to mind** — but does it matter for any other country? Using World Bank estimates — which we’ll see they are probably underestimations — we can draw tourism as the number one export of many countries. Above you see two treemaps (click on them to enlarge) showing the composition of the export basket of Zimbabwe and Spain. The larger the square the more the country makes exporting that product. Tourism would add a square larger than tobacco for Zimbabwe, and twice as big as cars for Spain. Countries make a lot of money out of tourism, so it is crucial to have a more precise way to investigate it.

tourism2

How do we measure tourism? As said before, we’re working with anonymized and aggregated transaction data. In practice, for each postal code of the country of destination we can know how many cards and how much expenditure in total happened in different retail sectors. We focus on cards which were issued outside the country we are analyzing. This way we can be confident we are capturing mostly foreign expenditures. There are many caveats to keep in mind which affect our results: we do not see cash expenditures, we have only a non-random sample from MasterCard cards, and so on. However, when we look at maps showing us the dollar intensity in the various parts of the country (above for Colombia and the Netherlands — click on them to enlarge), we find comforting validation with external data: the top six tourism destinations as reported by Trip Advisor always correspond to areas where we see a lot of activity also in our data.

nld_communities

We also see an additional thing, and it turns out to be related to the advantage of our data over traditional tourism reports. A lot is happening on the border. In fact, the second most popular Colombian city after Bogotà is Cucuta. If you never heard of Cucuta it just means that you are not from Colombia, or Venezuela: Cucuta is a city on the northeastern border of the country. It is the place where many Venezuelan cross the border to do shopping, representing a huge influx of cash for Colombia. Until the border got closed, at least (the data is from before this happened, now it’s open again). In the Netherlands, you can cluster municipalities according to the dominant foreign countries observed there — see map above. You will find a Belgian cluster, for instance (in purple). This cluster is dominated by grocery and shopping.

tourism3

While these Belgian shoppers are probably commuters rather than tourists, they are nevertheless bringing foreign currency to local coffers, so that’s great. And it is something not really captured by alternative methodologies. We classify a merchant type as “commuting” if it is predominant in the purple cluster, because it is more popular for “local” Belgian travelers. Everything else is either “tourism” — if it is predominant in the other non-border municipalities –, or “other” if there is no clear dominance anywhere. In the tourism cluster you find things like “Accommodations” and “Travel Agencies and Tour Operators”; in the commuting cluster you have merchants classified under “Automotive Retail” and “Pet Stores”. When you look at the share of expenditures going to the commuting cluster (above in green), you realize how significant this is. One out of four foreign dollars spent in the Netherlands go to non-tourism related activities. The share for Colombia goes up to 30%.

tourism4

A post in this blog would not be complete without a gratuitous network visualization, so here we are. What you see is what we call “Origin Space” for Colombia (above) and the Netherlands (below). Nodes are countries of origin, and they are connected if the tourists from these countries behave similarly in what and where they make their purchases. The color of the node tells you the continent of the country. The size is the presence of tourists in the country of destination, relative to the origin’s GDP. The size and color of the edge is proportional to how similar the two origins are (orange = very similar; blue = similar, but not so much). We can see that Colombia has a lot of large red nodes — from the Americas — and the Netherlands is strong in blue nodes — from Europe.

If you click on the picture and zoom into the Colombia network you will see why this kind of visualization is useful. Colombia is fairly well-placed in the Australian market: the corresponding node is quite large. A thing then jumps to the eye. Australia has a large and very orange connection. To New Zealand. No surprise: Australians and New Zealanders are similar. Yet, the New Zealand node is much smaller than Australia. It shouldn’t be: these are relative expenditures. This means that, for some reason, Colombia is not currently an appealing destination for New Zealanders, even if it should, based on their similarity with Australians. New Zealand should then be a target of further investigation, which might lead to the untapping of a new potential market for Colombian tourism.

And this concludes the reasons why this data is so amazing to study tourism. To wrap up the message, we have first validated the data, showing that it agrees with what we expect being the most important tourism destinations of a country. Then, we unleashed its potential: the ability to detect “non-tourism” foreign cash inflows, and the promising initial development of tools to discover potential missing opportunities.


* The process is not foolproof, thought. I don’t remember where I read it, but it seems that if you sum all declared exports of countries and all declared imports, and subtract the two, you get a quite high positive number. I wonder where all these extra exports are going. Mars?

** When I was told we were doing a tourism project I got high hopes. I’m still waiting for a fully paid work mission to be approved by management. Any day now.

22 August 2016 ~ 0 Comments

It’s Not All in the Haka: Networks Matter in Rugby Too

If there is a thing that I love more than looking at silly pictures on the Interwebz for work is to watch rugby for work. I love rugby: in my opinion it is the most beautiful team sport out there. It tingles my network senses: 15 men on the field have to coordinate like a single organism to achieve their goal — crossing the goal line with the ball by passing it backwards instead of forward. When Optasports made available some data collected during 18 rugby matches I felt I could not miss the opportunity for some hardcore network nerding on them. The way teams weave their collaboration networks during a match must have some relationship with their performance, and I was going to find out what this relationship might be.

1443058137058

For my quest I teamed up with Luca Pappalardo and Paolo Cintia, two friends of mine who are making an impact on network and big data sports analytics, both in soccer and in cycling. The result was “The Haka Network: Evaluating Rugby Team Performance with Dynamic Graph Analysis“, a paper recently presented at the DyNo workshop in San Francisco. Our questions were:

  1. Is there a relationship between the topology of the network of passes and the success of the team?
  2. Is there a relationship between disruptions made by tackles and territorial gains?
  3. If we want to predict a team’s success, is it better to build networks of passes and disruptions for each action separately or for the entire match?
  4. Can we use these relationships to “predict” the outcome of the match?

rugby1

A passage network is simply a network whose nodes are the players of a team and the directed connections go from the player originating a pass to the player receiving the ball. We consider only completed passages: the ones that did not result in an error or lost possession. In the above picture, those are the green edges and they are always established between players belonging to the same team. In rugby, players are allowed to tackle the current ball carrier of the opponent team. When that happens, we create another directed edge, this time in what we call “disruption network”. The aim of a tackle is to prevent the opponent team from gaining meters. These are the red edges in the above picture and can only be established between players belonging to opposite teams. The picture you see is the collection of all passes and tackles which happened in the Italy vs New Zealand match in 2012. It is a multilayer network as it contains edges of two different types: passes and tackles.

Once we have pass and disruption networks we can calculate a collection of network measures. I’ll give a brief idea here, but if you are looking for more formal definitions you’ll have to search for them in the paper:

  • Connectivity: how many pass connections you have to remove to isolate players;
  • Assortativity: the tendency of players to pass the ball to players with a similar number of connections — in high assortativity central players pass to other central players and marginal players to other marginal players;
  • Components: how many “sinks” there are, in that the ball never goes back to the bulk of the team when it is passed to a player in a sink;
  • Clustering: how many triangles there are, meaning that the team can be decomposed in many different smaller sub-teams of three players.

These are the features we calculated for the pass networks. The disruption case is slightly different. We calculated the same features for the team when removing the tackled player, weighted on the relative number of tackles. If 50% of the tackles hit player number 11, then 50% of the disrupted connectivity is the connectivity value of the pass network when removing player 11. The reason is that the tackled player is temporarily removed from the game, so we need to know how the team performs without him, weighted on the number of times this occurrence happens.

So, it is time to give some answers. Shall we?

1. Is there a relationship between the topology of the network of passes and the success of the team?

rugby2

Yes, there is. We calculate “success” as the number of meters gained, ball in hand, by the team. The objective of rugby is to cross the goal line carrying the ball, so meters made is a pretty good indicator. We control for two things. First, the total number of passes: it simply means the team was able to hold onto the ball longer, so it is trivially expected to result in more meters. Second, the home advantage, which is a huge factor in rugby: Italy won only 12 out of 85 matches in the European “Six Nations” tournament, and 11 of them were in Italy. After these controls, we find that two features have good correlations with meters made: connectivity and components. The more edges are needed to isolate a player, the more meters a team is expected to make (p < .01, R2 = 47%). More sinks in a team is associated with lower gains in meters.

2. Is there a relationship between disruptions made by tackles and territorial gains?

rugby3

Again: yes. In this case it seems that all calculated features matter to predict meters made. The strongest factor is again leftover connectivity. It means that if the connectivity of the pass network increases after the tackled player is removed from it then the team is able to advance more. Simplifying: if you are able to tackle only low connectivity players, then your opponent is able to gain more territory (p < .01, R2 = 48%).

3. Is it better to build networks of passes and disruptions for each action separately or for the entire match?

The answer to the previous two questions were made by calculating the features on the global match networks. The global network uses all the data from a match, exactly like the pass and disruption edges depicted in the above figure. In principle, one could calculate these features as the match unfolds: sequence by sequence. In fact, networks features at the action level work very well in soccer, as Luca and Paolo already proved. Does that work also in rugby?

Surprisingly, the answer is no. We recalculated the features for each passage of play. A passage of play is the part of a match from when a team gets into possession of the ball until it loses it, scores, or the game flow stops for an infraction. When we calculate features at this level, we find very weak correlations: almost nothing is significant and, when it is, the predictive power is very low. We think that this is because in rugby our definition of sequence is too strict. While soccer is a tactical game — where each sequence counts for itself — rugby is a grand strategy game: sequences build cumulative advantage which pays off after a series of them — or only in the match as a whole.

4. Can we use these relationships to “predict” the outcome of the match?

mca_3099866b

This is the real queen question of the post, and we do not fully answer it, unfortunately. However, we have a very good reason to think that the answer could be positive. We created a predictor which trains on 17 matches and then, given the global multi-layer network, will pick the winner. You can see the problem of the approach here: we use the network of the match as it happened to “predict” the outcome. However, we did that only because we did not have enough matches for each individual team: we believe we can first predict how pass and disruption networks will shape in a new match using historic data and then use that to predict the outcome. That will be future works, maybe if some team is intrigued by networks and wants to contact us for a collaboration… (wink wink).

The reason I still like to report on our predictor is that it has a very promising property. Its accuracy was 83%. We compared with a prediction made with official rugby rankings, whose performance is worse: 76% accuracy. We also tested against bookmakers, who are better than us with their 86% accuracy. However, historic data on bets only cover more important matches — only 14 out of 18 — and matches between minor teams are usually less predictable. The fact that we are on par on a more difficult task is remarkable. More importantly, bookies tend to just “choose the best team”. For instance, they always predict a New Zealand win. The Haka, however, is not always enough and our networks caught that. New Zealand lost to England in a big upset on December 1st 2012. The bookmakers didn’t see that coming, but our network approach could have.

07 July 2016 ~ 0 Comments

Building Data-Driven Development

A few weeks ago I had to honor to speak at my group’s  “Global Empowerment Meeting” about my research on data science and economic development. I’m linking here the Youtube video of my talk and my transcript for those who want to follow it. The transcript is not 100% accurate given some last minute edits — and the fact that I’m a horrible presenter 🙂 — but it should be good enough. Enjoy!


We think that the big question of this decade is on data. Data is the building blocks of our modern society. We think in development we are not currently using enough of these blocks, we are not exploiting data nearly as much as we should. And we want to fix that.

Many of the fastest growing companies in the world, and definitely the ones that are shaping the progress of humanity, are data-intensive companies. Here at CID we just want to add the entire world to the party.

So how do we do it? To fix the data problem development literature has, we focus on knowing how the global knowledge building looks like. And we inspect three floors: how does knowledge flow between countries? What lessons can we learn inside these countries? What are the policy implications?

To answer these questions, we were helped by two big data players. The quantity and quality of the data they collect represent a revolution in the economic development literature. You heard them speaking at the event: they are MasterCard – through their Center for Inclusive Growth – and Telefonica.

Let’s start with MasterCard, they help us with the first question: how does knowledge flow between countries? Credit card data answer to that. Some of you might have a corporate issued credit card in your wallet right now. And you are here, offering your knowledge and assimilating the knowledge offered by the people sitting at the table with you. The movements of these cards are movements of brains, ideas and knowledge.

When you aggregate this at the global level you can draw the map of international knowledge exchange. When you have a map, you have a way to know where you are and where you want to be. The map doesn’t tell you why you are where you are. That’s why CID builds something better than a map.

We are developing a method to tell why people are traveling. And reasons are different for different countries: equity in foreign establishments like the UK, trade partnerships like Saudi Arabia, foreign greenfield investments like Taiwan.

Using this map, it is easy to envision where you want to go. You can find countries who have a profile similar to yours and copy their best practices. For Kenya, Taiwan seems to be the best choice. You can see that, if investments drive more knowledge into a country, then you should attract investments. And we have preliminary results to suggest whom to attract: the people carrying the knowledge you can use.

The Product Space helps here. If you want to attract knowledge, you want to attract the one you can more easily use. The one connected to what you already know. Nobody likes to build cathedrals in a desert. More than having a cool knowledge building, you want your knowledge to be useful. And used.

There are other things you can do with international travelers flows. Like tourism. Tourism is a great export: for many countries it is the first export. See these big portion of the exports of Zimbabwe or Spain? For them tourism would look like this.

Tourism is hard to pin down. But it is easier with our data partners. We can know when, where and which foreigners spend their money in a country. You cannot paint pictures as accurate as these without the unique dataset MasterCard has.

Let’s go to our second question: what lessons can we learn from knowledge flows inside a country? Telefonica data is helping answering this question for us. Here we focus on a test country: Colombia. We use anonymized call metadata to paint the knowledge map of Colombia, and we discover that the country has its own knowledge departments. You can see them here, where each square is a municipality, connecting to the ones it talks to. These departments correlate only so slightly with the actual political boundaries. But they matter so much more.

In fact, we asked if these boundaries could explain the growth in wages inside the country. And they seem to be able to do it, in surprisingly different ways. If you are a poor municipality in a rich state in Colombia, we see your wage growth penalized. You are on a path of divergence.

However, if you are a poor municipality and you talk to rich ones, we have evidence to show that you are on a path of convergence: you grow faster than you expect to. Our preliminary results seem to suggest that being in a rich knowledge state matters.

So, how do you use this data and knowledge? To do so you have to drill down at the city level. We look not only at communication links, but also at mobility ones. We ask if a city like Bogota is really a city, or different cities in the same metropolitan area. With the data you can draw four different “mobility districts”, with a lot of movements inside them, and not so many across them.

The mobility districts matter, because combining mobility and economic activities we can map the potential of a neighborhood, answering the question: if I live here, how productive can I be? A lot in the green areas, not so much in the red ones.

With this data you can reshape urban mobility. You know where the entrance barriers to productivity are, and you can destroy them. You remodel your city to include in its productive structure people that are currently isolated by commuting time and cost. These people have valuable skills and knowhow, but they are relegated in the informal sector.

So, MasterCard data told us how knowledge flows between countries. Telefonica data showed the lessons we can learn inside a country. We are left with the last question: what are the policy implications?

So far we have mapped the landscape of knowledge, at different levels. But to hike through it you need a lot of equipment. And governments provide part of that equipment. Some in better ways than others.

To discover the policy implications, we unleashed a data collector program on the Web. We wanted to know how the structure of the government in the US looks like. Our program returned us a picture of the hierarchical organization of government functions. We know how each state structures its own version of this hierarchy. And we know how all those connections fit together in the union, state by state. We are discovering that the way a state government is shaped seems to be the result of two main ingredients: where a state is and how its productive structure looks like.

We want to establish that the way a state expresses its government on the Web reflects the way it actually performs its functions. We seem to find a positive answer: for instance having your environmental agencies to talk with each other seems to work well to improve your environmental indicators, as recorded by the EPA. Wiring organization when we see positive feedback and rethinking them when we see a negative one is a direct consequence of this Web investigation.

I hope I was able to communicate to you the enthusiasm CID discovered in the usage of big data. Zooming out to gaze at the big picture, we start to realize how the knowledge building looks like. As the building grows, so does our understanding of the world, development and growth. And here’s the punchline of CID: the building of knowledge grows with data, but the shape it takes is up to what we make of this data. We chose to shape this building with larger doors, so that it can be used to ensure a more inclusive world.


By the way, the other presentations of my session were great, and we had a nice panel after that. You can check out the presentations in the official Center for International Development Youtube channel. I’m embedding the panel’s video below:

09 June 2016 ~ 0 Comments

Netsci 2016 Report

netsci1

Another NetSci edition went by, as interconnected as ever. This year we got to enjoy Northeast Asia, a new scenario for us network scientists, and an appropriate one: many new faces popped up both among speakers and attendees. Seoul was definitely what NetSci needed at this time. I want to spend just a few words about what impressed me the most during this trip — well, second most after what Koreans did with their pizzas: that is unbeatable. Let’s go chronologically, starting with the satellites.

You all know I was co-organizing the one on Networks of networks (you didn’t? Then scroll down a bit and get informed!). I am pleased with how things went: the talks we gathered this year were most excellent. Space constraints don’t allow me to give everyone the attention they deserve, but I want to mention two. First is Yong-Yeol Ahn, who was the star of this year. He gave four talks at the conference — provided I haven’t miscounted — and his plenary one on the analysis of the Linkedin graph was just breathtaking. At Netonets, he talked about the internal belief network each one of us carries in her own brain, and its relationship with how macro societal behaviors arise in social networks. An original take on networks of networks, and one that spurred the idea: how much are the inner workings of one’s belief network affected by the metabolic and the bio-connectome networks of one own body? Should we study networks of networks of networks? Second, Nitesh Chawla showed us how high order networks unveil real relationships among nodes. The same node can behave like it is many different ones, depending on which of its connections we are considering.

yy1

Besides the most awesome networks of networks satellite, other ones caught my attention. Again, space is my tyrant here, so I get to award just one slot, and I would like to give it to Hyejin Youn. Her satellite was on the evolution of technological networks. She does amazing things tracking how the patent network evolved from the depths of 1800 until now. The idea is to find viable innovation paths, and to predict which fields will have the largest impact in the future.

When it comes to the plenary sessions, I think Yang-Yu Liu stole the spotlight with a flashy presentation about the microcosmos everybody carries in their guts. The analysis of the human microbiome is a very hot topic right now, and it pleases me to know that there is somebody working on a network perspective of it. Besides scientific merits, whoever extensively quotes Minute Earth videos — bonus points for it being the one about poop transplants — has my eternal admiration. I also want to highlight Ginestra Bianconi‘s talk. She has an extraordinary talent in bringing to network science the most cutting edge aspects of physics. Her line of research combining quantum gravity and network geometry is a dream come true for a physics nerd like myself. I always wished to see advanced physics concepts translated into network terms, but I never had the capacity to do so: now I just have to sit back and wait for Ginestra’s next paper.

netsci2

What about contributed talks? The race for the second best is very tight. The very best was clearly mine on the link between mobility and communication patterns, about which I showed a scaling relationship connecting them (paperpost). I will be magnanimous and spare you all the praises I could sing of it. Enough joking around, let’s move on. Juyong Park gave two fantastic talks on networks and music. This was a nice breath of fresh air for digital humanities: this NetSci edition was orphan of the great satellite chaired by Max Schich. Juyong showed how to navigate through collaboration networks on classical music CDs, and through judge biases in music competitions. By the way, Max dominated — as expected — the lighting talk session, showing some new products coming from his digital humanities landmark published last year in Science.  Tomomi Kito was also great: she borrowed the tools of economic complexity and shifted her focus from the macro analysis of countries to the micro analysis of networks of multinational corporations. A final mention goes to Roberta Sinatra. Her talk was about her struggle into making PhD committees recognize that what she is doing is actually physics. It resonates with my personal experience, trying to convince hiring committees that what I’m doing is actually computer science. Maybe we should all give up the struggle and just create a network science department.

And so we get to the last treat of the conference: the Erdos-Renyi prize, awarded to the most excellent network researcher under the age of 40. This year it went to Aaron Clauset, and this pleases me for several reasons. First, because Aaron is awesome, and he deserves it. Second, because he is the first computer scientist who is awarded the prize, and this just gives me hope that our work too is getting recognized by the network gurus. His talk was fantastic on two accounts.

aaron1

For starters, he presented his brand new Index of Complex Networks. The interface is pretty clunky, especially on my Ubuntu Firefox, but that does not hinder the usefulness of such an instrument. With his collaborators, Aaron collected the most important papers in the network literature, trying to find a link to a publicly available network. If they were successful, that link went in the index, along with some metadata about the network. This is going to be a prime resource for network scientists, both for starting new projects and for the sorely needed task of replicating previous results.

Replication is the core of the second reason I loved Aaron’s talk. Once he collected all these networks, for fun he took a jab at some of the dogmas of networks science. The main one everybody knows is: “Power-laws are everywhere”. You can see where this is going: the impertinent Colorado University boy showed that yes, power-laws are very common… among the 5-10% of networks in which it is possible to find them. Not so much “everywhere” any more, huh? This was especially irreverent given that not so long before Stefan Thurner gave a very nice plenary talk featuring a carousel of power laws. I’m not picking sides on the debate — I feel hardly qualified in doing so. I just think that questioning dearly held results is always a good thing, to avoid fooling ourselves into believing we’ve reached an objective truth.

netsci3

Among the non-scientific merits of the conference, I talked with Vinko Zlatic about the Croatian government on the brink of collapse, spread the search for a new network scientist by the Center for International Development, and discovered that Korean pizzas are topped with almonds (you didn’t really think I was going to let slip that pizza reference at the beginning of the post, did you?). And now I made myself sad: I wish there was another NetSci right away, to shove my brain down into another blender of awesomeness.  Oh well, there are going to be plenty of occasions to do so. See you maybe in Dubrovnik, Tel-Aviv or Indianapolis?

20 May 2016 ~ 0 Comments

Program of Netonets 2016 is Out!

As announced in the previous post, the symposium on networks of networks is happening in less than two weeks: May 31st @ 9AM, room Dongkang C of the K-Hotel Seoul, South Korea. Przemek Kazienko, Gregorio D’Agostino and I have a fantastic program and set of speakers to keep you entertained on multilayer, interdependent and multislice networks. Take a look for yourself!

Session I

9:00 – 9:15: Room set up
9:15 – 9:30: Welcome from the organizers
9:30 – 10:15: Invited I: Yong-Yeol Ahn: Dynamics of social network of belief networks
10:15 – 11:00: Invited II: Luca Maria Aiello: The Nature of Social Links

11:00 – 11:30: Coffee Break

Session II

11:30 – 12:15: Invited III: Jianxi Gao: Networks of Networks: From Structure to Dynamics
12:15 – 13:00: Invited IV: Tomasz Kajdanowicz: Fusion methods for classification in multiplex networks

13:00 – 14:30: Lunch Break

Session III

14:30 – 15:15: Invited V: Michael Danziger: Beyond interdependent networks
15:15 – 15:35: Contributed I: Bruno Coutinho: Greedy Leaf Removal on Hypergraphs
15:35 – 15:55: Contributed II: Yong Zhuang: Complex Contagions in Clustered Random Multiplex Networks

15:55 – 16:30: Coffee Break

Session IV

16:30 – 17:15: Invited VI: Nitesh Chawla: From complex interactions to networks: the higher-order network representation

17:15 – 18:00: Round table – Open discussion
18:00 – 18:15: Organizers wrap up

Remember to register to the main NetSci conference if you want to attend.

Incidentally, the end of May is going to be a rather busy period for me. Besides co-organizing Netonets and speaking at the main Netsci conference, I’m going to present also at the Core50 conference in Louvain-la-Neuve, Belgium, on the role of social and mobility networks in shaping the economic growth of a country. Thanks to Jean-Charles Delvenne for inviting me!

I hope to see many of you there!