Michele Coscia - Connecting Humanities

Michele Coscia I am an associate prof at IT University of Copenhagen. I mainly work on algorithms for the analysis of complex networks, and on applying the extracted knowledge to a variety of problems. My background is in Digital Humanities, i.e. the connection between the unstructured knowledge and the coldness of computer science. I have a PhD in Computer Science, obtained in June 2012 at the University of Pisa. In the past, I visited Barabasi's CCNR at Northeastern University, and worked for 6 years at CID, Harvard University.

08 June 2022 ~ 0 Comments

Give Me Some Space!, or: the Slow Descent into Unicode Madness

My wife is the literal embodiment of this comic strip:

Laptop Issues
Source: https://xkcd.com/2083/

She has to overcome the weirdest code golfing issues to get through her research. The other day her issue was: “I have a bunch of text and I want to surround every unicode emoticon with spaces” — the idea being to treat each emoji like a full word.

Sounds easy enough until you get to the word “unicode” and you start hearing helicopters in the background, a Doors song comes on, and you flash memories of going upstream in a jungle river on a barge.

My problem is that, instead, I am the embodiment of the victim in this comic strip:

Nerd Sniping
Source: https://xkcd.com/356/

When confronted with something seemingly simple but with an interesting level of complexity behind it, I just have to drop everything I’m doing for my actual job for an hour, until I solve it.

At first glance, it seems easy enough: find a dictionary of emoticons and use it as the basis of a regular expression. The emoji package can take care of the first part. A naive solution could be the following:

import re
from emoji import unicode_codes

allemojis = "".join(unicode_codes.UNICODE_EMOJI['en'])
searcher = re.compile(u'([%s])' % allemojis)
spacemoji = {key: searcher.sub(r' \1 ', texts[key]) for key in texts}

This assumes that “texts” is a dictionary with a collection of texts we’re interested in. The “searcher” wraps a bunch of characters in between square brackets, which means that any matching characters will be found. The the “.sub” method will replace whatever matched (“\1“) with its content surrounded by spaces.

Easy enough. Let’s test it with some example strings:

Wonderful
Everything works as it should and is awesome
Say what now?
I wonder if they still hire at McDonald’s

A passing knowledge of unicode, or a quick Google search about the mysterious \u200d code popping out of nowhere in example #3, leads to a relatively quick diagnosis. Emoticons are not single characters: they can combine multiple unicode characters to modify the appearance of a single glyph. In my case, the baseline turban emoticon is of a male with the baseline yellow emoticon skin tone. To obtain a white lady with a turban you need to combine the turban emoticon with the white color and the woman symbol.

Same goes for numbers: some emoticons contain raw digit characters, and thus those will match even when not “inside” an emoticon.

So here’s a step-by-step cure for our unicode woes:

  1. Don’t work with string or unicode string objects. Always work with bytestrings by calling “.encode("utf-8")“. This way you can see what the machine actually sees. It’s not ““. it’s “\xf0\x9f\x91\xb3\xf0\x9f\x8f\xbb\xe2\x80\x8d\xe2\x99\x80” (easy, right?).
  2. Don’t use square brackets for your regular expression, because it will only match one character. Emoticons aren’t characters, they are words. Use the pipe, which allows for matching groups of characters.
  3. Store your emoticons in a list and sort by descending length. The regular expression will stop at the first match, and is longer than , because the first one is a modified version. The simple turban emoticon is actually “\xf0\x9f\x91\xb3” (note how these are the first four bytes of the previous emoticon). So the regular expression will not match the “man with a turban” inside the “white woman with a turban“.
  4. Escape your regular expression’s special characters. Some emoticons contain the raw character “*“, which will make your compiler scream offensive things at you.
  5. Remember to encode your input text and then to decode your output, so that you obtain back a string and not a byte sequence — assuming you’re one of those basic people who want to be able to read the outputs of their pile of code.

If we put all of this together, here’s the result of my hour of swearing at the screen in Italian 🤌:

import re
from emoji import unicode_codes

allemojis = [x.encode("utf-8") for x in unicode_codes.UNICODE_EMOJI['en']]
allemojis = sorted(allemojis, key = len, reverse = True)
allemojis = b"|".join(allemojis)

searcher = re.compile(b'(%s)' % allemojis.replace(b'*', b'\*'))
spacemoji = {key: searcher.sub(rb' \1 ', texts[key].encode("utf-8")).decode("utf-8") for key in texts}

Which restores some of my sanity:

I’m sure better programmers than me can find better solutions (in less time) and that there are infinite edge cases I’m not considering here, but this was good enough for me. At least I can now go back and do the stuff my employer is actually paying me to do, and I don’t feel I’m in a Francis Ford Coppola masterpiece any more.

17 May 2022 ~ 0 Comments

Node Attribute Distances, Now Available on Multilayer Networks! (Until Supplies Last)

I’ve been a longtime fan of measuring distances between node attributes on networks: I’ve reviewed the methods to do it and even proposed new ones. One of the things bothering me was that no one had so far tried to extend these methods to multilayer networks — networks with more than one type of relationships. Well, it bothers me no more, because I just made the extension myself! It is the basis of my new paper: “Generalized Euclidean Measure to Estimate Distances on Multilayer Networks,” which has been published on the TKDD journal this month.

Image from https://atlas.cid.harvard.edu/

You might be wondering: what does it mean to “measure the distance between node attributes on networks”? Why is it useful? Let’s make a use case. The Product Space is a super handy network connecting products on the global trade network based on their similarity. You can have attributes saying how much of a product a country exported in a given year — in the image above you see what Egypt exported in 2018. This is super interesting, because the ability of a country to spread over all the products in the Product Space is a good predictor of their future growth. The question is: how can we tell how much the country moved in the last ten years? Can we say that country A moved more or less than country B? Yes, we can! Exactly by measuring the distance between the node attributes on the network!

The Product Space is but an example of many. One can estimate distances between node attributes when they tell you something about:

  • When and how much people were affected by a disease in a social network;
  • Which customers purchased how many products in a co-purchase network (à la Amazon);
  • Which country an airport belongs to in a flight network;
  • etc…
Image from https://manliodedomenico.com/

Let’s focus on that last example. In this scenario, each airport has an attribute per country: the attribute is equal to 1 if the airport is located in that country, and 0 otherwise. The network connects airports if there is at least a flight planned between them. In this way, you could calculate the network distance between two countries. But wait: it’s not a given that you can fly seamlessly between two countries even if they are connected by flights across airports. You could get from airport A to airport B using flight company X, but it’s not a given than X provides also a flight to airport C, which might be your desired final destination. You might need to switch to airline Y — the image above shows the routes of four different companies: they can be quite different! Switching between airlines might be far from trivial — as every annoyed traveler will confirm to you –, and it is essentially invisible to the measure.

It becomes visible if, instead of using the simple network I just described, you use a multilayer network. In a multilayer network, you can say that each airline is a layer of the network. The layer only contains the flight routes provided by that company. In this scenario, to go from airport A to airport C, you pay the additional cost of switching between layers X and Y. This cost can be embedded in my Generalized Euclidean measure, and I show how in the paper — I’ll spare you the linear algebra lingo.

Image from yours truly

One thing I’ll say — though — is that there are easy ways to embed such layer-switching costs in other measures, such as the Earth’s Mover Distance. However, these measures all consider edge weights as costs — e.g., how long does it take to fly from A to B. My measure, instead, sees edge weights as capacities — e.g. how many flights the airline has between A and B. This is not splitting hairs, it has practical repercussions: edge weights as costs are ambiguous in linear algebra, because they can be confused with the zeros in the adjacency matrices. The zeros encode absent edges, which are effectively infinite costs. Thus there is an ambiguity* in measures using this approach: as edges get cheaper and cheaper they look more and more like infinitely costly. No such ambiguity exists in my approach. The image above shows you how to translate between weights-as-costs and weights-as-capacities, and you can see how you can get in trouble in one direction but not in the other.

In the paper, I show one useful case study for this multilayer node distance measure. For instance, I am able to quantify how important the national flagship airline company is for the connectivity of its country. It’s usually extremely important for small countries like Belgium, Czechia, or Ireland, and less crucial for large ones like France, the UK, or Italy.

The code I developed to estimate node attribute distances on multilayer networks is freely available as a Python library — along with data and code necessary to replicate the results. So you have no more excuses now: go and calculate distances on your super complex super interesting networks!


* This is not completely unsolvable. I show in the paper how one could get around this. But I’d argue it’s still better not to have this problem at all 🙂

19 April 2022 ~ 0 Comments

Complex Networks in Economics Satellite @ NetSci22

For NetSci22, I will join forces once again with the great Morgan Frank to bring you the second edition of the “Complex Networks in Economics and Innovation” satellite (post and official website of the first edition).

Once again, we’re looking for contributed talks, giving you an opportunity to showcase your work. Topics that are more than welcome include:

  • Mapping the relationship of complex economic activities at the global, regional, and local level;
  • Tracking flows of knowhow in all its forms;
  • Creating networks of related tasks and skills to estimate knockoff effects and productivity gains of automation;
  • Investigating the dynamics of research and innovation via analysis of patents, inventions, and science;
  • Uncovering scaling laws and other growth trends able to describe the systemic increase in complexity of activities due to agglomeration;
  • In general, any application of network analysis that can be used to further our understanding of economics.

The submission link is: https://easychair.org/my/conference?conf=cnei22. The full call text is here. And this is the official website. You should send a one-page one-figure abstract before May 13th, 2022.

We have a fantastic lineup of invited speakers you’ll mingle with:

The event will be held online on Zoom. The organization details are still pending confirmation, but we should have secured the date of July 18th, 2022. We should start at 8AM Eastern time (2PM Europe time) and have a 6-hour program. This could still change, so stay tuned for updates!

I hope to see your abstract in my inbox and then you presenting at the satellite!

28 January 2022 ~ 0 Comments

Avoiding Conflicts on Social Media Might Make Things Worse

Look, I get it. Sometimes you really don’t want to get mired in a Facebook discussion with that uncle of yours who thinks that the moon doesn’t exist. It’s easier to block, unfriend, ignore, rather than engage. However, have you considered that avoiding conflicts might make things worse? This is a question Luca Rossi and I asked ourselves as a part of our research on polarization on social media. Part of the answer comes from an Agent Based Model (ABM) we have recently published on Plos One in the paper “How Minimizing Conflicts Could Lead to Polarization on Social Media: an Agent-Based Model Investigation.”

You sit on a throne of lies.

Specifically, we were interested in looking at how news sources react when they get attacked on social media with backlash and flagging. This is a followup to our previous paper, where we found — surprisingly — that this backlash and flagging is mostly directed at neutral and factual news sources. The reason why these sources are magnets for controversies is because their stories are widely read, and thus attract the ire of all sorts of quacks. Quackery, instead, is only read by quacks agreeing with it, and thus they don’t quack at it so much.

So now the question is: what does this negative attention from quacks do to a neutral news source? To answer the question we updated our ABM. In the original version, each news source and user had a fixed political position — a numerical value between -1 (extreme left) and +1 (extreme right), with 0 as perfect neutrality. In the new version, their position can change. People get attracted by similar points of view and repulsed by opinions that are too far from their position. For instance, a +1 user might get attracted if they read a +0.9 news item (moving to, say, a +0.95 position), but will be repulsed again if they read a -0.5 item next (moving back to, say, a +0.98 position).

Our starting assumption is that users and sources are mostly neutral. Here you can see the initial distributions of how many agents (y axis) have a given opinion (x axis). News sources in red and users in blue.

If a user is repulsed, they will also flag the news source. A news source doesn’t want to be flagged. A flag is a bad omen: too many flags and the news source might get banned by the social media platform, or be subject to big scary fact-checking banners. They might even — gasp! — make Mark Zuckerberg leak a tear. Social media are too important for news sources to let this happen. So they will try to avoid conflict. Since they know flags come from people with a different opinion from their own, the only thing they can do is to change their stance. The safest bet is to average the opinions of all their readers. Taking the average of their neighbors in most cases would lead to settle in the middle of the polarity spectrum, but this is not guaranteed.

One example of the strategy social media have tried to use to combat misinformation online.

Four factors together create the rules of the game: how much users feel the need to share new articles on social media; how tolerant they are with diverse viewpoints; how much news sources will try to resist the pressure to change their spin; and how quickly users change their own opinion following what they read. Having an ABM allows you to run a lot of simulations and see what the effect of each of this aspect is on the final system.

We find that:

  • The more people share, the more news sources will be pushed away from neutrality and become partisan;
  • The less tolerant users are, the more they will increase polarization;
  • The resistance a news organization puts up against such a pressure is irrelevant to the final state of the system;
  • If users change their opinions easily, they will be attracted to the extreme ends of the polarity spectrum.
The difference between low tolerance (left) and high tolerance (right) in the opinion distributions of the users after the model has run for a while. Notice the extremist peaks in the left distribution.

Some of this is unsurprising — intolerance breeds polarization –, while other things might be worth looking at a second time. For instance, we think polarization is a bigger deal nowadays exactly because social media helps sharing in a way newspapers, radio, and television do not. Our results say that this oversharing exacerbates polarization. But a nagging question remains: is our ABM just a theoretical toy, or can it reproduce reality?

We think the latter is true, because we tested it against a real world Twitter network. We have a network topology of who talks with whom, and a polarity score for each user based on the news sources they cite in their tweets. The parameter combination that best reproduces real world data is this: high sharing, high opinion volatility, and low tolerance. Which in our model is the exact recipe for escalating conflict. And all of this just because news sources eschew conflict and don’t want to be flagged. Ouch.

This is what happens to the polarity distribution of the users when we try to fit our ABM model on real data from Twitter. Double ouch.

The same grains of salt that you should take our previous paper with also apply here. The model is based on assumptions, and thus it is only as good as those assumptions. Moreover, reality is more complicated than our ABM. For instance, we assume tolerance is uncorrelated with one own political opinion. But what if some political opinions tend to go together with being less tolerant of other points of view? And what if users don’t genuinely flag what they think is outrageous, but make a more strategic use of the flag button to advance their own agenda? These are questions we will explore in further developments of our work.

27 October 2021 ~ 1 Comment

Pearson Correlations for Networks

We all know that correlation doesn’t imply causation:

And yet, we calculate correlations all the time. Because knowing when two things correlate is still pretty darn useful. Even if there is no causation link at all. For instance, it’d be great to know whether reading makes you love reading more. Part of the answer could start by correlating the number of books you read with the number of books you want to read.

The very important questions the Pearson correlation coefficient allows you to ask: will consuming cheese bring upon you the doom of dying by suffocating in your bedsheets? source: https://www.tylervigen.com/spurious-correlations

As a network scientist, you might think that you could calculate correlations of variables attached to the nodes of your network. Unfortunately, you cannot do this, because normal correlation measures assume that nodes do not influence each other — the measures are basically assuming the network doesn’t exist. Well, you couldn’t, until I decided to make a correlation coefficient that works on networks. I describe it in the paper “Pearson Correlations on Complex Networks,” which appeared in the Journal of Complex Networks earlier this week.

The formula you normally use to calculate the correlation between two variables is the Pearson correlation coefficient. What I realized is that this formula is the special case of a more general formula that can be applied to networks.

In Pearson, you compare two vectors, which are just two sequences of numbers. One could be the all the numbers of books that the people in our sample have read, and the other one is all of their ages. In the example, you might expect that older people had more time to read more books. To do so, you check each entry in the two vectors in order: every time you consider a new person, if their age is higher than the average person’s, then also the number of books they read should be higher.

If you are in a network, each entry of these vectors is the value of a node. In our book-reading case, you might have a social network: for each person you know who their friends are. Now you shouldn’t look at each person in isolation, because the numbers of books and the ages of people also correlate in different parts of the network — this is known as homophily. Some older people might be pressured into reading more books by their book-addicted older friends. Thus, leaving out the network might cause us to miss something: that a person’s age tells us not just about the number of books they have read, but it also allows us to predict the number of books their friends have read.

This is the type of networks you are forced to work with when you use the Pearson correlation. That’s just silly, isn’t it?

To put it simply, the classical Pearson correlation coefficient assumes that there is a very special network behind the data: a network in which each node is isolated and only connects to itself — see the image above. When we slightly modify the math behind its formula, it can take into account how close two nodes are in the network — for instance, by calculating their shortest path length.

You can interpret the results from this network correlation coefficient the same way you do with the Pearson one. The maximum value of +1 means that there is a perfect positive relation: for every extra year of age you read a certain amount of new books. The minimum of -1 means that there is a perfect negative relationship: a weird world where the oldest people have not read much. The midpoint of 0 means that the two variables have no relation at all.

Is the network correlation coefficient useful? Two answers. First: how dare you, asking me if the stuff I do has any practical application. The nerve of some people. Second: Yes! To begin with, in the paper I build a bunch of artificial cases in which I show how the Pearson coefficient would miss correlations that are actually present in a network. But you’re not here for synthetic data: you’re a data science connoisseur, you want the real deal, actual real world data. Above you can see a line chart, showing the vanilla Pearson (in blue) and the network-flavored (in red) correlations for a social network of book readers as they evolve over time.

The data comes from Anobii, a social network for bibliophiles. The plot is a correlation between number of books read and number of books in the wishlist of a user. These two variables are positively correlated: the more you have read, the more you want to read. However, the Pearson correlation coefficient greatly underestimates the true correlation, at 0.25, while the network correlation is beyond 0.6. This is because bookworms like each other and connect with each other, thus the number of books you have read also correlates with the wishlist size of your friends.

This other beauty of a plot, instead, shows the correlation between the age of a user and the number of tags they used to tag books. What is interesting here is that, for Pearson, there practically isn’t a correlation: the coefficient is almost zero and not statistically significant. Instead, when we consider the network, there is a strong and significant negative correlation at around -0.11. Older users are less inclined to tag the books they read — it’s just a fad kids do these days –, and they are even less inclined if their older friends do not tag either. If you were to hypothesize a link between age and tag activity and all you had was lousy Pearson, you’d miss this relationship. Luckily, you know good ol’ Michele.

If this makes you want to mess around with network correlations, you can do it because all the code I wrote for the paper is open and free to use. Don’t forget to like and subscrib… I mean, cite my paper if you are fed up with the Pearson correlation coefficient and you find it useful to estimate network correlations properly.

08 June 2021 ~ 0 Comments

Program for Networks in Economics Satellite @ Networks21 Conference

This year, I’ll be organizing a satellite event at the Networks21 conference, the major 2021 event bringing together all network scientists — merging the Sunbelt and the NetSci crowds for the first time! The satellite will be about network applications on research about economic development and innovation. The most excellent Morgan Frank and Lingfei Wu are the other engines behind this project. I briefly introduced it in this earlier post.

I’m glad to report that now we have a final roster of participants. We received several abstracts for the contributed talks and we managed to squeeze eight of them in our program. What follows is the current schedule — minor changes might happen due to author constraints and I will try to keep this post up to date. Note that the satellite will happen Wednesday, June 30, 2021, and all times reported are US East coast time.

  • 8:30AM Invited I: Lü Linyuan
  • 9:10AM Invited II: R. Maria del Rio-Chanona
  • 9:50AM Contributed I: Process-driven network analysis of a mobile money system in Asia. Carolina Mattsson and Frank Takes.
  • 10:10AM Contributed II: Discovering industries in networks of words. Juan Mateos-Garcia, Bishop Alex and Richardson George.
  • 10:30AM Break
  • 10:50AM Invited III: Yong-Yeol “YY” Ahn
  • 11:30AM Contributed III: From code to market: Network of developers and correlated returns of cryptocurrencies. Lorenzo Lucchini, Laura Maria Alessandretti, Bruno Lepri, Angela Gallo and Andrea Baronchelli.
  • 11:50AM Contributed IV: What is a Labor Market? Classifying Workers and Jobs Using Network Theory. James Fogel and Bernardo Modenesi.
  • 12:10PM Invited IV: Hyejin Youn
  • 12:50PM Lunch Break
  • 1:30PM Invited V: Esteban Moro
  • 2:10PM Invited VI: Marta Gonzalez
  • 2:50PM Contributed V: How to Govern Facebook. Seth Benzell and Avinash Collis.
  • 3:10PM Contributed VI: The network limits of infectious disease control via occupation-based targeting. Demetris Avraam, Nick Obradovich, Niccolò Pescetelli, Manuel Cebrian and Alex Rutherford.
  • 3:30PM Break
  • 3:50PM Invited VII: Daniel Rock
  • 4:30PM Contributed VII: Measuring Fraudulent Transactions On Complex Economic Networks Using Optimality Gap. Danilo Bernardineli and Wenjia Hu.
  • 4:50PM Contributed VIII: Local connections drive global structure for technological innovation. Dion O’Neale, Demival Vasques Filho and Shaun Hendy.
  • 5:10PM Invited VIII: Jiang Zhang

If you want to attend, you need to register to the Networks21 conference (deadline on June 20th) and then you will receive a Zoom link to the event.

I don’t know about you, but I’m very excited to see all of this! The official page of the satellite has more information for you.

15 April 2021 ~ 0 Comments

A Bayesian Framework for Online Social Network Sampling

Analyzing an online social network is hard because there’s no “download” button to get the data. In the best case scenario, you have to query an API (Application Program Interface) system. You have to tell the API which user you want to see the connections of, and you can’t just ask for all users at the same time. With the current API rate limits, downloading the Twitter graph would take seven centuries, as the picture below explains.

The gray circle represents the totality of Twitter’s users. The red one is what you’d get with a continuous crawl of one year.

I don’t know about you, but I don’t have seven centuries to publish my papers. I want to become a famous scientist now. If you’re like me, you’re forced to extract only a sample. In this post, I’ll describe a network sampling algorithm that works with social media APIs to maximize the amount of information you get from your sample. The method has been published in the paper “Noise Corrected Sampling of Online Social Networks“, which recently appeared in the journal Transactions on Knowledge Discovery from Data.

My Noise Corrected sampling (NC) is a topological method. This means that you start from one (or more) user IDs that are known, and then you gather new user IDs from their friends, then you move on to the friends’ friends and so on, following the connections in the graph you’re sampling. This is the usual approach, because there is no way to know the user IDs beforehand — unless you have inside knowledge about how the social media platform works.

The underlying common question behind a topological sampler is: how do we pick the best user to explore next? Different samplers answer this question differently. A Random Walk sampler says: “Screw it! Thinking is too hard! I’m gonna get a random node from the friends of the currently explored node! Then, with the time I saved, I’m gonna get some ice cream.” The Metropolis-Hastings Random Walk — the nerd’s answer to the jock’s Random Walk — wants to preserve the degree distribution, so it penalizes high-degree nodes — since they’re more likely to be oversampled. Alternatively, we could use classical graph exploration techniques such as Depth-First and Breadth-First Search — and, again, you can enhance them so that you can preserve some properties of the network, for instance its clustering coefficient.

A random walk on a graph. It looks like a drunk going around. I don’t drink and sample, and neither should you.

My NC departs from all these approaches in a fundamental way. The thing they all have in common is that they don’t use information from the sample they have currently gathered. They only look at the potential next users and make a decision on whom to explore based on their characteristics alone. For instance Metropolis-Hastings Random Walk checks the user’s popularity in number of connections. NC, instead, looks at the entire collected sample and asks a straightforward question: “which of these next users is most likely to give me the most information about the whole structure, given what I explored so far?”

It does so with a Bayesian framework. The gory math is in the paper, but the underlying philosophy is that you can calculate a z-score for each edge of the network. It tells you how surprising it is that this edge is there. In probability theory, “surprise” is a good thing: it is directly related to how much information something gives you. Probability theorists are a bunch of jolly folks, always happy to discover jacks-in-a-box. If you always pick the most surprising edge in the network to explore the next user, you’re maximizing the amount of information you’re going to get back. The picture below unravels what this entails on a simple example.

In this example, the green nodes are explored, the blue nodes are known — because they’re friends of an explored node — but not explored yet. The red nodes are unknown.

The first figure (a) is the start of the process, where we have a network and we pick a random starting point, node 7. I use the edge thickness to show its z-score: we always follow the widest edge. Initially, you only pick a random edge, because all edges have the same z-score in a star. However, once I pick node 3, all z-scores change as we discover new edges (b). In this case, first we prioritize the star around node 7 (c), then we start following the path from node 3 until node 9 (d-e), then we explore the star around node 9 until there are no more nodes nor edges to discover in this toy network (f). Note how edges exhaust their surprise as you learn more about the neighborhoods of the nodes they connect.

Our toy example is an unweighted network, where all edges have the same weight, but this method thrives if you have information about how strong a connection is. The edge weight is additional information that can be taken into account to measure its surprise value. Under the hood, NC estimates the expected weight of an edge given the average edge weights of the two nodes it connects. In an unweighted network this is equivalent to the degree, because all edges have the same weight. But in a weighted network you actually have meaningful differences between nodes that tend to have strong/weak connections.

So, the million dollar question: how does NC fare when tested against the alternatives I presented before? The first worry you might have is that it takes too much time to calculate the z-scores. After all, to do so you need to look at the entire sampled network so far, while the other methods only look at a local neighborhood. NC is indeed slower than them—it has to be. But it doesn’t matter. The figure above shows how much time it takes to calculate the z-scores for more and more edges. Those runtimes (less than a tenth of a second) are puny when compared to the multi-second wait times due to the APIs’ rate limits — or even to simple network latency. After all, NC is a simplification of the Noise-Corrected Backbone algorithm I developed with Frank Neffke, and we showed in the original paper that it was pretty darn fast.

What about the quality of the samples? This is a question with a million answers, because it depends on what you want to do with the sample, besides what are the characteristics of the original graph and of the API system you’re wrestling against, and how much time you have to gather your sample. For instance, in the figure below we see that NC is the best method when you want to estimate the assortativity value of a disassortative attribute (think gender in a dating network), but it is a boring middle-of-the-pack method when you want to reconstruct the original communities of a network.

Left: the Mean Absolute Error between the sample’s and the original graph’s assortativity values (lower is better). Right: the Normalized Mutual Information between the sample’s and the original graph’s communities (higher is better). NC in red.

NC ranks between the 3rd and 4th position on average across a wide range of applications, network topologies, API systems, and budget levels. This is good because I test eight alternatives, thus the expectation is that NC will practically always be better than average. Among all eight methods tested, in fact, no other method has a better rank average. The second best method is Depth-First Search, which ranks 4th three out of four times, against NC’s one out of two.

I have shared some code allowing to replicate my results. If you speak Python better than you speak academiquese, you could use that code to infer how to implement an NC sampling strategy next time you need to download Twitter.

15 March 2021 ~ 0 Comments

Networks in Economics Satellite @ Networks21 Conference

This year’s NetSci conference will be special. For the first time it will be held jointly with the other major network event of the year: Sunbelt, or the main network conference for social sciences. I could not miss an opportunity like this, and so I decided to organize a satellite event with the excellent Morgan Frank and Lingfei Wu. The topic of the satellite will be network applications on research about economic development and innovation.

We’re looking for contributors to send an abstract about their work in the area. If you’re unsure about what area that is, think about my research on the Product Space, or on the impact of business travel on economic growth, or economic convergence in Colombia, etc. Specifically, if you are interested in issues like:

  • Mapping the relationship of complex economic activities at the global, regional, and local level;
  • Tracking flows of knowhow in all its forms;
  • Estimating the relatedness of tasks and skills to estimate knockoff effects and productivity gains of automation;
  • Investigating the dynamics of research and innovation via analysis of patents, inventions, and science;
  • Uncovering scaling laws and other growth trends able to describe the systemic increase in complexity of activities due to agglomeration;

and you study them using networks and the tools of the science of complex systems, then you really should send us your abstract. The submission link is: https://easychair.org/my/conference?conf=cnei21. You should send a one-page one-figure abstract before May 5th, 2021.

We have a fantastic lineup of invited speakers you’ll mingle with:

The event will be held online on Zoom. The exact date is still to be determined, but it will be between June 21st and July 3rd. So stay tuned for updates! You should bookmark the official website of the satellite, to get fresh news about it: https://mrfrank8176.github.io/Complex-Networks-in-Economics-and-Innovation/

I hope to see your abstract in my inbox and then you presenting at the satellite!

11 February 2021 ~ 2 Comments

PhD Call: Combat Disinformation in Complex Social Systems

The IT University of Copenhagen has put out a call searching for people interested in starting a PhD in computer science in Fall 2021. One of the projects you could work on is my project on disinformation and social networks. You should apply if you think you: might be interested in pursuing a PhD related to misinformation and social media; like the idea of living in the happiest country in the world; and are ok with having a clueless supervisor like me. Link to apply.

The easiest way to figure out what sort of things you’d be doing is by reading my post on the topic of the PhD project. To sum it up briefly: in the first paper of this project, Luca Rossi and I made an agent-based model, simulating the way social media handle misinformation via flagging. We show that the system is counterproductive: its mechanics end up penalizing mostly popular and neutral news sources. Luca will be your co-supervisor, so you’d be working with a real social scientist who has an actual clue about what’s going on.

This project is born out of the idea to make that paper into an actual research program. There are a few paths that are all worth exploring and require a person giving them their full time attention. A few ideas:

  • Expand the agent-based model to make it more realistic and to enable it to answer more interesting questions (what if users and sources change their polarity? What if one side is less tolerant than the other? What if tolerance changes over time? What if…?);
  • Find real-world data and use it to validate, motivate, or find ideas on how to expand the model and its results;
  • Design intervention strategies for social media platforms that are less counterproductive than the ones currently deployed.

The position is fully funded by ITU, so it’s a pretty safe path for 3 to 4 years. The group where I work, NERDS, is full of amazing professionals doing extremely cool research, publishing in Nature, collaborating with UNICEF, and the like. We’re fun, really.

Link to apply, where you should specify that you want to work on the “Modelling Complex Social Systems to Handle Disinformation” project. I hope you’ll give it a thought!

09 February 2021 ~ 0 Comments

The Atlas for the Aspiring Network Scientist v1.1

Last month I put out in public my Atlas for the Aspiring Network Scientist. The reaction to it was very pleasant and people contacted me with a number of corrections, opinions, and comments. I just uploaded to arXiV a version 1.1 of it, doing my best to address whatever could be addressed quickly. The PDF on the official website is also updated and, in fact, that link will always direct you to the most up-to-date version.

Corrections involve mostly some notation, a few references, and the like. One important thing I want to point out is my rephrasing/retraction of some humorous parts. I still stand by my decision of using humor, but not when it comes at the expense of the feeling of inclusiveness in the community. Science is a social process and everyone should feel welcome to it. Using language that opposes that aim is a net loss for society. One example is in the chapter about tools, where some ruvid humor didn’t paint the correct picture: these open source tools are fantastic gifts to the community and should be unequivocally celebrated. All remaining jokes are about the self-deprecation I feel every day from my inability to measure up to the awesome fellows behind these libraries/software.

One thing that was flagged to me but I couldn’t touch was references. There are just too many for me to check them all. I’m asking your help: if you find some issue with references (missing information, or things like editors as authors, etc), please write me flagging the specific reference with the issue: mcos@itu.dk.

I’m also glad to announce that you can buy a physical copy of the book, in case you need it handy for whatever reason. This is only v1, though, so all corrections mentioned above are not included. When v2 will come out, I’ll also make that available for physical purchase. The book was printed via IngramSpark, thus there’s a good chance you can find it for sale & shipping almost everywhere. For instance, it is available on Amazon or, if you’re in Denmark (where I live), on Saxo. You could even buy it on friggin’ Walmart.

The final object’s quality is… eh. Some of it is by design: I wanted this to be as accessible as possible. You’ll hardly find another 650+ color pages book in US Letter format for less than $40. Compromises needed to be made. However, most of the things making it a clearly amateurish product are my fault. Take a look at the left margin in the back cover:

Eww… Also, since I had to upload the cover separately, I didn’t remember to include a blank page. So the left pages are on the right and vice versa. Which makes page numbers practically invisible in the middle:

That said, if you ignore everything that makes this book ugly, it’s actually pretty nice:

Also, apologies to your backs, but this thing is hyuuuge. It’s as tall as a half-kilo pack of bucatini and twice as thick (packs of bucatini are a standard unit of measure in Italy):

Finally, I would love to give a shout out to everyone I interacted with after the book came out. Everyone was super nice and/or super helpful, most were both. I discovered many things I wasn’t aware of. One of them is NETfrix, a network science podcast by super cool fellow Asaf Shapira. The podcast has transcripts in English available here.

That’s it for now! Hopefully new research posts will follow soon.