Democratic Community Discovery
When thinking about our social life, we instinctively recall the reason why we know the people we know. There is my sister, there’s the friend with whom I shared all my experiences – from stealing snacks at the primary school to the bachelor thesis we wrote together – and finally there’s that obscure guy who I don’t really know but he added me on Facebook and I don’t want to reduce my friend counter. A long time ago a very nice app called Nexus allowed Facebook users to visualize this concept. Nexus then died, it was replaced by Social Graph, which died too. Now you can use Touch Graph, but it is not nearly as good and, besides this, it’s a different story from the one I want to tell.
Nexus’ visualizations looked more or less like this:
The grey dots are my friends on Facebook and they are connected if they are friends on Facebook too. It’s clear that people you know for the same reason are densely connected to each other, because they are very likely to know each other too. The structure is pretty spectacular, I have to admit. It also seems to make a lot of sense and these different dense areas (“communities”) do not look too hard to be extracted automatically by an algorithm.
It’s on the shoulders of this gigantic positivist naïvety that countless authors decided to write such algorithm. Soon enough, a new branch of complex network analysis was born, called Community Discovery. The aim of community discovery is to group nodes that are densely connected to each other. The groups are then called “communities” and nodes are said to be “members” of a community. Community discovery has a long, long history made by initial successes, coups de théâtre, drama and romance, but it is a complicated story and not in the scope of this post. Basically, many authors realized that they were dealing with a larger mess than previously thought. The main reason is that most, if not all, real world networks do not look like the above example. At. All. They look like this:
An ugly and meaningless hairball. Most people scratched their heads and decided to stick with their definition of community. This is the wrong way because it implies that we just give up in analyzing the second case and we consider the two examples as completely different phenomena. Guess what: they are not. The second network, too, is a sample of the Facebook friendship network and it contains the first. The sole difference between the two is the scale: the first only considers the immediate neighborhood of a node (me), the second just samples 15,000 users and the connections between them.
It’s on the basis of these considerations that I wrote a paper on community discovery, together with my co-authors at the KDDLab in Italy. We developed an algorithm that exploits the order at the local level around a node to find a sense of the connective mess at the global scale. It is called DEMON: Democratic Estimate of the Modular Organization of a Network.
The details of the implementation are included in the paper accepted at the SIGKDD 2012 conference on Data Mining. It was recently presented there by my co-author Giulio Rossetti who also happens to have written its implementation, freely available (and Open Source!). In this post, I’ll just give you a quick idea about how DEMON works.
Let’s consider a simple, messy, network:
It looks more like the messy hairball, doesn’t it? Let’s now select a node:
And then all the other nodes that are directly connected to it:
Now, let us ignore all the other nodes of the network, included the first selected. We just create a network only using the green nodes and the connections between them. What does this network look like? Like this:
Surprise! We’ve fallen back into the first, neatly divided, example. Now what we need to do is to just apply a very simple, old-school, community discovery algorithm to this sample and we have an idea about the communities surrounding the yellow node. We apply this operation to all the nodes of the network and then we merge together the communities that share an extensive portion of their members. I won’t bother you with the details and the proofs about how awesome this method is and how it outperforms the current state-of-the-art community discovery algorithms because everything is in the paper and I don’t like to brag (twice).