Memetics
This page serves as an entry point for multiple publications. These are (click on the link to scroll to the corresponding content):
- “Competition and Success in the Meme Pool: a Case Study on Quickmeme.com“;
- “Average is Boring: How Similarity Kills a Meme’s Success“;
- “Popularity Spikes Hurt the Future Implementations of a Protomeme“;
- “Posts on Central Websites Need Less Originality to be Noticed“.
This is the companion data of the paper “Competition and Success in the Meme Pool: a Case Study on Quickmeme.com“.
You can download the data by clicking here. The data has been collected from Quickmeme.com during October 15th-16th, 2012.
The ZIP archive contains a text file with 5 tab-separated columns. The columns in order are:
- Meme ID: the ID of the meme assigned by Quickmeme.com. You can download the corresponding meme images by connecting to the URL: http://i.qkme.me/<meme_id>.jpg. For example meme “353npm” corresponds to http://i.qkme.me/353npm.jpg.
- Data ID: this is an internal Quickmeme ID. It is useful because a larger data ID always means that the meme came after in time, and it can be cast to a number.
- Total ratings of the meme: the total ratings that the meme had at the moment of the crawl (October 2012).
- Meme template name: the name of the meme template. Before using it, be sure you make it case insensitive, now it is not (so it distinguish between “Socially Awkward Penguin” and “socially awkward penguin”, in case you are interested in meme misspelling).
- Week of creation of the meme: the week in which the meme was created, in format “yyyy-[m]m-w”, where w = {1, 2, 3, 4}.
This is the companion data of the paper “Average is Boring: How Similarity Kills a Meme’s Success“. Please cite the paper if you find the data useful!
You can download the data by clicking here. The data has been collected from Memegenerator.net from June 27th, 2013 to July 6th, 2013.
The ZIP archive contains two text files.
The first file, “instanceid_generatoridfiltered_votes_text_timestep” contains the actual data in 5 tab-separated columns. The columns contains:
- ID of the meme implementation: this is a progressive ID. You can use it to retrieve the corresponding meme implementation using the URL http://memegenerator.net/instance/<implementation_id>. So meme “10057023” can be retrieved at the URL http://memegenerator.net/instance/10057023;
- ID of the meme that has been used for the implementation;
- Number of upvotes that the meme implementation got until the crawling time;
- Text of the implementation, the text that the user superimposed to the meme;
- The bimonthly timestep that has been derived from the meme implementation’s ID.
The second file “generators” contains additional metadata on the memes in 4 tab-separated columns:
- ID of the meme, this column is the primary key of the file and it matches column #2 of the previous file, as a relational database foreign key;
- URL slug of the meme;
- Meme name;
- URL of the meme template image.
This is the companion data of the paper “Popularity Spikes Hurt the Future Implementations of a Protomeme“. Crawled originally by Tim Weninger et al. [1].
You can download the data by clicking here (warning: download size >350MB). The data has been collected from Reddit.com in [1], and from Hacker News. In my further postprocess, I selected only posts submitted to Reddit from April 5th, 2012 to April 26th, 2013, and to Hacker News from January 7th, 2010 to May 29th, 2014.
The archive contains two folders, both containing two files. I’ll describe the content of the Reddit folder, as it is logically equivalent to the Hacker News folder. The first file, “id_roots_date_votes_1333584000” contains the main dataset. The files contains 4 tab-separated columns:
- Post ID. This is a unique identifier of the post. This is assigned by the Reddit system. By following this ID, you can obtain the original post by visiting the URL http://www.reddit.com/<ID>. So, if you want to visit post “xbfwb”, you just go to http://www.reddit.com/xbfwb.
- List of meanings in the title. This is a comma separated list of meanings ids. To know what meaning corresponds to what id, look at the map provided in the archive and explained below.
- Timestamp. When the post was submitted to Reddit, in the standard UNIX format.
- Score. This is the score of the post at the moment of the data collection.
The second file, “root_map” contains the map from a meaning id to its actual word root. The word root is in the first column, the ID is in the second column. The two columns are separated by a tab. The word root is stemmed and all stopwords are already removed.
[1] Tim Weninger, Xihao Avi Zhu, and Jiawei Han. An exploration of discussion threads in social news sites: a case study of the reddit community. In ASONAM, pages 579–583, 2013.
This is the code and (some of the) data needed to replicate the results in the paper “Posts on Central Websites Need Less Originality to be Noticed”. You can download the data by clicking here.
You do not need the raw data to replicate most of the results. The folders “03_originality” and “05_twitter” contain a series of “regr_table*.csv” files which support most of the core results of the paper by themselves — specifically all the regression tables and the figures derived from them. Simply run the R scripts in those folders to get the tables.
If you want to reproduce the entire pipeline, you will have to populate the “00_data” folder with all the raw inputs. These are:
- Reddit data: https://files.pushshift.io/reddit/submissions/RS_*.zst (with * being 19-10, 19-11, 19-12, and 20-01)
- Common Crawl nodes: https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2019-20-nov-dec-jan/domain/cc-main-2019-20-nov-dec-jan-domain-vertices.txt.gz
- Common Crawl edges: https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2019-20-nov-dec-jan/domain/cc-main-2019-20-nov-dec-jan-domain-edges.txt.gz
- Twitter data: https://files.pushshift.io/twitter/US_PoliticalTweets.tar.gz (you’ll need to extract the zip file)
Then you need to run the scripts in their numbered order, starting from script 01 in folder 01 until script 03 in folder 06. Things should run properly, unless I messed something up renaming paths to create this archive from the original scripts, in which case fixes should be relatively obvious — scripts will fail with some “file not found” exception, which should help you fixing the path. Other potential common mistakes are missing columns here and there, because I forgot when I removed them from the data once I did not need them any more.
You will have to pass the month and the year as an argument to the scripts, so they know which dataset to use for the analysis. The standard choice for the paper is “12 19”. Exceptions include:
- 01_clean_reddit_data/04_common_crawl_filter.py, 02_clean_web_data/*: these scripts also need to know which Common Crawl file to use. Standard choice for the paper is “2019-20-nov-dec-jan“.
- 03_originality/01_originality.py: this script needs to specify also the month and year of the training dataset to learn bigram probability. Standard choice for the paper is “11 19“.
Note that the 01_metadata_cleaning.py script in the 01_clean_reddit_data folder expects to receive the Reddit data as a stream from standard input. To make it work, you should uncompress the Reddit data and send it to this script via standard input. If you have access to any Unix standard terminal (like the ones in Ubuntu or MacOS) you could simply type:
zstd -cdq ../00_data/RS_2019-12.zst | python 01_metadata_cleaning.py 12 19
at the prompt when in the 01_clean_reddit_data folder. This assumes you have downloaded the Reddit data and that you have the zstd (v1.4.8) program installed in your system.
All scripts in the 01 folder also want as the last parameter an instruction on whether or not to filter the domains/subreddits with less than 5 submissions in the data. If you want to filter them, type “filter” (without quotes), if you don’t want to filter them write something else (like “nofilter“).
The script “00_run_framework.sh” will do everything for you to reproduce the main results of the paper if you call it this way:
../00_run_framework.sh 12 19 2019-20-nov-dec-jan 11 19 filter
The exception is Tables S13 and S14, for which you’ll have to re-run everything with the “nofilter” option at the end. Also, for Tables S11 and S12 you should run this same script in order for all months from 10/19 to 01/20.
The scripts were written and used on Ubuntu 21.10, using Python 3.9.7, R 4.0.4, and Gnuplot 5.4 patchlevel 1 (the latter only for making figures), on a Xeon(R) W-11955M CPU with 32GB of memory. Everything ran in a few hours without the need for additional memory. The script that takes the longest is script 01 in folder 06, because it wants to generate 10k null models. If you are content with fewer null models (say 100), you can simply edit the for loop in that script and it should only take an hour or so on an equally powerful CPU.
The scripts depends on the following libraries that are not included by default:
- Python: pandas 1.4.1, nltk 3.7, numpy 1.21.5, sklearn 1.0.2, networkx 2.7.1, scipy 1.8.0, pyemd 0.5.1 (the latter is not really necessary, it’s only needed by the network_distance.py script as an import, but that function is never used. You can comment the import line out if you do not want to install pyemd).
- R: igraph 1.2.11, lme4 1.1-29, stargazer 5.2.3, survival 3.2-10