Health Crawler
Health Crawler
implemented by me
This is the data and code companion for the paper “Mapping the International Health Aid Community Using Web Data” by me, Katsumasa Hamaguchi, Maria Elena Pinglo, and Antonio Giuffrida, published in the EPJ Data Science journal.
The release has two folders: “data” and “code”.
The “data” folder contains six files. The first three contains our cleaned inputs: the entity lists.
– countries.csv. A three column file, connecting each country ID (first column) to the country name (second column). If the country name has aliases, these are listed in the third column.
– keywords.csv. The file connects each keyword ID (first column) the the keyword name (second column). Columns three to six indicate whether the keyword is part of one of four macro categories.
– organizations.csv. The file connects each organization id (first column) to the organization’s name (second column), acronym (third column), website’s URL (fourth column), and possible aliases (fifth column).
The other three files contain the data, sliced differently.
– data_clean.csv. The file has five columns:
– “organization”: the ID of the organization from which we obtain the observation.
– “entity 1 type” & “entity 2 type”: these columns inform on how to interpret the following two columns. Since all IDs are numeric, the ID 1 could refer to either organization 1, issue 1, or country 1. If “entity 1 type” is “countries”, it means that a value of “3” in the “entity 1” column refers to country 3.
– “entity 1” & “entity 2”: the columns report the ID of the connected entities. Check the entity type columns for how to interpret these values.
– “score”: the WS score of the entity pair, meaning the log number of pages in which the entities co-occur.
– data_clean_bycountry.csv. The file has the same structure as the previous one, with the only difference being the first columns. Instead of reporting the organization ID — meaning the website from which the data comes — the first column here represents the country ID: the country to which the data refers. This means that the last column — “score” — is the aggregation of the WS scores across all websites.
– data_clean_citations.csv. The file has four columns:
– “source”: the ID of the organization making the citation.
– “target type”: this column informs on how to interpreting the “target” column in the same way “entity 1 type” & “entity 2 type” do in the “data_clean.csv” file.
– “target”: the ID of the entity being cited by the organization.
– “score”: the WS score of the citation, meaning the log number of pages in which the target is mentioned.
In all data files, the type “organization_links” refers to an “organization”. The difference is that an “organization” is a mention by name (or acronym), while an “organization_links” is a mention by hyperlink.
The “code” folder contains four files and a subfolder. The csv files are the same input file contained in the data folder: countries.csv, keywords.csv, organizations.csv. The Python script wbcrawl_manager.py is used to launch the crawler. The script assumes that you have created a file called “batch”. This file should contain a list of organization IDs, one per line. The script will spawn a different process per ID, crawling the corresponding website. If launched in a machine with sufficient processors and memory, “batch” can contain all the organization IDs, thus launching the entire crawl process in parallel.
!!!IMPORTANT NOTE!!! The crawler REQUIRES the usage of the PhantomJS binary, which is NOT included in this package. One can download it from http://phantomjs.org/, and place it in the code folder. This might create compatibility issues if run in a system different from the development one. The crawler should run with no issues on Ubuntu 16.04.
The crawler assumes that you have properly installed the Scrapy library (https://scrapy.org/). In fact, the crawler is nothing more than a standard Scrapy project. The Scrapy version used to crawl the webpages is 1.3.2. I cannot guarantee that other Scrapy versions will work without issues. Other required nonstandard Python library are:
– Selenium
– BeautifulSoup
– Six
Most of the code is standard Scrapy code. The custom parts of the crawler are mostly in wb_crawl/spiders/health_spider.py, which contains the logic of the crawler.
If you are going to use the crawler for your own personal project, PLEASE edit the USER_AGENT (line 19 in wb_crawl/settings.py) and include your contact information.