Full Dataset Description

Parse.ly is a web analytics company that informs enterprise media companies how much attention their content is receiving. At a very basic level, we collect two types of data that are provided within our data set.

Pageviews – We collect data on every pageview that lands on our customers’ sites, allowing publishers to know how often each article is read, how readers discover each article (via search, social, or the publication’s homepage), which device types readers are using, and which geographic regions visitors originate from. We’ve tracked over 480 billion pageviews over the last three years alone.

Articles – We scrape every article page and obtain its full text, as well as other metadata such as title, author, and publication date. We enrich each article by running its full-text through state-of-the-art NLP algorithms, which extract fine-grained information on the categories, people, companies, places, etc, that each article focuses on. We’ve scraped, enriched, and tracked pageviews on over 250 million articles.