Getting started with <code>castarter</code> • castarter

Setting up the environment

castarter facilitates downloading and extracting textual contents from a website, or a given section of a website. castarter can be used to analyse a single website, or more websites as part of a project, facilitating comparison among cases. In order to smoothen replication and prevent data losses, by default castarter stores a copy of downloaded web pages in html format. To do so, it includes a function that creates the relevant folder structure, and another function that facilitates archiving the html files in compressed format. In castarter every dataset is expected to be extracted from a website, and each website is expected to be part of a project. This is useful to give a consistent way of storing the materials gathered and to facilitate comparison among cases, but it does not imply additional limitations: it is possible to have projects composed of a single website, and it is possible to make comparisons among websites that are part of different projects.

As an example, this vignette will demonstrate how to extract press releases published by the Kremlin. When starting a new castarter project, the first step is usally that of defining the name of the project and the name of website. In this case, the project will be called “Presidents”, assuming other websites to be analysed later may be of presidents of other countries.

SetCastarter(project = "Presidents", website = "Kremlin_en")

This stores the names of project and website as options that can be retrieved by all castarter functions automatically, without the need to inputting them each time a function requires to save or load files.

Then, it is time to create the folder structure where all outputs will be saved:

CreateFolders()

This creates the following folders in the current working directory:

Presidents
- Kremlin_en
  - Archives
  - Dataset
  - Html
  - IndexHtml
  - Logs
  - Outputs
  - Txt

Notice that there is no need to provide names of project and website to the function CreateFolders(), since they are retrieved from the previously set options.

Download index pages

From the Kremlin’s website, as in many other websites, it is easy to understand the structure of the archive: it is composed by http://en.kremlin.ru/events/president/news/page/ plus a number. By clicking on the “previous” link at the bottom of the page, as it is customary, the user is given the possibility to see older press released. In castarter, it is possible to get the links to all the pages we would see by clicking subsequent times on the ‘previous page’ link, with the following command function:

indexLinks <- CreateLinks(linkFirstChunk = "http://en.kremlin.ru/events/president/news/page/",
                          startPage = 1,
                          endPage = 1082)
head(indexLinks)

N.B. Another frequently found way for websites to provide access to their archives is by creating pages with dates directly in their URL. If, for example, a website offers its contents in the form of http://www.example.com/archive/2015-10-22, then it would be possible to generate the links for all the dates, say, between January 2012 and December 2015, with the following parameters: CreateLinks(linkFirstChunk = "http://www.example.com/", startDate = "2012-01-01", endDate = "2015-12-31", dateSeparator = "-")

This command creates a vector (indexLinks) including direct links to the index pages.

It is now necessary to download these index pages. The following command downloads the relevant index pages and saves them in html format in the /Presidents/Kremlin_en/IndexHtml/ folder.

DownloadContents(links = indexLinks, type = "index")

If the process is interrupted for any reason, it is possible simply to run again the same function, which will check which files have already been downloaded, and will download the missing ones. Sometimes, due to a number of reasons, one or a few pages may not download correctly. It is usually appropriate to run the following function, which will re-download files that are too small to contain meaningful contents (if all pages have been downloaded correctly, the following will not do anything).

DownloadContents(links = indexLinks, type = "index", missingPages = FALSE)

Extracting links

Now that we have the index pages, we need to extract direct links to the individual press releases. There are various ways to do this, but the easiest approach is looking at the URLs of individual articles, and see if there is a pattern (this is very often the case in modern websites). Hovering over the titles of an index page - you can open one with browseURL(url = sample(x = indexLinks, size = 1)) - it is easy to see the links to indivudal news items.

In this case, we see that the URL to each news item includes in the URL the following string: “president/news/”. We can have a quick look at the links to see if the resulting list makes sense. In this case, it appears that also links to photo and videos are included, so we should remove them in order to have a clean list of links to press releases. The following command extracts links to individual articles.


links <- ExtractLinks(domain = "http://en.kremlin.ru/",
                      partOfLink = "president/news/",
                      partOfLinkToExclude = c("page",
                                              "photos",
                                              "videos",
                                              "special"))

We use this to extract relevant links, and discard all links to other pages or sections of the website scattered around the page that we are not interested in. For details on more advanced options, e.g. extract only links that are inside a given html container in the page, see ?ExtractLinks.

A common approach to find if all relevant links have been extracted, no more no less, is to see how many links per index page have been extracted.

length(links)/length(indexLinks)

This should usually give a number that is very close to the number of links per index page.

Download pages

Now it’s time to download the actual press releases.

DownloadContents(links = links)

This command downloads html files for each page, and stores them as a numbered html file in the Html subfolder. If for any reason you have to interrupt the process, you can re-run the command and by default it will download only pages that have not been previously downloaded.

When downloading thousands of pages, it sometimes happens that a few of them are not downloaded correctly due to server errors or connection issues. It is advised to re-run the above command at least once to ensure that all files have been properly downloaded. In other cases, however, an empty or almost-empty page has been downloaded (e.g. a web page with only an error message). Such pages can be downloaded with the following command (it is possible to adjust the minimum size with the size parameter).

DownloadContents(links = links, missingPages = FALSE)

Extract data and metadata

[vignette do be completed]

Here’s a whole script that outlines the whole procedure, that can be copy pasted and adapted to different websites:

#### 1. Preliminary steps ####

## Install castarter (devtools required for installing from github)
# install.packages("devtools")
devtools::install_github(repo = "giocomai/castarter")

## Load castarter
library("castarter")

## Set project and website name.
# They will remain in memory in the current R session,
# so need need to include them in each function call.
SetCastarter(project = "Presidents", website = "Kremlin_en")

## Creates the folder structure where files will be saved within the
## current working directory
CreateFolders()

#### 2. Download index/archive pages ####

# Check out the archive page, e.g. http://en.kremlin.ru/events/president/news/
# See what is the last archive page, if they are defined by consecutive numbers

indexLinks <- CreateLinks(linkFirstChunk = "http://en.kremlin.ru/events/president/news/page/",
                          startPage = 1,
                          endPage = 1082)

## Downloads all html files of the archive pages
# This can be interrupted: by default, if the command is issued again it will download only missing pages (i.e. pages that have not been downloaded previously)
DownloadContents(links = indexLinks, type = "index")
# Downloads again files with oddly small size (usually, errors), if any.
DownloadContents(links = indexLinks, type = "index", missingPages = FALSE)

#### 3. Extract links to individual pages ####

# Find criteria to filter only direct links to news items
# Open an archive page at random.
# browseURL(url = sample(x = indexLinks, size = 1))

links <- ExtractLinks(domain = "http://en.kremlin.ru/",
                      partOfLink = "president/news/",
                      partOfLinkToExclude = c("page",
                                              "photos",
                                              "videos",
                                              "special"))
# length(links)/length(indexLinks)
# Explore links and check if more or less alright, if number realistic
# View(links)
# head(links)

#### 4. Download pages ####

DownloadContents(links = links, wait = 3)
DownloadContents(links = links, missingPages = FALSE, wait = 3)

#### 5. Extract metadata and text ####

# open an article at random to find out where metadata are located
# browseURL(url = sample(x = links, size = 1))

id <- ExtractId()

titles <- ExtractTitles(removeEverythingAfter = " •",
                        id = id)


dates <- ExtractDates(container = "time",
                      attribute = "datetime",
                      containerInstance = 1,
                      dateFormat = "Ymd",
                      id = id)

# check how many dates were not captured
sum(is.na(dates))
links[is.na(dates)]

language <- "english"

metadata <- ExportMetadata(id = id,
                           dates = dates,
                           titles = titles,
                           language = language,
                           links = links)

## Extract text
text <- ExtractText(container = "div",
                    containerClass = "read__content",
                    subElement = "p",
                    removeEverythingAfter = "\nPublished in:",
                    id = id)

i <- sample(x = 1:length(text), 1)
titles[i]
dates[i]
text[i]
links[id][i]
#### 6. Save and export ####

SaveWebsite(dataset = TRUE)

Getting started with `castarter`

Giorgio Comai

2018-10-07

Setting up the environment

Download index pages

Extracting links

Download pages

Extract data and metadata

Contents

Getting started with castarter

Giorgio Comai

2018-10-07

Setting up the environment

Download index pages

Extracting links

Download pages

Extract data and metadata

Contents

Getting started with `castarter`