Extracts titles of individual pages from a vector of html files or from a named vector of links.

ExtractTitles(container = "title", containerClass = NULL,
  containerId = NULL, htmlLocation = NULL, id = NULL, links = NULL,
  removePunctuation = FALSE, onlyStandardCharacters = FALSE,
  removeString = NULL, removeEverythingBefore = NULL,
  removeEverythingAfter = NULL, customXpath = "",
  maxCharacters = NULL, encoding = "UTF-8", progressBar = TRUE,
  exportParameters = TRUE, importParameters = NULL, project = NULL,
  website = NULL)

Arguments

container

HTML element where the title is found. The title can usually be found in one of the following:

  • "links": Extract the title from links (required). Titles are taken from the textual element of the link taken from the index pages.

  • "title": Default. Extract the title from the Html <title> field, usually shown on the top bar of web browsers.

  • "h1": Extract the title from the first occurence of text that has heading 1, the <h1> html tag, as its style.

  • "h2": Extract the title from the first occurence of text that has heading 2, the <h2> html tag, as its style.

htmlLocation

Path to folder where html files, tipically downloaded with DownloadContents(links) are stored. If not given, it defaults to the Html folder inside project/website folders.

id

Defaults to NULL. If provided, it should be a vector of integers. Only html files corresponding to given id in the relevant htmlLocation will be processed.

links

A named character vector, typically created by the ExtractLinks function.

removeString

A character vector of one or more strings to be removed from the extracted title.

removeEverythingAfter

Removes everything after given string.

maxCharacters

An integer. Defines the maximum number of characters to be kept in the output for each title.

progressBar

Logical, defaults to TRUE. If FALSE, progress bar is not shown (useful for example when including scripts in rmarkdown)

exportParameters

Defaults to TRUE. If TRUE, function parameters are exported in the project/website folder. They can be used to update the corpus. Requires parameters project/website.

importParameters

Defaults to NULL. If TRUE, ignores all parameters given in the function call, and imports them from parameters file stored in "project/website/Logs/parameters.rds".

Value

A character vector of article titles.

Examples

# NOT RUN {
titles <- ExtractTitles()
# }