Extracts direct links to individual articles from index pages according to a selcted pattern.

ExtractLinks(domain = NULL, partOfLink = NULL,
  partOfLinkToExclude = NULL, container = NULL,
  containerClass = NULL, containerId = NULL, attributeType = NULL,
  htmlLocation = NULL, id = NULL, extractText = FALSE,
  minLength = NULL, maxLength = NULL, indexLinks = NULL,
  sortLinks = FALSE, linkTitle = TRUE, appendString = NULL,
  export = FALSE, removeString = NULL, progressBar = TRUE,
  project = NULL, website = NULL, importParameters = NULL,
  exportParameters = TRUE)

Arguments

domain

Web domain of the website. Will be added at the beginning of each link found.If links in the page already include the full web address this should be ignored. Defaults to "".

partOfLink

Part of URL found only in links of individual articles to be downloaded. If more than one provided, it includes all links that contains either of the strings provided.

partOfLinkToExclude

If an URL includes this string, it is excluded from the output. One or more strings may be provided.

container

Type of html container from where links are to be extracted, such as "div", "ul", and others. containerClass or containerId must also be provided.

attributeType

Type of attribute to extract from links, when different from href.

htmlLocation

Path to folder where html files, tipically downloaded with DownloadContents(links, type = "index") are stored. If not given, it defaults to the IndexHtml folder inside project/website folders.

id

Defaults to NULL. If provided, it should be a vector of integers. Only html files corresponding to given id in the relevant htmlLocation will be processed.

minLength

If a link is shorter than the number of characters given in minLength, it is excluded from the output.

maxLength

If a link is longer than the number of characters given in maxLength, it is excluded from the output.

indexLinks

A character vector, defaults to NULL. If provided, indexLinks are removed from the extracted articlesLinks.

sortLinks

Defaults to FALSE If TRUE, links are sorted in alphabetical order.

linkTitle

Defaults to TRUE. If TRUE, text of links is included as names of the vector.

appendString

If provided, appends given string to the extracted articles. Typically used to create links for print or mobile versions of the extracted page.

removeString

If provided, remove given string (or strings) from links.

progressBar

Logical, defaults to TRUE. If FALSE, progress bar is not shown (useful for example when including scripts in rmarkdown)

exportParameters

Defaults to FALSE. If TRUE, function parameters are exported in the project/website folder. They can be used to update the corpus.

Value

A named character vector of links to articles. Name of the link may be the article title.

Examples

# NOT RUN {
links <- ExtractLinks(domain = "http://www.example.com/", partOfLink = "news/")
# }