Extracts direct links to individual articles from index pages.

Extracts direct links to individual articles from index pages according to a selcted pattern.

ExtractLinks(domain = NULL, partOfLink = NULL,
  partOfLinkToExclude = NULL, container = NULL,
  containerClass = NULL, containerId = NULL, attributeType = NULL,
  htmlLocation = NULL, id = NULL, extractText = FALSE,
  minLength = NULL, maxLength = NULL, indexLinks = NULL,
  sortLinks = FALSE, linkTitle = TRUE, appendString = NULL,
  export = FALSE, removeString = NULL, progressBar = TRUE,
  project = NULL, website = NULL, importParameters = NULL,
  exportParameters = TRUE)

Arguments

domain	Web domain of the website. Will be added at the beginning of each link found.If links in the page already include the full web address this should be ignored. Defaults to "".
partOfLink	Part of URL found only in links of individual articles to be downloaded. If more than one provided, it includes all links that contains either of the strings provided.
partOfLinkToExclude	If an URL includes this string, it is excluded from the output. One or more strings may be provided.
container	Type of html container from where links are to be extracted, such as "div", "ul", and others. containerClass or containerId must also be provided.
attributeType	Type of attribute to extract from links, when different from href.
htmlLocation	Path to folder where html files, tipically downloaded with DownloadContents(links, type = "index") are stored. If not given, it defaults to the IndexHtml folder inside project/website folders.
id	Defaults to NULL. If provided, it should be a vector of integers. Only html files corresponding to given id in the relevant htmlLocation will be processed.
minLength	If a link is shorter than the number of characters given in minLength, it is excluded from the output.
maxLength	If a link is longer than the number of characters given in maxLength, it is excluded from the output.
indexLinks	A character vector, defaults to NULL. If provided, indexLinks are removed from the extracted articlesLinks.
sortLinks	Defaults to FALSE If TRUE, links are sorted in alphabetical order.
linkTitle	Defaults to TRUE. If TRUE, text of links is included as names of the vector.
appendString	If provided, appends given string to the extracted articles. Typically used to create links for print or mobile versions of the extracted page.
removeString	If provided, remove given string (or strings) from links.
progressBar	Logical, defaults to TRUE. If FALSE, progress bar is not shown (useful for example when including scripts in rmarkdown)
exportParameters	Defaults to FALSE. If TRUE, function parameters are exported in the project/website folder. They can be used to update the corpus.

Value

A named character vector of links to articles. Name of the link may be the article title.

Examples

# NOT RUN {
links <- ExtractLinks(domain = "http://www.example.com/", partOfLink = "news/")
# }

Extracts direct links to individual articles from index pages.

Arguments

Value

Examples

Contents