Downloads html pages based on a vector of links.

DownloadContents(links, type = "articles", path = NULL, size = 500,
  linksToCheck = NULL, linksToDownload = NULL, wgetSystem = FALSE,
  method = "auto", missingPages = TRUE, start = 1, wait = 1,
  ignoreSSLcertificates = FALSE, createScript = FALSE,
  project = NULL, website = NULL)

Arguments

links

A character vector of links, commonly generated either with the function CreateLinks or ExtractLinks.

type

Accepted values are either "articles" (default), or "index"; it defines the folder where files are stored.

path

Defaults to NULL. If given, overrides the "type" param and stores html files in given path as a subfolder of project/website. Folder must already exist, and should be empty.

size

Defaults to 500. It represents the minimum size in bytes that downloaded html files should have: files that are smaller will be downloaded again. Used only when missingPages == FALSE.

linksToCheck

A logical vector. Only links corresponding to TRUE will be considered for download. Corresponds to `links[linksToCheck]` but keeps the id in line with the original location in the links vector. If given, unlike the `linksToDownload` parameter it considers other parameters, e.g. if `missingPages=TRUE`, it downloads selected pages only if they have not been previously downloaded.

linksToDownload

A logical vector. Only links corresponding to TRUE will be downloaded. Corresponds to `links[linksToDownload]` but keeps the id in line with the original location in the links vector. If given, it ignores other parameters and downloads all selected pages (overwriting if they exist already).

wgetSystem

Logical, defaults to FALSE. Calls wget as a system command through the system() function. Wget must be previously installed on the system.

method

Defaults to "auto". Method is passed to the function utils::download.file(); available options are "internal", "wininet" (Windows only) "libcurl", "wget" and "curl". For more information see ?utils::download.file()

missingPages

Logical, defaults to TRUE. If TRUE, verifies if a downloaded html file exists for each element in articlesLinks; when there is no such file, it downloads it.

start

Integer. Only links with position higher than start in the links vector will be downloaded: links[start:length(links)]

wait

Defaults to 1. Number of seconds to wait between downloading one page and the next. Can be increased to reduce server load, or can be set to 0 when this is not an issue.

ignoreSSLcertificates

Logical, defaults to FALSE. If TRUE it uses wget to download the page, and does not check if the SSL certificate is valid. Useful, for example, for https pages with expired or mis-configured SSL certificate.

createScript

Logical, defaults to FALSE. Tested on Linux only. If TRUE, creates a downloadPages.sh executable file that can be used to download all relevant pages from a terminal.

project

Name of 'castarter' project. Must correspond to the name of a folder in the current working directory.

website

Name of a website included in a 'castarter' project. Must correspond to the name of a sub-folder of the project folder.

Value

By default, returns nothing, used for its side effects (downloads html files in relevant folder). Download files can then be imported in a vector with the function ImportHtml.

Examples

# NOT RUN {
DownloadContents(links)
# }