R/extractLinks.R
ExtractLinks.Rd
Extracts direct links to individual articles from index pages according to a selcted pattern.
ExtractLinks(domain = NULL, partOfLink = NULL, partOfLinkToExclude = NULL, container = NULL, containerClass = NULL, containerId = NULL, attributeType = NULL, htmlLocation = NULL, id = NULL, extractText = FALSE, minLength = NULL, maxLength = NULL, indexLinks = NULL, sortLinks = FALSE, linkTitle = TRUE, appendString = NULL, export = FALSE, removeString = NULL, progressBar = TRUE, project = NULL, website = NULL, importParameters = NULL, exportParameters = TRUE)
domain | Web domain of the website. Will be added at the beginning of each link found.If links in the page already include the full web address this should be ignored. Defaults to "". |
---|---|
partOfLink | Part of URL found only in links of individual articles to be downloaded. If more than one provided, it includes all links that contains either of the strings provided. |
partOfLinkToExclude | If an URL includes this string, it is excluded from the output. One or more strings may be provided. |
container | Type of html container from where links are to be extracted, such as "div", "ul", and others. containerClass or containerId must also be provided. |
attributeType | Type of attribute to extract from links, when different from href. |
htmlLocation | Path to folder where html files, tipically downloaded with DownloadContents(links, type = "index") are stored. If not given, it defaults to the IndexHtml folder inside project/website folders. |
id | Defaults to NULL. If provided, it should be a vector of integers. Only html files corresponding to given id in the relevant htmlLocation will be processed. |
minLength | If a link is shorter than the number of characters given in minLength, it is excluded from the output. |
maxLength | If a link is longer than the number of characters given in maxLength, it is excluded from the output. |
indexLinks | A character vector, defaults to NULL. If provided, indexLinks are removed from the extracted articlesLinks. |
sortLinks | Defaults to FALSE If TRUE, links are sorted in alphabetical order. |
linkTitle | Defaults to TRUE. If TRUE, text of links is included as names of the vector. |
appendString | If provided, appends given string to the extracted articles. Typically used to create links for print or mobile versions of the extracted page. |
removeString | If provided, remove given string (or strings) from links. |
progressBar | Logical, defaults to TRUE. If FALSE, progress bar is not shown (useful for example when including scripts in rmarkdown) |
exportParameters | Defaults to FALSE. If TRUE, function parameters are exported in the project/website folder. They can be used to update the corpus. |
A named character vector of links to articles. Name of the link may be the article title.
# NOT RUN { links <- ExtractLinks(domain = "http://www.example.com/", partOfLink = "news/") # }