Extracts titles of individual pages from a vector of html files or from a named vector of links.
ExtractTitles(container = "title", containerClass = NULL, containerId = NULL, htmlLocation = NULL, id = NULL, links = NULL, removePunctuation = FALSE, onlyStandardCharacters = FALSE, removeString = NULL, removeEverythingBefore = NULL, removeEverythingAfter = NULL, customXpath = "", maxCharacters = NULL, encoding = "UTF-8", progressBar = TRUE, exportParameters = TRUE, importParameters = NULL, project = NULL, website = NULL)
container | HTML element where the title is found. The title can usually be found in one of the following:
|
---|---|
htmlLocation | Path to folder where html files, tipically downloaded with DownloadContents(links) are stored. If not given, it defaults to the Html folder inside project/website folders. |
id | Defaults to NULL. If provided, it should be a vector of integers. Only html files corresponding to given id in the relevant htmlLocation will be processed. |
links | A named character vector, typically created by the ExtractLinks function. |
removeString | A character vector of one or more strings to be removed from the extracted title. |
removeEverythingAfter | Removes everything after given string. |
maxCharacters | An integer. Defines the maximum number of characters to be kept in the output for each title. |
progressBar | Logical, defaults to TRUE. If FALSE, progress bar is not shown (useful for example when including scripts in rmarkdown) |
exportParameters | Defaults to TRUE. If TRUE, function parameters are exported in the project/website folder. They can be used to update the corpus. Requires parameters project/website. |
importParameters | Defaults to NULL. If TRUE, ignores all parameters given in the function call, and imports them from parameters file stored in "project/website/Logs/parameters.rds". |
A character vector of article titles.
# NOT RUN { titles <- ExtractTitles() # }