Extracts textual contents from a vector of html files.
ExtractText(container = NULL, containerClass = NULL, containerId = NULL, subElement = NULL, noChildren = NULL, htmlLocation = NULL, id = NULL, encoding = "UTF-8", metadata = NULL, export = FALSE, maxTitleCharacters = 80, removeString = NULL, customXpath = NULL, removeEverythingAfter_pre = NULL, removeEverythingBefore_pre = NULL, removeEverythingBefore = NULL, removeEverythingAfter = NULL, keepEverything = FALSE, removePunctuationInFilename = TRUE, removeTitleFromTxt = FALSE, titles = NULL, importParameters = NULL, exportParameters = TRUE, progressBar = TRUE, project = NULL, website = NULL)
container | Defaults to NULL. If provided, it must be an html element such as "div", "span", etc. |
---|---|
containerClass | Defaults to NULL. If provided, also `container` must be given (and `containerId` must be NULL). Only text found inside the provided combination of container/class will be extracted. |
containerId | Defaults to NULL. If provided, also `container` must be given (and `containerClass` must be NULL). Only text found inside the provided combination of container/class will be extracted. |
subElement | Defaults to NULL. If provided, also `container` must be given. Only text within elements of given type under the chosen combination of container/containerClass will be extracted. When given, it will tipically be "p", to extract all p elements inside the selected div. |
noChildren | Defaults to FALSE, i.e. by default all subelements of the selected combination (e.g. div with given class) are extracted. If TRUE, only text found under the given combination (but not its subelements) will be extracted. Corresponds to the xpath string `/node()[not(self::div)]`. |
htmlLocation | Path to folder where html files, tipically downloaded with DownloadContents(links) are stored. If not given, it defaults to the Html folder inside project/website folders. |
id | Defaults to NULL. If provided, it should be a vector of integers. Only html files corresponding to given id in the relevant htmlLocation will be processed. |
metadata | Defaults to NULL. A data.frame presumably created with ExportMetadata() including information on all articles. Number of rows must correspond to the number of articles to be elaborated. This is required when export == TRUE, in order to provide meaningful filenames. |
export | Logical, defaults to TRUE. If TRUE, textual contents are saved as individual txt files in a dedicated folder. Filename is based on the medatadata. |
maxTitleCharacters | Maximum number of characters allowed in the title. Defaults to 80. |
removeString | A character vector of one or more strings. Provided strings are removed from each article. |
removeEverythingAfter_pre | Defaults to NULL. Everything after this string is removed before processing the HTML file. |
removeEverythingBefore_pre | Defaults to NULL. Everything before this string is removed before processing the HTML file. |
keepEverything | Logical. If TRUE, extracts all visible text. |
removePunctuationInFilename | Logical, defaults to TRUE. If TRUE (and export == TRUE), it removes punctuation signs from filemanes to prevent errors in saving files. |
progressBar | Logical, defaults to TRUE. If FALSE, progress bar is not shown. |
A character vector of text, and individual articles saved as txt files in a dedicated folder if 'export' is set to TRUE.
# NOT RUN { text <- ExtractText(container = "div", containerClass = "article") # }