Extracts textual contents from a vector of html files

Extracts textual contents from a vector of html files.

ExtractText(container = NULL, containerClass = NULL,
  containerId = NULL, subElement = NULL, noChildren = NULL,
  htmlLocation = NULL, id = NULL, encoding = "UTF-8",
  metadata = NULL, export = FALSE, maxTitleCharacters = 80,
  removeString = NULL, customXpath = NULL,
  removeEverythingAfter_pre = NULL, removeEverythingBefore_pre = NULL,
  removeEverythingBefore = NULL, removeEverythingAfter = NULL,
  keepEverything = FALSE, removePunctuationInFilename = TRUE,
  removeTitleFromTxt = FALSE, titles = NULL, importParameters = NULL,
  exportParameters = TRUE, progressBar = TRUE, project = NULL,
  website = NULL)

Arguments

container	Defaults to NULL. If provided, it must be an html element such as "div", "span", etc.
containerClass	Defaults to NULL. If provided, also `container` must be given (and `containerId` must be NULL). Only text found inside the provided combination of container/class will be extracted.
containerId	Defaults to NULL. If provided, also `container` must be given (and `containerClass` must be NULL). Only text found inside the provided combination of container/class will be extracted.
subElement	Defaults to NULL. If provided, also `container` must be given. Only text within elements of given type under the chosen combination of container/containerClass will be extracted. When given, it will tipically be "p", to extract all p elements inside the selected div.
noChildren	Defaults to FALSE, i.e. by default all subelements of the selected combination (e.g. div with given class) are extracted. If TRUE, only text found under the given combination (but not its subelements) will be extracted. Corresponds to the xpath string `/node()[not(self::div)]`.
htmlLocation	Path to folder where html files, tipically downloaded with DownloadContents(links) are stored. If not given, it defaults to the Html folder inside project/website folders.
id	Defaults to NULL. If provided, it should be a vector of integers. Only html files corresponding to given id in the relevant htmlLocation will be processed.
metadata	Defaults to NULL. A data.frame presumably created with ExportMetadata() including information on all articles. Number of rows must correspond to the number of articles to be elaborated. This is required when export == TRUE, in order to provide meaningful filenames.
export	Logical, defaults to TRUE. If TRUE, textual contents are saved as individual txt files in a dedicated folder. Filename is based on the medatadata.
maxTitleCharacters	Maximum number of characters allowed in the title. Defaults to 80.
removeString	A character vector of one or more strings. Provided strings are removed from each article.
removeEverythingAfter_pre	Defaults to NULL. Everything after this string is removed before processing the HTML file.
removeEverythingBefore_pre	Defaults to NULL. Everything before this string is removed before processing the HTML file.
keepEverything	Logical. If TRUE, extracts all visible text.
removePunctuationInFilename	Logical, defaults to TRUE. If TRUE (and export == TRUE), it removes punctuation signs from filemanes to prevent errors in saving files.
progressBar	Logical, defaults to TRUE. If FALSE, progress bar is not shown.

Value

A character vector of text, and individual articles saved as txt files in a dedicated folder if 'export' is set to TRUE.

Examples

# NOT RUN {
text <- ExtractText(container = "div", containerClass = "article")
# }

Extracts textual contents from a vector of html files

Arguments

Value

Examples

Contents