Extracts dates from a vector of html files

Extracts dates from a vector of html files.

ExtractDates(dateFormat = "dmY", container = NULL,
  containerClass = NULL, containerId = NULL,
  containerInstance = NULL, htmlLocation = NULL, inputVector = NULL,
  id = NULL, customXpath = NULL, customRegex = NULL,
  attribute = NULL, language = Sys.getlocale(category = "LC_TIME"),
  customString = "", minDate = NULL, maxDate = NULL,
  encoding = "UTF-8", keepAllString = FALSE,
  removeEverythingBefore = NULL, progressBar = TRUE,
  exportParameters = TRUE, importParameters = NULL, project = NULL,
  website = NULL)

Arguments

dateFormat	A string expressing the date format. In line with standards (see ?strptime), 'd' stands for day, 'm' stands for month in figures, 'b' for months spelled out as words, 'y' as year without the century, 'Y' as year with four digits. Standard separation marks among parts of the date (e.g. '-', '/', '.') should not be included. The following date formats are available : "dmY": Default. "Ymd": "dbY": "YBd": "dB,Y": "db.'y": "Bd,Y": "xdBY": customString must be provided.
containerInstance	Defaults to NULL. If given, it must be an integer. If a given element is found more than once in the same page, it keeps only the relevant occurrence for further extraction.
inputVector	Defaults to NULL. If provided, instead of looking for downloaded Html files it parses the given character vector.
id	Defaults to NULL. If provided, it should be a vector of integers. Only html files corresponding to given id in the relevant htmlLocation will be processed.
customRegex	Defaults to NULL. If provided, regex parsing pre-data extraction will follow this forumula, e.g. `[[:digit:]][[:digit:]][[:punct:]][[:space:]][[:digit:]][[:digit:]][[:punct:]][[:space:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]]`.
attribute	Defaults to NULL. Can be specified only if customXpath is given, in order to extract a given attribute e.g. if customXpath = "//meta[@property='article:published_time']", and attribute = "content".
language	Provide a language in order to extract name of months. Defaults to the locale currently active in R (usually, the system language). Generic forms such as "english" or "russian", are usually accepted. See ?locales for more details. On linux, you can run system("locale -a", intern = TRUE) to see all available locales.
minDate,	maxDate Minimum and maximum possible dates in the format year-month-date, e.g. "2007-06-24". Introduces NA in the place of impossibly high or low dates.
encoding	Defaults to 'UTF-8'. If source is not in UTF, encoding can be specified here. A list of valid values can be found using iconvlist().
keepAllString	Logical, defaults to FALSE. If TRUE, it directly tries to parse the date with the given dateFormat, without trying to polish the string provided accordingly.
progressBar	Logical, defaults to TRUE. If FALSE, progress bar is not shown.
exportParameters	Defaults to TRUE. If TRUE, function parameters are exported in the project/website folder. They can be used to update the corpus. Requires parameters project/website.
importParameters	Defaults to NULL. If TRUE, ignores all parameters given in the function call, and imports them from parameters file stored in "project/website/Logs/parameters.rds".
project	Name of 'castarter' project. Must correspond to the name of a folder in the current working directory. Defaults to NULL, required for storing export parameters (with exportParameters = TRUE). This can be left blank if previously set with SetCastarter(project = "project", website = "website").
website	Name of a website included in a 'castarter' project. Must correspond to the name of a sub-folder of the project folder. Defaults to NULL, required for storing export parameters (with exportParameters = TRUE). This can be left blank if previously set with SetCastarter(project = "project", website = "website").

Value

A vector of the Date class.

Examples

# NOT RUN {
dates <- ExtractDates()
# }

Arguments

Value

Examples

Contents