castarter is designed to make it easy also for relatively inexperienced users to create a textual dataset from a website, or a section of a website, keep it up-to-date, and explore it through word frequency graphs or a web interface that makes it possibe to tag items.
Documentation is available on
Books dedicated to content analysis typically assume that the researcher has already created, has access or may buy access to a structured dataset. They may include sections on sampling (Krippendorff 2004; Riffe, Lacy, and Fico 2005), but they do not debate explicitly how new datasets can be built. Commercial software packages generally share the same expectation. In the R ecosystem, there are a number of packages that can be used to access existing datasets (e.g.
manifestoR) or import into R textual content from social media through their APIs (e.g.
However, there is no package dedicated to getting into R the textual contents of regular websites and extracting key metadata (date and title) in order to populate a corpus that can be analysed in R or with other software packages. ‘castarter - content analysis starter toolkit for R’ aims to accomplish just that, offering to moderately incompetent users the opportunity to extract textual contents from websites and prepare them for content analysis. It allows to export the datasets in a number of standard formats for further processing and facilitates basic word frequency analysis in R.
castarter will be particularly beneficial for relatively inexperienced users. Many of the functions it offers will look trivial to expert R users. Yet, even experienced users will benefit of some of
castarter’s convenience functions, including its ability to download and parse pages systematically (and continue download processes if it has been interrupted), and to automatically keep a dataset up-to-date.
For further debate on the usefulness of a structured approach to analysing web contents in the context of area studies, and for some practical examples of how
castarter has been used, see:
Comai, Giorgio (2017). Quantitative Analysis of Web Content in Support of Qualitative Research. Examples from the Study of Post-Soviet De Facto States, Studies of Transition States and Societies, 9(1), 14-34. http://publications.tlu.ee/index.php/stss/article/view/346/446.
Given a few basic criteria, it allows to:
tidytextpackage, but it can easily be adapted for further analysis with the
quantedapackages among others)
As parameters for downloading web pages and extracting metadata and text are stored by default, it is easy to keep the dataset up-to-date with the dedicated
UpdateDataset() function (which parses index pages for new links, downloads them, and adds them to the latest dataset).
castarter facilitates basic word frequency analysis with a dedicated series of functions.
castarter functionalities are available also through web interfaces, further facilitating the creation and analysis of datasets. This is due to make
castarter usable also to users who are not familiar with R and are not able to code.
To see them in action, it is suggested to download an example dataset with the following commands, which will download and store locally all press releases issued by the Kremlin and available on their website.
library("castarter") devtools::install_github(repo = "giocomai/castarterpresidents") #> Skipping install of 'castarterpresidents' from a github remote, the SHA1 (3c809129) has not changed since last install. #> Use `force = TRUE` to force installation CreateFolders(project = "presidents", website = "kremlin_en") SaveWebsite(dataset = castarterpresidents::kremlin_en, project = "presidents", website = "kremlin_en") #> Dataset saved in castarter/presidents/kremlin_en/2018-10-07-presidents-kremlin_en-dataset.rds
CreateDataset() will allow to conduct the whole procedure of extracting contents from a website directly from a web interface, without requiring to know R or code.
It is currently not yet functional.
AnalyseDataset() allows to interactively explore a textual dataset, by creating word frequency time series of terms given by the user (or compare among different terms when comma-separated). It makes it easy to explore not only the word frequency, but also the actual contents by presenting the sentences including a given keyword in an interactive table below the graph.
Users can choose which datasets to include in the analysis among those stored locally through an interactive interface.
It may acquire new functionalities (including convenience functions for sharing the interface), but it is already fully functional.
ReadAndTag() facilitates reading through the dataset, tagging articles along different criteria, and filter the available articles either by keyoword or tag.
This can be used to skim quickly through a dataset, or to conduct structured qualitative analysis of the dataset. Tags are stored automatically in the interactive session.
This app is already functional, even if not thoroughly tested (make sure everything works as expected if you intend to use for bigger projects).
It will be possible to do some basic analysis of the tags within the app, and to export the tagging in standard formats.
Most functionalities that
castarter will likely ever do have already been implemented, in full or in part.
Enhancements to current functions will likely focus on:
UpdateDataset()for keeping multiple datasets up to date
Forthcoming releases will likely include fully functional and enhanced version of the following shiny apps:
- `CreateDataset()` - to create new textual datasets - `AnalyseDataset()` - to conduct basic quantitative content analysis based on word frequency and time series - `ReadAndTag()` - to conduct basic qualitative content analysis through tagging
Other planned feature include:
castarter can easily be installed from GitHub with
To use interactive web interfaces, you will need to have installed on your system also the
Some examples of analysis of media contents conducted with
castarter are available on the author’s blog: