Skip to main content

Overview: Tools for working with digital trace data

JB
Johannes Breuer
Senior Researcher, GESIS - Leibniz Institute for the Social Sciences
Original post on 25.04.2025 by Johannes Breuer

What is digital trace data?

Online discourse can be researched with a variety of methods and types of data. A widely used data type in this domain is digital trace data. These can be defined as "'records of activity (trace data) undertaken through an online information system (thus, digital)' (Howison et al., 2011, p. 769) and can be collected from a multitude of technical systems, such as websites, social media platforms, smartphone apps, or sensors" (Stier et al., 2020, p. 504). Social media data is, in large part, digital trace data but, as this definition shows, the category is broader. From this definition, it is also apparent that there are different ways in which digital trace data can be further categorized. One way is distinguishing between sources of digital trace data, such as social media platforms, websites, or sensors. Another way is to categorize the data based on the type of information they contain. For example, social media data can be categorized based on their modality (text, image, audio, video) or their structure (e.g., network or metadata) as well.

On a more abstract level, Menchen-Trevino (2013) makes a distinction between participation traces (e.g., comments or posts) and transactional data (e.g., login data or logs more generally). Similarly, Hox (2017) differentiates between intentional and unintentional traces. Regarding the scope of the data, Menchen-Trevino (2013) further distinguishes between horizontal (e.g., all posts from a social media platform containing a specific hashtag) and vertical trace data (comprehensive usage data - potentially from different sources - for a limited group of users). The Data Knowledge Hub section on "Social media data types" combines some of these categorizations by dividing social media data into content data, interaction data, user data, and metadata (plus three subtypes for each of those).

A variety of available tools

Regardless of the specific type, when working with digital trace data we make use of different tools. Typically, we work with combinations of tools or tool stacks. The purpose of this entry is to provide an overview of tools that can be used across the research data cycle when working with digital trace data: From data collection to preprocessing, analysis and visualization, reporting, and archiving and sharing data. In addition to tools that can be used specifically for one of those steps, the next section also discusses general-purpose tools that can be used for different phases of the research data cycle. After introducing the different tool options, this entry ends with an outlook on the new world of AI(-based) tools and suggestions for some criteria for choosing the "right" tools.

A few things should be noted when reading this entry:

  • This entry is meant to provide a broad overview of tools. It does not provide any in-depth introductions. There are other entries here in the Data Knowledge Hub that provide more in-depth introductions to specific methods and/or tools, and there are hundreds of introductions and tutorials on the tools mentioned in this entry. For working with digital trace data in general, the GESIS Guides to Digital Behavioral Data can offer some further guidance, covering various aspects, e.g., related to data collection or data management and ethics.
  • Most of the tools discussed in the following are not specifically created for working with digital trace data or studying online discourse. They can be used for a variety of purposes. However, they are suitable options for being included in a tool stack for research (on online discourse) with digital trace data.
  • The tools are not listed in any particular order, and the list is not exhaustive.
  • The focus is on open-source tools, but some commercial tools are also included.

General-purpose tools

Research using digital trace data (on online discourse as well as other topics) typically involves writing code for one or more of the steps in the research data cycle, most commonly for data preprocessing, analysis, and visualization. The most widely used programming languages for this are Python and R. Both languages have many packages and libraries that can be used for a wide range of tasks across the research data cycle. This makes them highly extensible and versatile.

When working with R, Python, or any other programming language, it makes sense to use an integrated development environment (IDE) that provides an interface for writing and executing code. There are many different IDEs available. The most widely used one for R is RStudio. Its potential successor Positron also offers full support for Python. A popular IDE for Python is PyCharm, and another widely used options that works with and offers specific extensions for many programming languages (including R, and Python) is Visual Studio Code.

Another general-purpose tool that is especially helpful with regard to transparency and reproducibility is the version control system git. As for R and Python, there are plenty of introductions to git available online, but a good starting point is the official documentation. git is particularly useful for sharing research output (especially code) and collaborating with others if used in combination with a public git hosting service. The most popular one is GitHub and another popular option is GitLab. A very comprehensive resource focusing on the use of git and GitHub in combination with R and RStudio is the online guide "Happy Git and GitHub for the useR".

Moving from the general to the more specific, in the following, I will provide an overview of tools that can be used for the different steps in the research data cycle. Notably, many tools can be used for several steps. In such cases, the respective tools will be discussed in the section that is most relevant for their main purpose. Another thing to take into account is that the distinction between the different steps is not always clear-cut. For example, data collection and data preprocessing are often closely intertwined. It also depends on the specific research question or use case whether a tool is used for data collection, preprocessing, or analysis and visualization.

Data collection

There are different options for collecting digital trace data and all of those have their own pros and cons (Breuer et al., 2020; Ohme et al., 2023). The most widely used options in research with digital trace data in general and research on online discourse in particular are APIs (application programming interfaces) and web scraping. However, there are also other options, such as reusing existing data(sets), direct collaborations with platforms, or data donations from users.

There are several entries within the Data Knowledge Hub that provide detailed guidance on some specific methods for collecting digital trace data. API access, (web) scraping, and data donations are discussed in the Data Knowledge Hub entry on "Common data collection methods on social media platforms illustrated with TikTok". Another entry on "Data Collection on X (Twitter)" is an in-depth guide on how to collect data from X (Twitter) using the APIs offered by the platform, and there also is a section on "Webscraping techniques with R". Besides that, the aforementioned GESIS Guides to Digital Behavioral Data series also include guides on collecting data via APIs and web scraping in general, as well as for specific platforms, such as YouTube or Google Trends.

Different types of data and different collection methods require different tools. There are various collections and overviews of tools for social media data. While these lists can quickly become outdated because tools are not maintained anymore, new tools are developed, or the platforms that the tools are created for change their APIs, they can still be helpful for finding, comparing, and choosing data collection tools. Typically, these lists also provide some criteria that can be used for selecting tools. The guide by Deubel et al. (2023), e.g., provides information on what platform(s) the tool can be used for, whether it makes use of an API, requires programming skills (and in what language), offers a graphical user interface (GUI) and analysis features, and whether it is free and open source (FOS) or not. Other tool overviews are provided in the Wiki of the Social Media Observatory at the Leibniz Institute for Media Research, the list of Social Media Research Tools by Brandon Silverman and Chris Miles, or the Social Media Research Toolkit by the Social Media Lab at Ted Rogers School of Management, Toronto Metropolitan University.

Besides choosing the appropriate tool(s) for collecting digital trace data, it is also important to consider legal aspects as well as the ethical implications of data collection. The relevant legal and ethical considerations for working with social media data are discussed in detail in the respective sections within the Data Knowledge Hub. For digital trace data in general, the guide by Breuer et al. (2025) gives an overview and some recommendations for ethical questions related to doing research with this type of data.

Data preprocessing

What preprocessing steps are required strongly depends on the type(s) of data being used. This, of course, also determines which tools are appropriate for the task. For textual data, the Data Knowledge Hub entry on "How to analyse social media data" provides a good overview and also points to some further resources. Sticking with the example of text data, there are many packages and libraries available for text mining and natural language processing (NLP) in R and Python (as well as other programming languages). A very powerful and widely used NLP library for Python is spaCy. There also is an R implementation of spaCy called spacyr. spacyr is part of the Quanteda ecosystem, which is a very popular family of packages for the processing and quantitative analysis of textual data in R. Another widely used R package for text mining is tidytext.

The analysis of audio, image, or video data requires other or additional preprocessing steps. A common approach is to transform such data into textual data via speech recognition or computer vision tools. A popular speech recognition model is Whisper by OpenAI that can be used in Python, via a command line tool, or an independently developed R package. There are also many tools and libraries for image and video analysis. For Python, OpenCV is a widely used library. For R, an interesting new package for image analysis that makes use of large language models (LLMs) is kuzco. While the landscape of tools and libraries for image and video analysis is quickly changing, the book on images as data by Webb Williams et al. (2020) still provides a good introduction.

Data analysis & visualization

As for preprocessing, what types of analysis and visualization are possible and make sense heavily depends on a) the specific type(s) of data being used and b) the respective research question(s). For textual data, common analysis methods, e.g., include supervised machine learning or topic modeling (see the "Overview: How to analyse social media data" for an introduction to these as well as other methods for text analysis). The other data analysis projects included in the Data Knowledge Hub provide further examples of analysis methods for social media data, such as social network analysis and hashtag analysis.

Notably, many of the tools and libraries/packages mentioned in the previous two sections can also be used for analyzing and visualizing digital trace data, but there are also many other options, especially for Python and R. Regarding data visualization in general, two very popular and versatile options are matplotlib for Python and ggplot2 for R. A very helpful resource for choosing the right data visualization that also provides code examples for both Python and Ris the website From Data to Viz. For analyzing and visualizing network data, which is also widely used in research on online discourse, two widely used options are Gephi and igraph, which also offers libraries for Python and R.

Reporting

There are many digital tools and pieces of software that can be used for reporting research (on online discourse) with digital trace data. Given the complexity of the data and methods commonly used in this field, two particularly suitable formats are notebooks and dashboards. Notebooks represent a popular type of literate programming framework that combines text, code, and its output (e.g., figures) in one document. Dashboards are interactive web applications that allow users to explore data and results in a more dynamic way. The following table provides an overview of some of the most popular tools for creating notebooks and dashboards.

Format(s)Supported languagesCosts
QuartoNotebook, DashboardPython, R, Julia, Observable JavaScriptFOS1
JupyterNotebookJulia, Python, RFOS2,3
RMarkdownNotebook, ReportRFOS1
marimoNotebookPython, SQLFOS1
ShinyDashboards, Web AppsR, PythonFOS1
Google ColabNotebookFocus on Python, but several others work as well (including R)Freemium
ObservableNotebook, DashboardObservable JavaScriptPaid hosting via ObservableHQ
TableauDashboardGUI and proprietary visual query language (VizQL) but offers integration with SQL, Python, RCommercial

Notably, as is the case also for all other sections, this list is not exhaustive. There are many more tools and services for creating (and publishing) notebooks, dashboards or 'reproducible reports', especially commercial ones (Curvenote is one example). Some of the tools (including a few from the table above) have also started integrating AI-based features (e.g., Deepnote) (note: AI-based tools will be discussed in more detail in a later section of this entry).

Archiving & sharing data

One or more reports in the form of an article, blog post, and/or an interactive notebook or dashboard (using one of the tools mentioned in the previous section) is the usual output of any research project. However, there also are other research products that can and should be shared. Apart from the code that can, e.g., be shared via GitHub, another important product is the data that is collected and analyzed. Archiving and sharing them is important to ensure/increase transparency and reproducibility of research.

As for the phase of data collection, ethical and legal questions become especially relevant when it comes to archiving and sharing digital trace data. In addition to the resources mentioned before, for archiving and sharing digital trace data (and social media data in particular) the book chapter by Bishop and Gray (2017) provides some general guidance. It is important to clarify whether or how the data can (legal) and should (ethical) be shared. However, in most cases, even if the full/raw data cannot be shared, it should be possible to at least share a processed or aggregated version of the data that allows for a reproduction of the reported analyses.

There are many options to choose from for archiving and sharing research data. The Registry of Research Data Repositories (re3data) is a dedicated search engine for finding research repositories. There are different criteria that can be used to search for and choose a suitable repository for your research data. There are general ones (e.g., the OSF, Zenodo, or the Harvard Dataverse) as well as discipline-, country- or data-type-specific repositories. Depending on your needs, some criteria to consider could be where the archive is situated, if it focuses on specific types of data, whether it uses persistent identifiers (such as DOIs), or if it allows for access control. Klein et al. (2018) suggest some further criteria for choosing a repository for archiving and sharing research data in general.

Regardless of the chosen repository, the data should ideally be shared in a way that is consistent with the FAIR data principles (Wilkinson et al., 2016), meaning that the data are findable, accessible, interoperable, and reusable. In addition to choosing a suitable repository, this also brings requirements with regard to the format and documentation of the data (see Breuer et al., 2021 for a discussion of this for social media data).

Outlook: AI(-based) tools

The advent of large language models (LLMs) and generative AI (genAI) also has a major impact on tools for research with digital trace data. There are AI(-based) tools that can be used for (almost) every task in the research process, even going beyond the phases included here (e.g., also for searching and summarizing relevant literature). As with other tools, some AI(-based) tools are general-purpose, while others are specifically designed for certain tasks (Breuer, 2023).

Since research with digital trace data typically involves writing code for one or more steps in the research process, coding assistance tools like GitHub Copilot4, Claude Code by Anthropic, or Gemini Code Assist by Google can be helpful. LLMs are also becoming increasingly popular for preprocessing and analyzing textual data (Törnberg, 2024) but also image and video data (Jo et al., 2024). The use of LLMs for text analysis is also discussed in the "Overview: How to analyse social media data".

Given the speed of the development in this area, it is hard to keep track with new models and tools being released almost daily. However, the still fairly frequently maintained list "AI Tools for Research Workflow in Academia" by Niels Van Quaquebeke can serve as a good overview and starting point.

One thing that should generally be noted when using AI(-based) tools is that while they can be very useful and facilitate steps in the research process, they are also associated with specific challenges. Besides potential costs, the most notable ones are intransparency, (data) privacy, and a possible (increased) dependency on commercial providers.

Choosing the right tools

As should be evident from the previous sections, there is a large variety of tools to choose from for research (on online discourse) with digital trace data (and the combination possibilities are practically endless). There are, however, some important criteria to consider when choosing the right tools for your project. In general, there are three main aspects to consider:

  1. The specific task(s) that the tools are meant to accomplish: For example, if you want to collect data from a specific social media platform, you need to choose a tool that is able to do so.

  2. Available resources. This relates to the time and money you have available for your project but also skills and expertise. Some tools are free, while others require a subscription or a one-time payment. Some tools are easy to use, while others require more technical knowledge.

  3. Requirements. There may be requirements from employers, clients, or collaborators that need to be considered. Some institutions may require the use of specific tools or solutions for data processing, analysis, or storage. In addition, there may be legal requirements that need to be considered when choosing tools, such as data protection regulations.

Since we typically work with combinations of different tools (tool stacks), another relevant aspect to consider is the compatibility of the tools. Some tools are designed to work together, while others may require additional steps to integrate them.

Finally, to increase the trustworthiness of research results, transparency and reproducibility are important aspects to consider when choosing tools. According to the glossary of the Framework for Open and Reproducible Research Training (FORRT), transparency means "[h]aving one’s actions open and accessible for external evaluation". And The Turing Way defines "reproducible research as work that can be independently recreated from the same data and the same code that the original team used." In general, FOS tools are better suited for ensuring transparency and reproducibility. Another important aspect in this regard is documentation, both of the tools themselves as well as the ways in which they are used in specific projects.

References

  • Bishop, L., & Gray, D. (2017). Chapter 7: Ethical Challenges of Publishing and Sharing Social Media Research Data. In K. Woodfield (Ed.), Advances in Research Ethics and Integrity (Vol. 2, pp. 159–187). Emerald Publishing Limited. https://doi.org/10.1108/S2398-601820180000002007

  • Breuer, J. (2023). Putting the AI into social science – How artificial intelligence tools are changing and challenging research in the social sciences. In A. Sudmann, A. Echterhölter, M. Ramsauer, F. Retkowski, J. Schröter, & A. Waibel (Eds.), Beyond Quantity. Research with Subsymbolic AI (pp. 255–273). transcript. https://www.transcript-open.de/pdf_chapter/bis%206999/9783839467664/9783839467664-014.pdf

  • Breuer, J., Bishop, L., & Kinder-Kurlanda, K. (2020). The practical and ethical challenges in acquiring and sharing digital trace data: Negotiating public-private partnerships. New Media & Society, 22(11), 2058–2080. https://doi.org/10.1177/1461444820924622

  • Breuer, J., Borschewski, K., Bishop, L., Vávra, M., Štebe, J., Strapcova, K., & Hegedűs, P. (2021). Archiving Social Media Data: A guide for archivists and researchers. https://doi.org/10.5281/zenodo.6517880

  • Breuer, J., Stier, S., Lukito, J., Mangold, F., Wieland, M., & Radovanovic, D. (2025). Overview of Ethical Considerations when Working with Digital Behavioral Data (No. 14; GESIS Guides to Digital Behavioral Data). https://www.gesis.org/fileadmin/admin/Dateikatalog/pdf/guides/14_Breuer_et_al._Overview_Ethics_DBD.pdf

  • Deubel, A., Breuer, J., & Weller, K. (2023). Collecting Social Media Data: Tools for Obtaining Data from Social Media Platforms (No. 1; Navigating Research Data and Methods). Center for Advanced Internet Studies. https://www.cais-research.de/wp-content/uploads/Collecting-Social-Media-Data.pdf

  • Howison, J., Wiggins, A., & Crowston, K. (2011). Validity Issues in the Use of Social Network Analysis with Digital Trace Data. Journal of the Association for Information Systems, 12(12), 767–797. https://doi.org/10.17705/1jais.00282

  • Hox, J. J. (2017). Computational Social Science Methodology, Anyone? Methodology, 13(Supplement 1), 3–12. https://doi.org/10.1027/1614-2241/a000127

  • Jo, C. W., Wesołowska, M., & Wojcieszak, M. (2024). Harmful YouTube Video Detection: A Taxonomy of Online Harm and MLLMs as Alternative Annotators (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2411.05854

  • Klein, O., Hardwicke, T. E., Aust, F., Breuer, J., Danielsson, H., Mohr, A. H., IJzerman, H., Nilsonne, G., & Frank, M. C. (2018). A practical guide for transparency in psychological science. Collabra: Psychology, 4(1). https://doi.org/10.1525/collabra.158

  • Menchen-Trevino, E. (2013). Collecting vertical trace data: Big possibilities and big challenges for multi-method research. Policy & Internet, 5(3), 328–339. https://doi.org/10.1002/1944-2866.poi336

  • Ohme, J., Araujo, T., Boeschoten, L., Freelon, D., Ram, N., Reeves, B. B., & Robinson, T. N. (2023). Digital Trace Data Collection for Social Media Effects Research: APIs, Data Donation, and (Screen) Tracking. Communication Methods and Measures, 18(2), 124–141. https://doi.org/10.1080/19312458.2023.2181319

  • Stier, S., Breuer, J., Siegers, P., & Thorson, K. (2020). Integrating Survey Data and Digital Trace Data: Key Issues in Developing an Emerging Field. Social Science Computer Review, 38(5), 503–516. https://doi.org/10.1177/0894439319843669

  • Törnberg, P. (2024). Best Practices for Text Annotation with Large Language Models. Sociologica, 18(2), 67–85. https://doi.org/10.6092/ISSN.1971-8853/19461

  • Webb Williams, N., Casas, A., & Wilkerson, J. D. (2020). Images as Data for Social Science Research: An Introduction to Convolutional Neural Nets for Image Classification (1st ed.). Cambridge University Press. https://doi.org/10.1017/9781108860741

  • Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., Da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1), 160018. https://doi.org/10.1038/sdata.2016.18

Footnotes

  1. Depending on what solution you choose for that, there may be costs for publishing/hosting the output (online). 2 3 4

  2. JupyterLab offers the equivalent of an IDE for Jupyter notebooks.

  3. Binder offers an option for the free hosting of reproducible analyses using Jupyter notebooks.

  4. GitHub Copilot can also be easily integrated into RStudio and VS Code.