The Data Knowledge Hub for Researching Online Discourse
Background and rationale: What, why, and how does it help you?
The Data Knowledge Hub for Researching Online Discourse (Data Knowledge Hub) is an initiative that aims to provide a central resource for researchers, social scientists, data scientists, journalists, and other practitioners, and policy makers interested in independently researching social media and online discourse more broadly.
Why do we feel this is necessary?
Online discourse has changed how we inform ourselves, what and who to trust, as well as how information is quite simply accessed. Notably on online platforms and social media, recommender systems and other design features can be gamed to fuel disinformation, hate speech, and outrage. In addition, messaging services and alternative platforms are increasingly falling risk to exploitation and provide agitators with vast audiences to spread falsehoods. But how and why exactly this is happening remains under-researched and merely anecdotally illustrated. If we want to strengthen our information ecosystem and increase each other’s ability to decide what’s trustworthy and what’s not, we need to move away from anecdotes towards broad, continuous, and ideally real-time data-driven insight.
The challenge
Due to the increasing number of social media and other digital platforms as well as the huge amounts of data to analyse, it is critical to enable and empower more researchers, social as well as data scientists, on two fronts:
- to conduct independent research of social media and online discourse on a technical level, and
- to assess the data from a socio-political context.
There are already renowned, well-established organisations that do incredible work on Social Media Monitoring, including CeMAS, Democracy Reporting International, the SPARTA Project of the Bundeswehr University Munich, or the Institute for Strategic Dialogue. Yet even these established players face several challenges, among others:
- the multitude of digital platforms;
- the sheer amount of data and necessary server capacities;
- fast-developing and constantly changing narratives;
- new and changing actors and agitators.
Building a foundation for solving these challenges
To reduce the obstacles and lower the threshold to independently researching online discourse, we are launching this Data Knowledge Hub. Hosted open source and under a Creative Commons license on GitHub, it continuously welcomes contributions of new data, code, and written content, fostering a collaborative environment for all. Cooperation and collaboration on development, design, content, and scope among established actors is key to turning this Data Knowledge Hub into a useful tool and an enabler for future research.
For first publication in September 2023, we gathered initial contributions on legal basis and ethical standards, good practices and exemplary research for webscraping, data collection on Twitter and TikTok as well as code samples to monitor various platforms. This Data Knowledge Hub will be continuously updated and reviewed, and, with the help of community and crowdsourced contributions, we hope to include a broad range of samples and organic input, over time providing all relevant information for researching and understanding the dynamics of online discourse.
You can help and contribute, too
We welcome additional contributions on a rolling basis. Right now, we would be particularly interested in including and discussing chapters on:
- Social media usage: users worldwide, number of posts/messages, regional differences etc.
- Data access and ethics:
- How to deal with dark socials?
- Data access rights beyond the European Union and the U.S.
- Data collection: sock puppet, snowball sampling and other innovative approaches
- Examples of data collection: Facebook, Instagram, YouTube, Fediverse and others
- Examples of data analysis: Topic modelling, sentiment analysis, geospational analysis, infrastructure as code, and others
- Additional aspects that benefit from monitoring as a research method
Living document - How to navigate the Data Knowledge Hub
The Data Knowledge Hub is hosted on a GitHub repository. For better usability we use a documentation framework which allows users to switch to a static website for easier reading, accessing content as a digital book. This means that all text content is created using Markdown. Code projects are included as a single file (e.g. a Jupyter Notebook) or in folders that can be pulled from GitHub. We intend to continuously update content and invite contributions on additional aspects of independently researching social media and online discourse. A first version was published in September 2023, chapters that are already in the pipeline are marked as “forthcoming”, a list of invited contributions can be found in the “editorial”.
All contributors are listed here as well as named in their respective chapters.
Code projects
Here is a table with all projects that are currently included in the Data Knowledge Hub. Click on the link to go to the project page.
Project | Description | Language | Plattform | Code |
---|---|---|---|---|
tiktok-scraping | Collect data on TikTok using puppeteer | JavaScript | TikTok | Code |
tiktok-hashtag-analysis | Analyse TikTok hashtags | Python | TikTok | Code |
blog-webscraping | Webscraping using rvest and selenium | R | Blogs | Code |
twitter-streaming | Large-scale data collection on X (Twitter) | Python | Twitter / X | Code |
twitter-social-network | Social Network Analysis with R | R | Twitter / X | Code |
Design principles
The editorial team has adopted four guiding principles for content on the Data Knowledge Hub:
- From general to specific: Cater to different target groups by starting each chapter with a general and easy-to-follow introduction. More specific topics such as content on use cases, projects, or code examples will be added throughout the project. We use three labels to indicate difficulty that will help users to orientate themselves: no code, beginners, advanced.
- Rich links: Enable non-linear interaction with internal and external links, highlighting diverse initiatives, projects, or code libraries.
- Reproducibility: For code examples, we focus on Python and R due to their widespread use in data science (however use cases in other languages are also welcome such as JavaScript, Julia or Rust). All code should be reproducible.
- Open Source: Content and code will be accessible on GitHub under a CC BY License.
Structure of the Data Knowledge Hub
The structure of the hub is based on the different stages of a data analysis project:
- How to get started: Overview of legal and ethical considerations as well as tools you may use during your project.
- How to access data on platforms: Information on data access options available for each platform.
- How to collect data: Summary of data collection methods and tools, along with challenges, limitations, and potential.
- How to analyse data: Introduction to research designs and methods like natural language processing, network analysis, and machine learning.
- Literature and Illustrative Research: Overview of literature and selected research and studies.
- Contribute: Now it's your turn. Information on how you can support us and make your research available to the data knowledge community.
Questions and improvements
If you have any questions or ideas, please do not hesitate to contact us at upgrade.democracy@bertelsmann-stiftung.de.
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.