The Data Knowledge Hub for Researching Online Discourse

Background and rationale: What, why, and how does it help you?

Cathleen Berger

Upgrade Democracy | Bertelsmann Stiftung

Website Mastodon

Charlotte Freihse

Upgrade Democracy | Bertelsmann Stiftung

Mastodon LinkedIn

The Data Knowledge Hub for Researching Online Discourse (Data Knowledge Hub) is an initiative that aims to provide a central resource for researchers, social scientists, data scientists, journalists, practitioners, and policy makers interested in independently researching social media and online discourse more broadly.

Why do we feel this is necessary?

Online discourse has changed how we inform ourselves, what and who to trust, as well as how information is quite simply accessed. Notably on online platforms and social media, recommender systems and other design features can be gamed to fuel disinformation, hate speech, and outrage. In addition, messaging services and alternative platforms are increasingly falling risk to exploitation and provide agitators with vast audiences to spread falsehoods. But how and why exactly this is happening remains under-researched and merely anecdotally illustrated. If we want to strengthen our information ecosystem and increase each other’s ability to decide what’s trustworthy and what’s not, we need to move away from anecdotes towards broad, continuous, and ideally real-time data-driven insight.

The challenge

Due to the increasing number of social media and other digital platforms as well as the huge amounts of data to analyse, it is critical to enable and empower more researchers, social as well as data scientists, on two fronts:

To conduct independent research of social media and online discourse on a technical level

To assess the data from a socio-political context

There are already renowned, well-established organisations that do incredible work on Social Media Monitoring, including CeMAS, Democracy Reporting International, the SPARTA Project of the Bundeswehr University Munich, or the Institute for Strategic Dialogue. Yet even these established players face several challenges, among others:

the multitude of digital platforms;
the sheer amount of data and necessary server capacities;
fast-developing and constantly changing narratives;
new and changing actors and agitators.

Building a foundation for solving these challenges

To reduce the obstacles and lower the threshold to independently researching online discourse, we are launching this Data Knowledge Hub. Hosted open source and under a Creative Commons license on GitHub, it continuously welcomes contributions of new data, code, and written content, fostering a collaborative environment for all. Cooperation and collaboration on development, design, content, and scope among established actors is key to turning this Data Knowledge Hub into a useful tool and an enabler for future research.

For first publication in September 2023, we gathered initial contributions on legal basis and ethical standards, good practices and exemplary research for webscraping, data collection on Twitter and TikTok as well as code samples to monitor various platforms. This Data Knowledge Hub will be continuously updated and reviewed, and, with the help of community and crowdsourced contributions, we hope to include a broad range of samples and organic input, over time providing all relevant information for researching and understanding the dynamics of online discourse.

You can help and contribute, too

We welcome additional contributions on a rolling basis. Right now, we would be particularly interested in including and discussing chapters on:

Social media usage: users worldwide, number of posts/messages, regional differences etc.
Data access and ethics:
- How to deal with dark socials?
- Data access rights beyond the European Union and the U.S.
Data collection: sock puppet, snowball sampling and other innovative approaches
Examples of data collection: Facebook, Instagram, Reddit, Fediverse and others
Examples of data analysis: Topic modelling, sentiment analysis, geospational analysis, infrastructure as code, and others
Additional aspects that benefit from monitoring as a research method

Living document - How to navigate the Data Knowledge Hub

Johannes Müller

&effect data solutions GmbH

Website LinkedIn

The Data Knowledge Hub is hosted on a GitHub repository. For better usability we use a documentation framework which allows users to switch to a static website for easier reading, accessing content as a digital book. This means that all text content is created using Markdown. Code projects are included as a single file (e.g. a Jupyter Notebook) or in folders that can be pulled from GitHub. We intend to continuously update content and invite contributions on additional aspects of independently researching social media and online discourse. A first version was published in September 2023, chapters that are already in the pipeline are marked as “forthcoming”, a list of invited contributions can be found in the “editorial”.

All contributors are listed here as well as named in their respective chapters.

Code projects

Here is a table with all projects that are currently included in the Data Knowledge Hub. Click on the link to go to the project page.

Project	Description	Language	Plattform	Code
tiktok-scraping	Collect data on TikTok using puppeteer	JavaScript	TikTok	Code
tiktok-hashtag-analysis	Analyse TikTok hashtags	Python	TikTok	Code
blog-webscraping	Webscraping using rvest and selenium	R	Blogs	Code
twitter-streaming	Large-scale data collection on X (Twitter)	Python	Twitter / X	Code
twitter-social-network	Social Network Analysis with R	R	Twitter / X	Code

Design principles

The editorial team has adopted four guiding principles for content on the Data Knowledge Hub:

From general to specific

Cater to different target groups by starting each chapter with a general and easy-to-follow introduction. More specific topics such as content on use cases, projects, or code examples will be added throughout the project. We use three labels to indicate difficulty that will help users to orientate themselves: no code, beginners, advanced.

Rich links

Enable non-linear interaction with internal and external links, highlighting diverse initiatives, projects, or code libraries.

Reproducibility

For code examples, we focus on Python and R due to their widespread use in data science (however use cases in other languages are also welcome such as JavaScript, Julia or Rust). All code should be reproducible.

Open Source

Content and code will be accessible on GitHub under a CC BY License.