Skip to main content

Best Practices to Stream and Store Data on X

Level required: Expert
AN
Andreas Neumeier
SPARTA | University of the German Federal Armed Forces
JR
Jasmin Riedl
SPARTA | University of the German Federal Armed Forces
WD
Wiebke Drews
SPARTA | University of the German Federal Armed Forces

How to Deal with Varying Amounts of X Streaming Data#

Expect the number of tweets coming through the filtered stream to vary based on factors like time of day or current events. This fluctuation means that processing capabilities must be adaptable to handle increases or decreases in volume. Additionally, itā€™s important to note that Xā€™s (Twitterā€™s) filtered stream endpoint does not include a built-in buffer. Therefore, if incoming data is not processed promptly, the stream may stop working.

If you experience a disconnection, you may retrieve data up to five minutes after the interruption. However, it is recommended to minimize interruptions to maintain a consistent and efficient data flow. One effective approach is establishing a local buffer to temporarily store incoming tweets for later processing.

Moreover, incorporating a publish-subscribe system like Kafka is worth considering. With this system in place, incoming tweets are stored instantly, allowing subsequent processes to access and analyze them at their own pace. This approach effectively manages peak loads and balances the data processing workload.

Always monitor your analysis pipelineā€™s capacity. If you notice that it is struggling to handle the regular tweet volume, it is time to consider scaling up your resources. This step ensures that your system stays reliable and responsive, especially during times of high demand.

How to Store your X Data#

To ensure that tweets can be processed later, they need to be stored in a structured format.

The method in which the data is stored varies depending on the specific application. It is feasible to store them in both relational and non-relational databases.

For effective storage and retrieval of tweets, selecting the appropriate database structure is crucial. Your chosen structure depends heavily on your applicationā€™s nature and how you intend to interact with stored data. Hereā€™s a quick overview of how you can store and/or organise tweets in both relational and non-relational databases:

1. Relational Databases#

In a relational database (e.g., PostgreSQL, MySQL), information is arranged into tables with predetermined columns, and connections between these tables are clearly defined through primary and foreign keys.

tip

Remember that the decision to use a relational database such as PostgreSQL should depend on your applicationā€™s requirements. If your application relies heavily on structured queries and data relationships, a relational database like PostgreSQL can be very beneficial.

PostgreSQL-Schema example for tweets:

id (text)#

  • Description: A unique identifier for each tweet.
  • Constraints: Must be unique for each tweet.
  • Example: "1234567890"

project (text)#

  • Description: The name of a project. This allows you to store tweets from multiple projects in the same table.
  • Example: "project1"

created_at (datetime)#

  • Description: The date and time when the tweet was published.
  • Constraints: Must be a valid date and time.
  • Example: 2023-08-24 12:34:56+00

author_id (text)#

  • Description: The unique identifier of the user who authored the tweet.
  • Example: "9876543210"

text (text)#

  • Description: The actual content of the tweet, limited to a specific number of characters.
  • Example: "This is a sample tweet!"

retweeted_id (text)#

  • Description: If the tweet is a retweet, this field stores the original tweet's unique identifier.
  • Constraints: Should map to a valid tweet id or be NULL if it's not a retweet.
  • Example: "987654321"

conversation_id (text)#

  • Description: An identifier to group tweets from the same conversation/thread.
  • Constraints: Can be same as the tweet's id if it's the start of a conversation or can map to another tweet's id if it's part of an ongoing conversation.
  • Example: "1122334455"

media (jsonb)#

  • Description: A JSON object containing information about any media (like images, videos) attached to the tweet.
  • Example:
    [  {    "url": "https://pbs.twimg.com/media/test.jpg",    "type": "photo",    "width": 4086,    "height": 3065,    "media_key": "3_123456",    "public_metrics": {}  }]

polls (jsonb)#

  • Description: A JSON object containing data about any polls included in the tweet.
  • Example:
    [  {    "id": "123456789",    "question": "Favorite color?",    "options": [      { "label": "Red", "votes": 18, "position": 1 },      { "label": "Blue", "votes": 28, "position": 2 },      { "label": "Green", "votes": 18, "position": 3 }    ],    "end_datetime": "2023-08-24 12:34:56.000Z",    "voting_status": "closed",    "duration_minutes": 10080  }]

places (jsonb)#

  • Description: A JSON object containing geolocation or place information associated with the tweet.
  • Constraints: Should adhere to a pre-defined structure for storing place data.
  • Example:
    [  {    "id": "37439688c6302728",    "geo": {      "bbox": [11.360589, 48.061634, 11.722918, 48.248124],      "type": "Feature",      "properties": {}    },    "name": "Munich",    "country": "Germany",    "full_name": "Munich, Germany",    "place_type": "city",    "country_code": "DE"  }]

entities (jsonb)#

  • Description: A JSON object that captures specific entities within the tweet like hashtags, user mentions, URLs, etc.
  • Constraints: Should adhere to a pre-defined structure for capturing these entities.
  • Example:
    {  "hashtags": [    { "end": 15, "start": 0, "tag": "#example" },    { "end": 30, "start": 15, "tag": "#database" }  ],  "mentions": [    { "id": "123456", "end": 15, "start": 0, "username": "@username" }  ]}

tweet (jsonb)#

  • Description: A JSON object that captures the entire raw tweet data, useful in case something changes in the Twitter API, for backup, or for applications that need the full payload.
  • Example:
    {"id": "1234567890", "text": "This is a sample tweet!", "lang": "de", "public_metrics": {"like_count": 0, "quote_count": 0, "reply_count": 0, "retweet_count": 0, "impression_count": 0}, ...}

compliance (text)#

  • Description: Displays the compliance status of the tweet in terms of compliance events such as deleted, edited, witheld. This can be used to indicate whether a compliance event exists for a tweet.
  • Example: deleted

2. Non-Relational Databases#

Non-relational databases (e.g., MongoDB, Cassandra) offer greater storage flexibility since they donā€™t need a fixed schema.

Document-based storage (like MongoDB) example:

Each tweet can be a document:

{   "_id": "1234567890",   "author_id": "9876543210",   "text": "This is a sample tweet!",   "created_at": "2023-08-24 12:34:56+00",   "entities": {        "hashtags": [            {"end": 15, "start": 0, "tag": "#example"},            {"end": 30, "start": 15, "tag": "#database"}        ],        "mentions": [            {"id": "123456", "end": 15, "start": 0, "username": "@username"}        ]    },    ...}

Non-relational databases excel when:

  • The data structure is flexible and might change over time.
  • Horizontal scalability is required.
  • High write speeds are necessary, and data consistency can be slightly relaxed.

In Short: How to choose your database#

The choice between relational and non-relational databases should be made based on your applicationā€™s specific requirements. If the application heavily relies on structured relationships, complex queries, and data integrity, a relational database is preferable. On the other hand, if your application needs scalability, flexibility in schema design, and can compromise a bit on consistency, a non-relational database might be the better choice.