Data collection

We collected depression-related posts from the most popular English-speaking health forums.

The dataset contains posts from February 15, 2016 until February 15, 2019, and covers only publicly available posts (registration is not required to read the forums), which were shared willingly by their authors.

The forums’ content was downloaded – in compliance with GDPR regulations – by SentiOne a social listening platform.

We filtered the collected corpus in two rounds:
(1) we selected threads which contained the word „depression” or „depressed” in the title or at least in one post, then
    (2) we selected posts, whose link, topic or content contains at least one of these  depression-related terms: depression, depressed, bummer, desolation, desperation, moody, upset, gloom, hopelessness, depressant, melancholia, sorrow, unhappiness, feeling blue, depressive, depressive disorder, unipolar depression, bipolar, bipolar depression, major depression, mdd, persistent depressive disorder, pdd, cyclothymia, mood disorder, adjustment disorder, chronic fatigue syndrome, cfs, premenstrual dysphoric disorder.

The filtered corpus contained 79 889 posts.


We used the following preprocessing steps:

  • Deletion of repost part of the text
  • Removal of duplicate and too short (<20 words) posts
  • Deletion of URLs, e-mail addresses
  • Identification of name of mental disorders (n-grams)
    [completed with some general medical terms as „unipolar depression” or „depressive disorder”; retrieved: 15-04-2019]
  • Lemmatization (WordNet lemmatizer from Python NLTK)
  • Significant bigram detection (BigramCollocationFinder, NLTK, with PMI measure and human evaluation)
  • Stop-word removal (NLTK Stopwords Corpus)
  • Turned out to be non-relevant: significant trigrams, named entity recognition for persons’ name

After the preprocessing the corpus has 67 857 posts.

Used program and package

We used Python 3.7 NLTK package 

Reference: Bird, S., Loper, E. and Klein, E. (2009), Natural Language Processing with Python. O’Reilly Media Inc.


The annotated database is a random sample of 4500 posts from the corpus contained at least 20 word long posts. The annotators were instructed to assign a secondary label to the texts if needed (34% of the posts got a second label).

Two independent annotators labelled each text. In case of explicit disagreement between the annotators a senior researcher also labelled the text (12,3% of the posts were concerned).

We aggregated the labels of the annotators into an integrated label based on these principles:

Only two, matching labelsThat category was chosen
Only one of the annotators gave „valid” label to the post, the other label was „unclassifiable” or „irrelevant”The „valid” (primary) label was chosen
Two matching from the three labels (one annotator choose one category, the other choose two categories)That category was chosen
Only two identical labels from the four used labelsThat category was chosen
Both annotators gave primary and secondary labels in the same orderThe primary label was chosen
The two annotators used same labels, but the order of the labels were differentThird annotator’s decision
Only different labels, none of them were „unclassifiable” or „irrelevant”Third annotator’s decision

Features for the automated classification

In order to move beyond word frequencies and to build some background knowledge into the model, we tried to define further variables (‘features’) which might help the learner to classify the posts. Most of them are general linguistic characteristics of the post possibly related to social status or age of the author.

Other features are more specific to the topic of depression. There is also an author-level characteristics describing their engagement in forum community. We used these features as predictors, because we assumed they might be related to framing either in a direct (e.g. occurrence of dosage may refer to bio-medical framing) or in a rather indirect way (e.g. higher level of lexical diversity may be associated with higher educational level which might affect framing).

TypeUsed feature
Specific characteristics of the post possibly related to framingThe number of occasions any mental health medication occurs
The number of occasions some drug dosage (‘8 mg’) occurs
Characteristics of the authorAuthor’s level of activity on the forum
General linguistic characteristics of the postProportion of nouns / verbs / adjectives / emojis / misspelled words / sentences / words / stop words / numbers / misspelled words / punctuation marks / commas
Average length of sentences / words
Number of words
Lexical diversity
Sentiment score
Number of abbreviations commonly used in chat forums (e.g., 2nite=tonight) [retrieved: 15-04-2019]