Data collection
We collected depression-related posts from the most popular English-speaking health forums.
The dataset contains posts from February 15, 2016 until February 15, 2019, and covers only publicly available posts (registration is not required to read the forums), which were shared willingly by their authors.
The forums’ content was downloaded – in compliance with GDPR regulations – by SentiOne a social listening platform.
We filtered the collected corpus in two rounds:
(1) we selected threads which contained the word „depression” or „depressed” in the title or at least in one post, then
(2) we selected posts, whose link, topic or content contains at least one of these depression-related terms: depression, depressed, bummer, desolation, desperation, moody, upset, gloom, hopelessness, depressant, melancholia, sorrow, unhappiness, feeling blue, depressive, depressive disorder, unipolar depression, bipolar, bipolar depression, major depression, mdd, persistent depressive disorder, pdd, cyclothymia, mood disorder, adjustment disorder, chronic fatigue syndrome, cfs, premenstrual dysphoric disorder.
The filtered corpus contained 79 889 posts.
Preprocessing
We used the following preprocessing steps:
- Deletion of repost part of the text
- Removal of duplicate and too short (<20 words) posts
- Deletion of URLs, e-mail addresses
- Identification of name of mental disorders (n-grams)
[completed with some general medical terms as „unipolar depression” or „depressive disorder”; retrieved: 15-04-2019] - Lemmatization (WordNet lemmatizer from Python NLTK)
- Significant bigram detection (BigramCollocationFinder, NLTK, with PMI measure and human evaluation)
- Stop-word removal (NLTK Stopwords Corpus)
- Turned out to be non-relevant: significant trigrams, named entity recognition for persons’ name
After the preprocessing the corpus has 67 857 posts.
Used program and package
We used Python 3.7 NLTK package
Reference: Bird, S., Loper, E. and Klein, E. (2009), Natural Language Processing with Python. O’Reilly Media Inc.
Annotation
The annotated database is a random sample of 4500 posts from the corpus contained at least 20 word long posts. The annotators were instructed to assign a secondary label to the texts if needed (34% of the posts got a second label).
Two independent annotators labelled each text. In case of explicit disagreement between the annotators a senior researcher also labelled the text (12,3% of the posts were concerned).
We aggregated the labels of the annotators into an integrated label based on these principles:
Case | Decision |
Only two, matching labels | That category was chosen |
Only one of the annotators gave „valid” label to the post, the other label was „unclassifiable” or „irrelevant” | The „valid” (primary) label was chosen |
Two matching from the three labels (one annotator choose one category, the other choose two categories) | That category was chosen |
Only two identical labels from the four used labels | That category was chosen |
Both annotators gave primary and secondary labels in the same order | The primary label was chosen |
The two annotators used same labels, but the order of the labels were different | Third annotator’s decision |
Only different labels, none of them were „unclassifiable” or „irrelevant” | Third annotator’s decision |
Features for the automated classification
In order to move beyond word frequencies and to build some background knowledge into the model, we tried to define further variables (‘features’) which might help the learner to classify the posts. Most of them are general linguistic characteristics of the post possibly related to social status or age of the author.
Other features are more specific to the topic of depression. There is also an author-level characteristics describing their engagement in forum community. We used these features as predictors, because we assumed they might be related to framing either in a direct (e.g. occurrence of dosage may refer to bio-medical framing) or in a rather indirect way (e.g. higher level of lexical diversity may be associated with higher educational level which might affect framing).
Type | Used feature |
Specific characteristics of the post possibly related to framing | The number of occasions any mental health medication occurs The number of occasions some drug dosage (‘8 mg’) occurs |
Characteristics of the author | Author’s level of activity on the forum |
General linguistic characteristics of the post | Proportion of nouns / verbs / adjectives / emojis / misspelled words / sentences / words / stop words / numbers / misspelled words / punctuation marks / commas Average length of sentences / words Number of words Lexical diversity Sentiment score Number of abbreviations commonly used in chat forums (e.g., 2nite=tonight) [retrieved: 15-04-2019] |