Data – Machine learning of concepts hard even for humans: the case of online depression forums

Data collection

We collected depression-related posts from the most popular English-speaking health forums.

The dataset contains posts from February 15, 2016 until February 15, 2019, and covers only publicly available posts (registration is not required to read the forums), which were shared willingly by their authors.

The forums’ content was downloaded – in compliance with GDPR regulations – by SentiOne a social listening platform.

We filtered the collected corpus in two rounds:
(1) we selected threads which contained the word „depression” or „depressed” in the title or at least in one post, then
(2) we selected posts, whose link, topic or content contains at least one of these depression-related terms: depression, depressed, bummer, desolation, desperation, moody, upset, gloom, hopelessness, depressant, melancholia, sorrow, unhappiness, feeling blue, depressive, depressive disorder, unipolar depression, bipolar, bipolar depression, major depression, mdd, persistent depressive disorder, pdd, cyclothymia, mood disorder, adjustment disorder, chronic fatigue syndrome, cfs, premenstrual dysphoric disorder.

The filtered corpus contained 79 889 posts.

Preprocessing

We used the following preprocessing steps:

Deletion of repost part of the text
Removal of duplicate and too short (<20 words) posts
Deletion of URLs, e-mail addresses
Identification of name of mental disorders (n-grams)
[completed with some general medical terms as „unipolar depression” or „depressive disorder”; retrieved: 15-04-2019]
Lemmatization (WordNet lemmatizer from Python NLTK)
Significant bigram detection (BigramCollocationFinder, NLTK, with PMI measure and human evaluation)
Stop-word removal (NLTK Stopwords Corpus)
Turned out to be non-relevant: significant trigrams, named entity recognition for persons’ name

After the preprocessing the corpus has 67 857 posts.

Used program and package

We used Python 3.7 NLTK package

Reference: Bird, S., Loper, E. and Klein, E. (2009), Natural Language Processing with Python. O’Reilly Media Inc.

Annotation

The annotated database is a random sample of 4500 posts from the corpus contained at least 20 word long posts. The annotators were instructed to assign a secondary label to the texts if needed (34% of the posts got a second label).

Two independent annotators labelled each text. In case of explicit disagreement between the annotators a senior researcher also labelled the text (12,3% of the posts were concerned).

We aggregated the labels of the annotators into an integrated label based on these principles:

Case	Decision
Only two, matching labels	That category was chosen
Only one of the annotators gave „valid” label to the post, the other label was „unclassifiable” or „irrelevant”	The „valid” (primary) label was chosen
Two matching from the three labels (one annotator choose one category, the other choose two categories)	That category was chosen
Only two identical labels from the four used labels	That category was chosen
Both annotators gave primary and secondary labels in the same order	The primary label was chosen
The two annotators used same labels, but the order of the labels were different	Third annotator’s decision
Only different labels, none of them were „unclassifiable” or „irrelevant”	Third annotator’s decision

Features for the automated classification

In order to move beyond word frequencies and to build some background knowledge into the model, we tried to define further variables (‘features’) which might help the learner to classify the posts. Most of them are general linguistic characteristics of the post possibly related to social status or age of the author.

Other features are more specific to the topic of depression. There is also an author-level characteristics describing their engagement in forum community. We used these features as predictors, because we assumed they might be related to framing either in a direct (e.g. occurrence of dosage may refer to bio-medical framing) or in a rather indirect way (e.g. higher level of lexical diversity may be associated with higher educational level which might affect framing).

Type	Used feature
Specific characteristics of the post possibly related to framing	The number of occasions any mental health medication occurs The number of occasions some drug dosage (‘8 mg’) occurs
Characteristics of the author	Author’s level of activity on the forum
General linguistic characteristics of the post	Proportion of nouns / verbs / adjectives / emojis / misspelled words / sentences / words / stop words / numbers / misspelled words / punctuation marks / commas Average length of sentences / words Number of words Lexical diversity Sentiment score Number of abbreviations commonly used in chat forums (e.g., 2nite=tonight) [retrieved: 15-04-2019]