Datasets used on the paper ``Comparing Health Forums: User Engagement, Salient Entities, Medical Detail'' This data comprises features extracted from posts about hypertension, diabetes, and depression, made to the health communities Health Boards, Patient, r/Hypertension, r/BloodPressure, r/Diabetes, and r/Depression. Post text and other details (eg. author) are omitted for privacy and term compliance, but URLs are given so they can be retrieved. Text and user information are not needed to replicate our experiments. The files are formatted as JSON, where each entry corresponds to a post thread, with the following fields: sub_id: submission id category: depression, hypertension, or diabetes url: link to thread num_replies: total number of replies in the thread num_unique_users: number of distinct users that participate in the thread entity_list: list of medical entities mentioned in the thread, given by their UMLS code drug_dosages: list of drug dosages mentioned in the thread, if at all If you are using this data, please cite us as: @inproceedings{Guimaraes_CSCW2021, TITLE = {Comparing Health Forums: User Engagement, Salient Entities, Medical Detail}, AUTHOR = {Anna Guimar{\~{a}}es, Erisa Terolli, and Gerhard Weikum}, BOOKTITLE = {Companion of the 2021 {ACM} Conference on Computer Supported Cooperative Work and Social Computing, {CSCW}, October 23-27, 2021}, PUBLISHER = {{ACM}}, ADDRESS = {Virtual Event}, YEAR = {2021} }