Datasets used on the paper ``X-Posts Explained: Analyzing and Predicting Controversial Contributions in Thematically Diverse Reddit Forums'' This data comprises features extracted from the initial posts of reply paths collected from Politics, WorldNews, Relationships, and Soccer subreddits. Post text and other details (eg. author) are omitted for privacy and term compliance, but IDs are given so they can be retrieved from Reddit. Text and user information are not needed to replicate our experiments. The files are formatted as JSON, where each line consists of a path of direct replies, with the following fields: sub_id: submission id (can be used to access the original submission, if still available, through reddit.com/r//comments/) post_id: root (top-level) comment id (can be used to access the original comment, if still available, through reddit.com/r//comments//_/) frac_neg: fraction of comments with negative sentiment in the path frac_neu: fraction of comments with neutral sentiment frac_pos: fraction of comments with positive sentiment sub_sent: compound sentiment value of the submission text avg_sent, var_sent: average and variance of compound sentiment values of comments in the path avg_pos, var_pos: average and variance of compound sentiment values of comments with positive sentiment in the path avg_neg, var_neg: average and variance of compound sentiment values of comments with negative sentiment frac_diff_sent: fraction of consecutive comments with different sentiment polarities avg_post_sim, var_post_sim: average and variance of the similarity between the textual content of consecutive comments in the path avg_sub_sim, var_sub_sim: average and variance of the similarity between the textual content of the submission and comments in the path avg_root_sim, var_root_sim: average and variance of the similarity between the textual content of the root post and later comments in the path avg_replies: average of the number of replies received by each comment in the path avg_delay: average of the time delay between consecutive comments in the path, given by the difference in their posting times frac_controversial: fraction of comments in the path that have been flagged by Reddit as controversial frac_unique_users: fraction of unique users in the path (unique users/total comments) entities: list of entities which appear in the path (see paper for entity selection strategy) prior_X: binary flag denoting whether an X-post is present in the first four comments of the path future_X: binary flag denoting whether an X-post is present in the later comments of the path If you are using this data, please cite us as: @inproceedings{Guimaraes_ICWSM2021, TITLE = {X-Posts Explained: Analyzing and Predicting Controversial Contributions in Thematically Diverse Reddit Forums}, AUTHOR = {Anna Guimar{\~{a}}es and Gerhard Weikum}, BOOKTITLE = {Proceedings of the 15th International Conference on Web and Social Media, {ICWSM}, June 7th-10th, 2021}, PUBLISHER = {{AAAI} Press}, ADDRESS = {Held Virtually}, YEAR = {2021} }