Datasets used on the paper ``X-Posts Explained: Analyzing and Predicting Controversial Contributions in Thematically Diverse Reddit Forums''

This data comprises features extracted from the initial posts of reply paths collected from Politics, WorldNews, Relationships, and Soccer subreddits. Post text and other details (eg. author) are omitted for privacy and term compliance, but IDs are given so they can be retrieved from Reddit. Text and user information are not needed to replicate our experiments.

The files are formatted as JSON, where each line consists of a path of direct replies, with the following fields:

	sub_id: submission id (can be used to access the original submission, if still available, through reddit.com/r/<name of subreddit>/comments/<sub_id>)
	post_id: root (top-level) comment id (can be used to access the original comment, if still available, through reddit.com/r/<name of subreddit>/comments/<sub_id>/_/<post_id>)

	frac_neg: fraction of comments with negative sentiment in the path
	frac_neu: fraction of comments with neutral sentiment
	frac_pos: fraction of comments with positive sentiment
	sub_sent: compound sentiment value of the submission text
	avg_sent, var_sent: average and variance of compound sentiment values of comments in the path
	avg_pos, var_pos: average and variance of compound sentiment values of comments with positive sentiment in the path
	avg_neg, var_neg: average and variance of compound sentiment values of comments with negative sentiment
	frac_diff_sent: fraction of consecutive comments with different sentiment polarities

	avg_post_sim, var_post_sim: average and variance of the similarity between the textual content of consecutive comments in the path
	avg_sub_sim, var_sub_sim: average and variance of the similarity between the textual content of the submission and comments in the path
	avg_root_sim, var_root_sim: average and variance of the similarity between the textual content of the root post and later comments in the path

	avg_replies: average of the number of replies received by each comment in the path
	avg_delay: average of the time delay between consecutive comments in the path, given by the difference in their posting times
	frac_controversial: fraction of comments in the path that have been flagged by Reddit as controversial
	frac_unique_users: fraction of unique users in the path (unique users/total comments)
	entities: list of entities which appear in the path (see paper for entity selection strategy)
	prior_X: binary flag denoting whether an X-post is present in the first four comments of the path
	future_X: binary flag denoting whether an X-post is present in the later comments of the path

If you are using this data, please cite us as:

@inproceedings{Guimaraes_ICWSM2021,
TITLE = {X-Posts Explained: Analyzing and Predicting Controversial Contributions in Thematically Diverse Reddit Forums},
AUTHOR = {Anna Guimar{\~{a}}es and Gerhard Weikum},
BOOKTITLE = {Proceedings of the 15th International Conference on Web and Social Media, {ICWSM}, June 7th-10th, 2021},
PUBLISHER = {{AAAI} Press},
ADDRESS = {Held Virtually},
YEAR = {2021}
}