View on GitHub

Argument Mining Corpora

A compilation of the most famous/used argument mining corpora in English

A compilation of the most famous/used argument mining corpora in English.

Id	Corpus	Relevant Papers
1	AraucariaDB	Reed, C., 2005 Reed et al., 2008
2	European Court of Human Rights (ECHR)	Mochales & Moens, 2007 Mochales & Moens, 2008
3	Internet Argument corpus (IAC)	Walker et al. 2012
4	Argument Annotated Essays Corpus (AAEC)	Stab & Gurevych, 2014a
5	Wikipedia articles	Aharoni et al., 2014
6	User-generated Web Discourse Gold standard Toulmin corpus (study 2)	Habernal & Gurevych, 2017

Some statistics and characteristics of the previously listed corpora are presented below.

1 - AraucariaDB
Domain	Newspapers and court cases
Language	English
Size	Over 700 analyses, and a total of 80,000 words
Argument model	Walton’s schemes
Annotation process	AML: Argument Markup Language (XML-based tree structure) Two annotator (one main analyst and one secondary analyst)
Agreement	Unknown
Comments	Text gathered from newspaper editorials, parliamentary records, judicial summaries and discussion boards
URL	https://arg-tech.org/index.php/research/araucariadb/

2 - European Court of Human Rights (ECHR)
Domain	Legal
Language	English
Size	12,904 sent., 10,133 non arg. and 2,771 arg., 2,355 premises and 416 conclusions
Time	4 weeks - docs analyzed by 2 lawyers
Argument model	Argumentation schemes AC: conclusion, and premise AR: support / attack
Annotation process	All documents are annotated independently by 2 lawyers The annotations were analysed by another legal expert Fourth legal expert who annotated again the documents
Agreement	K = 0.58, K = 0.80
Comments	55 documents composed of 25 legal cases and 29 admissibility reports

3 - Internet Argument corpus (IAC)
Domain	Political
Language	English
Size	Set of 390,704 posts in 11,800 discussions (from debate site 4forums.com)
Annotation process	With Amazon’s Mechanical Turk: Complex process, with several stages, and several Turkers Data stored in JSON files with most annotations in CSV format
Agreement	K(topic) = 0.22–0.60 K(avg) = 0.47v
Comments	Corpus for research in political debate on Internet forums. It consists of approximately 11,000 discussions, 390,000 posts, and some 73,000,000 words
URL	https://nlds.soe.ucsc.edu/iac

4 - Argument Annotated Essays Corpus (AAEC)
Domain	Persuasive essays (various)
Language	English
Size	90 persuasive essays, 1,673 sentences with 34,917 tokens
Argument model	AC: major claim, claim, and premise AR: support / attack
Annotation process	Three scorers for the ACs and RCs ACs are consolidated and selected by majority vote Same for RCs
Agreement	αU(comp) = 0.72 αU(rel) = 0.81
Comments	The corpus consists of 90 English persuasive essays (collected from essay forum). The corpus contains 1,879 sentences. 402 Essays about 8 controversial topics
URL	http://corpora.aifdb.org/AAECv2

5 - Wikipedia articles
Domain	Various
Language	English
Size	~50,000 sent, 2,683 argument elements, collected in the context of 33 controversial topics
Argument model	AC: claim and its associated supporting evidence In detail: Topic, Context Dependent Claim (CDC), Context Dependent Evidence (CDE)
Annotation process	20 carefully trained in-house labelers Two-stage labeling approach: Five labelers worked independently on the same text Five labelers independently crosschecked the joint list of the detected candidates Candidates confirmed by at least three labelers were included in the corpus
Agreement	K(claim) = 0.39 K(evidence) = 0.4
Comments	A corpus of 2,683 argument elements, collected in the context of 33 predefined controversial topics

6 - User-generated Web Discourse
Language	English
Size	340 documents
Time	Each annotator spent 35 hours by annotating in the course of 5 weeks Discussions and consolidation of the gold data took another 6 hours
Argument model	Adaptation of Toulmin’s model AC: claim, premise, backing, rebuttal, and refutation
Annotation process	All docs were annotated by 3 independent annotators. Three phases: 50 random comments and forum posts were annotated 148 comments and forum posts as well as 41 blog posts 96 comments/forum posts, 8 blog posts, and 8 articles for this phase
Agreement	αU = 0.48 Joint logos (claim, premise, backing, rebuttal, refutation) for Articles + Blog posts + Comments + Forum posts
Comments	Contains 340 documents about 6 controversial topics in education
URL	https://bit.ly/2vdkHOD

Created on Oct 04, 2022
Created by:

This project is licensed under the terms of the Apache License 2.0.

This work was supported by the Spanish Ministry of Science and Innovation (PID2019-108965GB-I00).