View on GitHub

Argument Mining Corpora

A compilation of the most famous/used argument mining corpora in English

Download this project as a .zip file Download this project as a tar.gz file

Argument Minining Corpora

A compilation of the most famous/used argument mining corpora in English.

Id Corpus Relevant Papers
1 AraucariaDB Reed, C., 2005
Reed et al., 2008
2 European Court of Human Rights (ECHR) Mochales & Moens, 2007
Mochales & Moens, 2008
3 Internet Argument corpus (IAC) Walker et al. 2012
4 Argument Annotated Essays Corpus (AAEC) Stab & Gurevych, 2014a
5 Wikipedia articles Aharoni et al., 2014
6 User-generated Web Discourse
Gold standard Toulmin corpus (study 2)
Habernal & Gurevych, 2017

Details of the corpus

Some statistics and characteristics of the previously listed corpora are presented below.

1 - AraucariaDB
DomainNewspapers and court cases
LanguageEnglish
SizeOver 700 analyses, and a total of 80,000 words
Argument modelWalton’s schemes
Annotation process
  1. AML: Argument Markup Language (XML-based tree structure)
  2. Two annotator (one main analyst and one secondary analyst)
AgreementUnknown
CommentsText gathered from newspaper editorials, parliamentary records, judicial summaries and discussion boards
URLhttps://arg-tech.org/index.php/research/araucariadb/
2 - European Court of Human Rights (ECHR)
DomainLegal
LanguageEnglish
Size12,904 sent., 10,133 non arg. and 2,771 arg., 2,355 premises and 416 conclusions
Time4 weeks - docs analyzed by 2 lawyers
Argument modelArgumentation schemes
AC: conclusion, and premise
AR: support / attack
Annotation process
  1. All documents are annotated independently by 2 lawyers
  2. The annotations were analysed by another legal expert
  3. Fourth legal expert who annotated again the documents
AgreementK = 0.58, K = 0.80
Comments55 documents composed of 25 legal cases and 29 admissibility reports
3 - Internet Argument corpus (IAC)
DomainPolitical
LanguageEnglish
SizeSet of 390,704 posts in 11,800 discussions (from debate site 4forums.com)
Annotation process
  1. With Amazon’s Mechanical Turk: Complex process, with several stages, and several Turkers
  2. Data stored in JSON files with most annotations in CSV format
AgreementK(topic) = 0.22–0.60
K(avg) = 0.47v
CommentsCorpus for research in political debate on Internet forums. It consists of approximately 11,000 discussions, 390,000 posts, and some 73,000,000 words
URLhttps://nlds.soe.ucsc.edu/iac
4 - Argument Annotated Essays Corpus (AAEC)
DomainPersuasive essays (various)
LanguageEnglish
Size90 persuasive essays, 1,673 sentences with 34,917 tokens
Argument modelAC: major claim, claim, and premise
AR: support / attack
Annotation process
  1. Three scorers for the ACs and RCs
  2. ACs are consolidated and selected by majority vote
  3. Same for RCs
AgreementαU(comp) = 0.72
αU(rel) = 0.81
CommentsThe corpus consists of 90 English persuasive essays (collected from essay forum). The corpus contains 1,879 sentences. 402 Essays about 8 controversial topics
URLhttp://corpora.aifdb.org/AAECv2
5 - Wikipedia articles
DomainVarious
LanguageEnglish
Size~50,000 sent, 2,683 argument elements, collected in the context of 33 controversial topics
Argument modelAC: claim and its associated supporting evidence
In detail: Topic, Context Dependent Claim (CDC), Context Dependent Evidence (CDE)
Annotation process20 carefully trained in-house labelers
Two-stage labeling approach:
  1. Five labelers worked independently on the same text
  2. Five labelers independently crosschecked the joint list of the detected candidates
Candidates confirmed by at least three labelers were included in the corpus
AgreementK(claim) = 0.39
K(evidence) = 0.4
CommentsA corpus of 2,683 argument elements, collected in the context of 33 predefined controversial topics
6 - User-generated Web Discourse
LanguageEnglish
Size340 documents
TimeEach annotator spent 35 hours by annotating in the course of 5 weeks
Discussions and consolidation of the gold data took another 6 hours
Argument modelAdaptation of Toulmin’s model
AC: claim, premise, backing, rebuttal, and refutation
Annotation processAll docs were annotated by 3 independent annotators. Three phases:
  1. 50 random comments and forum posts were annotated
  2. 148 comments and forum posts as well as 41 blog posts
  3. 96 comments/forum posts, 8 blog posts, and 8 articles for this phase
AgreementαU = 0.48
Joint logos (claim, premise, backing, rebuttal, refutation) for Articles + Blog posts + Comments + Forum posts
CommentsContains 340 documents about 6 controversial topics in education
URLhttps://bit.ly/2vdkHOD

Authors

Created on Oct 04, 2022
Created by:

License

This project is licensed under the terms of the Apache License 2.0.

Acknowledgements

This work was supported by the Spanish Ministry of Science and Innovation (PID2019-108965GB-I00).