How to get rid of the noise in the corpus : cleaning large samples of digital newspaper texts

KANTNER, Cathleen; KUTTER, Amelie; HILDEBRANDT, Andreas; PÜTTCHER, Mark

2011

Working Paper

KANTNER, Cathleen; KUTTER, Amelie; HILDEBRANDT, Andreas; PÜTTCHER, Mark

Working Paper, International Relations Online Working Paper, 2011/2

Cite

KANTNER, Cathleen, KUTTER, Amelie, HILDEBRANDT, Andreas, PÜTTCHER, Mark, How to get rid of the noise in the corpus : cleaning large samples of digital newspaper texts, International Relations Online Working Paper, 2011/2 - https://hdl.handle.net/1814/40289

Retrieved from Cadmus, EUI Research Repository

Large digital text samples are promising sources for text-analytical research in the social sciences. However, they may turn out to be very troublesome when not cleaned of the 'noise' of doublets and sampling errors that induce biases and distort the reliability of content-analytical results. This paper claims that these problems can be remedied by making innovative use of computational and corpus-linguistic procedures. Automatic pairwise document comparison based on a vector space model will bring doublets to light, while sampling errors can be discerned with the help of textmining procedures that measure the 'keyness' of a document, i.e. the degree to which it contains or does not contain keywords representing research topic.

Cadmus permanent link: https://hdl.handle.net/1814/40289

ISSN: 2192-7278

External link: http://www.uni-stuttgart.de/soz/ib/forschung/IRWorkingPapers/

Series/Number: International Relations Online Working Paper; 2011/2

Show full item record

Files associated with this item

Files	Size	Format	View
There are no files associated with this item.

Collections

RSC Working Papers