Date: 2011
Type: Working Paper
How to get rid of the noise in the corpus : cleaning large samples of digital newspaper texts
Working Paper, International Relations Online Working Paper, 2011/2
KANTNER, Cathleen, KUTTER, Amelie, HILDEBRANDT, Andreas, PÜTTCHER, Mark, How to get rid of the noise in the corpus : cleaning large samples of digital newspaper texts, International Relations Online Working Paper, 2011/2 - https://hdl.handle.net/1814/40289
Retrieved from Cadmus, EUI Research Repository
Large digital text samples are promising sources for text-analytical research in the social sciences. However, they may turn out to be very troublesome when not cleaned of the 'noise' of doublets and sampling errors that induce biases and distort the reliability of content-analytical results. This paper claims that these problems can be remedied by making innovative use of computational and corpus-linguistic procedures. Automatic pairwise document comparison based on a vector space model will bring doublets to light, while sampling errors can be discerned with the help of textmining procedures that measure the 'keyness' of a document, i.e. the degree to which it contains or does not contain keywords representing research topic.
Cadmus permanent link: https://hdl.handle.net/1814/40289
ISSN: 2192-7278
Series/Number: International Relations Online Working Paper; 2011/2
Files associated with this item
Files | Size | Format | View |
---|---|---|---|
There are no files associated with this item. |