How to get rid of the noise in the corpus : cleaning large samples of digital newspaper texts
Title: How to get rid of the noise in the corpus : cleaning large samples of digital newspaper texts
Series/Number: International Relations Online Working Paper; 2011/2
Large digital text samples are promising sources for text-analytical research in the social sciences. However, they may turn out to be very troublesome when not cleaned of the 'noise' of doublets and sampling errors that induce biases and distort the reliability of content-analytical results. This paper claims that these problems can be remedied by making innovative use of computational and corpus-linguistic procedures. Automatic pairwise document comparison based on a vector space model will bring doublets to light, while sampling errors can be discerned with the help of textmining procedures that measure the 'keyness' of a document, i.e. the degree to which it contains or does not contain keywords representing research topic.
Type of Access: openAccess
Files in this item
There are no files associated with this item.