The Experts below are selected from a list of 291 Experts worldwide ranked by ideXlab platform
Martin Potthast - One of the best experts on this subject based on the ideXlab platform.
-
Wikidata Vandalism Corpus 2015 (WDVC-15)
2020Co-Authors: Benno Stein, Martin Potthast, Stefan Heindorf, Gregor EngelsAbstract:The Wikidata Vandalism corpus 2015 (WDVC-15) is a corpus for the evaluation of automatic Vandalism detectors for Wikidata. For research purposes the corpus can be used free of charge.
-
debiasing Vandalism detection models at wikidata
The Web Conference, 2019Co-Authors: Stefan Heindorf, Gregor Engels, Yan Scholten, Martin PotthastAbstract:Crowdsourced knowledge bases like Wikidata suffer from low-quality edits and Vandalism, employing machine learning-based approaches to detect both kinds of damage. We reveal that state-of-the-art detection approaches discriminate anonymous and new users: benign edits from these users receive much higher Vandalism scores than benign edits from older ones, causing newcomers to abandon the project prematurely. We address this problem for the first time by analyzing and measuring the sources of bias, and by developing a new Vandalism detection model that avoids them. Our model FAIR-S reduces the bias ratio of the state-of-the-art Vandalism detector WDVD from 310.7 to only 11.9 while maintaining high predictive performance at 0.963 ROC and 0.316 PR.
-
WWW - Debiasing Vandalism Detection Models at Wikidata
The World Wide Web Conference on - WWW '19, 2019Co-Authors: Stefan Heindorf, Gregor Engels, Yan Scholten, Martin PotthastAbstract:Crowdsourced knowledge bases like Wikidata suffer from low-quality edits and Vandalism, employing machine learning-based approaches to detect both kinds of damage. We reveal that state-of-the-art detection approaches discriminate anonymous and new users: benign edits from these users receive much higher Vandalism scores than benign edits from older ones, causing newcomers to abandon the project prematurely. We address this problem for the first time by analyzing and measuring the sources of bias, and by developing a new Vandalism detection model that avoids them. Our model FAIR-S reduces the bias ratio of the state-of-the-art Vandalism detector WDVD from 310.7 to only 11.9 while maintaining high predictive performance at 0.963 ROC and 0.316 PR.
-
overview of the wikidata Vandalism detection task at wsdm cup 2017
arXiv: Information Retrieval, 2017Co-Authors: Stefan Heindorf, Martin Potthast, Gregor Engels, Benno SteinAbstract:We report on the Wikidata Vandalism detection task at the WSDM Cup 2017. The task received five submissions for which this paper describes their evaluation and a comparison to state of the art baselines. Unlike previous work, we recast Wikidata Vandalism detection as an online learning problem, requiring participant software to predict Vandalism in near real-time. The best-performing approach achieves a ROC-AUC of 0.947 at a PR-AUC of 0.458. In particular, this task was organized as a software submission task: to maximize reproducibility as well as to foster future research and development on this task, the participants were asked to submit their working software to the TIRA experimentation platform along with the source code for open source release.
-
towards Vandalism detection in knowledge bases corpus construction and analysis
International ACM SIGIR Conference on Research and Development in Information Retrieval, 2015Co-Authors: Stefan Heindorf, Martin Potthast, Benno Stein, Gregor EngelsAbstract:We report on the construction of the Wikidata Vandalism Corpus WDVC-2015, the first corpus for Vandalism in knowledge bases. Our corpus is based on the entire revision history of Wikidata, the knowledge base underlying Wikipedia. Among Wikidata's 24 million manual revisions, we have identified more than 100,000 cases of Vandalism. An in-depth corpus analysis lays the groundwork for research and development on automatic Vandalism detection in public knowledge bases. Our analysis shows that 58% of the Vandalism revisions can be found in the textual portions of Wikidata, and the remainder in structural content, e.g., subject-predicate-object triples. Moreover, we find that some vandals also target Wikidata content whose manipulation may impact content displayed on Wikipedia, revealing potential vulnerabilities. Given today's importance of knowledge bases for information systems, this shows that public knowledge bases must be used with caution.
Kathleen R. Mckeown - One of the best experts on this subject based on the ideXlab platform.
-
got you automatic Vandalism detection in wikipedia with web based shallow syntactic semantic modeling
International Conference on Computational Linguistics, 2010Co-Authors: William Yang Wang, Kathleen R. MckeownAbstract:Discriminating Vandalism edits from non-Vandalism edits in Wikipedia is a challenging task, as ill-intentioned edits can include a variety of content and be expressed in many different forms and styles. Previous studies are limited to rule-based methods and learning based on lexical features, lacking in linguistic analysis. In this paper, we propose a novel Web-based shallow syntactic-semantic modeling method, which utilizes Web search results as resource and trains topic-specific n-tag and syntactic n-gram language models to detect Vandalism. By combining basic task-specific and lexical features, we have achieved high F-measures using logistic boosting and logistic model trees classifiers, surpassing the results reported by major Wikipedia Vandalism detection systems.
-
" Got You! " : Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics, 2010Co-Authors: William Yang Wang, Kathleen R. MckeownAbstract:Discriminating Vandalism edits from non-Vandalism edits in Wikipedia is a challenging task, as ill-intentioned edits can include a variety of content and be expressed in many different forms and styles. Previous studies are limited to rule-based methods and learning based on lexical features, lacking in linguistic analysis. In this paper, we propose a novel Web-based shallow syntactic-semantic modeling method, which utiliz-es Web search results as resource and trains topic-specific n-tag and syntactic n-gram language models to detect van-dalism. By combining basic task-specific and lexical features, we have achieved high F-measures using logistic boosting and logistic model trees classifiers, sur-passing the results reported by major Wikipedia Vandalism detection systems.
William Yang Wang - One of the best experts on this subject based on the ideXlab platform.
-
got you automatic Vandalism detection in wikipedia with web based shallow syntactic semantic modeling
International Conference on Computational Linguistics, 2010Co-Authors: William Yang Wang, Kathleen R. MckeownAbstract:Discriminating Vandalism edits from non-Vandalism edits in Wikipedia is a challenging task, as ill-intentioned edits can include a variety of content and be expressed in many different forms and styles. Previous studies are limited to rule-based methods and learning based on lexical features, lacking in linguistic analysis. In this paper, we propose a novel Web-based shallow syntactic-semantic modeling method, which utilizes Web search results as resource and trains topic-specific n-tag and syntactic n-gram language models to detect Vandalism. By combining basic task-specific and lexical features, we have achieved high F-measures using logistic boosting and logistic model trees classifiers, surpassing the results reported by major Wikipedia Vandalism detection systems.
-
" Got You! " : Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics, 2010Co-Authors: William Yang Wang, Kathleen R. MckeownAbstract:Discriminating Vandalism edits from non-Vandalism edits in Wikipedia is a challenging task, as ill-intentioned edits can include a variety of content and be expressed in many different forms and styles. Previous studies are limited to rule-based methods and learning based on lexical features, lacking in linguistic analysis. In this paper, we propose a novel Web-based shallow syntactic-semantic modeling method, which utiliz-es Web search results as resource and trains topic-specific n-tag and syntactic n-gram language models to detect van-dalism. By combining basic task-specific and lexical features, we have achieved high F-measures using logistic boosting and logistic model trees classifiers, sur-passing the results reported by major Wikipedia Vandalism detection systems.
Andrew G West - One of the best experts on this subject based on the ideXlab platform.
-
wikipedia Vandalism detection combining natural language metadata and reputation features
International Conference on Computational Linguistics, 2011Co-Authors: Thomas B Adler, Luca De Alfaro, Santiago M Molavelasco, Paolo Rosso, Andrew G WestAbstract:Wikipedia is an online encyclopedia which anyone can edit. While most edits are constructive, about 7% are acts of Vandalism. Such behavior is characterized by modifications made in bad faith; introducing spam and other inappropriate content. In this work, we present the results of an effort to integrate three of the leading approaches to Wikipedia Vandalism detection: a spatio-temporal analysis of metadata (STiki), a reputation-based system (WikiTrust), and natural language processing features. The performance of the resulting joint system improves the state-of-the-art from all previous methods and establishes a new baseline for Wikipedia Vandalism detection. We examine in detail the contribution of the three approaches, both for the task of discovering fresh Vandalism, and for the task of locating Vandalism in the complete set of Wikipedia revisions.
-
spatio temporal analysis of wikipedia metadata and the stiki anti Vandalism tool
International Symposium on Wikis and Open Collaboration, 2010Co-Authors: Andrew G West, Sampath KannanAbstract:The bulk of Wikipedia anti-Vandalism tools require natural language processing over the article or diff text. However, our prior work demonstrated the feasibility of using spatio-temporal properties to locate malicious edits. STiki is a real-time, on-Wikipedia tool leveraging this technique. The associated poster reviews STiki's methodology and performance. We find competing anti-Vandalism tools inhibit maximal performance. However, the tool proves particularly adept at mitigating long-term embedded Vandalism. Further, its robust and language-independent nature make it well-suited for use in less-patrolled Wiki installations.
-
stiki an anti Vandalism tool for wikipedia using spatio temporal analysis of revision metadata
International Symposium on Wikis and Open Collaboration, 2010Co-Authors: Andrew G West, Sampath KannanAbstract:STiki is an anti-Vandalism tool for Wikipedia. Unlike similar tools, STiki does not rely on natural language processing (NLP) over the article or diff text to locate Vandalism. Instead, STiki leverages spatio-temporal properties of revision metadata. The feasibility of utilizing such properties was demonstrated in our prior work, which found they perform comparably to NLP-efforts while being more efficient, robust to evasion, and language independent. STiki is a real-time, on-Wikipedia implementation based on these properties. It consists of, (1) a server-side processing engine that examines revisions, scoring the likelihood each is Vandalism, and, (2) a client-side GUI that presents likely Vandalism to end-users for definitive classification (and if necessary, reversion on Wikipedia). Our demonstration will provide an introduction to spatio-temporal properties, demonstrate the STiki software, and discuss alternative research uses for the open-source code.
-
detecting wikipedia Vandalism via spatio temporal analysis of revision metadata
European Workshop on System Security, 2010Co-Authors: Andrew G West, Sampath KannanAbstract:Blatantly unproductive edits undermine the quality of the collaboratively-edited encyclopedia, Wikipedia. They not only disseminate dishonest and offensive content, but force editors to waste time undoing such acts of Vandalism. Language-processing has been applied to combat these malicious edits, but as with email spam, these filters are evadable and computationally complex. Meanwhile, recent research has shown spatial and temporal features effective in mitigating email spam, while being lightweight and robust. In this paper, we leverage the spatio-temporal properties of revision metadata to detect Vandalism on Wikipedia. An administrative form of reversion called rollback enables the tagging of malicious edits, which are contrasted with non-offending edits in numerous dimensions. Crucially, none of these features require inspection of the article or revision text. Ultimately, a classifier is produced which flags Vandalism at performance comparable to the natural-language efforts we intend to complement (85% accuracy at 50% recall). The classifier is scalable (processing 100+ edits a second) and has been used to locate over 5,000 manually-confirmed incidents of Vandalism outside our labeled set.
Gregor Engels - One of the best experts on this subject based on the ideXlab platform.
-
Wikidata Vandalism Corpus 2015 (WDVC-15)
2020Co-Authors: Benno Stein, Martin Potthast, Stefan Heindorf, Gregor EngelsAbstract:The Wikidata Vandalism corpus 2015 (WDVC-15) is a corpus for the evaluation of automatic Vandalism detectors for Wikidata. For research purposes the corpus can be used free of charge.
-
debiasing Vandalism detection models at wikidata
The Web Conference, 2019Co-Authors: Stefan Heindorf, Gregor Engels, Yan Scholten, Martin PotthastAbstract:Crowdsourced knowledge bases like Wikidata suffer from low-quality edits and Vandalism, employing machine learning-based approaches to detect both kinds of damage. We reveal that state-of-the-art detection approaches discriminate anonymous and new users: benign edits from these users receive much higher Vandalism scores than benign edits from older ones, causing newcomers to abandon the project prematurely. We address this problem for the first time by analyzing and measuring the sources of bias, and by developing a new Vandalism detection model that avoids them. Our model FAIR-S reduces the bias ratio of the state-of-the-art Vandalism detector WDVD from 310.7 to only 11.9 while maintaining high predictive performance at 0.963 ROC and 0.316 PR.
-
WWW - Debiasing Vandalism Detection Models at Wikidata
The World Wide Web Conference on - WWW '19, 2019Co-Authors: Stefan Heindorf, Gregor Engels, Yan Scholten, Martin PotthastAbstract:Crowdsourced knowledge bases like Wikidata suffer from low-quality edits and Vandalism, employing machine learning-based approaches to detect both kinds of damage. We reveal that state-of-the-art detection approaches discriminate anonymous and new users: benign edits from these users receive much higher Vandalism scores than benign edits from older ones, causing newcomers to abandon the project prematurely. We address this problem for the first time by analyzing and measuring the sources of bias, and by developing a new Vandalism detection model that avoids them. Our model FAIR-S reduces the bias ratio of the state-of-the-art Vandalism detector WDVD from 310.7 to only 11.9 while maintaining high predictive performance at 0.963 ROC and 0.316 PR.
-
overview of the wikidata Vandalism detection task at wsdm cup 2017
arXiv: Information Retrieval, 2017Co-Authors: Stefan Heindorf, Martin Potthast, Gregor Engels, Benno SteinAbstract:We report on the Wikidata Vandalism detection task at the WSDM Cup 2017. The task received five submissions for which this paper describes their evaluation and a comparison to state of the art baselines. Unlike previous work, we recast Wikidata Vandalism detection as an online learning problem, requiring participant software to predict Vandalism in near real-time. The best-performing approach achieves a ROC-AUC of 0.947 at a PR-AUC of 0.458. In particular, this task was organized as a software submission task: to maximize reproducibility as well as to foster future research and development on this task, the participants were asked to submit their working software to the TIRA experimentation platform along with the source code for open source release.
-
towards Vandalism detection in knowledge bases corpus construction and analysis
International ACM SIGIR Conference on Research and Development in Information Retrieval, 2015Co-Authors: Stefan Heindorf, Martin Potthast, Benno Stein, Gregor EngelsAbstract:We report on the construction of the Wikidata Vandalism Corpus WDVC-2015, the first corpus for Vandalism in knowledge bases. Our corpus is based on the entire revision history of Wikidata, the knowledge base underlying Wikipedia. Among Wikidata's 24 million manual revisions, we have identified more than 100,000 cases of Vandalism. An in-depth corpus analysis lays the groundwork for research and development on automatic Vandalism detection in public knowledge bases. Our analysis shows that 58% of the Vandalism revisions can be found in the textual portions of Wikidata, and the remainder in structural content, e.g., subject-predicate-object triples. Moreover, we find that some vandals also target Wikidata content whose manipulation may impact content displayed on Wikipedia, revealing potential vulnerabilities. Given today's importance of knowledge bases for information systems, this shows that public knowledge bases must be used with caution.