Research Projects

 Detecting Sockpuppets in Wikipedia

Wikipedia is the world’s largest repository of knowledge, built through the collaboration of hundreds of thousands of volunteer editors spread all across the world. Vandals often try to subvert the collaborative editing process in Wikipedia by creating false identities that are called sockpuppets. These sockpuppets are used to bypass bans, create a false majority, or to spread misinformation in article discussion pages and policy pages. Anyone can create an account or multiple accounts in Wikipedia, and the IP address of a user is not available to most administrators. Currently, Wikipedia uses a manual process to detect sockpuppets and the puppet operators through an analysis of the users’ IP addresses and behaviors. However, this process is time-consuming, requires years of expertise, and often cannot detect smart sockpuppets that use IP spoofing tools. In this research, we propose a scheme for detecting sockpuppets solely based on the small snippets of comments they make in various discussion pages. We tackle this problem as a form of authorship attribution and adapt state of the art approaches for this task. To evaluate the effectiveness of our proposed solution, we take real sockpuppet cases from Wikipedia and apply our techniques. Experiments show that our tool is 70% effective in detecting sockpuppets and agrees with the human evaluators of Wikipedia.


Cross-Domain Authorship Attribution

We are currently building tools to retrieve and analyze spontaneous data for cross-domain authorship attribution. By taking data from social media, research publications, and spoken word lectures and interviews, we hope to be able to use data from one domain to identify the author in the other. Authorship attribution studies have been successful when done within the same domain, but there has been very few studies in cross-domain applications. We hope to add to the burgeoning literature by assembling a corpus and using it to develop effective techniques to attribute authorship both in and between multiple domains.


Document Provenance

Provenance is the derivation history of an object, indicating its origin and lineage. When provenance or causal relationships are not recorded during document creation/modification, it is very difficult to recreate the relationships by only looking at the documents. We posit that document signatures can be very useful to establish provenance or lineage of a document. Given a set of documents, their document signatures can be used to determine which documents are related to one another. To check if two given documents are related, we can match their signatures to see the probability of the documents being related.