Tools and Datasets

Datasets
Wikipedia Sockpuppets
You can download our dataset here:  LREC 2014 (Data Set | Features List)
RSS File
The RSS file contains 10,995 items from 33 blog authors. Each Author’s posts are separated into their own folders and stored in text files. Each text file contains one post and each post has all of the html still in line.
RSS file can be downloaded here: blogs.zip
Tweets File
The tweets file contains 64,808 tweets from 49 authors. Each tweet is stored in a text file labeled with the tweet’s unique ID (assigned from twitter). Each author’s tweets are stored in a file with the authors name on it.
Tweets file can be downloaded here: tweets.zip

 

Tools
PAN Plagiarism Detection
The system submitted to PAN Text Alignment task of PAN’13. Given a pair of source and suspicious files the program gives the character index and offset of plagiarized parts in both the files. The output is currently in xml format. The program can be downloaded here: text_alignment.tar
Please read the readme file in the root folder, which has instructions for running the program and for changing the parameters.