Masters Computing Minor Thesis

Concept based Tree Structure Representation for Paraphrased Plagiarism Detection

By

KietNim

A thesis submitted for the degree of

Master of Science (Computer and Information Science)

School of Computer and Information Science

University of South Australia

November 2012

Supervisor

Dr. Jixue Liu

Associate Supervisor

Dr. Jiuyong Li

Declaration

I declare that the thesis presents the original works conducted by myself and does not incorporate without reference to any material submitted previously for a degree in any university. To the best of my knowledge, the thesis does not contain any material previously published or written except where due acknowledgement is made in the content.

KietNim

November 2012

Acknowledgements

I would like to express my sincere gratitude to my supervisors, Dr. Jixue Liu and Dr. Jiuyong Li – professors and researchers at University of South Australia – for their dedicated support, professional advice, feedback and encouragement throughout the period of conducting thestudy. In addition, I would like to thank all of my course coordinators for their dedicated and in-depth teaching. Finally, I would like to thank my family for always encouraging and providing me their full support throughout the study in Australia.

Abstract

In the era of World Wide Web, searching for information can be performed easily by the support of several search engines and online databases. However, this also makes the task of protecting intellectual property from information abuses become more difficult. Plagiarism is one of those dishonest behaviors. Most existing systems can efficiently detect literal plagiarism where exact copy or only minor changes are made. In cases where plagiarists use intelligent methods to hide their intentions, these PD systems usually fail to detect plagiarized documents.

The concept based tree structure representation can be the potential solution for paraphrased plagiarism detection – one of the intelligent plagiarism tactics. By exploiting WordNet as the background knowledge, concept-based feature can be generated. The additional feature in combining with the traditional term-based feature and the term-based tree structure can enhance document representation. In particular, this modified model not only can capture syntactic information like the term-based model does but also can discover hidden semantic information of a document. Consequently, semantic-similar documents can be detected and retrieved.

The contributions of the modified structure can be expressed in the following two. Firstly, a real-time prototype for high level plagiarism detection is proposed in this study. Secondly, the additional concept-based feature provides considerable improvements for the task of Document Clustering in the way that more semantic-related documents can be grouped into same clusters even though they are expressed in different ways. Consequently, the task of Document Retrieval can relatively retrieve more relevant documents in same topics.

Table of Contents

Declaration

Acknowledgements

Abstract

List of Figures

List of Tables

Chapter 1 – Introduction

1.1.Background

1.2.Motivations

1.3.Fields of Thesis

1.4.Research Question

1.5.Contributions

Chapter 2 – Literature Review

2.1.Plagiarism Taxonomy

2.2.Document Representation

2.2.1.Flat Feature Representation

2.2.2.Structural Representation

2.3.Plagiarism Detection Techniques

2.4.Limitations

Chapter 3 – Methodology

3.1.Document Representation and Indexing

3.1.1.Term based Vocabulary Construction

3.1.2.Concept based Vocabulary Construction

3.1.3.Document Representation

3.1.4.Document Indexing

3.2.Source Detection and Retrieval

3.3.Detail Plagiarism Analysis

3.3.1.Paragraph Level Plagiarism Analysis

3.3.2.Sentence Level Plagiarism Analysis

Chapter 4 – Experiments

4.1.Experiment Initialization

4.1.1.The Dataset and Workstation Configuration

4.1.2.Performance Measures and Parameter Configuration

4.2.Source Detection and Retrieval for Literal Plagiarism

4.3.Source Detection and Retrieval for Paraphrased Plagiarism

4.4.Study of Parameters

4.4.1.Size of Term based Vocabulary

4.4.2.Size of Concept based Vocabulary

4.4.3.Dimensions of Term based PCA feature

4.4.4.Dimensions of Concept based PCA feature

4.4.5.Contribution of the Weights and

Chapter 5 – Conclusion

5.1.Concluding Remarks

5.2.Future Works

References

Appendix A – Source code of the Modified Porter Algorithm

Appendix B – Output Example of a Term based Vocabulary

Appendix C – Output Example of a Concept based Vocabulary

List of Figures

Figure 1 - Taxonomy of Plagiarism (Alzahrani et al. 2012)

Figure 2 - Term-Document Matrix (Marksberry 2011)

Figure 3 - Singular Value Decomposition of term-document matrix A (Letsche et al. 1997)

Figure 4 - 3 layer document-paragraphs-sentences tree representation (Zhang et al. 2011)

Figure 5 - Comparision of the original & modified Porter Stemmers

Figure 6 - Data structure of Term-based Vocabulary

Figure 7 - Example of looking for synonyms, hypernyms and hyponyms

Figure 8 - Data structure for the concept-based Vocabulary

Figure 9 - Concept based Document Tree Representation

Figure 10 - 2 level SOMs for document-paragraph-sentence document tree (Chow et al. 2009)

Figure 11 - Performance of Source Detection & Retrieval for Literal Plagiarism

Figure 12 - Performance of Source Detection & Retrieval for Paraphrased Plagiarism

Figure 13 - Performance based on different sizes of Term based Vocabulary

Figure 14 - Performance based on different sizes of Concept based Vocabulary

Figure 15 - Performance based on different dimensions of Term based PCA feature

Figure 16 - Performance based on different dimensions of Concept based PCA feature

List of Tables

Table 1 - Configuration of Parameters for Literal Plagiarism

Table 2 - Source Detection & Retrieval for Literal Plagiarism

Table 3 - Configuration of Parameters for Paraphrased Plagiarism

Table 4 - Source Detection & Retrieval for Paraphrased Plagiarism

Table 5 - Performance based on different sizes of Term-based Vocabulary

Table 6 - Performance based on different sizes of Concept based Vocabulary

Table 7 - Performance based on different dimensions of Term based PCA feature

Table 8 - Performance based on different dimensions of Concept based PCA feature

Table 9 - Performance based on different values of and

1

Chapter 1 – Introduction

1.1.Background

In the era of World Wide Web, more and more documents are being digitalized and made available for accessing remotely.Searching for information has become even easier with the support of variety of search engines and online databases. However, these advantages also make the task of protecting intellectual property from information abuses become more difficult. One of those dishonest behaviors is plagiarism. It is clear that plagiarism has caused significant damages to intellectual property. Most cases have been detected in academic works such as student assignments and researches. Lukashenko et al. [1] define plagiarism as activities of “turning of someone else’s work as your own without reference to original source”.

Several systems and algorithms have been developed to tackle this problem. However, most of them can only detect word-by-word plagiarism or can be referred as literal plagiarism. These are cases in which plagiarists do the exact copy or only make minor changes to original sources. But in cases where significant changes are made, most of these “flat feature” based methods fail to detect plagiarized documents [2]. This type is referred as intellectual or intelligent plagiarism including text manipulation, translation and idea adoption.

In this study, my focus is to improve an existing structural model and conduct several experiments to test the detectionof one tactic of text manipulation- Paraphrasing.

1.2.Motivations

Paraphrasing is a strategy of intellectual plagiarism to bypass the systems only detecting exact copy or plagiarized documents with minor modifications. For instance, one popular and widely used system in academic for plagiarism detection is Turnitin. It can be seen that Turnitin can detect word-by-word copy efficiently to sentence level. However, by simply paraphrasing detected terms using their synonyms/hyponyms/hypernyms or similar phrases, it can be bypassed easily. Apparently, paraphrasing is just one of many existing intelligent tactics. It is clear that plagiarism has become more and more sophisticated [3]. Therefore, it is also urgent to have more powerful mechanisms to protect intellectual property from high level plagiarism.

In different plagiarism detection (PD) systems, documents are represented by different non-structural or structural schemes. Non-structural or flat feature based representations are the earliest mechanisms for document representation. Systems such as COPS [4] and SCAM [5] are typical applications of these representation schemes where documents are firstly broken into smallchunks of words or sentences. These chunks are then hashed and registered against a hash table to perform document retrieval (DR) and PD. Systems in [6-8] use character or word n-grams as the units for similarity detection. In common, all these flat feature based systems ignore the contextual information of words/terms in a document. Therefore, structural representations are recently developed to overcome this limitation. In these schemes, 2 promising candidates which can capture the rich textual information are graphs [9, 10] and tree structure representations [2, 11-13]. The applications of structural representation have shown significant improvements in the tasks of DR, document clustering (DC) and PD. However, majority of both non-structural and structural representation schemes are mostly based on word histograms or derived features from word histograms. Obviously, they can be used to effectively detect literal plagiarism but are not strong enough to perform such tasks of intelligent plagiarism detection.

In the research, I focus on analyzing the tree structure representation and the studies of Chow et al. [2, 11-13]. In their works, a document is hierarchically organized into layers. In this way, the tree can capture not only syntactic but also semantic information of a document. While the root represents global information or main topics of a document, other layers capture local information or sub-topics of the main topics and the leaf level can be used to perform detailed comparison. Their proposed models have improved significantly the accuracy of DC, DR and PD. However, the features used to represent each layer are still derived from the term-based Vocabulary and, hence, the systems show some limitations when performing the task of intelligent plagiarism detection.

Therefore, this study provides an extension to the term-based tree structure representation, particularly, the features used to represent each layer in order to perform the detection of one specific type of high level plagiarism – Paraphrasing. The modified structure representation is referred as the Conceptbased Tree Structure Representation.

1.3.Fields of Thesis

Document Representation; Information Retrieval; Plagiarism Detection; Text Mining.

1.4.Research Question

As outlined in section 1.2, most existing PD systems only implement flat-feature based representation for DC, DR and PD. Even though there are some applications of structural representation recently, the features used in those schemes are still derivatives of word histograms which ignore semantic similarity between words or terms. Consequently, semantic-similar documents might be considered as unrelated. Secondly, plagiarism has evolved and become more sophisticated with multiple forms including text manipulation, translation and idea adoptions. These tactics can be used to easily bypass systems only based on flat features. Even though structural feature based systems have been proved to be more effective than flat feature based systems, they are still suffered from such devious techniques. Therefore, it is urgent to develop either new or additional features to improve current structural feature based systems in order to protect intellectual property from being abused by high level plagiarism.

This thesis presents the study and development of a new mechanism in details to detect one particular tactic of sophisticated plagiarism – Paraphrasing. The aim of the research is to extend the structural model studied by Chow et al. in [2, 12,13].The original tree structure representation based solely on word histograms is enhanced with an additional feature to capture multi-dimensional semantic information. The additional feature can be referred as the Concept based feature. The ultimate aim of thestudy is to answer the following question “Is the Concept based feature in combination with the tree structure representation and the Term based feature capable of discovering plagiarism by paraphrasing and, potentially, higher level plagiarism?”.

In addition to the main research question, there are also multiple sub-questions that need to be addressed and answered including:

  • What is the tree model used to represent a document?
  • How to construct 2 types of feature for each layer?
  • The necessity of document organization and indexing?
  • What is the scheme for candidate detection and retrieval?
  • What is the scheme for detailed plagiarism analysis?

To answer the main research question, the experiments carried out focus on examining how concept-based feature in combination with the term-based tree structure representation contributes to the tasks of document organization, document retrieval and paraphrased plagiarism detection. However, the experiment model can only be built when all research sub-questions are answered and they are outlined in Methodology section in details.

In the modified structural representation, for a brief Methodologyoverview, each node of the tree is represented by 2 derived vectors of terms and concepts. To overcome the “Curse of Dimensionality” due to the lengths of these vectors, Principal Component Analysis (PCA) [14] – a well-known technique for dimensionality reduction – is further applied. For the number of tree layers, I choose the 3-layer document-paragraph-sentence model to represent a document. The task of document organization is also taken into consideration by applying the Self-Organizing Map (SOM) clustering technique [15]. In document retrieval, comparing only documents in the same areas is taken into account since documents mention different topics are regarded as serving no purpose [16], e.g. comparing a CIS paper against a collection of CIS papers rather than a CIS paper against biology papers. To generate the concept-based feature, I consider using the external background knowledge – WordNet – to firstly generate the concept-based Vocabulary. After that, the concept-based feature is derived from thisVocabulary and together with the term-based feature used to represent a document.

1.5.Contributions

In this thesis, the C-PCA-SOM 3-stage prototype for high level Plagiarism Detection is introduced. The 3 stages include: Stage 1 – Document Representation & Indexing;Stage 2 – Source Detection & Retrieval and Stage 3 – Detail Plagiarism Analysis. In addition, due to the achievement in constant processing time when conducting experiments, the prototype can provide real-time applications for Document Representation, Document Clustering, Document Retrieval and, potentially, Paraphrased Plagiarism Detection.

Through experiments, it is verified that the introduction of the additional Concept-based feature can improve the performance of Source Detection and Retrieval comparing with models solely based on Term-based feature. Furthermore, it is also proved that the enhanced tree structure representation not only can capture syntactic information like the original scheme does but also can discover hidden semantic information of a document. By capturing multi-dimensional information of a document, the task of Document Clustering can be improved in the way that more semantic-related documents can be grouped into meaningful clusters even though they are expressed differently. As a result, Document Retrieval can also be benefited since more documents mentioning the same topics can be detected and retrieved.

Chapter 2 – Literature Review

This section provides a comprehensive overview of literature on different types of plagiarism in section 2.1. In section 2.2, variety of document representation schemes are discussed including non-structural or flat feature based representation and structural representation. Existing plagiarism detection techniques are outlined in section 2.3. Finally, the limitations of these PD techniques and representation schemes are discussed for potential improvements and further studies in section 2.4.

2.1.Plagiarism Taxonomy

When human began the artifact of producing papers as a part of intellectual documentation, plagiarism also came into existence. Documentation and plagiarism exist in parallel but they are two different sides of a coin totally. While one contributes to the knowledge body of human society, the other one causes serious damages to intellectual property. Realizing this matter of fact, ethical community has developed many techniques to fight against plagiarism. However, the battle against this phenomenon is a lifelong battle since plagiarism has also evolved and become more sophisticated. Therefore, to efficiently engage such devious enemy, it is necessary to have a mapping scheme to identify and classify different types of plagiarism into meaningful categories. Many studies have been conducted to perform this task [1, 3,17]. Lukashenko et al. [1] point out different types of plagiarism activities including:

  • Copy-paste plagiarism (word to word copying).
  • Paraphrasing (using synonyms/phrases to express same content).
  • Translated plagiarism (copying content expressed in another languages).
  • Artistic plagiarism (expressing plagiarized works in different formats such as images or text).
  • Idea plagiarism (extracting and using others’ ideas).
  • Code plagiarism (copying others’ programming codes).
  • No proper use of quotation marks.
  • Misinformation of references.

More precisely, Alzahrani et al. [3] use a taxonomy and classify different types of plagiarism into 2 main categories: literal and intelligent plagiarism Fig. 1. In the former, plagiarists make exact copy or only few changes to original sources. Thus, this type of plagiarism can be detected easily. However, the latter case is much more difficult to detect because plagiarists try to hide their intentions by using many intelligent ways to change original sources. These tactics include: text manipulation, translation and idea adoption. In text manipulating, plagiarists try to change the appearance of the text while keeping the same semantic meaning or idea of the text. Paraphrasing is one tactic of text manipulation that is usually performed. It transforms text appearance by using synonyms, hyponyms, hypernyms or equivalent phrases. In the research, my main focus is to detect this type of intelligent plagiarism. Plagiarism based on Translation is also known as cross-lingual plagiarism. Offenders can use some translation softwares to copy text written in other languages to bypass monolingual systems. Finally, Alzahrani et al. consider idea adoption is the most serious and dangerous type of plagiarism since stealing ideas from others’ works without proper referencing is the most disrespectful action toward their authors and intellectual property. Apparently, this type of plagiarism is also the hardest to detect because plagiarized text might not carry any similar syntactic information to original sources. Another reason is that the ideas being plagiarized can be extracted from multiple parts of original documents.