Plagiarism is the reuse of someone else’s prior ideas, processes, results, or words without explicitly acknowledging the original author and source. In modern years, plagiarism has raised great concern over knowledgeable property protection. Plagiarists violate knowledgeable property rights either by copying source/binary code or by stealing and covertly implementing protected algorithms. The first case is also known as software plagiarism. Plagiarism involves reproducing the existing information in modified format or sometimes the original document as it is. This is quiet common among students, researchers and academicians. This has made some strong influence on research community and awareness among academic peoples to prevent such a kind of misuse. in this presented article Plagiarism Detection is explained.
Overview of Plagiarism Detection
A word may have several possible meanings and senses due to the richness of natural languages, which make detecting plagiarism a hard task especially when dealing with semantic meaning, not just searching for patterns of text that are illegally copied from others (copy and paste texts from digital resources without acknowledging the original resource),
Plagiarism occurs in various forms: submitting another’s work exactly same without proper citation, paraphrasing text, reordering the sentences, using synonyms, or changing grammar, code plagiarism etc. Plagiarism is defined as the use or close imitation of the language and thoughts of another author and the representation of them as one’s own original work. By the use of synonyms, plagiarism can be done. Therefore, they are difficult to recognize by the commercial software. Plagiarism affects the education quality of the students and thereby reduce the economic status of the country. Plagiarism is done by paraphrased works and the similarities between keywords and verbatim overlaps, change of sentences from one form to another form, which could be identified using wordnet etc.
Plagiarism detection in text documents is an important field in information processing. Plagiarism detection consists of searching of similar and more identical text between the documents. It is a very complex task because most of the plagiarists will reuse the text from other source documents with aim of covering plagiarism by replacing words with synonyms, or by reordering the sentences. Plagiarism can be classified into five categories:
- Copy & Paste Plagiarism.
- Word Switch Plagiarism.
- Style Plagiarism.
- Metaphor Plagiarism.
- Idea Plagiarism.
In both the textual document plagiarism and source code plagiarism, detection can be either: Manual detection or automatic detection.
- Manual detection: done manually by human, it’s suitable for lectures and teachers in checking student’s assignments but it is not effective and impractical for a large number of documents and not economical also need highly effort and wasting time.
- Automatic detection (Computer assisted detection): there are many software and tools used in automatic plagiarism detection, like PlagAware, PlagScan, Check for Plagiarism, iThenticate, PlagiarismDetection.org, Academic Plagiarism, The Plagiarism Checker, Urkund, Docoloc and more.
Defining Plagiarism Detection
Plagiarism means a piece of writing that has been taken from a source without proper citation. Therefore it is an intellectual theft, which consists of turning someone else’s work as your own. Plagiarism exists in many different scenarios and it causes an increasing challenge to publication industry, which affects academia and the publication industries in particular. Plagiarism cases are an everyday topic, for example, in academics, journalism, and scientific research and even in politics.
Figure 1: Process of Plagiarism Detection
With the explosive growth of content found throughout the Web, people can find nearly everything they need for their written work, but detection of such cases can become a tedious task. For these reasons society needs to tackle this problem with computer-assisted approaches, and consequently, multiple studies in the field are being conducted. All of the following are considered plagiarism:
- Turning in someone else’s work as your own
- Copying words or ideas from someone else without giving credit
- Failing to put a quotation in quotation marks
- Giving incorrect information about the source of a quotation
- Changing words but copying the sentence structure of a source without giving credit
- Copying so many words or ideas from a source that it makes up the majority of your work, whether you give credit or not
Textual Plagiarism Detection Techniques
Many of researchers are developed a set of tools used in textual automatic detection like:
The grammar-based method is important tool to detect plagiarism. It focuses on the grammatical structure of documents, and this method uses a string-based matching approach to detect and to measure similarity between the documents. The grammar- based methods is suitable for detecting exact copy without any modification, but it is not suitable for detecting modified copied text by rewriting or switching some words that has the same meaning. This is considered as one of this method limitations
The semantics-based method, also considered as one of the important method for plagiarism detection, focuses on detecting the similarities between documents by using the vector space model. It also can calculate and count the redundancy of the word in the document, and then they use the fingerprints for each document for matching it with fingerprints in other documents and find out the similarity. The semantic-based method is suitable for non-partial plagiarism as mentioned before use the whole document and use vector space to match between the documents, but if the document has been partially plagiarized it cannot achieve good results, and this is considered as one of the limitations of this method, because it is difficult to fix the place of copied text in the original document.
Grammar semantics hybrid method
Grammar semantic hybrid method is considered as the most important method in plagiarism detecting for the natural languages. This method, so effective in achieving better and improving plagiarism detection result, is suitable for the copied text including modified text by rewriting or switching some words that have the same meaning, which cannot be detected by grammar-based method. It also solves the limitation of semantic- based method. Grammar semantic hybrid method can detect and determine the location of plagiarized parts of the document, which cannot be detected by semantic-based method, and calculating the similarity between documents.
External plagiarism detection method
The external plagiarism detection relies on a reference corpus composed of documents from which passages might have been plagiarized A passage could be made up of paragraphs, a fixed size block of words, a block of sentences and so on. A suspicious document is checked for plagiarism by searching for passages that are duplicates or near duplicates of passages in documents within the reference corpus. An external plagiarism system then reports these findings to a human controller who decides whether the detected passages are plagiarized or not. A naive solution to this problem is to compare each passage in a suspicious document to every passage of each document in the reference corpus. This is obviously prohibitive. The reference corpus has to be large in order to find as many plagiarized passages as possible. This fact directly translates to very high runtimes when using the naive approach. External plagiarism detection is similar to textual information retrieval (IR). Given a set of query terms an IR system returns a ranked set of documents from a corpus that best matches the query terms. The most common structure for answering such queries is an inverted index. An external plagiarism detection system using an inverted index indexes passage of the reference corpus’ documents. Such a system was presented in for finding duplicate or near duplicate documents.
External plagiarism detection can also be viewed as nearest neighbor problem in a vector space.
Clustering in plagiarism detection
Document clustering is one of the important techniques used by information retrieval in many purposes; it has been used in summarization of the documents to improve the retrieval of data by reducing the searching time in locating the document. It is also used for result presentation. Document clustering is used in plagiarism detection to reduce the searching time. But still now in clustering there are some limitations and difficulties with time and space. Most of the above methods have been used by textual documents plagiarism detection.
 Fangfang Zhang, Yoon-Chan Jhi and Dinghao Wu, “A First Step towards Algorithm Plagiarism Detection”, the International Symposium on Software Testing and Analysis, (ISSTA ’12), Minneapolis, MN, USA, July 15-20, 2012.
 Asad Abdi, Norisma Idris and Rasim M. Alguliyev, “PDLK: Plagiarism detection using linguistic knowledge”, Expert Systems with Applications 42, PP 8936-8946, 2015.
 Hermann Maurer, Frank Kappe and Bilal Zaka, “Plagiarism – A Survey”, Journal of Universal Computer Science, Volume 12, No.8, PP. 1050-1084, 2006.
 Zechner, M., Muhr, M., Kern, R., Michael, G. External and intrinsic plagiarism detection using vector space models, Proceedings of the SEPLN’09 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, pp. 4755 (2009)