Duplicate detection and elimination in xml data for a data warehouse

Mahdi, Ghaith; Hamad, Murtadha

Please use this identifier to cite or link to this item: http://localhost:8080/xmlui/handle/123456789/4469

Title:	Duplicate detection and elimination in xml data for a data warehouse
Authors:	Mahdi, Ghaith Hamad, Murtadha
Keywords:	Blocking Technique Levenshtein Distance Smith Waterman Similarity ANN (Back Propagation
Issue Date:	1-Jan-2018
Publisher:	International Journal of Engineering & Technology
Abstract:	Due to the significant increase in the volume of data in recent decades, the problem of duplicate data has emerged because of the multi-plicity of resources where the data is collected in different formats. The presence of duplicates comes as a result of the existence of dif-ferent formulas of data. Thus, it is necessary to clean the duplicate data to access a pure data set. The main concern of this study is to clean data which Known by its complex hierarchal structure in data warehouse. This can be achieved by detecting duplicate in large data bases in order to increase the efficiency of data mining. In the current study the proposed system of duplicate elements passes through three-stages. The first stage (Pre-processing stage) includes two parts: the first part is the elimination of the exact match which, in turn, works to eliminate many of the identical elements completely. This procedure saves a lot of time and effort by preventing the entrance of many elements to the processing stage which are usually known by its complexity. In the second part blocking technique is used based on Levenshtein distance to minimize the number of comparisons and to maximize the accuracy of blocking elements than the traditional ones. These processes are performed to improve the dataset. The second stage (Processing stage) is taken to compute the similarity ratios between each pair of elements within each block by using smith waterman similarity algorithm. The third stage is the classification stage of the elements in which an element is identified whether it is duplicate or non-duplicate. The Artificial Neural Network technique (Back-Propagation) is used to meet this purpose. The threshold 0.65 has been determined which is relied on the results obtained. The Artificial Neural Network (Back-Propagation) is used to classify the elements in to duplicate and non-duplicate. The efficiency of the proposed system is represented by the accuracy obtained which is closer to 100% through reducing the number of "false negatives" and "false positive" relative to the "true positive
URI:	http://localhost:8080/xmlui/handle/123456789/4469
Appears in Collections:	قسم علوم الحاسبات

Files in This Item:

File	Description	Size	Format
IJET-20419.pdf		529.76 kB	Adobe PDF	View/Open

Show full item record