Please use this identifier to cite or link to this item: http://localhost:8080/xmlui/handle/123456789/8693
Title: Data Quality Management for Distributed File System
Authors: Khalil, Majida
Hamad, Murtadha
Keywords: Big Data
ETL,
Unstructured Data
Metadata,
Distributed File System
Fragmentation,
Hadoop,
HDInsight,
Statistical Model
Issue Date: 1-Jan-2020
Publisher: University of Anbar
Abstract: The continuous increasing of data, considers one of the key issues related to the development of systems and applications to deal with storage, management and processing of a huge amount of unstructured data. The traditional approaches of data management are inappropriate due to the large and complex data sizes. In this thesis, a several of procedure and algorithms were proposed that deal with big data including data collecting, data preprocessing. Since the big data includes different forms and types of sources, it is used an Extraction, Transformation and Loading (ETL) procedure to standardize this data before storing it. It is deal with unstructured data such as video by the algorithm to convert the unstructured data into data structured using metadata. The distributed processing (Fragmentation) will take the function of a distributed implementation of the traditional file system time-sharing model, where various users share files and storage resources. Also, it is used the Hadoop framework to improve the performance of a query and reduce the response time. The Apache Hadoop project is safe, scalable and distributed computing. Finally, to ensure the data quality, it is used statistical model to evaluate the highest educational institutions. The results showed that Hadoop is the best approach to deal with big data during calculating the rate of response time of a complex query for example at (00:00:01) per second and comparing it with the response time of the same queries on the fragmentation at (00: 01:11) per second and the standard database at (00:05:13) per second. As a result, the metadata will be included in reports, fields and descriptions. Total time to access complex queries in distributed processing is faster than nondistributed processing. Also it is used the statistical functions to compare and know the homogeneous between scientific colleges and humanity colleges, the values of the quality assurance functions (T-test) is 0.329 and the values (T-dis) is 2.05 so, it can be deduced that if the t-test is smaller than the t-dis there is no difference between the mean of the scientific and humanities samples, the values of Coefficient of Variation (COV) for both scientific is (10.135) and humanities is X (8.977) using the law of homogeneity know whether any sets are more homogeneous whenever the value of a small COV was more homogeneous however the humanity set is more homogeneity. Keywords:
URI: http://localhost:8080/xmlui/handle/123456789/8693
Appears in Collections:قسم علوم الحاسبات

Files in This Item:
File Description SizeFormat 
رسالة ماجستير ماجدة ياسين خليل.pdf10.28 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.