Enhancing Data Cleaning through the Extraction and Expansion of Relaxed Functional Dependencies
- Department of Computer Engineering, Ne.C, Islamic Azad University, Neyshabur, Iran
- Islamic Azad University, Mashhad
- Department of Computer Engineering, Ma.C, Islamic Azad University, Mashhad, Iran
- Department of Computer Engineering, Qu.C, Islamic Azad University, Quchan, Iran
Received: 2025-10-28
Revised: 2026-02-26
Accepted: 2026-03-26
Published in Issue 2026-03-30
Copyright (c) 2026 Mona Kardehi Moghadam, Seyyed Javad Seyyed Mahdavi Chabok, Reza Sheybani, Reza Ghaemi (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
How to Cite
PDF views: 14
Abstract
Data cleaning is pivotal for ensuring high-quality datasets in machine learning and data analytics, yet traditional methods often rely on manual thresholding, which is subjective and inefficient. This paper proposes RFD-Sugeno-KMeans-Clean, a data cleaning framework that integrates a Sugeno fuzzy inference system for automatic thresholding to extract Relaxed Functional Dependencies (RFDs). Then, by combining K-means++ clustering, it extracts the most effective pattern functional dependencies for data cleaning from complex datasets. In the second phase, a two-step SQL method is employed to detect and clean the data. The method also utilizes Principal Component Analysis (PCA), Identifying and Repairing Multiple Violations (Spline), Median, and Adaptive Neuro-Fuzzy Inference System (ANFIS) to enhance outlier detection across diverse datasets. The effectiveness of the proposed method has been validated using a dataset from the UCI database. The results demonstrate that the proposed method achieves a significant reduction in the number of data points and tuples, with a decrease of approximately 99.5% in most cases, while ensuring stability and reliability in responses. The proposed method’s ability to extract optimal patterns in data cleaning demonstrates higher stability and reliability in responses compared to earlier methods, making it a viable solution for addressing data cleaning challenges in various domains.
Keywords
- Data Cleaning; Relaxed Functional Dependencies; K-means Clustering Algorithm; Sugeno Fuzzy System; Automatic Thresholding; Two-Step SQL Method; Unit Violations; Multiple Violations; Outlier Detection; ANFIS
References
- [1]. Caruccio L, Deufemia V, Polese G. Mining relaxed functional dependencies from data. Data Mining and Knowledge Discovery. 2020 Mar;34(2):443-77.
- [2]. Hariri S, Kind MC, Brunner RJ. Extended isolation forest. IEEE transactions on knowledge and data engineering. 2020 Oct 31;33(4):1479-89.
- [3]. A. Khan, m. A. Khan, and m. A. Khan, “adaptive dbscan with intelligent parameter tuning,” applied intelligence, 2022. Vol. 52, no. 3, pp. 2456–2470,
- [4]. Pang G, Shen C, Cao L, Hengel AV. Deep learning for anomaly detection: A review. ACM computing surveys (CSUR). 2021 Mar 5;54(2):1-38.
- [5]. Qahtan A, Tang N, Ouzzani M, Cao Y, Stonebraker M. Pattern functional dependencies for data cleaning. Proceedings of the VLDB Endowment (PVLDB). 2020 Jan 31;13(5):684-97.
- [6]. Beil D, Theissler A. Cluster-clean-label: an interactive machine learning approach for labeling high-dimensional data. In Proceedings of the 13th International Symposium on Visual Information Communication and Interaction 2020 Dec 8 (pp. 1-8)
- [7]. Wang, Q., et al., Deep Q-network-based feature selection for multisourced data cleaning. IEEE Internet of Things Journal, 2020. 8(21): p. 16153-16164.
- [8]. Rahul, K. And R. Banyal. Data cleaning mechanism for big data and cloud computing. In 2019 6th International Conference on Computing for Sustainable Global Development (indiacom). 2019. IEEE.
- [9]. Ezugwu AE, Ikotun AM, Oyelade OO, Abualigah L, Agushaka JO, Eke CI, Akinyelu AA. A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence. 2022 Apr 1; 110:104743.
- [10]. Côté PO, Nikanjam A, Ahmed N, Humeniuk D, Khomh F. Data cleaning and machine learning: a systematic literature review. Automated Software Engineering. 2024 Nov;31(2):54
- [11]. Kriegel, H.-P., E. Schubert, and A. Zimek, The (black) art of runtime evaluation: Are we comparing algorithms or implementations? Knowledge and Information Systems, 2017. 52: p. 341-378.
- [12]. Hartigan, J.A. and M.A. Wong, Algorithm AS 136: A k-means clustering algorithm. Journal of the royal statistical society. Series c (applied statistics), 1979. 28(1): p. 100-108.
- [13]. Li Y, Li D. Photovoltaic abnormal data cleaning based on fuzzy clustering-quartile algorithm. In2023 IEEE 6th International Conference on Industrial Cyber-Physical Systems (ICPS) 2023 May 8 (pp. 1-5). IEEE.
- [14]. Mackay, D.J., Information theory, inference and learning algorithms. 2003: Cambridge university press.
- [15]. Caruccio, L., V. Deufemia, and G. Polese. Lattice-based Discovery of Hybrid Relaxed Functional Dependencies. In CEUR WORKSHOP PROCEEDINGS. 2020. CEUR-WS.
- [16]. Vučetić M, Hudec M, Božilović B. Fuzzy functional dependencies and linguistic interpretations employed in knowledge discovery tasks from relational databases. Engineering Applications of Artificial Intelligence. 2020 Feb 1; 88:103395.
- [17]. Ponzio F, Deodato G, Macii E, Di Cataldo S, Ficarra E. Exploiting “uncertain” deep networks for data cleaning in digital pathology. In2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI) 2020 Apr 3 (pp. 1139-1143). IEEE.
- [18]. Li, P., et al., Cleanml: A benchmark for joint data cleaning and machine learning [experiments and analysis]. Arxiv preprint arxiv:1904.09483, 2019: p. 75.
- [19]. Kokkonen, H., 2019. Effects of data cleaning on machine learning model performance (Bachelor's thesis, H. Kokkonen).
- [20]. Prokoshyna, N., et al., Combining quantitative and logical data cleaning. Proceedings of the VLDB Endowment, 2015. 9(4): p. 300-311.
- [21]. Bohannon, P., et al. Conditional functional dependencies for data cleaning. In 2007 IEEE 23rd international conference on data engineering. 2006. IEEE.
- [22]. Livshits, E., B. Kimelfeld, and J. Wijsen, Counting subset repairs with functional dependencies. Journal of Computer and System Sciences, 2021. 117: p. 154-164.
- [23]. Hakawati, M.R., et al., Data Cleaning Model for XML Datasets using Conditional Dependencies. European Journal of Electrical Engineering and Computer Science, 2020. 4(1).
- [24]. Martinez-Mosquera, D., et al., Data cleaning technique for security logs based on Fellegi-Sunter Theory, in Information Systems: Research, Development, Applications, Education: 10th SIGSAND/PLAIS eurosymposium 2017, Gdansk, Poland, September 22, 2017, Proceedings. 2017, Springer. P. 3-12.
- [25]. Du, Y., et al., Discovering context-aware conditional functional dependencies. Frontiers of Computer Science, 2017. 11: p. 688-701.
- [26]. Fan, W., et al., Discovering graph functional dependencies. ACM Transactions on Database Systems (TODS), 2020. 45(3): p. 1-42.
- [27]. Caruccio, L., et al., Discovering relaxed functional dependencies based on multi-attribute dominance. IEEE Transactions on Knowledge and Data Engineering, 2020. 33(9): p. 3212-3228.
- [28]. Schirmer, P., et al. Dynfd: Functional Dependency Discovery in Dynamic Datasets. In EDBT. 2019.
- [29]. Schirmer, P., et al., Efficient discovery of matching dependencies. ACM Transactions on Database Systems (TODS), 2020. 45(3): p. 1-33.
- [30]. Salem, R. And A. Abdo, Fixing rules for data cleaning based on conditional functional dependency. Future Computing and Informatics Journal, 2016. 1(1-2): p. 10-26.
- [31]. Caruccio, L., V. Deufemia, and G. Polese. A genetic algorithm to discover relaxed functional dependencies from data. In SEBD. 2017.
- [32]. Vucetic, M., M. Hudec, and M. Vujošević, A new method for computing fuzzy functional dependencies in relational database systems. Expert Systems with Applications, 2013. 40(7): p. 2738-2745.
- [33]. Qahtan, A., et al., Pattern functional dependencies for data cleaning. 2020.
- [34]. Kumar, R. Kavitha, and R. M. Chadrasekaran. "Attribute correction-data cleaning using association rule and clustering methods." International Journal of Data Mining & Knowledge Management Process 1, no. 2 (2011): 22-32.
- [35]. Pani AK, Mohanta HK. Online monitoring of cement clinker quality using multivariate statistics and Takagi-Sugeno fuzzy-inference technique. Control Engineering Practice. 2016 Dec 1; 57:1-7.
- [36]. Khedher A, Othman KB, Benrejeb M. Active fault tolerant control (FTC) design for Takagi-Sugeno fuzzy systems with weighting functions depending on the FTC. International Journal of Computer Science Issues. 2011 May;8(3): 1.
- [37]. Gunasekaran J, Sevvel P, Solomon IJ, Roy JV. Optimization of FSW Parameters Using SA Algorithm and ANFIS-Based Models to Maximize Mechanical Properties of AZ80A Mg Alloy Joints. Journal of Materials Engineering and Performance. 2024 Sep 12:1-20.
- [38]. Pruengkarn R, Wong KW, Fung CC. Data cleaning using complementary fuzzy support vector machine technique. Inneural Information Processing: 23rd International Conference, ICONIP 2016, Kyoto, Japan, October 16–21, 2016, Proceedings, Part II 23 2016 (pp. 160-167). Springer International Publishing.
- [39]. Hudec, M., M. Vučetić, and M. Vujošević. Synergy of linguistic summaries and fuzzy functional dependencies for mining knowledge in the data. In 2014 18th International Conference on System Theory, Control and Computing (ICSTCC). 2014. IEEE.
- [40]. Fatima A, Nazir N, Khan MG. Data cleaning in data warehouse: A survey of data pre-processing techniques and tools. Int. J. Inf. Technol. Comput. Sci. 2017 Mar;9(3):50-61.
- [41]. Caruccio, L., V. Deufemia, and G. Polese. On the discovery of relaxed functional dependencies. In Proceedings of the 20th International Database Engineering & Applications Symposium. 2016.
- [42]. Vaz, R., et al. Automated big-o analysis of algorithms. In 2017 international conference on nascent technologies in engineering (ICNTE). 2017. IEEE.
- [43]. Ježková, L., P. Cordero, and M. Enciso, Fuzzy functional dependencies: A comparative survey. Fuzzy Sets and Systems, 2017. 317: p. 88-120.
- [44]. Skjerve TA, Klemetsdal G, Åby BA, Sommerseth JK, Indahl UG, Olsen HF. Using Density and Fuzzy Clustering for Data Cleaning and Segmental Description of Livestock Data. Journal of Agricultural, Biological and Environmental Statistics. 2024 May 14:1-6.
- [45]. Chu X, Ilyas IF. Qualitative data cleaning. Proceedings of the VLDB Endowment. 2016 Sep 1;9(13):1605-8.
- [46]. Sun R, Wu Y, Lan H, Wang Y, Ding R, Xu J, Liao S, Hu J, Sun Y. Research on multi-source heterogeneous data cleaning technology based on integrating neural network with fuzzy rules for renewable energy accommodation. In2020 IEEE 4th Conference on Energy Internet and Energy System Integration (EI2) 2020 Oct 30 (pp. 3024-3027). IEEE.
- [47]. Li Y, Li D. Photovoltaic abnormal data cleaning based on fuzzy clustering-quartile algorithm. In2023 IEEE 6th International Conference on Industrial Cyber-Physical Systems (ICPS) 2023 May 8 (pp. 1-5). IEEE.
- [48]. Caruccio, L., V. Deufemia, and G. Polese, Mining relaxed functional dependencies from data. Data Mining and Knowledge Discovery, 2020. 34(2): p. 443-477.
- [49]. www.https://archive.ics.uci.edu/datasets
- [50]. Huhtala Y, Karkkainen J, Porkka P, Toivonen H (1999) TANE: an efficient algorithm for discovering functional and approximate dependencies. Comput J 42(2):100–111
- [51]. Caruccio L, Cirillo S, Deufemia V, Polese G, Stanzione R. REQUIRED: A Tool to Relax Queries through Relaxed Functional Dependencies. Inedbt 2023 (pp. 823-826)
- [52]. Breve B, Caruccio L, Deufemia V, Polese G. RENUVER: A Missing Value Imputation Algorithm based on Relaxed Functional Dependencies. Inedbt 2022 Mar 29 (pp. 1-52).
- [53]. Ding X, Liu Y, Wang H, Wang C, Song Y, Yang D, Wang J. Efficient Relaxed Functional Dependency Discovery with Minimal Set Cover. In2024 IEEE 40th International Conference on Data Engineering (ICDE) 2024 May 13 (pp. 3519-3531). IEEE.
- [54]. B. Breve, L. Caruccio, S. Cirillo, V. Deufemia and G. Polese, "indibits: Incremental Discovery of Relaxed Functional Dependencies using Bitwise Similarity," 2023 IEEE 39th International Conference on Data Engineering (ICDE), Anaheim, CA, USA, 2023, pp. 1393-1405, doi:10.1109/ICDE55515.2023.00111.
- [55]. Caruccio L, Cirillo S, Iuliano G, Polese G, Stanzione R. RYAN: A tool for explaining and visually analyzing the evolution of Relaxed Functional Dependencies. In2024 IEEE International Conference on Big Data (bigdata) 2024 Dec 15 (pp. 1249-1254). IEEE.
- [56]. Fan, w. Extending dependencies with conditions for data cleaning. In 2008 8th ieee international conference on computer and information technology. 2008. Ieee.
- [57]. Srivastava, m., r. Garg, and p. Mishra, mapreduce-based parallel data cleaning algorithm in web usage mining. International journal of computer science & applications, 2017. 14(2).
- [58]. Zhang, w., d. Wang, and x. Tan, robust class-specific autoencoder for data cleaning and classification in the presence of label noise. Neural processing letters, 2019. 50: p. 1845-1860.
- [59]. Fanani, l. And n.d. Priandani, data cleaning and prototyping using k-means to enhance classification accuracy. International journal of applied engineering research, 2017. 12(15): p. 5242-5247.
- [60]. Bertossi, L., S. Kolahi, and L.V. Lakshmanan. Data cleaning and query answering with matching dependencies and matching functions. In Proceedings of the 14th International Conference on Database Theory. 2011.
- [61]. Ansari Z, Azeem MF, Babu AV, Ahmed W. A fuzzy clustering-based approach for mining usage profiles from web log data. Arxiv preprint arxiv:1509.00693. 2015 Sep 1.
10.57647/fomj.2026.0701.06