Login | Register

Automated Data Preparation using Semantics of Data Science Artifacts


Automated Data Preparation using Semantics of Data Science Artifacts

Vashisth, Shubham (2023) Automated Data Preparation using Semantics of Data Science Artifacts. Masters thesis, Concordia University.

[thumbnail of Vashisth_MCompSc_F2023.pdf]
Text (application/pdf)
Vashisth_MCompSc_F2023.pdf - Accepted Version
Available under License Spectrum Terms of Access.


Data preparation is critical for improving model accuracy. However, data scientists often work independently, spending most of their time writing code to identify and select relevant features, enrich, clean, and transform their datasets to train predictive models for solving a machine learning problem. Working in isolation from each other, they lack support to learn from what other data scientists have performed on similar datasets. This thesis addresses these challenges by presenting a novel approach that automates data preparation using the semantics of data science artifacts. Therefore, this work proposes KGFarm, a holistic platform for automating data preparation based on machine learning models trained using the semantics of data science artifacts, captured as a knowledge graph (KG). These semantics comprise datasets and pipeline scripts. KGFarm seamlessly integrates with existing data science platforms, effectively enabling scientific communities to automatically discover and learn from each other’s work. KGFarm’s models were trained on top of a KG constructed from the top-rated 1000 Kaggle datasets and 13800 pipeline scripts with the highest number of votes. Our comprehensive evaluation uses 130 unseen datasets collected from different AutoML benchmarks to compare KGFarm against state-of-the-art systems in data cleaning, data transformation, feature selection, and feature engineering tasks. Our experiments show that KGFarm consumes significantly less time and memory compared to the state-of-the-art systems while achieving comparable or better accuracy. Hence, KGFarm effectively handles large-scale datasets and empowers data scientists to automate data preparation pipelines interactively.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Vashisth, Shubham
Institution:Concordia University
Degree Name:M. Comp. Sc.
Program:Computer Science
Date:4 August 2023
Thesis Supervisor(s):Mansour, Essam
Keywords:Data Preparation, Feature Engineering, Data Transformation, Data Cleaning, Feature Selection, Knowledge Graphs, Data Science, Machine Learning
ID Code:992618
Deposited By: Shubham Vashisth
Deposited On:14 Nov 2023 20:38
Last Modified:14 Nov 2023 20:38
Additional Information:https://github.com/CoDS-GCS/kgfarm


Abdallah, H., Nguyen, D., Nguyen, K., & Mansour, E. (2021). Demonstration of kgnet: a cogni- tive knowledge graph platform. In O. Seneviratne, C. Pesquita, J. Sequeda, & L. Etcheverry (Eds.), Proceedings of the ISWC 2021 posters, demos and industry tracks: From novel ideas to industrial practice co-located with 20th international semantic web conference (ISWC 2021), virtual conference, october 24-28, 2021 (Vol. 2980). CEUR-WS.org. Retrieved from https://ceur-ws.org/Vol-2980/paper311.pdf
Alghushairy, O., Alsini, R., Soule, T., & Ma, X. (2021). A review of local outlier factor algorithms for outlier detection in big data streams. Big Data and Cognitive Computing, 5(1). Retrieved from https://www.mdpi.com/2504-2289/5/1/1 doi: 10.3390/bdcc5010001
Al-Rfou, R., Alain, G., Almahairi, A., Angermueller, C., Bahdanau, D., Ballas, N., ... others (2016). Theano: A python framework for fast computation of mathematical expressions. arXiv e-prints, arXiv±1605.
Alteryx. (2023). Retrieved from https://www.alteryx.com/
Bauckmann, J., Leser, U., Naumann, F., & Tietz, V. (2007). Efficiently detecting inclusion dependencies. In 2007 ieee 23rd international conference on data engineering (p. 1448-1450). doi:10.1109/ICDE.2007.369032
Bellamy, R. K. E., Dey, K., Hind, M., Hoffman, S. C., Houde, S., Kannan, K., . . . Zhang, Y. (2018, October). AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. Retrieved from https://arxiv.org/abs/1810.01943 Biessmann, F., Rukat, T., Schmidt, P., & et al. (2019). Datawig: Missing value imputation for tables. J. Mach. Learn. Res., 20(175), 1±6.
Bogatu, A., Fernandes, A. A. A., Paton, N. W., & Konstantinou, N. (2020). Dataset discovery in data lakes. In 2020 ieee 36th international conference on data engineering (icde) (p. 709-720). doi: 10.1109/ICDE48307.2020.00067
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., . . . Zhang, Z. (2015). Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274.
Dask Development Team. (2016). Dask: Library for dynamic task scheduling [Computer software manual]. Retrieved from https://dask.org
Dua, D., & Graff, C. (2017). UCI machine learning repository.
Einblick. (2023). Retrieved from https://www.einblick.ai/
Feast: Feature Store for Machine Learning. (2022). Retrieved from https://feast.dev/ Fernandez, R. C., Abedjan, Z., Koko, F., Yuan, G., Madden, S., & Stonebraker, M. (2018). Aurum:
A data discovery system. In 34th IEEE international conference on data engineering, ICDE 2018, paris, france, april 16-19, 2018 (pp. 1001±1012). IEEE Computer Society. Retrieved from https://doi.org/10.1109/ICDE.2018.00094 doi: 10.1109/ICDE.2018 .00094
Goikoetxea, J., Agirre, E., & Soroa, A. (2016). Single or multiple? combining word representations independently learned from text and wordnet. In Proceedings of the thirtieth conference on artificial intelligence (AAAI) (pp. 2608±2614). Retrieved from http://www.aaai.org/ ocs/index.php/AAAI/AAAI16/paper/view/11777
Hai, R., Kang, Y., Koutras, C., Ionescu, A., & Katsifodimos, A. (2022). Bridging the gap between data integration and ml systems. arXiv preprint arXiv:2205.09681.
Helal, A., Helali, M., Ammar, K., & Mansour, E. (2021). A demonstration of kglac: A data discovery and enrichment platform for data science. Proceedings of the VLDB Endowment, 14(12), 2675±2678.
Helali, M., Mansour, E., Abdelaziz, I., & et al. (2022). A scalable AutoML approach based on graph neural networks. PVLDB, 15(11).
Helali, M., Vashisth, S., Carrier, P., & et al. (2021). Linked data science powered by knowledge graphs. CoRR, abs/2303.02204.
Kakantousis, T., Kouzoupis, A., Buso, F., & et al. (2019). Horizontally scalable ml pipelines with a feature store. In Sysml.
Kanter, J. M., & Veeramachaneni, K. (2015). Deep feature synthesis: Towards automating data science endeavors. In 2015 ieee international conference on data science and advanced analytics (dsaa) (p. 1-10). doi: 10.1109/DSAA.2015.7344858
Kasneci,E.,Sessler,K.,KuÈchemann,S.,Bannert,M.,Dementieva,D.,Fischer,F.,...Kasneci,G. (2023). Chatgpt for good? on opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. Retrieved from https://www .sciencedirect.com/science/article/pii/S1041608023000195 doi:https://doi.org/10.1016/j.lindif.2023.102274
Katz, G., Shin, E. C. R., & Song, D. (2016). Explorekit: Automatic feature generation and selection. In 2016 ieee 16th international conference on data mining (icdm) (p. 979-984). doi: 10.1109/ ICDM.2016.0123
Kaul, A., Maheshwary, S., & Pudi, V. (2017). Autolearn - automated feature generation and selection. In Icdm (pp. 217±226).
Khatiwada, A., Fan, G., Shraga, R., Chen, Z., Gatterbauer, W., Miller, R. J., & Riedewald, M. (2023, may). Santos: Relationship-based semantic table union search. Proc. ACM Manag. Data, 1(1). Retrieved from https://doi.org/10.1145/3588689 doi: 10.1145/3588689
Kumar, V., & Minz, S. (2014). Feature selection: a literature review. SmartCR, 4(3), 211±229. Lam, H. T., Thiebaut, J.-M., Sinn, M., Chen, B., Mai, T., & Alkan, O. (2017). One button machine for automating feature engineering in relational databases. arXiv preprint arXiv:1706.00327. Mansour, E., Srinivas, K., & Hose, K. (2021). Federated data science to break down silos [vision].
SIGMOD Rec., 50(4).
Mueller, J., & Smola, A. (2019). Recognizing variables from their data via deep embeddings of distributions. In International conference on data mining (ICDM) (pp. 1264±1269). Nargesian, F., Asudeh, A., & Jagadish, H. V. (2021, jul). Tailoring data source distribu- tions for fairness-aware data integration. Proc. VLDB Endow., 14(11), 2519±2532. Re- trieved from https://doi.org/10.14778/3476249.3476299 doi:10.14778/3476249.3476299
Nargesian, F., Samulowitz, H., Khurana, U., & et al. (2017). Learning feature engineering for classification. In Ijcai.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., . . . Lerer, A. (2017). Auto- matic differentiation in pytorch.
Peng, J., Wu, W., Lockhart, B., & et al. (2021). Dataprep.eda: Task-centric exploratory data analysis for statistical modeling in python. In Sigmod (pp. 2271±2280).
Raju, V. G., Lakshmi, K. P., Jain, V. M., Kalidindi, A., & Padma, V. (2020). Study the influence of normalization/transformation process on the accuracy of supervised classification. In 2020 third international conference on smart systems and inventive technology (icssit) (pp. 729± 735).
Raju, V. N. G., Lakshmi, K. P., Jain, V. M., Kalidindi, A., & Padma, V. (2020). Study the influence of normalization/transformation process on the accuracy of supervised classification. In Icssit (p. 729-735).
Rekatsinas, T., Chu, X., Ilyas, I. F., & et al. (2017). Holoclean: Holistic data repairs with proba- bilistic inference. PVLDB, 10(11).
Rezig, E. K., Bhandari, A., Fariha, A., & et al. (2021). DICE: data discovery by example. PVLDB, 14(12).
Rostin, A., Albrecht, O., Bauckmann, J., Naumann, F., & Leser, U. (2009). A machine learning approach to foreign key discovery. In 12th international workshop on the web and databases, webdb 2009, providence, rhode island, usa, june 28, 2009.
Samala, R. K., Chan, H.-P., Hadjiiski, L., & Koneru, S. (2020). Hazards of data leakage in machine learning: a study on classification of breast cancer using deep neural networks. In Medical imaging 2020: Computer-aided diagnosis (Vol. 11314, pp. 279±284).
Tensorflow: Large-scale machine learning on heterogeneous systems. (n.d.). Retrieved from https://www.tensorflow.org/ (Software available from tensorflow.org)
Trifacta. (2023). Retrieved from https://www.trifacta.com/
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45(3), 1±67. Retrieved from https://www .jstatsoft.org/index.php/jss/article/view/v045i03 doi: 10.18637/jss.v045.i03
Waring, J., Lindvall, C., & Umeton, R. (2020). Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artificial Intelligence in Medicine, 104, 101822. Retrieved from https://www.sciencedirect.com/science/article/pii/ S0933365719310437 doi: https://doi.org/10.1016/j.artmed.2020.101822
Xu, S., Lu, B., Baldea, M., Edgar, T. F., Wojsznis, W., Blevins, T., & Nixon, M. (2015). Data cleaning in the process industries. Reviews in Chemical Engineering, 31(5), 453±490. Re- trieved 2023-07-11, from https://doi.org/10.1515/revce-2015-0022 doi: doi:10.1515/revce-2015-0022
Yan, C., & He, Y. (2020). Auto-Suggest: Learning-to-recommend data preparation steps using data science notebooks. In SIGMOD (pp. 1539±1554).
Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., . . . Sun, M. (2020). Graph neural networks: A review of methods and applications. AI Open, 1, 57-81. Retrieved from https://www.sciencedirect.com/science/article/pii/S2666651021000012 doi: https://doi.org/10.1016/j.aiopen.2021.01.001
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top