By Andreas Loizou
Over the years, as the amount of data within organizations has exponentially increased, the need for designing and developing more accurate workflows for data analytics has become paramount. A major concern in this domain is the pervasive issue of low data quality throughout organizational databases, which undermines the effectiveness of analytics-driven decision-making. Traditionally, efforts have focused on enhancing or scaling algorithmic performance primarily by increasing data input, without a thorough examination of the data’s content. In contrast, data-centric AI techniques prioritize the quality and relevance of data, tailoring it to the specific requirements of analytics tasks
Andrew NG, a well-known figure in the scientific and entrepreneurial communities of computer science, has consistently stated that “data is food for AI” [1] emphasizing that just as the quality of food is crucial for a healthy body, the quality of data is critical for effective analytics. Data-centric AI approaches aim to improve dataset quality using various methodologies, particularly through the application of Artificial Intelligence (AI). This approach, known as data-centric AI, has gained significant traction in recent years, profoundly impacting how organizations leverage data for analytics.
Benefits of data-centric AI for big data analytics workflows
Data-centric AI, when applied to big data analytics workflows, prioritise dataset quality above quantity through the elimination of noisy data instances and outliers, thereby enhancing the effectiveness of analytics models and prediction. Automation, when enabled by data-centric and AI methods, streamlines the analytics process, lowering time-to-insight and allowing data scientists to focus their efforts on higher-value tasks such as model refinement and interpretation. Scalability and flexibility among the database data provide various advantages since data-centric AI methods enable organisations to easily adapt to changing data landscapes and company demands, ensuring that analytics programmes remain agile and responsive. Furthermore, leveraging diverse and comprehensive databases leads to improved prediction accuracy, allowing organisations to find hidden patterns, trends, and correlations that drive strategic decision-making and competitive advantage.
Challenges of data-centric AI
Among the whole promising potential of data-centric AI in big data analytics tasks, various hurdles must be overcome in order to reach its full potential. A major challenge is ensuring data privacy and security, especially as organisations deal with more stringent rules and consumer expectations for data protection. The use of AI to analyse massive volumes of personal data raises fears about potential biases, discrimination, and unintended consequences. Furthermore, the sheer volume and variety of data drive into logistical issues, with organisations trying to handle and analyse data at scale while maintaining quality and relevance Establishing robust data governance standards tailored to each organisation’s requirements and investing in technologies that enable data discovery, lineage tracking, and auditability are essential steps towards overcoming these hurdles. By addressing these issues, organizations can harness the transformative potential of data-centric AI for big-data analytics.
Real World Applications
Several academic and industry organisations have embraced data-centric AI to optimize their big data analytics workflows and drive actionable insights. Giannakopoulos et al. [2], developed a three-step workflow, where through multiple distinct datasets constructed a similarity matrix and with an operator with a small knowledge from all the datasets creates a Machine Learning model. Their workflow evaluation indicates that their proposed methodology can produce accurate analytic models and minimize the loss error. Amirata Ghorbani and James Zou [3], developed a framework to evaluate the data quality among datasets using approximations of the Data Shapley the Truncated Monte Carlo Shapley and the Gradient Shapley. Their framework shows that they can detect outliers and noisy data and when they removed them the accuracy was increased.
In the enterprise sector, several enterprise organisations developed their tools to identify and improve data quality, each tool focused on different data quality issues and can support different data types. Tools like Soda [4], and Acelldata [5] can work only with tabular numerical data, and try to improve data quality by finding low data quality in the datasets. LightUp [6] tool also targets tabular numerical data, focusing on identifying and removing duplicate data. Great Expectations [7], offers a versatile data quality tool that works with multiple data types (numerical, string, etc.) and improves dataset quality by finding and eliminating non-value data. IBM developed a data quality tool Databand [8], which is designed to work with different data types (numerical, string, etc.), and tries to identify the data anomalies from metadata and improve the data quality.
These real-world applications demonstrate the critical role of data-centric AI approaches in big data analytics. By focusing on data quality, organizations can significantly improve their analytics decision-making and derive more reliable and actionable insights from their data assets.
Conclusion
Data-centric AI has evolved as an evolutionary approach in big data analytics, emphasising data quality over quantity. Organisations may get more accurate and trustworthy results from analytics tasks by tackling the inherent challenges of data privacy, security, and data governance, and using novel methods to cleanse and enrich datasets. Works from both academia and industry enhance data quality develop their algorithms to enhance data quality and prove effective in uncovering valuable insights and driving strategic decision-making. Embracing data-centric AI is essential for organisations aiming to maximise the value of their data assets, assuring agility, scalability, and a competitive edge in the ever-evolving digital landscape. Moving forward, the continued focus on data-centric AI methodologies will be critical for gaining deeper insights and driving innovation in big data analytics.
References
[1] A. Ng, “AI Doesn’t Have to Be Too Complicated or Expensive for Your Business,” [Online]. Available: https://hbr.org/2021/07/ai-doesnt-have-to-be-too-complicated-or-expensive-for-your-business.[2] I. Giannakopoulos, D. Ioannis and N. Koziris, “A Content-Based Approach for Modeling Analytics Operators,” CIKM’18, p. 227–236, 2018.
[3] A. Ghorbani and J. Zou, “Data Shapley: Equitable Valuation of Data for Machine Learning,” arXiv, Jun 2019.
[4] “Soda,” [Online]. Available: https://www.soda.io/.
[5] “Acceldata,” [Online]. Available: https://www.acceldata.io/.
[6] “LightUp,” [Online]. Available: https://lightup.ai/.
[7] “Great Expectations,” [Online]. Available: https://greatexpectations.io/.
[8] “Databand,” [Online]. Available: https://www.ibm.com/products/databand.