DC Research Projects

DC1: Continuous/concurrent data summarization

Objectives: We target algorithmic constructs and data structures to maintain synopses in parallel, supporting concurrent updates and queries in varying window frames; knowledge and new advances on combinable summaries, allowing state partitioning. We will study adaptive semantic relaxation of concurrent data-structures in both (coarse-grained) batch and fine-grained streaming scenarios of large data processing and analytics, as well as we will develop consistency models that are appropriate for the above in the presence of concurrency and ways to achieve them in the aforementioned constructs, while tuning trade-offs among consistency/accuracy, time-efficiency and cost (in parallelism, memory)
Expected Results: Algorithms and implementations, as part of streaming and batch big data processing environments, of relaxed, elastic, concurrent synopses data structures and application-oriented use-cases.

DC2: Staleness and Asynchrony for performance gains in in deep learning and iterative large data processing

Objectives: Investigate process coordination models/algorithms to trade staleness of state for gaining performance without losing accuracy in deep learning and iterative large data processing algorithms
Expected Results: Asynchronous parallel algorithms for deep learning and iterative data processing (from SGDs to clustering and graph algorithms).

DC3: Integrating event stream processing in ML and analytics systems

Objectives: Investigate how to bring in an efficient way data analytics model building and serving tightly together in a shared execution while allowing incremental (stream processing) and bulk-iterative workloads to operate in unison.
Expected Results: Algorithms and system implementations of streaming components of data analytics and ML integrated on data analytics and ML frameworks.

DC4: Data-centric analytics modelling for complex tasks

Objectives: Many analytics tasks require multiple input datasets, the number of possible combinations and the subsequent search space for the most profitable ones increase exponentially. Moreover, real-life tasks usually require long, complex or adhoc compositions of simpler tasks into workflows. This renders the estimation of the right input data that qualitatively optimize an arbitrary workflow a very obscure task. In this task we plan on addressing the aforementioned challenges: Study the problem of creating accurate models that connect structural, semantic and operational data features with the performance of complex analytics tasks and workflows, especially given that data sources are distributed, created and updated.
Expected Results: Develop algorithms and tools that create performance models for analytics workflows as well as multi-input tasks. Efficient design and implementation for data which reside in different physical locations.

DC5: Analytics modelling over uncertain and variable data inputs

Objectives: Data sources may regularly exhibit varying levels of uncertainty such as noise and missing attributes. The relationship between data uncertainty and its impact on analytics performance is still cryptic. Similarly, data velocity is a regular source of complexity in analytics. In this task, we plan on addressing the challenging aspects of data uncertainty and data churn in view of a content-centric approach; namely, we wish to define and implement appropriate methods and tools that model, maintain in real time and help analysts select multiple data inputs that change in different aspects and/or contain varying degrees of uncertainty.
Expected Results: Develop algorithms and tools that create performance models for analytics tasks over multiple possible inputs that exhibit varying degrees of i) churn and ii) uncertainty

DC6: Numeric Accuracy and Reproducibility in Deep Learning Training and Inference

Objectives: Different versions of machine-learning hardware and software typically yield slightly different answers due to differences in floating point order of evaluation. The result is often poorer accuracy, or the same overall accuracy but different classifications between the two implementations, with unpredictable results. The goal of this work is to develop methods for trained models with sharper distinctions between classifications so that the model is more resilient to minor changes.
Expected Results: Statistical methods, algorithms and implementation for multi-objective optimization; notations to describe the order of evaluation and bounds on the range of results from each implementation; adversarial and adversarial measures of sensitivity to perturbation.

DC7: Arithmetic and Number Systems for Deep Learning

Objectives: Developing numeric types that match value distributions and operations of training better than existing default types. Identify number systems that make better use of limited encodings for both inference and training. Investigate domain-specific and application-specific number systems and encodings for improved compactness and customize the level of precision of data to the movement of the data within the parallel/distributed computing system.
Expected Results: Improved representations that better match value distributions and changes in values during training. Domain-specific and problem-specific number systems that improve encoding density and reduce data movement. Principled analytic approach to customizing precision to the level the memory hierarchy and the movement of distributed data.

DC8: Transparent Data Management and Verification

Objectives: Advanced transparency and verification methods beyond XAI (Explainable Artificial Intelligence) approaches for core data analytics functionality only. We create holistic end-to-end descriptions and interactions with data-intensive applications, contributing new concepts, methods and algorithms for transparent data science that allow users to verify and transparently explore the impact of data quality as well as of data science design and modelling decisions on system results.
Expected Results: Transparency and explanation models and algorithms for data-intensive systems that integrate with XAI approaches, but cover the entire design and development process of data-intensive applications.

DC9: Interaction techniques for adaptable models and streaming data systems

Objectives: Maintenance, changing tasks, dynamically varying and streaming data, collaborative tasks unlike existing solutions for XAI that assume standardized systems in stable environments, we devise methods that semi-automatically adapt to dynamic changes in batch and streaming data, and that provide the means for interactive updates, suitable interfaces and active learning mechanisms for modern data analysis and XAI.
Expected Results: Active learning models for interactive analytics and XAI models, incremental update mechanisms and adaption strategies for dynamically changing data (streams)

DC10: Efficient and responsible analytics for urban mobility and allied applications

Objectives: Develop high-throughput analytics for high-velocity streams of heterogeneous data from urban mobility scenarios, in order to develop application-specific trade-offs between accuracy and performance to underpin mobility applications such as traffic routing and real-time spatial provisioning of shared resources.
Expected Results: Algorithms and indexing structures for high-throughput analytics, which would be published in top data science avenues.

DC11: Interactive and Intelligent exploration of big complex data

Objectives: Investigate novel data explorations techniques to cope with very large and dynamic complex data such as graphs, trajectories or texts including (i) anytime approximation techniques for efficient data explorations, (ii) efficient techniques to deal with dynamic data such as transaction data or streaming data, (iii) scalable parallel processing approaches for data exploration on modern hardware architectures with high-throughput and low workload.
Expected Results: Intelligent data exploration methods that automatically learn the intrinsic structure of the data to derive results more efficient and effective; anytime approximation techniques that can work under arbitrary time constraints; work-efficient parallel processing methods on shared memory and distributed systems; and incremental data exploration techniques to cope with dynamic data.

DC12: Application-Aware Relaxed Synchronisation for Distributed Graph Processing

Objectives: Investigating and improving relaxed synchronisation in distributed graph processing; designing relaxed synchronisation algorithms based on monotonicity of graph analytic algorithms; designing communication schemes that minimise redundant work; deriving properties of graph analytics problems that enhance relaxation.
Expected Results: Relaxed scheduling and communication algorithms; prototype implementation