subgraph isomorphism search | DIPSA: Data-Intensive Parallel Systems and Algorithms

I presented a poster on “Graphlet-based Filtering for High-Performance Subgraph Isomorphism Search” at the 19th International Summer School on Advanced Computer Architecture and Compilation for High-performance Embedded Systems organised by HiPEAC at Fiuggi, Italy between July 9 – 15, 2023.

Summary

Introduction

Graphs are powerful tools for representing complex relationships in various domains such as social networks, bioinformatics, and computer networks. The Subgraph Isomorphism (SI) search problem is crucial in graph analytics, involving finding instances of a pattern graph within a larger data graph. This problem has applications in pattern discovery and graph database queries.

However, the SI problem is NP-complete, meaning that it becomes computationally intensive for larger graphs. To tackle this challenge, researchers have proposed heuristic algorithms that aim to speed up the SI search by pruning the search space through various techniques.

Three Stages of Heuristic Algorithms

These heuristic algorithms can be divided into three stages:

Filtering: In this stage, a candidate set of data graph vertices is generated for each vertex of the pattern graph. The goal is to select a subset of data graph vertices that are potential matches for each pattern vertex. Two common filtering techniques are Label and Degree Filtering (LDF) and Neighbourhood Label Filtering (NLF). LDF ensures that only vertices with matching labels and sufficient degrees are considered, while NLF adds further constraints based on the labels of neighbours.
Ordering: The vertices of the pattern graph are ordered for mapping with the candidate data vertices. The Highest Degree First (HDF) strategy is commonly used.
Backtracking: A backtracking search is performed based on the ordered vertices to find subgraph isomorphisms.

Proposed Approach: Graphlet Filtering (GLF)

The proposed approach, Graphlet Filtering (GLF), introduces an additional stage to the heuristic algorithms. Graphlets are recurring small subgraphs, and orbits represent groups of vertices within these graphlets. By counting the occurrences of orbits in both pattern and data graphs, we propose that the count of an orbit in the pattern graph must be less than or equal to the count of the same orbit in the candidate data vertices’ graphs. This idea enhances the filtering stage by providing more stringent filtering criteria.

Preliminary Results

We conducted preliminary experiments on various datasets and found that GLF filtering provides additional filtering enhancements ranging from 4.2% to 15.38% depending on the dataset and pattern graph density. Across all datasets, the GLF technique improved filtering by 9.93% for sparse pattern graphs and 8.49% for dense pattern graphs.

Conclusion

The study suggests that incorporating graphlet-based filtering (GLF) into the existing heuristic algorithms for the Subgraph Isomorphism problem can lead to more effective filtering and pruning of the search space. We plan to explore the impact of GLF on the runtime of different heuristic algorithms. If successful, this technique could significantly reduce the execution time of algorithms used for tasks such as pattern discovery and graph database queries.

Download the poster and the abstract below.

ACACES_Abstract Download

ACACES_Poster Download

Alpha draft of the PowerPoint presentation

In this post, I’ll be introducing the core aspects of my research, discussing the significance of subgraph isomorphism search in graphs, and outlining the objectives and milestones that lay ahead for me in my PhD journey.

As you may know, graphs are a fundamental structure in computer science, finding applications in diverse real-life scenarios like computer networks, social networks, bioinformatics, astrophysics, chemistry and road systems. Graph analytics, in turn, focuses on deciphering the complexities of these interconnected structures to extract valuable insights. Graph analytics is pivotal in addressing various challenges and opportunities that arise in graph datasets across industries. Applications of graph analytics include classification, clustering and pattern recognition. In my PhD, I will focus on one particular challenge, i.e., the pattern recognition problem, also called the subgraph isomorphism search problem.

This subgraph isomorphism search problem involves identifying instances of a smaller graph within a larger graph – finding a needle in a haystack. While this problem holds immense potential for diverse applications, its complexity arises out of it being NP-complete, making it incredibly challenging to find efficient (i.e., polynomial-time) solutions. To address the challenge, exact heuristic solutions and inexact approximate solutions have been proposed in the literature. In my research, I am focusing on exact heuristic solutions for the subgraph isomorphism problem.

There are three objectives in my PhD research.

Objective 1: Dynamic Heuristic Selection

One of my core objectives revolves around the dynamic selection of heuristics based on the characteristics of the pattern and data pair. I aim to develop an intelligent system that can determine the most suitable heuristic before the initiation of the search process itself. This adaptive approach could significantly enhance the efficiency of the search process and potentially outperform existing solutions.

Objective 2: Vector-Based Embedding

To enhance efficiency, I aim to propose a novel heuristic that relies on vector-based embeddings of graph nodes. I hypothesise that this departure from conventional scalar-based heuristics would enable the encapsulation of richer information (i.e., context) within the embeddings. By leveraging this additional context, I hope to enable accurate search space reduction (filtering) and early-stage matches (ordering), thereby reducing the need for backtracking and accelerating the matching process.

Objective 3: Parallelism and Optimization

My final objective focuses on the application of parallelism to both the previous objectives. While parallelism holds immense potential for speedup, it also introduces challenges such as load balancing and synchronization. My aim is to focus on shared memory parallelism while addressing concurrent potential pitfalls and, if necessary, exploring the realm of distributed memory systems.

The presentation outlines a comprehensive work plan for my research journey, marked by milestones and deadlines. My initial focus lies on completing experimentation and crafting a conference paper that showcases the outcomes of my pursuit of objectives.

My PhD project aims to enhance the efficiency and effectiveness of graph analytics, specifically in the context of subgraph isomorphism search. By adopting heuristics dynamically, embracing vector-based embeddings, and harnessing parallelism, I aim to contribute to the understanding of subgraph isomorphism search.

DIPSA: Data-Intensive Parallel Systems and Algorithms

Tag Archives: subgraph isomorphism search

Poster Presentation | Graphlet-based Filtering for High-Performance Subgraph Isomorphism Search

High-Performance Distributed Graph Analytics | Sub-graph Isomorphism | Initial Review