Detecting Similar Complex Data Structures in Large-Scale Datasets
Abstract
Pierpaolo Massoli
The increasing volumes of complex data stored in today’s databases are driving the scientific community towards elaborating more efficient methods for data analysis. The data structures contained within them require appropriate mathematical modeling, as is the case in network structures, which can be effectively modeled by applying concepts from Graph Theory. The search for similar networks is therefore often viewed as a graph matching problem, which poses a fundamental challenge in real-world applications. This study introduces a novel approach by leveraging Locality Sensitive Hashing to efficiently address the graph matching problem. Finding an isomorphism between graphs as well as the search for the common subgraph embedded within them is achieved by hashing the graphs, thus transforming the problem into a similarity search problem. Due to its approximate nature the method applied generates false duplicates. Usual diagnostics do not guarantee high levels of accuracy of the solution. This study therefore proposes the use of the popular Conformal Prediction framework in order to evaluate the validity of the results with greater accuracy. A real-world case study is considered to test the potential of the proposed approach.