Assessing the Quality in the Detection of Similar Complex Data Structures in Large-Scale Datasets
Abstract
Pierpaolo Massoli
The existence of complex data structures in today’s data collections requires appropriate approaches driving the scientific community towards elaborating more efficient methods for data analysis. Graph Theory can be effectively applied for mathematical modeling these structures as is the case in network analysis. The search for similar networks may therefore be viewed as a graph matching problem, which poses a fundamental challenge in real-world applications. This study investigates the quality of the detection of similar complex data structures which follows a novel approach introduced recently. The detection approach employs some basic concepts from the Graph Theory for leveraging the Locality Sensitive Hashing to efficiently address the graph matching problem for finding isomorphic graphs as well as the common subgraph embedded within them. This method may generate false duplicates which affect the accuracy of the solution so that even the finest tuning of the hyperparameters does not guarantee high levels of accuracy. This study therefore proposes an in- depth investigation of crucial aspects of the detection approach in order to assess the accuracy of the same. The similarity of the detected pairs of similar graphs is analyzed as well as the critical aspect of the hashing step is investigated by bootstrapping the solution in order to assess its statistical properties. A real-world case study is considered to validate the potential of the proposed approach.