Skip to content
Multi-source Entity Resolution

Overview

Multi-source data refers to data from different sources with different formats, semantics, and structures. The integration of multi-source data is a challenge for data quality in data processing, and in the subsequent data analysis and application processes, it is easy to bring problems in data accuracy, homogeneity, completeness and so on. The entity recognition and enhancement function of the Rock system is designed to solve these problems. This function realizes entity grouping, entity matching, entity clustering, optimal record acquisition and master data management, integrates time-ordered optimal rules, analyzes data based on user-defined credibility/priority of rules and accumulated a priori knowledge, and achieves the effect of analyzing entities more accurately by merging and transferring entities several times and recommends the optimal values for different values of the same entity data.

Challenges

Data quality issues
Entity resolution of multi-source data requires processing a large amount of data, and these data often have quality issues such as missing values, erroneous values, and duplicate values, which affect the accuracy and efficiency of entity resolution.
Entity disambiguation problem
Entity resolution of multi-source data requires disambiguation of entities in the text, determining whether entities refer to the same entity. This is a complex problem that requires consideration of multiple factors such as context, semantics, and entity attributes.
Algorithm efficiency problem
Traditional entity resolution algorithms are often inefficient and cannot meet the requirements of large-scale data processing involved in entity resolution of multi-source data.

Architecture

Benefits

Improve the data quality
Based on the auto-discovered entity enhancement rules and data repair algorithms with theoretical guarantees, iteratively discover the wrong data in the data and complete the repair with correctness guarantees to improve the data quality.
Output the best entity data
Combine logic rules and natural language processing technology to resolve text entities accurately and solve semantic diversity problems. Incorporate features such as temporal rules and user-defined optimal rules, stipulate different dimensions such as longest data/shortest data/closest data-update-time as the optimal entity data, and output the optimal entity records based on entity resolution and classification.
High accuracy and efficiency
Propose entity enhancement rules and correctness-guaranteed data repair. Compared with the industry benchmark products, the comparison experiments based on DBLP dataset show that the overall execution performance is 382% faster, the entity resolution accuracy rate is improved by 56%, and the number of duplicate entities found is 16 times more.