Skip to content

Current Issues

A commercial bank has procured data from multiple enterprises, and during usage, they discovered a significant number of duplicated entity names, name changes, and subtle differences in descriptions. This has resulted in extensive verification work when utilizing company information, sometimes making it challenging to determine which data entry to use. The primary issues are as follows:

Difficulty in extracting key attributes representing business information.
Severe lack of core data to distinguish entities effectively.
Using both the company name and the enterprise's business license, tax registration certificate, and organization code certificate as identification conditions for entities is overly stringent, leading to complex computations.
Lack of capabilities for entity propagation and merging.

Solution and Effect

Rock addresses the issue by directly applying rule auto-discovery to multiple data sources, resulting in the output of rules related to the ThreeLicense. This provides a robust reference for extracting key attributes. Simultaneously, machine learning is employed to match company names as an auxiliary entity rule, gradually expanding data coverage. The system also supports setting rule priorities, with the highest priority given to the Unified Social Credit Code, followed by the organization code, and the business registration code as the third priority. This step-by-step narrowing of the calculation scope ensures accurate entity recognition.

Manually writing SQL statements to express the rules for the ThreeLicense.

Matching company names, but lacking precise matching with machine learning, resulting in severe omissions.

Inability to set priorities for known rules related to the ThreeLicense.

Continuously running on full dataset, causing potential inefficiencies.

Automatic discovery of rules related to the ThreeLicense, reducing the difficulty of manually designing rules.

Supplementary manual rule design with the support of machine learning.

Setting priorities for rules based on business definitions.

Iterative and incremental running, gradually narrowing the data scope, with lower requirements on machine computing power.

Achievements

The business entity data was reduced from 396,000 to 226,000 through merging, achieving an accuracy rate of 98.86%. Previously, the merger was limited to 350,000+ entities.

By employing a combination of Rock's automatic rule discovery and supplementary manual rule design, the data coverage increased from 80% to 99%.

The workload for manual confirmation was significantly reduced. Confirming the 396,000 data entries manually used to take approximately 2 days, but now, it can be completed in just 0.5 days.