Theory

Data quality models and methods with correctness guarantee

Research Introduction

Low-quality data can lead to significant financial and operational losses. Data quality management involves mining data quality rules, automatically identifying and repairing errors in data to enhance data availability. Leveraging our original data quality theory, we develop a data availability platform that supports automated management. This platform establishes a positive interaction between data and the platform, promotes the applications of high-quality data in various industrial domains, and leads the global development of data availability.

Research Field

By addressing the five key issues of data quality, including consistency, accuracy, completeness, timeliness, and data deduplication, we propose a new data quality model and develop a reasoning system that unifies machine learning and logical rules. We support automatic algorithms for rule discovery, rule verification, error detection, and error correction.

Logic + AI

Research Introduction

Machine learning methods are widely applied in big data analysis. It is well acknowledged in both the academic and industrial communities that machine learning systems mainly work in a statistical or black-box manner, with the issue of uninterpretability, which limits its applicability. We are tackling this challenge from a structural causal reasoning perspective. We improve the accuracy using machine learning models and ensure the interpretability via logic rules. Meanwhile, we propose a novel logical rule system that supports plug-and-play machine learning models and is able to discover implicit relationships in the underlying data.

Research Field

By logically connecting the inputs and outputs of machine learning models based on the structure of graphical and relational data, and utilizing the topological structures and association relationships in data (e.g., data hierarchy), we investigate how to effectively reveal the reasoning logic in natural language processing, intelligent question-answering, semantic model analysis and so forth, enabling researchers to perform designated optimizations, thereby enhancing the performance of machine learning models and expanding their scopes. We propose a novel logical rule system that unifies logic rules and machine learning models, preserving the interpretability of logical reasoning and leveraging machine learning models to capture semantic relationships. This logical rule system has been widely used in data quality and association analysis.

Parallel Scalability

Research Introduction

Distributed computing has become an essential mode of big data computing. By contrast, single-user computing is difficult to overcome the high computational complexity of big data orders. On the other side, distributed computing will also increase the communication overhead of parallel computing resources due to data interactivity, thus decreasing the efficiency of big data computing. We are engaged in a novel strategy of balancing computing resources and efficiency, thus cutting overhead costs, including the computing and communication time, producing from distributed computing.

Research Field

We are engaged in building models of parallel scalability with different complexity level for various computing problems in order to determine and identify parallel scalability problems of the same complexity. We also study design methods of parallel scalable algorithms for different types of computing problems. We are devoting to offering an integrated approach to balance relationships between multiple factors when computational performance gets improved.

Awards

Certificate of Specialized Evaluation for Fundamental Capabilities of Data Quality Management Platform

First Prize in the Data Governance Track of the Beijing Big Data Skills Competition

Member of the Big Data Technology Standards Promotion Committee

The Best Paper Award at The 23rd International Conference on Database Engineering (ICDE)

The Best Paper Award at The 36th International Conference on Very Large Data Bases (VLDB)

Publications

Splitting Tuples of Mismatched Entities

ACM SIGMOD Conference on Management of Data (SIGMOD), 2024.

Wenfei Fan, Ziyan Han, Weilong Ren, Ding Wang, Yaoshu Wang, Min Xie, and Mengyi Yan.

Discovering Top-k Rules using Subjective and Objective Criteria

ACM SIGMOD Conference on Management of Data (SIGMOD), 2023.

Wenfei Fan, Ziyan Han, Yaoshu Wang, and Min Xie.

Learning and Deducing Temporal Orders

The 49th International Conference on Very Large Data Bases (VLDB), 2023.

Wenfei Fan, Resul Tugay, Yaoshu Wang, Min Xie, and Muhammad Asif Ali.

Parallel Rule Discovery from Large Datasets by Sampling

ACM SIGMOD Conference on Management of Data (SIGMOD), 2022.

Wenfei Fan, Ziyan Han, Yaoshu Wang, and Min Xie

Total 26 items
<
1
2
3
•••
7
>
Go to