GroupByRule

Deduplicate data using fuzzy and deterministic matching rules.

🚧 under construction 🚧

GroupByRule is a Python package for data cleaning and deduplication. It integrates with pandasgroupby() function to not only group dataframe rows by a given identifier, but also groups rows based on logical rules and partial matching. In other words, it provides tools for deterministic record linkage and entity resolution in structured databases. It can also be used for blocking, a form of filtering used to speed-up more complex entity resolution algorithms. See the references below to learn more about these topics.

One of the main goal of GroupByRule is to be user-friendly. Matching rules and clustering algorithms are composable and the performance of algorithms can be readily evaluated given training data. The package is built on top of pandas for data manipulation and on igraph for graph clustering and related computations.

Additionally, GroupByRule provides highly efficient C++ implementations of common string distance functions through its comparator submodule. This can be used independently of record linkage algorithms.

Indices and tables