Analysis Of The Study “Auto-Join: Joining Tables By Leveraging Transformations”

Problem being addressed

The authors attempt to address the problem of joining tables when the datasets are collected from different sources and their key columns are formatted differently. Joins are very useful when it comes to combining records for data analysis and the present-day tools available allow to perform equi-join but it works well on structured and organized data and fail to give correct results when data is less curated. So, if some transformations are made before applying equi-join, the results will improve. But then performing these transformations by manually identifying rows and columns with joinable values is not feasible when it comes to tables with thousands of rows as it is not only slow but often fails to produce matches. Thus, the objective is to encounter this challenge and automate the transformation-based joins and efficiently scale the algorithm for large datasets.

Proposed solution

The major contributions of the authors are, dealing with Auto-Join algorithm, scaling of auto join to large tables and creating benchmarks with authentic test cases that require transformation joins. To combat the problem of auto-join, the authors present a system that creates a transformation program that on execution makes the input tables joinable. It uses an effective strategy that allows the system to work for large datasets with a high success rate. The system consists of three main stages:

  1. Finding Joinable Row Pairs - Here guesswork is used to find joinable raw pairs by using q-gram matches with an assumption that unique matches form joinable pairs, based on power law. Further, a search algorithm is developed to find joinable row pairs. In this, for all the columns in source and target a suffix array index is built by extracting all the possible suffixes of the value and sorting them in ascending order.
  2. Learning Transformations - After the joinable row pairs have been found, the goal is to find the transformation with the least complexity that could be generalized to find the desired set of result. For this, different methods such as using physical and logical operators are being used to minimize the data sets to get the desired result. Further, this process is extended to work on large data sets.
  3. Constrained Fuzzy Join - Many times inconsistencies exist in the dataset used as input, so a technique is developed to find fuzzy join such that the maximum number of rows to join are found by maintaining the cardinality constraints. Thus, improving join coverage. To further combat the problem with scaling large data sets to keep up with the interactive speed, a sampling scheme was designed to get the results with greater probability. The proposed solutions are helped explicitly with some of the experiments such as, one that uses web and enterprise as the benchmarks from the real data sets and finding the evaluation metrics to identify the join quality. Eight different methods were implemented for comparing the experimental results, for e. g. , Substring Matching, Fuzzy Join- Full row etc.

Discussion and evaluation

The authors have done a detailed experiment by taking different benchmarks and the results of those experiments clearly indicate that the algorithm used by them has provided a performance boost. The related work summarizes several papers and point out the potential issues in the earlier work and how their system can overcome them. Examples at every step were clear and helped to improve the understanding of the concept. In general, this paper is well written and well organized, with concrete algorithms. Also, it is now used as a part of Microsoft data preparation system, which means that the proposed solution can be applied into commerce and is evidently useful and efficient.

15 July 2020
close
Your Email

By clicking “Send”, you agree to our Terms of service and  Privacy statement. We will occasionally send you account related emails.

close thanks-icon
Thanks!

Your essay sample has been sent.

Order now
exit-popup-close
exit-popup-image
Still can’t find what you need?

Order custom paper and save your time
for priority classes!

Order paper now