Dataset Joining

Model by Modzy

This model provides a convenient way of combining two datasets in JSON format by performing a user-specified join operation, such as inner (default), outer, left, or right join, over optionally given keys. If no keys are explicitly specified, the model finds the best possible keys (if any) to link the two datasets, saving data scientists and researchers time on data preprocessing.

  • Description

    Product Description

    PERFORMANCE METRICS

    100% Deterministic – This model will correctly perform the user-specified join as long as the input files are formatted according to the given specifications. Further information here.

    This model is deterministic, therefore no evaluation metrics are necessary. However, to ensure the model performs the join operations correctly, its output was empirically reviewed on dozens of pairs of test datasets.

    OVERVIEW:

    This model accepts three JSON files as input. The first two inputs are the JSON files that are to be joined. These input files cannot contain nested dictionaries. The third input acts as a configuration file, specifying the key(s) to match and the type of join to perform (inner, outer, left, or right join). The default type of join is an inner join, and if no join key(s) are specified, the model finds the best key and matches using it. This key is determined by calculating the Intersection over Union (IOU) of the set of values of each key that appear in both datasets. The key with the highest IOU value above 0.3 is matched. The input datasets are loaded as Pandas DataFrames and are merged using the Pandas merge function according to the values in the configuration file.

    TRAINING:

    This model is deterministic, therefore no training was involved.

    VALIDATION:

    This model was validated by reviewing the outputs on dozens of pairs of datasets.

    INPUT SPECIFICATION:

    The input(s) to this model must adhere to the following specifications:

    Filename Maximum Size Accepted Format(s)
    input_1.json 50M .json
    input_2.json 50M .json
    input_3.json 1M .json

    The input_1.json and input_2.json files should contain the datasets to be joined, where keys are column names and values are lists of entries. These files cannot have a nested structure. The input_3.json file can contain the following optional keys: match_on (which column(s) to join on, should match key(s) in input_1.json and input_2.json) and match_type (type of join to be performed). The following match_type values are available: inner, outer, right, left. Both match_type and match_on are optional, match_type defaults to inner if not specified and the model will try to find a join candidate key with an IOU above 0.3 if match_on is not specified. If neither is to be specified, input_3.json must be submitted as {}.

    OUTPUT DETAILS:

    This model will output the following:

    Filename Maximum Size Format
    results.json 200M .json

    The output file, “results.json”, will contain the joined dataset where keys are column names and values are lists of entries.