Skip to main content

Joining platform datasets

In this example we will examine how to join datasets that already exist on the platform by downloading them locally performing the join and uploading them back to the EvoML Platform.

Step 1: Connecting into the platform

As a first step you will need to connect into the platform using your credentials.

from typing import Final
import evoml_client as ec
import pandas as pd

# Pease replace with your deploy-platform URL
API_URL: Final[str] = "https://evoml.ai"

# Please replace with your username
USERNAME: Final[str] = ""
# Please replace with your password
PASSWORD: Final[str] = ""

# Connect to the platform
ec.init(base_url=API_URL, username=USERNAME, password=PASSWORD)

Preparation Step

To ensure the example works properly, we will first upload and create two Datasets on the EvoML platform. Afterward, we will proceed with the assumption that these datasets already exist on the platform.

The two datasets we will be using are:

  • users.csv: Contains user information with columns:

    • user_id: Unique identifier for the user.
    • name: Name of the user.
    • job_id: Foreign key linking to the jobs.csv dataset.
    • salary: Salary of the user.
  • jobs.csv: Contains job information with columns:

    • job_id: Unique identifier for the job.
    • job_title: Title of the job.
users_csv_df = pd.read_csv("./data/users.csv")
users_dataset = ec.Dataset.from_pandas(users_csv_df, "Users Dataset", "A Dataset containing users")
users_dataset.put()

users_dataset_id = users_dataset.dataset_id
print(users_dataset_id)
jobs_csv_df = pd.read_csv("./data/jobs.csv")
jobs_dataset = ec.Dataset.from_pandas(users_csv_df, "Jobs Dataset", "A Dataset containing jobs")
jobs_dataset.put()

jobs_dataset_id = jobs_dataset.dataset_id
print(jobs_dataset_id)

Step 2: Loading the platform Datasets locally

We can use the dataset ids for downloading the datasets locally.

users_dataset = ec.Dataset.from_id(users_dataset_id)
users_df = users_dataset.data
users_df.head()
jobs_dataset = ec.Dataset.from_id(jobs_dataset_id)
jobs_df = jobs_dataset.data
jobs_df.head()

Step 3: Joining the datasets

We will join the two datasets using pandas, which allows merging on specific columns or indexes with control over the join type. Available join types are:

  • left: Uses only keys from the left dataframe (SQL left outer join), preserves key order.
  • right: Uses only keys from the right dataframe (SQL right outer join), preserves key order.
  • outer: Uses the union of keys from both frames (SQL full outer join), sorts keys lexicographically.
  • inner: Uses the intersection of keys from both frames (SQL inner join), preserves the order of left keys.
  • cross: Creates a cartesian product of both frames, preserves left key order.

For more details, refer to the pandas documentation. For this example we will be joining the two datases on the job_id column using the left join.

df_merged = pd.merge(users_df, jobs_df, left_on="job_id", right_on="job_id", how="left")
df_merged.head()

Step 4: Upload the Joined Dataset into the evoml Platform

In this step, we will upload the merged dataset to the evoml platform. The following code snippet demonstrates how to do this:

dataset = ec.Dataset.from_pandas(df_merged, "Users-Jobs Dataset", "A merged dataset")
dataset.put()

print(dataset.dataset_id)