Skip to main content

Joining local datasets

In this example we will examine how to join local datasets and then upload them into the EvoML platform.

Step 1: Connecting into the platform

As a first step you will need to connect into the platform using your credentials.

from typing import Final
import evoml_client as ec
import pandas as pd

# Pease replace with your deploy-platform URL
API_URL: Final[str] = "https://evoml.ai"

# Please replace with your username
USERNAME: Final[str] = ""
# Please replace with your password
PASSWORD: Final[str] = ""

# Connect to the platform
ec.init(base_url=API_URL, username=USERNAME, password=PASSWORD)

Step 2: Load the Datasets

In this example, we will use the following datasets:

  • users.csv: Contains user information with columns:

    • user_id: Unique identifier for the user.
    • name: Name of the user.
    • job_id: Foreign key linking to the jobs.csv dataset.
    • salary: Salary of the user.
  • jobs.csv: Contains job information with columns:

    • job_id: Unique identifier for the job.
    • job_title: Title of the job.
users_df = pd.read_csv("./data/users.csv")
users_df.head()
jobs_df = pd.read_csv("./data/jobs.csv")
jobs_df.head()

Step 3: Joining the datasets

We will join the two datasets using pandas, which allows merging on specific columns or indexes with control over the join type. Available join types are:

  • left: Uses only keys from the left dataframe (SQL left outer join), preserves key order.
  • right: Uses only keys from the right dataframe (SQL right outer join), preserves key order.
  • outer: Uses the union of keys from both frames (SQL full outer join), sorts keys lexicographically.
  • inner: Uses the intersection of keys from both frames (SQL inner join), preserves the order of left keys.
  • cross: Creates a cartesian product of both frames, preserves left key order.

For more details, refer to the pandas documentation. For this example we will be joining the two datases on the job_id column using the left join.

df_merged = pd.merge(users_df, jobs_df, left_on="job_id", right_on="job_id", how="left")
df_merged.head()

Step 4: Upload the Joined Dataset into the evoml Platform

In this step, we will upload the merged dataset to the evoml platform. The following code snippet demonstrates how to do this:

dataset = ec.Dataset.from_pandas(df_merged, "Users-Jobs Dataset", "A merged dataset")
dataset.put()

print(dataset.dataset_id)