Joining local datasets

In this example we will examine how to join local datasets and then upload them into the EvoML platform.

Step 1: Connecting into the platform

As a first step you will need to connect into the platform using your credentials.

from typing import Final
import evoml_client as ec
import pandas as pd

# Pease replace with your deploy-platform URL
API_URL: Final[str] = "https://evoml.ai"

# Please replace with your username
USERNAME: Final[str] = ""
# Please replace with your password
PASSWORD: Final[str] = ""

# Connect to the platform
ec.init(base_url=API_URL, username=USERNAME, password=PASSWORD)

Step 2: Load the Datasets

In this example, we will use the following datasets:

users.csv: Contains user information with columns:
- user_id: Unique identifier for the user.
- name: Name of the user.
- job_id: Foreign key linking to the jobs.csv dataset.
- salary: Salary of the user.
jobs.csv: Contains job information with columns:
- job_id: Unique identifier for the job.
- job_title: Title of the job.

users_df = pd.read_csv("./data/users.csv")
users_df.head()

jobs_df = pd.read_csv("./data/jobs.csv")
jobs_df.head()

Step 3: Joining the datasets

We will join the two datasets using pandas, which allows merging on specific columns or indexes with control over the join type. Available join types are:

left: Uses only keys from the left dataframe (SQL left outer join), preserves key order.
right: Uses only keys from the right dataframe (SQL right outer join), preserves key order.
outer: Uses the union of keys from both frames (SQL full outer join), sorts keys lexicographically.
inner: Uses the intersection of keys from both frames (SQL inner join), preserves the order of left keys.
cross: Creates a cartesian product of both frames, preserves left key order.

For more details, refer to the pandas documentation. For this example we will be joining the two datases on the job_id column using the left join.

df_merged = pd.merge(users_df, jobs_df, left_on="job_id", right_on="job_id", how="left")
df_merged.head()

Step 4: Upload the Joined Dataset into the evoml Platform

In this step, we will upload the merged dataset to the evoml platform. The following code snippet demonstrates how to do this:

dataset = ec.Dataset.from_pandas(df_merged, "Users-Jobs Dataset", "A merged dataset")
dataset.put()

print(dataset.dataset_id)

Step 1: Connecting into the platform​

Step 2: Load the Datasets​

Step 3: Joining the datasets​

Step 4: Upload the Joined Dataset into the evoml Platform​

Step 1: Connecting into the platform

Step 2: Load the Datasets

Step 3: Joining the datasets

Step 4: Upload the Joined Dataset into the evoml Platform