Joining local datasets
In this example we will examine how to join local datasets and then upload them into the EvoML platform.
Step 1: Connecting into the platform
As a first step you will need to connect into the platform using your credentials.
from typing import Final
import evoml_client as ec
import pandas as pd
# Pease replace with your deploy-platform URL
API_URL: Final[str] = "https://evoml.ai"
# Please replace with your username
USERNAME: Final[str] = ""
# Please replace with your password
PASSWORD: Final[str] = ""
# Connect to the platform
ec.init(base_url=API_URL, username=USERNAME, password=PASSWORD)
Step 2: Load the Datasets
In this example, we will use the following datasets:
-
users.csv
: Contains user information with columns:user_id
: Unique identifier for the user.name
: Name of the user.job_id
: Foreign key linking to thejobs.csv
dataset.salary
: Salary of the user.
-
jobs.csv
: Contains job information with columns:job_id
: Unique identifier for the job.job_title
: Title of the job.
users_df = pd.read_csv("./data/users.csv")
users_df.head()
jobs_df = pd.read_csv("./data/jobs.csv")
jobs_df.head()
Step 3: Joining the datasets
We will join the two datasets using pandas
, which allows merging on specific columns or indexes with control over the join type. Available join types are:
- left: Uses only keys from the left dataframe (SQL left outer join), preserves key order.
- right: Uses only keys from the right dataframe (SQL right outer join), preserves key order.
- outer: Uses the union of keys from both frames (SQL full outer join), sorts keys lexicographically.
- inner: Uses the intersection of keys from both frames (SQL inner join), preserves the order of left keys.
- cross: Creates a cartesian product of both frames, preserves left key order.
For more details, refer to the pandas documentation.
For this example we will be joining the two datases on the job_id
column using the left
join.
df_merged = pd.merge(users_df, jobs_df, left_on="job_id", right_on="job_id", how="left")
df_merged.head()
Step 4: Upload the Joined Dataset into the evoml Platform
In this step, we will upload the merged dataset to the evoml platform. The following code snippet demonstrates how to do this:
dataset = ec.Dataset.from_pandas(df_merged, "Users-Jobs Dataset", "A merged dataset")
dataset.put()
print(dataset.dataset_id)