Datawrangling

In data analysis, we often repeat similar actions to get valuable information from the data. In this Chapter we describe the most common techniques applied in data analysis, and how they are performed in different programming languages and libraries.

Split-apply-combine

Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together. group_by is the most common function name for split-apply-combine technique.

These actions can be performed using different libraries in various programming languages. Based on the sample data created with the Synthea patient generator and Create views using DuckDB, we will demonstrate standard data analysis actions. We will illustrate how to implement these actions across multiple libraries and languages.

Datawrangling languages and libraries

In this chapter we use the following languages and libraries:

Pandas 2.0
Pandas is a popular python data wrangling library in the data science comunity. Pandas 2.0 mainly introduces improvements in performance, making it faster and more memory-efficient.

Polars
Polars is very similar to pandas in terms of functionality. However, polars is based on Rust, which makes it highly performant.

IBIS
Ibis is a python library which enables you to write your code in a pandas-like way, and the translates it to a lot of different backends. This enables you to change backends with only minor adjustments to your data wrangling code.

PySpark PySpark is a python library for Apache Spark, an open source computing system. It enables you to work with large datasets, or run complex calculations.

dplyr
dplyr is a widely used library in r.

SQL - DuckDB
In this handbook, we use DuckDB to demonstrate how to perform data wrangling with SQL. DuckDB SQL is almost the same as plain SQL but has a couple of extra functions to handle nested data, along with a few functions that aid in data analysis. DuckDB is a fast and efficient open-source database for analyzing large datasets using SQL.

Overview data transformations in different libraries

In the table below, you can find an overview of techniques that are frequently employed in data analysis, along with the corresponding functions for these techniques in the discussed libraries/languages.

concept pandas Polars IBIS PySpark dplyr SQL
Split groupby() group_by() group_by() groupBy() group_by() GROUP BY
Apply many functions many functions many functions many functions many functions many functions
combine join (), merge() join() left_join, inner_join() etc. join() left_join, inner_join() etc. LEFT JOIN, JOIN etc.
Filtering (row based) loc[], query() filter() filter() filter() filter() WHERE
Filtering (column based) loc[], iloc[], select() - select() select() SELECT
Mutate assign() with_columns() mutuate() withColumn() mutate() ADD
Ordering sort_values() sort() order_by() orderBy() arrange() ORDER BY

Example using the Synthea dataset

To demonstrate the most common data analysis techniques in the different languages and libraries, we use the data generated with the Synthea Patient Generator.

load required libraries

Before getting started, we need to load the required libraries.

import pandas as pd
import polars as pl
import duckdb
import ibis 
import pandas as pd
import duckdb
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, DateType,TimestampType,IntegerType
from pyspark.sql import functions as F
%load_ext rpy2.ipython
%%R
library(DBI)
library(dplyr)
library(duckdb)
library(glue)
library(lubridate)
library(knitr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Attaching package: ‘lubridate’

The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union

Retrieve data

In this section we will retrieve the data from the parquet files created in Creating views with DuckDB.

First, we set the paths where the data is located

from pathlib import Path
ROOT = Path('.')
if 'src' in str(ROOT.resolve()): # locally
    ROOT = '..' / ROOT # go one folder up
BRONZE = ROOT / 'data' / 'bronze'
SILVER = ROOT / 'data' / 'silver'
%%R
library(here)

ROOT <- here()


# Check if the 'src' directory is in the path and adjust the ROOT if needed
if (grepl("src", ROOT, fixed = TRUE)) {
  ROOT <- dirname(ROOT)
}

# Define BRONZE and SILVER directories
BRONZE <- file.path(ROOT, "data", "bronze")
SILVER <- file.path(ROOT, "data", "silver")
here() starts at /home/runner/work/hdc-data-analytics-handbook/hdc-data-analytics-handbook

Thereafter, we import the data in a dataframe

# load data from parquet
df_timeline_pandas = pd.read_parquet(f"{SILVER}/parquet_export/patient_timeline.parquet")
df_price_list_pandas = pd.read_parquet(f"{SILVER}/parquet_export/price_list.parquet")
# read parquet files in polars dateframe
df_timeline_polars = pl.read_parquet(f"{SILVER}/parquet_export/patient_timeline.parquet")
df_price_list_polars = pl.read_parquet(f"{SILVER}/parquet_export/price_list.parquet")

In Creating views using DuckDB, we have already demonstrated how to create a DuckDB database and import Parquet data into DuckDB. Therefore, here we will connect to the previously created database, which only contains the “patient_timeline” and “price_list” tables.

# connect to the database
con_duckdb = duckdb.connect()

con_duckdb.sql(
    f"""
    CREATE TABLE patient_timeline AS
    SELECT *
    FROM read_parquet('{SILVER}/parquet_export/patient_timeline.parquet');
    """
)

con_duckdb.sql(
    f"""
    CREATE TABLE price_list AS
    SELECT *
    FROM read_parquet('{SILVER}/parquet_export/price_list.parquet');
    """
)
# read parquet files with ibis
df_timeline_IBIS = ibis.read_parquet(f'{SILVER}/parquet_export/patient_timeline.parquet')
df_price_list_ibis = ibis.read_parquet(f'{SILVER}/parquet_export/price_list.parquet')
# create spark session
spark = SparkSession.builder.appName("patient_timeline_analysis").getOrCreate()

# convert timeline to spark dataframe
df_timeline_spark = spark.read.parquet(f'{SILVER}/parquet_export/patient_timeline.parquet')

# convert pricelist to spark dataframe
df_price_list_spark = spark.read.parquet(f'{SILVER}/parquet_export/price_list.parquet')
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/19 09:25:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
%%R
con <- dbConnect(
  duckdb::duckdb(),
  dbdir = glue("{SILVER}/pregnancy.duckdb"),
  read_only = TRUE
)
%%R
df <- dbGetQuery(con, "SELECT * FROM patient_timeline")
df_price_list <- dbGetQuery(con, "SELECT * FROM price_list")
In addition: Warning message:
In rapi_execute(res@stmt_lst$ref, res@arrow, res@connection@driver@bigint ==  :
  Coercing nanoseconds to a lower resolution may result in a loss of data.

Split-apply-combine

From the patient timeline data, we want to know how many patients each organization has recorded. For this task we can use the split-apply-combine technique.

  • we split the data in groups, where each group is a organization
  • we apply a function to count the number of patients in the group
  • we combine the data back together again

In the code below, we can see that the groupby function splits the data into different groups (and later combines them again). the agg function applies an aggregation, in this case the unique count of patients in the group. There are a few more actions in this code, renaming the index to organization, and the patient_id column to # patients.

df_grouped_pandas = df_timeline_pandas.groupby("organization_name").agg(
    {"patient_id":pd.Series.nunique}
).rename(
    columns = {'patient_id':"# patients"}
).rename_axis('organization')

# show first 5 records 
df_grouped_pandas.head()
# patients
organization
4499 ACUSHNET AVENUE OPERATING COMPANY, LLC 5
A&A HEALTHCARE LLC 9
ACCESS PRIMARY CARE INC 1
ACHIEVE WELLNESS, LLC 3
ACTIVATED BY WELLNESS LLC 5
df_grouped_polars = (
    df_timeline_polars
    .group_by("organization_name")
    .agg(
        pl.n_unique("patient_id")
    )
)
print(df_grouped_polars.head())
shape: (5, 2)
┌───────────────────────────────────┬────────────┐
│ organization_name                 ┆ patient_id │
│ ---                               ┆ ---        │
│ str                               ┆ u32        │
╞═══════════════════════════════════╪════════════╡
│ SUBURBAN INTERNAL MEDICINE, INC.  ┆ 2          │
│ LAWRENCE GENERAL HOSPITAL         ┆ 9          │
│ CONTINUUM HEALTHCARE SOLUTIONS L… ┆ 4          │
│ WHITLEY WELLNESS LLC              ┆ 29         │
│ HOPKINTON MEDICAL CARE, P.C.      ┆ 5          │
└───────────────────────────────────┴────────────┘

Polars supports both lazy and non-lazy (eager) evaluation. Lazy evaluation means that execution is deferred to the last minute, which can have significant performance advantages.

df_grouped_polars = (
    df_timeline_polars.lazy()
    .group_by("organization_name")
    .agg(
        pl.n_unique("patient_id")
    )
    .collect()
)
print(df_grouped_polars)
shape: (352, 2)
┌───────────────────────────────────┬────────────┐
│ organization_name                 ┆ patient_id │
│ ---                               ┆ ---        │
│ str                               ┆ u32        │
╞═══════════════════════════════════╪════════════╡
│ DEVENS TREATMENT & RECOVERY CENT… ┆ 4          │
│ RENAISSANCE FAMILY MEDICINE OF W… ┆ 1          │
│ COMMUNICATIVE HEALTH CARE ASSOCI… ┆ 2          │
│ THE SHRINERS HOSPITAL FOR CHILDR… ┆ 19         │
│ …                                 ┆ …          │
│ OUTER  CAPE HEALTH SERVICES       ┆ 1          │
│ AUBURN PRIMARY CARE AND AESTHETI… ┆ 9          │
│ COMMUNITY HEALTH PROGRAMS INC     ┆ 3          │
│ ZION HEALTH & WELLNESS LLC        ┆ 2          │
└───────────────────────────────────┴────────────┘
query = f"""
create or replace table organization_grouped as (
    select 
        organization_name as organization,
        count(distinct patient_id) as patient_count
    from patient_timeline
    group by organization_name
) ;
select * from organization_grouped 
limit 5;


"""
con_duckdb.sql(query).to_df()
organization patient_count
0 MEASURED WELLNESS LLC 4
1 FRANCISCAN HOSPITAL FOR CHILDREN INC 8
2 FENWAY COMMUNITY HEALTH CENTER, INC 4
3 ARBOUR-FULLER HOSPITAL 13
4 NASH HEALTH CARE PRACTICES LLC 4
df_grouped_ibis = df_timeline_IBIS.group_by("organization_name").aggregate(df_timeline_IBIS.patient_id.nunique()).to_pandas()
df_grouped_ibis.head()
organization_name CountDistinct(patient_id)
0 NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC 8
1 DUTTON FAMILY CARE ASSOCIATES LLP 8
2 THE SHRINERS HOSPITAL FOR CHILDREN 19
3 TREE OF LIFE HEALTHCARE ASSOCIATES 4
4 GREATER BOSTON INTERNAL MEDICINE PC 2
df_grouped_ibis = df_timeline_IBIS.group_by("organization_name").aggregate(df_timeline_IBIS.patient_id.nunique())
ibis.show_sql(df_grouped_ibis)
SELECT
  t0.organization_name,
  COUNT(DISTINCT t0.patient_id) AS "CountDistinct(patient_id)"
FROM _ibis_read_parquet_2b2nxedlfbbiljrr2xyonlxd34 AS t0
GROUP BY
  1
#option 1, using buid in functions
df_grouped_pyspark = df_timeline_spark.groupBy('organization_name').agg(
        F.countDistinct('patient_id'
    ).alias('# patients')).orderBy(
        '# patients',ascending = False
    )

df_grouped_pyspark = df_grouped_pyspark.withColumnRenamed("organization_name", "organization")

# option 2: using SQL 
df_timeline_spark.createOrReplaceTempView("timeline")
df_grouped_pyspark = spark.sql('''
Select 
    organization_name, 
    count(distinct patient_id) as nr_of_patients 
from timeline group by organization_name 
order by count(distinct patient_id) DESC
''')

df_grouped_pyspark.limit(5).show()
+--------------------+--------------+
|   organization_name|nr_of_patients|
+--------------------+--------------+
|BOSTON MEDICAL CE...|            29|
|WHITLEY WELLNESS LLC|            29|
|  TEWKSBURY HOSPITAL|            26|
|CAPE COD HOSPITAL...|            25|
|HARBOR HEALTH SER...|            23|
+--------------------+--------------+
%%R
df_grouped <- df %>% 
  group_by(organization_name) %>%
  summarise(
    number_of_patients = n_distinct(patient_id),
    .groups = "drop"
  )
kable(head(df_grouped))


|organization_name                           | number_of_patients|
|:-------------------------------------------|------------------:|
|4499 ACUSHNET AVENUE OPERATING COMPANY, LLC |                  5|
|A&A HEALTHCARE LLC                          |                  9|
|ACCESS PRIMARY CARE INC                     |                  1|
|ACHIEVE WELLNESS, LLC                       |                  3|
|ACTIVATED BY WELLNESS LLC                   |                  5|
|ADVANTAGE HEALTH & WELLNESS INC             |                  2|

Filter

In data analysis, it is common to be interested in only a subset of the data. For this reason, we apply the filtering technique. From our patients we only want to keep the ‘well child visits’.

# option 1
value = 'Well child visit (procedure)'
df_filtered_pandas = df_timeline_pandas[df_timeline_pandas['procedure_name'] == value]

# option 2
df_filtered_pandas = df_timeline_pandas.loc[df_timeline_pandas['procedure_name'] == value]

# option 3
df_filtered_pandas = df_filtered_pandas.query(f"procedure_name == '{value}' ")

#show results
df_filtered_pandas[['patient_id','start_time','procedure_name']].head()
patient_id start_time procedure_name
2 fca8d2ca-7aef-2c27-3bba-3f94723012f5 2016-04-01T04:09:48+02:00 Well child visit (procedure)
3 19936964-a432-d501-2cd5-fa52db6b9f41 2022-01-07T02:19:02+01:00 Well child visit (procedure)
4 0d016955-26db-966c-3d52-26d441bfcb97 2014-08-18T14:52:06+02:00 Well child visit (procedure)
5 3a2311a7-eb30-2aae-59c0-55ae273d1581 2016-11-01T04:38:02+01:00 Well child visit (procedure)
6 f9133ee8-f952-0e7d-f642-99d84fc9c6ad 2014-06-03T20:31:52+02:00 Well child visit (procedure)
df_filtered_polars = (
    df_timeline_polars
    .filter(pl.col("procedure_name") == "Well child visit (procedure)")
)
print(df_filtered_polars.head())
shape: (5, 18)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ patient_id ┆ social_sec ┆ prefix ┆ first_nam ┆ … ┆ end_time  ┆ Vaccine_n ┆ vaccine_c ┆ vaccine_c │
│ ---        ┆ urity_numb ┆ ---    ┆ e         ┆   ┆ ---       ┆ ame       ┆ ode       ┆ ode_syste │
│ str        ┆ er         ┆ str    ┆ ---       ┆   ┆ str       ┆ ---       ┆ ---       ┆ m         │
│            ┆ ---        ┆        ┆ str       ┆   ┆           ┆ str       ┆ str       ┆ ---       │
│            ┆ str        ┆        ┆           ┆   ┆           ┆           ┆           ┆ str       │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ fca8d2ca-7 ┆ 999-24-859 ┆ null   ┆ Marlen929 ┆ … ┆ 2016-04-0 ┆ Hep B,    ┆ 08        ┆ CVX       │
│ aef-2c27-3 ┆ 9          ┆        ┆           ┆   ┆ 1T04:24:4 ┆ adolescen ┆           ┆           │
│ bba-3f9472 ┆            ┆        ┆           ┆   ┆ 8+02:00   ┆ t or      ┆           ┆           │
│ 30…        ┆            ┆        ┆           ┆   ┆           ┆ pediatric ┆           ┆           │
│ 19936964-a ┆ 999-33-797 ┆ null   ┆ Monique14 ┆ … ┆ 2022-01-0 ┆ Hep B,    ┆ 08        ┆ CVX       │
│ 432-d501-2 ┆ 4          ┆        ┆ 8         ┆   ┆ 7T02:34:0 ┆ adolescen ┆           ┆           │
│ cd5-fa52db ┆            ┆        ┆           ┆   ┆ 2+01:00   ┆ t or      ┆           ┆           │
│ 6b…        ┆            ┆        ┆           ┆   ┆           ┆ pediatric ┆           ┆           │
│ 0d016955-2 ┆ 999-47-850 ┆ Ms.    ┆ Tracy345  ┆ … ┆ 2014-08-1 ┆ Influenza ┆ 140       ┆ CVX       │
│ 6db-966c-3 ┆ 0          ┆        ┆           ┆   ┆ 8T15:07:0 ┆ ,         ┆           ┆           │
│ d52-26d441 ┆            ┆        ┆           ┆   ┆ 6+02:00   ┆ seasonal, ┆           ┆           │
│ bf…        ┆            ┆        ┆           ┆   ┆           ┆ injectabl ┆           ┆           │
│            ┆            ┆        ┆           ┆   ┆           ┆ e,…       ┆           ┆           │
│ 3a2311a7-e ┆ 999-40-608 ┆ null   ┆ Janene3   ┆ … ┆ 2016-11-0 ┆ Hep B,    ┆ 08        ┆ CVX       │
│ b30-2aae-5 ┆ 4          ┆        ┆           ┆   ┆ 1T04:53:0 ┆ adolescen ┆           ┆           │
│ 9c0-55ae27 ┆            ┆        ┆           ┆   ┆ 2+01:00   ┆ t or      ┆           ┆           │
│ 3d…        ┆            ┆        ┆           ┆   ┆           ┆ pediatric ┆           ┆           │
│ f9133ee8-f ┆ 999-93-354 ┆ Ms.    ┆ Ozie87    ┆ … ┆ 2014-06-0 ┆ Influenza ┆ 140       ┆ CVX       │
│ 952-0e7d-f ┆ 6          ┆        ┆           ┆   ┆ 3T20:46:5 ┆ ,         ┆           ┆           │
│ 642-99d84f ┆            ┆        ┆           ┆   ┆ 2+02:00   ┆ seasonal, ┆           ┆           │
│ c9…        ┆            ┆        ┆           ┆   ┆           ┆ injectabl ┆           ┆           │
│            ┆            ┆        ┆           ┆   ┆           ┆ e,…       ┆           ┆           │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘
query = """
create or replace table filtered_table as (
    Select 
        * 
    from patient_timeline 
    where procedure_name = 'Well child visit (procedure)'
);
select * from filtered_table limit 5;
"""
con_duckdb.sql(query).to_df()
patient_id social_security_number prefix first_name last_name birthDate code system organization_id organization_name practitioner_name practitioner_id procedure_name start_time end_time Vaccine_name vaccine_code vaccine_code_system
0 fca8d2ca-7aef-2c27-3bba-3f94723012f5 999-24-8599 None Marlen929 Greenholt190 2016-04-01 410620009 SNOMED 3f12ebb4-e03c-3453-88d2-4fc9682383df NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC Dr. Homero668 Rolón954 9999962894 Well child visit (procedure) 2016-04-01T04:09:48+02:00 2016-04-01T04:24:48+02:00 Hep B, adolescent or pediatric 08 CVX
1 19936964-a432-d501-2cd5-fa52db6b9f41 999-33-7974 None Monique148 Haley279 2022-01-07 410620009 SNOMED c6b019eb-28ec-36f6-abf3-bcc4d1d58966 DUTTON FAMILY CARE ASSOCIATES LLP Dr. Maren639 Aufderhar910 9999950790 Well child visit (procedure) 2022-01-07T02:19:02+01:00 2022-01-07T02:34:02+01:00 Hep B, adolescent or pediatric 08 CVX
2 0d016955-26db-966c-3d52-26d441bfcb97 999-47-8500 Ms. Tracy345 Smith67 2000-07-17 410620009 SNOMED 39c15c0f-5c49-311e-99d2-1fb99d80e06e HARBOR HEALTH SERVICES INC Dr. Salvador46 Homenick806 9999977496 Well child visit (procedure) 2014-08-18T14:52:06+02:00 2014-08-18T15:07:06+02:00 Influenza, seasonal, injectable, preservative ... 140 CVX
3 3a2311a7-eb30-2aae-59c0-55ae273d1581 999-40-6084 None Janene3 Howe413 2016-11-01 410620009 SNOMED 2e8b42f8-4300-30a9-a576-d513d1021a0a GREATER LOWELL FAMILY PRACTICE PC Dr. Homero668 Salcedo309 9999959890 Well child visit (procedure) 2016-11-01T04:38:02+01:00 2016-11-01T04:53:02+01:00 Hep B, adolescent or pediatric 08 CVX
4 f9133ee8-f952-0e7d-f642-99d84fc9c6ad 999-93-3546 Ms. Ozie87 Nitzsche158 1998-04-21 410620009 SNOMED 0e9d0119-1157-35a0-8bb5-7e0aa3041928 CARING HEALTH CENTER, INC Dr. Kari181 Trantow673 9999979997 Well child visit (procedure) 2014-06-03T20:31:52+02:00 2014-06-03T20:46:52+02:00 Influenza, seasonal, injectable, preservative ... 140 CVX
df_filtered_ibis = df_timeline_IBIS.filter(df_timeline_IBIS.procedure_name == "Well child visit (procedure)")
df_filtered_ibis.to_pandas().head()
patient_id social_security_number prefix first_name last_name birthDate code system organization_id organization_name practitioner_name practitioner_id procedure_name start_time end_time Vaccine_name vaccine_code vaccine_code_system
0 fca8d2ca-7aef-2c27-3bba-3f94723012f5 999-24-8599 None Marlen929 Greenholt190 2016-04-01 410620009 SNOMED 3f12ebb4-e03c-3453-88d2-4fc9682383df NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC Dr. Homero668 Rolón954 9999962894 Well child visit (procedure) 2016-04-01T04:09:48+02:00 2016-04-01T04:24:48+02:00 Hep B, adolescent or pediatric 08 CVX
1 19936964-a432-d501-2cd5-fa52db6b9f41 999-33-7974 None Monique148 Haley279 2022-01-07 410620009 SNOMED c6b019eb-28ec-36f6-abf3-bcc4d1d58966 DUTTON FAMILY CARE ASSOCIATES LLP Dr. Maren639 Aufderhar910 9999950790 Well child visit (procedure) 2022-01-07T02:19:02+01:00 2022-01-07T02:34:02+01:00 Hep B, adolescent or pediatric 08 CVX
2 0d016955-26db-966c-3d52-26d441bfcb97 999-47-8500 Ms. Tracy345 Smith67 2000-07-17 410620009 SNOMED 39c15c0f-5c49-311e-99d2-1fb99d80e06e HARBOR HEALTH SERVICES INC Dr. Salvador46 Homenick806 9999977496 Well child visit (procedure) 2014-08-18T14:52:06+02:00 2014-08-18T15:07:06+02:00 Influenza, seasonal, injectable, preservative ... 140 CVX
3 3a2311a7-eb30-2aae-59c0-55ae273d1581 999-40-6084 None Janene3 Howe413 2016-11-01 410620009 SNOMED 2e8b42f8-4300-30a9-a576-d513d1021a0a GREATER LOWELL FAMILY PRACTICE PC Dr. Homero668 Salcedo309 9999959890 Well child visit (procedure) 2016-11-01T04:38:02+01:00 2016-11-01T04:53:02+01:00 Hep B, adolescent or pediatric 08 CVX
4 f9133ee8-f952-0e7d-f642-99d84fc9c6ad 999-93-3546 Ms. Ozie87 Nitzsche158 1998-04-21 410620009 SNOMED 0e9d0119-1157-35a0-8bb5-7e0aa3041928 CARING HEALTH CENTER, INC Dr. Kari181 Trantow673 9999979997 Well child visit (procedure) 2014-06-03T20:31:52+02:00 2014-06-03T20:46:52+02:00 Influenza, seasonal, injectable, preservative ... 140 CVX
# option 1
value = 'Well child visit (procedure)'
df_filtered_pyspark = df_timeline_spark.filter(df_timeline_spark['procedure_name']==value)

# option 2
df_filtered_pyspark = df_timeline_spark.filter(F.expr(f"procedure_name == '{value}'"))

# option 3 - sql
df_timeline_spark.createOrReplaceTempView("timeline")
df_filtered_pyspark = spark.sql(f"Select * from timeline where procedure_name =='{value}'")

#show results
df_filtered_pyspark[['patient_id','start_time','procedure_name']].limit(5).show()
+--------------------+--------------------+--------------------+
|          patient_id|          start_time|      procedure_name|
+--------------------+--------------------+--------------------+
|fca8d2ca-7aef-2c2...|2016-04-01T04:09:...|Well child visit ...|
|19936964-a432-d50...|2022-01-07T02:19:...|Well child visit ...|
|0d016955-26db-966...|2014-08-18T14:52:...|Well child visit ...|
|3a2311a7-eb30-2aa...|2016-11-01T04:38:...|Well child visit ...|
|f9133ee8-f952-0e7...|2014-06-03T20:31:...|Well child visit ...|
+--------------------+--------------------+--------------------+
%%R
df_filtered <- df %>%
  filter(procedure_name == "Well child visit (procedure)")
kable(head(df_filtered))


|patient_id                           |social_security_number |prefix |first_name |last_name    |birthDate  |      code|system |organization_id                      |organization_name                               |practitioner_name          |practitioner_id |procedure_name               |start_time                |end_time                  |Vaccine_name                                       |vaccine_code |vaccine_code_system |
|:------------------------------------|:----------------------|:------|:----------|:------------|:----------|---------:|:------|:------------------------------------|:-----------------------------------------------|:--------------------------|:---------------|:----------------------------|:-------------------------|:-------------------------|:--------------------------------------------------|:------------|:-------------------|
|fca8d2ca-7aef-2c27-3bba-3f94723012f5 |999-24-8599            |NA     |Marlen929  |Greenholt190 |2016-04-01 | 410620009|SNOMED |3f12ebb4-e03c-3453-88d2-4fc9682383df |NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC |Dr. Homero668 Rolón954     |9999962894      |Well child visit (procedure) |2016-04-01T04:09:48+02:00 |2016-04-01T04:24:48+02:00 |Hep B, adolescent or pediatric                     |08           |CVX                 |
|19936964-a432-d501-2cd5-fa52db6b9f41 |999-33-7974            |NA     |Monique148 |Haley279     |2022-01-07 | 410620009|SNOMED |c6b019eb-28ec-36f6-abf3-bcc4d1d58966 |DUTTON FAMILY CARE ASSOCIATES LLP               |Dr. Maren639 Aufderhar910  |9999950790      |Well child visit (procedure) |2022-01-07T02:19:02+01:00 |2022-01-07T02:34:02+01:00 |Hep B, adolescent or pediatric                     |08           |CVX                 |
|0d016955-26db-966c-3d52-26d441bfcb97 |999-47-8500            |Ms.    |Tracy345   |Smith67      |2000-07-17 | 410620009|SNOMED |39c15c0f-5c49-311e-99d2-1fb99d80e06e |HARBOR HEALTH SERVICES INC                      |Dr. Salvador46 Homenick806 |9999977496      |Well child visit (procedure) |2014-08-18T14:52:06+02:00 |2014-08-18T15:07:06+02:00 |Influenza, seasonal, injectable, preservative free |140          |CVX                 |
|3a2311a7-eb30-2aae-59c0-55ae273d1581 |999-40-6084            |NA     |Janene3    |Howe413      |2016-11-01 | 410620009|SNOMED |2e8b42f8-4300-30a9-a576-d513d1021a0a |GREATER LOWELL FAMILY PRACTICE PC               |Dr. Homero668 Salcedo309   |9999959890      |Well child visit (procedure) |2016-11-01T04:38:02+01:00 |2016-11-01T04:53:02+01:00 |Hep B, adolescent or pediatric                     |08           |CVX                 |
|f9133ee8-f952-0e7d-f642-99d84fc9c6ad |999-93-3546            |Ms.    |Ozie87     |Nitzsche158  |1998-04-21 | 410620009|SNOMED |0e9d0119-1157-35a0-8bb5-7e0aa3041928 |CARING HEALTH CENTER, INC                       |Dr. Kari181 Trantow673     |9999979997      |Well child visit (procedure) |2014-06-03T20:31:52+02:00 |2014-06-03T20:46:52+02:00 |Influenza, seasonal, injectable, preservative free |140          |CVX                 |
|f9133ee8-f952-0e7d-f642-99d84fc9c6ad |999-93-3546            |Ms.    |Ozie87     |Nitzsche158  |1998-04-21 | 410620009|SNOMED |0e9d0119-1157-35a0-8bb5-7e0aa3041928 |CARING HEALTH CENTER, INC                       |Dr. Kari181 Trantow673     |9999979997      |Well child visit (procedure) |2014-06-03T20:31:52+02:00 |2014-06-03T20:46:52+02:00 |meningococcal MCV4P                                |114          |CVX                 |
In addition: Warning message:
Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this. 

Besides filtering data by content it is also possible to filter the columns one wishes to show

df_filtered_pandas.filter(items = ['patient_id','birthDate','organization_name']).head(1)
patient_id birthDate organization_name
2 fca8d2ca-7aef-2c27-3bba-3f94723012f5 2016-04-01 NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC
df_filtered_pandas[['patient_id','birthDate','organization_name']].head(1)
patient_id birthDate organization_name
2 fca8d2ca-7aef-2c27-3bba-3f94723012f5 2016-04-01 NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC
df_filtered_polars.select([
    pl.col("patient_id"), pl.col("birthDate"), pl.col("organization_name")
]).head()
shape: (5, 3)
patient_id birthDate organization_name
str datetime[μs] str
"fca8d2ca-7aef-… 2016-04-01 00:00:00 "NEW BEDFORD IN…
"19936964-a432-… 2022-01-07 00:00:00 "DUTTON FAMILY …
"0d016955-26db-… 2000-07-17 00:00:00 "HARBOR HEALTH …
"3a2311a7-eb30-… 2016-11-01 00:00:00 "GREATER LOWELL…
"f9133ee8-f952-… 1998-04-21 00:00:00 "CARING HEALTH …
query = """
Select 
    patient_id,
    birthDate,
    organization_name
from patient_timeline limit 1
"""
con_duckdb.sql(query).to_df()
patient_id birthDate organization_name
0 4fc244f3-2c0e-4017-d64d-c2c4cd03655f 1965-01-07 HOLYOKE HEALTH CENTER INC
df_filtered_ibis.select(
    'patient_id','birthDate','organization_name'
).to_pandas().head()
patient_id birthDate organization_name
0 fca8d2ca-7aef-2c27-3bba-3f94723012f5 2016-04-01 NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC
1 19936964-a432-d501-2cd5-fa52db6b9f41 2022-01-07 DUTTON FAMILY CARE ASSOCIATES LLP
2 0d016955-26db-966c-3d52-26d441bfcb97 2000-07-17 HARBOR HEALTH SERVICES INC
3 3a2311a7-eb30-2aae-59c0-55ae273d1581 2016-11-01 GREATER LOWELL FAMILY PRACTICE PC
4 f9133ee8-f952-0e7d-f642-99d84fc9c6ad 1998-04-21 CARING HEALTH CENTER, INC
df_filtered_pyspark.select('patient_id','birthDate','organization_name').limit(1).show()
+--------------------+-------------------+--------------------+
|          patient_id|          birthDate|   organization_name|
+--------------------+-------------------+--------------------+
|fca8d2ca-7aef-2c2...|2016-04-01 00:00:00|NEW BEDFORD INTER...|
+--------------------+-------------------+--------------------+
#TODO

Mutate

Another common action is the need to adapt data. For instance, we may want to create an additional column indicating the year of a visit in the patient timeline, based on the available start_time. Currently, the date is in a specific timezone, and it’s important to consider this when converting it to a datetime format. For now, we maintain the date in the same timezone before extracting the year from it.

# transform start_time to datetime (its currently string format)
df_timeline_pandas['start_time'] = pd.to_datetime(df_timeline_pandas['start_time'], utc=  True)
df_timeline_pandas['start_time'].dtype

#option 1
df_timeline_pandas['year'] = df_timeline_pandas['start_time'].dt.year

#option 2
df_timeline_pandas = df_timeline_pandas.assign(year=df_timeline_pandas['start_time'].dt.year )

df_timeline_pandas[['start_time','year']].head(1)
start_time year
0 2015-01-08 06:11:47+00:00 2015
df_new_var_polars = (
    df_timeline_polars
    .with_columns(
        pl.col("start_time").str.to_datetime(format="%Y-%m-%dT%H:%M:%S%z")
        .dt.year().alias("start_year")
    )
)
print(df_new_var_polars.head())
shape: (5, 19)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ patient_id ┆ social_sec ┆ prefix ┆ first_nam ┆ … ┆ Vaccine_n ┆ vaccine_c ┆ vaccine_c ┆ start_yea │
│ ---        ┆ urity_numb ┆ ---    ┆ e         ┆   ┆ ame       ┆ ode       ┆ ode_syste ┆ r         │
│ str        ┆ er         ┆ str    ┆ ---       ┆   ┆ ---       ┆ ---       ┆ m         ┆ ---       │
│            ┆ ---        ┆        ┆ str       ┆   ┆ str       ┆ str       ┆ ---       ┆ i32       │
│            ┆ str        ┆        ┆           ┆   ┆           ┆           ┆ str       ┆           │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 4fc244f3-2 ┆ 999-53-581 ┆ Mrs.   ┆ Alyce744  ┆ … ┆ zoster    ┆ 121       ┆ CVX       ┆ 2015      │
│ c0e-4017-d ┆ 3          ┆        ┆           ┆   ┆ vaccine,  ┆           ┆           ┆           │
│ 64d-c2c4cd ┆            ┆        ┆           ┆   ┆ live      ┆           ┆           ┆           │
│ 03…        ┆            ┆        ┆           ┆   ┆           ┆           ┆           ┆           │
│ 4fc244f3-2 ┆ 999-53-581 ┆ Mrs.   ┆ Alyce744  ┆ … ┆ Influenza ┆ 140       ┆ CVX       ┆ 2015      │
│ c0e-4017-d ┆ 3          ┆        ┆           ┆   ┆ ,         ┆           ┆           ┆           │
│ 64d-c2c4cd ┆            ┆        ┆           ┆   ┆ seasonal, ┆           ┆           ┆           │
│ 03…        ┆            ┆        ┆           ┆   ┆ injectabl ┆           ┆           ┆           │
│            ┆            ┆        ┆           ┆   ┆ e,…       ┆           ┆           ┆           │
│ fca8d2ca-7 ┆ 999-24-859 ┆ null   ┆ Marlen929 ┆ … ┆ Hep B,    ┆ 08        ┆ CVX       ┆ 2016      │
│ aef-2c27-3 ┆ 9          ┆        ┆           ┆   ┆ adolescen ┆           ┆           ┆           │
│ bba-3f9472 ┆            ┆        ┆           ┆   ┆ t or      ┆           ┆           ┆           │
│ 30…        ┆            ┆        ┆           ┆   ┆ pediatric ┆           ┆           ┆           │
│ 19936964-a ┆ 999-33-797 ┆ null   ┆ Monique14 ┆ … ┆ Hep B,    ┆ 08        ┆ CVX       ┆ 2022      │
│ 432-d501-2 ┆ 4          ┆        ┆ 8         ┆   ┆ adolescen ┆           ┆           ┆           │
│ cd5-fa52db ┆            ┆        ┆           ┆   ┆ t or      ┆           ┆           ┆           │
│ 6b…        ┆            ┆        ┆           ┆   ┆ pediatric ┆           ┆           ┆           │
│ 0d016955-2 ┆ 999-47-850 ┆ Ms.    ┆ Tracy345  ┆ … ┆ Influenza ┆ 140       ┆ CVX       ┆ 2014      │
│ 6db-966c-3 ┆ 0          ┆        ┆           ┆   ┆ ,         ┆           ┆           ┆           │
│ d52-26d441 ┆            ┆        ┆           ┆   ┆ seasonal, ┆           ┆           ┆           │
│ bf…        ┆            ┆        ┆           ┆   ┆ injectabl ┆           ┆           ┆           │
│            ┆            ┆        ┆           ┆   ┆ e,…       ┆           ┆           ┆           │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘
query = """
Select
    cast(start_time as datetime) as start_time,
    year(cast(start_time as datetime)) as year
from patient_timeline;
"""
con_duckdb.sql(query).to_df()
start_time year
0 2015-01-08 06:11:47 2015
1 2015-01-08 06:11:47 2015
2 2016-04-01 02:09:48 2016
3 2022-01-07 01:19:02 2022
4 2014-08-18 12:52:06 2014
... ... ...
13832 2019-07-24 02:17:26 2019
13833 2020-07-29 02:17:26 2020
13834 2021-08-04 02:17:26 2021
13835 2022-08-10 02:17:26 2022
13836 2023-08-16 02:17:26 2023

13837 rows × 2 columns

df_new_var_ibis = df_timeline_IBIS.mutate(start_year=df_timeline_IBIS.start_time.cast("timestamp").year()).to_pandas()
df_new_var_ibis.head()
patient_id social_security_number prefix first_name last_name birthDate code system organization_id organization_name practitioner_name practitioner_id procedure_name start_time end_time Vaccine_name vaccine_code vaccine_code_system start_year
0 4fc244f3-2c0e-4017-d64d-c2c4cd03655f 999-53-5813 Mrs. Alyce744 Bergstrom287 1965-01-07 162673000 SNOMED b03b624d-c939-3688-986d-9555b8009a3b HOLYOKE HEALTH CENTER INC Dr. Bennett146 Hartmann983 9999981894 General examination of patient (procedure) 2015-01-08T07:11:47+01:00 2015-01-08T07:26:47+01:00 zoster vaccine, live 121 CVX 2015
1 4fc244f3-2c0e-4017-d64d-c2c4cd03655f 999-53-5813 Mrs. Alyce744 Bergstrom287 1965-01-07 162673000 SNOMED b03b624d-c939-3688-986d-9555b8009a3b HOLYOKE HEALTH CENTER INC Dr. Bennett146 Hartmann983 9999981894 General examination of patient (procedure) 2015-01-08T07:11:47+01:00 2015-01-08T07:26:47+01:00 Influenza, seasonal, injectable, preservative ... 140 CVX 2015
2 fca8d2ca-7aef-2c27-3bba-3f94723012f5 999-24-8599 None Marlen929 Greenholt190 2016-04-01 410620009 SNOMED 3f12ebb4-e03c-3453-88d2-4fc9682383df NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC Dr. Homero668 Rolón954 9999962894 Well child visit (procedure) 2016-04-01T04:09:48+02:00 2016-04-01T04:24:48+02:00 Hep B, adolescent or pediatric 08 CVX 2016
3 19936964-a432-d501-2cd5-fa52db6b9f41 999-33-7974 None Monique148 Haley279 2022-01-07 410620009 SNOMED c6b019eb-28ec-36f6-abf3-bcc4d1d58966 DUTTON FAMILY CARE ASSOCIATES LLP Dr. Maren639 Aufderhar910 9999950790 Well child visit (procedure) 2022-01-07T02:19:02+01:00 2022-01-07T02:34:02+01:00 Hep B, adolescent or pediatric 08 CVX 2022
4 0d016955-26db-966c-3d52-26d441bfcb97 999-47-8500 Ms. Tracy345 Smith67 2000-07-17 410620009 SNOMED 39c15c0f-5c49-311e-99d2-1fb99d80e06e HARBOR HEALTH SERVICES INC Dr. Salvador46 Homenick806 9999977496 Well child visit (procedure) 2014-08-18T14:52:06+02:00 2014-08-18T15:07:06+02:00 Influenza, seasonal, injectable, preservative ... 140 CVX 2014
# transform start_time to datetime (its currently string format)
df_timeline_spark = df_timeline_spark.withColumn("start_time", F.to_utc_timestamp(df_timeline_spark["start_time"], "Europe/Amsterdam"))

df_timeline_spark = df_timeline_spark.withColumn("year", F.year(df_timeline_spark["start_time"]))
df_timeline_spark.select('start_time','year').limit(3).show()
+-------------------+----+
|         start_time|year|
+-------------------+----+
|2015-01-08 05:11:47|2015|
|2015-01-08 05:11:47|2015|
|2016-04-01 00:09:48|2016|
+-------------------+----+
%%R
df_new_var <- df %>%
  mutate(start_year = year(as.Date(start_time, format = "%Y-%m-%d")))
kable(head(df_new_var))


|patient_id                           |social_security_number |prefix |first_name |last_name    |birthDate  |      code|system |organization_id                      |organization_name                               |practitioner_name          |practitioner_id |procedure_name                             |start_time                |end_time                  |Vaccine_name                                       |vaccine_code |vaccine_code_system | start_year|
|:------------------------------------|:----------------------|:------|:----------|:------------|:----------|---------:|:------|:------------------------------------|:-----------------------------------------------|:--------------------------|:---------------|:------------------------------------------|:-------------------------|:-------------------------|:--------------------------------------------------|:------------|:-------------------|----------:|
|4fc244f3-2c0e-4017-d64d-c2c4cd03655f |999-53-5813            |Mrs.   |Alyce744   |Bergstrom287 |1965-01-07 | 162673000|SNOMED |b03b624d-c939-3688-986d-9555b8009a3b |HOLYOKE HEALTH CENTER INC                       |Dr. Bennett146 Hartmann983 |9999981894      |General examination of patient (procedure) |2015-01-08T07:11:47+01:00 |2015-01-08T07:26:47+01:00 |zoster vaccine, live                               |121          |CVX                 |       2015|
|4fc244f3-2c0e-4017-d64d-c2c4cd03655f |999-53-5813            |Mrs.   |Alyce744   |Bergstrom287 |1965-01-07 | 162673000|SNOMED |b03b624d-c939-3688-986d-9555b8009a3b |HOLYOKE HEALTH CENTER INC                       |Dr. Bennett146 Hartmann983 |9999981894      |General examination of patient (procedure) |2015-01-08T07:11:47+01:00 |2015-01-08T07:26:47+01:00 |Influenza, seasonal, injectable, preservative free |140          |CVX                 |       2015|
|fca8d2ca-7aef-2c27-3bba-3f94723012f5 |999-24-8599            |NA     |Marlen929  |Greenholt190 |2016-04-01 | 410620009|SNOMED |3f12ebb4-e03c-3453-88d2-4fc9682383df |NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC |Dr. Homero668 Rolón954     |9999962894      |Well child visit (procedure)               |2016-04-01T04:09:48+02:00 |2016-04-01T04:24:48+02:00 |Hep B, adolescent or pediatric                     |08           |CVX                 |       2016|
|19936964-a432-d501-2cd5-fa52db6b9f41 |999-33-7974            |NA     |Monique148 |Haley279     |2022-01-07 | 410620009|SNOMED |c6b019eb-28ec-36f6-abf3-bcc4d1d58966 |DUTTON FAMILY CARE ASSOCIATES LLP               |Dr. Maren639 Aufderhar910  |9999950790      |Well child visit (procedure)               |2022-01-07T02:19:02+01:00 |2022-01-07T02:34:02+01:00 |Hep B, adolescent or pediatric                     |08           |CVX                 |       2022|
|0d016955-26db-966c-3d52-26d441bfcb97 |999-47-8500            |Ms.    |Tracy345   |Smith67      |2000-07-17 | 410620009|SNOMED |39c15c0f-5c49-311e-99d2-1fb99d80e06e |HARBOR HEALTH SERVICES INC                      |Dr. Salvador46 Homenick806 |9999977496      |Well child visit (procedure)               |2014-08-18T14:52:06+02:00 |2014-08-18T15:07:06+02:00 |Influenza, seasonal, injectable, preservative free |140          |CVX                 |       2014|
|3a2311a7-eb30-2aae-59c0-55ae273d1581 |999-40-6084            |NA     |Janene3    |Howe413      |2016-11-01 | 410620009|SNOMED |2e8b42f8-4300-30a9-a576-d513d1021a0a |GREATER LOWELL FAMILY PRACTICE PC               |Dr. Homero668 Salcedo309   |9999959890      |Well child visit (procedure)               |2016-11-01T04:38:02+01:00 |2016-11-01T04:53:02+01:00 |Hep B, adolescent or pediatric                     |08           |CVX                 |       2016|

Order

It often occurs that you want to order your data. In this case, let’s arrange the patient timeline by start_time. Most libraries automatically order in ascending (oldest first) and allow you to order in descending (newest first).

df_ordered_pandas = df_timeline_pandas.sort_values(by = 'start_time', ascending = False)
df_ordered_pandas[['patient_id','start_time','procedure_name']].head()
patient_id start_time procedure_name
1871 1d24c422-98a8-713d-3184-fbd05c3695a1 2023-09-05 10:06:14+00:00 Well child visit (procedure)
1872 1d24c422-98a8-713d-3184-fbd05c3695a1 2023-09-05 10:06:14+00:00 Well child visit (procedure)
9916 404e3ebd-a1e3-4678-7d4e-58a48ca21418 2023-09-05 03:32:07+00:00 Well child visit (procedure)
5872 4efaa926-30cf-0fcd-04d9-0d31c4071483 2023-09-04 04:12:02+00:00 General examination of patient (procedure)
12301 8f262a31-6268-5a96-7432-1723d168687f 2023-09-04 01:54:17+00:00 General examination of patient (procedure)
df_ordered_polars = (
    df_timeline_polars
    .with_columns(
        pl.col("start_time").str.to_datetime(format="%Y-%m-%dT%H:%M:%S%z")
    )
    .sort("start_time")
)
print(df_ordered_polars.head())
shape: (5, 18)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ patient_id ┆ social_sec ┆ prefix ┆ first_nam ┆ … ┆ end_time  ┆ Vaccine_n ┆ vaccine_c ┆ vaccine_c │
│ ---        ┆ urity_numb ┆ ---    ┆ e         ┆   ┆ ---       ┆ ame       ┆ ode       ┆ ode_syste │
│ str        ┆ er         ┆ str    ┆ ---       ┆   ┆ str       ┆ ---       ┆ ---       ┆ m         │
│            ┆ ---        ┆        ┆ str       ┆   ┆           ┆ str       ┆ str       ┆ ---       │
│            ┆ str        ┆        ┆           ┆   ┆           ┆           ┆           ┆ str       │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 9808a732-1 ┆ 999-90-152 ┆ Mrs.   ┆ Pattie91  ┆ … ┆ 2013-09-1 ┆ Influenza ┆ 140       ┆ CVX       │
│ f51-6483-d ┆ 4          ┆        ┆           ┆   ┆ 1T19:23:1 ┆ ,         ┆           ┆           │
│ 8b0-ec0d61 ┆            ┆        ┆           ┆   ┆ 4+02:00   ┆ seasonal, ┆           ┆           │
│ f8…        ┆            ┆        ┆           ┆   ┆           ┆ injectabl ┆           ┆           │
│            ┆            ┆        ┆           ┆   ┆           ┆ e,…       ┆           ┆           │
│ 6f93e13d-d ┆ 999-65-407 ┆ Mrs.   ┆ Carrie738 ┆ … ┆ 2013-09-1 ┆ Influenza ┆ 140       ┆ CVX       │
│ 237-acc2-4 ┆ 9          ┆        ┆           ┆   ┆ 1T21:06:5 ┆ ,         ┆           ┆           │
│ 7f0-d1fa20 ┆            ┆        ┆           ┆   ┆ 0+02:00   ┆ seasonal, ┆           ┆           │
│ c6…        ┆            ┆        ┆           ┆   ┆           ┆ injectabl ┆           ┆           │
│            ┆            ┆        ┆           ┆   ┆           ┆ e,…       ┆           ┆           │
│ 7aafabc2-f ┆ 999-49-954 ┆ null   ┆ Maximina9 ┆ … ┆ 2013-09-1 ┆ Hep B,    ┆ 08        ┆ CVX       │
│ 4cf-3d82-f ┆ 2          ┆        ┆ 16        ┆   ┆ 2T05:05:4 ┆ adolescen ┆           ┆           │
│ 1e9-64d809 ┆            ┆        ┆           ┆   ┆ 5+02:00   ┆ t or      ┆           ┆           │
│ 4b…        ┆            ┆        ┆           ┆   ┆           ┆ pediatric ┆           ┆           │
│ 1c2f968a-4 ┆ 999-61-837 ┆ null   ┆ Claribel7 ┆ … ┆ 2013-09-1 ┆ varicella ┆ 21        ┆ CVX       │
│ 735-c752-0 ┆ 5          ┆        ┆ 06        ┆   ┆ 3T02:11:3 ┆           ┆           ┆           │
│ 7dd-54a664 ┆            ┆        ┆           ┆   ┆ 5+02:00   ┆           ┆           ┆           │
│ fe…        ┆            ┆        ┆           ┆   ┆           ┆           ┆           ┆           │
│ 1c2f968a-4 ┆ 999-61-837 ┆ null   ┆ Claribel7 ┆ … ┆ 2013-09-1 ┆ IPV       ┆ 10        ┆ CVX       │
│ 735-c752-0 ┆ 5          ┆        ┆ 06        ┆   ┆ 3T02:11:3 ┆           ┆           ┆           │
│ 7dd-54a664 ┆            ┆        ┆           ┆   ┆ 5+02:00   ┆           ┆           ┆           │
│ fe…        ┆            ┆        ┆           ┆   ┆           ┆           ┆           ┆           │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘
query = """
Select 
    * 
from patient_timeline
order by start_time asc 
limit 5
"""
con_duckdb.sql(query).to_df()
patient_id social_security_number prefix first_name last_name birthDate code system organization_id organization_name practitioner_name practitioner_id procedure_name start_time end_time Vaccine_name vaccine_code vaccine_code_system
0 9808a732-1f51-6483-d8b0-ec0d61f8cead 999-90-1524 Mrs. Pattie91 Abbott774 1967-08-23 162673000 SNOMED b44955d1-f0e2-3beb-a013-708a81dbe430 SIGNATURE HEALTHCARE MEDICAL GROUP INC Dr. Ernest565 Runte676 9999970699 General examination of patient (procedure) 2013-09-11T19:08:14+02:00 2013-09-11T19:23:14+02:00 Influenza, seasonal, injectable, preservative ... 140 CVX
1 6f93e13d-d237-acc2-47f0-d1fa20c62df2 999-65-4079 Mrs. Carrie738 Blick895 1985-07-03 162673000 SNOMED 11a40a26-a5a3-331d-aef0-d06dd90e2b4e CAPE COD PRIMARY CARE LLC Dr. Len277 Wintheiser220 9999947994 General examination of patient (procedure) 2013-09-11T20:51:50+02:00 2013-09-11T21:06:50+02:00 Influenza, seasonal, injectable, preservative ... 140 CVX
2 7aafabc2-f4cf-3d82-f1e9-64d8094beeb6 999-49-9542 None Maximina916 Rutherford999 2013-09-12 410620009 SNOMED dd9914b0-9491-3330-a890-d34e95e497ea CHILD & FAMILY SERVICES, INC. Dr. Assunta351 Haley279 9999967794 Well child visit (procedure) 2013-09-12T04:50:45+02:00 2013-09-12T05:05:45+02:00 Hep B, adolescent or pediatric 08 CVX
3 1c2f968a-4735-c752-07dd-54a664fef295 999-61-8375 None Claribel706 Ruecker817 2008-10-03 410620009 SNOMED 542185c1-943a-30f4-aba0-0efaa92488a4 EVEREST HEALTH CARE SPECIALISTS PLLC Dr. Luetta803 Rau926 9999955096 Well child visit (procedure) 2013-09-13T01:56:35+02:00 2013-09-13T02:11:35+02:00 MMR 03 CVX
4 1c2f968a-4735-c752-07dd-54a664fef295 999-61-8375 None Claribel706 Ruecker817 2008-10-03 410620009 SNOMED 542185c1-943a-30f4-aba0-0efaa92488a4 EVEREST HEALTH CARE SPECIALISTS PLLC Dr. Luetta803 Rau926 9999955096 Well child visit (procedure) 2013-09-13T01:56:35+02:00 2013-09-13T02:11:35+02:00 DTaP 20 CVX
df_ordered_ibis = df_timeline_IBIS.order_by(df_timeline_IBIS.start_time.cast("timestamp")).to_pandas()
df_ordered_ibis.head()
patient_id social_security_number prefix first_name last_name birthDate code system organization_id organization_name practitioner_name practitioner_id procedure_name start_time end_time Vaccine_name vaccine_code vaccine_code_system
0 9808a732-1f51-6483-d8b0-ec0d61f8cead 999-90-1524 Mrs. Pattie91 Abbott774 1967-08-23 162673000 SNOMED b44955d1-f0e2-3beb-a013-708a81dbe430 SIGNATURE HEALTHCARE MEDICAL GROUP INC Dr. Ernest565 Runte676 9999970699 General examination of patient (procedure) 2013-09-11T19:08:14+02:00 2013-09-11T19:23:14+02:00 Influenza, seasonal, injectable, preservative ... 140 CVX
1 6f93e13d-d237-acc2-47f0-d1fa20c62df2 999-65-4079 Mrs. Carrie738 Blick895 1985-07-03 162673000 SNOMED 11a40a26-a5a3-331d-aef0-d06dd90e2b4e CAPE COD PRIMARY CARE LLC Dr. Len277 Wintheiser220 9999947994 General examination of patient (procedure) 2013-09-11T20:51:50+02:00 2013-09-11T21:06:50+02:00 Influenza, seasonal, injectable, preservative ... 140 CVX
2 7aafabc2-f4cf-3d82-f1e9-64d8094beeb6 999-49-9542 None Maximina916 Rutherford999 2013-09-12 410620009 SNOMED dd9914b0-9491-3330-a890-d34e95e497ea CHILD & FAMILY SERVICES, INC. Dr. Assunta351 Haley279 9999967794 Well child visit (procedure) 2013-09-12T04:50:45+02:00 2013-09-12T05:05:45+02:00 Hep B, adolescent or pediatric 08 CVX
3 1c2f968a-4735-c752-07dd-54a664fef295 999-61-8375 None Claribel706 Ruecker817 2008-10-03 410620009 SNOMED 542185c1-943a-30f4-aba0-0efaa92488a4 EVEREST HEALTH CARE SPECIALISTS PLLC Dr. Luetta803 Rau926 9999955096 Well child visit (procedure) 2013-09-13T01:56:35+02:00 2013-09-13T02:11:35+02:00 varicella 21 CVX
4 1c2f968a-4735-c752-07dd-54a664fef295 999-61-8375 None Claribel706 Ruecker817 2008-10-03 410620009 SNOMED 542185c1-943a-30f4-aba0-0efaa92488a4 EVEREST HEALTH CARE SPECIALISTS PLLC Dr. Luetta803 Rau926 9999955096 Well child visit (procedure) 2013-09-13T01:56:35+02:00 2013-09-13T02:11:35+02:00 IPV 10 CVX
df_ordered_pyspark = df_timeline_spark.orderBy('start_time',ascending = False)
df_ordered_pyspark.limit(5).show()
+--------------------+----------------------+------+-----------+--------------+-------------------+---------+------+--------------------+--------------------+--------------------+---------------+--------------------+-------------------+--------------------+--------------------+------------+-------------------+----+
|          patient_id|social_security_number|prefix| first_name|     last_name|          birthDate|     code|system|     organization_id|   organization_name|   practitioner_name|practitioner_id|      procedure_name|         start_time|            end_time|        Vaccine_name|vaccine_code|vaccine_code_system|year|
+--------------------+----------------------+------+-----------+--------------+-------------------+---------+------+--------------------+--------------------+--------------------+---------------+--------------------+-------------------+--------------------+--------------------+------------+-------------------+----+
|1d24c422-98a8-713...|           999-29-9635|  NULL|  Taneka590|       Howe413|2011-08-16 00:00:00|410620009|SNOMED|545dcda5-7471-374...|COMMUNITY HEALTH ...|Dr. Lezlie553 Kos...|     9999981191|Well child visit ...|2023-09-05 08:06:14|2023-09-05T12:21:...|   HPV, quadrivalent|          62|                CVX|2023|
|1d24c422-98a8-713...|           999-29-9635|  NULL|  Taneka590|       Howe413|2011-08-16 00:00:00|410620009|SNOMED|545dcda5-7471-374...|COMMUNITY HEALTH ...|Dr. Lezlie553 Kos...|     9999981191|Well child visit ...|2023-09-05 08:06:14|2023-09-05T12:21:...|Influenza, season...|         140|                CVX|2023|
|404e3ebd-a1e3-467...|           999-54-8765|  NULL|Nicholle822|      Runte676|2013-08-27 00:00:00|410620009|SNOMED|84abfbd5-41c1-3df...|SKIN WELLNESS PHY...|Dr. Cole117 Block661|     9999953190|Well child visit ...|2023-09-05 01:32:07|2023-09-05T05:47:...|Influenza, season...|         140|                CVX|2023|
|4efaa926-30cf-0fc...|           999-66-1148|  Mrs.|    Lisa683|Schamberger479|1951-04-30 00:00:00|162673000|SNOMED|0583e4f7-f1c2-310...|COMMUNITY HEALTH ...| Dr. Jewel43 Howe413|     9999978494|General examinati...|2023-09-04 02:12:02|2023-09-04T06:27:...|Influenza, season...|         140|                CVX|2023|
|8f262a31-6268-5a9...|           999-13-3109|  Mrs.|      Pa577|     Graham902|1970-08-17 00:00:00|162673000|SNOMED|e1892a2c-6dca-3c4...|PRIMARY & PREVENT...|Dr. Derek111 Erdm...|     9999954990|General examinati...|2023-09-03 23:54:17|2023-09-04T04:09:...|Influenza, season...|         140|                CVX|2023|
+--------------------+----------------------+------+-----------+--------------+-------------------+---------+------+--------------------+--------------------+--------------------+---------------+--------------------+-------------------+--------------------+--------------------+------------+-------------------+----+
%%R
df_ordered <- df %>%
  mutate(start_time = as.Date(start_time), format = "%Y-%m-%d") %>%
  arrange(start_time)
kable(head(df_ordered))


|patient_id                           |social_security_number |prefix |first_name  |last_name     |birthDate  |      code|system |organization_id                      |organization_name                             |practitioner_name        |practitioner_id |procedure_name                             |start_time |end_time                  |Vaccine_name                                       |vaccine_code |vaccine_code_system |format   |
|:------------------------------------|:----------------------|:------|:-----------|:-------------|:----------|---------:|:------|:------------------------------------|:---------------------------------------------|:------------------------|:---------------|:------------------------------------------|:----------|:-------------------------|:--------------------------------------------------|:------------|:-------------------|:--------|
|6f93e13d-d237-acc2-47f0-d1fa20c62df2 |999-65-4079            |Mrs.   |Carrie738   |Blick895      |1985-07-03 | 162673000|SNOMED |11a40a26-a5a3-331d-aef0-d06dd90e2b4e |CAPE COD PRIMARY CARE LLC                     |Dr. Len277 Wintheiser220 |9999947994      |General examination of patient (procedure) |2013-09-11 |2013-09-11T21:06:50+02:00 |Influenza, seasonal, injectable, preservative free |140          |CVX                 |%Y-%m-%d |
|9808a732-1f51-6483-d8b0-ec0d61f8cead |999-90-1524            |Mrs.   |Pattie91    |Abbott774     |1967-08-23 | 162673000|SNOMED |b44955d1-f0e2-3beb-a013-708a81dbe430 |SIGNATURE HEALTHCARE MEDICAL GROUP INC        |Dr. Ernest565 Runte676   |9999970699      |General examination of patient (procedure) |2013-09-11 |2013-09-11T19:23:14+02:00 |Influenza, seasonal, injectable, preservative free |140          |CVX                 |%Y-%m-%d |
|7aafabc2-f4cf-3d82-f1e9-64d8094beeb6 |999-49-9542            |NA     |Maximina916 |Rutherford999 |2013-09-12 | 410620009|SNOMED |dd9914b0-9491-3330-a890-d34e95e497ea |CHILD & FAMILY SERVICES, INC.                 |Dr. Assunta351 Haley279  |9999967794      |Well child visit (procedure)               |2013-09-12 |2013-09-12T05:05:45+02:00 |Hep B, adolescent or pediatric                     |08           |CVX                 |%Y-%m-%d |
|0a78537c-628d-77a4-d16b-ec23b6327c9e |999-89-9767            |NA     |Kalyn451    |Smith67       |2007-09-28 | 410620009|SNOMED |7a75decf-7d3f-3b0d-8c2b-3abf0a8c7f58 |NEW ENGLAND PROFESSIONAL HOME HEALTH CARE LLC |Dr. Afton574 Kulas532    |9999944595      |Well child visit (procedure)               |2013-09-13 |2013-09-13T06:33:22+02:00 |Influenza, seasonal, injectable, preservative free |140          |CVX                 |%Y-%m-%d |
|1c2f968a-4735-c752-07dd-54a664fef295 |999-61-8375            |NA     |Claribel706 |Ruecker817    |2008-10-03 | 410620009|SNOMED |542185c1-943a-30f4-aba0-0efaa92488a4 |EVEREST HEALTH CARE SPECIALISTS PLLC          |Dr. Luetta803 Rau926     |9999955096      |Well child visit (procedure)               |2013-09-13 |2013-09-13T02:11:35+02:00 |varicella                                          |21           |CVX                 |%Y-%m-%d |
|1c2f968a-4735-c752-07dd-54a664fef295 |999-61-8375            |NA     |Claribel706 |Ruecker817    |2008-10-03 | 410620009|SNOMED |542185c1-943a-30f4-aba0-0efaa92488a4 |EVEREST HEALTH CARE SPECIALISTS PLLC          |Dr. Luetta803 Rau926     |9999955096      |Well child visit (procedure)               |2013-09-13 |2013-09-13T02:11:35+02:00 |IPV                                                |10           |CVX                 |%Y-%m-%d |

Rename

Another action that is done frequently is the renaming of a column. As an example we rename the birthDate column to birth_date

In pandas, many actions such as rename can be adapted directly in the same dataframe by adding the option inplace = True.

Alternatively, one can create a new dataframe and keep the name the same in the original dataframe.

#option 1 - change in dataframe
df_timeline_pandas.rename(columns = {'birthDate':'birth_date'}, inplace = True)

#option 2 - change only in new dataframe
df_renamed_pandas = df_timeline_pandas.rename(columns = {'birthDate':'birth_date'})
df_renamed_pandas.head(1)
patient_id social_security_number prefix first_name last_name birth_date code system organization_id organization_name practitioner_name practitioner_id procedure_name start_time end_time Vaccine_name vaccine_code vaccine_code_system year
0 4fc244f3-2c0e-4017-d64d-c2c4cd03655f 999-53-5813 Mrs. Alyce744 Bergstrom287 1965-01-07 162673000 SNOMED b03b624d-c939-3688-986d-9555b8009a3b HOLYOKE HEALTH CENTER INC Dr. Bennett146 Hartmann983 9999981894 General examination of patient (procedure) 2015-01-08 06:11:47+00:00 2015-01-08T07:26:47+01:00 zoster vaccine, live 121 CVX 2015
df_renamed_polars = (
    df_timeline_polars
    .rename({"birthDate": "birth_date"})
)
print(df_renamed_polars.head(1))
shape: (1, 18)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ patient_id ┆ social_sec ┆ prefix ┆ first_nam ┆ … ┆ end_time  ┆ Vaccine_n ┆ vaccine_c ┆ vaccine_c │
│ ---        ┆ urity_numb ┆ ---    ┆ e         ┆   ┆ ---       ┆ ame       ┆ ode       ┆ ode_syste │
│ str        ┆ er         ┆ str    ┆ ---       ┆   ┆ str       ┆ ---       ┆ ---       ┆ m         │
│            ┆ ---        ┆        ┆ str       ┆   ┆           ┆ str       ┆ str       ┆ ---       │
│            ┆ str        ┆        ┆           ┆   ┆           ┆           ┆           ┆ str       │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 4fc244f3-2 ┆ 999-53-581 ┆ Mrs.   ┆ Alyce744  ┆ … ┆ 2015-01-0 ┆ zoster    ┆ 121       ┆ CVX       │
│ c0e-4017-d ┆ 3          ┆        ┆           ┆   ┆ 8T07:26:4 ┆ vaccine,  ┆           ┆           │
│ 64d-c2c4cd ┆            ┆        ┆           ┆   ┆ 7+01:00   ┆ live      ┆           ┆           │
│ 03…        ┆            ┆        ┆           ┆   ┆           ┆           ┆           ┆           │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘
query = """
alter table patient_timeline
rename column birthDate to birth_date;

select * from patient_timeline limit 5;
"""
print(con_duckdb.sql(query).to_df())

con_duckdb.sql("alter table patient_timeline rename column birth_date to birthDate")
                             patient_id social_security_number prefix   
0  4fc244f3-2c0e-4017-d64d-c2c4cd03655f            999-53-5813   Mrs.  \
1  4fc244f3-2c0e-4017-d64d-c2c4cd03655f            999-53-5813   Mrs.   
2  fca8d2ca-7aef-2c27-3bba-3f94723012f5            999-24-8599   None   
3  19936964-a432-d501-2cd5-fa52db6b9f41            999-33-7974   None   
4  0d016955-26db-966c-3d52-26d441bfcb97            999-47-8500    Ms.   

   first_name     last_name birth_date       code  system   
0    Alyce744  Bergstrom287 1965-01-07  162673000  SNOMED  \
1    Alyce744  Bergstrom287 1965-01-07  162673000  SNOMED   
2   Marlen929  Greenholt190 2016-04-01  410620009  SNOMED   
3  Monique148      Haley279 2022-01-07  410620009  SNOMED   
4    Tracy345       Smith67 2000-07-17  410620009  SNOMED   

                        organization_id   
0  b03b624d-c939-3688-986d-9555b8009a3b  \
1  b03b624d-c939-3688-986d-9555b8009a3b   
2  3f12ebb4-e03c-3453-88d2-4fc9682383df   
3  c6b019eb-28ec-36f6-abf3-bcc4d1d58966   
4  39c15c0f-5c49-311e-99d2-1fb99d80e06e   

                                 organization_name   
0                        HOLYOKE HEALTH CENTER INC  \
1                        HOLYOKE HEALTH CENTER INC   
2  NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC   
3                DUTTON FAMILY CARE ASSOCIATES LLP   
4                       HARBOR HEALTH SERVICES INC   

            practitioner_name practitioner_id   
0  Dr. Bennett146 Hartmann983      9999981894  \
1  Dr. Bennett146 Hartmann983      9999981894   
2      Dr. Homero668 Rolón954      9999962894   
3   Dr. Maren639 Aufderhar910      9999950790   
4  Dr. Salvador46 Homenick806      9999977496   

                               procedure_name                 start_time   
0  General examination of patient (procedure)  2015-01-08T07:11:47+01:00  \
1  General examination of patient (procedure)  2015-01-08T07:11:47+01:00   
2                Well child visit (procedure)  2016-04-01T04:09:48+02:00   
3                Well child visit (procedure)  2022-01-07T02:19:02+01:00   
4                Well child visit (procedure)  2014-08-18T14:52:06+02:00   

                    end_time   
0  2015-01-08T07:26:47+01:00  \
1  2015-01-08T07:26:47+01:00   
2  2016-04-01T04:24:48+02:00   
3  2022-01-07T02:34:02+01:00   
4  2014-08-18T15:07:06+02:00   

                                        Vaccine_name vaccine_code   
0                               zoster vaccine, live          121  \
1  Influenza, seasonal, injectable, preservative ...          140   
2                     Hep B, adolescent or pediatric           08   
3                     Hep B, adolescent or pediatric           08   
4  Influenza, seasonal, injectable, preservative ...          140   

  vaccine_code_system  
0                 CVX  
1                 CVX  
2                 CVX  
3                 CVX  
4                 CVX  
df_renamed_ibis = df_timeline_IBIS.relabel({"birthDate": "birth_date"}).to_pandas()
df_renamed_ibis.head(1)
patient_id social_security_number prefix first_name last_name birth_date code system organization_id organization_name practitioner_name practitioner_id procedure_name start_time end_time Vaccine_name vaccine_code vaccine_code_system
0 4fc244f3-2c0e-4017-d64d-c2c4cd03655f 999-53-5813 Mrs. Alyce744 Bergstrom287 1965-01-07 162673000 SNOMED b03b624d-c939-3688-986d-9555b8009a3b HOLYOKE HEALTH CENTER INC Dr. Bennett146 Hartmann983 9999981894 General examination of patient (procedure) 2015-01-08T07:11:47+01:00 2015-01-08T07:26:47+01:00 zoster vaccine, live 121 CVX
df_renamed_pyspark = df_timeline_spark.withColumnRenamed("birthDate", "birth_date")
df_renamed_pyspark.head(1)
[Row(patient_id='4fc244f3-2c0e-4017-d64d-c2c4cd03655f', social_security_number='999-53-5813', prefix='Mrs.', first_name='Alyce744', last_name='Bergstrom287', birth_date=datetime.datetime(1965, 1, 7, 0, 0), code=162673000, system='SNOMED', organization_id='b03b624d-c939-3688-986d-9555b8009a3b', organization_name='HOLYOKE HEALTH CENTER INC', practitioner_name='Dr. Bennett146 Hartmann983', practitioner_id='9999981894', procedure_name='General examination of patient (procedure)', start_time=datetime.datetime(2015, 1, 8, 5, 11, 47), end_time='2015-01-08T07:26:47+01:00', Vaccine_name='zoster vaccine, live', vaccine_code='121', vaccine_code_system='CVX', year=2015)]
%%R
df_renamed <- df %>%
  rename(birth_date = birthDate)
kable(head(df_renamed))


|patient_id                           |social_security_number |prefix |first_name |last_name    |birth_date |      code|system |organization_id                      |organization_name                               |practitioner_name          |practitioner_id |procedure_name                             |start_time                |end_time                  |Vaccine_name                                       |vaccine_code |vaccine_code_system |
|:------------------------------------|:----------------------|:------|:----------|:------------|:----------|---------:|:------|:------------------------------------|:-----------------------------------------------|:--------------------------|:---------------|:------------------------------------------|:-------------------------|:-------------------------|:--------------------------------------------------|:------------|:-------------------|
|4fc244f3-2c0e-4017-d64d-c2c4cd03655f |999-53-5813            |Mrs.   |Alyce744   |Bergstrom287 |1965-01-07 | 162673000|SNOMED |b03b624d-c939-3688-986d-9555b8009a3b |HOLYOKE HEALTH CENTER INC                       |Dr. Bennett146 Hartmann983 |9999981894      |General examination of patient (procedure) |2015-01-08T07:11:47+01:00 |2015-01-08T07:26:47+01:00 |zoster vaccine, live                               |121          |CVX                 |
|4fc244f3-2c0e-4017-d64d-c2c4cd03655f |999-53-5813            |Mrs.   |Alyce744   |Bergstrom287 |1965-01-07 | 162673000|SNOMED |b03b624d-c939-3688-986d-9555b8009a3b |HOLYOKE HEALTH CENTER INC                       |Dr. Bennett146 Hartmann983 |9999981894      |General examination of patient (procedure) |2015-01-08T07:11:47+01:00 |2015-01-08T07:26:47+01:00 |Influenza, seasonal, injectable, preservative free |140          |CVX                 |
|fca8d2ca-7aef-2c27-3bba-3f94723012f5 |999-24-8599            |NA     |Marlen929  |Greenholt190 |2016-04-01 | 410620009|SNOMED |3f12ebb4-e03c-3453-88d2-4fc9682383df |NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC |Dr. Homero668 Rolón954     |9999962894      |Well child visit (procedure)               |2016-04-01T04:09:48+02:00 |2016-04-01T04:24:48+02:00 |Hep B, adolescent or pediatric                     |08           |CVX                 |
|19936964-a432-d501-2cd5-fa52db6b9f41 |999-33-7974            |NA     |Monique148 |Haley279     |2022-01-07 | 410620009|SNOMED |c6b019eb-28ec-36f6-abf3-bcc4d1d58966 |DUTTON FAMILY CARE ASSOCIATES LLP               |Dr. Maren639 Aufderhar910  |9999950790      |Well child visit (procedure)               |2022-01-07T02:19:02+01:00 |2022-01-07T02:34:02+01:00 |Hep B, adolescent or pediatric                     |08           |CVX                 |
|0d016955-26db-966c-3d52-26d441bfcb97 |999-47-8500            |Ms.    |Tracy345   |Smith67      |2000-07-17 | 410620009|SNOMED |39c15c0f-5c49-311e-99d2-1fb99d80e06e |HARBOR HEALTH SERVICES INC                      |Dr. Salvador46 Homenick806 |9999977496      |Well child visit (procedure)               |2014-08-18T14:52:06+02:00 |2014-08-18T15:07:06+02:00 |Influenza, seasonal, injectable, preservative free |140          |CVX                 |
|3a2311a7-eb30-2aae-59c0-55ae273d1581 |999-40-6084            |NA     |Janene3    |Howe413      |2016-11-01 | 410620009|SNOMED |2e8b42f8-4300-30a9-a576-d513d1021a0a |GREATER LOWELL FAMILY PRACTICE PC               |Dr. Homero668 Salcedo309   |9999959890      |Well child visit (procedure)               |2016-11-01T04:38:02+01:00 |2016-11-01T04:53:02+01:00 |Hep B, adolescent or pediatric                     |08           |CVX                 |

Join

It’s rare that data analysis involves only a single data frame. Typically you have many data frames, and you must join them together to answer the questions that you’re interested in.

An inner join keeps observations that appear in both tables.

Inner join

An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:

  • A left join keeps all observations in x.
  • A right join keeps all observations in y.
  • A full join keeps all observations in x and y.

Outer joins

The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.

df_joined_pandas = pd.merge(
    df_timeline_pandas,
    df_price_list_pandas, 
    left_on = ['vaccine_code','vaccine_code_system'], 
    right_on = ['code','system'], 
    how ='left'
)

df_joined_pandas.filter(items = ['patient_id','start_time','year','vaccine_name','item_claimed','USD']).head()
patient_id start_time year item_claimed USD
0 4fc244f3-2c0e-4017-d64d-c2c4cd03655f 2015-01-08 06:11:47+00:00 2015 zoster vaccine, live 136.0
1 4fc244f3-2c0e-4017-d64d-c2c4cd03655f 2015-01-08 06:11:47+00:00 2015 Influenza, seasonal, injectable, preservative ... 136.0
2 fca8d2ca-7aef-2c27-3bba-3f94723012f5 2016-04-01 02:09:48+00:00 2016 Hep B, adolescent or pediatric 136.0
3 19936964-a432-d501-2cd5-fa52db6b9f41 2022-01-07 01:19:02+00:00 2022 Hep B, adolescent or pediatric 136.0
4 0d016955-26db-966c-3d52-26d441bfcb97 2014-08-18 12:52:06+00:00 2014 Influenza, seasonal, injectable, preservative ... 136.0
df_joined_polars = (
    df_timeline_polars
    .join(
        df_price_list_polars,
        left_on=["vaccine_code", "vaccine_code_system"],
        right_on=["code", "system"],
        how="left"
    )
)
print(df_joined_polars.head())
shape: (5, 20)
┌────────────┬────────────┬────────┬────────────┬───┬────────────┬────────────┬────────────┬───────┐
│ patient_id ┆ social_sec ┆ prefix ┆ first_name ┆ … ┆ vaccine_co ┆ vaccine_co ┆ item_claim ┆ USD   │
│ ---        ┆ urity_numb ┆ ---    ┆ ---        ┆   ┆ de         ┆ de_system  ┆ ed         ┆ ---   │
│ str        ┆ er         ┆ str    ┆ str        ┆   ┆ ---        ┆ ---        ┆ ---        ┆ f64   │
│            ┆ ---        ┆        ┆            ┆   ┆ str        ┆ str        ┆ str        ┆       │
│            ┆ str        ┆        ┆            ┆   ┆            ┆            ┆            ┆       │
╞════════════╪════════════╪════════╪════════════╪═══╪════════════╪════════════╪════════════╪═══════╡
│ 4fc244f3-2 ┆ 999-53-581 ┆ Mrs.   ┆ Alyce744   ┆ … ┆ 121        ┆ CVX        ┆ zoster     ┆ 136.0 │
│ c0e-4017-d ┆ 3          ┆        ┆            ┆   ┆            ┆            ┆ vaccine,   ┆       │
│ 64d-c2c4cd ┆            ┆        ┆            ┆   ┆            ┆            ┆ live       ┆       │
│ 03…        ┆            ┆        ┆            ┆   ┆            ┆            ┆            ┆       │
│ 4fc244f3-2 ┆ 999-53-581 ┆ Mrs.   ┆ Alyce744   ┆ … ┆ 140        ┆ CVX        ┆ Influenza, ┆ 136.0 │
│ c0e-4017-d ┆ 3          ┆        ┆            ┆   ┆            ┆            ┆ seasonal,  ┆       │
│ 64d-c2c4cd ┆            ┆        ┆            ┆   ┆            ┆            ┆ injectable ┆       │
│ 03…        ┆            ┆        ┆            ┆   ┆            ┆            ┆ ,…         ┆       │
│ fca8d2ca-7 ┆ 999-24-859 ┆ null   ┆ Marlen929  ┆ … ┆ 08         ┆ CVX        ┆ Hep B,     ┆ 136.0 │
│ aef-2c27-3 ┆ 9          ┆        ┆            ┆   ┆            ┆            ┆ adolescent ┆       │
│ bba-3f9472 ┆            ┆        ┆            ┆   ┆            ┆            ┆ or         ┆       │
│ 30…        ┆            ┆        ┆            ┆   ┆            ┆            ┆ pediatric  ┆       │
│ 19936964-a ┆ 999-33-797 ┆ null   ┆ Monique148 ┆ … ┆ 08         ┆ CVX        ┆ Hep B,     ┆ 136.0 │
│ 432-d501-2 ┆ 4          ┆        ┆            ┆   ┆            ┆            ┆ adolescent ┆       │
│ cd5-fa52db ┆            ┆        ┆            ┆   ┆            ┆            ┆ or         ┆       │
│ 6b…        ┆            ┆        ┆            ┆   ┆            ┆            ┆ pediatric  ┆       │
│ 0d016955-2 ┆ 999-47-850 ┆ Ms.    ┆ Tracy345   ┆ … ┆ 140        ┆ CVX        ┆ Influenza, ┆ 136.0 │
│ 6db-966c-3 ┆ 0          ┆        ┆            ┆   ┆            ┆            ┆ seasonal,  ┆       │
│ d52-26d441 ┆            ┆        ┆            ┆   ┆            ┆            ┆ injectable ┆       │
│ bf…        ┆            ┆        ┆            ┆   ┆            ┆            ┆ ,…         ┆       │
└────────────┴────────────┴────────┴────────────┴───┴────────────┴────────────┴────────────┴───────┘
query = """
Select 
* 
from patient_timeline pt
left join price_list pl on pt.vaccine_code = pl.code and pt.vaccine_code_system = pl.system
limit 5;
"""
con_duckdb.sql(query).to_df()
patient_id social_security_number prefix first_name last_name birthDate code system organization_id organization_name ... procedure_name start_time end_time Vaccine_name vaccine_code vaccine_code_system code_2 system_2 item_claimed USD
0 4fc244f3-2c0e-4017-d64d-c2c4cd03655f 999-53-5813 Mrs. Alyce744 Bergstrom287 1965-01-07 162673000 SNOMED b03b624d-c939-3688-986d-9555b8009a3b HOLYOKE HEALTH CENTER INC ... General examination of patient (procedure) 2015-01-08T07:11:47+01:00 2015-01-08T07:26:47+01:00 zoster vaccine, live 121 CVX 121 CVX zoster vaccine, live 136.0
1 4fc244f3-2c0e-4017-d64d-c2c4cd03655f 999-53-5813 Mrs. Alyce744 Bergstrom287 1965-01-07 162673000 SNOMED b03b624d-c939-3688-986d-9555b8009a3b HOLYOKE HEALTH CENTER INC ... General examination of patient (procedure) 2015-01-08T07:11:47+01:00 2015-01-08T07:26:47+01:00 Influenza, seasonal, injectable, preservative ... 140 CVX 140 CVX Influenza, seasonal, injectable, preservative ... 136.0
2 fca8d2ca-7aef-2c27-3bba-3f94723012f5 999-24-8599 None Marlen929 Greenholt190 2016-04-01 410620009 SNOMED 3f12ebb4-e03c-3453-88d2-4fc9682383df NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC ... Well child visit (procedure) 2016-04-01T04:09:48+02:00 2016-04-01T04:24:48+02:00 Hep B, adolescent or pediatric 08 CVX 08 CVX Hep B, adolescent or pediatric 136.0
3 19936964-a432-d501-2cd5-fa52db6b9f41 999-33-7974 None Monique148 Haley279 2022-01-07 410620009 SNOMED c6b019eb-28ec-36f6-abf3-bcc4d1d58966 DUTTON FAMILY CARE ASSOCIATES LLP ... Well child visit (procedure) 2022-01-07T02:19:02+01:00 2022-01-07T02:34:02+01:00 Hep B, adolescent or pediatric 08 CVX 08 CVX Hep B, adolescent or pediatric 136.0
4 0d016955-26db-966c-3d52-26d441bfcb97 999-47-8500 Ms. Tracy345 Smith67 2000-07-17 410620009 SNOMED 39c15c0f-5c49-311e-99d2-1fb99d80e06e HARBOR HEALTH SERVICES INC ... Well child visit (procedure) 2014-08-18T14:52:06+02:00 2014-08-18T15:07:06+02:00 Influenza, seasonal, injectable, preservative ... 140 CVX 140 CVX Influenza, seasonal, injectable, preservative ... 136.0

5 rows × 22 columns

df_joined_ibis = df_timeline_IBIS.left_join(
    df_price_list_ibis, [
        df_timeline_IBIS.vaccine_code == df_price_list_ibis.code,
        df_timeline_IBIS.vaccine_code_system == df_price_list_ibis.system
    ]
).to_pandas()
df_joined_ibis.head()
patient_id social_security_number prefix first_name last_name birthDate code system organization_id organization_name ... procedure_name start_time end_time Vaccine_name vaccine_code vaccine_code_system code_right system_right item_claimed USD
0 4fc244f3-2c0e-4017-d64d-c2c4cd03655f 999-53-5813 Mrs. Alyce744 Bergstrom287 1965-01-07 162673000 SNOMED b03b624d-c939-3688-986d-9555b8009a3b HOLYOKE HEALTH CENTER INC ... General examination of patient (procedure) 2015-01-08T07:11:47+01:00 2015-01-08T07:26:47+01:00 zoster vaccine, live 121 CVX 121 CVX zoster vaccine, live 136.0
1 4fc244f3-2c0e-4017-d64d-c2c4cd03655f 999-53-5813 Mrs. Alyce744 Bergstrom287 1965-01-07 162673000 SNOMED b03b624d-c939-3688-986d-9555b8009a3b HOLYOKE HEALTH CENTER INC ... General examination of patient (procedure) 2015-01-08T07:11:47+01:00 2015-01-08T07:26:47+01:00 Influenza, seasonal, injectable, preservative ... 140 CVX 140 CVX Influenza, seasonal, injectable, preservative ... 136.0
2 fca8d2ca-7aef-2c27-3bba-3f94723012f5 999-24-8599 None Marlen929 Greenholt190 2016-04-01 410620009 SNOMED 3f12ebb4-e03c-3453-88d2-4fc9682383df NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC ... Well child visit (procedure) 2016-04-01T04:09:48+02:00 2016-04-01T04:24:48+02:00 Hep B, adolescent or pediatric 08 CVX 08 CVX Hep B, adolescent or pediatric 136.0
3 19936964-a432-d501-2cd5-fa52db6b9f41 999-33-7974 None Monique148 Haley279 2022-01-07 410620009 SNOMED c6b019eb-28ec-36f6-abf3-bcc4d1d58966 DUTTON FAMILY CARE ASSOCIATES LLP ... Well child visit (procedure) 2022-01-07T02:19:02+01:00 2022-01-07T02:34:02+01:00 Hep B, adolescent or pediatric 08 CVX 08 CVX Hep B, adolescent or pediatric 136.0
4 0d016955-26db-966c-3d52-26d441bfcb97 999-47-8500 Ms. Tracy345 Smith67 2000-07-17 410620009 SNOMED 39c15c0f-5c49-311e-99d2-1fb99d80e06e HARBOR HEALTH SERVICES INC ... Well child visit (procedure) 2014-08-18T14:52:06+02:00 2014-08-18T15:07:06+02:00 Influenza, seasonal, injectable, preservative ... 140 CVX 140 CVX Influenza, seasonal, injectable, preservative ... 136.0

5 rows × 22 columns

df_joined_pyspark = df_timeline_spark.join(df_price_list_spark,(df_timeline_spark['vaccine_code'] == df_price_list_spark['code']) & (df_timeline_spark['vaccine_code_system'] == df_price_list_spark['system']),'inner')

df_joined_pyspark.limit(5).show()
+--------------------+----------------------+------+----------+------------+-------------------+---------+------+--------------------+--------------------+--------------------+---------------+--------------------+-------------------+--------------------+--------------------+------------+-------------------+----+----+------+--------------------+-----+
|          patient_id|social_security_number|prefix|first_name|   last_name|          birthDate|     code|system|     organization_id|   organization_name|   practitioner_name|practitioner_id|      procedure_name|         start_time|            end_time|        Vaccine_name|vaccine_code|vaccine_code_system|year|code|system|        item_claimed|  USD|
+--------------------+----------------------+------+----------+------------+-------------------+---------+------+--------------------+--------------------+--------------------+---------------+--------------------+-------------------+--------------------+--------------------+------------+-------------------+----+----+------+--------------------+-----+
|4fc244f3-2c0e-401...|           999-53-5813|  Mrs.|  Alyce744|Bergstrom287|1965-01-07 00:00:00|162673000|SNOMED|b03b624d-c939-368...|HOLYOKE HEALTH CE...|Dr. Bennett146 Ha...|     9999981894|General examinati...|2015-01-08 05:11:47|2015-01-08T07:26:...|zoster vaccine, live|         121|                CVX|2015| 121|   CVX|zoster vaccine, live|136.0|
|4fc244f3-2c0e-401...|           999-53-5813|  Mrs.|  Alyce744|Bergstrom287|1965-01-07 00:00:00|162673000|SNOMED|b03b624d-c939-368...|HOLYOKE HEALTH CE...|Dr. Bennett146 Ha...|     9999981894|General examinati...|2015-01-08 05:11:47|2015-01-08T07:26:...|Influenza, season...|         140|                CVX|2015| 140|   CVX|Influenza, season...|136.0|
|fca8d2ca-7aef-2c2...|           999-24-8599|  NULL| Marlen929|Greenholt190|2016-04-01 00:00:00|410620009|SNOMED|3f12ebb4-e03c-345...|NEW BEDFORD INTER...|Dr. Homero668 Rol...|     9999962894|Well child visit ...|2016-04-01 00:09:48|2016-04-01T04:24:...|Hep B, adolescent...|          08|                CVX|2016|  08|   CVX|Hep B, adolescent...|136.0|
|19936964-a432-d50...|           999-33-7974|  NULL|Monique148|    Haley279|2022-01-07 00:00:00|410620009|SNOMED|c6b019eb-28ec-36f...|DUTTON FAMILY CAR...|Dr. Maren639 Aufd...|     9999950790|Well child visit ...|2022-01-07 00:19:02|2022-01-07T02:34:...|Hep B, adolescent...|          08|                CVX|2022|  08|   CVX|Hep B, adolescent...|136.0|
|0d016955-26db-966...|           999-47-8500|   Ms.|  Tracy345|     Smith67|2000-07-17 00:00:00|410620009|SNOMED|39c15c0f-5c49-311...|HARBOR HEALTH SER...|Dr. Salvador46 Ho...|     9999977496|Well child visit ...|2014-08-18 10:52:06|2014-08-18T15:07:...|Influenza, season...|         140|                CVX|2014| 140|   CVX|Influenza, season...|136.0|
+--------------------+----------------------+------+----------+------------+-------------------+---------+------+--------------------+--------------------+--------------------+---------------+--------------------+-------------------+--------------------+--------------------+------------+-------------------+----+----+------+--------------------+-----+
%%R
df_joined <- df %>%
  left_join(
    df_price_list,
    by = c("vaccine_code" = "code", "vaccine_code_system" = "system")
  )
kable(head(df_joined))


|patient_id                           |social_security_number |prefix |first_name |last_name    |birthDate  |      code|system |organization_id                      |organization_name                               |practitioner_name          |practitioner_id |procedure_name                             |start_time                |end_time                  |Vaccine_name                                       |vaccine_code |vaccine_code_system |item_claimed                                       | USD|
|:------------------------------------|:----------------------|:------|:----------|:------------|:----------|---------:|:------|:------------------------------------|:-----------------------------------------------|:--------------------------|:---------------|:------------------------------------------|:-------------------------|:-------------------------|:--------------------------------------------------|:------------|:-------------------|:--------------------------------------------------|---:|
|4fc244f3-2c0e-4017-d64d-c2c4cd03655f |999-53-5813            |Mrs.   |Alyce744   |Bergstrom287 |1965-01-07 | 162673000|SNOMED |b03b624d-c939-3688-986d-9555b8009a3b |HOLYOKE HEALTH CENTER INC                       |Dr. Bennett146 Hartmann983 |9999981894      |General examination of patient (procedure) |2015-01-08T07:11:47+01:00 |2015-01-08T07:26:47+01:00 |zoster vaccine, live                               |121          |CVX                 |zoster vaccine, live                               | 136|
|4fc244f3-2c0e-4017-d64d-c2c4cd03655f |999-53-5813            |Mrs.   |Alyce744   |Bergstrom287 |1965-01-07 | 162673000|SNOMED |b03b624d-c939-3688-986d-9555b8009a3b |HOLYOKE HEALTH CENTER INC                       |Dr. Bennett146 Hartmann983 |9999981894      |General examination of patient (procedure) |2015-01-08T07:11:47+01:00 |2015-01-08T07:26:47+01:00 |Influenza, seasonal, injectable, preservative free |140          |CVX                 |Influenza, seasonal, injectable, preservative free | 136|
|fca8d2ca-7aef-2c27-3bba-3f94723012f5 |999-24-8599            |NA     |Marlen929  |Greenholt190 |2016-04-01 | 410620009|SNOMED |3f12ebb4-e03c-3453-88d2-4fc9682383df |NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC |Dr. Homero668 Rolón954     |9999962894      |Well child visit (procedure)               |2016-04-01T04:09:48+02:00 |2016-04-01T04:24:48+02:00 |Hep B, adolescent or pediatric                     |08           |CVX                 |Hep B, adolescent or pediatric                     | 136|
|19936964-a432-d501-2cd5-fa52db6b9f41 |999-33-7974            |NA     |Monique148 |Haley279     |2022-01-07 | 410620009|SNOMED |c6b019eb-28ec-36f6-abf3-bcc4d1d58966 |DUTTON FAMILY CARE ASSOCIATES LLP               |Dr. Maren639 Aufderhar910  |9999950790      |Well child visit (procedure)               |2022-01-07T02:19:02+01:00 |2022-01-07T02:34:02+01:00 |Hep B, adolescent or pediatric                     |08           |CVX                 |Hep B, adolescent or pediatric                     | 136|
|0d016955-26db-966c-3d52-26d441bfcb97 |999-47-8500            |Ms.    |Tracy345   |Smith67      |2000-07-17 | 410620009|SNOMED |39c15c0f-5c49-311e-99d2-1fb99d80e06e |HARBOR HEALTH SERVICES INC                      |Dr. Salvador46 Homenick806 |9999977496      |Well child visit (procedure)               |2014-08-18T14:52:06+02:00 |2014-08-18T15:07:06+02:00 |Influenza, seasonal, injectable, preservative free |140          |CVX                 |Influenza, seasonal, injectable, preservative free | 136|
|3a2311a7-eb30-2aae-59c0-55ae273d1581 |999-40-6084            |NA     |Janene3    |Howe413      |2016-11-01 | 410620009|SNOMED |2e8b42f8-4300-30a9-a576-d513d1021a0a |GREATER LOWELL FAMILY PRACTICE PC               |Dr. Homero668 Salcedo309   |9999959890      |Well child visit (procedure)               |2016-11-01T04:38:02+01:00 |2016-11-01T04:53:02+01:00 |Hep B, adolescent or pediatric                     |08           |CVX                 |Hep B, adolescent or pediatric                     | 136|

Finishing up

When using duckdb it is very important to always close the database connection. If one is connected to the databas a wall is created to block anyone else from connecting at the same time. This prevents conflicts when more are manipulating the data at the same time.

#close connection to the duckdb database
con_duckdb.close()