import pandas as pd
Datawrangling
In data analysis, we often repeat similar actions to get valuable information from the data. In this Chapter we describe the most common techniques applied in data analysis, and how they are performed in different programming languages and libraries.
Split-apply-combine
Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together. group_by is the most common function name for split-apply-combine technique.
These actions can be performed using different libraries in various programming languages. Based on the sample data created with the Synthea patient generator and Create views using DuckDB, we will demonstrate standard data analysis actions. We will illustrate how to implement these actions across multiple libraries and languages.
Datawrangling languages and libraries
In this chapter we use the following languages and libraries:
Pandas 2.0
Pandas is a popular python data wrangling library in the data science comunity. Pandas 2.0 mainly introduces improvements in performance, making it faster and more memory-efficient.
Polars
Polars is very similar to pandas in terms of functionality. However, polars is based on Rust, which makes it highly performant.
IBIS
Ibis is a python library which enables you to write your code in a pandas-like way, and the translates it to a lot of different backends. This enables you to change backends with only minor adjustments to your data wrangling code.
PySpark PySpark is a python library for Apache Spark, an open source computing system. It enables you to work with large datasets, or run complex calculations.
dplyr
dplyr is a widely used library in r.
SQL - DuckDB
In this handbook, we use DuckDB to demonstrate how to perform data wrangling with SQL. DuckDB SQL is almost the same as plain SQL but has a couple of extra functions to handle nested data, along with a few functions that aid in data analysis. DuckDB is a fast and efficient open-source database for analyzing large datasets using SQL.
Overview data transformations in different libraries
In the table below, you can find an overview of techniques that are frequently employed in data analysis, along with the corresponding functions for these techniques in the discussed libraries/languages.
concept | pandas | Polars | IBIS | PySpark | dplyr | SQL |
---|---|---|---|---|---|---|
Split | groupby() | group_by() | group_by() | groupBy() | group_by() | GROUP BY |
Apply | many functions | many functions | many functions | many functions | many functions | many functions |
combine | join (), merge() | join() | left_join, inner_join() etc. | join() | left_join, inner_join() etc. | LEFT JOIN, JOIN etc. |
Filtering (row based) | loc[], query() | filter() | filter() | filter() | filter() | WHERE |
Filtering (column based) | loc[], iloc[], | select() | - | select() | select() | SELECT |
Mutate | assign() | with_columns() | mutuate() | withColumn() | mutate() | ADD |
Ordering | sort_values() | sort() | order_by() | orderBy() | arrange() | ORDER BY |
Example using the Synthea dataset
To demonstrate the most common data analysis techniques in the different languages and libraries, we use the data generated with the Synthea Patient Generator.
load required libraries
Before getting started, we need to load the required libraries.
import polars as pl
import duckdb
import ibis
import pandas as pd
import duckdb
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, DateType,TimestampType,IntegerType
from pyspark.sql import functions as F
%load_ext rpy2.ipython
%%R
library(DBI)
library(dplyr)
library(duckdb)
library(glue)
library(lubridate) library(knitr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
Attaching package: ‘lubridate’
The following objects are masked from ‘package:base’:
date, intersect, setdiff, union
Retrieve data
In this section we will retrieve the data from the parquet files created in Creating views with DuckDB.
First, we set the paths where the data is located
from pathlib import Path
= Path('.')
ROOT if 'src' in str(ROOT.resolve()): # locally
= '..' / ROOT # go one folder up
ROOT = ROOT / 'data' / 'bronze'
BRONZE = ROOT / 'data' / 'silver' SILVER
%%R
library(here)
<- here()
ROOT
# Check if the 'src' directory is in the path and adjust the ROOT if needed
if (grepl("src", ROOT, fixed = TRUE)) {
<- dirname(ROOT)
ROOT
}
# Define BRONZE and SILVER directories
<- file.path(ROOT, "data", "bronze")
BRONZE <- file.path(ROOT, "data", "silver") SILVER
here() starts at /home/runner/work/hdc-data-analytics-handbook/hdc-data-analytics-handbook
Thereafter, we import the data in a dataframe
# load data from parquet
= pd.read_parquet(f"{SILVER}/parquet_export/patient_timeline.parquet")
df_timeline_pandas = pd.read_parquet(f"{SILVER}/parquet_export/price_list.parquet") df_price_list_pandas
# read parquet files in polars dateframe
= pl.read_parquet(f"{SILVER}/parquet_export/patient_timeline.parquet")
df_timeline_polars = pl.read_parquet(f"{SILVER}/parquet_export/price_list.parquet") df_price_list_polars
In Creating views using DuckDB, we have already demonstrated how to create a DuckDB database and import Parquet data into DuckDB. Therefore, here we will connect to the previously created database, which only contains the “patient_timeline” and “price_list” tables.
# connect to the database
= duckdb.connect()
con_duckdb
con_duckdb.sql(f"""
CREATE TABLE patient_timeline AS
SELECT *
FROM read_parquet('{SILVER}/parquet_export/patient_timeline.parquet');
"""
)
con_duckdb.sql(f"""
CREATE TABLE price_list AS
SELECT *
FROM read_parquet('{SILVER}/parquet_export/price_list.parquet');
"""
)
# read parquet files with ibis
= ibis.read_parquet(f'{SILVER}/parquet_export/patient_timeline.parquet')
df_timeline_IBIS = ibis.read_parquet(f'{SILVER}/parquet_export/price_list.parquet') df_price_list_ibis
# create spark session
= SparkSession.builder.appName("patient_timeline_analysis").getOrCreate()
spark
# convert timeline to spark dataframe
= spark.read.parquet(f'{SILVER}/parquet_export/patient_timeline.parquet')
df_timeline_spark
# convert pricelist to spark dataframe
= spark.read.parquet(f'{SILVER}/parquet_export/price_list.parquet') df_price_list_spark
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/19 09:25:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
%%R
<- dbConnect(
con
duckdb::duckdb(),= glue("{SILVER}/pregnancy.duckdb"),
dbdir = TRUE
read_only )
%%R
<- dbGetQuery(con, "SELECT * FROM patient_timeline")
df <- dbGetQuery(con, "SELECT * FROM price_list") df_price_list
In addition: Warning message:
In rapi_execute(res@stmt_lst$ref, res@arrow, res@connection@driver@bigint == :
Coercing nanoseconds to a lower resolution may result in a loss of data.
Split-apply-combine
From the patient timeline data, we want to know how many patients each organization has recorded. For this task we can use the split-apply-combine technique.
- we split the data in groups, where each group is a organization
- we apply a function to count the number of patients in the group
- we combine the data back together again
In the code below, we can see that the groupby function splits the data into different groups (and later combines them again). the agg function applies an aggregation, in this case the unique count of patients in the group. There are a few more actions in this code, renaming the index to organization, and the patient_id column to # patients.
= df_timeline_pandas.groupby("organization_name").agg(
df_grouped_pandas "patient_id":pd.Series.nunique}
{
).rename(= {'patient_id':"# patients"}
columns 'organization')
).rename_axis(
# show first 5 records
df_grouped_pandas.head()
# patients | |
---|---|
organization | |
4499 ACUSHNET AVENUE OPERATING COMPANY, LLC | 5 |
A&A HEALTHCARE LLC | 9 |
ACCESS PRIMARY CARE INC | 1 |
ACHIEVE WELLNESS, LLC | 3 |
ACTIVATED BY WELLNESS LLC | 5 |
= (
df_grouped_polars
df_timeline_polars"organization_name")
.group_by(
.agg("patient_id")
pl.n_unique(
)
)print(df_grouped_polars.head())
shape: (5, 2)
┌───────────────────────────────────┬────────────┐
│ organization_name ┆ patient_id │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════════════════════════════════╪════════════╡
│ SUBURBAN INTERNAL MEDICINE, INC. ┆ 2 │
│ LAWRENCE GENERAL HOSPITAL ┆ 9 │
│ CONTINUUM HEALTHCARE SOLUTIONS L… ┆ 4 │
│ WHITLEY WELLNESS LLC ┆ 29 │
│ HOPKINTON MEDICAL CARE, P.C. ┆ 5 │
└───────────────────────────────────┴────────────┘
Polars supports both lazy and non-lazy (eager) evaluation. Lazy evaluation means that execution is deferred to the last minute, which can have significant performance advantages.
= (
df_grouped_polars
df_timeline_polars.lazy()"organization_name")
.group_by(
.agg("patient_id")
pl.n_unique(
)
.collect()
)print(df_grouped_polars)
shape: (352, 2)
┌───────────────────────────────────┬────────────┐
│ organization_name ┆ patient_id │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════════════════════════════════╪════════════╡
│ DEVENS TREATMENT & RECOVERY CENT… ┆ 4 │
│ RENAISSANCE FAMILY MEDICINE OF W… ┆ 1 │
│ COMMUNICATIVE HEALTH CARE ASSOCI… ┆ 2 │
│ THE SHRINERS HOSPITAL FOR CHILDR… ┆ 19 │
│ … ┆ … │
│ OUTER CAPE HEALTH SERVICES ┆ 1 │
│ AUBURN PRIMARY CARE AND AESTHETI… ┆ 9 │
│ COMMUNITY HEALTH PROGRAMS INC ┆ 3 │
│ ZION HEALTH & WELLNESS LLC ┆ 2 │
└───────────────────────────────────┴────────────┘
= f"""
query create or replace table organization_grouped as (
select
organization_name as organization,
count(distinct patient_id) as patient_count
from patient_timeline
group by organization_name
) ;
select * from organization_grouped
limit 5;
"""
con_duckdb.sql(query).to_df()
organization | patient_count | |
---|---|---|
0 | MEASURED WELLNESS LLC | 4 |
1 | FRANCISCAN HOSPITAL FOR CHILDREN INC | 8 |
2 | FENWAY COMMUNITY HEALTH CENTER, INC | 4 |
3 | ARBOUR-FULLER HOSPITAL | 13 |
4 | NASH HEALTH CARE PRACTICES LLC | 4 |
= df_timeline_IBIS.group_by("organization_name").aggregate(df_timeline_IBIS.patient_id.nunique()).to_pandas()
df_grouped_ibis df_grouped_ibis.head()
organization_name | CountDistinct(patient_id) | |
---|---|---|
0 | NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC | 8 |
1 | DUTTON FAMILY CARE ASSOCIATES LLP | 8 |
2 | THE SHRINERS HOSPITAL FOR CHILDREN | 19 |
3 | TREE OF LIFE HEALTHCARE ASSOCIATES | 4 |
4 | GREATER BOSTON INTERNAL MEDICINE PC | 2 |
= df_timeline_IBIS.group_by("organization_name").aggregate(df_timeline_IBIS.patient_id.nunique())
df_grouped_ibis ibis.show_sql(df_grouped_ibis)
SELECT
t0.organization_name,
COUNT(DISTINCT t0.patient_id) AS "CountDistinct(patient_id)"
FROM _ibis_read_parquet_2b2nxedlfbbiljrr2xyonlxd34 AS t0
GROUP BY
1
#option 1, using buid in functions
= df_timeline_spark.groupBy('organization_name').agg(
df_grouped_pyspark 'patient_id'
F.countDistinct('# patients')).orderBy(
).alias('# patients',ascending = False
)
= df_grouped_pyspark.withColumnRenamed("organization_name", "organization")
df_grouped_pyspark
# option 2: using SQL
"timeline")
df_timeline_spark.createOrReplaceTempView(= spark.sql('''
df_grouped_pyspark Select
organization_name,
count(distinct patient_id) as nr_of_patients
from timeline group by organization_name
order by count(distinct patient_id) DESC
''')
5).show() df_grouped_pyspark.limit(
+--------------------+--------------+
| organization_name|nr_of_patients|
+--------------------+--------------+
|BOSTON MEDICAL CE...| 29|
|WHITLEY WELLNESS LLC| 29|
| TEWKSBURY HOSPITAL| 26|
|CAPE COD HOSPITAL...| 25|
|HARBOR HEALTH SER...| 23|
+--------------------+--------------+
%%R
<- df %>%
df_grouped %>%
group_by(organization_name)
summarise(= n_distinct(patient_id),
number_of_patients = "drop"
.groups
) kable(head(df_grouped))
|organization_name | number_of_patients|
|:-------------------------------------------|------------------:|
|4499 ACUSHNET AVENUE OPERATING COMPANY, LLC | 5|
|A&A HEALTHCARE LLC | 9|
|ACCESS PRIMARY CARE INC | 1|
|ACHIEVE WELLNESS, LLC | 3|
|ACTIVATED BY WELLNESS LLC | 5|
|ADVANTAGE HEALTH & WELLNESS INC | 2|
Filter
In data analysis, it is common to be interested in only a subset of the data. For this reason, we apply the filtering technique. From our patients we only want to keep the ‘well child visits’.
# option 1
= 'Well child visit (procedure)'
value = df_timeline_pandas[df_timeline_pandas['procedure_name'] == value]
df_filtered_pandas
# option 2
= df_timeline_pandas.loc[df_timeline_pandas['procedure_name'] == value]
df_filtered_pandas
# option 3
= df_filtered_pandas.query(f"procedure_name == '{value}' ")
df_filtered_pandas
#show results
'patient_id','start_time','procedure_name']].head() df_filtered_pandas[[
patient_id | start_time | procedure_name | |
---|---|---|---|
2 | fca8d2ca-7aef-2c27-3bba-3f94723012f5 | 2016-04-01T04:09:48+02:00 | Well child visit (procedure) |
3 | 19936964-a432-d501-2cd5-fa52db6b9f41 | 2022-01-07T02:19:02+01:00 | Well child visit (procedure) |
4 | 0d016955-26db-966c-3d52-26d441bfcb97 | 2014-08-18T14:52:06+02:00 | Well child visit (procedure) |
5 | 3a2311a7-eb30-2aae-59c0-55ae273d1581 | 2016-11-01T04:38:02+01:00 | Well child visit (procedure) |
6 | f9133ee8-f952-0e7d-f642-99d84fc9c6ad | 2014-06-03T20:31:52+02:00 | Well child visit (procedure) |
= (
df_filtered_polars
df_timeline_polarsfilter(pl.col("procedure_name") == "Well child visit (procedure)")
.
)print(df_filtered_polars.head())
shape: (5, 18)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ patient_id ┆ social_sec ┆ prefix ┆ first_nam ┆ … ┆ end_time ┆ Vaccine_n ┆ vaccine_c ┆ vaccine_c │
│ --- ┆ urity_numb ┆ --- ┆ e ┆ ┆ --- ┆ ame ┆ ode ┆ ode_syste │
│ str ┆ er ┆ str ┆ --- ┆ ┆ str ┆ --- ┆ --- ┆ m │
│ ┆ --- ┆ ┆ str ┆ ┆ ┆ str ┆ str ┆ --- │
│ ┆ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ str │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ fca8d2ca-7 ┆ 999-24-859 ┆ null ┆ Marlen929 ┆ … ┆ 2016-04-0 ┆ Hep B, ┆ 08 ┆ CVX │
│ aef-2c27-3 ┆ 9 ┆ ┆ ┆ ┆ 1T04:24:4 ┆ adolescen ┆ ┆ │
│ bba-3f9472 ┆ ┆ ┆ ┆ ┆ 8+02:00 ┆ t or ┆ ┆ │
│ 30… ┆ ┆ ┆ ┆ ┆ ┆ pediatric ┆ ┆ │
│ 19936964-a ┆ 999-33-797 ┆ null ┆ Monique14 ┆ … ┆ 2022-01-0 ┆ Hep B, ┆ 08 ┆ CVX │
│ 432-d501-2 ┆ 4 ┆ ┆ 8 ┆ ┆ 7T02:34:0 ┆ adolescen ┆ ┆ │
│ cd5-fa52db ┆ ┆ ┆ ┆ ┆ 2+01:00 ┆ t or ┆ ┆ │
│ 6b… ┆ ┆ ┆ ┆ ┆ ┆ pediatric ┆ ┆ │
│ 0d016955-2 ┆ 999-47-850 ┆ Ms. ┆ Tracy345 ┆ … ┆ 2014-08-1 ┆ Influenza ┆ 140 ┆ CVX │
│ 6db-966c-3 ┆ 0 ┆ ┆ ┆ ┆ 8T15:07:0 ┆ , ┆ ┆ │
│ d52-26d441 ┆ ┆ ┆ ┆ ┆ 6+02:00 ┆ seasonal, ┆ ┆ │
│ bf… ┆ ┆ ┆ ┆ ┆ ┆ injectabl ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ e,… ┆ ┆ │
│ 3a2311a7-e ┆ 999-40-608 ┆ null ┆ Janene3 ┆ … ┆ 2016-11-0 ┆ Hep B, ┆ 08 ┆ CVX │
│ b30-2aae-5 ┆ 4 ┆ ┆ ┆ ┆ 1T04:53:0 ┆ adolescen ┆ ┆ │
│ 9c0-55ae27 ┆ ┆ ┆ ┆ ┆ 2+01:00 ┆ t or ┆ ┆ │
│ 3d… ┆ ┆ ┆ ┆ ┆ ┆ pediatric ┆ ┆ │
│ f9133ee8-f ┆ 999-93-354 ┆ Ms. ┆ Ozie87 ┆ … ┆ 2014-06-0 ┆ Influenza ┆ 140 ┆ CVX │
│ 952-0e7d-f ┆ 6 ┆ ┆ ┆ ┆ 3T20:46:5 ┆ , ┆ ┆ │
│ 642-99d84f ┆ ┆ ┆ ┆ ┆ 2+02:00 ┆ seasonal, ┆ ┆ │
│ c9… ┆ ┆ ┆ ┆ ┆ ┆ injectabl ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ e,… ┆ ┆ │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘
= """
query create or replace table filtered_table as (
Select
*
from patient_timeline
where procedure_name = 'Well child visit (procedure)'
);
select * from filtered_table limit 5;
"""
con_duckdb.sql(query).to_df()
patient_id | social_security_number | prefix | first_name | last_name | birthDate | code | system | organization_id | organization_name | practitioner_name | practitioner_id | procedure_name | start_time | end_time | Vaccine_name | vaccine_code | vaccine_code_system | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | fca8d2ca-7aef-2c27-3bba-3f94723012f5 | 999-24-8599 | None | Marlen929 | Greenholt190 | 2016-04-01 | 410620009 | SNOMED | 3f12ebb4-e03c-3453-88d2-4fc9682383df | NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC | Dr. Homero668 Rolón954 | 9999962894 | Well child visit (procedure) | 2016-04-01T04:09:48+02:00 | 2016-04-01T04:24:48+02:00 | Hep B, adolescent or pediatric | 08 | CVX |
1 | 19936964-a432-d501-2cd5-fa52db6b9f41 | 999-33-7974 | None | Monique148 | Haley279 | 2022-01-07 | 410620009 | SNOMED | c6b019eb-28ec-36f6-abf3-bcc4d1d58966 | DUTTON FAMILY CARE ASSOCIATES LLP | Dr. Maren639 Aufderhar910 | 9999950790 | Well child visit (procedure) | 2022-01-07T02:19:02+01:00 | 2022-01-07T02:34:02+01:00 | Hep B, adolescent or pediatric | 08 | CVX |
2 | 0d016955-26db-966c-3d52-26d441bfcb97 | 999-47-8500 | Ms. | Tracy345 | Smith67 | 2000-07-17 | 410620009 | SNOMED | 39c15c0f-5c49-311e-99d2-1fb99d80e06e | HARBOR HEALTH SERVICES INC | Dr. Salvador46 Homenick806 | 9999977496 | Well child visit (procedure) | 2014-08-18T14:52:06+02:00 | 2014-08-18T15:07:06+02:00 | Influenza, seasonal, injectable, preservative ... | 140 | CVX |
3 | 3a2311a7-eb30-2aae-59c0-55ae273d1581 | 999-40-6084 | None | Janene3 | Howe413 | 2016-11-01 | 410620009 | SNOMED | 2e8b42f8-4300-30a9-a576-d513d1021a0a | GREATER LOWELL FAMILY PRACTICE PC | Dr. Homero668 Salcedo309 | 9999959890 | Well child visit (procedure) | 2016-11-01T04:38:02+01:00 | 2016-11-01T04:53:02+01:00 | Hep B, adolescent or pediatric | 08 | CVX |
4 | f9133ee8-f952-0e7d-f642-99d84fc9c6ad | 999-93-3546 | Ms. | Ozie87 | Nitzsche158 | 1998-04-21 | 410620009 | SNOMED | 0e9d0119-1157-35a0-8bb5-7e0aa3041928 | CARING HEALTH CENTER, INC | Dr. Kari181 Trantow673 | 9999979997 | Well child visit (procedure) | 2014-06-03T20:31:52+02:00 | 2014-06-03T20:46:52+02:00 | Influenza, seasonal, injectable, preservative ... | 140 | CVX |
= df_timeline_IBIS.filter(df_timeline_IBIS.procedure_name == "Well child visit (procedure)")
df_filtered_ibis df_filtered_ibis.to_pandas().head()
patient_id | social_security_number | prefix | first_name | last_name | birthDate | code | system | organization_id | organization_name | practitioner_name | practitioner_id | procedure_name | start_time | end_time | Vaccine_name | vaccine_code | vaccine_code_system | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | fca8d2ca-7aef-2c27-3bba-3f94723012f5 | 999-24-8599 | None | Marlen929 | Greenholt190 | 2016-04-01 | 410620009 | SNOMED | 3f12ebb4-e03c-3453-88d2-4fc9682383df | NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC | Dr. Homero668 Rolón954 | 9999962894 | Well child visit (procedure) | 2016-04-01T04:09:48+02:00 | 2016-04-01T04:24:48+02:00 | Hep B, adolescent or pediatric | 08 | CVX |
1 | 19936964-a432-d501-2cd5-fa52db6b9f41 | 999-33-7974 | None | Monique148 | Haley279 | 2022-01-07 | 410620009 | SNOMED | c6b019eb-28ec-36f6-abf3-bcc4d1d58966 | DUTTON FAMILY CARE ASSOCIATES LLP | Dr. Maren639 Aufderhar910 | 9999950790 | Well child visit (procedure) | 2022-01-07T02:19:02+01:00 | 2022-01-07T02:34:02+01:00 | Hep B, adolescent or pediatric | 08 | CVX |
2 | 0d016955-26db-966c-3d52-26d441bfcb97 | 999-47-8500 | Ms. | Tracy345 | Smith67 | 2000-07-17 | 410620009 | SNOMED | 39c15c0f-5c49-311e-99d2-1fb99d80e06e | HARBOR HEALTH SERVICES INC | Dr. Salvador46 Homenick806 | 9999977496 | Well child visit (procedure) | 2014-08-18T14:52:06+02:00 | 2014-08-18T15:07:06+02:00 | Influenza, seasonal, injectable, preservative ... | 140 | CVX |
3 | 3a2311a7-eb30-2aae-59c0-55ae273d1581 | 999-40-6084 | None | Janene3 | Howe413 | 2016-11-01 | 410620009 | SNOMED | 2e8b42f8-4300-30a9-a576-d513d1021a0a | GREATER LOWELL FAMILY PRACTICE PC | Dr. Homero668 Salcedo309 | 9999959890 | Well child visit (procedure) | 2016-11-01T04:38:02+01:00 | 2016-11-01T04:53:02+01:00 | Hep B, adolescent or pediatric | 08 | CVX |
4 | f9133ee8-f952-0e7d-f642-99d84fc9c6ad | 999-93-3546 | Ms. | Ozie87 | Nitzsche158 | 1998-04-21 | 410620009 | SNOMED | 0e9d0119-1157-35a0-8bb5-7e0aa3041928 | CARING HEALTH CENTER, INC | Dr. Kari181 Trantow673 | 9999979997 | Well child visit (procedure) | 2014-06-03T20:31:52+02:00 | 2014-06-03T20:46:52+02:00 | Influenza, seasonal, injectable, preservative ... | 140 | CVX |
# option 1
= 'Well child visit (procedure)'
value = df_timeline_spark.filter(df_timeline_spark['procedure_name']==value)
df_filtered_pyspark
# option 2
= df_timeline_spark.filter(F.expr(f"procedure_name == '{value}'"))
df_filtered_pyspark
# option 3 - sql
"timeline")
df_timeline_spark.createOrReplaceTempView(= spark.sql(f"Select * from timeline where procedure_name =='{value}'")
df_filtered_pyspark
#show results
'patient_id','start_time','procedure_name']].limit(5).show() df_filtered_pyspark[[
+--------------------+--------------------+--------------------+
| patient_id| start_time| procedure_name|
+--------------------+--------------------+--------------------+
|fca8d2ca-7aef-2c2...|2016-04-01T04:09:...|Well child visit ...|
|19936964-a432-d50...|2022-01-07T02:19:...|Well child visit ...|
|0d016955-26db-966...|2014-08-18T14:52:...|Well child visit ...|
|3a2311a7-eb30-2aa...|2016-11-01T04:38:...|Well child visit ...|
|f9133ee8-f952-0e7...|2014-06-03T20:31:...|Well child visit ...|
+--------------------+--------------------+--------------------+
%%R
<- df %>%
df_filtered filter(procedure_name == "Well child visit (procedure)")
kable(head(df_filtered))
|patient_id |social_security_number |prefix |first_name |last_name |birthDate | code|system |organization_id |organization_name |practitioner_name |practitioner_id |procedure_name |start_time |end_time |Vaccine_name |vaccine_code |vaccine_code_system |
|:------------------------------------|:----------------------|:------|:----------|:------------|:----------|---------:|:------|:------------------------------------|:-----------------------------------------------|:--------------------------|:---------------|:----------------------------|:-------------------------|:-------------------------|:--------------------------------------------------|:------------|:-------------------|
|fca8d2ca-7aef-2c27-3bba-3f94723012f5 |999-24-8599 |NA |Marlen929 |Greenholt190 |2016-04-01 | 410620009|SNOMED |3f12ebb4-e03c-3453-88d2-4fc9682383df |NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC |Dr. Homero668 Rolón954 |9999962894 |Well child visit (procedure) |2016-04-01T04:09:48+02:00 |2016-04-01T04:24:48+02:00 |Hep B, adolescent or pediatric |08 |CVX |
|19936964-a432-d501-2cd5-fa52db6b9f41 |999-33-7974 |NA |Monique148 |Haley279 |2022-01-07 | 410620009|SNOMED |c6b019eb-28ec-36f6-abf3-bcc4d1d58966 |DUTTON FAMILY CARE ASSOCIATES LLP |Dr. Maren639 Aufderhar910 |9999950790 |Well child visit (procedure) |2022-01-07T02:19:02+01:00 |2022-01-07T02:34:02+01:00 |Hep B, adolescent or pediatric |08 |CVX |
|0d016955-26db-966c-3d52-26d441bfcb97 |999-47-8500 |Ms. |Tracy345 |Smith67 |2000-07-17 | 410620009|SNOMED |39c15c0f-5c49-311e-99d2-1fb99d80e06e |HARBOR HEALTH SERVICES INC |Dr. Salvador46 Homenick806 |9999977496 |Well child visit (procedure) |2014-08-18T14:52:06+02:00 |2014-08-18T15:07:06+02:00 |Influenza, seasonal, injectable, preservative free |140 |CVX |
|3a2311a7-eb30-2aae-59c0-55ae273d1581 |999-40-6084 |NA |Janene3 |Howe413 |2016-11-01 | 410620009|SNOMED |2e8b42f8-4300-30a9-a576-d513d1021a0a |GREATER LOWELL FAMILY PRACTICE PC |Dr. Homero668 Salcedo309 |9999959890 |Well child visit (procedure) |2016-11-01T04:38:02+01:00 |2016-11-01T04:53:02+01:00 |Hep B, adolescent or pediatric |08 |CVX |
|f9133ee8-f952-0e7d-f642-99d84fc9c6ad |999-93-3546 |Ms. |Ozie87 |Nitzsche158 |1998-04-21 | 410620009|SNOMED |0e9d0119-1157-35a0-8bb5-7e0aa3041928 |CARING HEALTH CENTER, INC |Dr. Kari181 Trantow673 |9999979997 |Well child visit (procedure) |2014-06-03T20:31:52+02:00 |2014-06-03T20:46:52+02:00 |Influenza, seasonal, injectable, preservative free |140 |CVX |
|f9133ee8-f952-0e7d-f642-99d84fc9c6ad |999-93-3546 |Ms. |Ozie87 |Nitzsche158 |1998-04-21 | 410620009|SNOMED |0e9d0119-1157-35a0-8bb5-7e0aa3041928 |CARING HEALTH CENTER, INC |Dr. Kari181 Trantow673 |9999979997 |Well child visit (procedure) |2014-06-03T20:31:52+02:00 |2014-06-03T20:46:52+02:00 |meningococcal MCV4P |114 |CVX |
In addition: Warning message:
Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this.
Besides filtering data by content it is also possible to filter the columns one wishes to show
filter(items = ['patient_id','birthDate','organization_name']).head(1) df_filtered_pandas.
patient_id | birthDate | organization_name | |
---|---|---|---|
2 | fca8d2ca-7aef-2c27-3bba-3f94723012f5 | 2016-04-01 | NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC |
'patient_id','birthDate','organization_name']].head(1) df_filtered_pandas[[
patient_id | birthDate | organization_name | |
---|---|---|---|
2 | fca8d2ca-7aef-2c27-3bba-3f94723012f5 | 2016-04-01 | NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC |
df_filtered_polars.select(["patient_id"), pl.col("birthDate"), pl.col("organization_name")
pl.col( ]).head()
patient_id | birthDate | organization_name |
---|---|---|
str | datetime[μs] | str |
"fca8d2ca-7aef-… | 2016-04-01 00:00:00 | "NEW BEDFORD IN… |
"19936964-a432-… | 2022-01-07 00:00:00 | "DUTTON FAMILY … |
"0d016955-26db-… | 2000-07-17 00:00:00 | "HARBOR HEALTH … |
"3a2311a7-eb30-… | 2016-11-01 00:00:00 | "GREATER LOWELL… |
"f9133ee8-f952-… | 1998-04-21 00:00:00 | "CARING HEALTH … |
= """
query Select
patient_id,
birthDate,
organization_name
from patient_timeline limit 1
"""
con_duckdb.sql(query).to_df()
patient_id | birthDate | organization_name | |
---|---|---|---|
0 | 4fc244f3-2c0e-4017-d64d-c2c4cd03655f | 1965-01-07 | HOLYOKE HEALTH CENTER INC |
df_filtered_ibis.select('patient_id','birthDate','organization_name'
).to_pandas().head()
patient_id | birthDate | organization_name | |
---|---|---|---|
0 | fca8d2ca-7aef-2c27-3bba-3f94723012f5 | 2016-04-01 | NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC |
1 | 19936964-a432-d501-2cd5-fa52db6b9f41 | 2022-01-07 | DUTTON FAMILY CARE ASSOCIATES LLP |
2 | 0d016955-26db-966c-3d52-26d441bfcb97 | 2000-07-17 | HARBOR HEALTH SERVICES INC |
3 | 3a2311a7-eb30-2aae-59c0-55ae273d1581 | 2016-11-01 | GREATER LOWELL FAMILY PRACTICE PC |
4 | f9133ee8-f952-0e7d-f642-99d84fc9c6ad | 1998-04-21 | CARING HEALTH CENTER, INC |
'patient_id','birthDate','organization_name').limit(1).show() df_filtered_pyspark.select(
+--------------------+-------------------+--------------------+
| patient_id| birthDate| organization_name|
+--------------------+-------------------+--------------------+
|fca8d2ca-7aef-2c2...|2016-04-01 00:00:00|NEW BEDFORD INTER...|
+--------------------+-------------------+--------------------+
#TODO
Mutate
Another common action is the need to adapt data. For instance, we may want to create an additional column indicating the year of a visit in the patient timeline, based on the available start_time. Currently, the date is in a specific timezone, and it’s important to consider this when converting it to a datetime format. For now, we maintain the date in the same timezone before extracting the year from it.
# transform start_time to datetime (its currently string format)
'start_time'] = pd.to_datetime(df_timeline_pandas['start_time'], utc= True)
df_timeline_pandas['start_time'].dtype
df_timeline_pandas[
#option 1
'year'] = df_timeline_pandas['start_time'].dt.year
df_timeline_pandas[
#option 2
= df_timeline_pandas.assign(year=df_timeline_pandas['start_time'].dt.year )
df_timeline_pandas
'start_time','year']].head(1) df_timeline_pandas[[
start_time | year | |
---|---|---|
0 | 2015-01-08 06:11:47+00:00 | 2015 |
= (
df_new_var_polars
df_timeline_polars
.with_columns("start_time").str.to_datetime(format="%Y-%m-%dT%H:%M:%S%z")
pl.col("start_year")
.dt.year().alias(
)
)print(df_new_var_polars.head())
shape: (5, 19)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ patient_id ┆ social_sec ┆ prefix ┆ first_nam ┆ … ┆ Vaccine_n ┆ vaccine_c ┆ vaccine_c ┆ start_yea │
│ --- ┆ urity_numb ┆ --- ┆ e ┆ ┆ ame ┆ ode ┆ ode_syste ┆ r │
│ str ┆ er ┆ str ┆ --- ┆ ┆ --- ┆ --- ┆ m ┆ --- │
│ ┆ --- ┆ ┆ str ┆ ┆ str ┆ str ┆ --- ┆ i32 │
│ ┆ str ┆ ┆ ┆ ┆ ┆ ┆ str ┆ │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 4fc244f3-2 ┆ 999-53-581 ┆ Mrs. ┆ Alyce744 ┆ … ┆ zoster ┆ 121 ┆ CVX ┆ 2015 │
│ c0e-4017-d ┆ 3 ┆ ┆ ┆ ┆ vaccine, ┆ ┆ ┆ │
│ 64d-c2c4cd ┆ ┆ ┆ ┆ ┆ live ┆ ┆ ┆ │
│ 03… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 4fc244f3-2 ┆ 999-53-581 ┆ Mrs. ┆ Alyce744 ┆ … ┆ Influenza ┆ 140 ┆ CVX ┆ 2015 │
│ c0e-4017-d ┆ 3 ┆ ┆ ┆ ┆ , ┆ ┆ ┆ │
│ 64d-c2c4cd ┆ ┆ ┆ ┆ ┆ seasonal, ┆ ┆ ┆ │
│ 03… ┆ ┆ ┆ ┆ ┆ injectabl ┆ ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ e,… ┆ ┆ ┆ │
│ fca8d2ca-7 ┆ 999-24-859 ┆ null ┆ Marlen929 ┆ … ┆ Hep B, ┆ 08 ┆ CVX ┆ 2016 │
│ aef-2c27-3 ┆ 9 ┆ ┆ ┆ ┆ adolescen ┆ ┆ ┆ │
│ bba-3f9472 ┆ ┆ ┆ ┆ ┆ t or ┆ ┆ ┆ │
│ 30… ┆ ┆ ┆ ┆ ┆ pediatric ┆ ┆ ┆ │
│ 19936964-a ┆ 999-33-797 ┆ null ┆ Monique14 ┆ … ┆ Hep B, ┆ 08 ┆ CVX ┆ 2022 │
│ 432-d501-2 ┆ 4 ┆ ┆ 8 ┆ ┆ adolescen ┆ ┆ ┆ │
│ cd5-fa52db ┆ ┆ ┆ ┆ ┆ t or ┆ ┆ ┆ │
│ 6b… ┆ ┆ ┆ ┆ ┆ pediatric ┆ ┆ ┆ │
│ 0d016955-2 ┆ 999-47-850 ┆ Ms. ┆ Tracy345 ┆ … ┆ Influenza ┆ 140 ┆ CVX ┆ 2014 │
│ 6db-966c-3 ┆ 0 ┆ ┆ ┆ ┆ , ┆ ┆ ┆ │
│ d52-26d441 ┆ ┆ ┆ ┆ ┆ seasonal, ┆ ┆ ┆ │
│ bf… ┆ ┆ ┆ ┆ ┆ injectabl ┆ ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ e,… ┆ ┆ ┆ │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘
= """
query Select
cast(start_time as datetime) as start_time,
year(cast(start_time as datetime)) as year
from patient_timeline;
"""
con_duckdb.sql(query).to_df()
start_time | year | |
---|---|---|
0 | 2015-01-08 06:11:47 | 2015 |
1 | 2015-01-08 06:11:47 | 2015 |
2 | 2016-04-01 02:09:48 | 2016 |
3 | 2022-01-07 01:19:02 | 2022 |
4 | 2014-08-18 12:52:06 | 2014 |
... | ... | ... |
13832 | 2019-07-24 02:17:26 | 2019 |
13833 | 2020-07-29 02:17:26 | 2020 |
13834 | 2021-08-04 02:17:26 | 2021 |
13835 | 2022-08-10 02:17:26 | 2022 |
13836 | 2023-08-16 02:17:26 | 2023 |
13837 rows × 2 columns
= df_timeline_IBIS.mutate(start_year=df_timeline_IBIS.start_time.cast("timestamp").year()).to_pandas()
df_new_var_ibis df_new_var_ibis.head()
patient_id | social_security_number | prefix | first_name | last_name | birthDate | code | system | organization_id | organization_name | practitioner_name | practitioner_id | procedure_name | start_time | end_time | Vaccine_name | vaccine_code | vaccine_code_system | start_year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4fc244f3-2c0e-4017-d64d-c2c4cd03655f | 999-53-5813 | Mrs. | Alyce744 | Bergstrom287 | 1965-01-07 | 162673000 | SNOMED | b03b624d-c939-3688-986d-9555b8009a3b | HOLYOKE HEALTH CENTER INC | Dr. Bennett146 Hartmann983 | 9999981894 | General examination of patient (procedure) | 2015-01-08T07:11:47+01:00 | 2015-01-08T07:26:47+01:00 | zoster vaccine, live | 121 | CVX | 2015 |
1 | 4fc244f3-2c0e-4017-d64d-c2c4cd03655f | 999-53-5813 | Mrs. | Alyce744 | Bergstrom287 | 1965-01-07 | 162673000 | SNOMED | b03b624d-c939-3688-986d-9555b8009a3b | HOLYOKE HEALTH CENTER INC | Dr. Bennett146 Hartmann983 | 9999981894 | General examination of patient (procedure) | 2015-01-08T07:11:47+01:00 | 2015-01-08T07:26:47+01:00 | Influenza, seasonal, injectable, preservative ... | 140 | CVX | 2015 |
2 | fca8d2ca-7aef-2c27-3bba-3f94723012f5 | 999-24-8599 | None | Marlen929 | Greenholt190 | 2016-04-01 | 410620009 | SNOMED | 3f12ebb4-e03c-3453-88d2-4fc9682383df | NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC | Dr. Homero668 Rolón954 | 9999962894 | Well child visit (procedure) | 2016-04-01T04:09:48+02:00 | 2016-04-01T04:24:48+02:00 | Hep B, adolescent or pediatric | 08 | CVX | 2016 |
3 | 19936964-a432-d501-2cd5-fa52db6b9f41 | 999-33-7974 | None | Monique148 | Haley279 | 2022-01-07 | 410620009 | SNOMED | c6b019eb-28ec-36f6-abf3-bcc4d1d58966 | DUTTON FAMILY CARE ASSOCIATES LLP | Dr. Maren639 Aufderhar910 | 9999950790 | Well child visit (procedure) | 2022-01-07T02:19:02+01:00 | 2022-01-07T02:34:02+01:00 | Hep B, adolescent or pediatric | 08 | CVX | 2022 |
4 | 0d016955-26db-966c-3d52-26d441bfcb97 | 999-47-8500 | Ms. | Tracy345 | Smith67 | 2000-07-17 | 410620009 | SNOMED | 39c15c0f-5c49-311e-99d2-1fb99d80e06e | HARBOR HEALTH SERVICES INC | Dr. Salvador46 Homenick806 | 9999977496 | Well child visit (procedure) | 2014-08-18T14:52:06+02:00 | 2014-08-18T15:07:06+02:00 | Influenza, seasonal, injectable, preservative ... | 140 | CVX | 2014 |
# transform start_time to datetime (its currently string format)
= df_timeline_spark.withColumn("start_time", F.to_utc_timestamp(df_timeline_spark["start_time"], "Europe/Amsterdam"))
df_timeline_spark
= df_timeline_spark.withColumn("year", F.year(df_timeline_spark["start_time"]))
df_timeline_spark 'start_time','year').limit(3).show() df_timeline_spark.select(
+-------------------+----+
| start_time|year|
+-------------------+----+
|2015-01-08 05:11:47|2015|
|2015-01-08 05:11:47|2015|
|2016-04-01 00:09:48|2016|
+-------------------+----+
%%R
<- df %>%
df_new_var = year(as.Date(start_time, format = "%Y-%m-%d")))
mutate(start_year kable(head(df_new_var))
|patient_id |social_security_number |prefix |first_name |last_name |birthDate | code|system |organization_id |organization_name |practitioner_name |practitioner_id |procedure_name |start_time |end_time |Vaccine_name |vaccine_code |vaccine_code_system | start_year|
|:------------------------------------|:----------------------|:------|:----------|:------------|:----------|---------:|:------|:------------------------------------|:-----------------------------------------------|:--------------------------|:---------------|:------------------------------------------|:-------------------------|:-------------------------|:--------------------------------------------------|:------------|:-------------------|----------:|
|4fc244f3-2c0e-4017-d64d-c2c4cd03655f |999-53-5813 |Mrs. |Alyce744 |Bergstrom287 |1965-01-07 | 162673000|SNOMED |b03b624d-c939-3688-986d-9555b8009a3b |HOLYOKE HEALTH CENTER INC |Dr. Bennett146 Hartmann983 |9999981894 |General examination of patient (procedure) |2015-01-08T07:11:47+01:00 |2015-01-08T07:26:47+01:00 |zoster vaccine, live |121 |CVX | 2015|
|4fc244f3-2c0e-4017-d64d-c2c4cd03655f |999-53-5813 |Mrs. |Alyce744 |Bergstrom287 |1965-01-07 | 162673000|SNOMED |b03b624d-c939-3688-986d-9555b8009a3b |HOLYOKE HEALTH CENTER INC |Dr. Bennett146 Hartmann983 |9999981894 |General examination of patient (procedure) |2015-01-08T07:11:47+01:00 |2015-01-08T07:26:47+01:00 |Influenza, seasonal, injectable, preservative free |140 |CVX | 2015|
|fca8d2ca-7aef-2c27-3bba-3f94723012f5 |999-24-8599 |NA |Marlen929 |Greenholt190 |2016-04-01 | 410620009|SNOMED |3f12ebb4-e03c-3453-88d2-4fc9682383df |NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC |Dr. Homero668 Rolón954 |9999962894 |Well child visit (procedure) |2016-04-01T04:09:48+02:00 |2016-04-01T04:24:48+02:00 |Hep B, adolescent or pediatric |08 |CVX | 2016|
|19936964-a432-d501-2cd5-fa52db6b9f41 |999-33-7974 |NA |Monique148 |Haley279 |2022-01-07 | 410620009|SNOMED |c6b019eb-28ec-36f6-abf3-bcc4d1d58966 |DUTTON FAMILY CARE ASSOCIATES LLP |Dr. Maren639 Aufderhar910 |9999950790 |Well child visit (procedure) |2022-01-07T02:19:02+01:00 |2022-01-07T02:34:02+01:00 |Hep B, adolescent or pediatric |08 |CVX | 2022|
|0d016955-26db-966c-3d52-26d441bfcb97 |999-47-8500 |Ms. |Tracy345 |Smith67 |2000-07-17 | 410620009|SNOMED |39c15c0f-5c49-311e-99d2-1fb99d80e06e |HARBOR HEALTH SERVICES INC |Dr. Salvador46 Homenick806 |9999977496 |Well child visit (procedure) |2014-08-18T14:52:06+02:00 |2014-08-18T15:07:06+02:00 |Influenza, seasonal, injectable, preservative free |140 |CVX | 2014|
|3a2311a7-eb30-2aae-59c0-55ae273d1581 |999-40-6084 |NA |Janene3 |Howe413 |2016-11-01 | 410620009|SNOMED |2e8b42f8-4300-30a9-a576-d513d1021a0a |GREATER LOWELL FAMILY PRACTICE PC |Dr. Homero668 Salcedo309 |9999959890 |Well child visit (procedure) |2016-11-01T04:38:02+01:00 |2016-11-01T04:53:02+01:00 |Hep B, adolescent or pediatric |08 |CVX | 2016|
Order
It often occurs that you want to order your data. In this case, let’s arrange the patient timeline by start_time. Most libraries automatically order in ascending (oldest first) and allow you to order in descending (newest first).
= df_timeline_pandas.sort_values(by = 'start_time', ascending = False)
df_ordered_pandas 'patient_id','start_time','procedure_name']].head() df_ordered_pandas[[
patient_id | start_time | procedure_name | |
---|---|---|---|
1871 | 1d24c422-98a8-713d-3184-fbd05c3695a1 | 2023-09-05 10:06:14+00:00 | Well child visit (procedure) |
1872 | 1d24c422-98a8-713d-3184-fbd05c3695a1 | 2023-09-05 10:06:14+00:00 | Well child visit (procedure) |
9916 | 404e3ebd-a1e3-4678-7d4e-58a48ca21418 | 2023-09-05 03:32:07+00:00 | Well child visit (procedure) |
5872 | 4efaa926-30cf-0fcd-04d9-0d31c4071483 | 2023-09-04 04:12:02+00:00 | General examination of patient (procedure) |
12301 | 8f262a31-6268-5a96-7432-1723d168687f | 2023-09-04 01:54:17+00:00 | General examination of patient (procedure) |
= (
df_ordered_polars
df_timeline_polars
.with_columns("start_time").str.to_datetime(format="%Y-%m-%dT%H:%M:%S%z")
pl.col(
)"start_time")
.sort(
)print(df_ordered_polars.head())
shape: (5, 18)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ patient_id ┆ social_sec ┆ prefix ┆ first_nam ┆ … ┆ end_time ┆ Vaccine_n ┆ vaccine_c ┆ vaccine_c │
│ --- ┆ urity_numb ┆ --- ┆ e ┆ ┆ --- ┆ ame ┆ ode ┆ ode_syste │
│ str ┆ er ┆ str ┆ --- ┆ ┆ str ┆ --- ┆ --- ┆ m │
│ ┆ --- ┆ ┆ str ┆ ┆ ┆ str ┆ str ┆ --- │
│ ┆ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ str │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 9808a732-1 ┆ 999-90-152 ┆ Mrs. ┆ Pattie91 ┆ … ┆ 2013-09-1 ┆ Influenza ┆ 140 ┆ CVX │
│ f51-6483-d ┆ 4 ┆ ┆ ┆ ┆ 1T19:23:1 ┆ , ┆ ┆ │
│ 8b0-ec0d61 ┆ ┆ ┆ ┆ ┆ 4+02:00 ┆ seasonal, ┆ ┆ │
│ f8… ┆ ┆ ┆ ┆ ┆ ┆ injectabl ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ e,… ┆ ┆ │
│ 6f93e13d-d ┆ 999-65-407 ┆ Mrs. ┆ Carrie738 ┆ … ┆ 2013-09-1 ┆ Influenza ┆ 140 ┆ CVX │
│ 237-acc2-4 ┆ 9 ┆ ┆ ┆ ┆ 1T21:06:5 ┆ , ┆ ┆ │
│ 7f0-d1fa20 ┆ ┆ ┆ ┆ ┆ 0+02:00 ┆ seasonal, ┆ ┆ │
│ c6… ┆ ┆ ┆ ┆ ┆ ┆ injectabl ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ e,… ┆ ┆ │
│ 7aafabc2-f ┆ 999-49-954 ┆ null ┆ Maximina9 ┆ … ┆ 2013-09-1 ┆ Hep B, ┆ 08 ┆ CVX │
│ 4cf-3d82-f ┆ 2 ┆ ┆ 16 ┆ ┆ 2T05:05:4 ┆ adolescen ┆ ┆ │
│ 1e9-64d809 ┆ ┆ ┆ ┆ ┆ 5+02:00 ┆ t or ┆ ┆ │
│ 4b… ┆ ┆ ┆ ┆ ┆ ┆ pediatric ┆ ┆ │
│ 1c2f968a-4 ┆ 999-61-837 ┆ null ┆ Claribel7 ┆ … ┆ 2013-09-1 ┆ varicella ┆ 21 ┆ CVX │
│ 735-c752-0 ┆ 5 ┆ ┆ 06 ┆ ┆ 3T02:11:3 ┆ ┆ ┆ │
│ 7dd-54a664 ┆ ┆ ┆ ┆ ┆ 5+02:00 ┆ ┆ ┆ │
│ fe… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 1c2f968a-4 ┆ 999-61-837 ┆ null ┆ Claribel7 ┆ … ┆ 2013-09-1 ┆ IPV ┆ 10 ┆ CVX │
│ 735-c752-0 ┆ 5 ┆ ┆ 06 ┆ ┆ 3T02:11:3 ┆ ┆ ┆ │
│ 7dd-54a664 ┆ ┆ ┆ ┆ ┆ 5+02:00 ┆ ┆ ┆ │
│ fe… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘
= """
query Select
*
from patient_timeline
order by start_time asc
limit 5
"""
con_duckdb.sql(query).to_df()
patient_id | social_security_number | prefix | first_name | last_name | birthDate | code | system | organization_id | organization_name | practitioner_name | practitioner_id | procedure_name | start_time | end_time | Vaccine_name | vaccine_code | vaccine_code_system | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9808a732-1f51-6483-d8b0-ec0d61f8cead | 999-90-1524 | Mrs. | Pattie91 | Abbott774 | 1967-08-23 | 162673000 | SNOMED | b44955d1-f0e2-3beb-a013-708a81dbe430 | SIGNATURE HEALTHCARE MEDICAL GROUP INC | Dr. Ernest565 Runte676 | 9999970699 | General examination of patient (procedure) | 2013-09-11T19:08:14+02:00 | 2013-09-11T19:23:14+02:00 | Influenza, seasonal, injectable, preservative ... | 140 | CVX |
1 | 6f93e13d-d237-acc2-47f0-d1fa20c62df2 | 999-65-4079 | Mrs. | Carrie738 | Blick895 | 1985-07-03 | 162673000 | SNOMED | 11a40a26-a5a3-331d-aef0-d06dd90e2b4e | CAPE COD PRIMARY CARE LLC | Dr. Len277 Wintheiser220 | 9999947994 | General examination of patient (procedure) | 2013-09-11T20:51:50+02:00 | 2013-09-11T21:06:50+02:00 | Influenza, seasonal, injectable, preservative ... | 140 | CVX |
2 | 7aafabc2-f4cf-3d82-f1e9-64d8094beeb6 | 999-49-9542 | None | Maximina916 | Rutherford999 | 2013-09-12 | 410620009 | SNOMED | dd9914b0-9491-3330-a890-d34e95e497ea | CHILD & FAMILY SERVICES, INC. | Dr. Assunta351 Haley279 | 9999967794 | Well child visit (procedure) | 2013-09-12T04:50:45+02:00 | 2013-09-12T05:05:45+02:00 | Hep B, adolescent or pediatric | 08 | CVX |
3 | 1c2f968a-4735-c752-07dd-54a664fef295 | 999-61-8375 | None | Claribel706 | Ruecker817 | 2008-10-03 | 410620009 | SNOMED | 542185c1-943a-30f4-aba0-0efaa92488a4 | EVEREST HEALTH CARE SPECIALISTS PLLC | Dr. Luetta803 Rau926 | 9999955096 | Well child visit (procedure) | 2013-09-13T01:56:35+02:00 | 2013-09-13T02:11:35+02:00 | MMR | 03 | CVX |
4 | 1c2f968a-4735-c752-07dd-54a664fef295 | 999-61-8375 | None | Claribel706 | Ruecker817 | 2008-10-03 | 410620009 | SNOMED | 542185c1-943a-30f4-aba0-0efaa92488a4 | EVEREST HEALTH CARE SPECIALISTS PLLC | Dr. Luetta803 Rau926 | 9999955096 | Well child visit (procedure) | 2013-09-13T01:56:35+02:00 | 2013-09-13T02:11:35+02:00 | DTaP | 20 | CVX |
= df_timeline_IBIS.order_by(df_timeline_IBIS.start_time.cast("timestamp")).to_pandas()
df_ordered_ibis df_ordered_ibis.head()
patient_id | social_security_number | prefix | first_name | last_name | birthDate | code | system | organization_id | organization_name | practitioner_name | practitioner_id | procedure_name | start_time | end_time | Vaccine_name | vaccine_code | vaccine_code_system | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9808a732-1f51-6483-d8b0-ec0d61f8cead | 999-90-1524 | Mrs. | Pattie91 | Abbott774 | 1967-08-23 | 162673000 | SNOMED | b44955d1-f0e2-3beb-a013-708a81dbe430 | SIGNATURE HEALTHCARE MEDICAL GROUP INC | Dr. Ernest565 Runte676 | 9999970699 | General examination of patient (procedure) | 2013-09-11T19:08:14+02:00 | 2013-09-11T19:23:14+02:00 | Influenza, seasonal, injectable, preservative ... | 140 | CVX |
1 | 6f93e13d-d237-acc2-47f0-d1fa20c62df2 | 999-65-4079 | Mrs. | Carrie738 | Blick895 | 1985-07-03 | 162673000 | SNOMED | 11a40a26-a5a3-331d-aef0-d06dd90e2b4e | CAPE COD PRIMARY CARE LLC | Dr. Len277 Wintheiser220 | 9999947994 | General examination of patient (procedure) | 2013-09-11T20:51:50+02:00 | 2013-09-11T21:06:50+02:00 | Influenza, seasonal, injectable, preservative ... | 140 | CVX |
2 | 7aafabc2-f4cf-3d82-f1e9-64d8094beeb6 | 999-49-9542 | None | Maximina916 | Rutherford999 | 2013-09-12 | 410620009 | SNOMED | dd9914b0-9491-3330-a890-d34e95e497ea | CHILD & FAMILY SERVICES, INC. | Dr. Assunta351 Haley279 | 9999967794 | Well child visit (procedure) | 2013-09-12T04:50:45+02:00 | 2013-09-12T05:05:45+02:00 | Hep B, adolescent or pediatric | 08 | CVX |
3 | 1c2f968a-4735-c752-07dd-54a664fef295 | 999-61-8375 | None | Claribel706 | Ruecker817 | 2008-10-03 | 410620009 | SNOMED | 542185c1-943a-30f4-aba0-0efaa92488a4 | EVEREST HEALTH CARE SPECIALISTS PLLC | Dr. Luetta803 Rau926 | 9999955096 | Well child visit (procedure) | 2013-09-13T01:56:35+02:00 | 2013-09-13T02:11:35+02:00 | varicella | 21 | CVX |
4 | 1c2f968a-4735-c752-07dd-54a664fef295 | 999-61-8375 | None | Claribel706 | Ruecker817 | 2008-10-03 | 410620009 | SNOMED | 542185c1-943a-30f4-aba0-0efaa92488a4 | EVEREST HEALTH CARE SPECIALISTS PLLC | Dr. Luetta803 Rau926 | 9999955096 | Well child visit (procedure) | 2013-09-13T01:56:35+02:00 | 2013-09-13T02:11:35+02:00 | IPV | 10 | CVX |
= df_timeline_spark.orderBy('start_time',ascending = False)
df_ordered_pyspark 5).show() df_ordered_pyspark.limit(
+--------------------+----------------------+------+-----------+--------------+-------------------+---------+------+--------------------+--------------------+--------------------+---------------+--------------------+-------------------+--------------------+--------------------+------------+-------------------+----+
| patient_id|social_security_number|prefix| first_name| last_name| birthDate| code|system| organization_id| organization_name| practitioner_name|practitioner_id| procedure_name| start_time| end_time| Vaccine_name|vaccine_code|vaccine_code_system|year|
+--------------------+----------------------+------+-----------+--------------+-------------------+---------+------+--------------------+--------------------+--------------------+---------------+--------------------+-------------------+--------------------+--------------------+------------+-------------------+----+
|1d24c422-98a8-713...| 999-29-9635| NULL| Taneka590| Howe413|2011-08-16 00:00:00|410620009|SNOMED|545dcda5-7471-374...|COMMUNITY HEALTH ...|Dr. Lezlie553 Kos...| 9999981191|Well child visit ...|2023-09-05 08:06:14|2023-09-05T12:21:...| HPV, quadrivalent| 62| CVX|2023|
|1d24c422-98a8-713...| 999-29-9635| NULL| Taneka590| Howe413|2011-08-16 00:00:00|410620009|SNOMED|545dcda5-7471-374...|COMMUNITY HEALTH ...|Dr. Lezlie553 Kos...| 9999981191|Well child visit ...|2023-09-05 08:06:14|2023-09-05T12:21:...|Influenza, season...| 140| CVX|2023|
|404e3ebd-a1e3-467...| 999-54-8765| NULL|Nicholle822| Runte676|2013-08-27 00:00:00|410620009|SNOMED|84abfbd5-41c1-3df...|SKIN WELLNESS PHY...|Dr. Cole117 Block661| 9999953190|Well child visit ...|2023-09-05 01:32:07|2023-09-05T05:47:...|Influenza, season...| 140| CVX|2023|
|4efaa926-30cf-0fc...| 999-66-1148| Mrs.| Lisa683|Schamberger479|1951-04-30 00:00:00|162673000|SNOMED|0583e4f7-f1c2-310...|COMMUNITY HEALTH ...| Dr. Jewel43 Howe413| 9999978494|General examinati...|2023-09-04 02:12:02|2023-09-04T06:27:...|Influenza, season...| 140| CVX|2023|
|8f262a31-6268-5a9...| 999-13-3109| Mrs.| Pa577| Graham902|1970-08-17 00:00:00|162673000|SNOMED|e1892a2c-6dca-3c4...|PRIMARY & PREVENT...|Dr. Derek111 Erdm...| 9999954990|General examinati...|2023-09-03 23:54:17|2023-09-04T04:09:...|Influenza, season...| 140| CVX|2023|
+--------------------+----------------------+------+-----------+--------------+-------------------+---------+------+--------------------+--------------------+--------------------+---------------+--------------------+-------------------+--------------------+--------------------+------------+-------------------+----+
%%R
<- df %>%
df_ordered = as.Date(start_time), format = "%Y-%m-%d") %>%
mutate(start_time
arrange(start_time) kable(head(df_ordered))
|patient_id |social_security_number |prefix |first_name |last_name |birthDate | code|system |organization_id |organization_name |practitioner_name |practitioner_id |procedure_name |start_time |end_time |Vaccine_name |vaccine_code |vaccine_code_system |format |
|:------------------------------------|:----------------------|:------|:-----------|:-------------|:----------|---------:|:------|:------------------------------------|:---------------------------------------------|:------------------------|:---------------|:------------------------------------------|:----------|:-------------------------|:--------------------------------------------------|:------------|:-------------------|:--------|
|6f93e13d-d237-acc2-47f0-d1fa20c62df2 |999-65-4079 |Mrs. |Carrie738 |Blick895 |1985-07-03 | 162673000|SNOMED |11a40a26-a5a3-331d-aef0-d06dd90e2b4e |CAPE COD PRIMARY CARE LLC |Dr. Len277 Wintheiser220 |9999947994 |General examination of patient (procedure) |2013-09-11 |2013-09-11T21:06:50+02:00 |Influenza, seasonal, injectable, preservative free |140 |CVX |%Y-%m-%d |
|9808a732-1f51-6483-d8b0-ec0d61f8cead |999-90-1524 |Mrs. |Pattie91 |Abbott774 |1967-08-23 | 162673000|SNOMED |b44955d1-f0e2-3beb-a013-708a81dbe430 |SIGNATURE HEALTHCARE MEDICAL GROUP INC |Dr. Ernest565 Runte676 |9999970699 |General examination of patient (procedure) |2013-09-11 |2013-09-11T19:23:14+02:00 |Influenza, seasonal, injectable, preservative free |140 |CVX |%Y-%m-%d |
|7aafabc2-f4cf-3d82-f1e9-64d8094beeb6 |999-49-9542 |NA |Maximina916 |Rutherford999 |2013-09-12 | 410620009|SNOMED |dd9914b0-9491-3330-a890-d34e95e497ea |CHILD & FAMILY SERVICES, INC. |Dr. Assunta351 Haley279 |9999967794 |Well child visit (procedure) |2013-09-12 |2013-09-12T05:05:45+02:00 |Hep B, adolescent or pediatric |08 |CVX |%Y-%m-%d |
|0a78537c-628d-77a4-d16b-ec23b6327c9e |999-89-9767 |NA |Kalyn451 |Smith67 |2007-09-28 | 410620009|SNOMED |7a75decf-7d3f-3b0d-8c2b-3abf0a8c7f58 |NEW ENGLAND PROFESSIONAL HOME HEALTH CARE LLC |Dr. Afton574 Kulas532 |9999944595 |Well child visit (procedure) |2013-09-13 |2013-09-13T06:33:22+02:00 |Influenza, seasonal, injectable, preservative free |140 |CVX |%Y-%m-%d |
|1c2f968a-4735-c752-07dd-54a664fef295 |999-61-8375 |NA |Claribel706 |Ruecker817 |2008-10-03 | 410620009|SNOMED |542185c1-943a-30f4-aba0-0efaa92488a4 |EVEREST HEALTH CARE SPECIALISTS PLLC |Dr. Luetta803 Rau926 |9999955096 |Well child visit (procedure) |2013-09-13 |2013-09-13T02:11:35+02:00 |varicella |21 |CVX |%Y-%m-%d |
|1c2f968a-4735-c752-07dd-54a664fef295 |999-61-8375 |NA |Claribel706 |Ruecker817 |2008-10-03 | 410620009|SNOMED |542185c1-943a-30f4-aba0-0efaa92488a4 |EVEREST HEALTH CARE SPECIALISTS PLLC |Dr. Luetta803 Rau926 |9999955096 |Well child visit (procedure) |2013-09-13 |2013-09-13T02:11:35+02:00 |IPV |10 |CVX |%Y-%m-%d |
Rename
Another action that is done frequently is the renaming of a column. As an example we rename the birthDate column to birth_date
In pandas, many actions such as rename can be adapted directly in the same dataframe by adding the option inplace = True.
Alternatively, one can create a new dataframe and keep the name the same in the original dataframe.
#option 1 - change in dataframe
= {'birthDate':'birth_date'}, inplace = True)
df_timeline_pandas.rename(columns
#option 2 - change only in new dataframe
= df_timeline_pandas.rename(columns = {'birthDate':'birth_date'})
df_renamed_pandas 1) df_renamed_pandas.head(
patient_id | social_security_number | prefix | first_name | last_name | birth_date | code | system | organization_id | organization_name | practitioner_name | practitioner_id | procedure_name | start_time | end_time | Vaccine_name | vaccine_code | vaccine_code_system | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4fc244f3-2c0e-4017-d64d-c2c4cd03655f | 999-53-5813 | Mrs. | Alyce744 | Bergstrom287 | 1965-01-07 | 162673000 | SNOMED | b03b624d-c939-3688-986d-9555b8009a3b | HOLYOKE HEALTH CENTER INC | Dr. Bennett146 Hartmann983 | 9999981894 | General examination of patient (procedure) | 2015-01-08 06:11:47+00:00 | 2015-01-08T07:26:47+01:00 | zoster vaccine, live | 121 | CVX | 2015 |
= (
df_renamed_polars
df_timeline_polars"birthDate": "birth_date"})
.rename({
)print(df_renamed_polars.head(1))
shape: (1, 18)
┌────────────┬────────────┬────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ patient_id ┆ social_sec ┆ prefix ┆ first_nam ┆ … ┆ end_time ┆ Vaccine_n ┆ vaccine_c ┆ vaccine_c │
│ --- ┆ urity_numb ┆ --- ┆ e ┆ ┆ --- ┆ ame ┆ ode ┆ ode_syste │
│ str ┆ er ┆ str ┆ --- ┆ ┆ str ┆ --- ┆ --- ┆ m │
│ ┆ --- ┆ ┆ str ┆ ┆ ┆ str ┆ str ┆ --- │
│ ┆ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ str │
╞════════════╪════════════╪════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 4fc244f3-2 ┆ 999-53-581 ┆ Mrs. ┆ Alyce744 ┆ … ┆ 2015-01-0 ┆ zoster ┆ 121 ┆ CVX │
│ c0e-4017-d ┆ 3 ┆ ┆ ┆ ┆ 8T07:26:4 ┆ vaccine, ┆ ┆ │
│ 64d-c2c4cd ┆ ┆ ┆ ┆ ┆ 7+01:00 ┆ live ┆ ┆ │
│ 03… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
└────────────┴────────────┴────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘
= """
query alter table patient_timeline
rename column birthDate to birth_date;
select * from patient_timeline limit 5;
"""
print(con_duckdb.sql(query).to_df())
"alter table patient_timeline rename column birth_date to birthDate") con_duckdb.sql(
patient_id social_security_number prefix
0 4fc244f3-2c0e-4017-d64d-c2c4cd03655f 999-53-5813 Mrs. \
1 4fc244f3-2c0e-4017-d64d-c2c4cd03655f 999-53-5813 Mrs.
2 fca8d2ca-7aef-2c27-3bba-3f94723012f5 999-24-8599 None
3 19936964-a432-d501-2cd5-fa52db6b9f41 999-33-7974 None
4 0d016955-26db-966c-3d52-26d441bfcb97 999-47-8500 Ms.
first_name last_name birth_date code system
0 Alyce744 Bergstrom287 1965-01-07 162673000 SNOMED \
1 Alyce744 Bergstrom287 1965-01-07 162673000 SNOMED
2 Marlen929 Greenholt190 2016-04-01 410620009 SNOMED
3 Monique148 Haley279 2022-01-07 410620009 SNOMED
4 Tracy345 Smith67 2000-07-17 410620009 SNOMED
organization_id
0 b03b624d-c939-3688-986d-9555b8009a3b \
1 b03b624d-c939-3688-986d-9555b8009a3b
2 3f12ebb4-e03c-3453-88d2-4fc9682383df
3 c6b019eb-28ec-36f6-abf3-bcc4d1d58966
4 39c15c0f-5c49-311e-99d2-1fb99d80e06e
organization_name
0 HOLYOKE HEALTH CENTER INC \
1 HOLYOKE HEALTH CENTER INC
2 NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC
3 DUTTON FAMILY CARE ASSOCIATES LLP
4 HARBOR HEALTH SERVICES INC
practitioner_name practitioner_id
0 Dr. Bennett146 Hartmann983 9999981894 \
1 Dr. Bennett146 Hartmann983 9999981894
2 Dr. Homero668 Rolón954 9999962894
3 Dr. Maren639 Aufderhar910 9999950790
4 Dr. Salvador46 Homenick806 9999977496
procedure_name start_time
0 General examination of patient (procedure) 2015-01-08T07:11:47+01:00 \
1 General examination of patient (procedure) 2015-01-08T07:11:47+01:00
2 Well child visit (procedure) 2016-04-01T04:09:48+02:00
3 Well child visit (procedure) 2022-01-07T02:19:02+01:00
4 Well child visit (procedure) 2014-08-18T14:52:06+02:00
end_time
0 2015-01-08T07:26:47+01:00 \
1 2015-01-08T07:26:47+01:00
2 2016-04-01T04:24:48+02:00
3 2022-01-07T02:34:02+01:00
4 2014-08-18T15:07:06+02:00
Vaccine_name vaccine_code
0 zoster vaccine, live 121 \
1 Influenza, seasonal, injectable, preservative ... 140
2 Hep B, adolescent or pediatric 08
3 Hep B, adolescent or pediatric 08
4 Influenza, seasonal, injectable, preservative ... 140
vaccine_code_system
0 CVX
1 CVX
2 CVX
3 CVX
4 CVX
= df_timeline_IBIS.relabel({"birthDate": "birth_date"}).to_pandas()
df_renamed_ibis 1) df_renamed_ibis.head(
patient_id | social_security_number | prefix | first_name | last_name | birth_date | code | system | organization_id | organization_name | practitioner_name | practitioner_id | procedure_name | start_time | end_time | Vaccine_name | vaccine_code | vaccine_code_system | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4fc244f3-2c0e-4017-d64d-c2c4cd03655f | 999-53-5813 | Mrs. | Alyce744 | Bergstrom287 | 1965-01-07 | 162673000 | SNOMED | b03b624d-c939-3688-986d-9555b8009a3b | HOLYOKE HEALTH CENTER INC | Dr. Bennett146 Hartmann983 | 9999981894 | General examination of patient (procedure) | 2015-01-08T07:11:47+01:00 | 2015-01-08T07:26:47+01:00 | zoster vaccine, live | 121 | CVX |
= df_timeline_spark.withColumnRenamed("birthDate", "birth_date")
df_renamed_pyspark 1) df_renamed_pyspark.head(
[Row(patient_id='4fc244f3-2c0e-4017-d64d-c2c4cd03655f', social_security_number='999-53-5813', prefix='Mrs.', first_name='Alyce744', last_name='Bergstrom287', birth_date=datetime.datetime(1965, 1, 7, 0, 0), code=162673000, system='SNOMED', organization_id='b03b624d-c939-3688-986d-9555b8009a3b', organization_name='HOLYOKE HEALTH CENTER INC', practitioner_name='Dr. Bennett146 Hartmann983', practitioner_id='9999981894', procedure_name='General examination of patient (procedure)', start_time=datetime.datetime(2015, 1, 8, 5, 11, 47), end_time='2015-01-08T07:26:47+01:00', Vaccine_name='zoster vaccine, live', vaccine_code='121', vaccine_code_system='CVX', year=2015)]
%%R
<- df %>%
df_renamed = birthDate)
rename(birth_date kable(head(df_renamed))
|patient_id |social_security_number |prefix |first_name |last_name |birth_date | code|system |organization_id |organization_name |practitioner_name |practitioner_id |procedure_name |start_time |end_time |Vaccine_name |vaccine_code |vaccine_code_system |
|:------------------------------------|:----------------------|:------|:----------|:------------|:----------|---------:|:------|:------------------------------------|:-----------------------------------------------|:--------------------------|:---------------|:------------------------------------------|:-------------------------|:-------------------------|:--------------------------------------------------|:------------|:-------------------|
|4fc244f3-2c0e-4017-d64d-c2c4cd03655f |999-53-5813 |Mrs. |Alyce744 |Bergstrom287 |1965-01-07 | 162673000|SNOMED |b03b624d-c939-3688-986d-9555b8009a3b |HOLYOKE HEALTH CENTER INC |Dr. Bennett146 Hartmann983 |9999981894 |General examination of patient (procedure) |2015-01-08T07:11:47+01:00 |2015-01-08T07:26:47+01:00 |zoster vaccine, live |121 |CVX |
|4fc244f3-2c0e-4017-d64d-c2c4cd03655f |999-53-5813 |Mrs. |Alyce744 |Bergstrom287 |1965-01-07 | 162673000|SNOMED |b03b624d-c939-3688-986d-9555b8009a3b |HOLYOKE HEALTH CENTER INC |Dr. Bennett146 Hartmann983 |9999981894 |General examination of patient (procedure) |2015-01-08T07:11:47+01:00 |2015-01-08T07:26:47+01:00 |Influenza, seasonal, injectable, preservative free |140 |CVX |
|fca8d2ca-7aef-2c27-3bba-3f94723012f5 |999-24-8599 |NA |Marlen929 |Greenholt190 |2016-04-01 | 410620009|SNOMED |3f12ebb4-e03c-3453-88d2-4fc9682383df |NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC |Dr. Homero668 Rolón954 |9999962894 |Well child visit (procedure) |2016-04-01T04:09:48+02:00 |2016-04-01T04:24:48+02:00 |Hep B, adolescent or pediatric |08 |CVX |
|19936964-a432-d501-2cd5-fa52db6b9f41 |999-33-7974 |NA |Monique148 |Haley279 |2022-01-07 | 410620009|SNOMED |c6b019eb-28ec-36f6-abf3-bcc4d1d58966 |DUTTON FAMILY CARE ASSOCIATES LLP |Dr. Maren639 Aufderhar910 |9999950790 |Well child visit (procedure) |2022-01-07T02:19:02+01:00 |2022-01-07T02:34:02+01:00 |Hep B, adolescent or pediatric |08 |CVX |
|0d016955-26db-966c-3d52-26d441bfcb97 |999-47-8500 |Ms. |Tracy345 |Smith67 |2000-07-17 | 410620009|SNOMED |39c15c0f-5c49-311e-99d2-1fb99d80e06e |HARBOR HEALTH SERVICES INC |Dr. Salvador46 Homenick806 |9999977496 |Well child visit (procedure) |2014-08-18T14:52:06+02:00 |2014-08-18T15:07:06+02:00 |Influenza, seasonal, injectable, preservative free |140 |CVX |
|3a2311a7-eb30-2aae-59c0-55ae273d1581 |999-40-6084 |NA |Janene3 |Howe413 |2016-11-01 | 410620009|SNOMED |2e8b42f8-4300-30a9-a576-d513d1021a0a |GREATER LOWELL FAMILY PRACTICE PC |Dr. Homero668 Salcedo309 |9999959890 |Well child visit (procedure) |2016-11-01T04:38:02+01:00 |2016-11-01T04:53:02+01:00 |Hep B, adolescent or pediatric |08 |CVX |
Join
It’s rare that data analysis involves only a single data frame. Typically you have many data frames, and you must join them together to answer the questions that you’re interested in.
An inner join keeps observations that appear in both tables.
An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:
- A left join keeps all observations in x.
- A right join keeps all observations in y.
- A full join keeps all observations in x and y.
The most commonly used join is the left join: you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.
= pd.merge(
df_joined_pandas
df_timeline_pandas,
df_price_list_pandas, = ['vaccine_code','vaccine_code_system'],
left_on = ['code','system'],
right_on ='left'
how
)
filter(items = ['patient_id','start_time','year','vaccine_name','item_claimed','USD']).head() df_joined_pandas.
patient_id | start_time | year | item_claimed | USD | |
---|---|---|---|---|---|
0 | 4fc244f3-2c0e-4017-d64d-c2c4cd03655f | 2015-01-08 06:11:47+00:00 | 2015 | zoster vaccine, live | 136.0 |
1 | 4fc244f3-2c0e-4017-d64d-c2c4cd03655f | 2015-01-08 06:11:47+00:00 | 2015 | Influenza, seasonal, injectable, preservative ... | 136.0 |
2 | fca8d2ca-7aef-2c27-3bba-3f94723012f5 | 2016-04-01 02:09:48+00:00 | 2016 | Hep B, adolescent or pediatric | 136.0 |
3 | 19936964-a432-d501-2cd5-fa52db6b9f41 | 2022-01-07 01:19:02+00:00 | 2022 | Hep B, adolescent or pediatric | 136.0 |
4 | 0d016955-26db-966c-3d52-26d441bfcb97 | 2014-08-18 12:52:06+00:00 | 2014 | Influenza, seasonal, injectable, preservative ... | 136.0 |
= (
df_joined_polars
df_timeline_polars
.join(
df_price_list_polars,=["vaccine_code", "vaccine_code_system"],
left_on=["code", "system"],
right_on="left"
how
)
)print(df_joined_polars.head())
shape: (5, 20)
┌────────────┬────────────┬────────┬────────────┬───┬────────────┬────────────┬────────────┬───────┐
│ patient_id ┆ social_sec ┆ prefix ┆ first_name ┆ … ┆ vaccine_co ┆ vaccine_co ┆ item_claim ┆ USD │
│ --- ┆ urity_numb ┆ --- ┆ --- ┆ ┆ de ┆ de_system ┆ ed ┆ --- │
│ str ┆ er ┆ str ┆ str ┆ ┆ --- ┆ --- ┆ --- ┆ f64 │
│ ┆ --- ┆ ┆ ┆ ┆ str ┆ str ┆ str ┆ │
│ ┆ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
╞════════════╪════════════╪════════╪════════════╪═══╪════════════╪════════════╪════════════╪═══════╡
│ 4fc244f3-2 ┆ 999-53-581 ┆ Mrs. ┆ Alyce744 ┆ … ┆ 121 ┆ CVX ┆ zoster ┆ 136.0 │
│ c0e-4017-d ┆ 3 ┆ ┆ ┆ ┆ ┆ ┆ vaccine, ┆ │
│ 64d-c2c4cd ┆ ┆ ┆ ┆ ┆ ┆ ┆ live ┆ │
│ 03… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 4fc244f3-2 ┆ 999-53-581 ┆ Mrs. ┆ Alyce744 ┆ … ┆ 140 ┆ CVX ┆ Influenza, ┆ 136.0 │
│ c0e-4017-d ┆ 3 ┆ ┆ ┆ ┆ ┆ ┆ seasonal, ┆ │
│ 64d-c2c4cd ┆ ┆ ┆ ┆ ┆ ┆ ┆ injectable ┆ │
│ 03… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ,… ┆ │
│ fca8d2ca-7 ┆ 999-24-859 ┆ null ┆ Marlen929 ┆ … ┆ 08 ┆ CVX ┆ Hep B, ┆ 136.0 │
│ aef-2c27-3 ┆ 9 ┆ ┆ ┆ ┆ ┆ ┆ adolescent ┆ │
│ bba-3f9472 ┆ ┆ ┆ ┆ ┆ ┆ ┆ or ┆ │
│ 30… ┆ ┆ ┆ ┆ ┆ ┆ ┆ pediatric ┆ │
│ 19936964-a ┆ 999-33-797 ┆ null ┆ Monique148 ┆ … ┆ 08 ┆ CVX ┆ Hep B, ┆ 136.0 │
│ 432-d501-2 ┆ 4 ┆ ┆ ┆ ┆ ┆ ┆ adolescent ┆ │
│ cd5-fa52db ┆ ┆ ┆ ┆ ┆ ┆ ┆ or ┆ │
│ 6b… ┆ ┆ ┆ ┆ ┆ ┆ ┆ pediatric ┆ │
│ 0d016955-2 ┆ 999-47-850 ┆ Ms. ┆ Tracy345 ┆ … ┆ 140 ┆ CVX ┆ Influenza, ┆ 136.0 │
│ 6db-966c-3 ┆ 0 ┆ ┆ ┆ ┆ ┆ ┆ seasonal, ┆ │
│ d52-26d441 ┆ ┆ ┆ ┆ ┆ ┆ ┆ injectable ┆ │
│ bf… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ,… ┆ │
└────────────┴────────────┴────────┴────────────┴───┴────────────┴────────────┴────────────┴───────┘
= """
query Select
*
from patient_timeline pt
left join price_list pl on pt.vaccine_code = pl.code and pt.vaccine_code_system = pl.system
limit 5;
"""
con_duckdb.sql(query).to_df()
patient_id | social_security_number | prefix | first_name | last_name | birthDate | code | system | organization_id | organization_name | ... | procedure_name | start_time | end_time | Vaccine_name | vaccine_code | vaccine_code_system | code_2 | system_2 | item_claimed | USD | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4fc244f3-2c0e-4017-d64d-c2c4cd03655f | 999-53-5813 | Mrs. | Alyce744 | Bergstrom287 | 1965-01-07 | 162673000 | SNOMED | b03b624d-c939-3688-986d-9555b8009a3b | HOLYOKE HEALTH CENTER INC | ... | General examination of patient (procedure) | 2015-01-08T07:11:47+01:00 | 2015-01-08T07:26:47+01:00 | zoster vaccine, live | 121 | CVX | 121 | CVX | zoster vaccine, live | 136.0 |
1 | 4fc244f3-2c0e-4017-d64d-c2c4cd03655f | 999-53-5813 | Mrs. | Alyce744 | Bergstrom287 | 1965-01-07 | 162673000 | SNOMED | b03b624d-c939-3688-986d-9555b8009a3b | HOLYOKE HEALTH CENTER INC | ... | General examination of patient (procedure) | 2015-01-08T07:11:47+01:00 | 2015-01-08T07:26:47+01:00 | Influenza, seasonal, injectable, preservative ... | 140 | CVX | 140 | CVX | Influenza, seasonal, injectable, preservative ... | 136.0 |
2 | fca8d2ca-7aef-2c27-3bba-3f94723012f5 | 999-24-8599 | None | Marlen929 | Greenholt190 | 2016-04-01 | 410620009 | SNOMED | 3f12ebb4-e03c-3453-88d2-4fc9682383df | NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC | ... | Well child visit (procedure) | 2016-04-01T04:09:48+02:00 | 2016-04-01T04:24:48+02:00 | Hep B, adolescent or pediatric | 08 | CVX | 08 | CVX | Hep B, adolescent or pediatric | 136.0 |
3 | 19936964-a432-d501-2cd5-fa52db6b9f41 | 999-33-7974 | None | Monique148 | Haley279 | 2022-01-07 | 410620009 | SNOMED | c6b019eb-28ec-36f6-abf3-bcc4d1d58966 | DUTTON FAMILY CARE ASSOCIATES LLP | ... | Well child visit (procedure) | 2022-01-07T02:19:02+01:00 | 2022-01-07T02:34:02+01:00 | Hep B, adolescent or pediatric | 08 | CVX | 08 | CVX | Hep B, adolescent or pediatric | 136.0 |
4 | 0d016955-26db-966c-3d52-26d441bfcb97 | 999-47-8500 | Ms. | Tracy345 | Smith67 | 2000-07-17 | 410620009 | SNOMED | 39c15c0f-5c49-311e-99d2-1fb99d80e06e | HARBOR HEALTH SERVICES INC | ... | Well child visit (procedure) | 2014-08-18T14:52:06+02:00 | 2014-08-18T15:07:06+02:00 | Influenza, seasonal, injectable, preservative ... | 140 | CVX | 140 | CVX | Influenza, seasonal, injectable, preservative ... | 136.0 |
5 rows × 22 columns
= df_timeline_IBIS.left_join(
df_joined_ibis
df_price_list_ibis, [== df_price_list_ibis.code,
df_timeline_IBIS.vaccine_code == df_price_list_ibis.system
df_timeline_IBIS.vaccine_code_system
]
).to_pandas() df_joined_ibis.head()
patient_id | social_security_number | prefix | first_name | last_name | birthDate | code | system | organization_id | organization_name | ... | procedure_name | start_time | end_time | Vaccine_name | vaccine_code | vaccine_code_system | code_right | system_right | item_claimed | USD | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4fc244f3-2c0e-4017-d64d-c2c4cd03655f | 999-53-5813 | Mrs. | Alyce744 | Bergstrom287 | 1965-01-07 | 162673000 | SNOMED | b03b624d-c939-3688-986d-9555b8009a3b | HOLYOKE HEALTH CENTER INC | ... | General examination of patient (procedure) | 2015-01-08T07:11:47+01:00 | 2015-01-08T07:26:47+01:00 | zoster vaccine, live | 121 | CVX | 121 | CVX | zoster vaccine, live | 136.0 |
1 | 4fc244f3-2c0e-4017-d64d-c2c4cd03655f | 999-53-5813 | Mrs. | Alyce744 | Bergstrom287 | 1965-01-07 | 162673000 | SNOMED | b03b624d-c939-3688-986d-9555b8009a3b | HOLYOKE HEALTH CENTER INC | ... | General examination of patient (procedure) | 2015-01-08T07:11:47+01:00 | 2015-01-08T07:26:47+01:00 | Influenza, seasonal, injectable, preservative ... | 140 | CVX | 140 | CVX | Influenza, seasonal, injectable, preservative ... | 136.0 |
2 | fca8d2ca-7aef-2c27-3bba-3f94723012f5 | 999-24-8599 | None | Marlen929 | Greenholt190 | 2016-04-01 | 410620009 | SNOMED | 3f12ebb4-e03c-3453-88d2-4fc9682383df | NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC | ... | Well child visit (procedure) | 2016-04-01T04:09:48+02:00 | 2016-04-01T04:24:48+02:00 | Hep B, adolescent or pediatric | 08 | CVX | 08 | CVX | Hep B, adolescent or pediatric | 136.0 |
3 | 19936964-a432-d501-2cd5-fa52db6b9f41 | 999-33-7974 | None | Monique148 | Haley279 | 2022-01-07 | 410620009 | SNOMED | c6b019eb-28ec-36f6-abf3-bcc4d1d58966 | DUTTON FAMILY CARE ASSOCIATES LLP | ... | Well child visit (procedure) | 2022-01-07T02:19:02+01:00 | 2022-01-07T02:34:02+01:00 | Hep B, adolescent or pediatric | 08 | CVX | 08 | CVX | Hep B, adolescent or pediatric | 136.0 |
4 | 0d016955-26db-966c-3d52-26d441bfcb97 | 999-47-8500 | Ms. | Tracy345 | Smith67 | 2000-07-17 | 410620009 | SNOMED | 39c15c0f-5c49-311e-99d2-1fb99d80e06e | HARBOR HEALTH SERVICES INC | ... | Well child visit (procedure) | 2014-08-18T14:52:06+02:00 | 2014-08-18T15:07:06+02:00 | Influenza, seasonal, injectable, preservative ... | 140 | CVX | 140 | CVX | Influenza, seasonal, injectable, preservative ... | 136.0 |
5 rows × 22 columns
= df_timeline_spark.join(df_price_list_spark,(df_timeline_spark['vaccine_code'] == df_price_list_spark['code']) & (df_timeline_spark['vaccine_code_system'] == df_price_list_spark['system']),'inner')
df_joined_pyspark
5).show() df_joined_pyspark.limit(
+--------------------+----------------------+------+----------+------------+-------------------+---------+------+--------------------+--------------------+--------------------+---------------+--------------------+-------------------+--------------------+--------------------+------------+-------------------+----+----+------+--------------------+-----+
| patient_id|social_security_number|prefix|first_name| last_name| birthDate| code|system| organization_id| organization_name| practitioner_name|practitioner_id| procedure_name| start_time| end_time| Vaccine_name|vaccine_code|vaccine_code_system|year|code|system| item_claimed| USD|
+--------------------+----------------------+------+----------+------------+-------------------+---------+------+--------------------+--------------------+--------------------+---------------+--------------------+-------------------+--------------------+--------------------+------------+-------------------+----+----+------+--------------------+-----+
|4fc244f3-2c0e-401...| 999-53-5813| Mrs.| Alyce744|Bergstrom287|1965-01-07 00:00:00|162673000|SNOMED|b03b624d-c939-368...|HOLYOKE HEALTH CE...|Dr. Bennett146 Ha...| 9999981894|General examinati...|2015-01-08 05:11:47|2015-01-08T07:26:...|zoster vaccine, live| 121| CVX|2015| 121| CVX|zoster vaccine, live|136.0|
|4fc244f3-2c0e-401...| 999-53-5813| Mrs.| Alyce744|Bergstrom287|1965-01-07 00:00:00|162673000|SNOMED|b03b624d-c939-368...|HOLYOKE HEALTH CE...|Dr. Bennett146 Ha...| 9999981894|General examinati...|2015-01-08 05:11:47|2015-01-08T07:26:...|Influenza, season...| 140| CVX|2015| 140| CVX|Influenza, season...|136.0|
|fca8d2ca-7aef-2c2...| 999-24-8599| NULL| Marlen929|Greenholt190|2016-04-01 00:00:00|410620009|SNOMED|3f12ebb4-e03c-345...|NEW BEDFORD INTER...|Dr. Homero668 Rol...| 9999962894|Well child visit ...|2016-04-01 00:09:48|2016-04-01T04:24:...|Hep B, adolescent...| 08| CVX|2016| 08| CVX|Hep B, adolescent...|136.0|
|19936964-a432-d50...| 999-33-7974| NULL|Monique148| Haley279|2022-01-07 00:00:00|410620009|SNOMED|c6b019eb-28ec-36f...|DUTTON FAMILY CAR...|Dr. Maren639 Aufd...| 9999950790|Well child visit ...|2022-01-07 00:19:02|2022-01-07T02:34:...|Hep B, adolescent...| 08| CVX|2022| 08| CVX|Hep B, adolescent...|136.0|
|0d016955-26db-966...| 999-47-8500| Ms.| Tracy345| Smith67|2000-07-17 00:00:00|410620009|SNOMED|39c15c0f-5c49-311...|HARBOR HEALTH SER...|Dr. Salvador46 Ho...| 9999977496|Well child visit ...|2014-08-18 10:52:06|2014-08-18T15:07:...|Influenza, season...| 140| CVX|2014| 140| CVX|Influenza, season...|136.0|
+--------------------+----------------------+------+----------+------------+-------------------+---------+------+--------------------+--------------------+--------------------+---------------+--------------------+-------------------+--------------------+--------------------+------------+-------------------+----+----+------+--------------------+-----+
%%R
<- df %>%
df_joined
left_join(
df_price_list,= c("vaccine_code" = "code", "vaccine_code_system" = "system")
by
) kable(head(df_joined))
|patient_id |social_security_number |prefix |first_name |last_name |birthDate | code|system |organization_id |organization_name |practitioner_name |practitioner_id |procedure_name |start_time |end_time |Vaccine_name |vaccine_code |vaccine_code_system |item_claimed | USD|
|:------------------------------------|:----------------------|:------|:----------|:------------|:----------|---------:|:------|:------------------------------------|:-----------------------------------------------|:--------------------------|:---------------|:------------------------------------------|:-------------------------|:-------------------------|:--------------------------------------------------|:------------|:-------------------|:--------------------------------------------------|---:|
|4fc244f3-2c0e-4017-d64d-c2c4cd03655f |999-53-5813 |Mrs. |Alyce744 |Bergstrom287 |1965-01-07 | 162673000|SNOMED |b03b624d-c939-3688-986d-9555b8009a3b |HOLYOKE HEALTH CENTER INC |Dr. Bennett146 Hartmann983 |9999981894 |General examination of patient (procedure) |2015-01-08T07:11:47+01:00 |2015-01-08T07:26:47+01:00 |zoster vaccine, live |121 |CVX |zoster vaccine, live | 136|
|4fc244f3-2c0e-4017-d64d-c2c4cd03655f |999-53-5813 |Mrs. |Alyce744 |Bergstrom287 |1965-01-07 | 162673000|SNOMED |b03b624d-c939-3688-986d-9555b8009a3b |HOLYOKE HEALTH CENTER INC |Dr. Bennett146 Hartmann983 |9999981894 |General examination of patient (procedure) |2015-01-08T07:11:47+01:00 |2015-01-08T07:26:47+01:00 |Influenza, seasonal, injectable, preservative free |140 |CVX |Influenza, seasonal, injectable, preservative free | 136|
|fca8d2ca-7aef-2c27-3bba-3f94723012f5 |999-24-8599 |NA |Marlen929 |Greenholt190 |2016-04-01 | 410620009|SNOMED |3f12ebb4-e03c-3453-88d2-4fc9682383df |NEW BEDFORD INTERNAL MEDICINE & GERIATRICS, LLC |Dr. Homero668 Rolón954 |9999962894 |Well child visit (procedure) |2016-04-01T04:09:48+02:00 |2016-04-01T04:24:48+02:00 |Hep B, adolescent or pediatric |08 |CVX |Hep B, adolescent or pediatric | 136|
|19936964-a432-d501-2cd5-fa52db6b9f41 |999-33-7974 |NA |Monique148 |Haley279 |2022-01-07 | 410620009|SNOMED |c6b019eb-28ec-36f6-abf3-bcc4d1d58966 |DUTTON FAMILY CARE ASSOCIATES LLP |Dr. Maren639 Aufderhar910 |9999950790 |Well child visit (procedure) |2022-01-07T02:19:02+01:00 |2022-01-07T02:34:02+01:00 |Hep B, adolescent or pediatric |08 |CVX |Hep B, adolescent or pediatric | 136|
|0d016955-26db-966c-3d52-26d441bfcb97 |999-47-8500 |Ms. |Tracy345 |Smith67 |2000-07-17 | 410620009|SNOMED |39c15c0f-5c49-311e-99d2-1fb99d80e06e |HARBOR HEALTH SERVICES INC |Dr. Salvador46 Homenick806 |9999977496 |Well child visit (procedure) |2014-08-18T14:52:06+02:00 |2014-08-18T15:07:06+02:00 |Influenza, seasonal, injectable, preservative free |140 |CVX |Influenza, seasonal, injectable, preservative free | 136|
|3a2311a7-eb30-2aae-59c0-55ae273d1581 |999-40-6084 |NA |Janene3 |Howe413 |2016-11-01 | 410620009|SNOMED |2e8b42f8-4300-30a9-a576-d513d1021a0a |GREATER LOWELL FAMILY PRACTICE PC |Dr. Homero668 Salcedo309 |9999959890 |Well child visit (procedure) |2016-11-01T04:38:02+01:00 |2016-11-01T04:53:02+01:00 |Hep B, adolescent or pediatric |08 |CVX |Hep B, adolescent or pediatric | 136|
Finishing up
When using duckdb it is very important to always close the database connection. If one is connected to the databas a wall is created to block anyone else from connecting at the same time. This prevents conflicts when more are manipulating the data at the same time.
#close connection to the duckdb database
con_duckdb.close()