Hydra + JupySQL: Plotting Large Datasets with Jupyter Notebook

Joe Sciarrino

CEO of Hydra

Jupyter is one of the most widely used tools for data exploration, so it's no surprise that data scientists spend most of their time in front of a Jupyter notebook slicing and dicing data. In many cases, the data lives in a Postgres database!

Traditionally, data scientists must pull data from Postgres with SQL and visualize it with Python libraries such as matplotlib. There are two big problems with this approach:

  • Postgres isn't optimized for analytical queries. Slow analytics can infect and degrade performance of a transactional workload also running in Postgres.
  • Manual, slow process to move data from Postgres to a local machine, load data into memory with Python for visualization. Visualization is also then limited by the memory available locally.

In this blog post, we'll demonstrate how to finally solve these issues with Hydra and JupySQL. Hydra and JupySQL allows users to plot large-scale data sets by:

  1. Pushing the aggregation step to Hydra
  2. Running SQL from a notebook
  3. Using Hydra’s columnar storage, parallel execution, and vectorization on Postgres for speedy analytics

Let’s dive in!

What is JupySQL?

JupySQL is better SQL in Jupyter notebooks. JupySQL is a fork of ipython-sql. To install it, run pip install jupysql.

What is Hydra?

Hydra is building the best data warehouse for developers. Hydra is open source, simple, and speeds through the toughest analytics by adding columnar storage, vectorization, and query parallelization to Postgres. Check out the Hydra repo and add a Github star ⭐!

Tutorial

With JupySQL we’ll combine Jupyter Notebooks and Hydra. This enables data scientists to easily query data in SQL directly from Jupyter. By using Hydra’s Postgres, hundreds or thousands of gigabytes of data can be combined and leveraged. Locally, you can experiment with importing and combining datasets using both Hydra's columnar tables, Postgres built-in row tables, and external data using FDWs.

In the example below, we’ll use the NYC taxi data set, run analytics on average trip time, and efficiently create a histogram on 1.4M observations with Postgres.

Requirements

To run this tutorial, you need to install the following Python packages:

%pip install jupysql pandas pyarrow psycopg2-binary --quiet

To run the \copy command, you need pgspecial<2. We recommend you getting it via conda install since pip install might give you some issues.

%conda install "pgspecial<2" -c conda-forge -y --quiet

You also need Docker installed and running to start the Hydra instance.

Starting a Hydra instance

Let's fetch the docker image and start the Hydra instance. The next command will take a couple of minutes as it needs to fetch the image from the repository:

docker run --name hydra -e POSTGRES_DB=db \
  -e POSTGRES_USER=user \
  -e POSTGRES_PASSWORD=password \
  -p 5432:5432 -d ghcr.io/hydradatabase/hydra

Data download

Now, let's fetch some sample data. We'll be using the NYC taxi dataset:

import pandas as pd

df = pd.read_parquet("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet")
print(f"Number of rows: {len(df):,}")

# we convert it to CSV so we can upload it using the \COPY postgres command
df.to_csv("taxi.csv", index=False)
Number of rows: 1,369,769

As you can see, this dataset contains ~1.4M rows.

Uploading data to Hydra

We're ready to upload our data; we'll load the JupySQL extension and start the database connection:

%load_ext sql
%sql postgresql://user:password@localhost/db

Let's create the table; note that Hydra adds a USING columnar option to the CREATE TABLE statement, which will optimize storage for analytical queries.

CREATE TABLE "taxi" (
    "VendorID" DECIMAL NOT NULL,
    tpep_pickup_datetime TIMESTAMP WITHOUT TIME ZONE,
    tpep_dropoff_datetime TIMESTAMP WITHOUT TIME ZONE,
    passenger_count DECIMAL,
    trip_distance DECIMAL NOT NULL,
    "RatecodeID" DECIMAL,
    store_and_fwd_flag BOOLEAN,
    "PULocationID" DECIMAL NOT NULL,
    "DOLocationID" DECIMAL NOT NULL,
    payment_type DECIMAL NOT NULL,
    fare_amount DECIMAL NOT NULL,
    extra DECIMAL NOT NULL,
    mta_tax DECIMAL NOT NULL,
    tip_amount DECIMAL NOT NULL,
    tolls_amount DECIMAL NOT NULL,
improvement_surcharge DECIMAL NOT NULL,
    total_amount DECIMAL NOT NULL,
    congestion_surcharge DECIMAL,
    airport_fee DECIMAL
) USING columnar;

Let's now upload the data:

\copy taxi from 'taxi.csv' WITH DELIMITER ',' CSV HEADER;

Let's now query our data:

SELECT COUNT(*) FROM taxi
| count   |
| ------- |
| 1369769 |

We see that the ~1.4M are there. Let's take a look at the first rows:

SELECT * FROM taxi
LIMIT 3
| VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | airport_fee |
| -------- | -------------------- | --------------------- | --------------- | ------------- | ---------- | ------------------ | ------------ | ------------ | ------------ | ----------- | ----- | ------- | ---------- | ------------ | --------------------- | ------------ | -------------------- | ----------- |
| 1        | 2021-01-01 00:30:10  | 2021-01-01 00:36:12   | 1.0             | 2.1           | 1.0        | False              | 142          | 43           | 2            | 8.0         | 3.0   | 0.5     | 0.0        | 0.0          | 0.3                   | 11.8         | 2.5                  | None        |
| 1        | 2021-01-01 00:51:20  | 2021-01-01 00:52:19   | 1.0             | 0.2           | 1.0        | False              | 238          | 151          | 2            | 3.0         | 0.5   | 0.5     | 0.0        | 0.0          | 0.3                   | 4.3          | 0.0                  | None        |
| 1        | 2021-01-01 00:43:30  | 2021-01-01 01:11:06   | 1.0             | 14.7          | 1.0        | False              | 132          | 165          | 1            | 42.0        | 0.5   | 0.5     | 8.65       | 0.0          | 0.3                   | 51.95        | 0.0                  | None        |

Hydra allows us to perform analytical queries efficiently. Let's compute the average trip distance given the passenger count:

SELECT
    passenger_count, AVG(trip_distance) AS avg_trip_distance
FROM taxi
GROUP BY passenger_count
ORDER BY passenger_count ASC
| passenger_count | avg_trip_distance      |
| --------------- | ---------------------- |
| 0.0             | 2.5424466811344758     |
| 1.0             | 2.6805563237138753     |
| 2.0             | 2.7948325921160876     |
| 3.0             | 2.7576410606577899     |
| 4.0             | 2.8681984015618327     |
| 5.0             | 2.6940995207308051     |
| 6.0             | 2.5745177825092658     |
| 7.0             | 11.1340000000000000    |
| 8.0             | 1.05000000000000000000 |
| None            | 29.6651257727346673    |

JupySQL comes with powerful plotting capabilities. Let's create a histogram of trip distance:

%sqlplot histogram --table taxi --column trip_distance --bins 50
<AxesSubplot: title={'center': "'trip_distance' from 'taxi'"}, xlabel='trip_distance', ylabel='Count'>
26-1

We cannot see much since there are some outliers. Let's find the 99th percentile:

SELECT percentile_disc(0.99) WITHIN GROUP (ORDER BY trip_distance)
FROM taxi
| percentile_disc |
| --------------- |
| 19.24           |

Now, let's use this value as a cutoff:

SELECT trip_distance
FROM taxi
WHERE trip_distance < 19.24
%sqlplot histogram --table no_outliers --column trip_distance --bins 50 --with no_outliers
<AxesSubplot: title={'center': "'trip_distance' from 'no_outliers'"}, xlabel='trip_distance', ylabel='Count'>

31-1

Much better! We just created a histogram of 1.4M observations!

Where to go from here

Clean up

To finish the tutorial, let's shut down the container:

%%capture out
! docker container ls --filter ancestor=ghcr.io/hydradatabase/hydra --quiet
container_id = out.stdout.strip()
print(f"Container id: {container_id}")
Container id: fd21cec1f520
! docker container stop {container_id}
fd21cec1f520
! docker container rm {container_id}
fd21cec1f520

Package versions

For reproducibility, these are the package versions we used:

# jupysql
import sql; sql.__version__ # 0.5.3.dev
import pandas; pandas.__version__ # 1.5.2
import pyarrow; pyarrow.__version__ # 8.0.0
import psycopg2; psycopg2.__version__ # 2.9.3 (dt dec pq3 ext lo64)
import pgspecial; pgspecial.__version__ # 1.13.1

Conclusion

In this blog post, we walked through how to use Hydra and JupySQL to analyze and plot datasets. Used together, Hydra and JupySQL enable data practitioners to run fast analytical queries and plot large-scale datasets by pushing the aggregation steps to Hydra. If you want to learn more, check out Hydra and JupySQL documentation!

Share this post
Contact Us

Subscribe to product updates

Submit
Oops! Something went wrong while submitting the form.
Joe Sciarrino
CEO of Hydra

See more articles