JupySQL is a Python library that lets you run SQL in Jupyter notebooks and the IPython shell.
In this guide, we're going to learn how to query data using chDB and JupySQL.
We're going to use one of Jeff Sackmann's tennis_atp dataset, which contains metadata about players and their rankings over time.
Let's start by downloading the rankings files:
from urllib.request import urlretrieve
files = ['00s', '10s', '20s', '70s', '80s', '90s', 'current']
base = "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master"
for file in files:
_ = urlretrieve(
f"{base}/atp_rankings_{file}.csv",
f"atp_rankings_{file}.csv",
)
Now we're going to store the data from these CSV files in a table.
The default database doesn't persist data on disk, so we need to create another database first:
%sql CREATE DATABASE atp
And now we're going to create a table called rankings whose schema will be derived from the structure of the data in the CSV files:
%%sql
CREATE TABLE atp.rankings
ENGINE=MergeTree
ORDER BY ranking_date AS
SELECT * REPLACE (
toDate(parseDateTime32BestEffort(toString(ranking_date))) AS ranking_date
)
FROM file('atp_rankings*.csv')
SETTINGS schema_inference_make_columns_nullable=0
And then create a table called players based on the content of the CSV file.
We'll also clean up the dob field so that its a Date32 type.
In ClickHouse, the Date type only supports dates from 1970 onwards. Since the dob column contains dates from before 1970, we'll use the Date32 type instead.
%%sql
CREATE TABLE atp.players
Engine=MergeTree
ORDER BY player_id AS
SELECT * REPLACE (
makeDate32(
toInt32OrNull(substring(toString(dob), 1, 4)),
toInt32OrNull(substring(toString(dob), 5, 2)),
toInt32OrNull(substring(toString(dob), 7, 2))
)::Nullable(Date32) AS dob
)
FROM file('atp_players.csv')
SETTINGS schema_inference_make_columns_nullable=0
Once that's finished running, we can have a look at the data we've ingested:
%sql SELECT * FROM atp.players LIMIT 10
+-----------+------------+-----------+------+------------+-----+--------+-------------+
| player_id | name_first | name_last | hand | dob | ioc | height | wikidata_id |
+-----------+------------+-----------+------+------------+-----+--------+-------------+
| 100001 | Gardnar | Mulloy | R | 1913-11-22 | USA | 185 | Q54544 |
| 100002 | Pancho | Segura | R | 1921-06-20 | ECU | 168 | Q54581 |
| 100003 | Frank | Sedgman | R | 1927-10-02 | AUS | 180 | Q962049 |
| 100004 | Giuseppe | Merlo | R | 1927-10-11 | ITA | 0 | Q1258752 |
| 100005 | Richard | Gonzalez | R | 1928-05-09 | USA | 188 | Q53554 |
| 100006 | Grant | Golden | R | 1929-08-21 | USA | 175 | Q3115390 |
| 100007 | Abe | Segal | L | 1930-10-23 | RSA | 0 | Q1258527 |
| 100008 | Kurt | Nielsen | R | 1930-11-19 | DEN | 0 | Q552261 |
| 100009 | Istvan | Gulyas | R | 1931-10-14 | HUN | 0 | Q51066 |
| 100010 | Luis | Ayala | R | 1932-09-18 | CHI | 170 | Q1275397 |
+-----------+------------+-----------+------+------------+-----+--------+-------------+
Data ingestion is done, now it's time for the fun part - querying the data!
Tennis players receive points based on how well they perform in the tournaments they play.
The points for each player over a 52 week rolling period.
We're going to write a query that finds the maximum points accumulate by each player along with their ranking at the time:
%%sql
SELECT name_first, name_last,
max(points) as maxPoints,
argMax(rank, points) as rank,
argMax(ranking_date, points) as date
FROM atp.players
JOIN atp.rankings ON rankings.player = players.player_id
GROUP BY ALL
ORDER BY maxPoints DESC
LIMIT 10
We can save queries using the --save parameter on the same line as the %%sql magic.
The --no-execute parameter means that query execution will be skipped.
%%sql --save best_points --no-execute
SELECT name_first, name_last,
max(points) as maxPoints,
argMax(rank, points) as rank,
argMax(ranking_date, points) as date
FROM atp.players
JOIN atp.rankings ON rankings.player = players.player_id
GROUP BY ALL
ORDER BY maxPoints DESC
When we run a saved query it will be converted into a Common Table Expression (CTE) before executing.
In the following query we compute the maximum points achieved by players when they were ranked 1:
We can also use parameters in our queries.
Parameters are just normal variables:
rank = 10
And then we can use the {{variable}} syntax in our query.
The following query finds the players who had the least number of days between when they first had a ranking in the top 10 and last had a ranking in the top 10:
%%sql
SELECT name_first, name_last,
MIN(ranking_date) AS earliest_date,
MAX(ranking_date) AS most_recent_date,
most_recent_date - earliest_date AS days,
1 + (days/7) AS weeks
FROM atp.rankings
JOIN atp.players ON players.player_id = rankings.player
WHERE rank <= {{rank}}
GROUP BY ALL
ORDER BY days
LIMIT 10
JupySQL also has limited charting functionality.
We can create box plots or histograms.
We're going to create a histogram, but first let's write (and save) a query that computes the rankings within the top 100 that each player has achieved.
We'll be able to use this to create a histogram that counts how many players achieved each ranking:
%%sql --save players_per_rank --no-execute
select distinct player, rank
FROM atp.rankings
WHERE rank <= 100
We can then create a histogram by running the following: