Skip to main content

Randy Reflects

RSS feed
Open menu

How to Translate SQL Data Analysis Patterns into Pandas

Mar 7, 2026, 9:32 PM @ 📑 The Office

In my job as a data analyst, I do a lot of data manipulation and analysis using SQL on various platforms such as Snowflake. However, data often comes from disparate sources, so sometimes I need to perform analysis and validation outside of SQL. For those cases, my preferred tool is the Python library Pandas. In this short article, I demonstrate how a few common data analysis tasks can be done in both SQL and Pandas.

Assumptions made in this article

For this article, I’m assuming that you have a solid understanding of SQL and its analytic and aggregation functions. On the other hand, I’ll assume familiarity with Python but not with the Pandas library specifically.

A full tutorial on Pandas is definitely out of scope for this article, but the Pandas library’s website has a ten-minute introduction. W3Schools also has a tutorial series on this. Lots of information is available about this library’s usage.

For starters, here’s my super short introduction.

# Usually, we import both pandas and numpy to start.
# They're both useful libraries that often get used together.
import pandas as pd
import numpy as np


# Import a data file (CSV in this case) into what Pandas
# calls a "dataframe" (df for short). A df is basically
# a table with a lot of powerful functionality built in.
df = pd.read_csv("sales_fact_table.csv")

# Show basic info about the df, including columns, null
# value counts, and the data type of each column
df.info()

# Show a single column or set of columns
df["txn_id"]
df[["txn_id", "payment_datetime"]]

# Filter the df based on a condition
my_filter = df["unit_price"] > 100
df[my_filter]
# Or as a one-liner
df[df["unit_price"] > 100]

The above examples will get you pretty far, but if you’re having trouble following, then I suggest looking at one of the tutorials I noted above or going to YouTube for a walk-through.

Checking for duplicate rows

Duplicate records in a dataset are a common obstacle in data analysis. In SQL, there are several ways to check for them. One common method is to use group by with having:

select txn_id, order_id, count(*) as ct
from sales_fact_table
where returned_dt is null
group by txn_id, order_id
having count(*) > 1

In this pattern, if the combination of txn_id and order_id has more than one record, then the having clause will show only those records. If you want to inspect each of those records without grouping them together, you can also use qualify with an aggregate or analytic function:

select *
from sales_fact_table
where returned_dt is null
qualify count(*) over (
    partition by txn_id, order_id
) > 1

With Pandas, this task is much simpler—only a single line of code. To see duplicate rows, you use the duplicated method on a dataframe. It has two optional parameters, subset and keep. The subset parameter lets you give a list of columns to consider when checking for duplicates. The keep parameter tells Pandas what to do with duplicates. With "first" or "last", Pandas marks only the first or last occurrence as unduplicated. With keep=False, it will returns all occurrences.

The first example below will show duplicates where every column is the same. The second example gives the same results as the qualify from the SQL query above.

sales = pd.read_csv("sales_fact_table.csv")
# Keep only non-returned sales. The tilde inverts the Boolean flag
sales = sales[sales["returned_dt"].isna()]

# Use Boolean indexing to show only rows with duplicates
sales[sales.duplicated(keep=False)]

# Show duplicates for combos of txn_id, order_id (same as SQL qualify above)
sales[sales.duplicated(subset=["txn_id", "order_id"], keep=False)]

Alternatively, you could use the transform method to accomplish the same task. This method behaves like a SQL window function. It works by first grouping the rows of the dataset and then applying the result of an aggregation function to each group. Instead of returning one row for each group, it maps the value of each group to the rows in the original dataframe.

The qualify version in Pandas would look like this:

# Group rows together
groups = sales.groupby(by=["txn_id", "order_id"])
# Apply the window function with transform().
# transform("size") is equivalent to count(*) in SQL
count = groups.transform("size")
sales[count > 1]

# Or as a one-liner:
sales[sales.groupby(by=["txn_id", "order_id"]).transform("size") > 1]

The benefit to transform() is that its use goes beyond just finding duplicates. You can use it to manipulate a dataframe much the same way you’d use a window function or qualify in SQL.

Row numbers and custom sorting

Another common SQL pattern involves using an analytic function such as row_number() to select a preferred row from a group of records. This next example shows how that pattern translates to Pandas.

In this scenario, a retailer’s POS systems create a new transaction row in the database for each change of status. It has three status codes: “UNPAID”, “PENDING”, and “PAID.” To deduplicate transaction records, we select the latest status in order of preference: “PAID,” “PENDING,” or “UNPAID.” As before, we need to exclude returned transactions. To illustrate this, I had ChatGPT cook up an example dataframe:

data = {
    "txn_id": [1, 1, 1, 2, 2, 3, 3],
    "txn_date": [
        "2026-01-01", "2026-01-02", "2026-01-03",
        "2026-01-01", "2026-01-04",
        "2026-01-01", "2026-01-02"
    ],
    "txn_status": ["UNPAID", "PENDING", "PAID", "UNPAID", "PENDING", "UNPAID", "UNPAID"],
    "txn_type": ["online", "online", "online", "store", "store", "online", "online"],
    "txn_amount": [50, 50, 50, 30, 30, 20, 20],
    "returned_dt": [None, None, None, None, None, "2026-01-03", None]
}
s = pd.DataFrame(data)
txn_idtxn_datetxn_statustxn_typetxn_amountreturned_dt
12026-01-01UNPAIDonline50NaN
12026-01-02PENDINGonline50NaN
12026-01-03PAIDonline50NaN
22026-01-01UNPAIDstore30NaN
22026-01-04PENDINGstore30NaN
32026-01-01UNPAIDonline202026-01-03
32026-01-02UNPAIDonline20NaN

Notice that with the way that the data is structured now, we can’t sum up the txn_amount column to determine total sales. We have to deduplicate transactions by selecting a single representative row for each transaction.

The highlighted rows are the rows we’re looking to capture. To do this in SQL, first, remove the transactions that have a return date. There are several ways to do that, but I’ve chosen to use qualify in the example below. Then, the case statement in the order by puts rows in the desired order. This illustrates one of the common uses of row_number() in SQL.

with latest_txn as (
    select
        *,
        row_number() over (
            partition by txn_id
            order by
                case
                    when txn_status = 'PAID'
                    then 0
                    when txn_status = 'PENDING'
                    then 1
                    when txn_status = 'UNPAID'
                    then 2
                end,
                txn_date desc
        ) as rn
    from sales_fact_table
    /* Remove all transactions that have any return record */
    qualify count(returned_dt) over (partition by txn_id) = 0
)
select txn_id, txn_date, txn_type, txn_amount
from latest_txn
where rn = 1

In Pandas, this follows slightly different logic. First, remove transactions with a return date. This pattern of groupby().transform() is roughly equivalent to a SQL window function. The False if x.notna().any() means “False if the given column in each group has any non-null values.”

sold_txns = s[s.groupby("txn_id")["returned_dt"].transform(
    lambda x: False if x.notna().any() else True
)]

Then, we create a column based on the priority for each status. This mirrors the case statement logic in the SQL version. The map() method looks up each row’s value in the priorities dictionary and returns the value that it finds, similar to a VLOOKUP in Excel.

priorities = {
    "PAID": 0,
    "PENDING": 1,
    "UNPAID": 2
}

sold_txns["priority"] = sold_txns["txn_status"].map(priorities)

Lastly, we sort the dataframe so that the desired row for each group is always first. Then, we group by transaction ID and take the first row from each group.

sold_txn_sorted = sold_txns.sort_values(
    ["priority", "txn_date"], ascending=[True, False])
latest_txns = sold_txns_sorted.groupby(
    "txn_id", as_index=False).first().drop(columns="priority")

Notice that in this version, we don’t actually need to create a row number because the sorting takes care of the ordering and selection logic for us. The example below does the whole operation in a single step using the assign() method , which creates columns that can be used in a method chain.

latest_txn = (
    sold_txns
    # Create a "priority" column based on the transaction status
    .assign(priority=lambda x: x["txn_status"].map(priorities))

    # Sort the values so that the row we want is first in each group
    .sort_values(["priority", "txn_date"], ascending=[True, False])

    # Make our groups and take the first value of each
    # column for each group. Drop the priority column
    # that we created above because it's no longer necessary.
    .groupby("txn_id", as_index=False)
    .first()
    .drop(columns=["priority"])
)

The result is just the two transactions and their latest statuses:

txn_idtxn_datetxn_statustxn_typetxn_amountreturned_dt
12026-01-03PAIDonline50NaN
22026-01-04PENDINGstore30NaN

In this example, the sales data is more like a sales-related event log than an actual record of sales. This technique of selecting a preferred row for each group enables us to perform aggregations on the data, for example by summing the total sales in a period.

Conclusion

These two examples demonstrate how common data analysis techniques translate between SQL and the popular Pandas library. Once you get to know the library well, Pandas becomes an excellent tool for data analysis. While not every technique translates one-to-one between SQL and Pandas, there’s almost certainly a Pandas way to do whatever your SQL task is. Often, Pandas can do it in fewer lines of code, too.

More From the Office