Originally published at: ChatGPT Proves Useful for Data Exploration - TidBITS
Sometimes, when asked questions about data generated by tech systems I’ve created, I feel like I’m channeling Dr. McCoy in the original Star Trek. “Dammit, Jim, I’m a writer, not a statistician!” I have a firm grasp on integer math but a tenuous one on statistics, data analysis, and related concepts.
That ignorance—and, let’s be honest, lack of interest in learning skills in that field—caused me to turn to ChatGPT the other day for answers to some questions that would have taken me far too long to figure out any other way. I recently subscribed to ChatGPT Plus, which provides more extensive access to GPT-4o and better handling of file uploads, making this kind of data analysis possible. Other AI assistants like Claude offer similar capabilities.
Understanding Referral Sources
Here’s the problem: The Finger Lakes Runners Club puts on 25 races per year, but the number of races and minor data inconsistencies make large-scale analysis of the registration data daunting. For instance, for some of our races, we ask where registrants learned about the race, giving them a set of choices. Our intent is to learn which of our promotional efforts are the most successful.
I can easily imagine how to build a spreadsheet with separate sheets for each race and use lookups to count the number of each answer for each race. However, actually building such a spreadsheet would take hours, particularly troubleshooting the funky lookup formulas. It hasn’t happened. My recent realization that ChatGPT can analyze CSV and Excel files pointed to another solution.
First, I uploaded my files by dragging them into a new chat. (It could take only ten at a time, so I had to upload in batches.) I asked ChatGTP to create a table counting the number of times each item in the Source column appears across all the race registrations. Magically, it did! But once I could see it, the table wasn’t quite what I had in mind. It included an index column and summarized all races. I changed gears slightly, asking ChatGPT to remove the index column and add per-race data to the rows in the table. Its first try confused the rows and columns, but asking it to transpose the data fixed that. A few more commands gave me a Total row and column, an alphabetically sorted list of races, and shorter column names.
I confirmed a few of the numbers and calculations by cross-checking them against my original spreadsheets and doing spot-checks with simple formulas. The overall goal was also sufficiently unimportant that I wouldn’t have been all that perturbed by a few minor mistakes. Unfortunately, apart from some generalities about which approaches were the most common and a few outliers—the large “From a Friend” suggests that Turkey Trot works largely by word of mouth—the data was hard to understand.
Next, I asked ChatGPT to switch from raw numbers to percentages, which allowed me to compare the relative effectiveness of each promotional channel more easily. However, examining each number closely in the table still required significant effort, so I had ChatGPT create a visualization. It suggested a stacked bar chart, which produced the one below.
Frankly, this chart is exactly what I need to pass on to the club’s Communications Team so they can evaluate what they’re doing. (A few of the bars aren’t quite at 100% due to an earlier rounding step, but again, I’m interested in the forest here, not a few twigs.)
Evaluating Late Race Registrations
Buoyed by the relative ease of extracting the referral source data, I tackled a more complex problem. Over the past few years, we’ve seen a significant uptick in late race registrations, with many people registering in the last day or two before a race. Not having a feel for how large a race will be until the last few days causes uncertainty and stress for race directors, most notably when estimating how much food to buy for post-race snacks.
We’ve played with various techniques for encouraging earlier registrations, including pleas to register earlier, price increases, and early cutoffs, but none have moved the needle. People apologize for being a pain, but they don’t stop. Our races aren’t expensive enough that price increases make much difference. Early cutoffs generate anguished emails from those who missed out and people showing up at the race begging for day-of-race registration, both of which ultimately cause more stress and work. Since we’re unwilling to employ draconian policies inappropriate for a volunteer-led community club focused on inclusivity, late registrants have become a fact of our modern world.
However, knowing how many people will likely register in the last week would help us estimate food needs better and avoid worrying that insufficient PR has been done. Since I had already uploaded all those race registration spreadsheets to ChatGPT, I decided to see if it could help us visualize the percentage of runners who sign up late.
Since not all races had registrations on race day, and that date couldn’t be calculated any other way, I had to feed ChatGPT a list of the actual race dates. After that, I got it to create bar charts that showed how many people signed up the day of a race, one day before, two days before, and so on. As you can see, some races are much more heavily weighted to late registrations (Turkey Trot, left) than others (Skunk Cabbage, right).
However, when I asked ChatGPT to combine all the races, convert the raw numbers to percentages of the whole, and condense the days into weeks, the resulting chart was stark—over 50% of people sign up in the last week.
I would typically have stopped here, but I was curious if ChatGPT could suggest other helpful ways of examining the information. It offered a handful, including a cumulative line chart, a heatmap, a stacked area chart, and a percentage stacked bar chart.
Only the cumulative line chart made sense to me, and even there, I had trouble internalizing what the slope of the line meant. So, I had ChatGPT change the chronological order so that the race day was on the right side of the graph and the line increased over time. I also asked it to switch from raw numbers to percentages again, which produced graphs like this for Turkey Trot and Skunk Cabbage.
Now we’re talking! For Turkey Trot, which gets a lot of late registrations, I can easily see that 50% of registrants will sign up in the last three days. However, the 50% mark for Skunk Cabbage comes about two months beforehand. I plan to share these charts with the various race directors so they can understand what’s likely to happen with registrations this year.
To be clear, I wouldn’t have thought of doing a cumulative line chart on my own, and even if I had, I’m not sure I would have been able to construct these charts. Certainly not without hours of work, rather than the 15 or 20 minutes I spent giving ChatGPT directions.
The Extreme Oddity of an AI Assistant
My retelling of how I arrived at these charts glosses over all the back and forth it took. In most cases, I’d get something shockingly impressive after a prompt or two. Within minutes, I’d have tables or charts that looked reasonable.
However, even if I didn’t care about extreme accuracy, the numbers needed to be roughly correct. Several times, when I spot-checked a number, it was utterly wrong. That also happens in spreadsheets I build, but instead of figuring out where I’d gone wrong in a formula, I just told ChatGPT to fix the mistake. I had to repeat myself on a few occasions until the numbers aligned with what I had uploaded.
Most of my time was spent on little things. While working on the table of referral sources, I wanted to change how it was sorted. That was easily accomplished, but ChatGPT also sorted the Total row and column rather than leaving them at the bottom and right. It had no conception that they were different from the data and should remain in those positions, though it was happy to put them back there when I complained. It also used decimal notation in some places, even though everything was a whole number, so I had to tell it to stick with integers.
Also, when building the graphs, I had to make ChatGPT use integers on both the X and Y axes because the data did not contain half-registrations or half-days. In fact, it stumbled over the fact that the registration dates did include times—it was starting down the timezone rabbit hole—but I short-circuited that problem by telling it to ignore the times. That alone was vastly easier than fighting with date formats in a spreadsheet, which I find almost entirely inscrutable.
Even stranger was having to change formatting with prompts. Changing column titles, chart titles, and even the location of the legend required that I explain what I wanted rather than just doing it myself.
I seldom work with people on this sort of thing, so most of my iterative work happens in my head without ever being externalized. Upon reflection, though, the experience wasn’t all that dissimilar from when Josh Centers first developed the financial graphs we use when covering Apple’s quarterly reports. He would show me a graph, and I would ask him to change the color, tweak a title, or adjust the legend spacing. A few times, when I asked why the graphs didn’t look right, he found that a formula had gone awry and was calculating the numbers wrong. ChatGPT made much weirder errors than Josh, but it was also much faster to respond to requests like “Change the graphs so the line increases over time rather than decreasing.”
I did encounter a couple of unexpected snags. First, like a cat in a tree, ChatGPT has trouble turning around. If I sent it down a path that I later decided was a dead end, telling ChatGPT to go back and try a different approach often failed.
I think the problem lies in ChatGPT’s memory limits. It processes instructions in the current thread but doesn’t retain the full history of operations indefinitely. For example, once I told ChatGPT to switch from raw numbers to percentages, the raw numbers were no longer part of its context, making it difficult to revert. Without a mechanism to “go back,” earlier stages of data were essentially erased, requiring me to start over in a new chat.
The solution was to start over in a new chat, but that required repeating all the desired prompts up to the point I wanted to try a different approach. It might have been possible to copy all those prompts from the previous chat and combine them into a single one in the new chat, but I didn’t try that.
The second snag was that ChatGPT would get “tired” after a while. I apologize for the anthropomorphizing, but things it had done without issue earlier eventually became problematic. It started forgetting about some of the races I had uploaded, and when I complained, it made up data entirely. Eventually, it threw in the towel, saying:
It seems the working dataset for registrations (
filtered_corrected_days_before_all_df
) is no longer available. To accurately recalculate the total number of registrations, I’ll need to reload and process the original registration files. Could you please re-upload the necessary files or confirm how you’d like to proceed?
I suspect that this is the same problem as trying to get it to revert to an earlier result and start over. With a limited context, too much of my data had been tweaked, munged, and transmogrified to continue the conversation in any useful fashion. In other words, ChatGPT was like the little boy in The Far Side cartoon who says, “Mrs. Johnson, may I be excused? My brain is full.”
Despite some snags, generating tables and graphs by simply uploading data and prompting ChatGPT felt like a glimpse into the future. Once I adapted my approach to interacting with ChatGPT—framing directions clearly and anticipating iterative revisions as I would when working with a person—the process became surprisingly efficient. For anyone faced with analyzing large datasets, ChatGPT offers a compelling alternative to traditional tools. I encourage you to experiment with it the next time you want to explore or visualize complex data quickly.