Originally published at: Comparing Audio Transcription in Notes, Audio Hijack, and MacWhisper - TidBITS
One of the hot new features in Notes in iOS 18 and macOS 15 Sequoia is the ability to record audio and generate a transcript. However, don’t give Apple innovation points since MacWhisper and Audio Hijack have been providing transcription on the Mac since 2023 by leveraging OpenAI’s open source Whisper speech-to-text technology. Still, with recording and transcription integrated into Notes, it has become much easier for millions of users to access these tools.
Exploring the Test Space
After the transcription feature appeared in Notes, I began recording Apple’s online presentations—earnings calls, product announcements, and more—because I often want to revisit precisely what was said when writing an article. I wanted to see how well Notes performed, but I soon realized I should compare it against Audio Hijack’s Transcribe block.
For Notes, I placed my iPhone next to one of my Mac’s external speakers, which seemed funky but captured clear audio, albeit with a bit of room echo. (Notes can’t create a transcript from imported audio.) Extracting the transcript from Notes was easy—tap the ••• button for commands to add the transcript to the note or copy it to the clipboard. With Audio Hijack, I created a simple workflow that captured audio from Safari, transcribed it to a text file, and saved an MP3 file.
The Notes transcript was easier to read because it inserted line breaks at pauses to form paragraphs. In contrast, Audio Hijack puts the entire transcript into one exceptionally long line. The lack of formatting made little difference in actual usage because I would always search for a specific word to learn what had been said.
That formatting made Notes seem more accurate at first, but when I examined the transcripts more closely, I got the feeling that Notes was making more mistakes than Audio Hijack. I also discovered that Notes synced the audio between my devices via iCloud, allowing me to trigger transcription separately on my M1 MacBook Air. How did that version stack up against the other two? Once I was comparing three versions, it seemed only fair to include MacWhisper, but the free version of MacWhisper is restricted to a small model, while the Pro version can access larger, more capable models. My friend Allison Sheridan uses MacWhisper Pro for her podcast transcripts and agreed to help me test.
At that point, I was beginning to feel overwhelmed. How could I possibly compare so many transcript versions, especially since I didn’t have a ground truth for any of Apple’s presentations? Then I recalled that NPR publishes transcripts for some of its shows. I quickly found an episode of NPR’s Short Wave podcast with a transcript. I set up Audio Hijack to transcribe it and placed my iPhone next to the speaker to record it in Notes. I also listened to the podcast and followed the official transcript.
Amusingly, the official transcript overlooked at least four important words and omitted (likely intentionally) several repeated phrases, such as when the guest said “you know” twice in rapid succession. So much for ground truth.
In the end, I had seven transcripts: one from Audio Hijack, two from Notes (iOS and macOS), and four from MacWhisper that utilized both audio files and two different models.
Measuring Transcription Accuracy
Next came the question of how to assess the accuracy of these files. I started with ChatGPT since I had recently experienced success with it analyzing spreadsheets (see “ChatGPT Proves Useful for Data Exploration,” 20 January 2025). ChatGPT was happy to analyze the files, but I kept getting different numbers for missing words, extra words, mistaken words, punctuation mistakes, and capitalization errors. For tasks like this, ChatGPT provides a little link at the end of the response that displays the code it generated to conduct the analysis, and while it appeared sound, I had no way to verify that it was performing as I intended.
To evaluate ChatGPT’s approach, I created several small test files containing a known number of errors and ran them through ChatGPT’s analysis. The results were way off, and when I asked ChatGPT to list the mistakes, I found it was duplicating many of the results, often treating mistaken words as missing or extra words. During this process, ChatGPT referenced a measurement called Word Error Rate (WER), which is considered something of an industry standard. When I researched WER, I learned that it ignores punctuation, capitalization, whitespace, and line breaks. Instead, WER is defined as the sum of all word substitutions, deletions, and insertions in the transcript, divided by the total number of words in the reference file. Asking ChatGPT to apply WER brought its previously inconsistent results into a more sensible range.
ChatGPT also directed me to two sites that provide WER calculators: Amberscript and Kensho. Unfortunately, although they were internally consistent, they usually disagreed with each other and ChatGPT. I abandoned the search for a definitive answer and instead built a table to show the variability and average WER across all three tools. As you’ll see, the only point of agreement is that the four words missing from the official transcript worked out to a 0.2% WER.
Transcription | Amberscript | Kensho | ChatGPT | Average |
Official transcript before correction | 0.2% | 0.2% | 0.2% | 0.2% |
MacWhisper Pro (Large V2, AH audio) | 3.4% | 4.2% | 3.6% | 3.7% |
MacWhisper (Small, iPhone audio) | 4.2% | 6.1% | 4.2% | 4.8% |
MacWhisper Pro (Large V2, iPhone audio) | 4.3% | 7.0% | 4.3% | 5.2% |
MacWhisper (Small, AH audio) | 5.5% | 5.8% | 5.5% | 5.6% |
Audio Hijack (AH audio) | 6.2% | 6.7% | 5.9% | 6.2% |
Notes (macOS, iPhone audio) | 6.0% | 7.2% | 6.6% | 6.6% |
Notes (iOS, iPhone audio) | 7.3% | 8.7% | 8.2% | 8.1% |
Key Findings and Observations
From these test results, we can draw some broad conclusions:
- Accept imperfection: There is no such thing as a perfect transcription of unscripted text. Even with NPR’s audio, which I assume was significantly cleaned up and edited, whoever created the official transcript chose to remove artifacts of human speech. Plus, they missed four words, which is interesting because none of the other apps missed them.
- Audio quality matters: The quality of the recording matters, though not necessarily in predictable ways. In general, audio recorded directly by Audio Hijack produced better transcripts, although the MacWhisper Small model in the free version performed better with the iPhone’s audio. Also, my test audio was particularly good. Unedited audio from non-professionals and recordings made in environments that have ambient noise or echo won’t be nearly as clean. Transcripts will significantly suffer from poor audio.
- Cross-platform discrepancies with Notes: Notes does not perform identically across platforms, with the Mac version producing notably better results than the iPhone version. If you care about transcription accuracy, let a Notes recording on the iPhone sync to an Apple silicon Mac and transcribe it there.
- Challenges with names and proper nouns: All the transcription solutions stumble over uncommon names and proper nouns. NPR’s audio engineer on this podcast was Kwesi Lee, but the tools rendered his first name as “Kweisi,” “Quacy,” “Quasey,” and “Quas.”
- Handling of speech artifacts: Whisper automatically removes artifacts like “um” and “uh” for MacWhisper and Audio Hijack, which helps their accuracy. Notes does transcribe such artifacts, and although removing them helps its accuracy ratings, it doesn’t change the overall order.
- Importance of formatting: Since I usually search within the transcript, line breaks don’t make much difference for me, but they will for many other uses. Notes does fairly well at separating sentences into standalone paragraphs. MacWhisper can break text into segments in the free version, but the AI-powered cleanup to combine them into sentences requires the Pro version. Rogue Amoeba tells me that adding line breaks to transcripts is on the road map for Audio Hijack.
- Lack of speaker identification: None of these solutions attempt to assign speakers to the text, as do the transcription features built into videoconferencing tools like Zoom. Fathom also does an excellent job of identifying speakers in a transcript. Allison Sheridan has written about assigning speakers in ChatGPT and MacWhisper.
- Integration of synced text and audio: The other big win of Notes and MacWhisper is that they can sync the text and the audio so you can hear what’s being said while reading it. This feature enables you to identify any words that were particularly mangled in transcription.
- Varied language support: Audio Hijack and MacWhisper support about 100 languages because they use Whisper. Notes, on the other hand, is currently limited to English. However, accuracy in non-English languages may vary significantly from what I observed in my English-only testing.
Which Tool Should You Choose?
Ultimately, I’m a little unsatisfied. I had hoped that my testing would point toward an obvious solution.
Notes provides line breaks and syncs the text and audio. Most importantly, it’s free and available to all iPhones and iPads that can run iOS 18 and iPadOS 18, as well as Apple silicon Macs running macOS 15 Sequoia. However, it’s the least accurate, especially on the iPhone, and it requires that you record audio you can hear, meaning it will also pick up other people speaking in the room. I’d recommend it for those who need transcripts only occasionally and don’t care much about accuracy.
Audio Hijack is more accurate than Notes, and transcription comes on top of numerous other audio recording capabilities. However, it doesn’t provide line breaks, and if accuracy is important to you, MacWhisper is a better choice. You shouldn’t buy Audio Hijack for transcription alone, but the feature may be helpful if you already have the app for other needs. It costs $64 with a 20% discount for TidBITS members.
MacWhisper offers the best accuracy, syncs text and audio, and is focused on transcribing audio. While the free version works well enough with audio recorded elsewhere (such as Notes), if you’re serious about transcription, you’ll want the €49 Pro version (with discounts for students, journalists, and non-profits). It can record app audio, offers multiple models for speed and accuracy, can combine segments into sentences, supports batch transcription, transcribes YouTube videos, and much more.