Comparing Audio Transcription in Notes, Audio Hijack, and MacWhisper

Originally published at: Comparing Audio Transcription in Notes, Audio Hijack, and MacWhisper - TidBITS

One of the hot new features in Notes in iOS 18 and macOS 15 Sequoia is the ability to record audio and generate a transcript. However, don’t give Apple innovation points since MacWhisper and Audio Hijack have been providing transcription on the Mac since 2023 by leveraging OpenAI’s open source Whisper speech-to-text technology. Still, with recording and transcription integrated into Notes, it has become much easier for millions of users to access these tools.

Exploring the Test Space

After the transcription feature appeared in Notes, I began recording Apple’s online presentations—earnings calls, product announcements, and more—because I often want to revisit precisely what was said when writing an article. I wanted to see how well Notes performed, but I soon realized I should compare it against Audio Hijack’s Transcribe block.

For Notes, I placed my iPhone next to one of my Mac’s external speakers, which seemed funky but captured clear audio, albeit with a bit of room echo. (Notes can’t create a transcript from imported audio.) Extracting the transcript from Notes was easy—tap the ••• button for commands to add the transcript to the note or copy it to the clipboard. With Audio Hijack, I created a simple workflow that captured audio from Safari, transcribed it to a text file, and saved an MP3 file.

The Notes transcript was easier to read because it inserted line breaks at pauses to form paragraphs. In contrast, Audio Hijack puts the entire transcript into one exceptionally long line. The lack of formatting made little difference in actual usage because I would always search for a specific word to learn what had been said.

That formatting made Notes seem more accurate at first, but when I examined the transcripts more closely, I got the feeling that Notes was making more mistakes than Audio Hijack. I also discovered that Notes synced the audio between my devices via iCloud, allowing me to trigger transcription separately on my M1 MacBook Air. How did that version stack up against the other two? Once I was comparing three versions, it seemed only fair to include MacWhisper, but the free version of MacWhisper is restricted to a small model, while the Pro version can access larger, more capable models. My friend Allison Sheridan uses MacWhisper Pro for her podcast transcripts and agreed to help me test.

At that point, I was beginning to feel overwhelmed. How could I possibly compare so many transcript versions, especially since I didn’t have a ground truth for any of Apple’s presentations? Then I recalled that NPR publishes transcripts for some of its shows. I quickly found an episode of NPR’s Short Wave podcast with a transcript. I set up Audio Hijack to transcribe it and placed my iPhone next to the speaker to record it in Notes. I also listened to the podcast and followed the official transcript.

Amusingly, the official transcript overlooked at least four important words and omitted (likely intentionally) several repeated phrases, such as when the guest said “you know” twice in rapid succession. So much for ground truth.

In the end, I had seven transcripts: one from Audio Hijack, two from Notes (iOS and macOS), and four from MacWhisper that utilized both audio files and two different models.

Measuring Transcription Accuracy

Next came the question of how to assess the accuracy of these files. I started with ChatGPT since I had recently experienced success with it analyzing spreadsheets (see “ChatGPT Proves Useful for Data Exploration,” 20 January 2025). ChatGPT was happy to analyze the files, but I kept getting different numbers for missing words, extra words, mistaken words, punctuation mistakes, and capitalization errors. For tasks like this, ChatGPT provides a little link at the end of the response that displays the code it generated to conduct the analysis, and while it appeared sound, I had no way to verify that it was performing as I intended.

To evaluate ChatGPT’s approach, I created several small test files containing a known number of errors and ran them through ChatGPT’s analysis. The results were way off, and when I asked ChatGPT to list the mistakes, I found it was duplicating many of the results, often treating mistaken words as missing or extra words. During this process, ChatGPT referenced a measurement called Word Error Rate (WER), which is considered something of an industry standard. When I researched WER, I learned that it ignores punctuation, capitalization, whitespace, and line breaks. Instead, WER is defined as the sum of all word substitutions, deletions, and insertions in the transcript, divided by the total number of words in the reference file. Asking ChatGPT to apply WER brought its previously inconsistent results into a more sensible range.

ChatGPT also directed me to two sites that provide WER calculators: Amberscript and Kensho. Unfortunately, although they were internally consistent, they usually disagreed with each other and ChatGPT. I abandoned the search for a definitive answer and instead built a table to show the variability and average WER across all three tools. As you’ll see, the only point of agreement is that the four words missing from the official transcript worked out to a 0.2% WER.

Transcription Amberscript Kensho ChatGPT Average
Official transcript before correction 0.2% 0.2% 0.2% 0.2%
MacWhisper Pro (Large V2, AH audio) 3.4% 4.2% 3.6% 3.7%
MacWhisper (Small, iPhone audio) 4.2% 6.1% 4.2% 4.8%
MacWhisper Pro (Large V2, iPhone audio) 4.3% 7.0% 4.3% 5.2%
MacWhisper (Small, AH audio) 5.5% 5.8% 5.5% 5.6%
Audio Hijack (AH audio) 6.2% 6.7% 5.9% 6.2%
Notes (macOS, iPhone audio) 6.0% 7.2% 6.6% 6.6%
Notes (iOS, iPhone audio) 7.3% 8.7% 8.2% 8.1%

Key Findings and Observations

From these test results, we can draw some broad conclusions:

  • Accept imperfection: There is no such thing as a perfect transcription of unscripted text. Even with NPR’s audio, which I assume was significantly cleaned up and edited, whoever created the official transcript chose to remove artifacts of human speech. Plus, they missed four words, which is interesting because none of the other apps missed them.
  • Audio quality matters: The quality of the recording matters, though not necessarily in predictable ways. In general, audio recorded directly by Audio Hijack produced better transcripts, although the MacWhisper Small model in the free version performed better with the iPhone’s audio. Also, my test audio was particularly good. Unedited audio from non-professionals and recordings made in environments that have ambient noise or echo won’t be nearly as clean. Transcripts will significantly suffer from poor audio.
  • Cross-platform discrepancies with Notes: Notes does not perform identically across platforms, with the Mac version producing notably better results than the iPhone version. If you care about transcription accuracy, let a Notes recording on the iPhone sync to an Apple silicon Mac and transcribe it there.
  • Challenges with names and proper nouns: All the transcription solutions stumble over uncommon names and proper nouns. NPR’s audio engineer on this podcast was Kwesi Lee, but the tools rendered his first name as “Kweisi,” “Quacy,” “Quasey,” and “Quas.”
  • Handling of speech artifacts: Whisper automatically removes artifacts like “um” and “uh” for MacWhisper and Audio Hijack, which helps their accuracy. Notes does transcribe such artifacts, and although removing them helps its accuracy ratings, it doesn’t change the overall order.
  • Importance of formatting: Since I usually search within the transcript, line breaks don’t make much difference for me, but they will for many other uses. Notes does fairly well at separating sentences into standalone paragraphs. MacWhisper can break text into segments in the free version, but the AI-powered cleanup to combine them into sentences requires the Pro version. Rogue Amoeba tells me that adding line breaks to transcripts is on the road map for Audio Hijack.
  • Lack of speaker identification: None of these solutions attempt to assign speakers to the text, as do the transcription features built into videoconferencing tools like Zoom. Fathom also does an excellent job of identifying speakers in a transcript. Allison Sheridan has written about assigning speakers in ChatGPT and MacWhisper.
  • Integration of synced text and audio: The other big win of Notes and MacWhisper is that they can sync the text and the audio so you can hear what’s being said while reading it. This feature enables you to identify any words that were particularly mangled in transcription.
  • Varied language support: Audio Hijack and MacWhisper support about 100 languages because they use Whisper. Notes, on the other hand, is currently limited to English. However, accuracy in non-English languages may vary significantly from what I observed in my English-only testing.

Which Tool Should You Choose?

Ultimately, I’m a little unsatisfied. I had hoped that my testing would point toward an obvious solution.

Notes provides line breaks and syncs the text and audio. Most importantly, it’s free and available to all iPhones and iPads that can run iOS 18 and iPadOS 18, as well as Apple silicon Macs running macOS 15 Sequoia. However, it’s the least accurate, especially on the iPhone, and it requires that you record audio you can hear, meaning it will also pick up other people speaking in the room. I’d recommend it for those who need transcripts only occasionally and don’t care much about accuracy.

Audio Hijack is more accurate than Notes, and transcription comes on top of numerous other audio recording capabilities. However, it doesn’t provide line breaks, and if accuracy is important to you, MacWhisper is a better choice. You shouldn’t buy Audio Hijack for transcription alone, but the feature may be helpful if you already have the app for other needs. It costs $64 with a 20% discount for TidBITS members.

MacWhisper offers the best accuracy, syncs text and audio, and is focused on transcribing audio. While the free version works well enough with audio recorded elsewhere (such as Notes), if you’re serious about transcription, you’ll want the €49 Pro version (with discounts for students, journalists, and non-profits). It can record app audio, offers multiple models for speed and accuracy, can combine segments into sentences, supports batch transcription, transcribes YouTube videos, and much more.

 

5 Likes

I’ve been using Audio Hijack to transcribe a few episodes of the BBC’s ‘In Our Time’ podcast. Having realised that the macOS podcast app (in Sonoma) offers transcriptions with line breaks, I’m now going straight to that text.

The Apple transcription is certainly no worse than the Audio Hijack version and I am very glad to have the line breaks.

The pace of development in audio transcription is astonishing. I am, however, unable to drink the Kool-Aid. You cannot accept these transcriptions as an accurate record, there are just too many errors. There are errors in human transcriptions, too, but in general you are paying for those transcriptions and so have more reason to challenge them when their results are imperfect.

Yes, when for $19.95 you can get an idea of what was said it seems wonderful and useful for a quick take but the wonderfulness disappears when a few missed words means you completely misunderstood what was said.

Buyer beware.

Dave

2 Likes

The few times I have needed transcription (I interview families at conventions with a lot of background noise as well as particular terms used) I used Aiko, which is available for MacOS and iOS and it has made my life a lot simpler for my blog. I always listen to the interview while looking at the transcription and usually only have to make a few corrections. As I am using a handheld recorder, it will take the file and transcribe it from that.

1 Like

Yes, that’s been around for a while.

I wonder how it compares, given that it uses Apple server-side technology rather than any client-side processing. I would hope that it is better.

There’s no question about the number of errors—that’s why I calculated the WER. But I think it’s more helpful to think about whether the transcript, with all its errors, is useful to you. For my use, which is verifying statements in spoken text, it’s massively useful and the lack of accuracy is irrelevant because I can (and do) listen to the synced-up audio once I’ve found the desired text. For me, the transcript is an audio navigation tool, not a standalone text.

Others’ uses may vary, but it’s hard to imagine a situation where AI-driven transcription isn’t helpful, at least as a starting point.

How would that be helpful? Identifying human-generated errors would be incredibly tedious and itself error-prone. (I know, I listened to the entire podcast twice while reading the official transcript, and I found more errors on the second pass.) Plus, it would be so vastly more expensive that it would be used only for the highest-value use cases like court transcripts.

1 Like

I completely agree. In particular because you’re a professional editor and reporter who knows what he’s doing.

There are still human-staffed transcription services because there are all sorts of places, not just courts, where transcripts must be accurate. I am sure they’re using voice recognition software to save vast quantities of time (like professional translators now do) but they do provide surety to their clients who willingly pay for it.

What I’m worried about are non- or sorta-tech-savvy people who assume the transcription is correct because a computer did it. They could get into all sorts of trouble without realizing it like, harrumph, the brain-dead lawyers who submitted ChatGPT generated briefs to courts. :crazy_face:

It’s like AI-generated image masking. "Oooh cool! Wow! That’s fast! I’m sending that off to Joey right now! He’ll :rofl:! " Send it to a professional publication and it will get sent right back because basically the masking although surprising in what it gets is pretty shoddy.

Dave

I use Transcribe, a funky little app that runs on my 10.12 Desktop. It works surprisingly well on imported files. It also displays the text synced to the audio, which is really helpful for final editing.

I’ve used Transcribeme for years, and it’s great. They charge, but it’s essentially micro charge, something like 6 or 7 cents per minute.

Every time I do a comparison to other tools, it comes out on top, though I haven’t tested lately. And it gets better and better. It’s been a year or more since I had to refer back to a recording to sort out some problem (I.e. it doesn’t get everything right, but it’s so close that I can easily figure out the actual text).

They offer human transcription too, so you need to choice machine transcription. And the UI sucks. But I’m incredibly happy with it.

They have an iOS app, but my flow fwiw is to use RecUp for voice recording, which automatically sends to DropBox. IFFFT notices the addition and emails me a reminder to transcribe, so when I get home I upload to transcribeme.com and grab the transcript. At some point I’ll switch to the iOS app, but whenever I use it I find it annoying.

On my iPhone (English operating system, but Siri speaks Italian), Notes can transcribe both in English and in Italian.

Just to let you know.

Gabriele

Well, that’s interesting. Apple will need to update its iOS 18 manual.

I’m a professional writer and reporter, and I have been using otter.ai for transcribing interviews, while taking notes using Nisus. I then go through the transcription to correct errors. I’m very careful to get the facts straight, but I don’t need a word-by-word 100% accurate transcription. Excellent article! I’ll add a few observations.

Most interviews include some hesitant vowels, repeated words and other mis-speaks. I’m very aware of that because I do all of those and I think I sound awful when I speak because of them. Professional speakers are much better, but otherwise you can’t get a clean interview without editing out something.

Do not set the transcription to skip silent periods. At least for Otter, that can cut off the beginnings and ends of words, which can wreak havoc with numbers. For example what one speaker said was two point five billion years came out as two billion years. I caught it because I checked the transcription right after the interview and the number was an important fact that I knew. I would not have caught it if I didn’t know the right number because otter had cut off the “point five” from the recording it played back for me to check the transcript.

I’ll echo other comments on the importance of checking spelling of names.

Transcription software can be very helpful in clarifying the speech of someone with an accent. My aging ears need extra time to process accented speech, so I often miss words or can’t keep up, but the transcriptions were invaluable. To check the transcript, I slow down the recording to make sure the transcript is correct.

I have used Auphonic and found it quite amazing, certainly for sweetening audio captured in less than ideal situations. You can use it for free up to a limit, at least long enough to check its efficacy. If you feed it a two way interview it automagically transcribes indicating Speaker 1, speaker 2 etc.

It can also delete hums & ha’s, errs, silences etc. Incredible wot we can do these days.

L

Honestly, since Nuance abandoned the Mac, there really hasn’t been a good voice to text application. I used Dragon Dictate Medical, which did an amazingly good job right out of the box, with the minor annoyance of confusing (:) with that part of the intestine.

@iflatow I’m curious if you know more about how NPR creates its transcripts.

Apple’s Voice Memos on the iPhone offers a transcript that can be viewed in the app or copied.

I haven’t had the chance to try it yet, but I see that MacWhisper Pro 12 (released March 13, 2025) now includes automatic speaker recognition. From the release notes:

Automatic Speaker Recognition! Finally! Automatically recognise speakers in your recordings using local models. To use it, make sure you select a model that supports speaker recognition (WhisperKit). After your transcription is complete it will automatically be grouped by speaker. We’re still working on improvements so let us know what you think! (Pro)

1 Like