Originally published at: Help Build a Tool To Track Apple Support Page Changes - TidBITS
It’s possible to write software that can traverse the entire universe of Apple support pages. Wouldn’t it be helpful if there were a tool that would tell you which pages were new or had changed?
Possible, certainly; worthwhile not so sure? That is a LOT of links to traverse, with multiple references to the same page, so you would need to keep track of pages you’ve already scanned… Wondering what would happen if a bunch of people started using it at the same time?
That’s why I suggest storing the pages in a local database for easy comparison. The system would scan Apple’s site on a regular schedule, perhaps once per day, following robots.txt rules, but all users accessing it would be dealing with the local database, not Apple’s site. Apart from general considerations when building an interactive website, I don’t see this as having any unusual usage concerns.
The idea of writing software to track every change across Apple’s support pages may seem beneficial on the surface, but it’s likely to be a massive waste of resources for several reasons:
-
Scale and Complexity: The sheer number of support pages Apple maintains means this tool would be continuously overburdened with trivial updates, like minor wording tweaks, drowning users in unnecessary notifications.
-
Redundancy with AI: AI advancements are already making such tasks obsolete. Within a few years, AI could handle this far more efficiently, rendering such a tool redundant.
-
Misguided Focus: Instead of tracking endless updates, focusing on significant content like key policy changes or new feature rollouts would be far more valuable. A tool dedicated to capturing every single change risks becoming a noise generator rather than a genuinely useful resource.
Investing time and energy into a project like this, given the direction of AI and the complexity involved, could be more trouble than it’s worth. A more strategic approach would involve leveraging AI to filter and highlight only the most critical updates.
Well, you’ve certainly got more experience with large scale websites than I. The few I’ve done were quite small, knock out in a text editor scale. Still seems a lot of work for small return to me.
Possibly, but none of your reasons are entirely convincing. #1 is possible but remains to be seen; #2 hasn’t happened yet; #3 is what TidBITS already does with its articles (also, this strikes me as similar to the old remark about MS Word – that no one uses more than 20% of its features, but it’s a different 20%. People are going to have wildly different ides about what’s important and what’s not).
As I said, I think this would be of interest mostly to Apple sysadmins and consultants, along with journalists like me, not a general-purpose site that everyday users would track. But I’d be able to use the results to inform articles that would surface information that otherwise remains nearly undiscoverable.
It is true that Apple makes small wording changes to its pages on occasion, as you can see if you compare versions in the Wayback Machine. But just as the Wayback Machine can distinguish between small and large changes, any system that can do a diff can do the same.
So it would be easy to give people the option to be notified of all changes or only large changes. A sufficiently flexible system could also let people pick and choose their topics—I could easily see someone caring only about changes affecting macOS and not visionOS, for instance.
As to your point about AI, today’s AI chatbots wouldn’t help here because they have limited insight into recently changed pages. Even AI-based search engines like Perplexity aren’t useful (I checked) because they retrieve only a small number of pages to answer queries, and they don’t have the ability to direct their searches with metadata such as modification date.
Have you looked at Airtable.com?
Perhaps checking the published date at the bottom of each file and only if that has been changed then process the new file.
The headers returned give you a lot of information, in particular Cache-Control
, ETag
, Expires
, and Last-Modified
. For example:
$ curl --head -s 'https://support.apple.com/en-us/108382' | sort | uniq
Access-Control-Allow-Headers: origin
Access-Control-Max-Age: 1
Cache-Control: no-siteapp
Cache-Control: public, no-transform, max-age=1571
Connection: keep-alive
Content-Language: en-US
Content-Security-Policy: default-src 'self' blob: data: *.apple.com; connect-src 'self' *.apple.com *.apple.com.cn; script-src 'self' 'unsafe-inline' 'unsafe-eval' *.apple.com; img-src 'self' data: *.apple.com; child-src 'self' support.apple.com apple.com km.support.apple.com; style-src 'self' 'unsafe-inline' *.apple.com; font-src 'self' data: *.apple.com
Content-Type: text/html;charset=utf-8
Date: Thu, 08 Aug 2024 23:28:16 GMT
ETag: a56T1AULltDRsJuly24JD4Z110=====--gzip
Expires: Thu, 08 Aug 2024 23:54:27 GMT
Host: support-shd-mdn.corp.apple.com
Host: support.apple.com
Last-Modified: Fri, 12 Apr 2024 19:12:33 GMT
Referrer-Policy: no-referrer-when-downgrade
SS-Article-Version: 2.0.15.0
Server: Apple
Strict-Transport-Security: max-age=31536000; includeSubdomains
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
And by doing a HEAD
request (which is what curl --head
does in that example), you get just the headers. After looking at them if you decide you want the whole page, then you can do a GET
.
(SS-Article-Version
looks interesting too, but it’s a nonstandard header, so I don’t know exactly how Apple uses it.)
I have, but it looked like a more-involved platform than I wanted to learn for a one-off project.
Interesting! I think that SS-Article-Version is what the Neptyne guy was able to access too. I could never figure Apple’s system surrounding those numbers.
From Apple’s terms of use:
…no part of the Site and no Content may be copied, reproduced, republished, uploaded, posted, publicly displayed, encoded, translated, transmitted or distributed in any way (including “mirroring”) to any other computer, server, Web site or other medium for publication or distribution or for any commercial enterprise, without Apple’s express prior written consent…
…You may not use any “deep-link”, “page-scrape”, “robot”, “spider” or other automatic device, program, algorithm or methodology, or any similar or equivalent manual process, to access, acquire, copy or monitor any portion of the Site or any Content, or in any way reproduce or circumvent the navigational structure or presentation of the Site or any Content, to obtain or attempt to obtain any materials, documents or information through any means not purposely made available through the Site. Apple reserves the right to bar any such activity…
That might present something of a barrier to what you propose.
Of course, if we take Apple’s terms of use at their word, we’re in violation by posting them here:
Except as expressly provided in these Terms of Use, no part of the Site and no Content may be copied, reproduced, republished, uploaded, posted, publicly displayed, encoded, translated, transmitted or distributed in any way (including “mirroring”) to any other computer, server, Web site or other medium for publication or distribution or for any commercial enterprise, without Apple’s express prior written consent.
You may use information on Apple products and services (such as data sheets, knowledge base articles, and similar materials) purposely made available by Apple for downloading from the Site, provided that you (1) not remove any proprietary notice language in all copies of such documents, (2) use such information only for your personal, non-commercial informational purpose and do not copy or post such information on any networked computer or broadcast it in any media, (3) make no modifications to any such information, and (4) not make any additional representations or warranties relating to such documents.
Interestingly, here’s what support.apple.com’s robots.txt says. It doesn’t seem perturbed about standard support articles.
# robots for Inquira throttling
User-agent: IQ-WWW
Request-rate: 5/1 #maximum rate is 5 pages per second
User-agent: Baiduspider
Allow: /zh-cn
Allow: *viewlocale=zh_CN*
Allow: *locale=zh_CN*
Allow: /zh_CN
Disallow: /
User-agent: *
Disallow: /kb/index?*page=search*
Disallow: *src=support_app*
Disallow: /*/docs/product/*
Disallow: /docs/product/*
Disallow: */MANUALS/*.pdf$
Ron, you just violated the TOS by posting them here.
(DOUBLE EDIT: I mean, come on, that suggests that linking to Apple violates their TOS)
EDIT: Ninja’d!
I’ll be turning myself in to the Apple police, then. I’m so ashamed.
Yes, this seems very doable (I have 25+ years of experience in web applications, though it’s been a few years since I wrote code to crawl sites regularly). It might violate Apple’s terms for access to its pages, but by those rules, so does any browser that caches pages (spoiler: browsers almost invariably store and reproduce content from sites via caching) (however, I am not a lawyer). It’s not super hard to do something like this, but it does take some work to tread softly. Also, diffing HTML pages can be rather complicated (thought there are often viable “shortcuts” or workarounds).
One concern is that after building a tool to do this work, Apple might block the tool (via robots.txt or via legal means), which would render the effort a waste (I really don’t know if Apple would bother—I hardly think it’s worth their time, so I think the likelihood of problems is low).
With a smart intern and a good part-time advisor, I think this could be built in 1-2 months (not including showing visual diffs of versions of pages). I suspect that a small cloud virtual machine could fetch updates on at least a weekly basis (not sure about daily without knowing average latency as well as how many URLs need to be accessed; ETags and Last-Modified headers let you make conditional requests that can help a lot). Anyone have a “smart intern” available looking for this kind of work?
Yes, it absolutely would be useful. In the past, Apple Support had an RSS feed, but at some point it just stopped updating, and then eventually it got 404’ed. Indeed most changes were trivial, and multilingual, making following that feed quite tedious at times; nevertheless, it was really awesome.
Unfortunately I don’t think replicating this can ever be as good as that feed, and as noted you run the risk of hitting some anti-user/anti-scraping limit imposed by Apple’s web servers trying to make it work. Nevertheless I’d be interested in such a thing, if anyone ever figures out the schema the site uses, and how we could comprehensively scrape the support base.
Off-topic.
I am firmly convinced that more than 99% of organizations expect users not to read their terms of service. The immediate result is that I am firmly convinced that those organizations do not care if users adhere to terms of service. I am not a lawyer, this is not legal advice, and I think organizations should rein in their lawyers so that terms of service are easily readable.
Agreed, this is a job for RSS or even better Atom XML. Subscribe to feeds with alerts and firmware updates, etc.all the time. It would be immensely useful to pull any Apple Support Article updates.
But Apple needs to implement it and doing so properly takes some effort. Do it wrong and the feed is less useful. Plenty of broken or useless feeds out there.
Attempting to DIY with a scraping app and you risk Apple’s wrath. If such an app became popular it would be akin to a DDoS attack. If you do code something don’t share it. Don’t download every article, etc. not to mention copyright issues, etc.
Dunno about the technical side of things, but have to say that this would be an awesome idea, even for us norm users.
So go for it! ;-)
Sounds like a great project for a college coding group. Anyone have any connections with CS departments looking for projects? I can try to reach out to people at Cornell, but none of my friends are in quite the right area.