Getting a (complete) web page into PDF

dredfearn · December 11, 2022, 6:55pm

I got a new camera (a Sony A7RV) and have been reading web articles about setting up and configuring the camera, which is amazingly complex. These web articles have lots of embedded images, which are essential to reading the document. When I switch to the Reader and try to print to a pdf most of these embedded images are missing - lots of big empty blank areas in the resulting pdf. (The text survives.) The save to pdf results in the same problem. Safari is my default browser, but the same issue exists with Chrome. Does anyone know of a way to get complete pdf documents from these web pages? I never used to have this problem, but I have noticed it became acute after Big Sur arrived. I can certainly use the web pages, but a (complete) pdf would be really nice.

cwilcox · December 11, 2022, 7:53pm

If would help if you provided an example article.

It may be that the images are not in img elements but as CSS background images; Safari has a “Print backgrounds” checkbox in the Print dialog.

A site could have a CSS ‘print’ stylesheet that doesn’t display the images when a page is printed. Maybe it’s to discourage exactly what you’re trying to do so people will keep revisiting and they can make more money from ad impressions.

You can take a screenshot of the entire webpage instead; that link is to instructions for how to do it in Safari but it’s basically the same in Chrome and Firefox (in Firefox, you can right-click in the middle of the page to get screenshooting tool).

dredfearn · December 11, 2022, 8:36pm

OK, here is one of the pages I wanted to “print” to pdf:

I actually found a web site that seems to work to create a complete pdf of the web page -https://webtopdf.com though it doesn’t reproduce the page layout exactly.

I was hoping for an app or a Safari extension.

David

dianed143 · December 11, 2022, 9:12pm

Probably overkill for you, but Devonthink has a webarchive feature, which usually works. I have noticed some “save as PDF” pages through DT have been blank as well and haven’t had time to troubleshoot it.

Diane

paal · December 11, 2022, 9:17pm

Acrobat Pro can download websites as PDF. You can get a free trial that works for a week. Sometimes it works, sometimes it gets chaotic. I have used it now and then to get a manual to print.

davbro · December 11, 2022, 9:28pm

Not sure if this will meet your needs, but in Safari you can save the complete page as a webarchive.
https://support.apple.com/guide/safari/save-part-or-all-of-a-webpage-ibrw1089/mac

cwilcox · December 11, 2022, 9:44pm

At least with “Print background” checked, the only thing that seems to be missing is the initial “hero image,” which isn’t meaningful content. I confirmed that the hero image is a background image and the site has ‘print’ styles that makes backgrounds not appear; I think it’s just a blanket set of rules to make page content more readable if it’s printed, especially on a black and white printer.

Safari’s ‘Save as’ has a .webarchive option. It looks like if that page is saved as a .webarchive and the .webarchive is opened and printed to PDF, the image quality is higher.

dredfearn · December 11, 2022, 10:00pm

Save as a web archive does work. That is fine for my purposes to save the pages. I just run into these issues with NYT or WSJ articles that I want to share - a web archive wouldn’t work very well for that. The Times and WSJ also lets me send a link to an article for non-subscribers and I have been doing that. I am just limited to a certain number of links per month.

Thanks for the suggestions.

David

davbro · December 11, 2022, 10:16pm

Yes, although in my setup at least (13.0.1), to prevent some lines at the bottom of the page from getting horizontally sliced, with the upper half appearing on one page and the bottom half on the next one, in the same dialog you also have to check “Print headers and footers.”

jzw · December 11, 2022, 10:35pm

If you Print to PDF, you usually won’t get a replica of what you see in your browser. Sometimes it’s very different depending on the CSS files the page specifies (which can be different for print and display). The best/easiest way to save an ‘exact’ copy of what you’re viewing in your browser is to use the File > Export as PDF… menu item (cmd-shift-E). This will save a single, long (no page breaks), PDF showing what you see in Safari without trying to reformat it for printing.

One of the problems I think you’re having, though, is that some pages load images dynamically, so if you haven’t scrolled to a portion of the page, the image hasn’t been loaded. I would slowly scroll through the whole page first and make sure all images are there, and then Export as PDF.

Shamino · December 11, 2022, 10:55pm

This web page is pretty interesting, and seems to be using scripting to dynamically load the images.

Using Firefox, I can see the images loading and appearing in the text individually as I scroll forward through the document, probably the result of the page’s scripts. The browser’s print feature doesn’t trigger these scripts, so it only prints the images that have been downloaded (that is, those that I’ve already seen as a result of scrolling through the document). Images I haven’t yet seen don’t print and only blank placeholders are shown.

I was able to save it properly by scrolling through the entire document (42 pages worth), pausing briefly for the images to load as I go. Then print to a PDF file. The resulting PDF seems to have all the inline images (except for the big one at the top of the page, which is some kind of CSS background image.)

GFS · December 11, 2022, 11:04pm

2 things.

You don’t want to ‘Print’ a PDF. In Safari you have ‘Export as PDF…’ in the File Menu. This is much better, because it produces a single long scrolling page, just as it appears in Safari itself. You may also find that narrowing the Safari window first (on responsive sites) gives you a nicer view/page width.

Sometimes (and this site you’re trying with is an absolute extreme as far as I’ve ever seen) there’s what is probably a cache problem, which is why you’re missing images and just getting blank spaces. So you have to scroll all the way to the bottom before Export as PDF…

However… and I’ve not seen this before, but I’m guessing that it’s due to the number of images and the length of page, even if you scroll all the way down, when you Export as PDF, the resultant PDF is still missing some images.

Trick that worked: Grab the Scrollbar and scroll back up doing a frantic up/down as you go. So each image is being loaded 2 or 3 or more times. When I did this, I got the entire page (14mb) except for some of the images in the footer.

I also noticed, that this page doesn’t load properly in Reader mode. It only has the first image + paragraph. So the html is perhaps a little iffy.

dredfearn · December 11, 2022, 11:25pm

I turned on print backgrounds and it didn’t work for me. Web archive does work fine.

David

dredfearn · December 11, 2022, 11:33pm

Regarding the other suggestions. Yes, I run into web pages with dynamic loading of images. And I have tried scrolling through the document to make sure they are all loaded. That doesn’t work for me most of the time - the NYT pages are particularly difficult. Save as a pdf also doesn’t work for me most of the time. Saving as a web archive is the only way I can get an accurate copy. Works fine to save the pages but not so good to share the page with others.

dredfearn · December 11, 2022, 11:35pm

Yep - I used to use Reader all the time and it mostly worked. Now, it mostly fails.

David

cwilcox · December 11, 2022, 11:36pm

Yes, it’s using JavaScript to watch the page as it’s scrolled and only when an image’s area is getting close to being visible does it insert the img urls so the browser requests them. Not loading all images when the page is first visited can save the user and server time and bandwidth when it’s common for people to not view a whole page.

Browsers can now do this natively, without JavaScript, just by adding loading="lazy" to the img element (it’s meant for iframe elements too but currently only works by default in Chrome). It’s on by default in Safari 15.4 and newer (released in March), before that it was an experimental feature that had to be enabled. In browsers that don’t know loading="lazy" means, all the images are requested and loaded when the page is visited.

blinken · December 12, 2022, 5:46am

try using screenshot which gets a png image file which easily converts to pdf if you open it usually automatically in preview which then exports it to pdf if you want.

pcarrington · December 12, 2022, 12:12pm

David: I shared this with the MacGeekGab community a while back. I, too, will need to print to pdf something from safari etc and I often want the url’s to be printed with it. I select it as follows after selecting READER VIEW. FILE > PRINT > and SAVE AS PDF found at the PDF droplist. I don’t know if this is what you need but it gives me Reader view with a bit more information on where the information came from that I’m printing. Best, Patrick

dredfearn · December 12, 2022, 5:45pm

When I try this I get a one page document to print. I have seen this before - the Reader view just shows one page. That’s strange and seems like a Reader bug. I never used to have these problems and I think Safari and Reader have changed. (Though, as I have mentioned, Chrome also has these problems. So maybe people are constructing web pages differently now.)

David

dennishenley · December 12, 2022, 10:17pm

David,

I went to the URL you provided and was able to print from Safari using Print Friendly (PrintFriendly & PDF Safari ExtensionPrint Friendly & PDF Button for FireFox, Chrome, iPad/iPhone, Internet Explorer, and Safari). Print Friendly produced an 80 page PDF that appeared to have all the images including the eye-candy opener,

I’ve used print friendly on other “difficult to print” pages. The Safari extension allows you to delete blocks from the web page but there is no undo if you click the wrong area,