Copytables Simplifies Extracting Tabular Data from Web Pages

Dafuki · August 23, 2023, 3:21am

@ace Having written a web page table parser I can assure you that wacky HTML is the norm rather than the exception. Good ol’ simple html tables are a comparative breeze because they’re a simple structure (cough) but modern css-driven tables are a nightmare because you can randomly place cells visually that have nothing to do with the normal left-right; top-down expectation. (I gave up.)

I noticed mention of clipboards somewhere in here and the modern clipboard is nothing like the old one. You think “a” clipboard, the reality is 6, 8, 10 or more different expressions of the same stuff. It wouldn’t surprise me to discover an EBCDIC clipboard at at some point. In fact, the clipboards are a pretty miraculous piece of engineering. Complicated translation on the fly. . . .

As far as grabbing tables go, I’ve found copying a web page (not necessarily a table) into Pages has pretty amazing results—accuracy and non-hair-pulling wise. I’m not sure why TextEdit results and Pages results are different (given the probably identical engines) but they are. You then copy that out of Pages into Bbedit and away you go. . . .

Dave