There are two parts of your code. Code that can be unit tested and code that can't be unit tested.
Code that can't be unit tested is simple. Any code that has to touch IO can't be unit tested. Period. Any code that doesn't touch IO can be unit tested. It's that simple.
Keep ALL IO segregated from the rest of your code. Keep IO functions and methods super small. Do not inject IO polluted objects into other parts of your code.
In hindsight, this is obvious. The issue I have with unit testing projects like “Project: Lumbergh,” “Project: Sippy-Cup” or “Project: Cleese” is that they're nearly all IO with very little logic (with the caveat that “Project: Lumbergh” implements all the business logic interspaced with IO).
With that said, even though the author also stated not to mock at all, I do. We have a few network services that “Project: Lumbergh” relies upon, and I have written my own versions of those services that basically respond with a canned answer for a particular query. It was easy since the services speak a common protocol (DNS in this case) and I don't have to implement the logic of those services to determine the answer.
I also decided to implement a new regression test and keep the one I have working as a separate test. This way, it'll be easier to implement tests like “data source B replies, data source A times out.” I'm also keeping each test in its own file, so adding new ones should be way easier and hopefully, we won't end up with another 16,000 tests.
As a consequence, the closer to plain text a website is, the higher it'll score. The more markup it has in relation to its text, the lower it will score. Each script tag is punished. One script tag will still give the page a relatively high score, given all else is premium quality; but once you start having multiple script tags, you'll very quickly find yourself at the bottom of the search results.
Modern web sites have a lot of script tags. The web page of Rolling Stone Magazine has over a hundred script tags in its HTML code. Its quality rating is of the order 10-51%.
And from the few tests I ran, it seems to be a pretty decent search engine for what I'd use it for.