Challenge

Swift and Precise Categorization of 100,000 Domains

The researcher generated a Tranco list of 100,000 domains and programmed his own crawler in Python to extract their robots.txt files. He then needed to obtain the domains’ WHOIS information and website classification before he could finally perform an in-depth analysis of the dataset.

Diaz initially attempted to use AI to categorize the domains and create clusters but found it time-consuming and impractical. He needed to find another way to efficiently and accurately classify all of the domains in the study to correlate the domains’ categories with bots.

Solution

Fast and Easy-to-Use Website Categorization API

A quick online search led Diaz to WhoisXML API’s Website Categorization API. With the tool, the researcher did not have to make up his own categories and classify domains individually. He simply made web classification requests to the API and saved the data in a JSON file.

As a result, the researcher found similarities across domain categories and determined which ones are more likely to allow or disallow certain types of web crawlers.

He found the API straightforward and easy to use, with highly accurate and relevant results. The well-parsed outputs made the results easy to analyze.

Results

Correlation between Domain Categories and Bots

Website Categorization API enabled the researcher to carry out a comprehensive analysis of the Robots Exclusion Protocol, allowing him to determine the bot types that are more frequently found in certain website categories and whether or not they are blocked.

The tool’s precise and fast results and well-parsed outputs made the study more comprehensive and reliable.