Carlos III University of Madrid and WhoisXML API: Analyzing the Presence of Bots across Website Categories
About
Sergio Diaz, a Master in Cybersecurity student of the Carlos III University of Madrid, sought to analyze the Robots Exclusion Protocol, particularly the robots.txt files of the Tranco top 100,000 domains, to determine the nature of the blocked crawlers or bots and what type of websites allow or disallow them. The Robots Exclusion Protocol aims to keep crawlers under control, and the project showed that the bots can be identified and are found more frequently in specific categories than others.
Highlights
-
Classifying huge volumes of domains takes time.
-
Website Categorization API helped by allowing the researcher to quickly and accurately classify all of the domains in the study.
-
The researcher was able to correlate the websites’ categories and the crawlers present in their robots.txt files.
Swift and Precise Categorization of 100,000 Domains
The researcher generated a Tranco list of 100,000 domains and programmed his own crawler in Python to extract their robots.txt files. He then needed to obtain the domains’ WHOIS information and website classification before he could finally perform an in-depth analysis of the dataset.
Diaz initially attempted to use AI to categorize the domains and create clusters but found it time-consuming and impractical. He needed to find another way to efficiently and accurately classify all of the domains in the study to correlate the domains’ categories with bots.
Fast and Easy-to-Use Website Categorization API
A quick online search led Diaz to WhoisXML API’s Website Categorization API. With the tool, the researcher did not have to make up his own categories and classify domains individually. He simply made web classification requests to the API and saved the data in a JSON file.
As a result, the researcher found similarities across domain categories and determined which ones are more likely to allow or disallow certain types of web crawlers.
He found the API straightforward and easy to use, with highly accurate and relevant results. The well-parsed outputs made the results easy to analyze.
“WhoisXML API’s Website Categorization API helped me save time. If I did not use it, I would still be categorizing domains manually. It also improved the study’s accuracy, as the API had well-defined categories, and the output is precise and relevant.”
Correlation between Domain Categories and Bots
Website Categorization API enabled the researcher to carry out a comprehensive analysis of the Robots Exclusion Protocol, allowing him to determine the bot types that are more frequently found in certain website categories and whether or not they are blocked.
The tool’s precise and fast results and well-parsed outputs made the study more comprehensive and reliable.