See other success stories

Carlos III University of Madrid and WhoisXML API: Analyzing the Presence of Bots across Website Categories

About

Sergio Diaz, a Master in Cybersecurity student of the Carlos III University of Madrid, sought to analyze the Robots Exclusion Protocol, particularly the robots.txt files of the Tranco top 100,000 domains, to determine the nature of the blocked crawlers or bots and what type of websites allow or disallow them. The Robots Exclusion Protocol aims to keep crawlers under control, and the project showed that the bots can be identified and are found more frequently in specific categories than others.

Location

Madrid, Spain

Industry

Cybersecurity

Products used

Website Categorization API

Highlights

Classifying huge volumes of domains takes time.
Website Categorization API helped by allowing the researcher to quickly and accurately classify all of the domains in the study.
The researcher was able to correlate the websites’ categories and the crawlers present in their robots.txt files.

100,000 domains classified

Numerous categories detected

Challenge

Swift and Precise Categorization of 100,000 Domains

The researcher generated a Tranco list of 100,000 domains and programmed his own crawler in Python to extract their robots.txt files. He then needed to obtain the domains’ WHOIS information and website classification before he could finally perform an in-depth analysis of the dataset.

Diaz initially attempted to use AI to categorize the domains and create clusters but found it time-consuming and impractical. He needed to find another way to efficiently and accurately classify all of the domains in the study to correlate the domains’ categories with bots.

Solution

Fast and Easy-to-Use Website Categorization API

A quick online search led Diaz to WhoisXML API’s Website Categorization API. With the tool, the researcher did not have to make up his own categories and classify domains individually. He simply made web classification requests to the API and saved the data in a JSON file.

As a result, the researcher found similarities across domain categories and determined which ones are more likely to allow or disallow certain types of web crawlers.

He found the API straightforward and easy to use, with highly accurate and relevant results. The well-parsed outputs made the results easy to analyze.

Sergio Diaz

Master in Cybersecurity Student

Carlos III University of Madrid

“WhoisXML API’s Website Categorization API helped me save time. If I did not use it, I would still be categorizing domains manually. It also improved the study’s accuracy, as the API had well-defined categories, and the output is precise and relevant.”

Results

Correlation between Domain Categories and Bots

Website Categorization API enabled the researcher to carry out a comprehensive analysis of the Robots Exclusion Protocol, allowing him to determine the bot types that are more frequently found in certain website categories and whether or not they are blocked.

The tool’s precise and fast results and well-parsed outputs made the study more comprehensive and reliable.

1. Challenge 2. Solution 3. Results

Learn how WhoisXML API intelligence can help you achieve success

See other success stories

Try our WhoisXML API for free

Get started

WHOIS / WHOIS History

DNS / DNS History

IP Geolocation / IP Netblocks

Domain Research Suite (DRS)

Domain/WHOIS

DNS/IP

Intelligence

Other

Domain/WHOIS

DNS/IP

Intelligence

Other

Domain/WHOIS

DNS/IP

Intelligence

Other

Domain Research Suite (DRS)

Research

Monitoring

White-Label

Predictive Threat Intelligence Feeds

Internet Infrastructure

Enterprise API Packages

Security Intelligence (SI) Suite

Success Stories

Carlos III University of Madrid and WhoisXML API: Analyzing the Presence of Bots across Website Categories

About

Highlights

Swift and Precise Categorization of 100,000 Domains

Fast and Easy-to-Use Website Categorization API

Correlation between Domain Categories and Bots

Learn how WhoisXML API intelligence can help you achieve success

Try our WhoisXML API for free

Have questions?

WHOIS / WHOIS History

DNS / DNS History

IP Geolocation / IP Netblocks

Domain Research Suite (DRS)

Domain/WHOIS

DNS/IP

Intelligence

Other

Domain/WHOIS

DNS/IP

Intelligence

Other

Domain/WHOIS

DNS/IP

Intelligence

Other

Domain Research Suite (DRS)

Research

Monitoring

White-Label

Predictive Threat Intelligence Feeds

Internet Infrastructure

Enterprise API Packages

Security Intelligence (SI) Suite

Success Stories

Carlos III University of Madrid and WhoisXML API: Analyzing the Presence of Bots across Website Categories

About

Highlights

Swift and Precise Categorization of 100,000 Domains

Fast and Easy-to-Use Website Categorization API

Correlation between Domain Categories and Bots

Learn how WhoisXML API intelligence can help you achieve success

More Success Stories

Try our WhoisXML API for free

Have questions?