Importing Premium DNS 365 into ClickHouse | WhoisXML API



WhoisXML API Blog

Importing Premium DNS 365 into ClickHouse

This project aimed to upload data from DNS_Premium_365 dataset to a local ClickHouse database for efficient, optimized and rapid querying capabilities.

Approach

Decompressing

The Premium DNS files are very large, and decompressing them with a single-threaded tool such as gzip proves to take a long time.  The uncompressed data can be over 1TB.  Therefore, we recommend using a multi-threaded compression/decompression tool such as pigz, located at https://zlib.net/pigz/  Pigz will utilize multiple-cores, as many as the system has to offer, reducing decompression time by as much as 70%+.

Data Preprocessing

Utilizing shell scripts, we processed 17 .gz files from the DNS_Premium_365 dataset for March 2024. Each file's row count and size were logged post-decompression to validate data integrity and prepare for subsequent processing.

Data Preprocessing

TLD Pre-Filtering (Optional)

Use shell script to filter our TLDs of interest from the decompressed files, generating corresponding domain name lists. This step was critical to ensure the quality of the data preparation phase.

The TLDs we are using are:  [ "com", "org", "net", "ru", "de", "cn", "uk", "br", "jp", "ai", "io", "fr", "it", "ir", "il", "ua", "kp", "xyz", "online", "top", "shop", "site",  "icu", "club", "store", "vip" ]

TLD Pre-Filtering

ClickHouse Database Processing

Installation and Initialization

Please see the official documents for the installation process of ClickHouse at https://clickhouse.com/docs/en/getting-started/quick-start

Installation and Initialization

Database Creation

Executed SQL commands to create the `production_db` database.  By using command: ./clickhouse client --query="CREATE DATABASE production_db

Data Table Design and Import

Tables can be created according to customer needs and the data structure. Below is an example on how to create a table:

./clickhouse client --query="CREATE TABLE production_db.domain_new2 (domain String, timestamp UInt64, ip String) ENGINE = MergeTree() ORDER BY domain"

The script was utilized to automate the process for the 17 DNS files, it will iterate through the 17 files, creating corresponding tables and inserting data into them.

Data Table Design and Import

Use Case Example: 

  1. After uploading data to ClickHouse, we can select our database using the 'SHOW DATABASES' and 'USE your_database' commands in ClickHouse. We can also use 'SHOW TABLES' to view existing tables.
select database
  1. In ClickHouse, SQL commands can be used for data analysis, such as searching for specific domains or IP addresses. For example, the command below retrieves 1.20 billion FQDNs with the .com top-level domain (TLD) in 8.343 seconds.
retrieve 1.20 billion FQDNs with the .com top-level domain (TLD) in 8.343 seconds
  1. Script could be used to iterate through all tables to find specific domains or IP addresses. The example below is to iterate through 19 tables to find ip that startswith 104.154.*.
iterate through 19 tables to find ip that startswith 104.154.*

Conclusion

In conclusion, this project successfully utilized ClickHouse and automation scripts to efficiently process and analyze a large volume of domain and IP address data. Leveraging SQL commands and scripts for automatic table creation and bulk data insertion ensured seamless data management. The ability to iterate through all tables to identify specific domains or IP addresses demonstrated significant analytical capabilities. Moving forward, these methods can be scaled and adapted to handle even larger datasets, reaffirming their effectiveness in data processing and analysis tasks.

Try our WhoisXML API for free
Get started