Importing Premium DNS 365 into ClickHouse
This project aimed to upload data from DNS_Premium_365 dataset to a local ClickHouse database for efficient, optimized and rapid querying capabilities.
Approach
Decompressing
The Premium DNS files are very large, and decompressing them with a single-threaded tool such as gzip proves to take a long time. The uncompressed data can be over 1TB. Therefore, we recommend using a multi-threaded compression/decompression tool such as pigz, located at https://zlib.net/pigz/ Pigz will utilize multiple-cores, as many as the system has to offer, reducing decompression time by as much as 70%+.
Data Preprocessing
Utilizing shell scripts, we processed 17 .gz files from the DNS_Premium_365 dataset for March 2024. Each file's row count and size were logged post-decompression to validate data integrity and prepare for subsequent processing.
TLD Pre-Filtering (Optional)
Use shell script to filter our TLDs of interest from the decompressed files, generating corresponding domain name lists. This step was critical to ensure the quality of the data preparation phase.
The TLDs we are using are: [ "com", "org", "net", "ru", "de", "cn", "uk", "br", "jp", "ai", "io", "fr", "it", "ir", "il", "ua", "kp", "xyz", "online", "top", "shop", "site", "icu", "club", "store", "vip" ]
ClickHouse Database Processing
Installation and Initialization
Please see the official documents for the installation process of ClickHouse at https://clickhouse.com/docs/en/getting-started/quick-start
Database Creation
Executed SQL commands to create the `production_db` database. By using command: ./clickhouse client --query="CREATE DATABASE production_db
Data Table Design and Import
Tables can be created according to customer needs and the data structure. Below is an example on how to create a table:
./clickhouse client --query="CREATE TABLE production_db.domain_new2 (domain String, timestamp UInt64, ip String) ENGINE = MergeTree() ORDER BY domain"
The script was utilized to automate the process for the 17 DNS files, it will iterate through the 17 files, creating corresponding tables and inserting data into them.
Use Case Example:
- After uploading data to ClickHouse, we can select our database using the 'SHOW DATABASES' and 'USE your_database' commands in ClickHouse. We can also use 'SHOW TABLES' to view existing tables.
- In ClickHouse, SQL commands can be used for data analysis, such as searching for specific domains or IP addresses. For example, the command below retrieves 1.20 billion FQDNs with the .com top-level domain (TLD) in 8.343 seconds.
- Script could be used to iterate through all tables to find specific domains or IP addresses. The example below is to iterate through 19 tables to find ip that startswith 104.154.*.
Conclusion
In conclusion, this project successfully utilized ClickHouse and automation scripts to efficiently process and analyze a large volume of domain and IP address data. Leveraging SQL commands and scripts for automatic table creation and bulk data insertion ensured seamless data management. The ability to iterate through all tables to identify specific domains or IP addresses demonstrated significant analytical capabilities. Moving forward, these methods can be scaled and adapted to handle even larger datasets, reaffirming their effectiveness in data processing and analysis tasks.