Download WhoisXML API daily data ASAP part 2: An application with trending English words in .com | WhoisXML API

WhoisXML API Blog

Download WhoisXML API daily data ASAP part 2: An application with trending English words in .com

In the first part of this blog, we demonstrated how to download data from WhoisXML API's daily data feeds right after their publication by using the recently introduced RSS feed as the activator of download. In particular, we showed how to download the list of domains newly registered in the .com top-level domain (TLD). 

Now, to make the task a bit more interesting we demonstrate the use of our domain list with a showcase application: we calculate the list of the most frequent English words in the domain names on that day. This can be interesting in various applications. Domainers, for instance, can get information on the newest trends in domain registrations. Journalists and researchers can get a clue on a topic gaining popularity, etc.

We will keep on working in the same environment as in the first part of the blog (Linux system, BASH and Python, at an intermediate level). We just extend the demonstration done in the first part, so we will be working in the "daily_words'' subdirectory and assume that everything we did in the previous part is still there and working.

We will use the downloaded lists of newly registered domains in the .com top-level domain to immediately find out which are the most common English words that appear in newly registered domains. To do so, we need to do "word segmentation" (consult, e.g., this page for scientific details). We will be using a Python package called "Word Ninja" which does a good job in finding English words in strings without spaces, and is easy to install:

pip3 install wordninja

(Remember to do it in the shell where the virtual environment is active, i.e. do not forget to do

source ./daily_words_venv/bin/activate

before doing this and all that follows.) We wrote a little python program to find English words in all the strings coming from the lines of the standard input and output the top 100. This get_top_words.py reads:

#!/usr/bin/env python3
 
import sys
import re
import wordninja
 
words_counts = {}
 
for line in sys.stdin.readlines():
    for word in wordninja.split(line.strip()):
if len(word)>3 and re.search('[a-zA-Z]', word):
    try:
words_counts[word] +=1
    except KeyError:
words_counts[word] = 1
for w in sorted(words_counts, key=words_counts.get, reverse=True)[:100]:
    print(w)

Let's now create a subdirectory "top_100_words", and add the following three lines:

echo "Now postprocessing"
datehyph=$(echo $available | sed -e s/_/-/g) 
./get_top_words.py < ./data/domain_names_new/com/${datehyph}/add.com.csv > \ top_100_words_daily/com_top_100_words_${available}.txt

to get_com_daily_new.sh developed in the previous part to the end of the branch of downloading fresh data, i.e., before the line with "fi". The complete script (remember that this is the one we run from crontab) thus will read:

#!/bin/bash
 
MY_PLAN=lite
MY_KEY=MY_KEY
 
PROJECT_DIR="$( cd "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"
cd $PROJECT_DIR
 
source ./daily_words_venv/bin/activate
 
(rsstail -n1 -u \ "https://${MY_KEY}:${MY_KEY}@newly-registered-domains.whoisxmlapi.com/datafeeds/Newly_Registered_Domains/$MY_PLAN/domain_names_new/status/download_ready_rss.xml" \
> rss_output.txt & echo $! >&3) 3>rsstail.pid 
 
sleep 5
 
last_date=$(cat last_date.txt)
echo "---------------"
date
echo "Last download for day: $last_date"
available=$(tail --lines 1 < rss_output.txt | cut -d " " -f 4)
echo "Last available data for day:" $available
if [[ $last_date == $available ]];then
   echo "We already have data for $available"
else
   echo "Downloading for $available"
   datearg=$(echo $available | sed -e s/_//g)
   cd ./download_whois_data
   ./download_whois_data.py --feed domain_names_new \
      --tlds com \
      --dataformats csv \
      --startdate $datearg \
      --plan $MY_PLAN \
      --password $MY_KEY 
      --output-dir ../data
   cd ..
   echo $available > last_date.txt
   echo "Now postprocessing"
   datehyph=$(echo $available | sed -e s/_/-/g)
   ./get_top_words.py < ./data/domain_names_new/com/${datehyph}/add.com.csv \
      > top_100_words_daily/com_top_100_words_${available}.txt
fi
 
kill $(cat rsstail.pid)

From now on, immediately after the new domain list has been published, we will find the list of the top 100 words in the files of top_100_words_daily, e.g. those for 2021-04020 will be in the file top_100_words_daily/com_top_100_words_2021_04_20.txt. We conclude our demonstration by creating a word cloud of these data:

pip3 install wordcloud
wordcloud_cli --width 640 --height 480 \
  --no_collocations --no_normalize_plurals \
  --text top_100_words_daily/com_top_100_words_2021_04_20.txt \
  --imagefile top_100_words_daily/com_top_100_words_2021_04_20_wordcloud.gif

The result is like this: 

Word cloud

To visualize the trends, may even create an animation from these daily clouds, like the one below for 1 to 14 April 2021: 

Word cloud GIF
Try our WhoisXML API for free
Get started