Cluster massive amounts of keywords in PPC

Sample size is crucial for making decisions in PPC… One big goal of keyword clustering is to bring sample sizes up to enable decisions on stable numbers.

On this post
    Cluster massive amounts of keywords in PPC

    Why should you cluster keywords or Search Terms in PPC?

    Sample size is crucial for making decisions in PPC. Maybe you want to block some parts of your traffic using negative keywords, or you want to spend more of your budget on search terms / keywords that bring value to your website. For both tasks, you’ll struggle when you have stable conversion rates to judge performance.

    One big goal of keyword clustering is to bring sample sizes up to enable decisions on stable numbers.

    Why clustering is challenging for big keyword sets?

    We already posted some approaches to cluster keywords (also look at our Free Online Clustering Tool) based on algorithms like k-means by making use of vector representations of the keyword (TF-IDF).

    The above approach works fine for some thousands of keywords, but you’ll run into trouble when you want to process more than 500,000 Keywords:

    • Processing time is getting high.
    • The memory usage of your computer is getting too high.

    Google made some changes in how they match queries to keywords (close variants) which causes an exploding number of search terms. They also made some updates and search terms started to be included in the Search Term Performance Report, which increased the number of Queries by 5–7 times.

    Hands on: Clustering Google Shopping Queries

    How does the initial data set look like?

    My input data is about 130 MB in size and contains 2,500,000 search terms. In addition to the Queries, we also added Clicks, Conversions, ConversionValue, Cost and Impression data. Let’s have a look at the distribution of Clicks per Query:

    Rule# Queries
    Clicks = 11.650.000
    Clicks > 1830.000
    Clicks > 10120.000
    Clicks > 10012.600
    Clicks > 10001.000

    More than 50% of all Queries got just one click. Less than 1% of the search terms have greater than 100 Clicks, which is still low if you want to get stable conversion rates in our case.

    To sum it up: It’s impossible to make scalable actions by just looking at full queries. Let’s start grouping keywords together.

    Define possible keyword clusters based on rules

    Our goal is to assign keywords with low sample sizes to the next best cluster, where we can make a good guess for its expected performance. User searches follow a hierarchy—by adding words, the search gets more specific:

    We’ll use this approach for running a cluster logic. Based on the following rules. We want to assign each query to the next best cluster when we go one step up in that hierarchy.

    • Rule 1: We create possible clusters only based on search queries that found somewhere in the big keyword list. This means that phrases which never appear in a standalone way aren’t considered as a valid cluster. E.g., “buy”, “review” are words that are often used within user queries, but they appear never in a standalone way.
    • Rule 2: A valid cluster, which should be used to assign more specific queries, should have a minimum number of samples/observations. We used a threshold of 500 clicks in our code example.

    It becomes more clear when you look at some examples:

    Analyzing queries: Check for possible root queries that appeared in the full query list.

    The Query column contains the search term we want to cluster. The column “root_queries” contains a set of queries that would be a valid cluster.

    Selecting the most relevant keyword cluster

    You’ll realize we have more than one option available. Let’s have a look at the first Query from my screenshot:

    dior hypnotic poison eau de parfum” can be assigned to:

    • dior
    • parfum
    • hypnotic poison
    • dior hypnotic poison
    • eau de parfum

    All possible clusters appear as standalone query in the big keyword list of 2.500.000 keywords and all of them had at least greater than 500 Clicks. If we have to make a decision now to choose just one cluster, which one we should choose?

    “parfum” has by far the most clicks of all clusters but this is probably not a good choice as a cluster. When it comes to relevancy, we should come from the other end: the possible clusters with the minimum click numbers are more close to the query we want to cluster. In our example, “dior hypnotic poison” would be the chosen cluster. Let’s add the final rule to the clustering logic:

    • Rule 3: If there are multiple cluster options, choose the cluster with the minimum number of Clicks.
    The “Cluster” column contains the most relevant cluster out of the “root_queries” list.

    After running this logic on the full keyword set, we were able to assign 1.500.000 (out of 2.500.000) queries that have less than 100 clicks (difficult to judge performance) to the next best cluster. Based on that, we can make a good estimation about the keyword performance.

    Here is our Python code. You can easily adjust it to your needs:

    # Cluster PPC Keywords
    # Author: Stefan Neefischer
    
    import pandas as pd
    from collections import defaultdict,OrderedDict
    from itertools import combinations
    import numpy as np
    import swifter
    from nltk.stem import PorterStemmer
    import re
    from nltk import ngrams
    
    
    roots = defaultdict(lambda:0)
    root_clicksum = defaultdict(lambda:0)
    
    # What is a valid Cluster?
    MIN_CLICKS = 500
    
    # Read Queries
    df = pd.read_csv('PLA Queries 2.csv',sep=";")
    
    def defineRoots(query,kpi):
        global roots
        if (len(str(query).split(" ")) <= 4) and kpi > MIN_CLICKS :
            roots[query] = kpi
    
    def getWordcount(query):
        return(len(query.split(" ")))
    
    def lookupRoots(query,nr_words,clicks):
        global roots, attribute_words, root_clicksum
        ngramlist = []
        onegrams = ngrams(query.split(), 1)
        twograms = ngrams(query.split(), 2)
        threegrams = ngrams(query.split(), 3)
        fourgrams = ngrams(query.split(), 4)
        rootlist = []
        found = defaultdict()
        for gram in onegrams:
            if (gram[0] in roots):
                rootlist.append(gram[0])
                root_clicksum[gram[0]] += clicks
                found[gram[0]] = 1
        for gram in twograms:
            key = str(gram[0]) + " " + str(gram[1])
            if (key in roots):
                rootlist.append(key)
                root_clicksum[key] += clicks
                found[gram[0]] = 1
                found[gram[1]] = 1
    
        for gram in threegrams:
            key = str(gram[0]) + " " + str(gram[1]) + " " + str(gram[2])
            if (key in roots):
                rootlist.append(key)
                root_clicksum[key] += clicks
                found[gram[0]] = 1
                found[gram[1]] = 1
                found[gram[2]] = 1
    
        for gram in fourgrams:
            key = str(gram[0]) + " " + str(gram[1]) + " " + str(gram[2])+ " " + str(gram[3])
            if (key in roots):
                rootlist.append(key)
                root_clicksum[key] += clicks
                found[gram[0]] = 1
                found[gram[1]] = 1
                found[gram[2]] = 1
                found[gram[3]] = 1
    
    
        #print("----------------- "+str(query)+" -------------------")
    
        #print(rootlist)
        return(rootlist)
        attributes = []
        for word in query.split(" "):
            if word not in found and rootlist != []:
                attributes.append(word)
                attribute_words[word] += 1
    
    
    def selectCluster(clusteroptions):
        global root_clicksum
        mincount = -1
        cluster = "undef"
    
        #print("------------")
        for option in clusteroptions:
            clicks = root_clicksum[option]
            #print(option)
            #print(clicks)
            if (cluster == "undef"):
                cluster = option
                mincount = clicks
            else:
                if clicks < mincount:
                    cluster = option
                    mincount = clicks
    
        return cluster
    
    
    
    
    
    df.apply(lambda row: defineRoots(row['Query'],row['Clicks']),axis=1)
    df["nr_words"] = df.apply(lambda row: getWordcount(row['Query']),axis=1)
    df["root_queries"] = df.apply(lambda row: lookupRoots(row['Query'],row["nr_words"],row['Clicks']),axis=1)
    df["Cluster"] = df.apply(lambda row: selectCluster(row['root_queries']),axis=1)
    print(df)
    

    More Similar Posts