Search on Google Cloud Platform - App Engine and Search API

I came around few times on topic "Search on Google Cloud Platform", i.e. people asking question of different ways how to implement search. Not sure if it's not part of some Google Cloud test or exam, but it's interesting topic and GCP offers different ways how to do it so I decided to do series on this topic. So in each article I want to describe one way of implementation as well as code explanation and load testing.

Task description goes like this: Imagine you are eshop and you want to implement autocomplete of you product description functionality, so when users type in search box some words they will get products which contains those words. How would you do it on GCP to be scalable, fast etc...

Text search by itself is major topic which offers different functionalities, approaches and I don't consider myself to be an expert at all, so pardon me if I omit something.

To get some realistic data Best Buy has on github repository which contains data about some 50000 products. To really simulate big eshop I found on Kaggle dataset https://www.kaggle.com/c/acm-sf-chapter-hackathon-big/data (It's necessary to register in order to get product files.) which contains over million of products and that is already number which should be more fun to work with. I wrote a small script to extract all necessary information since it's in xml and in multiple files. I won't get all data since there are about 70 fields per product.

All code is on github.com https://github.com/zdenulo/gcp-search and it's written in Python 3 (some runs only in Python 2 as in this case).

General architecture

To implement functionality, we will need frontend with some input field to enter text which will be searched and then display results. For this I will use jQuery autocomplete library which makes request to server with input query and then display result automatically.

Next we will need some backend server for which i will use Google App Engine (GAE) (both Standard and Flexible, mostly Flexible) since it's easy to deploy and it scales automatically.

And of course we will need some storage where data about products will be stored used for search which is the whole essence of this s. Truth is, eshops have more complex database architecture, but I'm simplifying here, because we are interested only in search functionality. Normally you would have usual stuff (dozen on properties related to product) stored somewhere in database and stuff that will be searched (product name) only in Search Engine and reference between.

Extracting information.

After downloading product_data.tar.gz from Kaggle website and unpacking it, running script extract_product_data.py will extract some information from multiple xml files into one csv file. There are dozen fields per product but I am only saving few and perhaps I will not use even all of those. Obviously product name is most important. Csv file isn't included in repository since it's ~260MB big :).

Frontend

Frontend is simple and straight forward.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Autocomplete</title>
    <link rel="stylesheet" href="//code.jquery.com/ui/1.12.1/themes/base/jquery-ui.css">
    <script src="https://code.jquery.com/jquery-1.12.4.js"></script>
    <script src="https://code.jquery.com/ui/1.12.1/jquery-ui.js"></script>
    <script>
        $(function() {
            $("#search").autocomplete({
                source: function (request, response) {
                    $.ajax({
                        dataType: "json",
                        url: "/search",
                        data: {query: request.term},
                        success: function (data) {
                            response(data);
                        }
                    })
                },
                minLength: 2
            });
        });
    </script>
<body>
<div class="ui-widget">
    <form>
        <input type="text" id="search" size="55" >
    </form>
</div>

</body>
</html>

Basically like I wrote earlier, I am using jQuery autocomplete library which with few settings automatically makes queries to /search url where it sends typed query and renders received results.

 

Search API

As I wrote in the beginning, first service/product which I will use is Search API which is integrated inside Google App Engine Standard.

Some high level overview of Search API:

  • there is no setup. Like not even single line, value or anything... amazing :). In application's code you need to defined index object with name and basically you need to wrap up code for inserts, queries etc.
  • API is included in GAE Standard SDK. Unfortunately Search API can't be used outside of GAE Standard (not even Flexible). This should be taken into consideration if you want to create application on GAE Standard and use Search API.
  • I heard there are plans to make Search API available outside GAE as standalone service but I don't know when that could happen.
  • there is free quota for 0.25GB stored data, 1000 search queries per day and 0.01GB of added data to index per day.
  • pricing is following: 10 000 queries costs 0.5$, storage 0.18$/GB and indexing of data costs 2$/GB.
  • depending on your case and budget, it can be a bit expensive, with using memcache service for caching results, you can lower number of search queries.
  • regarding scaling, there is maximum of 100 aggregated minutes of query execution time per minute (I admit I don't understand what this means)
  • data needs to be saved in structured documents (there are several field types like Text, Number, Date, Geo, HTML, Atom). Documents can contain multiple fields. One document can have maximally 1MB.
  • documents need to be stored in "indexes", there can be multiple indexes per project. Search query is executed on index.
  • you can provide id for document or it can be generated automatically. With document id, you can update field values for document.
  • single index can have max size of 10GB which can be increased up to 200GB by submitting request to GCP and there can be unlimited indexes per project / application.
  • maximum documents added per minute is 15000 (250 per second), but there is limit of 200 document which can be inserted at once (one request). This has some time implication on our ~1M products upload.

Search API queries

There are many interesting search features:

  • querying on multiple fields (using boolean operators)
  • geo queries, i.e. searching documents within some radius based on latitude and longitude. Of course document needs to have Geo field in order for this to work.
  • if you have document with text "google" and search query is "goo" it will return result but if you search "ogl" it will not.
  • you can create snippets from result, which is text that includes query string and includes surrounding text.
  • faceted search - search query can return subcategories and number of documents, which can be used to further refine search.
  • it's possible to sort query results based on different properties like search term frequency.

GAE web app

I am using App Engine since it's the only way how to use Search API, but beside that it's lightweight, easy to deploy, scales up and down automatically.

Code for web application is in folder gae_search_api/webapp. As mentioned earlier, application runs on GAE Standard (Python 2), I will explain most important parts.

search_base.py contains class SearchEngine with which I will wrap up all operations related to search.

search_api.py contains implementation of SearchEngine class using Search API, here is full code.

from search_base import SearchEngine

from google.appengine.api import search


class SearchAPI(SearchEngine):
    """GAE Search API implementation, can be used only withing GAE"""

    def __init__(self, client=None):
        self.client = search.Index('products')  # setting Index

    def search(self, query):
        """Making search with SearchAPI and returning result"""
        try:
            search_results = self.client.search(query)
            results = search_results.results
            output = []
            for item in results:
                out = {
                    'value': item.field('name').value,
                    'label': item.field('name').value,
                    'sku': item.field('sku').value
                }
                output.append(out)
        except Exception:
            output = []
        return output

    def insert(self, item):
        """Inserts document in the Search Index"""
        doc = search.Document(
            fields=[
                search.TextField(name='name', value=item['name']),
                search.TextField(name='sku', value=item['sku']),
            ]
        )
        self.client.put(doc)

    def insert_bulk(self, items):
        docs = []
        for item in items:
            doc = search.Document(
                fields=[
                    search.TextField(name='name', value=item['name']),
                    search.TextField(name='sku', value=item['sku']),
                ]
            )
            docs.append(doc)
        self.client.put(docs)

    def delete_all(self):
        while True:
            document_ids = [
                document.doc_id
                for document
                in self.client.get_range(ids_only=True)]

            # If no IDs were returned, we've deleted everything.
            if not document_ids:
                break

            # Delete the documents for the given IDs
            self.client.delete(document_ids)

There is not much to explain, except that I am only inserting 2 product fields name and sku which I used in document both as TextFields

Web application (file main.py) is written in Flask and implements some general urls like for saving product data (because we can use Search API only withing GAE application), search and delete and of course renders html page for autocomplete.

import logging

from flask import Flask, render_template, request
from flask.json import jsonify

from google.appengine.ext import deferred

from search_api import SearchAPI

app = Flask(__name__)

search_client = SearchAPI()


@app.route('/')
def index():
    return render_template('index.html')


@app.route('/search')
def search():
    """based on user query it executes search and returns list of item in json"""
    query = request.args.get('query', '')
    results = search_client.search(query)
    return jsonify(results)


@app.route('/upload', methods=['POST'])
def upload():
    """gets list of products and saves into search index"""
    json_data = request.get_json()
    search_client.insert(json_data)
    return 'ok'


@app.route('/upload_bulk', methods=['POST'])
def upload_bulk():
    """gets list of products and saves into search index"""
    json_data = request.get_json()
    logging.info("received {} items".format(len(json_data)))
    search_client.insert_bulk(json_data)
    return 'ok'


@app.route('/delete')
def delete():
    """deletes all items in search"""
    deferred.defer(search_client.delete_all)
    return 'ok'

To upload GAE web application, you need to have installed Cloud SDK. Before application upload, you need to first install locally some libraries (which will be uploaded with application).

in webapp folder execute command:

>pip install -r requirements.txt -t lib

In folder load_data, there is script upload.py which reads data from csv file and makes requests to GAE application. We are limited with 200 insertion of documents per request, since I am doing batch import as well as 250 documents per second, so I am sending 200 products in one request and making small pause. I don't remember exactly how long it took to upload all data, but something like 3 hours or maybe even more. I guess it's no problem if you upload it once per life.

Now if you uploaded application as well as data, you can try search on your app's url:

 GCP Search

Search API returns 20 results (default number) of product names which contains word "mouse". It supports pagination, i.e. to continue getting more results which could be implemented as extra feature. Also this would be great case for Faceted search which allows refining search results. Maybe in some other article I could create example with faceted search.

Load testing

Of course, it's no problem to play as single user on your webapp and it would be interesting to see how search behaves (responds) to multiple users. That's why I will use distributed load testing using Kubernetes and load testing framework Locust based on this article https://cloud.google.com/solutions/distributed-load-testing-using-kubernetes Github repository referenced in this article is out of date (Kubernetes version) so I was using this one https://github.com/fawaz-moh/distributed-load-testing-using-kubernetes.

In load-testing folder is everything that's needed for set up of load testing. This is also several steps effort, I'll try to explain briefly how to set this. First we will create Docker image which will contain Locuts files for load testing (I'm not going into details). Then we will create Kubernetes cluster on Google Kubernetes Engine and deploy Docker image and initiate load testing which will make requests and make some stats about response time. Step by step process is explained in Readme file in load-testing folder so I won't go into details here.

What I am doing with Locust framework is that I parsed words from product names and I am using those to make search queries. Locust configuration allows setting hatch rate (number of users added per second) and final number of users. So every user is making requests between 1 and 5 seconds.

Cluster is default with 3 nodes of n1-standard-1 VM types and I am using preemptible to save money :). This allows setting 12 slaves which will make requests. So here are some graphs and stats.

This is graph of number of requests per second, as it's displayed in the end it was around 630 RPS which is decent load. Whole load test lasted around 10 minutes.

Locust

 

Average response time varied, you can see that in the beginning it was higher due to creating new instances to serve requests. Growth of number of users was linear.

 

Locust

 

Stats are also interesting, out of 286119 requests, there were only 3 with errors, median response was 57ms and average 181ms.

Locust

 

Here is also screen shot from GAE dashboard where number of instance is displayed over time. 

GAE

 And finally excerpt from logs.

 

Point of this load test was to demonstrate how Search API scales-above-one-user-load, and together with App Engine it handled without problems. This playing cost me ~16 $.

More detailed and thorough description with examples is in official documentation https://cloud.google.com/appengine/docs/standard/python/search/.

In conclusion, Search API has great search capabilities and with no configuration it's easy to use it straight into code. Dissadvantage can be (depending on case) higher price and lock in under GAE Standard.

In next article we will look at Cloud Datastore and see how we can use it to make text search queries.

 

blog comments powered by Disqus