Search inside attachments with AWS Elasticsearch

Gridium’s Tikkit work order management system runs on a microservices architecture, with Docker containers managed by Kubernetes. We’ve divided our application into small, reusable components, each of which runs in its own Docker container. Kubernetes takes care of keeping the right number of containers alive and talking to each other. We have containers that provide a messaging service, an auth service, Ember-based UIs for building managers and tenants, and a backend API that supports both UIs.

I wanted to add a new search service that the backend API could query for work order information. I’d used Elasticsearch before, and was impressed by its ease of use, flexibility, and scalability. It seemed like it should be easy: get the official Elasticsearch Docker image, install the the Mapper Attachments plugin, and start indexing and searching content. But it didn’t turn out that way.

We use Kubernetes to manage our Docker containers. It “groups the containers which make up an application into logical units for easy management and discovery,” and it wants to be in charge of all networking operations between containers. Part of the value proposition of Elasticsearch is that it manages its own cluster. By default, it uses multicast to discover and communicate with other search nodes, but Kubernetes-managed minions don’t like this. How can I make these two pieces get along? I asked for help from Ray Lu, another Gridium engineer. He spent several unhappy hours messing with Kubernetes configuration and AWS networking setup, without success. We could easily set up and run a single Elasticsearch container, but of course that wouldn’t scale to production loads. Now what?

Just as we were trying to figure out where to go next, Amazon announced hosted Elasticsearch. I could “set up and configure [an] Amazon Elasticsearch cluster in minutes.” Yay! Just a few clicks on an AWS console, and search is working, right? Well, sort of. The AWS Elasticsearch allows a very limited set of plugins, installing additional plugins is not supported. The Mapper Attachments plugin is not on Amazon’s short list. Maybe they’ll add more in the future, but for now I need to do without Mapper Attachments.

I have a choice: use the working, scalable AWS-hosted Elasticsearch cluster without attachment search, or dive deeper into the mysterious depths of Kubernetes configuration, and try to make it do something it didn’t seem intended to support. Can I live without attachment search? Or maybe I don’t have to…

The Mapper Attachments plugin is really a thin layer over Apache Tika, which does the hard work of extracting text from thousands of attachment formats. It’s written in Java distributed as a set of jar files that you can add to your application. Gridium’s current Docker stack doesn’t include Java; this is why using the Mapper Attachments plugin seemed like the way to go at first. I checked out using Tika directly, and found there’s already a Docker container that exposes a REST API to the Tika library. I could use the REST API to extract text from my attachments, before they get to Elasticsearch. I’m almost there!

To get attachments indexed I need to:

run my own Tika service
have the API send attachments to the Tika service
get back the text content of the attachment
index the now-plain text in Elasticsearch

Here’s what this looks like on my laptop:

On my laptop, I updated my docker-compose.yml config file to add a search container (the official Elasticsearch Docker image), a tika container (the unofficial Tika server Docker image), an ambassador for Tika, and links to search and tika from my API worker container. Here are the additions to my local docker-compose.yml:

apiworker:
  image: gridium/api:latest
  command: worker
  links:
    - search
    - tikaambassador:tika

search:
  restart: always
  image: elasticsearch:1.5  # matches AWS
  volumes:
    - ./search/config:/usr/share/elasticsearch/config
    - ./search/data:/usr/share/elasticsearch/data
  expose:
    - 9200
    - 9300
  ports:
    - "9200:9200"
    - "9300:9300"

tika:
  restart: always
  image: logicalspark/docker-tikaserver
  expose:
    - 9998
  ports:
    - "9998:9998"

tikaambassador:
  image: cpuguy83/docker-grand-ambassador
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
  command: -name=docker_tika_1

My production configuration is similar. Instead of running Elasticsearch in a Docker container, I want to send requests to the AWS-hosted Elasticsearch cluster. Usually, Elasticsearch runs on port 9200, but the AWS service exposes it on port 80. The production service accepts connections from other containers on port 9200, then forwards them port 80 on the AWS instance. It looks like this:

Ben Straub, another Gridium engineer, set up the magic in the search-service.yml config file used by Kubernetes:

kind: Service
apiVersion: v1
metadata:
  name: search
  labels:
    name: search

spec:
  ports:
    # AWS ES instances listen on port 80, but clients expect port
    # 9200.  This proxies 9200 traffic to port 80 on the endpoint,
    # which passes it straight through
    - port: 9200
      targetPort: 80
      name: es

Both my laptop and production see the Tika service at tika:9998 and Elasticsearch at search:9200. I need to tell the API how to index an attachment: get the attachment data, send it to Tika, and then send the text on to Elasticsearch.

Here’s what that looks like, written in Python within Gridium’s API service:

try:
    s3conn = boto.connect_s3(config.AWS_ACCESS_KEY, config.AWS_SECRET_KEY)
    k = boto.s3.key.Key(s3conn.get_bucket(config.S3_BUCKET, validate=False))
    k.key = '%s/requests/%s%s' % (
        fields['account'], fields['uuid'], os.path.splitext(fields['name'])[1])
    # send content to tika to extract text
    r = requests.put('http://tika:9998/tika', data=k.get_contents_as_string())
    text = r.text
except Exception as e:
    print('error loading attachment %s: %s' % (fields['id'], str(e)))
    return
if not text:
    print('no text to index for attachment %s' % fields['id'])
    return
# req is the work order request object, loaded previously
if not req.get('attachments', None):
    req['attachments'] = []
req['attachments'].append({'content': text})
print(es.index(index='tikkit', doc_type='request', id=req_id, body=req))

I call this once an attachment has been created and attached to a work order request. Then, I can search attachment content within our AWS Elasticsearch like this:

query = {
    'multi_match': {
        'type': 'most_fields',
        'query': q,
        'fields': ['body', 'body.english', 'content', ... _more fields_]
    }
}

After creating a container for the Tika attachment extraction and wiring up the backend API to talk to it and an AWS-hosted Elasticsearch insteance, I have everything needed for a useful search experience: our Elasticsearch instance can search inside both requests and attachment content, and can scale as we need it.

Search inside attachments with AWS Elasticsearch

About Kimberly Nicholls

0 replies on “Search inside attachments with AWS Elasticsearch”

You may also be interested in...

Measuring Aurora query IO with Batch experiments

Migrating to Aurora: easy except the bill

Fresh baked software