Painlessly reindexing your Elasticsearch index without any downtime

I believe anyone who uses Elasticsearch can relate to this: Elasticsearch is not really schemaless - if for example you decided you need to change one of your field mapping, you will have to reindex your data.

Here in PushApps, we migrated our Elasticsearch cluster from 1.X to 2.2 ( the details of the installation were discussed here ), and that's were we needed to think how to migrate the data as well.

Elasticsearch offers the migration plugin - you can run it on your cluster and wee what issues your are expected to encounter when trying to migrate your indices 'as is' - with their current mapping. For us - there were too many issues to try solve them manually so we decided to simply reindex all of our data.

Our settings

1. Daily indices
Let's assume we store tweets ( of course we don't but everybody uses this example...). We will then have daily tweets indices, in each index we will store the tweets from that day, and we would also name ad the date as a suffix, so the tweets from March 18th 2016 will be stored in the tweets_20160318 index, and so on.

2. Index templates
Our indices are created on the fly, when we try to index the first tweet of the day and the index doesn't exists yet, Without this super useful feature we wouldn't be able to set the same custom mapping to all of our tweets indices.
A simple template creation would look like this:

PUT /_templates/tweets  
{
    "template": "tweets_*",
    "mappings": {
        "tweet": {
            "dynamic_templates": [
                {
                    "notanalyzed": {
                        "match": "*",
                        "match_mapping_type": "string",
                        "mapping": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                }
            ],
            "properties": {
                "timestamp": {
                 "type": "date",
                 "format": "strict_date_optional_time||epoch_millis"
                },
                "user": {
                    "type": "string"
                },
                "message": {
                    "type": "string",
                    "fields": {
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                }
            }
        }
    }
}

the user field is a simple string, message is analyzed but we also keep it's raw format for exact searches and aggregations. We also determined that all new fields that we will try to store without setting their mapping before will not be analyzed. If course - when you want to index your tweet you need to check their timestamp and then decide to which index you want to store it.

Let's store some sample data:

POST /tweets_20160318/tweet/1  
{
    "user" : "Orr Chen",
    "message" : "just writing my new blog post",
    "timestamp" : "2016-03-18T11:08:06.075+02:00"
}
POST /tweets_20160318/tweet/2  
{
    "user" : "Leo Messi",
    "message" : "winning Copa america 2016 will make me greater than Maradona",
    "timestamp" : "2016-03-18T13:10:06.075+02:00"
}

This is just an example for using templates, you can do whatever you want.

The migration itself

Now, let's get to business. You need to have your original elasticsearch cluster which we will refer to as source, and the new one which we will refer to as target

1. Recreate the templates

Every template you had on source - recreate it on target in the same way you first created it. If you are migrating to a new version of Elasticsearch make sure your mappings are valid. for example - in 1.x you could have the same field in different type in the same index with different mapping, this is not allowed in 2.X.

2. Switch to the new server

If you don't want to have any downtime, it's important you first move your production load to the new server, make sure today's index is being created and data is stored. From this point and on all of your data will be stored in the new indices.

3. Reindex old data

The full script with documentation is available in Github

Now let's take care of all the date prior to the switch we made.

For this step we chose to use elasticsearch-dump. To install it :

npm install elasticdump  

List your relevant indices for migration:

SOURCE=http://old-es:9200  
TARGET=http://new-es:9200  
PREFIX=tweets  
list="$(curl -XGET $SOURCE/_cat/indices/$PREFIX?pretty=true -v | cut -d $' ' -f3)"  

For each index, copy the data using the bulk API. first check where is elasticdump by typing:

which elasticdump  
  for item in $list
  do
    echo "migrating" $item "..."
    "$(/usr/local/bin/elasticdump --bulk=true --output=$TARGET/$item --input=$SOURCE/$item)"
  done

Now, your "former" indices will start to recreate themselves, this might take a while, depending on your dataset size.

That's it! By first recreating the templates, moving the load to the new indices, and only then copying the old data - we made sure we didn't loose any data and didn't suffer any downtime.

On a final note, the final step of reindexing can be done in few other ways, one cool way would be the python helper reindex command.

This is a process we will probably have to do again in the future, so we would be happy to hear your comments and learn how do you think it's best to perform it.

Orr Chen

Read more posts by this author.