Monday, August 12, 2013

elasticsearch Basics

With any data, there's often a need to search for specific pieces, especially when that data comes grouped like documents. elasticsearch is a great tool for building a search mechanism over your data.  This search engine is very easy to start up after downloading:
bin/elasticsearch # add -f to run it in console mode

elasticsearch Queries and Testing:
http://localhost:9200/twitter/_search?pretty  - simple index level search on the twitter example

Test the default analyzer:  
curl -XGET localhost:9200/_analyze?pretty -d 'Test String and Data' # it will drop the "and"

Test with a specific analyzer:  
curl -XGET localhost:9200/index_name/_analyze?analyzer=autocomplete&pretty -d 'New Text'

Test with tokenizers/filters:  
curl localhost:9200/index_name/_analyze?tokenizer=whitespace&filters=lower-case,engram&pretty -d 'Newer data' #two filters, one tokenizer

Explaining a match:  
curl -XGET localhost:9200/index_name/type/document/_explain?pretty&q=quiet

Validate a search:
curl localhost:9200/index_name/type/_validate/query?explain&pretty -d '{...}' #remember to escape the & on *nix

Setting up and monitoring a cluster:
cluster.name : nodes with same cluster name will try to form a cluster
node.name : instance name, will be chosen automatically on each start up, but best to set unique one
set num of open files > 32k; set memory to no more than 1/2 of system memory (to allow for disk caching)
curl localhost:9200/_cluster/health?pretty
(to turn this into an alert, grep for status and grep for anything that isn't green - that will give you an indication that elasticsearch is reporting a problem.  You might also want to check that 'timed_out' is false.)
To check index or shard level add a parameter to the statement above ?level=indices - see below.

shutdown whole cluster : curl -XPOST localhost:9200/_cluster/nodes/_shutdown

shutdown a node in cluster: curl -XPOST localhost:9200/_cluster/nodes/node_name/_shutdown
get node names from : curl localhost:9200/_cluster/nodes

 ... and cluster/node stats:
curl localhost:9200/index_name,index_name2/_stats?pretty #in addition, can grep for count to find number of docs
curl localhost:9200/_cluster/health?pretty&level=indices #add grep for status and grep -v green to turn into alert
curl localhost:9200/_cluster/health?pretty&level=shards #add grep for status and grep -v green to turn into alert 
curl localhost:9200/_stats
curl localhost:9200/_nodes # _nodes will also return OS status
curl localhost:9200/_nodes/SpecificNodeName
curl localhost:9200/_nodes/stats ; curl locahost:9200/_nodes/SpecificNodeName/stats
curl localhost:9200/_cluster/state #can add ?filter_nodes to remove node info (or filter_blocks,filter_indices, etc)


Documents:
Use: 
  • XPUT to put a document with a specific id into the system: localhost:9200/index/type/id -d 'document body'
  • POST then ES will generate an ID which needs to be read from the respons:  localhost:9200/index/type -d 'document body' #id in response body
  • XGET to retrieve a known document: localhost:9200/index/type/id
/_update to update
/_search to query - with a body or like curl localhost:9200/index/_search?q=customerId:20020

localhost:9200/ -d '{ "query":{"match_all":{}}} #get all data
use PUT is specifying doc id, use POST if want ES to do it; diff types can have diff mappings
curl -XPUT localhost:9200/entertainment/movies/1 -d '{"movie_title": "The Monsters", "actor":"some nutty guy", "revenue":2000}'
curl -XPUT localhost:9200/entertainment/movies/2 -d '{"movie_title": "Alien Remake", "actor":"some nutty guy", "revenue":150000}'
curl -XPUT localhost:9200/entertainment/movies/flop -d '{"movie_title": "Slugs", "actor":"some nutty guy as bob", "revenue":123}' #note the change in document naming style - not always a good idea
curl -XPOST localhost:9200/entertainment/movies -d '{"movie_title": "Hairslugs", "actor":"bob as guy", "revenue":12300000}' # ES will return an id in the response - that's the key to finding this document directly

curl -XGET localhost:9200/entertainment/movies/1 to retrieve that doc
curl -XDELETE  localhost:9200/entertainment/movies/1  to remove
curl -XGET localhost:9200/_search - across all indices
curl -XGET localhost:9200/entertainment/_search - across all types in entertainment
curl -XGET localhost:9200/entertainment/movies/_search - doc type movies in entertainment index

simple query:
curl -XGET localhost:9200/_search -d '{"query": {"query_string":{"query":"Monsters"}}}'
curl -XGET localhost:9200/_search -d '{"query": {"query_string":{"query":"Monsters", "fields":["movie_title"]}}}'

Filtered search:
curl -XGET localhost:9200/_search -d '{"query": {"filtered": {"query_string":{"query":"Monsters", "fields":["movie_title"]}}, "filter": { "term": {"revenue":2000}}}}'
could switch the query to 'match_all' and just filter over everything
or use constant score like this:
'{"query": { "constant_score":{ "filter":{ "term":{"revenue":2000}}}}}'

Mapping example:
Here is a multi-field (to have normally tokenized analysis and unaltered indexing) mapping (basically a schema) example.
Create the mapping on a new index (you might need to -XDELETE the index if it already exists or try updating an existing index, but you won't be able to update existing documents from what I understand). Below, we're creating a mapping on one type (movies) on an index (entertainment):
curl -XPUT localhost:9200/entertainment/movies/ -d '{
 "movies":{ "properties":{ "actor":{ "type": "multi_field", "fields": {"actor": {"type":"string"},
 "fullname":{"type":"string", "index":"not_analyzed"}}}}} '
and check your mappings on an index:
curl -XGET localhost:9200/entertainment/movies/_mapping

 sets actor to be analyzed as normal, but also adds actor.fullname as a field that can be search in as a combined bit

To create a mapping on an entire index (all document types): use a similar query, but include your document types in the mapping:

curl -XPUT localhost:9200/entertainment/ -d '{
"mappings" : {
 "movies":{ 
    "properties":{ 
     "actor":{ 
        "type": "multi_field", "fields": {
             "actor": {"type":"string"},
             "fullname":{"type":"string", "index":"not_analyzed"}
        }
      }
    }
  },
 "cinemas":
    "properties":{ 
     "location":{ "type": "string" }
    }
  }

} '
Here we're creating two types of documents (entertainment/movies and entertainment/cinemas) using one mapping file.  Alternatively, this mapping could be loaded by
curl -XPUT localhost:9200/entertainment -d @mapping.json
where mapping.json is a file containing the json above.

Warming queries:
elasticsearch allows you to set warming queries to be run during start up.  The queries are defined
via PUTting a query to localhost:9200/index/_warmer/warmer_query_name -d'...'
A warming script can be retrieved by GETting it: localhost:9200/index/_warmer/warmer_query_name
Warming queries can be deleted like any document and can be disabled by putting a setting ( {"index.warmer.enable": false} to localhost:9200/index/_settings

A number of useful links:
https://github.com/elasticsearch/elasticsearch - documentation including the twitter example
http://elasticsearch-users.115913.n3.nabble.com/help-needed-with-the-query-tt3177477.html#a3178856
https://gist.github.com/justinvw/5025854 - auto-suggest example using a custom analyzer
http://www.elasticsearch.org/guide/reference/api/search/term-suggest/ - details on the auto-suggest
http://joelabrahamsson.com/elasticsearch-101/ - similar to some examples above
http://exploringelasticsearch.com/ - good, book like resource
http://www.elasticsearch.org/guide/reference/api/index_/
http://elasticsearch-users.115913.n3.nabble.com/Performance-on-indices-for-each-language-td4035879.html#a4036048

No comments:

Post a Comment