Monday, August 26, 2013

Using rrdtool

RRDTool and Round Robin Databases

A round-robin database (rrd) is basically a fixed size database that will overwrite the oldest data when all the other spaces have been used.  This behavior is common with caches using the least recently used approach, for example.  However, it seems a little odd for databases - after all, you want to store things and that is why you put it in a database, right?  Well, maybe not entirely right for entirely all data - imagine you wanted to store time series data where old values were aggregated (less detailed information) together and stored elsewhere or where data one year old just doesn't matter anymore.

RRDtool is a round robin database and a suite of tools all in one (graphite is a somewhat nicer alternative).  You can store data in RRDtool and retrieve the data or better yet, retrieve plots of the data.

(Update from 2019: DB-Engines ranks RRDtool still comparable to Graphite and Prometheus although Prometheus has been rising very quickly for the last few years. This is likely due to what RRDtool did to open up this field years ago.)

Here are some simple commands for using RRDtool: 
sudo yum install rrdtool #install or sudo apt-get install rrdtool

man rrdtool #if you want to read man pages
man rrdcreate #more details on creating an rrd file

Create an example rrd for storing counts of updates versus time - it's made up metric here, but could be a count of hits on a webpage or number of packets on a network interface:
 rrdtool create counts.rrd --start 1377510000 \ #set start time if needed
 -s 300 DS:count_updt:COUNTER:600:U:U RRA:AVERAGE:0.5:1:12 RRA:AVERAGE:0.5:3:10
#step 300s, COUNTER type, assume null after 600s, Unknown for min, Unknown for max, RRA - round robin aggregation patter: AVERAGE values to consolidate .5 (1/2) values needed to consolidate, every 1 step (maybe silly), 12 times; then another set with .5, 3 steps consolidated, 10 times (step = 5m, 3 steps = 15m, 10 points = 150m)

If you read the manual pages, then you'll see CF frequently. CF is the 'consolidation function' or how RRDtool consolidates data for storage or display in a plot. Above, the CF is AVERAGE, there are others and more sophisticated approaches like Holt-Winters.

Load some data using the Unix timestamp (date +%s). Below, we're loading a number of points starting at 1377510000 which works with the rrd created above. You may need to update this to the current time if you change things.
  rrdtool update counts.rrd 1377510000:1234 1377510300:1230 1377510600:1200
  rrdtool update counts.rrd 1377510900:1363 1377511200:963 1377511500:1275
  rrdtool update counts.rrd 1377511800:1083 1377512100:999 1377512400:1099
  rrdtool update counts.rrd 1377512700:1500 1377513000:1810 1377513300:840

Pull the data back to double check - first pull it, then over a time range, then a dump:
 rrdtool fetch counts.rrd AVERAGE
 rrdtool fetch counts.rrd AVERAGE --start 1377510000  --end 1377513000
 rrdtool dump counts.rrd

Create an image from the data:
 rrdtool graph count_updt.png --start 1377510000 --end 1377513000   DEF:count_updt=counts.rrd:count_updt:AVERAGE LINE2:count_updt#FF0000 

DEF:virtual_name=rrd_filename:data-source-name:CF defines a virtual_name for the rrd_filename followed by a data-source-name and consolidation function, but here we've only used one data source (graphite is a little easier for things like this, but once you have the format, rrdtool is fine). LINE2:... means use a line, weight 2, data-source-name and HTML color code.

Check the source docs (here and here) for more details information. Also, read the man pages:
 man rrdgraph_data
 man rrdgraph_graph
 man rrdgraph_examples

Let's create a little more involved one without the rigid time settings. The type has been switched from COUNTER to GAUGE (counter expects increasing values while gauge can vary up and down):

  rrdtool create counts1.rrd --start now-7200 DS:count_updt:GAUGE:600:U:U RRA:AVERAGE:0.5:1:12 RRA:AVERAGE:0.5:3:10
Fill with some data - in this case increasing values each time:
 i=0; d=`date +%s`
 while [ $i -lt 24 ] ; do echo $i; let i=$i+1; let t1=d-7200+300*$i; c=$(($i*500)); echo $t1 $c ; rrdtool update counts1.rrd $t1:$c ; done

Have a look at the data, if you want:
 rrdtool dump counts1.rrd
Create a plot:
 rrdtool graph count_secd.png --start now-7200 DEF:count_updt=counts1.rrd:count_updt:AVERAGE AREA:count_updt#FF0000

Now, let's create and fill an rrd like the above, but with varying data:
d=`date +%s`
rrdtool create noise.rrd --start=now-7200 DS:noise_meas:GAUGE:600:U:U RRA:AVERAGE:0.5:1:12   RRA:AVERAGE:0.5:3:10
 rrdtool dump noise.rrd
 let t=$d-7200
 while [ $i -lt 24 ] ; do echo $i; let i=$i+1; let t1=t+300*$i; v=`echo $RANDOM`; rrdtool update noise.rrd $t1:$v ; done
 rrdtool graph noise.png DEF:noise_meas=noise.rrd:noise_meas:AVERAGE LINE2:noise_meas#00FF00:"example_line\l" -t "Sample Graph" -v "values" -w 500 -h 200 -c BACK#AAAAAA -c GRID#333333
 rrdtool graph noise.png --start now-7200 DEF:noise_meas=noise.rrd:noise_meas:AVERAGE AREA:noise_meas#00FF00:"example_line\l" -t "Sample Graph" -v "values" -w 500 -h 200 -c BACK#AAAAAA -c GRID#333333

Now, let's create a plot with the data from the noise and counts1 data:
 rrdtool graph noise_count.png --start now-7200 DEF:noise_meas=noise.rrd:noise_meas:AVERAGE DEF:count_updt=counts1.rrd:count_updt:AVERAGE AREA:noise_meas#00FF00:"example_line\l" LINE2:count_updt#FF0000:"count\l" -t "Sample Graph" -v "values" -w 500 -h 200 -c BACK#AAAAAA -c GRID#333333

Using with Python
To make RRDtool more useful, it's good to link it with a bit of code to insert data as desired. Here, we're using python-rrdtool is the package (install python-rrdtool) with info here and here. There are other python pages and packages for other libraries.

Using python with rrdtool will require a package install like this: yum install python-rrdtool

Add some python (fairly straight forward, but details depend on the package):
#quick set of python commands for putting rrd data into rrdtool:
from rrdtool import update as rrdtool_update
value = "30"
result = rrdtool_update('counts.rrd', '1552217030:30') 
# this was the format for rrdtool-python:
#result = rrdtool_update('test.rrd','N:%s',%(value)) #N means now, otherwise could specify a specific unix timestamp and could send two values if both created in test.rrd 'N:%s:%s',%(value1,value2) 

Combine this with some web calls (see the python page on this blog) and you'll be storing time series data quickly! Before you write your own simple receiver, though, have a look at rrdcached which handles receiving metrics

Saturday, August 17, 2013

Using Graphite - metrics data

Graphite is a tool for time series data storage and graphing.  The main page is here: Graphite.

Graphite (and its Carbon receiver and Whisper data store) stores a metric value against a specific time stamp for the metric that you want to track.  In other words, if you want to know the number of requests per minute that a web server is receiving, you'd send the data to your graphite set up with a command like this (from a Unix/Linux/max command line):

echo "webserver1.requests_per_minute 201 1376748060" | nc 2003

Where webserver1.requests_per_minute is the metric to be saved. 201 is the value of the metric and 1376748060 is the time in seconds since Jan 1, 1970.  This simple set of values - metric name, metric value, metric timestamp in seconds is all that's needed by Graphite.
In the example above, this information is then piped into the netcat command to the graphite server at port 2003, the default port for Graphite.  There are other ways to get information into Graphite - python's pickle format, for example (there's also apparently AMQP support).

Graphite will store this data into a whisper directory as set up in the graphite installation. Whisper is the database Graphite uses to store data. It is a round-robin database very similar to rrdtool's storage (rrd meaning round robin database) where Graphite started off; however, limitations in the version of rrd then led to the creation of whisper.

In the Graphite/Whisper storage area, the metric data will be stored in a hierarchy based on the metric name given. In the example above, the metric name is webserver1.requests_per_minute which will lead to a directory called webserver1 in which there will be a file called requests_per_minute.wsp.  Graphite uses the "." as a delimiter to create the hierarchy.

To view the data, use a web browser to go to the graphite front end (for example, where you can browse the metrics that Graphite is storing and create graphs and dashboards of graphs for viewing.  Individual charts can be viewed by creating the right URL as in:

This URL will cause Graphite to draw a graph of the requests_per_minute from webserver1 and webserver2 for the last 24 hours until now and with a chart size of 400x250 pixels.

To see the raw data, add "&format=raw" to the end of a request; it will print the data per time slot, but won't show the time stamp. To see the time stamp and the values, you'll need to use some Whisper commands. To view json data, add "&format=json" instead. requests_per_minute.wsp will show the data with the timestamps. requests_per_minute.wsp will show basic information about the wsp file such as the expected time intervals requests_per_minute.wsp 5m:1y 30m:3y will resize the whisper data file to store data every 5 minutes for a year and then start aggregating values to 30 minutes for 3 years. requests_per_minute.wsp 1376748060:199 will overwrite the currently stored value at time 1376748060 (201) with the new value (199).  I've had trouble getting this command to work, but have had success resubmitting the information via the netcat command as above. will show a mix of and data including unfilled slots. to create a new metric file - this isn't needed as sending the data to Graphite will cause it to create the file with defaults matching the metric.

Whisper files are created with default values set in the storage-schemas.conf file which has entries like:
pattern = webserver*
retentions = 60s:90d 5m:3y

which sets the data intervals and retentions to every 60s for 90d.  The default values are every minute for 24 hours/1 day. Make sure you resize it or set the defaults before creating it.  When Whisper starts to aggregate the data it requires a certain number of metrics to start the aggregation. The default is xfactor=.5 which means that at least 50% of the data points must exist for an aggregated value to be created. If you have a flaky data injection, you might want to reduce this amount.

To stop and start carbon: stop start
For more info on this, look at:

Scaling a Graphite system
Graphite can be clustered to provide data and performance and even up-time at a scale that isn't possible on a single system.  The clustering available allows spreading data out and/or duplicating data for availability.  See more here:

Monday, August 12, 2013

elasticsearch Basics

With any data, there's often a need to search for specific pieces, especially when that data comes grouped like documents. elasticsearch is a great tool for building a search mechanism over your data.  This search engine is very easy to start up after downloading:
bin/elasticsearch # add -f to run it in console mode

elasticsearch Queries and Testing:
http://localhost:9200/twitter/_search?pretty  - simple index level search on the twitter example

Test the default analyzer:  
curl -XGET localhost:9200/_analyze?pretty -d 'Test String and Data' # it will drop the "and"

Test with a specific analyzer:  
curl -XGET localhost:9200/index_name/_analyze?analyzer=autocomplete&pretty -d 'New Text'

Test with tokenizers/filters:  
curl localhost:9200/index_name/_analyze?tokenizer=whitespace&filters=lower-case,engram&pretty -d 'Newer data' #two filters, one tokenizer

Explaining a match:  
curl -XGET localhost:9200/index_name/type/document/_explain?pretty&q=quiet

Validate a search:
curl localhost:9200/index_name/type/_validate/query?explain&pretty -d '{...}' #remember to escape the & on *nix

Setting up and monitoring a cluster: : nodes with same cluster name will try to form a cluster : instance name, will be chosen automatically on each start up, but best to set unique one
set num of open files > 32k; set memory to no more than 1/2 of system memory (to allow for disk caching)
curl localhost:9200/_cluster/health?pretty
(to turn this into an alert, grep for status and grep for anything that isn't green - that will give you an indication that elasticsearch is reporting a problem.  You might also want to check that 'timed_out' is false.)
To check index or shard level add a parameter to the statement above ?level=indices - see below.

shutdown whole cluster : curl -XPOST localhost:9200/_cluster/nodes/_shutdown

shutdown a node in cluster: curl -XPOST localhost:9200/_cluster/nodes/node_name/_shutdown
get node names from : curl localhost:9200/_cluster/nodes

 ... and cluster/node stats:
curl localhost:9200/index_name,index_name2/_stats?pretty #in addition, can grep for count to find number of docs
curl localhost:9200/_cluster/health?pretty&level=indices #add grep for status and grep -v green to turn into alert
curl localhost:9200/_cluster/health?pretty&level=shards #add grep for status and grep -v green to turn into alert 
curl localhost:9200/_stats
curl localhost:9200/_nodes # _nodes will also return OS status
curl localhost:9200/_nodes/SpecificNodeName
curl localhost:9200/_nodes/stats ; curl locahost:9200/_nodes/SpecificNodeName/stats
curl localhost:9200/_cluster/state #can add ?filter_nodes to remove node info (or filter_blocks,filter_indices, etc)

  • XPUT to put a document with a specific id into the system: localhost:9200/index/type/id -d 'document body'
  • POST then ES will generate an ID which needs to be read from the respons:  localhost:9200/index/type -d 'document body' #id in response body
  • XGET to retrieve a known document: localhost:9200/index/type/id
/_update to update
/_search to query - with a body or like curl localhost:9200/index/_search?q=customerId:20020

localhost:9200/ -d '{ "query":{"match_all":{}}} #get all data
use PUT is specifying doc id, use POST if want ES to do it; diff types can have diff mappings
curl -XPUT localhost:9200/entertainment/movies/1 -d '{"movie_title": "The Monsters", "actor":"some nutty guy", "revenue":2000}'
curl -XPUT localhost:9200/entertainment/movies/2 -d '{"movie_title": "Alien Remake", "actor":"some nutty guy", "revenue":150000}'
curl -XPUT localhost:9200/entertainment/movies/flop -d '{"movie_title": "Slugs", "actor":"some nutty guy as bob", "revenue":123}' #note the change in document naming style - not always a good idea
curl -XPOST localhost:9200/entertainment/movies -d '{"movie_title": "Hairslugs", "actor":"bob as guy", "revenue":12300000}' # ES will return an id in the response - that's the key to finding this document directly

curl -XGET localhost:9200/entertainment/movies/1 to retrieve that doc
curl -XDELETE  localhost:9200/entertainment/movies/1  to remove
curl -XGET localhost:9200/_search - across all indices
curl -XGET localhost:9200/entertainment/_search - across all types in entertainment
curl -XGET localhost:9200/entertainment/movies/_search - doc type movies in entertainment index

simple query:
curl -XGET localhost:9200/_search -d '{"query": {"query_string":{"query":"Monsters"}}}'
curl -XGET localhost:9200/_search -d '{"query": {"query_string":{"query":"Monsters", "fields":["movie_title"]}}}'

Filtered search:
curl -XGET localhost:9200/_search -d '{"query": {"filtered": {"query_string":{"query":"Monsters", "fields":["movie_title"]}}, "filter": { "term": {"revenue":2000}}}}'
could switch the query to 'match_all' and just filter over everything
or use constant score like this:
'{"query": { "constant_score":{ "filter":{ "term":{"revenue":2000}}}}}'

Mapping example:
Here is a multi-field (to have normally tokenized analysis and unaltered indexing) mapping (basically a schema) example.
Create the mapping on a new index (you might need to -XDELETE the index if it already exists or try updating an existing index, but you won't be able to update existing documents from what I understand). Below, we're creating a mapping on one type (movies) on an index (entertainment):
curl -XPUT localhost:9200/entertainment/movies/ -d '{
 "movies":{ "properties":{ "actor":{ "type": "multi_field", "fields": {"actor": {"type":"string"},
 "fullname":{"type":"string", "index":"not_analyzed"}}}}} '
and check your mappings on an index:
curl -XGET localhost:9200/entertainment/movies/_mapping

 sets actor to be analyzed as normal, but also adds actor.fullname as a field that can be search in as a combined bit

To create a mapping on an entire index (all document types): use a similar query, but include your document types in the mapping:

curl -XPUT localhost:9200/entertainment/ -d '{
"mappings" : {
        "type": "multi_field", "fields": {
             "actor": {"type":"string"},
             "fullname":{"type":"string", "index":"not_analyzed"}
     "location":{ "type": "string" }

} '
Here we're creating two types of documents (entertainment/movies and entertainment/cinemas) using one mapping file.  Alternatively, this mapping could be loaded by
curl -XPUT localhost:9200/entertainment -d @mapping.json
where mapping.json is a file containing the json above.

Warming queries:
elasticsearch allows you to set warming queries to be run during start up.  The queries are defined
via PUTting a query to localhost:9200/index/_warmer/warmer_query_name -d'...'
A warming script can be retrieved by GETting it: localhost:9200/index/_warmer/warmer_query_name
Warming queries can be deleted like any document and can be disabled by putting a setting ( {"index.warmer.enable": false} to localhost:9200/index/_settings

A number of useful links: - documentation including the twitter example - auto-suggest example using a custom analyzer - details on the auto-suggest - similar to some examples above - good, book like resource