What is Elasticsearch? Why do I need it?
Elasticsearch is a NOSQL, distributed full text database. Which means that this database is document based instead of using tables or schema, we use documents… lots and lots of documents. The project was started in 2010 by Shay Banon. Shay wanted to create a storage and search engine that would be easy to operate. Elasticsearch is based on the Lucene engine on top of which Shay added an http rest interface which resulted in a distributed search engine that is incredibly easy to scale and returns results at lightning speed.
As a developer who is used to relational databases, I did face some situations in my career where I needed to search tables that had millions of records, resulting in overly complex database views/stored-procedures and adding full text search on relational database fields. Something which I personally dislike, as it made the database twice the size and the speed was not optimal either.Relational databases are simply not built for such operations.
Now with Elasticsearch we can achieve the speed we would like, as it lets us index millions of documents. But what’s the use of indexing our documents if we can’t find the one we are looking for just as quickly? Well as we will see, Elasticsearch can perform queries across all those millions of documents and return accurate results in a fraction of a second.
Secondly, searching relevancy and score, where in a typical relational SQL database you may try to write code as follows:
1 |
SELECT POST_CONTENT FROM BLOG WHERE POST LIKE ‘%something%’; |
Which is sort of giving us what we want, while Elasticsearch has sophisticated query techniques that will allow us to apply scoring and relevance by a simple rest call.
Example of a query in Elasticsearch:
1 2 3 4 5 6 7 8 |
GET blog/_search { "query": { "match": { "post": "something" } } } |
Finally Elastic search offers statistical analysis tools, which allows us to see trends in our data.
Why would I want to use Elasticsearch?
Elasticsearch can be used for various usage, for example it can be used as a blog storage engine in case you would like your blog to be searchable. Traditional SQL doesn’t readily give you the means to do that.
How about Analytics tools? Most software generates tons of data that is worth analyzing, Elasticsearch comes with Logstash and Kibanato give you a full analytics system.
Finally, I like to see Elasticsearch as Data ware house, where you have documents with many different attributes and non-predictable schemas. Since Elasticsearch is schemaless, it won’t matter that you store various documents there, you will still be able to search them easily and quickly.
On the other hand having a powerful tool like Kibana would allow you to have a custom dashboard that gives the opportunity for non-technical managers to view and analyze this data.
How Elasticsearch saves data?
Elasticsearch does not have tables, and a schema is not required. Elasticsearch stores data documents that consist of JSON strings inside an index.
1 2 3 4 5 6 7 8 9 |
{ "id": 1, "firstName": "Alexandra", "lastName": "Hamilton" "isActive": false, "balance": "2,815.91", "age": 35, "eyeColor": "green" } |
The field is like the columns of the SQL database and the value represents the data in the row cells.
When you save a document in Elasticsearch, you save it in an index. An index is like a database in relational database. An index is saved across multiple shards and shards are then stored in one or more servers which are called nodes, multiple nodes form a cluster.
To get started with Elasticsearch:
http://joelabrahamsson.com/elasticsearch-101/
Elasticsearch Clients & Integrations:
Elasticsearch is a Restful API, but in most cases we don’t want to expose a full Elasticsearch Rest API to the outside world.
Elasticsearch supports the following clients:
Java API
Javascript API
Groovy API
.NET API
PHP API
Perl API
Python API
Ruby API
Rivers
And some community contributions.
As I did my projects with Java and .NET, I was very surprised by the smoothness of the applications.
Java:
For java you can install the Elasticsearch plugin via maven. As I was already using Jackson the mapping to the API was amazingly easy.
Java API: http://www.elasticsearch.org/guide/en/elasticsearch/client/java-api/current/index.html
.NET:
As for the .NET project, I used it with the NEST plugin. The installation is extremely simple with Nuget.
.NET API: http://www.elasticsearch.org/guide/en/elasticsearch/client/net-api/current/_nest.html
All API’s provide a very structured and clean approach to manage the data in the Elasticsearch server.
Elasticsearch plugins and tools: Plugins are a way to enhance the basic Elasticsearch functionality in a custom manner. While tools like “Kibana” and “Logstash” are essential to get a full ELK Stack (Elasticsearch, Logstash, Kibana).
Logstash: is a log parser. It is the answer to the question “How do I get log data from Elasticsearch?” It is used to scrub your logs and parse all data sources into an easy to read JSON format. Logstash is extremely easy and it’s the best tool if you are using Elasticsearch to monitor and analyze the server logs. Logstash can centralize the logs in one place, while Elasticsearch can index this data.
You can read more about Logstash:
http://www.elasticsearch.org/overview/logstash
Kibana: my favorite of all is Kibana, the data visualization engine that allows us to build custom and dynamic dashboards to view the data.
With Kibana, I could analyze data that came from the custom system, logs that came via Logstash and analyze them to give the best overview of the data.
You can find more about Kibana:
http://www.elasticsearch.org/overview/kibana
As a starter I did install the following plugins:
Marvel:
Manage and monitor your Elasticseach. Marvel is free for development use, you will have a wide overview of your cluster and all the nodes.
http://www.elasticsearch.org/overview/marvel/
Kopf:
Kopf is simpler than Marvel, it’s free and it allows you to see an overview of the Elasticsearch cluster
https://github.com/lmenezes/elasticsearch-kopf
River:
River is a pluggable service running within an Elasticsearch cluster, that pulls data (or has the data pushed to it) that is then indexed into the cluster.
There are several plugins that support River, some of them are described below.
Elasticsearch River JDBC:
As developers, we were raised in an era where relational databases were the (only) way … with River JDBC we can transform all data from those databases to Elasticsearch for further usage. Such as searching the data, analysing it or to use smart relevancy on it.
http://www.elasticsearch.org/overview/logstash
RabbitMQ River:
RabbitMQ River allows you to automatically index a RabbitMQ queue. It even allows bulk queues to be indexed in ElasticSearch.
https://github.com/elasticsearch/elasticsearch-river-rabbitmq
To view a detailed and full overview about Elasticsearch plugins:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html
Useful resources:
Would you like to try Elasticsearch?
Getting started:
http://joelabrahamsson.com/elasticsearch-101/Pluralsight Elasticsearch for dotnet developers on plurasight:
(For Non .NET developers there is only one section with .NET the rest of the course is Elasticsearch in general)
http://www.pluralsight.com/courses/elasticsearch-for-dotnet-developers
Nicely done . Thank you for the overall view. I got what i came for your page and liked to get the idea of given below. Can you help where i can go to understand the shards,nodes and whats happening index.
” When you save a document in Elasticsearch, you save it in an index. An index is like a database in relational database. An index is saved across multiple shards and shards are then stored in one or more servers which are called nodes, multiple nodes form a cluster. “
Hi Rakes,
I’m glad it did help you, as you noticed this is an old post but still applies.
You could watch this documentation that explains the Quantitative Cluster Sizing
https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing
So basically, when you have elastic search installed on multiple servers (nodes) you can give a command to create a new index, this index will be replicated depending on your cluster size.
For instance, if you have a cluster with 3 nodes and your indexes relatively small you could create an index with 3 shards and 2 replicas.
#3 nodes
index.number_of_shards: 3
index.number_of_replicas: 2
Now your data will be replicated on the 3 servers with 2 check this image you will understand it better:
https://www.openprogrammer.info/wp-content/uploads/2016/10/37qyng8pSBCHjJMEHIZw_reduce-shards07.png
Remember do not give a elastic node more than 32gb ram, and the best practice is testing elasticsearch with the type of data you have and check how it will perform.
Cheers,
Gabriel
Ok Gabriel. I’d like to dig more about plasticsearch . thank you for the response .