Elasticsearch is a popular open-source search and analytics engine that allows users to store, search, and analyze large volumes of data in real-time. When working with Elasticsearch, it's useful to have an understanding of its underlying components and how it stores data. Having this understanding can help you make more informed decisions about how to structure your data and optimize your search queries.
In this post, we'll explore the inner workings of Elasticsearch and examine the components that enable it to be such a powerful tool. We will cover Nodes & Clusters, Documents, Indices, Mappings, Shards, and Analyzers. By the end of this post, you'll have a solid grasp of the foundations of Elasticsearch and how it performs its magic.
Elastic Stack
This blog post focuses on Elasticsearch, which is part of the Elastic Stack. If you're new to Elastic, it's worth understanding how Elasticsearch fits into the stack.
The Elastic Stack is a set of open-source tools that consists of four main components:
- Logstash - The data processing pipeline that collects, parses, and transforms data from various sources. It provides many plugins that can process different data formats into Elastic Common Schema. This helps normalize the fields and data stored in Elasticsearch.
- Elasticsearch - The search and analytics engine that allows users to store, search, and analyze data.
- Kibana - The data visualization platform that enables users to explore and visualize data stored in Elasticsearch. Kibana provide a nice graphical interface to the user that abstracts the Elasticsearch API and allows users to easily interact with their data.
- Beats - A set of lightweight data shippers that allow users to send data from various sources to Elasticsearch or Logstash. A beats agent, along with its configuration file, is installed on a computer or server and sends data to the Elastic cluster.
Nodes & Clusters
A node is a single instance of Elasticsearch running on a server. The node stores data and participates in the indexing and search operations of the cluster.
A cluster is a collection of one or more nodes working together to provide distributed indexing and search capabilities. Nodes within a cluster are aware of each other and communicate to share data and distribute queries across the cluster.
Clusters provide fault tolerance, high availability, and horizontal scalability by allowing nodes to be added or removed from the cluster as needed to meet the evolving needs of the environment.
Documents
A document is the basic unit of data that can be indexed and searched in Elasticsearch. Documents are created from the JSON data that is ingested and indexed by Elasticsearch. A simple example of a JSON object that represents a customer is:
{
"name": "John Doe",
"age": 35,
"contact": {
"email": "johndoe@email.com",
"phone": "111-222-3333"
},
"is_customer": true,
"comments" : "John owns his own business and is looking for cheaper distribution channels. Has a wife (Penny) and 2 kids (Sally & John Jr.)"
}
Elasticsearch will save the values of each field into an index, which allows you to then perform queries against it, such as age > 30
or contact.email : johndoe*
. This process is called "indexing".
Each row that you see in Elastic's Discover view is a document, and each document belongs to an index.
Indices
An index is a collection of documents that share the same data structure.
Using our prior customer example, we expect all of our other customers to have the same data, so we could create a customer index to hold all the documents of this type.
The following JSON demonstrates the creation of an index named "customers" that will hold our customer data.
PUT /customers
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 1
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"age": { "type": "integer" },
"contact": {
"properties": {
"email": { "type": "keyword" },
"phone": { "type": "keyword" }
}
},
"is_customer": { "type": "boolean" },
"comments": { "type": "text" }
}
}
}
When it comes to an index, two of the most important things to understand are Shards and Mappings.
Shards
Shards are the underlying data structures that hold the indexed data. Each index is divided into one or more primary shards, which are distributed across the nodes in the Elasticsearch cluster.
Primary shards are essentially smaller, independent units of the index. Sharding the index enables Elasticsearch to distribute the index across multiple nodes in a cluster, which allows for better performance, fault tolerance, and scalability.
When you execute a search query, Elasticsearch will simultaneously search all relevant shards across the nodes for the data. This distributed searching is one of the reasons Elasticsearch can quickly return results.
Types of Shards
There are two types of shards: Primary and Replica. An index will have at least one primary shard, and each primary shard will have zero or more replica shards.
When a document is indexed in Elasticsearch, it is first written to a primary shard. After a primary shard is updated, that primary shard will forward the changes to all of its replica shards. A system of checksums and acknowledgments ensure that the replica shards receive the correct changes in the proper order. Once all replica shards have acknowledged the update, the primary shard considers the update to be successful.
Replica shards are exact copies of primary shards and are also distributed across different nodes in the cluster. In the event of a primary shard failure, one of its replica shards is promoted to serve as the new primary shard. This process is transparent to the user and it ensures that data remains available and consistent even in the face of hardware or software failures.
How Shards Affect Your Data
The number of primary shards for an index is determined at the time of creation and cannot be changed afterwards. However, the number of replica shards can be increased or decreased after the index is created. By default, Elasticsearch creates one replica shard for each primary shard, but this can be configured during index creation.
The number of primary shards determines how many pieces an index is divided into. Elasticsearch distributes documents between all primary shards, so the number of primary shards affects the size of each primary shard and its replicas. (Remember, since replicas are exact copies, they will be the same size as primary shards). The larger the shard size, the more time it takes for Elasticsearch to search the data within. Additionally, it will also take more time to migrate shards when Elasticsearch rebalances the cluster.
Primary shards are used to index documents, so the number of primary shards available affects write performance. However, Elasticsearch uses both primary and replica shards to read data, so the number of replica shards can improve search performance as queries can be distributed to the relevant primary and replica shards across all nodes.
How Many Shards Should You Have?
Increasing the number of primary shards improves indexing and write performance by distributing the data across more nodes; however, this also raises the overhead of managing the shards. On the other hand, if you have many small shards, the processing per shard is faster, but queries run against many shards will results in higher overhead and could ultimately reduce performance.
Increasing the number of replica shards improves data availability and search performance by providing redundancy and distributing search queries across more nodes. However, adding more replica shards also increases the amount of storage required and the network bandwidth required to maintain the replicas.
It's important to choose the right number of shards for your use case, as too few shards can lead to poor performance, while too many shards can result in excessive overhead. The optimal number of shards depends on the environment; there is no magic formula.
In general, if you have an index that will hold a massive amount of data or require high indexing requirements, you may want to increase the number of primary shards to improve performance. If you have high availability requirements or just need to improve the performance of search queries, you may want to increase the number of replica shards.
Mappings
Mappings in Elasticsearch provide the framework that defines how data is stored in an index. Mappings are important because they define the schema for the data that is being indexed, which helps Elasticsearch understand how to parse and store the data.
As a reminder, the previous mapping for our customer data was:
"mappings": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"age": { "type": "integer" },
"contact": {
"properties": {
"email": { "type": "keyword" },
"phone": { "type": "keyword" }
}
},
"is_customer": { "type": "boolean" },
"comments": { "type": "text" }
}
}
A mapping is the blueprint for an index: it specifies the fields and properties that the index will contain. Each field has its own mapping which defines the data type, format, and other properties that describe the field. Mappings also provide control over how data is analyzed for a field, which is important for understanding how the data is stored and queried.
When creating a mapping in Elasticsearch, it is important to consider the data types of the fields and how they will be used in searches. Elasticsearch provides a variety of data types, including text, keyword, date, numeric, and more. It is also possible to define custom data types and mappings to accommodate unique data requirements.
Data Structures
To best accommodate these different data types, Elasticsearch uses different data structures to store the data. Fields such as Keyword and Integer are intended for sorting and aggregating, so they use a doc_values structure. In contrast, fields which are expected to have a lot of words or unstructured text use a Text field (such as the "comments" field in our previous example). The Text field uses an inverted index structure to better support fast full-text searches.
Understanding these differences is important because values indexed in a doc_values structure are stored verbatim, whereas values stored in an inverted index are analyzed and stored as individual tokens.
For more information, check out this article:
Inverted Index
An inverted index in Elastic is a specialized data structure used to improve the efficiency of searching and retrieving text data. An inverted index is created for each Text field specified in the mappings of an Index.
The inverted index is a list of the unique terms found across that field in all documents, along with the locations of those terms within each document. By referencing the inverted index, Elastic can quickly return search results without having to read through every document in the set.
To process the text that is stored in the inverted index, Elastic uses Analyzers.
Analyzers
An analyzer is an essential component of Elasticsearch. Analyzers take raw text and transforms it into a stream of tokens. It does this by breaking down the text into smaller units based on a set rules. The primary purpose of an Analyzer is to make text more relevant and searchable by normalizing and enriching it. Analyzers are used during both the indexing and search phases of Elasticsearch. There are several built-in analyzers by default, but you can also create custom analyzers that can be tailored to for your specific needs.
For a detailed look at Analyzers, check out this post:
Conclusion
Wow! We covered a lot in this post!
We learned about Nodes & Clusters, Documents, Indices, Mappings, Shards, and Analyzers. Taking the time to understand how these pieces fit together is crucial to learning how Elasticsearch handles our data. If you have any questions or feedback, don't hesitate to drop a comment down below.
Thanks for reading!