Elasticsearch Overview
Elasticsearch is a distributed search engine that features high scalability, reliability, and ease of management. It is built based on Apache Lucene and supports full-text retrieval, structured retrieval, and analytics. It can integrate these three functions within a single query. Elasticsearch is widely used in log management, real-time data analysis, full-text retrieval, and SRA.
Elasticsearch has the following features:
- Distributed architecture: It supports horizontal scaling and can handle massive volumes of data.
- Real-time search: Data becomes searchable immediately after it is written.
- High availability: Data replication and failover mechanisms keep the system resilient and accessible.
- Flexible query language: A wide range of search and aggregation operations are supported.
- RESTful APIs: Easy-to-use RESTful APIs are provided, supporting multiple programming languages.
For more information about Elasticsearch, visit the Elasticsearch official website.
Programming language: Scala
Brief description: distributed search engine
Open-source license: XXX
When using open-source software, comply with the applicable licenses.
Recommended Software Version
The recommended version is Elasticsearch 8.10.1.
Principles
Elasticsearch is an open-source distributed search and analytics engine built on Apache Lucene. Its core design aims to efficiently store, retrieve, and analyze massive volumes of data. It uses a distributed architecture, dividing an index into multiple shards. The shards can be distributed across different nodes, enabling horizontal data scaling and load balancing. To ensure high availability, each primary shard can be configured with one or more replica shards. Replica shards provide fault tolerance and offload query requests to boost read performance.
The key to Elasticsearch's fast search is its inverted index mechanism. An inverted index creates a mapping from individual terms to a list of document IDs containing those terms. When a user performs a search, Elasticsearch first tokenizes the query statement, then directly looks up the corresponding document ID lists directly within the inverted index, and finally returns sorted results by merging these lists and calculating relevance scores. This process avoids full table scans, making it highly efficient.
In terms of data writing, Elasticsearch provides near-real-time search experience. Newly written documents are first stored in the memory buffer and recorded in the transaction log (Translog) to ensure data security. By default, data in memory is refreshed every second to generate a new immutable segment and store it in the file system cache. At this point, the documents become searchable but are not yet safely persisted to disk. Periodic flushing operations persist the segments in the file system cache onto disk and clear the corresponding Translog. In addition, the background segment merge process combines multiple small segments into larger and more efficient segments and deletes documents marked for deletion to optimize storage and query performance.
The distributed feature of Elasticsearch is further demonstrated by its robust cluster coordination and fault tolerance. The cluster uses a specific discovery mechanism (such as Zen Discovery) for node management and master node election. The master node manages the cluster status and shard allocation. When a node is added to or removed from the cluster, Elasticsearch automatically rebalances data by shuffling shards to maintain even data distribution and system stability. This design enables Elasticsearch to handle petabyte-scale (PB scale) structured or unstructured data, making it a ubiquitous choice for full-text search, log analytics, and real-time monitoring.