ES query

ES 쿼리 성능 분석

{
  "_source": ["type", "content_id", "title"],
  "size": 10, 
  "from": 0,
  "query": {
	"bool": {
	  "must": [
		{
		  "query_string": {
			"query": "*st*", 
			"fields": ["title", "content"]
		  }
		}
	  ],
	  "filter": [
		{
		  "terms": {
			"type": ["some_conntent"] 
		  }
		},
		{
		  "term": {
			"user_id": userId 
		  }
		}
	  ]
	}
  },
  "highlight": {
	"fields": {
	  "content": {
		"number_of_fragments": 1,
		"fragment_size": 100,
		"pre_tags": [""],
		"post_tags": [""]
	  }
	}
  }
}

먼저 위와 같은 쿼리가 있다고 가정한다.

일단 위 쿼리의 response data는 아래와 같다.

{
	"took": 8,
	"timed_out": false,
	"_shards": {
		"total": 1,
		"successful": 1,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": {
			"value": 1,
			"relation": "eq"
		},
		"max_score": 1.0,
		"hits": [
			{
				"_index": "some_content_index_1",
				"_id": "some_content_1",
				"_score": 1.0,
				"_source": {
					"content_id": 1,
					"type": "some_content",
					"title": "예시"
				},
				"highlight": {
					"content": [
						"string"
					]
				}
			}
		]
	}
}

위와 같이 제일 처음 key인 took이 나오는데 이것이 ms값이다. 즉 위 쿼리는 8ms가 소요되었다는 의미이다.

하지만 더 자세히 쿼리의 결과를 분석하고자 한다면 아래와 같이 profile값을 true로 설정해준뒤 요청을 보내면 된다.

"profile": true,
  "query": {
	"bool": {
	  "must": [
		...

Shards 정보:

네트워크 시간(inbound_network_time_in_millis 및 outbound_network_time_in_millis)은 0이므로, 네트워크 지연은 없다.

검색 쿼리 정보:

query: 쿼리의 주요 부분은 BooleanQuery와 여러 하위 쿼리(DisjunctionMaxQuery, MultiTermQueryConstantScoreWrapper, PointRangeQuery 등)로 구성된다.
time_in_nanos: 각 쿼리 타입별로 실행에 걸린 시간(나노초 단위). 예를 들어, BooleanQuery의 첫 번째 항목은 약 8ms (8378877 나노초)가 걸렸다.
breakdown: 쿼리의 각 단계별 시간 소요를 나타낸다. 예를 들어, build_scorer 단계에서 가장 많은 시간이 소요되었다. ("build_scorer": 8350220)

ES 작동 원리

일단 기본적으로 ES는 Inverted Index 방식을 사용한다.

일반적인 RDBMS는 아래와 같은 구조를 가진다.

ID	content
1	고양이 귀엽다
2	강아지 귀엽다
3	고슴도치 귀엽다
4	앵무새

따라서 위에서 “귀엽다” 라는 문자열을 기반으로 like query 를 날리면 모든 row의 content를 다 뒤져야하니 속도가 느리다.

하지만 ES는 아래와 같은 구조를 가진다.

Term	ID
고양이	1
강아지	2
고슴도치	3
앵무새	4
귀엽다	1, 2, 3

따라서 내가 “귀엽다”를 기반으로 쿼리를 날린다면

→ 해당 행의 ID를 바로 가져올 수 있는 이점이 있다.

💡 따라서 데이터가 늘어나도 찾아갈 행이 늘어나는 것이 아닌 Inverted Index의 원리에 의해 Term에 id 배열에 id만 추가되는 것이기에 성능상의 큰 저하는 없다.

Filter 작동 원리

Query context

_score metadata field.

Query context is in effect whenever a query clause is passed to a query parameter, such as the query

In a filter context, a query clause answers the question “Does this document match this query clause?” The answer is a simple Yes or No — no scores are calculated. Filter context is mostly used for filtering structured data, e.g.

Does this timestamp fall into the range 2015 to 2016?
Is the status field set to "published"?

Frequently used filters will be cached automatically by Elasticsearch, to speed up performance.

Filter context is in effect whenever a query clause is passed to a filter parameter, such as the filter or must_notfilter

쿼리와 필터의 구분: Elasticsearch/OpenSearch에서는 검색 쿼리(query)와 필터(filter)를 구분한다. 쿼리는 관련성 점수(relevance score)를 계산하고, 필터는 데이터를 빠르게 필터링하지만 점수 계산은 하지 않는다.
필터 실행 과정:

사용자가 필터 조건(예: 특정 userId 또는 type)을 쿼리에 지정
Elasticsearch/OpenSearch는 인버티드 인덱스를 사용하여 해당 조건을 만족하는 문서 ID를 신속하게 찾는다.
필터링 과정은 관련성 점수 계산 없이 진행되므로, 쿼리보다 더 빠르다.
필터 결과는 캐싱될 수 있어, 동일한 필터 조건에 대한 반복된 쿼리가 더 빠르게 처리된다.

→ 필터링된 데이터셋은 기존 전체 데이터셋보다 훨씬 작기에 검색 쿼리가 처리해야할 데이터 양이 줄어든다.

💡 과연 시간 복잡도는 얼마가 나올 것 인가?

위와 같이 생각해보았을 때 다양한 구글링을 해보았으나 공식 문서에는 별다른 내부적인 작동 원리에 대한 설명이 기재되어 있지 않았다.

따라서 여러 stackoverflow나 개발 커뮤니티를 찾아보았으나 뚜렷한 대답은 얻지 못하였다.

→ GPT는 알고 있지 않을까…?

필터링 과정에서 OpenSearch 및 Lucene은 모든 문서를 순차적으로 탐색하지 않습니다. 대신, 더 효율적인 방법을 사용합니다. 이로 인해, 필터링 작업의 시간 복잡도는 일반적으로 O(n) 보다 효율적입니다.

필터링의 시간 복잡도를 정확히 특정하기는 어렵지만, 일반적으로 OpenSearch 또는 Lucene에서 사용하는 필터링은 O(1) (상수 시간) 또는 O(log N) (로그 시간)보다는 복잡합니다. 이러한 시스템의 필터링 과정은 다음과 같은 특성을 갖습니다:

O(1)이 아님: O(1) 복잡도는 어떤 작업이 입력 크기와 관계없이 일정한 시간에 수행됨을 의미합니다. 필터링 작업은 일반적으로 입력 데이터의 크기에 따라 수행 시간이 달라지므로, 이는 O(1) 복잡도에 해당하지 않습니다.
O(log N) 이상일 수 있음: O(log N) 복잡도는 이진 검색과 같은 알고리즘에서 볼 수 있는 것으로, 데이터 양이 두 배로 증가해도 필요한 작업의 수가 한 단계만 증가합니다. 필터링 작업에서는 역 인덱스를 사용하여 효율적으로 특정 키워드나 조건을 만족하는 문서를 찾지만, 이 과정은 여전히 쿼리에 따라 O(log N)보다 복잡할 수 있습니다.
종류에 따라 다름: 필터링의 실제 시간 복잡도는 필터링의 종류와 복잡성, 데이터의 분포, 인덱스 구조, 시스템의 최적화 등에 따라 달라집니다. 일부 간단한 필터링 작업은 매우 빠르게 수행될 수 있지만, 복잡한 쿼리나 대규모 데이터 집합에 대한 필터링은 더 오래 걸릴 수 있습니다.

→ 따라서, 필터링 로직의 시간 복잡도는 일반적으로 O(n)이 아니라고 말을 해주고 있다. 실제 복잡도는 쿼리의 종류, 인덱스의 구조, 데이터의 양 및 배치 등에 따라 달라진다고 하니 O(n)은 아니며 O(log n)에 가깝지만 그보다는 조금 더 높을 수 있다고 결론을 지을 수 있을 것 같다.

테스트

/_bulk API 사용

Postman 기준 Body → binary → import json

1만 데이터 추가

평균 took: 15ms

2만 데이터 추가

평균 took: 위와 비슷

여기서 8만개까지 추가적으로 들어 갔을 때 아래와 같이 간단하게 분석을 해보았다.

import numpy as np

response_times = np.array([12, 19,20,17,16,17,37,19,22,18,19,17,26,17,18,17,21,47,15,16,20,22,19,44,39,25,18])


mean = np.mean(response_times)
median = np.median(response_times)
std_dev = np.std(response_times)
min_time = np.min(response_times)
max_time = np.max(response_times)

print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Standard Deviation: {std_dev}")
print(f"Min: {min_time}")
print(f"Max: {max_time}")

->
Mean: 22.11111111111111
Median: 19.0
Standard Deviation: 8.791129086601103
Min: 12
Max: 47

import pandas as pd

response_series = pd.Series(response_times)

print(response_series.describe())


-> 
count	27.000000
mean	 22.111111
std	   8.958594
min	  12.000000
25%	  17.000000
50%	  19.000000
75%	  22.000000
max	  47.000000

import matplotlib.pyplot as plt

plt.hist(response_times, bins=10)
plt.title('Response Times Histogram')
plt.xlabel('Response Time (ms)')
plt.ylabel('Frequency')
plt.show()

12만개

Mean: 29.40740740740741
Median: 23.0
Standard Deviation: 16.999475502277644
Min: 14
Max: 69

count	27.000000
mean	 29.407407
std	  17.323304
min	  14.000000
25%	  16.000000
50%	  23.000000
75%	  39.500000
max	  69.000000

import csv
import json
import sys 


csv.field_size_limit(sys.maxsize)


def csv_to_json(csv_file_path, json_file_path):
	jsonArray = []

	# CSV 파일 읽기
	with open(csv_file_path, encoding='utf-8') as csvf:
		csvReader = csv.DictReader(csvf)

		# 각 행을 JSON 형식으로 변환
		for row in csvReader:
			jsonArray.append(row)

	# JSON 파일 쓰기
	with open(json_file_path, 'w', encoding='utf-8') as jsonf:
		jsonf.write(json.dumps(jsonArray, indent=4))

# 함수 실행
csv_to_json('test.csv', 'test.json')

import json

def json_to_bulk_format(json_file_path, bulk_file_path):
	# JSON 데이터를 로드합니다.
	with open(json_file_path, 'r', encoding='utf-8') as json_file:
		data = json.load(json_file)

	# Bulk 형식으로 변환합니다.
	with open(bulk_file_path, 'w', encoding='utf-8') as bulk_file:
		for doc in data:
			# 메타데이터
			action = {
				"index": {
					"_index": "some_index_name",
					"_id": f"content-{doc['id']}"
				}
			}
			# 문서 데이터
			document = {
				"type": "some_content",
				"user_id": doc["user_id"],
				"title": "some_content",
				"content": doc["content"]
			}

			# Bulk 파일에 쓰기
			bulk_file.write(json.dumps(action) + '\n')
			bulk_file.write(json.dumps(document) + '\n')

# JSON 파일과 Bulk 파일의 경로를 지정합니다.
json_file_path = '/80000-test.json'
bulk_file_path = '80000-test.json'

# 함수 실행
json_to_bulk_format(json_file_path, bulk_file_path)


# def extract_docs(input_file_path, output_file_path, start_line, end_line):
#	 with open(input_file_path, 'r', encoding='utf-8') as input_file, \
#		  open(output_file_path, 'w', encoding='utf-8') as output_file:
#		 for current_line, line in enumerate(input_file, start=1):
#			 if current_line >= start_line:
#				 output_file.write(line)
#			 if current_line >= end_line:
#				 break

# # 원본 bulk 데이터 파일과 새 파일 경로
# input_file_path = '/result.json'
# output_file_path = '/80000-result.json'

# extract_docs(input_file_path, output_file_path, start_line=80000, end_line=130000)

ES 쿼리 성능 분석

ES 작동 원리

Filter 작동 원리

테스트

티스토리툴바