Elasticsearch调研

Apr 3, 2020 in UNDEFINED
elasticsearch
6 min read

Table of Content

目的：调研elasticsearch的启动、debug、评分、插件以实现自定义评分插件。

首先，关于es启动流程的大体介绍lanffy.github.io。在这片文章中，将会主要关注加载插件的部分。

在idea中启动调试elasticsearch

es6.6.2需要使用jdk11启动。

git clone https://github.com/elastic/elasticsearch.git
git checkout v6.6.2

将项目导入到idea

到项目根目录: ./gradlew idea
Idea create from existing source选择gradle，auto-import

要对项目做几处修改：下面直接复制git修改信息

Index: server/build.gradle
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- server/build.gradle	(revision 3bd3e59556628bb84c8d53b09b10c9ac8255e251)
+++ server/build.gradle	(date 1585882097224)
@@ -78,7 +78,7 @@
   compile "org.elasticsearch:elasticsearch-secure-sm:${version}"
   compile "org.elasticsearch:elasticsearch-x-content:${version}"
 
-  compileOnly project(':libs:plugin-classloader')
+  compile project(':libs:plugin-classloader')
   testRuntime project(':libs:plugin-classloader')
 
   // lucene

设置maven仓库为阿里云镜像

Index: build.gradle
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- build.gradle	(revision 3bd3e59556628bb84c8d53b09b10c9ac8255e251)
+++ build.gradle	(date 1585878050973)
@@ -48,6 +48,20 @@
   group = 'org.elasticsearch'
   version = VersionProperties.elasticsearch
   description = "Elasticsearch subproject ${project.path}"
+
+  // 增加下面部分
+  repositories {
+      google()
+      jcenter()
+      // maven库
+      def cn = "http://maven.aliyun.com/nexus/content/groups/public/"
+      def abroad = "http://central.maven.org/maven2/"
+      // 先从url中下载jar若没有找到，则在artifactUrls中寻找
+      maven {
+        url cn
+        artifactUrls abroad
+      }
+  }
 }
 
 apply plugin: 'nebula.info-scm'

（可选）设置该项目的gradle代理，加速访问

Index: gradle.properties
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>GBK
===================================================================
--- gradle.properties	(revision 3bd3e59556628bb84c8d53b09b10c9ac8255e251)
+++ gradle.properties	(date 1585878076127)
@@ -1,3 +1,8 @@
 org.gradle.daemon=true
 org.gradle.jvmargs=-Xmx2g
 options.forkOptions.memoryMaximumSize=2g
+systemProp.http.proxyHost=127.0.0.1
+systemProp.http.proxyPort=1080
+# systemProp.http.proxyUser=userid
+# systemProp.http.proxyPassword=password
+systemProp.http.nonProxyHosts=maven.aliyun.com|localhost
\ No newline at end of file

下载es6.6.2的release包https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.6.2.zip

解压到D:\elasticsearch-6.6.2

设置idea项目的vm options

-Des.path.home=D:\elasticsearch-6.6.2
-Des.path.conf=D:\elasticsearch-6.6.2\config
-Xms1g
-Xmx1g
-Dlog4j2.disable.jmx=true
-Djava.security.policy=D:\es\config\es.policy

D:\es\config\es.policy如下

grant {
    permission javax.management.MBeanTruePermission "register";
    permission javax.management.MBeanServerPermission "createMBeanServer";
    permission java.lang.RuntimePermission "createClassLoader";
};

至此我可以成功启动，如果遇到其他问题，请先refer https://blog.csdn.net/weixin_38380858/article/details/84258372

es 评分模块

https://www.elastic.co/guide/en/elasticsearch/reference/6.6/index-modules-similarity.html

es索引的评分规则在创建或更新索引setting的时候设定，用户可以配置es内置评分算法（如BM25）的参数或创建自己的scripted similarity。

可以使用scripted similarity

PUT /index
{
  "settings": {
    "number_of_shards": 1,
    "similarity": {
      "scripted_tfidf": {
        "type": "scripted",
        "script": {
          "source": "double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "field": {
          "type": "text",
          "similarity": "scripted_tfidf"
        }
      }
    }
  }
}

PUT /index/_doc/1
{
  "field": "foo bar foo"
}

PUT /index/_doc/2
{
  "field": "bar baz"
}

POST /index/_refresh

GET /index/_search?explain=true
{
  "query": {
    "query_string": {
      "query": "foo^1.7",
      "default_field": "field"
    }
  }
}

注意：

While scripted similarities provide a lot of flexibility, there is a set of rules that they need to satisfy. Failing to do so could make Elasticsearch silently return wrong top hits or fail with internal errors at search time:

返回分值必须是正的
当所有其他变量不变时，当doc.freq上升，socre不能下降
当所有其他变量不变时，当doc.length上升，score不能上升

上面例子中的script similarity计算方式中包含跟文档无关的部分：query.boost * idf。这一部分可以放在weight_script。如下:

    "similarity": {
      "scripted_tfidf": {
        "type": "scripted",
        "weight_script": {
          "source": "double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; return query.boost * idf;"
        },
        "script": {
          "source": "double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;"
        }
    }

es索引修改使用的评分算法

es提供多个评分算法，并且用户可以自行扩展，那么在检索时，使用哪套评分算法？

1. 创建/更新mapping时按字段设置

PUT /index/_mapping/_doc
{
  "properties" : {
    "title" : { "type" : "text", "similarity" : "my_similarity" }
  }
}

2. 设置默认评分算法

POST /index/_close

PUT /index/_settings
{
  "index": {
    "similarity": {
      "default": {
        "type": "boolean"
      }
    }
  }
}

POST /index/_open

es插件模块

https://www.elastic.co/guide/en/elasticsearch/plugins/6.6/index.html

插件包含：jar包、脚本和配置文件。插件必须在集群中的每个节点安装，安装后必须重启节点，插件才可用（动态加载的实现看起来较困难。涉及到集群元数据的插件，需要整个集群的重启

插件编写指南

开发调试插件

es可以从classpath加载插件，但是在发行版的代码中，隐藏了这个这个功能。稍微改下代码，把这个东西放出来

Index: server/src/main/java/org/elasticsearch/node/Node.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- server/src/main/java/org/elasticsearch/node/Node.java	(revision 1c439191c30172708dceae79ce3125822f8d6e12)
+++ server/src/main/java/org/elasticsearch/node/Node.java	(date 1586246418270)
@@ -265,6 +265,10 @@
         this(environment, Collections.emptyList(), true);
     }
 
+    public Node(Environment environment,Collection<Class<? extends Plugin>> classpathPlugins) {
+        this(environment, classpathPlugins, true);
+    }
+
     /**
      * Constructs a node
      *

然后，修改BootStrap中的Node创建代码

Index: server/src/main/java/org/elasticsearch/bootstrap/Bootstrap.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- server/src/main/java/org/elasticsearch/bootstrap/Bootstrap.java	(revision 1c439191c30172708dceae79ce3125822f8d6e12)
+++ server/src/main/java/org/elasticsearch/bootstrap/Bootstrap.java	(date 1586246418279)
@@ -214,22 +214,9 @@
             throw new BootstrapException(e);
         }

-        node = new Node(environment) {
+        Collection plugins = new ArrayList<>();
+        Collections.addAll(plugins, MBM25SimilarityPlugin.class);
+        node = new Node(environment,plugins) {
             @Override
             protected void validateNodeBeforeAcceptingRequests(
                 final BootstrapContext context,

下面是一个简单的插件（修改了BM25算法的参数）（这个插件没什么意义，修改BM25参数不需要搞插件）

package org.elasticsearch.plugin;

import org.apache.lucene.search.similarities.BM25Similarity;
import org.elasticsearch.index.IndexModule;
import org.elasticsearch.plugins.Plugin;


public class MBM25SimilarityPlugin extends Plugin {
    public String name() {
        return "elasticsearch-position-similarity";
    }

    public String description() {
        return "Elasticsearch scoring plugin based on matching a term or a phrase relative to a position of the term in a searched field.";
    }

    public void onIndexModule(IndexModule indexModule) {
        indexModule.addSimilarity("position", (settings, version, scriptService)->{
            String DISCOUNT_OVERLAPS="discount_overlaps";
            // BM25的k1默认是1.2 b默认是0.75
            float k1 = settings.getAsFloat("k1", 1.4f);
            float b = settings.getAsFloat("b", 0.8f);
            boolean discountOverlaps = settings.getAsBoolean(DISCOUNT_OVERLAPS, true);

            BM25Similarity similarity = new BM25Similarity(k1, b);
            similarity.setDiscountOverlaps(discountOverlaps);
            return similarity;
        });
    }
}

核心是indexModule.addXXXX方法，提供了扩展es各个功能的方法。

再看scripted_similarity

上文见识了scripted_similarity的实例，但是没有看用于script的参数的含义，这里再来看下。

各项参数定义的java类，是org.elasticsearch. index.similarity.ScriptedSimilarity类的私有静态内部类。elasticsearch的文档把这些参数写在了painless context中，链接：painless-similarity-context.html

参数：

params (Map, read-only)
- User-defined parameters passed in at query-time.
- (提供查询时传参能力，但是7.6版本已移除)
weight (float, read-only)
- The weight as calculated by a weight script
- 如果没有weight_script则为1
query.boost (float, read-only)
- The boost value if provided by the query. If this is not provided the value is 1.0f.
- 查询时传的参数，会乘一下
field.docCount (long, read-only)
- The number of documents that have a value for the current field.
- 分片shard中该字段有值的文档数量
field.sumDocFreq (long, read-only)
- The sum of all terms that exist for the current field. If this is not available the value is -1.
field.sumTotalTermFreq (long, read-only)
- The sum of occurrences in the index for all the terms that exist in the current field. If this is not available the value is -1.
term.docFreq (long, read-only)
- The number of documents that contain the current term in the index.
- 分片shard中包含该term的文档数量
term.totalTermFreq (long, read-only)
- The total occurrences of the current term in the index.
- index中该term的总出现次数
doc.length (long, read-only)
- The number of tokens the current document has in the current field. This is decoded from the stored norms and may be approximate for long fields
- 该字段的token数量
doc.freq (long, read-only)
- The number of occurrences of the current term in the current document for the current field.
- 该term在该doc的该field中出现的次数

有点抽象，尤其是field.sumDocFreq是Σterm.docFreq, field.sumTotalTermFreq是Σterm.totalTermFreq。他们到底有啥用？

可以从一个博客文本相似度-bm25算法原理及实现寻找bm25是如何使用这些参数的（很值得一看，理解es是怎么对每个field评分的）

这里不详细说里面的内容，大概介绍下使用这些参数的方式。field的评分是Σ语素（分词后的token）的评分。token的评分由weight(加权)*token的分数得出。weight最一般的实现是idf，上述script的idf是Math.log((field.docCount+1.0) /(term.docFreq+1.0)) + 1.0 ——field.docCount是总文档数，term.docFreq是该term出现的次数。

另一篇介绍BM25算法的文章https://www.jianshu.com/p/0b372804ff45

对similarity的简单总结：es提供给similarity评分计算的参数有限。无法获取文档中某字段的值，也无法获取用户输入参数。另外，es通过script_similarity给我们提供了自定义相似度算法的方法（当然，只能使用es给我们的那些参数，不能扩展）。因此，使用插件来自定义similarity算法意义不大。

在Elasticsearch自定义评分插件中，将介绍使用插件来实现自定义script_score（function_score），以实现自定义评分算法。