Elasticsearch系列数据建模实战

Z时代
2024-01-10
分类：综合

概要

本篇以实际案例为背景，介绍不同技术组件对数据建模的特点，并以ES为背景，介绍常用的联合查询的利弊，最后介绍了一下文件系统分词器path_hierarchy和嵌套对象的使用。

数据模型对比

实际项目中，电商平台系统常见的组合Java、Mysql和Elasticsearch，以基础的部门-员工实体为案例。

JavaBean类型定义

如果是JavaBean类型，会这样定义

public class Department {
	private Long id;
	private String name;
	private String desc;
	private List<Long> userIds;
}
public class Employee {
	private Long id;
	private String name;
	private byte gender;
	private Department dept;
}

数据库模型定义

如果是关系型数据库(mysql)，会这样建表

create table t_department (
	id bigint(20) not null auto_increment,
	name varchar(30) not null,
	desc varchar(80) not null,
	PRIMARY KEY (`id`)
)
create table t_employee (
	id bigint(20) not null auto_increment,
	name varchar(30) not null,
	gender tinyint(1) not null,
	dept_id bigint(20),
	PRIMARY KEY (`id`)
)

依据数据库三范式设计表，每个实体设计成独立的表，用主外键约束进行关联，按照现有的数据表规范，已经不再使用外键约束了，外键约束放在应用层控制。

ES文档数据模型

如果es的文档数据模型，会这样设计document

{
	"deptId": 1,
	"deptname": "CEO办公室",
	"desc":"这一个有情怀的CEO",
	"employee":[
		{
			"userId":1,
			"name":Lily,
			"gender":0
		},
		{
			"userId":2,
			"name":Lucy,
			"gender":0
		},
		{
			"userId":3,
			"name":Tom,
			"gender":1
		}
	]
}

es更类似面向对象的数据模型，将所有关联的数据放在一个document里。

JOIN查询

我们以博客网站为案例背景，建立博客网站中博客与用户的数据模型。

将用户与博客分别建立document，分割实体，类似数据库三范式，并使用关键field（userId）建立依赖关系

先建立两个实体document，放一条示例数据

PUT /blog/user/1 { "id":1, "username":"Lily", "age":18 } PUT /website/article/1 { "title":"my frist blog", "content":"this is my first blog, thank you", "userId":1 }

需求：要查询用户名Lily发表的博客

步骤：1）查询用户document，根据名字Lily查询到它的userId；

2）根据第一步查询返回的userId，重新组装请求报文，查询博客docuement

示例报文：

GET /blog/user/_search
{
  "query": {
    "match": {
      "username.keyword": "Lily"
    }
  }
}
GET /website/article/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "terms": {
          "userId": [
            "1"
          ]
        }
      }
    }
  }
}

以上步骤叫做应用层Join实现关联查询

优点：结构清晰明了，数据不冗余，维护方便。

缺点：应用层join，如关联的数据过多，查询性能很低。

适用场景：两层join，第一层document查询基本上能做到精准查询，返回的结果数很少，并且第二层数据量特别大。如案例中的场景，根据名称找userId，返回的数据相对较少，第二层的查询性能就比较高，第二层数据属于业务数据类型，数据量肯定特别大。

适度冗余减少应用层Join查询

普通查询

接上面案例，修改博客document，将username冗余到该document中，如：

PUT /website/article/2
{
  "title":"my second blog",
  "content":"this is my second blog, thank you",
  "userInfo": {
    "id":1,
    "username":"Lily"
  }
}

查询时直接指定username:

GET /website/article/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "userInfo.username.keyword": "Lily"
        }
      }
    }
  }
}

优点：一次查询即可，性能较高

缺点：若冗余的字段有更新，维护非常麻烦

适合场景：适当的冗余比较有必要，可以减小join查询，关系型数据库设计也经常有冗余数据的优化，只要挑选冗余字段时要注意尽量选变更可能性小的字段，避免查询一时爽，更新想撞墙这种事情发生。

数据冗余设计后聚合分组查询

造点测试数据进去

PUT /website/article/3
{
  "title":"my third blog",
  "content":"this is my third blog, thank you",
  "userInfo": {
    "id":2,
    "username":"Lucy"
  }
}
PUT /website/article/4
{
  "title":"my 4th blog",
  "content":"this is my 4th blog, thank you",
  "userInfo": {
    "id":2,
    "username":"Lucy"
  }
}

分组查询：Lily发表了哪些博客，Lucy发表了哪些博客

GET website/article/_search
{
  "size": 0,
  "aggs": {
    "group_by_username": {
      "terms": {
        "field": "userInfo.username.keyword"
      },
      "aggs": {
        "top_articles": {
          "top_hits": {
            "size": 10,
            "_source": {
              "includes": "title"
            }
          }
        }
      }
    }
  }
}

文件搜索

文件类型的数据有个很大的特点：有目录层次关系。如果我们有对文件搜索的需求，可以这个建立索引：

PUT /files
{
  "settings": {
    "analysis": {
      "analyzer": {
        "paths": {
          "tokenizer":"path_hierarchy"
        }
      }
    }
  }
}
PUT /files/_mapping/file
{
  "properties": {
    "name": {
      "type": "keyword"
    },
    "path": {
      "type": "keyword",
      "fields": {
        "tree": {
          "type": "text",
          "analyzer": "paths"
        }
      }
    }
  }
}

注意分词器path_hierarchy，会把/opt/data/log分成

/opt/ /opt/data/ /opt/data/log

插入一条测试数据

PUT /files/file/1
{
  "name":"hello.txt",
  "path":"/opt/data/txt/"
}

搜索案例

指定文件名，指定具体路径搜索

GET files/file/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": "hello.txt"
          }
        },
        {
          "match": {
            "path": "/opt/data/txt/"
          }
        }
      ]
    }
  }
}

/opt路径下的hello.txt文件（包含子目录）

GET files/file/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": "hello.txt"
          }
        },
        {
          "match": {
            "path.tree": "/opt/"
          }
        }
      ]
    }
  }
}

区别：path与path.tree的使用

path.tree是会分词的，并且指定分词器为path_hierarchy

path不分词，直接使用。

nested object数据类型

提出问题

用普通的object对象做数据冗余时，如果冗余的数据是一个数组集合类的，查询可能会出问题，例如：博客信息下面的评论，是一个集合类型

PUT /website/article/5 { "title": "清茶豆奶发表的一篇技术帖子", "content": "我是清茶豆奶，大家要不要考虑关注一下Java架构社区啊", "tags": [ "IT技术", "Java架构社区" ], "comments": [ { "name": "清茶", "comment": "有什么干货没有啊？", "age": 29, "stars": 4, "date": "2019-10-29" }, { "name": "豆奶", "comment": "我最喜欢研究技术，真好", "age": 32, "stars": 5, "date": "2019-10-30" } ] }

需求：查询被29岁的豆奶用户评论过的博客

GET /website/article/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "comments.name.keyword": "豆奶"
          }
        },
        {
          "match": {
            "comments.age": "29"
          }
        }
      ]
    }
  }
}

根据这条演示数据，这个条件是查不到结果的，但实际却查出来这条数据，为什么？

原因：object类型底层数据结果，会将json进行扁平化存储，如上例子，存储结构将变成：

{
	"title":["清茶","豆奶","发表","一篇","技术","帖子"],
	"content": ["我","清茶","豆奶","大家","要不要","考虑","关注","一下","Java架构社区"],
	tags:["IT技术", "Java架构社区"],
	comments.name:["清茶","豆奶"],
	comments.comment:["有","什么","干货","没有啊","我","最喜欢","研究","技术","真好"],
	comments.age:[29,32],
	comments.stars:[4,5],
	comments.date:["2019-10-29","2019-10-30"]
}

这样"豆奶"和29就被命中了，跟预期的结果不一致。

解决办法

引入nested object类型，就可以解决这种问题。

修改mapping，将comments的类型改成nested object。

先删掉索引，再重新建立

PUT /website
{
  "mappings": {
    "article": {
      "properties": {
        "comments": {
          "type": "nested",
          "properties": {
            "name": {"type":"text"},
            "comment": {"type":"text"},
            "age":     {"type":"short"},
            "stars":   {"type":"short"},
            "date":  {"type":"date"}
          }
        }
      }
    }
  }
}

这样底层数据结构就成变成：

{
	"title":["清茶","豆奶","发表","一篇","技术","帖子"],
	"content": ["我","清茶","豆奶","大家","要不要","考虑","关注","一下","Java架构社区"],
	tags:["IT技术", "Java架构社区"],
	comments:[
		{
			"name":"清茶",
			"comment":["有","什么","干货","没有啊"],
			"age":29,
			"stars":4,
			"date":"2019-10-29"
		},
		{
			"name":"豆奶",
			"comment":["我","最喜欢","研究","技术","真好"],
			"age":32,
			"stars":5,
			"date":"2019-10-30"
		}
	]
}

再查询结果为空，符合预期。

聚合查询示例

求博客每天评论的平均星数

GET /website/article/_search
{
  "size": 0,
  "aggs": {
    "comments_path": {
      "nested": {
        "path": "comments"
      }, 
      "aggs": {
        "group_by_comments_date": {
          "date_histogram": {
            "field": "comments.date",
            "interval": "day",
            "format": "yyyy-MM-dd"
          }, 
          "aggs": {
            "stars_avg": {
              "avg": {
                "field": "comments.stars"
              }
            }
          }
        }
      }
    }
  }
}

响应结果(有删节)：

{
  "aggregations": {
    "comments_path": {
      "doc_count": 2,
      "group_by_comments_date": {
        "buckets": [
          {
            "key_as_string": "2019-10-29",
            "key": 1572307200000,
            "doc_count": 1,
            "stars_avg": {
              "value": 4
            }
          },
          {
            "key_as_string": "2019-10-30",
            "key": 1572393600000,
            "doc_count": 1,
            "stars_avg": {
              "value": 5
            }
          }
        ]
      }
    }
  }
}