Hive内(外)部表

Hive内部表与外部表的区别

  • 创建表时

    • 创建内部表:会将数据移动到数据仓库指向的路径;
    • 创建外部表:仅记录数据所在的路径, 不对数据的位置做任何改变。

  • 删除表时

    • 内部表的元数据和数据会被一起删除
    • 外部表只删除元数据,不删除数据。外部表相对来说更加安全,数据组织更加灵活,方便共享源数据。

1.未被external修饰的是内部表【managed table】,被external修饰的为外部表【external table】。

2.内部表数据由Hive自身管理,外部表数据由HDFS管理。

3.内部表数据存储在hive.metastore.warehouse.dir【默认:/user/hive/warehouse】,外部表数据存储位置由用户自己决定。

4.删除内部表会直接删除元数据【metadata】及存储数据,删除外部表仅仅删除元数据,HDFS上的文件不会被删除。

5.对内部表的修改会直接同步到元数据,而对外部表的表结构和分区进行修改,则需要修改【MSCK REPAIR TABLE table_name】。

实践操作

把数据源上传至HDFS中

[hadoop@xinxingdata001 data]$ touch student.txt

[hadoop@xinxingdata001 data]$ vim student.txt

1,xinxing,22,boy

2,laocao,24,boy

3,shiqin,19,girl

4,xiaoming,50,boy

5,xiaohong,49,girl

[hadoop@xinxingdata001 data]$ hdfs dfs -mkdir /data

[hadoop@xinxingdata001 data]$ hdfs dfs -put student.txt /data

[hadoop@xinxingdata001 data]$ hdfs dfs -ls /data

Found 1 items

-rw-r--r-- 1 hadoop supergroup 87 2020-06-11 16:18 /data/student.txt

建(内部)表

create managed table tbl_internal (

id int,

name string,

age int,

sex string

)

comment '这是一张内部表'

row format delimited fields terminated by ',';

managed为建表默认参数,普通表默认为内部表

1、查看表结构

hive (xinxing)> desc formatted tbl_internal;

2、装载数据源到Hive

hive (xinxing)> LOAD DATA INPATH '/data/student.txt' OVERWRITE INTO TABLE tbl_internal;

3、查询数据是否装载成功

hive (xinxing)> select * from tbl_internal;

OK

tbl_internal.id tbl_internal.name tbl_internal.age tbl_internal.sex

1 xinxing 22 boy

2 laocao 24 boy

3 shiqin 19 girl

4 xiaoming 50 boy

5 xiaohong 49 girl

Time taken: 0.205 seconds, Fetched: 5 row(s)

4、打开HDFS的web页面查看student.txt路径

  • 在hive中建库,如果不指定location默认路径则为/user/hive/warehouse/

  • 原本存储至/data/student.txt路径下的源数据发生移动,移动到了/user/hive/warehouse/xinxing.db/tbl_internal

建(外部)表

create external table tbl_external (

id int,

name string,

age int,

sex string

)

comment '这是一张外部表'

row format delimited fields terminated by ',';

1、查看表结构

hive (xinxing)> desc formatted tbl_external;

2、装载数据源到Hive

hive (xinxing)> LOAD DATA INPATH '/data/student.txt' OVERWRITE INTO TABLE tbl_external;

3、查询数据是否装载成功

hive (xinxing)> select * from tbl_external;

OK

tbl_internal.id tbl_internal.name tbl_internal.age tbl_internal.sex

1 xinxing 22 boy

2 laocao 24 boy

3 shiqin 19 girl

4 xiaoming 50 boy

5 xiaohong 49 girl

Time taken: 0.205 seconds, Fetched: 5 row(s)

查看内外部表的区别

1、删除两张表

hive (xinxing)> drop table tbl_internal;

OK

Time taken: 0.518 seconds

hive (xinxing)> drop table tbl_external;

OK

Time taken: 0.056 seconds

2、在HDFS中查看两张表

  • 由此可见tbl_internal这张表已不存在HDFS中,tbl_external/student.txt仍然存在

3、在次创建tbl_external表结构

create external table tbl_external (

id int,

name string,

age int,

sex string

)

comment '这是一张外部表'

row format delimited fields terminated by ',';

4、查询tbl_external

hive (xinxing)> select * from tbl_external;

OK

tbl_internal.id tbl_internal.name tbl_internal.age tbl_internal.sex

1 xinxing 22 boy

2 laocao 24 boy

3 shiqin 19 girl

4 xiaoming 50 boy

5 xiaohong 49 girl

Time taken: 0.205 seconds, Fetched: 5 row(s)

总结

1、使用LOAD加载数据到Hive中时,不论是内部表还是外部表源数据都会被移动至默认路径/database/tbl_库下

2、使用location指定源数据路径时,源数据则不会移动

3、内部表的删除,删除了表结构以及数据

4、外部表的删除,仅仅删除了表结构,数据仍存在于HDFS中。

官网对于内(外)部表的介绍

A table created without the EXTERNAL clause is called a managed table because Hive manages its data.

Managed and External Tables

By default Hive creates managed tables, where files, metadata and statistics are managed by internal Hive processes. A managed table is stored under the hive.metastore.warehouse.dir path property, by default in a folder path similar to /apps/hive/warehouse/databasename.db/tablename/. The default location can be overridden by the location property during table creation. If a managed table or partition is dropped, the data and metadata associated with that table or partition are deleted. If the PURGE option is not specified, the data is moved to a trash folder for a defined duration.

Use managed tables when Hive should manage the lifecycle of the table, or when generating temporary tables.

An external table describes the metadata / schema on external files. External table files can be accessed and managed by processes outside of Hive. External tables can access data stored in sources such as Azure Storage Volumes (ASV) or remote HDFS locations. If the structure or partitioning of an external table is changed, an MSCK REPAIR TABLE table_name statement can be used to refresh metadata information.

Use external tables when files are already present or in remote locations, and the files should remain even if the table is dropped.

Managed or external tables can be identified using the DESCRIBE FORMATTED table_name command, which will display either MANAGED_TABLE or EXTERNAL_TABLE depending on table type.

Statistics can be managed on internal and external tables and partitions for query optimization.

官网链接:cwiki.apache.org/confluence/…

以上是 Hive内(外)部表 的全部内容, 来源链接: utcz.com/a/23528.html

回到顶部