TiSpark Insert SQL User Guide

TiSpark enables you to use TiDB as the data source of Apache Spark, similar to other data sources (PostgreSQL, HDFS, S3, etc.).

It supports insert data using Spark SQL now.

Use Spark SQL

Initiate SparkConf.

val sparkConf = new SparkConf().
  setIfMissing("spark.master", "local[*]").
  setIfMissing("spark.app.name", getClass.getName).
  setIfMissing("spark.sql.extensions", "org.apache.spark.sql.TiExtensions").
  setIfMissing("spark.tispark.pd.addresses", "pd0:2379").
  setIfMissing("spark.tispark.tidb.addr", "tidb").
  setIfMissing("spark.tispark.tidb.port", "4000")

val spark = SparkSession.builder.config(sparkConf).getOrCreate()

Write using Spark SQL.

/* create table before run the code
CREATE TABLE tpch_test.target_table_orders (
  `O_ORDERKEY` int(11) NOT NULL,
  `O_CUSTKEY` int(11) NOT NULL,
  `O_ORDERSTATUS` char(1) NOT NULL,
  `O_TOTALPRICE` decimal(15,2) NOT NULL,
  `O_ORDERDATE` date NOT NULL,
  `O_ORDERPRIORITY` char(15) NOT NULL,
  `O_CLERK` char(15) NOT NULL,
  `O_SHIPPRIORITY` int(11) NOT NULL,
  `O_COMMENT` varchar(79) NOT NULL
)
*/

//use catalog
spark.sql("use tidb_catalog")

// write data to tidb
spark.sql(
      "insert into tpch_test.target_table_orders values" +
        "(1,1,'0',11.11,date'2022-1-1','first','spark',12345,'no comment')")

// select data
spark.sql("select * from tpch_test.target_table_orders").show

Configurations

The following table shows the insert configurations, which can be passed in through SparkConf or spark.conf.set. But you should not set tidb.password in SparkConf and only use spark.conf.set to set it.

tidb.addr, tidb.port, tidb.user and tidb.password are only needed when enableUpdateTableStatistics is true.

Key	Required	Default	Description
tidb.addr	false	-	TiDB address, needed when enableUpdateTableStatistics is true
tidb.port	false	-	TiDB Port, needed when enableUpdateTableStatistics is true
tidb.user	false	-	TiDB User, needed when enableUpdateTableStatistics is true
tidb.password	false	-	TiDB Password, needed when enableUpdateTableStatistics is true
enableRegionSplit	false	true	To split Region to avoid hot Region during insertion
scatterWaitMS	false	300000	Max time to wait scatter region
bytesPerRegion	false	100663296 (96M)	Decrease this parameter to split more regions (increase write concurrency)
enableUpdateTableStatistics	false	false	Update statistics in table `mysql.stats_meta` (`tidb.user` must own update privilege to table `mysql.stats_meta` if set true)
rowFormatVersion	false	2	Version of row format to save data

TiDB Version

TiDB's version must be 4.0.0 or later.

Limitations

TiSpark support write into partition table using INSERT SQL now.

But TiSpark does not support insert with a partition spec and a column list, like:

INSERT INTO test.test PARTITION (i = 1)(s, k) VALUES ('hello', 'world');

Type Conversion For Insert

The following types of SparkSQL Data are currently not supported for writing to TiDB:

ArrayType
MapType
StructType

Spark SQL and TIDB have different conversion rules. We follow this rule when using DataSource API, but follow the rule of Spark SQL when using insert.

There is some CAST rules supported by Spark which must be explicit indicated. Like SELECT CAST('2020-01-01' AS Timestamp). Here for more details.

When convert Float/Double to Int/Short/Long, there is a subtle different between them. TiDB will use round(num), Spark SQL will use floor(num).

The complete conversion metrics are as follows.

target\source	Boolean	Binary	Short	Integer	Long	Float	Double	String	Decimal	Date	Timestamp
BIT	❌	✅	✅	✅	✅	✅	✅	❌	❌	❌	❌
BOOLEAN	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌
TINYINT	❌	✅	✅	✅	✅	✅	✅	❌	❌	❌	❌
SMALLINT	❌	✅	✅	✅	✅	✅	✅	❌	❌	❌	❌
MEDIUMINT	❌	✅	✅	✅	✅	✅	✅	❌	❌	❌	❌
INTEGER	❌	✅	✅	✅	✅	✅	✅	❌	❌	❌	❌
BIGINT	❌	✅	✅	✅	✅	✅	✅	❌	❌	❌	❌
FLOAT	❌	✅	✅	✅	✅	✅	✅	❌	❌	❌	❌
DOUBLE	❌	✅	✅	✅	✅	✅	✅	❌	❌	❌	❌
DECIMAL	❌	✅	✅	✅	✅	✅	✅	❌	✅	❌	❌
DATE	❌	❌	❌	❌	❌	❌	❌	❌	❌	✅	✅
DATETIME	❌	❌	❌	❌	❌	❌	❌	❌	❌	✅	✅
TIMESTAMP	❌	❌	❌	❌	❌	❌	❌	❌	❌	✅	✅
TIME	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌
YEAR	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌
CHAR	❌	✅	✅	✅	✅	❌	❌	✅	✅	✅	✅
VARCHAR	❌	✅	✅	✅	✅	❌	❌	✅	✅	✅	✅
TINYTEXT	❌	✅	✅	✅	✅	❌	❌	✅	✅	✅	✅
TEXT	❌	✅	✅	✅	✅	❌	❌	✅	✅	✅	✅
MEDIUMTEXT	❌	✅	✅	✅	✅	❌	❌	✅	✅	✅	✅
LONGTEXT	❌	✅	✅	✅	✅	❌	❌	✅	✅	✅	✅
BINARY	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌
VARBINARY	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌
TINYBLOB	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌
BLOB	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌
MEDIUMBLOB	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌
LONGBLOB	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌
ENUM	❌	✅	✅	✅	✅	✅	✅	✅	❌	❌	❌
SET	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌
JSON	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TiSpark Insert SQL User Guide

Use Spark SQL

Configurations

TiDB Version

Limitations

Type Conversion For Insert

FilesExpand file tree

insert_sql_userguide.md

Latest commit

History

insert_sql_userguide.md

File metadata and controls

TiSpark Insert SQL User Guide

Use Spark SQL

Configurations

TiDB Version

Limitations

Type Conversion For Insert