Skip to content

Latest commit

 

History

History
132 lines (105 loc) · 12.8 KB

File metadata and controls

132 lines (105 loc) · 12.8 KB

TiSpark Insert SQL User Guide

TiSpark enables you to use TiDB as the data source of Apache Spark, similar to other data sources (PostgreSQL, HDFS, S3, etc.).

It supports insert data using Spark SQL now.

Use Spark SQL

  1. Initiate SparkConf.

    val sparkConf = new SparkConf().
      setIfMissing("spark.master", "local[*]").
      setIfMissing("spark.app.name", getClass.getName).
      setIfMissing("spark.sql.extensions", "org.apache.spark.sql.TiExtensions").
      setIfMissing("spark.tispark.pd.addresses", "pd0:2379").
      setIfMissing("spark.tispark.tidb.addr", "tidb").
      setIfMissing("spark.tispark.tidb.port", "4000")
    
    val spark = SparkSession.builder.config(sparkConf).getOrCreate()
  2. Write using Spark SQL.

    /* create table before run the code
    CREATE TABLE tpch_test.target_table_orders (
      `O_ORDERKEY` int(11) NOT NULL,
      `O_CUSTKEY` int(11) NOT NULL,
      `O_ORDERSTATUS` char(1) NOT NULL,
      `O_TOTALPRICE` decimal(15,2) NOT NULL,
      `O_ORDERDATE` date NOT NULL,
      `O_ORDERPRIORITY` char(15) NOT NULL,
      `O_CLERK` char(15) NOT NULL,
      `O_SHIPPRIORITY` int(11) NOT NULL,
      `O_COMMENT` varchar(79) NOT NULL
    )
    */
    
    //use catalog
    spark.sql("use tidb_catalog")
    
    // write data to tidb
    spark.sql(
          "insert into tpch_test.target_table_orders values" +
            "(1,1,'0',11.11,date'2022-1-1','first','spark',12345,'no comment')")
    
    // select data
    spark.sql("select * from tpch_test.target_table_orders").show

Configurations

The following table shows the insert configurations, which can be passed in through SparkConf or spark.conf.set. But you should not set tidb.password in SparkConf and only use spark.conf.set to set it.

tidb.addr, tidb.port, tidb.user and tidb.password are only needed when enableUpdateTableStatistics is true.

Key Required Default Description
tidb.addr false - TiDB address, needed when enableUpdateTableStatistics is true
tidb.port false - TiDB Port, needed when enableUpdateTableStatistics is true
tidb.user false - TiDB User, needed when enableUpdateTableStatistics is true
tidb.password false - TiDB Password, needed when enableUpdateTableStatistics is true
enableRegionSplit false true To split Region to avoid hot Region during insertion
scatterWaitMS false 300000 Max time to wait scatter region
bytesPerRegion false 100663296 (96M) Decrease this parameter to split more regions (increase write concurrency)
enableUpdateTableStatistics false false Update statistics in table mysql.stats_meta (tidb.user must own update privilege to table mysql.stats_meta if set true)
rowFormatVersion false 2 Version of row format to save data

TiDB Version

TiDB's version must be 4.0.0 or later.

Limitations

TiSpark support write into partition table using INSERT SQL now.

But TiSpark does not support insert with a partition spec and a column list, like:

INSERT INTO test.test PARTITION (i = 1)(s, k) VALUES ('hello', 'world');

Type Conversion For Insert

The following types of SparkSQL Data are currently not supported for writing to TiDB:

  • ArrayType
  • MapType
  • StructType

Spark SQL and TIDB have different conversion rules. We follow this rule when using DataSource API, but follow the rule of Spark SQL when using insert.

There is some CAST rules supported by Spark which must be explicit indicated. Like SELECT CAST('2020-01-01' AS Timestamp). Here for more details.

When convert Float/Double to Int/Short/Long, there is a subtle different between them. TiDB will use round(num), Spark SQL will use floor(num).

The complete conversion metrics are as follows.

target\source Boolean Binary Short Integer Long Float Double String Decimal Date Timestamp
BIT
BOOLEAN
TINYINT
SMALLINT
MEDIUMINT
INTEGER
BIGINT
FLOAT
DOUBLE
DECIMAL
DATE
DATETIME
TIMESTAMP
TIME
YEAR
CHAR
VARCHAR
TINYTEXT
TEXT
MEDIUMTEXT
LONGTEXT
BINARY
VARBINARY
TINYBLOB
BLOB
MEDIUMBLOB
LONGBLOB
ENUM
SET
JSON