TiSpark enables you to use TiDB as the data source of Apache Spark, similar to other data sources (PostgreSQL, HDFS, S3, etc.).
It supports insert data using Spark SQL now.
-
Initiate
SparkConf.val sparkConf = new SparkConf(). setIfMissing("spark.master", "local[*]"). setIfMissing("spark.app.name", getClass.getName). setIfMissing("spark.sql.extensions", "org.apache.spark.sql.TiExtensions"). setIfMissing("spark.tispark.pd.addresses", "pd0:2379"). setIfMissing("spark.tispark.tidb.addr", "tidb"). setIfMissing("spark.tispark.tidb.port", "4000") val spark = SparkSession.builder.config(sparkConf).getOrCreate()
-
Write using Spark SQL.
/* create table before run the code CREATE TABLE tpch_test.target_table_orders ( `O_ORDERKEY` int(11) NOT NULL, `O_CUSTKEY` int(11) NOT NULL, `O_ORDERSTATUS` char(1) NOT NULL, `O_TOTALPRICE` decimal(15,2) NOT NULL, `O_ORDERDATE` date NOT NULL, `O_ORDERPRIORITY` char(15) NOT NULL, `O_CLERK` char(15) NOT NULL, `O_SHIPPRIORITY` int(11) NOT NULL, `O_COMMENT` varchar(79) NOT NULL ) */ //use catalog spark.sql("use tidb_catalog") // write data to tidb spark.sql( "insert into tpch_test.target_table_orders values" + "(1,1,'0',11.11,date'2022-1-1','first','spark',12345,'no comment')") // select data spark.sql("select * from tpch_test.target_table_orders").show
The following table shows the insert configurations, which can be passed in through SparkConf or spark.conf.set.
But you should not set tidb.password in SparkConf and only use spark.conf.set to set it.
tidb.addr, tidb.port, tidb.user and tidb.password are only needed when enableUpdateTableStatistics is true.
| Key | Required | Default | Description |
|---|---|---|---|
| tidb.addr | false | - | TiDB address, needed when enableUpdateTableStatistics is true |
| tidb.port | false | - | TiDB Port, needed when enableUpdateTableStatistics is true |
| tidb.user | false | - | TiDB User, needed when enableUpdateTableStatistics is true |
| tidb.password | false | - | TiDB Password, needed when enableUpdateTableStatistics is true |
| enableRegionSplit | false | true | To split Region to avoid hot Region during insertion |
| scatterWaitMS | false | 300000 | Max time to wait scatter region |
| bytesPerRegion | false | 100663296 (96M) | Decrease this parameter to split more regions (increase write concurrency) |
| enableUpdateTableStatistics | false | false | Update statistics in table mysql.stats_meta (tidb.user must own update privilege to table mysql.stats_meta if set true) |
| rowFormatVersion | false | 2 | Version of row format to save data |
TiDB's version must be 4.0.0 or later.
TiSpark support write into partition table using INSERT SQL now.
But TiSpark does not support insert with a partition spec and a column list, like:
INSERT INTO test.test PARTITION (i = 1)(s, k) VALUES ('hello', 'world');
The following types of SparkSQL Data are currently not supported for writing to TiDB:
- ArrayType
- MapType
- StructType
Spark SQL and TIDB have different conversion rules. We follow this rule when using DataSource API, but follow the rule of Spark SQL when using insert.
There is some CAST rules supported by Spark which must be explicit indicated. Like
SELECT CAST('2020-01-01' AS Timestamp). Here
for more details.
When convert Float/Double to Int/Short/Long, there is a subtle different between them. TiDB will use
round(num), Spark SQL will use floor(num).
The complete conversion metrics are as follows.
| target\source | Boolean | Binary | Short | Integer | Long | Float | Double | String | Decimal | Date | Timestamp |
|---|---|---|---|---|---|---|---|---|---|---|---|
| BIT | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| BOOLEAN | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| TINYINT | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| SMALLINT | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| MEDIUMINT | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| INTEGER | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| BIGINT | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| FLOAT | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| DOUBLE | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| DECIMAL | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ |
| DATE | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ |
| DATETIME | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ |
| TIMESTAMP | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ |
| TIME | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| YEAR | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| CHAR | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ |
| VARCHAR | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ |
| TINYTEXT | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ |
| TEXT | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ |
| MEDIUMTEXT | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ |
| LONGTEXT | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ |
| BINARY | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| VARBINARY | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| TINYBLOB | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| BLOB | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| MEDIUMBLOB | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| LONGBLOB | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| ENUM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| SET | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| JSON | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |