[SPARK-21774][SQL] The rule PromoteStrings should cast a string to double type when compare with a int/long #18986

stanzhai · 2017-08-18T03:16:54Z

What changes were proposed in this pull request?

The rule PromoteStrings should cast a string to double type when compare with a int.

This PR fixed this.

How was this patch tested?

Origin test cases updated.

DonnyZone · 2017-08-18T09:25:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

@@ -127,8 +127,10 @@ object TypeCoercion {
    case (DateType, TimestampType) => Some(StringType)
    case (StringType, NullType) => Some(StringType)
    case (NullType, StringType) => Some(StringType)
+    case (StringType, IntegerType) => Some(DoubleType)


How abount other types (e.g., LongType)? Is the similar issue still exists?
I think the logic in pr-15880 (#15880) is reasonable.

gatorsmile · 2017-08-18T16:18:01Z

ok to test

gatorsmile · 2017-08-18T16:29:03Z

@stanzhai Implicit type casting has to follow some rules. If you think this is a bug, could you do a complete investigation about this issue? For example, how Hive handles the type promotion?

If needed, we can consider supporting different type promotion/casting modes. (e.g., one is Hive compatible)

SparkQA · 2017-08-18T18:56:39Z

Test build #80849 has finished for PR 18986 at commit fd77e1e.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

stanzhai · 2017-08-19T03:13:55Z

In MySQL conversion of values from one string type to numeric, will be compared as floating-point (real) numbers.

https://dev.mysql.com/doc/refman/5.7/en/type-conversion.html

The following rules describe how conversion occurs for comparison operations:

If one or both arguments are NULL, the result of the comparison is NULL, except for the NULL-safe <=> equality comparison operator. For NULL <=> NULL, the result is true. No conversion is needed.
If both arguments in a comparison operation are strings, they are compared as strings.
If both arguments are integers, they are compared as integers.
Hexadecimal values are treated as binary strings if not compared to a number.
If one of the arguments is a TIMESTAMP or DATETIME column and the other argument is a constant, the constant is converted to a timestamp before the comparison is performed. This is done to be more ODBC-friendly. > Note that this is not done for the arguments to IN()! To be safe, always use complete datetime, date, or time strings when doing comparisons. For example, to achieve best results when using BETWEEN with date or time values, use CAST() to explicitly convert the values to the desired data type.
A single-row subquery from a table or tables is not considered a constant. For example, if a subquery returns an integer to be compared to a DATETIME value, the comparison is done as two integers. The integer is not converted to a temporal value. To compare the operands as DATETIME values, use CAST() to explicitly convert the subquery value to DATETIME.
If one of the arguments is a decimal value, comparison depends on the other argument. The arguments are compared as decimal values if the other argument is a decimal or integer value, or as floating-point values if the other argument is a floating-point value.
In all other cases, the arguments are compared as floating-point (real) numbers.

SparkQA · 2017-08-19T03:48:04Z

Test build #80866 has finished for PR 18986 at commit 627116a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-19T05:32:40Z

Test build #80868 has finished for PR 18986 at commit 1647447.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-19T18:31:12Z

@stanzhai @wangyum How about Hive? I think your usage scenarios are for hive compatibility, right?

wangyum · 2017-08-20T02:31:13Z

@gatorsmile @maropu Cast to double seems correct.

hive> create table spark_21646(c1 string, c2 string);
hive> insert into spark_21646 values('922337203685477580719223372036854775807192233720368547758071', 'a');
hive> insert into spark_21646 values('21474836471', 'b');
hive> insert into spark_21646 values('10', 'c');
hive> insert into spark_21646 values('0.125', 'd');
hive> insert into spark_21646 values('1.25', 'e');
hive> select * from spark_21646 where c1 > 0;
922337203685477580719223372036854775807192233720368547758071	a
0.125	d
21474836471	b
1.25	e
10	c

spark-sql>  select * from spark_21646 where c1 > 0;
1.25    e                                                                       
10	c
spark-sql>  select * from spark_21646 where c1 > 0L;
21474836471     b                                                               
1.25	e
10	c
spark-sql>  select * from spark_21646 where c1 > 0D;
922337203685477580719223372036854775807192233720368547758071    a               
0.125	d
21474836471	b
1.25	e
10	c
spark-sql>

gatorsmile · 2017-08-20T05:47:30Z

In Hive, 1 = 'true' will return true? 19157170390056973L = '19157170390056971' will return true?

maropu · 2017-08-20T08:35:31Z

Either way, we couldn't touch on this behaviour until the next major release? (It seems we've already fixed this policy for the 2.x releases in #15880... maybe). Btw, I feel we first should document this implicit casting behaviour somewhere like https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-AllowedImplicitConversions cuz this makes users much confused.

gatorsmile · 2017-08-20T16:12:15Z

@maropu Previously I did it in https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala#L34-L59

We can follow what MySQL did https://dev.mysql.com/doc/refman/5.7/en/type-conversion.html. It looks more user friendly.

maropu · 2017-08-20T23:29:47Z

Yea, since this topic is important for some users, I mean we better move the doc into ./docs/ ( I feel novices dont seem to check the code documents).

DonnyZone · 2017-08-21T05:37:24Z

@gatorsmile @wangyum For this issue, I think the behevior in PromoteStrings rule is reasonable, but there are problems in underlying converter UTF8String.

As described in PR-15880 (#15880):

It's more reasonable to follow postgres, i.e. cast string to the type of the other side, but return null if the string is not castable to keep hive compatibility.

However, the underlying UTF8String still returns true for cases that are not castable. From the below code, we can get res=true and wrapper.value=0. Consequently, it leads to the wrong answer in SPARK-21774.

val x = UTF8String.fromString("0.1")
val wrapper = new IntWrapper
val res = x.toInt(wrapper)

stanzhai · 2017-08-22T02:59:30Z

@DonnyZone @gatorsmile @cloud-fan PostgreSQL will throw an error when comparing a string to a int.

postgres=# select * from tb;
  a   | b
------+---
 0.1  | 1
 a    | 1
 true | 1
(3 rows)

postgres=# select * from tb where a>0;
ERROR:  operator does not exist: character varying > integer
LINE 1: select * from tb where a>0;
                                ^
HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.

stanzhai · 2017-08-22T03:31:38Z

@gatorsmile @DonnyZone When comparing a string to a int in Hive, it will cast string type to double.

hive> select * from tb;
0	0
0.1	0
true	0
19157170390056971	0
hive> select * from tb where a = 0;
0	0
hive> select * from tb where a = 19157170390056973L;
WARNING: Comparing a bigint and a string may result in a loss of precision.
19157170390056973	0
hive> select 1 = 'true';
NULL
hive> select 19157170390056973L = '19157170390056971';
WARNING: Comparing a bigint and a string may result in a loss of precision.
true

So, I think that cast a string to double type when compare with a numeric is more reasonable.

Actually, my usage scenarios are for Spark compatibility. The problem I found when I upgraded Spark to 2.2.0, and lots of SQL's results are wrong.

Neilxzn · 2019-05-06T12:48:24Z

2.4.0 has the same problem without this patch. For example, select '0.99' > 0 in spark 2.4.0 will return false, but in hive it will return true.

cloud-fan · 2019-05-06T16:30:24Z

I don't think there is a perfect solution here. When comparing a string and a number, you have no idea which type we should cast the string to, without looking into the string content.

For this specific case: select '0.99' > 0. I think returning null is more reasonable, as it's invalid to cast "0.99" to int type.

AmplabJenkins · 2019-09-16T18:25:06Z

Can one of the admins verify this patch?

miomiocat · 2019-11-11T05:33:35Z

any update for this issue right now?

github-actions · 2020-02-20T00:11:52Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

stanzhai added 2 commits August 18, 2017 10:17

fix a bug of TypeCoercion

1a289a5

add a test case for SPARK-21774

fd77e1e

DonnyZone reviewed Aug 18, 2017

View reviewed changes

fix string -> long

69becf6

stanzhai changed the title ~~[SPARK-21774][SQL] The rule PromoteStrings should cast a string to double type when compare with a int~~ [SPARK-21774][SQL] The rule PromoteStrings should cast a string to double type when compare with a int/long Aug 19, 2017

update test case

627116a

replace IntegerType with NumericType

1647447

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Feb 20, 2020

github-actions bot closed this Feb 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21774][SQL] The rule PromoteStrings should cast a string to double type when compare with a int/long #18986

[SPARK-21774][SQL] The rule PromoteStrings should cast a string to double type when compare with a int/long #18986

stanzhai commented Aug 18, 2017

DonnyZone Aug 18, 2017

gatorsmile commented Aug 18, 2017

gatorsmile commented Aug 18, 2017

SparkQA commented Aug 18, 2017

stanzhai commented Aug 19, 2017 •

edited

SparkQA commented Aug 19, 2017

SparkQA commented Aug 19, 2017

gatorsmile commented Aug 19, 2017

wangyum commented Aug 20, 2017 •

edited

gatorsmile commented Aug 20, 2017

maropu commented Aug 20, 2017

gatorsmile commented Aug 20, 2017

maropu commented Aug 20, 2017

DonnyZone commented Aug 21, 2017 •

edited

stanzhai commented Aug 22, 2017

stanzhai commented Aug 22, 2017

Neilxzn commented May 6, 2019

cloud-fan commented May 6, 2019

AmplabJenkins commented Sep 16, 2019

miomiocat commented Nov 11, 2019

github-actions bot commented Feb 20, 2020

[SPARK-21774][SQL] The rule PromoteStrings should cast a string to double type when compare with a int/long #18986

[SPARK-21774][SQL] The rule PromoteStrings should cast a string to double type when compare with a int/long #18986

Conversation

stanzhai commented Aug 18, 2017

What changes were proposed in this pull request?

How was this patch tested?

DonnyZone Aug 18, 2017

Choose a reason for hiding this comment

gatorsmile commented Aug 18, 2017

gatorsmile commented Aug 18, 2017

SparkQA commented Aug 18, 2017

stanzhai commented Aug 19, 2017 • edited

SparkQA commented Aug 19, 2017

SparkQA commented Aug 19, 2017

gatorsmile commented Aug 19, 2017

wangyum commented Aug 20, 2017 • edited

gatorsmile commented Aug 20, 2017

maropu commented Aug 20, 2017

gatorsmile commented Aug 20, 2017

maropu commented Aug 20, 2017

DonnyZone commented Aug 21, 2017 • edited

stanzhai commented Aug 22, 2017

stanzhai commented Aug 22, 2017

Neilxzn commented May 6, 2019

cloud-fan commented May 6, 2019

AmplabJenkins commented Sep 16, 2019

miomiocat commented Nov 11, 2019

github-actions bot commented Feb 20, 2020

stanzhai commented Aug 19, 2017 •

edited

wangyum commented Aug 20, 2017 •

edited

DonnyZone commented Aug 21, 2017 •

edited