Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21774][SQL] The rule PromoteStrings should cast a string to double type when compare with a int/long #18986

Closed
wants to merge 5 commits into from

Conversation

stanzhai
Copy link
Contributor

What changes were proposed in this pull request?

The rule PromoteStrings should cast a string to double type when compare with a int.

This PR fixed this.

How was this patch tested?

Origin test cases updated.

@@ -127,8 +127,10 @@ object TypeCoercion {
case (DateType, TimestampType) => Some(StringType)
case (StringType, NullType) => Some(StringType)
case (NullType, StringType) => Some(StringType)
case (StringType, IntegerType) => Some(DoubleType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How abount other types (e.g., LongType)? Is the similar issue still exists?
I think the logic in pr-15880 (#15880) is reasonable.

@gatorsmile
Copy link
Member

ok to test

@gatorsmile
Copy link
Member

@stanzhai Implicit type casting has to follow some rules. If you think this is a bug, could you do a complete investigation about this issue? For example, how Hive handles the type promotion?

If needed, we can consider supporting different type promotion/casting modes. (e.g., one is Hive compatible)

@SparkQA
Copy link

SparkQA commented Aug 18, 2017

Test build #80849 has finished for PR 18986 at commit fd77e1e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@stanzhai stanzhai changed the title [SPARK-21774][SQL] The rule PromoteStrings should cast a string to double type when compare with a int [SPARK-21774][SQL] The rule PromoteStrings should cast a string to double type when compare with a int/long Aug 19, 2017
@stanzhai
Copy link
Contributor Author

stanzhai commented Aug 19, 2017

In MySQL conversion of values from one string type to numeric, will be compared as floating-point (real) numbers.

https://dev.mysql.com/doc/refman/5.7/en/type-conversion.html

The following rules describe how conversion occurs for comparison operations:

  • If one or both arguments are NULL, the result of the comparison is NULL, except for the NULL-safe <=> equality comparison operator. For NULL <=> NULL, the result is true. No conversion is needed.
  • If both arguments in a comparison operation are strings, they are compared as strings.
  • If both arguments are integers, they are compared as integers.
  • Hexadecimal values are treated as binary strings if not compared to a number.
  • If one of the arguments is a TIMESTAMP or DATETIME column and the other argument is a constant, the constant is converted to a timestamp before the comparison is performed. This is done to be more ODBC-friendly. > Note that this is not done for the arguments to IN()! To be safe, always use complete datetime, date, or time strings when doing comparisons. For example, to achieve best results when using BETWEEN with date or time values, use CAST() to explicitly convert the values to the desired data type.
  • A single-row subquery from a table or tables is not considered a constant. For example, if a subquery returns an integer to be compared to a DATETIME value, the comparison is done as two integers. The integer is not converted to a temporal value. To compare the operands as DATETIME values, use CAST() to explicitly convert the subquery value to DATETIME.
  • If one of the arguments is a decimal value, comparison depends on the other argument. The arguments are compared as decimal values if the other argument is a decimal or integer value, or as floating-point values if the other argument is a floating-point value.
  • In all other cases, the arguments are compared as floating-point (real) numbers.

@SparkQA
Copy link

SparkQA commented Aug 19, 2017

Test build #80866 has finished for PR 18986 at commit 627116a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 19, 2017

Test build #80868 has finished for PR 18986 at commit 1647447.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@stanzhai @wangyum How about Hive? I think your usage scenarios are for hive compatibility, right?

@wangyum
Copy link
Member

wangyum commented Aug 20, 2017

@gatorsmile @maropu Cast to double seems correct.

hive> create table spark_21646(c1 string, c2 string);
hive> insert into spark_21646 values('922337203685477580719223372036854775807192233720368547758071', 'a');
hive> insert into spark_21646 values('21474836471', 'b');
hive> insert into spark_21646 values('10', 'c');
hive> insert into spark_21646 values('0.125', 'd');
hive> insert into spark_21646 values('1.25', 'e');
hive> select * from spark_21646 where c1 > 0;
922337203685477580719223372036854775807192233720368547758071	a
0.125	d
21474836471	b
1.25	e
10	c
spark-sql>  select * from spark_21646 where c1 > 0;
1.25    e                                                                       
10	c
spark-sql>  select * from spark_21646 where c1 > 0L;
21474836471     b                                                               
1.25	e
10	c
spark-sql>  select * from spark_21646 where c1 > 0D;
922337203685477580719223372036854775807192233720368547758071    a               
0.125	d
21474836471	b
1.25	e
10	c
spark-sql> 

@gatorsmile
Copy link
Member

In Hive, 1 = 'true' will return true? 19157170390056973L = '19157170390056971' will return true?

@maropu
Copy link
Member

maropu commented Aug 20, 2017

Either way, we couldn't touch on this behaviour until the next major release? (It seems we've already fixed this policy for the 2.x releases in #15880... maybe). Btw, I feel we first should document this implicit casting behaviour somewhere like https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-AllowedImplicitConversions cuz this makes users much confused.

@gatorsmile
Copy link
Member

@maropu
Copy link
Member

maropu commented Aug 20, 2017

Yea, since this topic is important for some users, I mean we better move the doc into ./docs/ ( I feel novices dont seem to check the code documents).

@DonnyZone
Copy link
Contributor

DonnyZone commented Aug 21, 2017

@gatorsmile @wangyum For this issue, I think the behevior in PromoteStrings rule is reasonable, but there are problems in underlying converter UTF8String.

As described in PR-15880 (#15880):

It's more reasonable to follow postgres, i.e. cast string to the type of the other side, but return null if the string is not castable to keep hive compatibility.

However, the underlying UTF8String still returns true for cases that are not castable. From the below code, we can get res=true and wrapper.value=0. Consequently, it leads to the wrong answer in SPARK-21774.

val x = UTF8String.fromString("0.1")
val wrapper = new IntWrapper
val res = x.toInt(wrapper)

@stanzhai
Copy link
Contributor Author

@DonnyZone @gatorsmile @cloud-fan PostgreSQL will throw an error when comparing a string to a int.

postgres=# select * from tb;
  a   | b
------+---
 0.1  | 1
 a    | 1
 true | 1
(3 rows)

postgres=# select * from tb where a>0;
ERROR:  operator does not exist: character varying > integer
LINE 1: select * from tb where a>0;
                                ^
HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.

@stanzhai
Copy link
Contributor Author

@gatorsmile @DonnyZone When comparing a string to a int in Hive, it will cast string type to double.

hive> select * from tb;
0	0
0.1	0
true	0
19157170390056971	0
hive> select * from tb where a = 0;
0	0
hive> select * from tb where a = 19157170390056973L;
WARNING: Comparing a bigint and a string may result in a loss of precision.
19157170390056973	0
hive> select 1 = 'true';
NULL
hive> select 19157170390056973L = '19157170390056971';
WARNING: Comparing a bigint and a string may result in a loss of precision.
true

So, I think that cast a string to double type when compare with a numeric is more reasonable.

Actually, my usage scenarios are for Spark compatibility. The problem I found when I upgraded Spark to 2.2.0, and lots of SQL's results are wrong.

@Neilxzn
Copy link

Neilxzn commented May 6, 2019

2.4.0 has the same problem without this patch. For example, select '0.99' > 0 in spark 2.4.0 will return false, but in hive it will return true.

@cloud-fan
Copy link
Contributor

I don't think there is a perfect solution here. When comparing a string and a number, you have no idea which type we should cast the string to, without looking into the string content.

For this specific case: select '0.99' > 0. I think returning null is more reasonable, as it's invalid to cast "0.99" to int type.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@miomiocat
Copy link

any update for this issue right now?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 20, 2020
@github-actions github-actions bot closed this Feb 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet