[Gold Standard]: Initial code for spark only setup with a single query #384

apoorvedave1 · 2021-03-15T23:40:10Z

What is the context for this pull request?

Parent Issue: [PROPOSAL]: Gold Standard #334
Dependencies: [Gold Standard] Add resources files for spark queries from spark's plan stability suite #383 (merged)
Dependent PRs: [Gold Standard] Updated plans for all tpcds queries with spark-only setup #377

What changes were proposed in this pull request?

In this PR, we are introducing the code for spark-only (non-hyperspace) version of Gold standard. This PR is also LIMITED TO ONLY QUERY 1 of tpcds queries.

In the subsequent PR #337, we will push the updated plans for all the remaining queries q2-q99. The sole aim of this PR is for code validation with an example query q1.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

apoorvedave1 · 2021-03-16T00:31:22Z

src/test/scala/com/microsoft/hyperspace/goldstandard/TPCDSBase.scala

+trait TPCDSBase extends SparkFunSuite with SparkInvolvedSuite {
+
+  val conf = SQLConf.get


Note: superclasses changed for lack of spark 3.0 support

apoorvedave1 · 2021-03-16T00:32:01Z

src/test/scala/com/microsoft/hyperspace/goldstandard/TPCDSBase.scala

+
+  // The TPCDS queries below are based on v1.4.
+  // TODO: Fix bulid pipeline for q49 and reenable q49.
+  val tpcdsQueries = Seq("q1")


Note: this is a complete list of queries which will run as part of this test. Currently only q1 is selected. In subsequent prs, all queries will be enabled

apoorvedave1 · 2021-03-16T00:34:18Z

src/test/scala/com/microsoft/hyperspace/goldstandard/TPCDSBase.scala

+  // TODO: Fix bulid pipeline for q49 and reenable q49.
+  val tpcdsQueries = Seq("q1")
+
+  private val tableColumns = Map(


no change in tableColumns

apoorvedave1 · 2021-03-16T00:35:25Z

src/test/scala/com/microsoft/hyperspace/goldstandard/TPCDSBase.scala

+
+  val tableNames: Iterable[String] = tableColumns.keys
+
+  def createTable(


no change in createTable

apoorvedave1 · 2021-03-16T00:36:17Z

src/test/scala/com/microsoft/hyperspace/goldstandard/TPCDSBase.scala

+
+  private val originalCBCEnabled = conf.cboEnabled
+  private val originalJoinReorderEnabled = conf.joinReorderEnabled
+


Note: removed val originalPlanStatsEnabled from source. It is only required for stats based tests which are not yet supported

apoorvedave1 · 2021-03-16T00:38:32Z

src/test/scala/com/microsoft/hyperspace/goldstandard/TPCDSBase.scala

+  private val originalCBCEnabled = conf.cboEnabled
+  private val originalJoinReorderEnabled = conf.joinReorderEnabled
+
+  override def beforeAll(): Unit = {


Note: simplified beforeAll and afterAll by removing some stats related code which we don't support yet.

apoorvedave1 · 2021-03-16T00:42:35Z

src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala

+ *
+ * To run the entire test suite:
+ * {{{
+ *   sbt "test:testOnly *PlanStabilitySuite"


Note: the run command is different from spark, because of the difference in spark vs hyperspace project structure. In spark, this test is within sql/ project so the command looks slightly different.
Same for others

apoorvedave1 · 2021-03-16T00:43:20Z

src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala

+ */
+// scalastyle:on filelinelengthchecker
+
+trait PlanStabilitySuite extends TPCDSBase with Logging {


Superclasses changed because of lack of spark 3.0 support

src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala

apoorvedave1 · 2021-03-16T00:48:36Z

src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala

+
+  override def afterAll(): Unit = {
+    super.afterAll()
+  }


Note: in beforeAll and afterAll, some spark conf values have been changed:

originalMaxToStringFields = conf.maxToStringFields => Conf not present in spark 2.4

spark.sql.crossJoin.enabled => set to true because some queries fail during query optimization phase in case of cross joins

apoorvedave1 · 2021-03-16T00:52:10Z

src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala

+      case subquery: SubqueryExec =>
+        subqueriesMap.getOrElseUpdate(subquery, subqueriesMap.size + 1)
+      case _ => -1
+    }


Note: removed a couple of case match statements because of lack of spark 3.0 support

case SubqueryBroadcastExec case ReusedSubqueryExec

apoorvedave1 · 2021-03-16T00:52:57Z

src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala

+     * "sum(sr_return_amt#14)", so we remove all of these using regex
+     */
+    def cleanUpReferences(references: AttributeSet): String = {
+      referenceRegex.replaceAllIn(references.toSeq.map(_.name).sorted.mkString(","), "")


Node: added sorting of references for consistent behavior in local vs azure build pipeline setup

imback82

Generally looking fine to me.

src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala

apoorvedave1 · 2021-03-16T00:56:58Z

src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala

+      s"Location.*spark-warehouse/",
+      "Location [not included in comparison]/{warehouse_dir}/")
+  }


Note: normalization logic changed very slightly.
We depend on df.explain() command for generating output.
Spark depends on QueryExecution.explainString() which generates a different output.

(to compare: link)

apoorvedave1 · 2021-03-16T00:58:39Z

src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala

+      classLoader = Thread.currentThread().getContextClassLoader)
+    val qe = spark.sql(queryString).queryExecution
+    val plan = qe.executedPlan
+    val explain = normalizeLocation(normalizeIds(explainString(qe)))


Note: small change here: use of private def explainString instead of spark 3.0 supported qe.explainString

apoorvedave1 · 2021-03-16T00:59:01Z

src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala

+    }
+  }
+
+  def explainString(queryExecution: QueryExecution): String = {


Note: new method, not present in original

apoorvedave1 · 2021-03-16T01:00:00Z

src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala

+  override def afterAll(): Unit = {
+    super.afterAll()
+  }
+


Note: No change from this line onward, until my next comment.

apoorvedave1 · 2021-03-16T02:18:06Z

Generally looking fine to me.

thanks @imback82 , I added the fix to the TODO comment.

imback82 · 2021-03-16T04:59:06Z

bintray is down https://status.bintray.com/ causing build failures.

apoorvedave1 · 2021-03-16T18:17:11Z

bintray is down https://status.bintray.com/ causing build failures.

Looks like the issue got fixed. The latest build succeeded. @imback82 , please take a look

src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala

imback82

LGTM, thanks @apoorvedave1!

apoorvedave1 added 27 commits February 18, 2021 17:23

gold standard initial commit

208ea5e

fix q32

cc14991

Merge branch 'master' of github.com:apoorvedave1/hyperspace-1 into gs

7c8ee78

Merge branch 'master' of github.com:apoorvedave1/hyperspace-1 into gs

fa9bd4a

keep only tpcds v1.4 and remove others

4b007a5

Merge remote-tracking branch 'upstream/master' into gs_initial

59207f6

Trigger Build

5c0cee9

build error: test with sequential run

900539d

revert previous commit

530dfa7

update plans

add01f1

update plan

8b58b6b

added sorting for fixing build pipeline

97e8441

udpated plans with sorting

411450f

fix q49

b70f00b

updated instructions on how to run tests

7880f40

test with updated plans for q47, q49

3dcd2d5

update q47, 49 plans

a7a1149

remove rogue query

dd295a9

remove q49

f47fbd3

fix scalastyle

d15a4e8

add query files for q49.sql

4c511fb

restructuring

e48c713

cleanup before rebase

6c5a2f7

Merge remote-tracking branch 'upstream/master' into gs_initial

25cb361

updated plans based on the plan stability suite

0b40853

normalize location: fix

32f5899

commit with code and q1 plans

5d7eebd

This was referenced Mar 15, 2021

[PROPOSAL]: Gold Standard #334

Open

[Gold Standard] Updated plans for all tpcds queries with spark-only setup #377

Open

apoorvedave1 self-assigned this Mar 16, 2021

apoorvedave1 commented Mar 16, 2021

View reviewed changes

simplify TPCDSBase

d44ecc0

apoorvedave1 commented Mar 16, 2021

View reviewed changes

src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala Outdated Show resolved Hide resolved

apoorvedave1 commented Mar 16, 2021

View reviewed changes

imback82 reviewed Mar 16, 2021

View reviewed changes

src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala Outdated Show resolved Hide resolved

imback82 added the enhancement New feature or request label Mar 16, 2021

imback82 added this to the February/March 2021 (v0.5.0) milestone Mar 16, 2021

apoorvedave1 commented Mar 16, 2021

View reviewed changes

fix a review comment

ff1a08a

Trigger Build

e1b9e79

imback82 reviewed Mar 16, 2021

View reviewed changes

src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala Outdated Show resolved Hide resolved

refactor spark confgs

6976a4b

imback82 approved these changes Mar 16, 2021

View reviewed changes

imback82 merged commit 3ccb0ea into microsoft:master Mar 16, 2021

apoorvedave1 deleted the gs_codeonly branch March 16, 2021 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gold Standard]: Initial code for spark only setup with a single query #384

[Gold Standard]: Initial code for spark only setup with a single query #384

apoorvedave1 commented Mar 15, 2021

apoorvedave1 Mar 16, 2021

apoorvedave1 Mar 16, 2021

apoorvedave1 Mar 16, 2021 •

edited

Loading

apoorvedave1 Mar 16, 2021

apoorvedave1 Mar 16, 2021 •

edited

Loading

apoorvedave1 Mar 16, 2021 •

edited

Loading

apoorvedave1 Mar 16, 2021

apoorvedave1 Mar 16, 2021

apoorvedave1 Mar 16, 2021

apoorvedave1 Mar 16, 2021

apoorvedave1 Mar 16, 2021

imback82 left a comment

apoorvedave1 Mar 16, 2021 •

edited

Loading

apoorvedave1 Mar 16, 2021

apoorvedave1 Mar 16, 2021

apoorvedave1 Mar 16, 2021

apoorvedave1 commented Mar 16, 2021

imback82 commented Mar 16, 2021

apoorvedave1 commented Mar 16, 2021

imback82 left a comment

		trait TPCDSBase extends SparkFunSuite with SparkInvolvedSuite {

		val conf = SQLConf.get


		val tableNames: Iterable[String] = tableColumns.keys

		def createTable(


		private val originalCBCEnabled = conf.cboEnabled
		private val originalJoinReorderEnabled = conf.joinReorderEnabled

[Gold Standard]: Initial code for spark only setup with a single query #384

[Gold Standard]: Initial code for spark only setup with a single query #384

Conversation

apoorvedave1 commented Mar 15, 2021

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apoorvedave1 Mar 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apoorvedave1 Mar 16, 2021 • edited Loading

Choose a reason for hiding this comment

apoorvedave1 Mar 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imback82 left a comment

Choose a reason for hiding this comment

apoorvedave1 Mar 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apoorvedave1 commented Mar 16, 2021

imback82 commented Mar 16, 2021

apoorvedave1 commented Mar 16, 2021

imback82 left a comment

Choose a reason for hiding this comment

apoorvedave1 Mar 16, 2021 •

edited

Loading

apoorvedave1 Mar 16, 2021 •

edited

Loading

apoorvedave1 Mar 16, 2021 •

edited

Loading

apoorvedave1 Mar 16, 2021 •

edited

Loading