Flattening Tuples, finding Variables by name, and converting Time Tuples #115

sapols · 2020-07-28T21:45:14Z

I'm opening a PR early to start getting eyes on this code I've been working on. It's all pretty rough at the moment.

Changes by file:

dataType.scala

Ported findAllVariables from v2
Handled Tuple IDs in flatten (this was tricky and my solution is smelly)
Added TODOs in getPath that point out its inability to find paths to Tuples

dataTypeSuite.scala

Added a variety of unit tests to test edge cases of flattening Tuples (I did test-driven development!)

TimeTupleToTime.scala

Starting this Operation (development is largely stalled until getPath is figured out

dailyFlux.txt / dailyFlux.fdml

A toy dataset that splits time variables into a Tuple

Open questions:

I forgot what naming conventions we decided on when using FunSuite. Is it a problem that the file name is dataTypeSuite.scala but the class is TupleFlattenSuite? Would it not be appropriate to add other classes to this file that test other dataType things?
Is my solution to flattening Tuples too smelly? I struggled with that problem for a long time and this finally does everything it should, but the code isn't pretty.

dlindhol

I added some thoughts on a few things. I'm saving the flatten method and the tests for another round.

core/src/main/scala/latis/ops/TimeTupleToTime.scala

dlindhol · 2020-07-29T15:33:05Z

core/src/main/scala/latis/model/dataType.scala

+   * Find all Variables within this Variable by the given name.
+   * TODO: support aliases
+   */
+  def findAllVariables(variableName: String): Seq[DataType] = {


I'm a bit uneasy about this approach from v2 since it implies that there can be multiple variables with the same name. Maybe this is OK in the context of name (as opposed to id) given aliases. Though, ambiguous aliases smells like a potential problem. The idea is that a user could use an alias in a selection. Should it apply to all variables with that alias? Probably not.

This is relevant in v2 where any variable of type Time gets the alias time. In that case we apply the selection to the first (presumably the primary) time variable. The idea here is that we can use standard names for standard variables like time, wavelength, irradiance....

We also add the original variable name to the alias list when it is renamed.

These uses of alias are convenient, except when they aren't. Mathematical rigor would suggest that we never have this ambiguity problem, but aliases are convenient. Hmm. Maybe we better punt on aliases for now and hide this from the public API.

core/src/test/resources/data/dailyFlux.txt

sapols · 2020-08-19T18:39:06Z

I believe I have addressed all the recent concerns we've been discussing in our standups. So please, review away. I'm done making changes unless any get suggested.

core/src/main/scala/latis/model/dataType.scala

RyanJHeld · 2020-08-21T17:55:37Z

core/src/main/scala/latis/model/dataType.scala

@@ -73,34 +102,46 @@ sealed trait DataType extends MetadataLike with Serializable {
   * Return this DataType with all nested Tuples flattened to a single Tuple.
   * A Scalar will remain a Scalar.
   * This form is consistent with Samples which don't preserve nested Functions.
+   * Flattened Tuples retain the ID of the outermost Tuple.
   */
  def flatten: DataType = {


I have a question that's more for @dlindhol and @lindholc . flatten on List will only flatten one layer deep, so something like List[List[List[Int]]] will flatten to List[List[Int]]. It looks like this implementation on DataType will flatten to have no nested tuples. Do we want to be consistent with List? I'm worried the name is misleading.

Point well taken, but I think the point is to eliminate nested tuples because (1) they cause problems and (2) they aren't preserved on the Data side anyway. So that's definitely what the method should do; maybe the question is more if "flatten" is an appropriate name or not.

I agree flattening everything is more useful and it's what we want here. I wouldn't dare suggest a new name for it because that's hard to do.

Eew. I didn't realize that List.flatten didn't recurse. How about something like flattenAll? Surely there is some precedence out there.

RyanJHeld · 2020-08-21T18:02:05Z

core/src/main/scala/latis/model/dataType.scala

-      case Function(d, r) => acc :+ Function(d.flatten, r.flatten)
+      case s: Scalar if tupIds.isEmpty => acc :+ s
+      //prepend Tuple ID(s) with dot(s) and drop leading dot(s)
+      case s: Scalar                   => acc :+ s.rename(s"$tupIds.${s.id}".replaceFirst("^\\.+", ""))


If you're dropping the leading dots, why even distinguish this case from if tupIds.isEmpty?

Oh thank you for noticing this! At one point I wasn't dropping leading dots, so these two cases were distinct. When I started dropping them I just didn't notice that they became indistinguishable. It always bothered me that I had to move the arrows => so far to the right to be in line with this case. Removing it lets me scooch those arrows to the left and that makes me happy.

RyanJHeld · 2020-08-21T18:11:04Z

core/src/main/scala/latis/model/dataType.scala

+      //prepend Tuple ID(s) with dot(s) and drop leading dot(s)
+      case s: Scalar                   => acc :+ s.rename(s"$tupIds.${s.id}".replaceFirst("^\\.+", ""))
+      case Function(d, r)              => acc :+ Function(d.flatten, r.flatten)
+      case t @ Tuple(es @ _*)          => if (tupIds.isEmpty && !t.id.isEmpty) tupIds = t.id else tupIds += s".${t.id}"


After the else it looks like it's assumed both tupIds and t.id are not empty, but this part of the code would be reached if:

tupIds is not empty and t.id is empty, resulting in tupIds having a ~~leading~~ trailing dot.

both are empty, resulting in tupIds being just a dot.

Hey someone finally noticed this. 🤓 Thank you for getting your head into this, Ryan. At the moment, that's actually intentional. It's also why I drop leading dots in case s: Scalar.

The reason for this is actually to make sure that flattened Tuples retain the ID of the outermost Tuple. Look at the last line in this method:

else Tuple(Metadata(tupIds.split('.').head), types)

That's the line that grabs the ID of the outermost Tuple. The problem I had was: if I don't allow the case t @ Tuple(es @ _*) case to add just dots, the line actually grabs the ID of the first named Tuple, which isn't the same thing.

To see this, imagine an anonymous outer Tuple with a nested Tuple named "tup1". tupIds will be ".tup1". ".tup1".split('.').head is "" which is correct. If I didn't allow the leading dot for the anonymous Tuple, tupIds would instead be "tup1". "tup1".split('.').head is tup1 which is incorrect.

So effectively, the leading dot keeps track of the fact that an anonymous Tuple was there. This was the cleanest solution I could think of. I considered using a boolean flag to keep track of whether the outermost Tuple was anonymous, but boolean flags are like the least functional thing ever so I didn't go there.

It all makes sense now! There's more going on here than I knew. Thanks for explaining that. Is there a way to condense that into a one-line comment so future developers (let's be real, I mean future us) know what this is doing? Maybe "anonymous tuples are represented with empty strings here".

Good call. I came up with this comment:

//build up a dot-separated String of Tuple IDs, including empty IDs that stand in for anonymous Tuples case t @ Tuple(es @ _*) => if (tupIds.isEmpty && !t.id.isEmpty) tupIds = t.id else tupIds += s".${t.id}"

core/src/main/scala/latis/ops/TimeTupleToTime.scala

RyanJHeld · 2020-08-21T19:18:46Z

core/src/main/scala/latis/ops/TimeTupleToTime.scala

+        //extract text values and join with space
+        //TODO: join with delimiter, problem when we use regex?
+        timePos match {
+          //TODO: reduce code duplication? Only difference is dd vs rd (types don't match) and replacing domain vs range


You could define a function

def time(pos: Int, data: Seq[Data]): Datum = { val timeData = data.slice(pos, pos+timeLen) Data.StringValue( timeData.map { case d: Datum => d.asString }.mkString(" ") ) }

I think that would allow you to pass dd or rd and still get time out. Then your case statement would be

case DomainPosition(n) => val domain = dd.slice(0, n) ++ Seq(time(n, dd)) ++ dd.slice(n+timeLen, dd.length) Sample(domain, rd) case RangePosition(n) => val range = rd.slice(0, n) ++ Seq(time(n, rd) ++ rd.slice(n+timeLen, dd.length) Sample(dd, range)

core/src/test/scala/latis/ops/TimeTupleToTimeSpec.scala

Co-authored-by: Ryan Heins <ryan.heins@lasp.colorado.edu>

RyanJHeld · 2020-08-21T22:48:39Z

core/src/main/scala/latis/model/dataType.scala

@@ -73,34 +100,45 @@ sealed trait DataType extends MetadataLike with Serializable {
   * Return this DataType with all nested Tuples flattened to a single Tuple.
   * A Scalar will remain a Scalar.
   * This form is consistent with Samples which don't preserve nested Functions.


I'm just noticing this comment. I know this line isn't part of the PR, but flatten does preserve nested Functions.

This should say "... Samples which don't preserve nested Tuples"

dlindhol

Looks like you got to be a guinea pig ( 🇬🇳 🐷 ) for bigger issues again. Some of these comments simply seemed like a good place to bring up some style questions that we should agree on. Others certainly suggest an alternate solution, but I'll leave it to your (and other reviewer's) discretion whether they need to be addressed as part of this PR.

dlindhol · 2020-08-12T15:13:45Z

core/src/main/scala/latis/model/dataType.scala

  def map(f: DataType => DataType): DataType = this match {
    case s: Scalar => f(s)
-    case Tuple(es @ _*) => Tuple(es.map(f))
-    case Function(d, r) => Function(d.map(f), r.map(f))
+    case t @ Tuple(es @ _*) => f(Tuple(t.metadata, es.map(f)))


Note that we can always include metadata munging as part of the function f. This simply carries the original metadata through so it can do so. That makes me feel better about DataType.map.

Good! I was pretty convinced I nailed the logic here. It's pretty simple actually.

dlindhol · 2020-08-24T17:41:20Z

core/src/main/scala/latis/model/dataType.scala

@@ -73,34 +102,46 @@ sealed trait DataType extends MetadataLike with Serializable {
   * Return this DataType with all nested Tuples flattened to a single Tuple.
   * A Scalar will remain a Scalar.
   * This form is consistent with Samples which don't preserve nested Functions.
+   * Flattened Tuples retain the ID of the outermost Tuple.
   */
  def flatten: DataType = {


Eew. I didn't realize that List.flatten didn't recurse. How about something like flattenAll? Surely there is some precedence out there.

dlindhol · 2020-08-24T17:58:12Z

core/src/main/scala/latis/model/dataType.scala

    }
  }

  /**
   * Return the path within this DataType to a given variable ID
   * as a sequence of SamplePositions.
   * Note that length of the path reflects the number of nested Functions.
+   * When searching a Tuple's ID, the path to the first Scalar in the Tuple is returned.
   */
  def getPath(id: String): Option[SamplePath] = {


I think this is where the idea of a ConstantFunction can come to the rescue. Scalar or Tuple can be thought of a Functions of arity zero. A Sample of such would have an empty domain with the data in the range. So, a SamplePosition for a Scalar or Tuple would necessarily be a RangePosition.

dlindhol · 2020-08-24T18:07:00Z

core/src/main/scala/latis/model/dataType.scala

-      if (id == dt.id) Some(currentPath) //found it  //TODO: use hasName to cover aliases?
+      //TODO: use hasName to cover aliases?
+      //searching fully qualified ID with namespace
+      if (id.contains('.') && dt.id == id)    Some(currentPath) //found it


What did we decide about dots in Identifiers? I recall that they are not allowed in an Identifier but that qualified identifiers are another matter.

I also recall cringing at the thought that we need tuples of arity one to represent, for example, B.x where the tuple provides the B namespace if we can't define the scalar identifier to be B.x.

The conclusion for now: Stick with string identifiers with dots. We'll make a new ticket to rely exclusively on the model structure to preserve namespace and use Identifier here and elsewhere without support for dots (".") (#130 ). flatten can thus be an encapsulated implementation detail as needed to achieve some result.

I don't think there is a use case for users to flatten their datasets since it doesn't alter the data. The concept only seems useful for us to map a variable to a sample element (as here) or to capture the structural name space in a flattened view of the metadata (e.g. csv column headings). It doesn't seem worthy of the public API.

Presumably, the getPath use case could be done without needing to reconstruct a flattened representation of the model. Something logically equivalent to zipping the hierarchical structure of the model with the flattened structure of a Sample, but without the baggage of actually doing so. (#129 )

dlindhol · 2020-08-24T18:44:45Z

core/src/main/scala/latis/ops/TimeTupleToTime.scala

+
+    val time: Scalar = timeTuple match {
+      case Tuple(es @ _*) =>
+        //build up format string


Style question: Should this kind of comment be elevated a bit with a space and capitol letter? Maybe include the s to be consistent with method comments?
// Builds ...

Eh, personally, I like it better lowercase and without a space. Lowercase and no space is what we normally do for comments, and I don't see this comment as anything special in need of "elevating." I'd be cool with adding s to make it present tense though. But even that doesn't seem necessary because normal comments don't make it into the Scaladoc. As I recall, the only reason we add s to Scaladoc comments is so that they read consistently for users who are browsing the Scaladoc.

dlindhol · 2020-08-25T17:11:07Z

core/src/main/scala/latis/ops/TimeTupleToTime.scala

+        //TODO: join with delimiter, problem when we use regex?
+        timePos match {
+          case DomainPosition(n) =>
+            val domain = dd.slice(0, n) ++ Seq(time(n, timeLen, dd)) ++ dd.slice(n+timeLen, dd.length)


Factoring out the time function was a good idea. Could we take it a step further and move the slicing logic there, too?

Good call. No reason that wouldn't work.

dlindhol · 2020-08-25T17:14:45Z

core/src/test/scala/latis/model/DataTypeSuite.scala

+
+class GetPathSuite extends FunSuite {
+  test("getPath to Tuple in domain") {
+    val func = {


Could you use ModelParser.parse("(a, b, c) -> d")?

Maybe we should add DataType.fromExpression?

No for a couple reasons. (1) that code fails: Left(latis.util.LatisException: Failed to parse model expression: (a, b, c) -> d. Failure reading:'_') (2) The name of the Tuple is actually important here, and ModelParser would produce an anonymous Tuple.

dlindhol · 2020-08-25T17:26:00Z

core/src/test/scala/latis/model/DataTypeSuite.scala

+import latis.metadata.Metadata
+import org.scalatest.FunSuite
+
+class GetPathSuite extends FunSuite {


Oh, if there were only a way to automatically generate test cases. It's hard to know when the test coverage is sufficient. Property based testing seems so tempting if we could just figure out a lawful property to preserve. Or an alternate approach to generate the right answer?

dlindhol · 2020-08-25T17:27:50Z

core/src/test/scala/latis/ops/TimeTupleToTimeSpec.scala

+    val metadata: Metadata = Metadata("MockDataset" + ("title" -> "Mock Dataset"))
+    val model: DataType = Function(
+      Tuple(Metadata("time"),
+        Scalar(Metadata("year")  + ("type" -> "int") + ("units" -> "yyyy")),


I would expect that these types would need to be "text" given units of this form. Not that it matters here in the test.

Good catch. Don't wanna be inconsistent even when it doesn't matter. Fixed.

dlindhol · 2020-08-25T17:36:28Z

core/src/test/scala/latis/ops/TimeTupleToTimeSpec.scala

+      ),
+      Scalar(Metadata("flux") + ("type" -> "int"))
+    )
+    val data: MemoizedFunction = SeqFunction(


Could we use a "dataset builder" to make this for us more easily?
Note that this is a 3D dataset with a 1D manifold (i.e. a line in 3D "space") as opposed to a Cartesian dataset.

I mean in theory, anything's possible. But that'd be so much work that I'm gonna say no to doing it for this PR. The metadata is actually the important stuff here, not the data values really. So we'd need a "dataset builder" that could not only produce a properly-structured dataset but also one with specified metadata everywhere.

sapols · 2020-08-25T22:31:02Z

Comment that's mostly for myself to read in the morning: Refactoring I still need to do in response to Doug's comments:

change findVariable to use findAllVariables's head then use findVariable in my code
move my mapFunction's slicing logic into the time helper function (then rename that function)

I believe I've addressed everything else.

sapols · 2020-08-26T19:22:43Z

Okay all, I'm done making changes. Ready for review again.

lindholc

I've left a small suggestion. Make sure to squash or something before merging.

lindholc · 2020-08-28T16:43:54Z

core/src/main/scala/latis/model/dataType.scala

+    findAllVariables(variableName) match {
+      case Seq()             => None
+      case vs: Seq[DataType] => Some(vs.head)
+    }


I think this ought to work:

findAllVariables(variableName).headOption

sapols added 23 commits July 21, 2020 15:59

Start a TimeTupleToTime op with a test dataset

86bc418

Typo fix

ceba70a

Start adding better support for tuple dataTypes

6167a76

Hack to support preserving a name for flattened Tuples

48fbf3d

Start porting over code from v2

4785239

Update schema URL

4bdb686

Correct prepending of Tuple IDs

1208888

Start unit testing Tuple flattening

826059a

Add a test

e3acd0a

Revent FDML back to how it should be

5735070

Remove unused function

3b3defe

Start trying to figure out Tuples in getPath

98dd45b

Refactor flatten and add a test

d8b4f40

Remove newline

f13984c

Add TODO

bda9a92

Remove TODO

d61fe0e

Add a comment

b346aab

Remove unused fdml

ba33ab8

Remove test

87c7a85

Remove comment

9dfd4c0

Remove changes

f50ba54

Add newline

5ef2935

Start building Time Scalar for model

e1c61bd

sapols requested review from lindholc, dlindhol, RyanJHeld and nicklamonica July 28, 2020 21:45

dlindhol reviewed Jul 29, 2020

View reviewed changes

sapols added 2 commits July 29, 2020 11:19

Write and test hardcoded version of operation

15ce30a

Add a TODO

dcdbc06

RyanJHeld suggested changes Aug 21, 2020

View reviewed changes

sapols and others added 7 commits August 21, 2020 13:48

Don't use intermediate val

3c3b034

Co-authored-by: Ryan Heins <ryan.heins@lasp.colorado.edu>

Remove unneeded curlies

ebc368f

Guard against calling .head on empty list, improve error msgs

dab039a

Remove unneeded assertions about mockDataset's contents

e6020a5

Reduce code duplication with a helper function

c4bba7a

Improve names of unit tests for TimeTupleToTime

0ba05be

Combine indistinguishable cases in flatten's pattern match

64040c0

RyanJHeld reviewed Aug 21, 2020

View reviewed changes

RyanJHeld approved these changes Aug 21, 2020

View reviewed changes

sapols added 2 commits August 21, 2020 17:09

Add a comment to better explain Tuple ID logic in flatten

221b656

Throw an error if time tuple is in a nested function

d47fd09

dlindhol reviewed Aug 25, 2020

View reviewed changes

sapols added 6 commits August 25, 2020 15:39

Preserve time tuple metadata

e4929d4

Typo fix: nested "Tuples", not nested "Functions"

106e505

Use t.id.nonEmpty instead of !t.id.isEmpty

1748012

Simplify applyToModel by mapping to find time Tuple

06170eb

Change test Scalar types from int to string

96b2deb

Remove unused line of code

f52d4dd

sapols added 3 commits August 26, 2020 12:43

Move data slicing logic into helper function

9623da5

Refactor findVariable to use findAllVariables.head

95475e4

Use findVariable instead of findAllVariables

3d90a71

dlindhol approved these changes Aug 28, 2020

View reviewed changes

lindholc approved these changes Aug 28, 2020

View reviewed changes

Replace pattern match in findVariable with .headOption

b20d39c

sapols merged commit e44cfcb into master Aug 28, 2020

sapols deleted the time-tuples branch August 28, 2020 19:08

Flattening Tuples, finding Variables by name, and converting Time Tuples #115

Flattening Tuples, finding Variables by name, and converting Time Tuples #115

Conversation

sapols commented Jul 28, 2020

Changes by file:

Open questions:

dlindhol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sapols commented Aug 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RyanJHeld Aug 21, 2020 • edited Loading

Choose a reason for hiding this comment

sapols Aug 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlindhol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sapols commented Aug 25, 2020

sapols commented Aug 26, 2020

lindholc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RyanJHeld Aug 21, 2020 •

edited

Loading

sapols Aug 21, 2020 •

edited

Loading