Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flattening Tuples, finding Variables by name, and converting Time Tuples #115

Merged
merged 72 commits into from
Aug 28, 2020

Conversation

sapols
Copy link
Contributor

@sapols sapols commented Jul 28, 2020

I'm opening a PR early to start getting eyes on this code I've been working on. It's all pretty rough at the moment.

Changes by file:

dataType.scala

  • Ported findAllVariables from v2
  • Handled Tuple IDs in flatten (this was tricky and my solution is smelly)
  • Added TODOs in getPath that point out its inability to find paths to Tuples

dataTypeSuite.scala

  • Added a variety of unit tests to test edge cases of flattening Tuples (I did test-driven development!)

TimeTupleToTime.scala

  • Starting this Operation (development is largely stalled until getPath is figured out

dailyFlux.txt / dailyFlux.fdml

  • A toy dataset that splits time variables into a Tuple

Open questions:

  • I forgot what naming conventions we decided on when using FunSuite. Is it a problem that the file name is dataTypeSuite.scala but the class is TupleFlattenSuite? Would it not be appropriate to add other classes to this file that test other dataType things?

  • Is my solution to flattening Tuples too smelly? I struggled with that problem for a long time and this finally does everything it should, but the code isn't pretty.

Copy link
Member

@dlindhol dlindhol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some thoughts on a few things. I'm saving the flatten method and the tests for another round.

* Find all Variables within this Variable by the given name.
* TODO: support aliases
*/
def findAllVariables(variableName: String): Seq[DataType] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit uneasy about this approach from v2 since it implies that there can be multiple variables with the same name. Maybe this is OK in the context of name (as opposed to id) given aliases. Though, ambiguous aliases smells like a potential problem. The idea is that a user could use an alias in a selection. Should it apply to all variables with that alias? Probably not.

This is relevant in v2 where any variable of type Time gets the alias time. In that case we apply the selection to the first (presumably the primary) time variable. The idea here is that we can use standard names for standard variables like time, wavelength, irradiance....

We also add the original variable name to the alias list when it is renamed.

These uses of alias are convenient, except when they aren't. Mathematical rigor would suggest that we never have this ambiguity problem, but aliases are convenient. Hmm. Maybe we better punt on aliases for now and hide this from the public API.

@sapols
Copy link
Contributor Author

sapols commented Aug 19, 2020

I believe I have addressed all the recent concerns we've been discussing in our standups. So please, review away. I'm done making changes unless any get suggested.

@@ -73,34 +102,46 @@ sealed trait DataType extends MetadataLike with Serializable {
* Return this DataType with all nested Tuples flattened to a single Tuple.
* A Scalar will remain a Scalar.
* This form is consistent with Samples which don't preserve nested Functions.
* Flattened Tuples retain the ID of the outermost Tuple.
*/
def flatten: DataType = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question that's more for @dlindhol and @lindholc . flatten on List will only flatten one layer deep, so something like List[List[List[Int]]] will flatten to List[List[Int]]. It looks like this implementation on DataType will flatten to have no nested tuples. Do we want to be consistent with List? I'm worried the name is misleading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point well taken, but I think the point is to eliminate nested tuples because (1) they cause problems and (2) they aren't preserved on the Data side anyway. So that's definitely what the method should do; maybe the question is more if "flatten" is an appropriate name or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree flattening everything is more useful and it's what we want here. I wouldn't dare suggest a new name for it because that's hard to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eew. I didn't realize that List.flatten didn't recurse. How about something like flattenAll? Surely there is some precedence out there.

case Function(d, r) => acc :+ Function(d.flatten, r.flatten)
case s: Scalar if tupIds.isEmpty => acc :+ s
//prepend Tuple ID(s) with dot(s) and drop leading dot(s)
case s: Scalar => acc :+ s.rename(s"$tupIds.${s.id}".replaceFirst("^\\.+", ""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're dropping the leading dots, why even distinguish this case from if tupIds.isEmpty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thank you for noticing this! At one point I wasn't dropping leading dots, so these two cases were distinct. When I started dropping them I just didn't notice that they became indistinguishable. It always bothered me that I had to move the arrows => so far to the right to be in line with this case. Removing it lets me scooch those arrows to the left and that makes me happy.

//prepend Tuple ID(s) with dot(s) and drop leading dot(s)
case s: Scalar => acc :+ s.rename(s"$tupIds.${s.id}".replaceFirst("^\\.+", ""))
case Function(d, r) => acc :+ Function(d.flatten, r.flatten)
case t @ Tuple(es @ _*) => if (tupIds.isEmpty && !t.id.isEmpty) tupIds = t.id else tupIds += s".${t.id}"
Copy link
Contributor

@RyanJHeld RyanJHeld Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the else it looks like it's assumed both tupIds and t.id are not empty, but this part of the code would be reached if:

  1. tupIds is not empty and t.id is empty, resulting in tupIds having a leading trailing dot.
  2. both are empty, resulting in tupIds being just a dot.

Copy link
Contributor Author

@sapols sapols Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey someone finally noticed this. 🤓 Thank you for getting your head into this, Ryan. At the moment, that's actually intentional. It's also why I drop leading dots in case s: Scalar.

The reason for this is actually to make sure that flattened Tuples retain the ID of the outermost Tuple. Look at the last line in this method:

else Tuple(Metadata(tupIds.split('.').head), types)

That's the line that grabs the ID of the outermost Tuple. The problem I had was: if I don't allow the case t @ Tuple(es @ _*) case to add just dots, the line actually grabs the ID of the first named Tuple, which isn't the same thing.

To see this, imagine an anonymous outer Tuple with a nested Tuple named "tup1". tupIds will be ".tup1". ".tup1".split('.').head is "" which is correct. If I didn't allow the leading dot for the anonymous Tuple, tupIds would instead be "tup1". "tup1".split('.').head is tup1 which is incorrect.

So effectively, the leading dot keeps track of the fact that an anonymous Tuple was there. This was the cleanest solution I could think of. I considered using a boolean flag to keep track of whether the outermost Tuple was anonymous, but boolean flags are like the least functional thing ever so I didn't go there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all makes sense now! There's more going on here than I knew. Thanks for explaining that. Is there a way to condense that into a one-line comment so future developers (let's be real, I mean future us) know what this is doing? Maybe "anonymous tuples are represented with empty strings here".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I came up with this comment:

//build up a dot-separated String of Tuple IDs, including empty IDs that stand in for anonymous Tuples
case t @ Tuple(es @ _*) => if (tupIds.isEmpty && !t.id.isEmpty) tupIds = t.id else tupIds += s".${t.id}"

//extract text values and join with space
//TODO: join with delimiter, problem when we use regex?
timePos match {
//TODO: reduce code duplication? Only difference is dd vs rd (types don't match) and replacing domain vs range
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could define a function

def time(pos: Int, data: Seq[Data]): Datum = {
  val timeData = data.slice(pos, pos+timeLen)
  Data.StringValue(
    timeData.map { case d: Datum => d.asString }.mkString(" ")
  )
}

I think that would allow you to pass dd or rd and still get time out. Then your case statement would be

case DomainPosition(n) =>
  val domain = dd.slice(0, n) ++ Seq(time(n, dd)) ++ dd.slice(n+timeLen, dd.length)
  Sample(domain, rd)
case RangePosition(n) =>
  val range = rd.slice(0, n) ++ Seq(time(n, rd) ++ rd.slice(n+timeLen, dd.length)
  Sample(dd, range)

@@ -73,34 +100,45 @@ sealed trait DataType extends MetadataLike with Serializable {
* Return this DataType with all nested Tuples flattened to a single Tuple.
* A Scalar will remain a Scalar.
* This form is consistent with Samples which don't preserve nested Functions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just noticing this comment. I know this line isn't part of the PR, but flatten does preserve nested Functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say "... Samples which don't preserve nested Tuples"

Copy link
Member

@dlindhol dlindhol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you got to be a guinea pig ( 🇬🇳 🐷 ) for bigger issues again. Some of these comments simply seemed like a good place to bring up some style questions that we should agree on. Others certainly suggest an alternate solution, but I'll leave it to your (and other reviewer's) discretion whether they need to be addressed as part of this PR.

def map(f: DataType => DataType): DataType = this match {
case s: Scalar => f(s)
case Tuple(es @ _*) => Tuple(es.map(f))
case Function(d, r) => Function(d.map(f), r.map(f))
case t @ Tuple(es @ _*) => f(Tuple(t.metadata, es.map(f)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we can always include metadata munging as part of the function f. This simply carries the original metadata through so it can do so. That makes me feel better about DataType.map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good! I was pretty convinced I nailed the logic here. It's pretty simple actually.

@@ -73,34 +102,46 @@ sealed trait DataType extends MetadataLike with Serializable {
* Return this DataType with all nested Tuples flattened to a single Tuple.
* A Scalar will remain a Scalar.
* This form is consistent with Samples which don't preserve nested Functions.
* Flattened Tuples retain the ID of the outermost Tuple.
*/
def flatten: DataType = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eew. I didn't realize that List.flatten didn't recurse. How about something like flattenAll? Surely there is some precedence out there.

}
}

/**
* Return the path within this DataType to a given variable ID
* as a sequence of SamplePositions.
* Note that length of the path reflects the number of nested Functions.
* When searching a Tuple's ID, the path to the first Scalar in the Tuple is returned.
*/
def getPath(id: String): Option[SamplePath] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is where the idea of a ConstantFunction can come to the rescue. Scalar or Tuple can be thought of a Functions of arity zero. A Sample of such would have an empty domain with the data in the range. So, a SamplePosition for a Scalar or Tuple would necessarily be a RangePosition.

if (id == dt.id) Some(currentPath) //found it //TODO: use hasName to cover aliases?
//TODO: use hasName to cover aliases?
//searching fully qualified ID with namespace
if (id.contains('.') && dt.id == id) Some(currentPath) //found it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did we decide about dots in Identifiers? I recall that they are not allowed in an Identifier but that qualified identifiers are another matter.

I also recall cringing at the thought that we need tuples of arity one to represent, for example, B.x where the tuple provides the B namespace if we can't define the scalar identifier to be B.x.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conclusion for now: Stick with string identifiers with dots. We'll make a new ticket to rely exclusively on the model structure to preserve namespace and use Identifier here and elsewhere without support for dots (".") (#130 ). flatten can thus be an encapsulated implementation detail as needed to achieve some result.

I don't think there is a use case for users to flatten their datasets since it doesn't alter the data. The concept only seems useful for us to map a variable to a sample element (as here) or to capture the structural name space in a flattened view of the metadata (e.g. csv column headings). It doesn't seem worthy of the public API.

Presumably, the getPath use case could be done without needing to reconstruct a flattened representation of the model. Something logically equivalent to zipping the hierarchical structure of the model with the flattened structure of a Sample, but without the baggage of actually doing so. (#129 )


val time: Scalar = timeTuple match {
case Tuple(es @ _*) =>
//build up format string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style question: Should this kind of comment be elevated a bit with a space and capitol letter? Maybe include the s to be consistent with method comments?
// Builds ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, personally, I like it better lowercase and without a space. Lowercase and no space is what we normally do for comments, and I don't see this comment as anything special in need of "elevating." I'd be cool with adding s to make it present tense though. But even that doesn't seem necessary because normal comments don't make it into the Scaladoc. As I recall, the only reason we add s to Scaladoc comments is so that they read consistently for users who are browsing the Scaladoc.

//TODO: join with delimiter, problem when we use regex?
timePos match {
case DomainPosition(n) =>
val domain = dd.slice(0, n) ++ Seq(time(n, timeLen, dd)) ++ dd.slice(n+timeLen, dd.length)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factoring out the time function was a good idea. Could we take it a step further and move the slicing logic there, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. No reason that wouldn't work.


class GetPathSuite extends FunSuite {
test("getPath to Tuple in domain") {
val func = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use ModelParser.parse("(a, b, c) -> d")?

Maybe we should add DataType.fromExpression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No for a couple reasons. (1) that code fails: Left(latis.util.LatisException: Failed to parse model expression: (a, b, c) -> d. Failure reading:'_') (2) The name of the Tuple is actually important here, and ModelParser would produce an anonymous Tuple.

import latis.metadata.Metadata
import org.scalatest.FunSuite

class GetPathSuite extends FunSuite {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, if there were only a way to automatically generate test cases. It's hard to know when the test coverage is sufficient. Property based testing seems so tempting if we could just figure out a lawful property to preserve. Or an alternate approach to generate the right answer?

val metadata: Metadata = Metadata("MockDataset" + ("title" -> "Mock Dataset"))
val model: DataType = Function(
Tuple(Metadata("time"),
Scalar(Metadata("year") + ("type" -> "int") + ("units" -> "yyyy")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that these types would need to be "text" given units of this form. Not that it matters here in the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Don't wanna be inconsistent even when it doesn't matter. Fixed.

),
Scalar(Metadata("flux") + ("type" -> "int"))
)
val data: MemoizedFunction = SeqFunction(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a "dataset builder" to make this for us more easily?
Note that this is a 3D dataset with a 1D manifold (i.e. a line in 3D "space") as opposed to a Cartesian dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean in theory, anything's possible. But that'd be so much work that I'm gonna say no to doing it for this PR. The metadata is actually the important stuff here, not the data values really. So we'd need a "dataset builder" that could not only produce a properly-structured dataset but also one with specified metadata everywhere.

@sapols
Copy link
Contributor Author

sapols commented Aug 25, 2020

Comment that's mostly for myself to read in the morning: Refactoring I still need to do in response to Doug's comments:

  • change findVariable to use findAllVariables's head then use findVariable in my code
  • move my mapFunction's slicing logic into the time helper function (then rename that function)

I believe I've addressed everything else.

@sapols
Copy link
Contributor Author

sapols commented Aug 26, 2020

Okay all, I'm done making changes. Ready for review again.

Copy link
Member

@lindholc lindholc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a small suggestion. Make sure to squash or something before merging.

Comment on lines 47 to 50
findAllVariables(variableName) match {
case Seq() => None
case vs: Seq[DataType] => Some(vs.head)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this ought to work:

findAllVariables(variableName).headOption

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neato!

@sapols sapols merged commit e44cfcb into master Aug 28, 2020
@sapols sapols deleted the time-tuples branch August 28, 2020 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants