-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flattening Tuples, finding Variables by name, and converting Time Tuples #115
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some thoughts on a few things. I'm saving the flatten
method and the tests for another round.
* Find all Variables within this Variable by the given name. | ||
* TODO: support aliases | ||
*/ | ||
def findAllVariables(variableName: String): Seq[DataType] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit uneasy about this approach from v2 since it implies that there can be multiple variables with the same name. Maybe this is OK in the context of name (as opposed to id) given aliases. Though, ambiguous aliases smells like a potential problem. The idea is that a user could use an alias in a selection. Should it apply to all variables with that alias? Probably not.
This is relevant in v2 where any variable of type Time
gets the alias time
. In that case we apply the selection to the first (presumably the primary) time variable. The idea here is that we can use standard names for standard variables like time, wavelength, irradiance....
We also add the original variable name to the alias list when it is renamed.
These uses of alias are convenient, except when they aren't. Mathematical rigor would suggest that we never have this ambiguity problem, but aliases are convenient. Hmm. Maybe we better punt on aliases for now and hide this from the public API.
I believe I have addressed all the recent concerns we've been discussing in our standups. So please, review away. I'm done making changes unless any get suggested. |
@@ -73,34 +102,46 @@ sealed trait DataType extends MetadataLike with Serializable { | |||
* Return this DataType with all nested Tuples flattened to a single Tuple. | |||
* A Scalar will remain a Scalar. | |||
* This form is consistent with Samples which don't preserve nested Functions. | |||
* Flattened Tuples retain the ID of the outermost Tuple. | |||
*/ | |||
def flatten: DataType = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question that's more for @dlindhol and @lindholc . flatten
on List
will only flatten one layer deep, so something like List[List[List[Int]]]
will flatten to List[List[Int]]
. It looks like this implementation on DataType
will flatten to have no nested tuples. Do we want to be consistent with List
? I'm worried the name is misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point well taken, but I think the point is to eliminate nested tuples because (1) they cause problems and (2) they aren't preserved on the Data side anyway. So that's definitely what the method should do; maybe the question is more if "flatten" is an appropriate name or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree flattening everything is more useful and it's what we want here. I wouldn't dare suggest a new name for it because that's hard to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eew. I didn't realize that List.flatten
didn't recurse. How about something like flattenAll
? Surely there is some precedence out there.
case Function(d, r) => acc :+ Function(d.flatten, r.flatten) | ||
case s: Scalar if tupIds.isEmpty => acc :+ s | ||
//prepend Tuple ID(s) with dot(s) and drop leading dot(s) | ||
case s: Scalar => acc :+ s.rename(s"$tupIds.${s.id}".replaceFirst("^\\.+", "")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're dropping the leading dots, why even distinguish this case from if tupIds.isEmpty
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh thank you for noticing this! At one point I wasn't dropping leading dots, so these two cases were distinct. When I started dropping them I just didn't notice that they became indistinguishable. It always bothered me that I had to move the arrows =>
so far to the right to be in line with this case. Removing it lets me scooch those arrows to the left and that makes me happy.
//prepend Tuple ID(s) with dot(s) and drop leading dot(s) | ||
case s: Scalar => acc :+ s.rename(s"$tupIds.${s.id}".replaceFirst("^\\.+", "")) | ||
case Function(d, r) => acc :+ Function(d.flatten, r.flatten) | ||
case t @ Tuple(es @ _*) => if (tupIds.isEmpty && !t.id.isEmpty) tupIds = t.id else tupIds += s".${t.id}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the else it looks like it's assumed both tupIds
and t.id
are not empty, but this part of the code would be reached if:
tupIds
is not empty andt.id
is empty, resulting intupIds
having aleadingtrailing dot.- both are empty, resulting in
tupIds
being just a dot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey someone finally noticed this. 🤓 Thank you for getting your head into this, Ryan. At the moment, that's actually intentional. It's also why I drop leading dots in case s: Scalar
.
The reason for this is actually to make sure that flattened Tuples retain the ID of the outermost Tuple. Look at the last line in this method:
else Tuple(Metadata(tupIds.split('.').head), types)
That's the line that grabs the ID of the outermost Tuple. The problem I had was: if I don't allow the case t @ Tuple(es @ _*)
case to add just dots, the line actually grabs the ID of the first named Tuple, which isn't the same thing.
To see this, imagine an anonymous outer Tuple with a nested Tuple named "tup1"
. tupIds
will be ".tup1"
. ".tup1".split('.').head
is ""
which is correct. If I didn't allow the leading dot for the anonymous Tuple, tupIds
would instead be "tup1"
. "tup1".split('.').head
is tup1
which is incorrect.
So effectively, the leading dot keeps track of the fact that an anonymous Tuple was there. This was the cleanest solution I could think of. I considered using a boolean flag to keep track of whether the outermost Tuple was anonymous, but boolean flags are like the least functional thing ever so I didn't go there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It all makes sense now! There's more going on here than I knew. Thanks for explaining that. Is there a way to condense that into a one-line comment so future developers (let's be real, I mean future us) know what this is doing? Maybe "anonymous tuples are represented with empty strings here".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. I came up with this comment:
//build up a dot-separated String of Tuple IDs, including empty IDs that stand in for anonymous Tuples
case t @ Tuple(es @ _*) => if (tupIds.isEmpty && !t.id.isEmpty) tupIds = t.id else tupIds += s".${t.id}"
//extract text values and join with space | ||
//TODO: join with delimiter, problem when we use regex? | ||
timePos match { | ||
//TODO: reduce code duplication? Only difference is dd vs rd (types don't match) and replacing domain vs range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could define a function
def time(pos: Int, data: Seq[Data]): Datum = {
val timeData = data.slice(pos, pos+timeLen)
Data.StringValue(
timeData.map { case d: Datum => d.asString }.mkString(" ")
)
}
I think that would allow you to pass dd
or rd
and still get time
out. Then your case statement would be
case DomainPosition(n) =>
val domain = dd.slice(0, n) ++ Seq(time(n, dd)) ++ dd.slice(n+timeLen, dd.length)
Sample(domain, rd)
case RangePosition(n) =>
val range = rd.slice(0, n) ++ Seq(time(n, rd) ++ rd.slice(n+timeLen, dd.length)
Sample(dd, range)
Co-authored-by: Ryan Heins <ryan.heins@lasp.colorado.edu>
@@ -73,34 +100,45 @@ sealed trait DataType extends MetadataLike with Serializable { | |||
* Return this DataType with all nested Tuples flattened to a single Tuple. | |||
* A Scalar will remain a Scalar. | |||
* This form is consistent with Samples which don't preserve nested Functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just noticing this comment. I know this line isn't part of the PR, but flatten
does preserve nested Functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should say "... Samples which don't preserve nested Tuples"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you got to be a guinea pig ( 🇬🇳 🐷 ) for bigger issues again. Some of these comments simply seemed like a good place to bring up some style questions that we should agree on. Others certainly suggest an alternate solution, but I'll leave it to your (and other reviewer's) discretion whether they need to be addressed as part of this PR.
def map(f: DataType => DataType): DataType = this match { | ||
case s: Scalar => f(s) | ||
case Tuple(es @ _*) => Tuple(es.map(f)) | ||
case Function(d, r) => Function(d.map(f), r.map(f)) | ||
case t @ Tuple(es @ _*) => f(Tuple(t.metadata, es.map(f))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we can always include metadata munging as part of the function f
. This simply carries the original metadata through so it can do so. That makes me feel better about DataType.map
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good! I was pretty convinced I nailed the logic here. It's pretty simple actually.
@@ -73,34 +102,46 @@ sealed trait DataType extends MetadataLike with Serializable { | |||
* Return this DataType with all nested Tuples flattened to a single Tuple. | |||
* A Scalar will remain a Scalar. | |||
* This form is consistent with Samples which don't preserve nested Functions. | |||
* Flattened Tuples retain the ID of the outermost Tuple. | |||
*/ | |||
def flatten: DataType = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eew. I didn't realize that List.flatten
didn't recurse. How about something like flattenAll
? Surely there is some precedence out there.
} | ||
} | ||
|
||
/** | ||
* Return the path within this DataType to a given variable ID | ||
* as a sequence of SamplePositions. | ||
* Note that length of the path reflects the number of nested Functions. | ||
* When searching a Tuple's ID, the path to the first Scalar in the Tuple is returned. | ||
*/ | ||
def getPath(id: String): Option[SamplePath] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is where the idea of a ConstantFunction
can come to the rescue. Scalar
or Tuple
can be thought of a Functions of arity zero. A Sample
of such would have an empty domain with the data in the range. So, a SamplePosition
for a Scalar
or Tuple
would necessarily be a RangePosition
.
if (id == dt.id) Some(currentPath) //found it //TODO: use hasName to cover aliases? | ||
//TODO: use hasName to cover aliases? | ||
//searching fully qualified ID with namespace | ||
if (id.contains('.') && dt.id == id) Some(currentPath) //found it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What did we decide about dots in Identifiers? I recall that they are not allowed in an Identifier
but that qualified identifiers are another matter.
I also recall cringing at the thought that we need tuples of arity one to represent, for example, B.x
where the tuple provides the B
namespace if we can't define the scalar identifier to be B.x
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conclusion for now: Stick with string identifiers with dots. We'll make a new ticket to rely exclusively on the model structure to preserve namespace and use Identifier
here and elsewhere without support for dots (".") (#130 ). flatten
can thus be an encapsulated implementation detail as needed to achieve some result.
I don't think there is a use case for users to flatten their datasets since it doesn't alter the data. The concept only seems useful for us to map a variable to a sample element (as here) or to capture the structural name space in a flattened view of the metadata (e.g. csv column headings). It doesn't seem worthy of the public API.
Presumably, the getPath
use case could be done without needing to reconstruct a flattened representation of the model. Something logically equivalent to zipping the hierarchical structure of the model with the flattened structure of a Sample, but without the baggage of actually doing so. (#129 )
|
||
val time: Scalar = timeTuple match { | ||
case Tuple(es @ _*) => | ||
//build up format string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style question: Should this kind of comment be elevated a bit with a space and capitol letter? Maybe include the s
to be consistent with method comments?
// Builds ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, personally, I like it better lowercase and without a space. Lowercase and no space is what we normally do for comments, and I don't see this comment as anything special in need of "elevating." I'd be cool with adding s
to make it present tense though. But even that doesn't seem necessary because normal comments don't make it into the Scaladoc. As I recall, the only reason we add s
to Scaladoc comments is so that they read consistently for users who are browsing the Scaladoc.
//TODO: join with delimiter, problem when we use regex? | ||
timePos match { | ||
case DomainPosition(n) => | ||
val domain = dd.slice(0, n) ++ Seq(time(n, timeLen, dd)) ++ dd.slice(n+timeLen, dd.length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factoring out the time
function was a good idea. Could we take it a step further and move the slicing logic there, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. No reason that wouldn't work.
|
||
class GetPathSuite extends FunSuite { | ||
test("getPath to Tuple in domain") { | ||
val func = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use ModelParser.parse("(a, b, c) -> d")
?
Maybe we should add DataType.fromExpression
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No for a couple reasons. (1) that code fails: Left(latis.util.LatisException: Failed to parse model expression: (a, b, c) -> d. Failure reading:'_')
(2) The name of the Tuple is actually important here, and ModelParser
would produce an anonymous Tuple.
import latis.metadata.Metadata | ||
import org.scalatest.FunSuite | ||
|
||
class GetPathSuite extends FunSuite { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, if there were only a way to automatically generate test cases. It's hard to know when the test coverage is sufficient. Property based testing seems so tempting if we could just figure out a lawful property to preserve. Or an alternate approach to generate the right answer?
val metadata: Metadata = Metadata("MockDataset" + ("title" -> "Mock Dataset")) | ||
val model: DataType = Function( | ||
Tuple(Metadata("time"), | ||
Scalar(Metadata("year") + ("type" -> "int") + ("units" -> "yyyy")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that these types would need to be "text" given units
of this form. Not that it matters here in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Don't wanna be inconsistent even when it doesn't matter. Fixed.
), | ||
Scalar(Metadata("flux") + ("type" -> "int")) | ||
) | ||
val data: MemoizedFunction = SeqFunction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use a "dataset builder" to make this for us more easily?
Note that this is a 3D dataset with a 1D manifold (i.e. a line in 3D "space") as opposed to a Cartesian dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean in theory, anything's possible. But that'd be so much work that I'm gonna say no to doing it for this PR. The metadata is actually the important stuff here, not the data values really. So we'd need a "dataset builder" that could not only produce a properly-structured dataset but also one with specified metadata everywhere.
Comment that's mostly for myself to read in the morning: Refactoring I still need to do in response to Doug's comments:
I believe I've addressed everything else. |
Okay all, I'm done making changes. Ready for review again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a small suggestion. Make sure to squash or something before merging.
findAllVariables(variableName) match { | ||
case Seq() => None | ||
case vs: Seq[DataType] => Some(vs.head) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this ought to work:
findAllVariables(variableName).headOption
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neato!
I'm opening a PR early to start getting eyes on this code I've been working on. It's all pretty rough at the moment.
Changes by file:
dataType.scala
findAllVariables
from v2flatten
(this was tricky and my solution is smelly)getPath
that point out its inability to find paths to TuplesdataTypeSuite.scala
TimeTupleToTime.scala
getPath
is figured outdailyFlux.txt
/dailyFlux.fdml
Open questions:
I forgot what naming conventions we decided on when using
FunSuite
. Is it a problem that the file name isdataTypeSuite.scala
but the class isTupleFlattenSuite
? Would it not be appropriate to add other classes to this file that test otherdataType
things?Is my solution to flattening Tuples too smelly? I struggled with that problem for a long time and this finally does everything it should, but the code isn't pretty.