-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataCatalog2.0]: Update KedroDataCatalog
CLI logic and make it reusable
#3312
Comments
could we have something like |
This Viz issue is related: kedro-org/kedro-viz#1480 |
That would be perfect! We would need such a thing |
@MarcelBeining Can you explains a bit more why you need this? I am thinking about this again because I am trying to build a plugin for kedro and this would come in handy to compile a static version of configuration. |
@noklam We try to find kedro datasets for which we have not written a data test, hence we iterate over |
@MarcelBeining Did I understand this question correctly as:
Does |
@noklam "Find which datasets is not written in catalog.yml including dataset factory resolves, yet" , yes kedro catalog resolve shows what I need, but it is a CLI command and I need it within Python (of course one could use os.system etc, but a simple extension of catalog.list() should not be that hard) |
@MarcelBeining Are you integrating this with some extra functionalities? How do you consume this information if this is ok to share? |
Adding on from our discussion on slack,
But I'd also like that information easily consumable in a notebook (for example). So if my catalog stores models like: "{experiment}.model":
type: pickle.PickleDataset
filepath: data/06_models/{experiment}/model.pickle
versioned: true I would want to be able to (somehow) do something like: models = {}
for model_dataset in [d for d in catalog.list(*~*magic*~*) if ".model" in d]:
models[model_dataset] = catalog.load(model_dataset) Its a small thing. But I was kind of surprised to not see my |
Another one, bumped to |
What if for datasets in data_catalog:
... |
I think it's neat @noklam , but I don't know if it's discoverable. To me |
why_not_both.gif
|
I've also wanted to be able to iterate through the datasets for a while, but it raises some unanswered questions:
But we always face the same issue: we would need to "resolve" the dataset factory first relatively to a pipeline. it would eventually give: The real advantage of doing so is that we do not need to create a search method with all type of supported search (by extension, by regex... as suggested in the corresponding issue) because it's easily customizable, so it's less maintenance burden in the end. |
Catalog.list already support regex, isn't that identical to what you suggest as catalog.search? |
@noklam you can only search by name, namespaces aren't really supported and you can't search by attribute |
namespace is just a prefix string so it works pretty well. I do believe there are benefits to improve it, but I think we should at least add an example for existing feature since @Galileo-Galilei told me he is not aware of it and most likely very few do. |
catalog.list
or alternative for dataset factory?catalog.list
or alternative for dataset factory?
Inconsistency between CLI commands and interactive workflow is a problem related to all Thus, we suggest refactoring CLI logic and moving it to the We also think that we should not couple catalog and pipelines, so we do not consider extending |
So this came up yesterday - my colleague said 'I don't think the catalog is working' we did It took 10 minutes of playing with What may have helped?
What did we do? we wrote something horrible like this:
In summary, I think we don't have to over-engineer this, I think having expected patterns show up in the |
Which I realise this is exactly what I pitched 18 months ago 😂 |
@datajoely The idea of listing datasets and patterns together looks interesting. But doing so could lead to confusion, as people may not differentiate datasets and patterns. So, providing an interface to access both separately and communicate this better seems more reasonable to me. There is also a high chance that I see the pain point about the discrepancy between Long story short - we suggest:
These 5 should address the points you mentioned as well as others we discussed above. |
Okay we're actually designing for different things. Perhaps we could print out some of these recommendations when the ipython extension is loaded because even experienced kedro users are going to know how to retrieve patterns etc |
catalog.list
or alternative for dataset factory?KedroDataCatalog
CLI logic and make it reusable
@datajoely, yep, now it's a good time to point to possible improvements. Can you please elaborate on what you mean by |
I a JSON structure something like this would be more useful from a machine interpretability point of view: [
{"dataset_name": "...", "dataset_type": "...", "pipelines":["...","..."] },
] There is more that is possibly useful such as classpath for custom datasets, which YAML file it was found in etc. |
1. SummaryThe prototype of updated catalog commands is here: #4480 Note 1: We cannot proceed with the aforementioned PR until #4481 and #4475 are resolved. So, for now, the goal is to get feedback and validate some implementation details and functionality of updated commands. Note 2: To demonstrate changes in the interactive environment, I made the session and context stateful as if we implemented #4481. Please do not pay attention to those changes now. 2. Implementation2.1 Where should commands liveContextThe bare minimum to implement catalog commands is having loaded catalog and pipelines. Previously, we instantiated a session inside each command and accessed the catalog via context. Pipelines were also imported there. kedro/kedro/framework/cli/catalog.py Line 59 in ed04197
from kedro.framework.project import pipelines
def list_datasets(metadata: ProjectMetadata, pipeline: str, env: str) -> None:
session = _create_session(metadata.package_name, env=env)
context = session.load_context()
catalog = context.catalog Since commands are placed in SuggestionThus, we suggest moving the logic for catalog commands to the session level, particularly to kedro/kedro/framework/session/catalog.py Line 77 in b5391f8
For example, if we move list catalog patterns logic to the session level, we will be able to update @catalog.command("rank")
@env_option
@click.pass_obj
def rank_catalog_factories(metadata: ProjectMetadata, env: str) -> None:
"""List all dataset factories in the catalog, ranked by priority by which they are matched."""
session = _create_session(metadata.package_name, env=env)
catalog_factories = session.list_catalog_patterns()
if catalog_factories:
click.echo(yaml.dump(catalog_factories))
else:
click.echo("There are no dataset factories in the catalog.") And in the interactive environment we'll be able to do the following: kedro ipython
In [1]: session.list_catalog_patterns()
Out[1]: ['{name}.{folder}#csv', '{name}_data', 'out-{dataset_name}', '{dataset_name}#csv', 'in-{dataset_name}'] Why not move catalog commands to catalog levelWe could implement commands at the catalog level, but that would lead to coupling catalog and pipelines, which is not desired. 2.2 Implementation at the session levelWe suggest implementing catalog commands logic as a mixin class in kedro/kedro/framework/session/catalog.py Line 15 in b5391f8
That way we will be able to extend the session without modifying it and any changes on the commands side will not affect session. kedro/kedro/framework/session/session.py Line 82 in b5391f8
3. FunctionalityEach command is implemented as a function that outputs serialisable objects, so they can be saved in any format. These functions are planned to be used for CLI commands too, where we can decide on the specific output format. For example, @datajoely suggested printing output in JSON format. Below you can find changes made per each command. 3.1
|
def list_catalog_patterns(self) -> list[str]: |
The functionality of this command is left unchanged but now we can do some interactive things:
kedro ipython
In [1]: session.list_catalog_patterns()
Out[1]: ['{name}.{folder}#csv', '{name}_data', 'out-{dataset_name}', '{dataset_name}#csv', 'in-{dataset_name}']
In [2]: runtime_pattern = {"{default}": {"type": "MemoryDataset"}}
In [3]: catalog.config_resolver.add_runtime_patterns(runtime_pattern)
In [4]: session.list_catalog_patterns()
Out[4]: ['{name}.{folder}#csv', '{name}_data', 'out-{dataset_name}', '{dataset_name}#csv', 'in-{dataset_name}', '{default}']
3.2 kedro catalog resolve
Current functionality
This command resolves catalog patterns against pipeline datasets.
kedro catalog resolve
X_test:
type: MemoryDataset
companies#csv:
filepath: data/01_raw/companies.csv
type: pandas.CSVDataset
shuttle_id_dataset:
credentials: db_credentials
execution_options:
stream_results: true
load_args:
chunksize: 1000
sql: select shuttle, shuttle_id from spaceflights.shuttles;
type: pandas.SQLQueryDataset
Updated functionality
kedro/kedro/framework/session/catalog.py
Line 83 in b5391f8
def resolve_catalog_patterns(self, include_default: bool = False) -> dict[str, Any]: |
Changes:
- Output full dataset configuration for catalog datasets
- Added option to include default datasets (by default they're excluded as before) -
session.resolve_catalog_patterns(include_default=True)
- Interactive workflow support
kedro ipython
In [1]: session.resolve_catalog_patterns()
Out[1]:
{
'X_test': {'type': 'kedro.io.memory_dataset.MemoryDataset', 'copy_mode': None},
'companies#csv': {
'type': 'pandas.CSVDataset',
'filepath': '/Projects/kedro-tests/default/data/01_raw/companies.csv'
},
'shuttle_id_dataset': {
'type': 'kedro_datasets.pandas.sql_dataset.SQLQueryDataset',
'sql': 'select shuttle, shuttle_id from spaceflights.shuttles;',
'credentials': 'shuttle_id_dataset_credentials',
'execution_options': {'stream_results': True},
'load_args': {'chunksize': 1000},
'fs_args': None,
'filepath': None
},
}
In [2]: catalog["new_dataset"] = 123
In [3]: session.resolve_catalog_patterns()
Out[3]:
{
'X_test': {'type': 'kedro.io.memory_dataset.MemoryDataset', 'copy_mode': None},
'companies#csv': {
'type': 'pandas.CSVDataset',
'filepath': '/Projects/kedro-tests/default/data/01_raw/companies.csv'
},
'shuttle_id_dataset': {
'type': 'kedro_datasets.pandas.sql_dataset.SQLQueryDataset',
'sql': 'select shuttle, shuttle_id from spaceflights.shuttles;',
'credentials': 'shuttle_id_dataset_credentials',
'execution_options': {'stream_results': True},
'load_args': {'chunksize': 1000},
'fs_args': None,
'filepath': None
},
'new_dataset': {'type': 'kedro.io.memory_dataset.MemoryDataset', 'copy_mode': None},
}
3.3 kedro catalog list
Current functionality
This command shows datasets per type.
kedro catalog list -p data_processing
Datasets in 'data_processing' pipeline:
Datasets generated from factories:
pandas.CSVDataset:
- reviews.01_raw#csv
- companies#csv
Datasets mentioned in pipeline:
DefaultDataset:
- preprocessed_companies
ExcelDataset:
- shuttles
ParquetDataset:
- preprocessed_shuttles
- model_input_table
Datasets not mentioned in pipeline:
MemoryDataset:
- X_test
PickleDataset:
- regressor
SQLQueryDataset:
- shuttle_id_dataset
Updated functionality
kedro/kedro/framework/session/catalog.py
Line 22 in b5391f8
def list_catalog_datasets(self, pipelines: list[str] | None = None) -> dict: |
Changes:
- Output full dataset types
- Per each pipeline output 3 categories:
datasets
- pipeline datasets configured in the catalogfactories
- pipeline datasets resolved using catalog patternsdefaults
- pipeline datasets match the catalog default pattern
- Remove
Datasets not mentioned in pipeline
- Interactive workflow support
kedro ipython
In [1]: session.list_catalog_datasets(pipelines=["data_processing"])
Out[1]:
{
'data_processing': {
'datasets': {
'kedro_datasets.pandas.parquet_dataset.ParquetDataset': ['model_input_table', 'preprocessed_shuttles'],
'kedro_datasets.pandas.excel_dataset.ExcelDataset': ['shuttles']
},
'factories': {'kedro_datasets.pandas.csv_dataset.CSVDataset': ['reviews.01_raw#csv', 'companies#csv']},
'defaults': {'kedro.io.memory_dataset.MemoryDataset': ['preprocessed_companies']}
}
}
In [2]: catalog["companies#csv"]
Out[2]: kedro_datasets.pandas.csv_dataset.CSVDataset(filepath=PurePosixPath('/Projects/kedro-tests/default/data/01_raw/companies.csv'), protocol='file', load_args={}, save_args={'index': False})
In [3]: session.list_catalog_datasets(pipelines=["data_processing"])
Out[3]:
{
'data_processing': {
'datasets': {
'kedro_datasets.pandas.parquet_dataset.ParquetDataset': ['model_input_table', 'preprocessed_shuttles'],
'kedro_datasets.pandas.excel_dataset.ExcelDataset': ['shuttles'],
'kedro_datasets.pandas.csv_dataset.CSVDataset': ['companies#csv']
},
'factories': {'kedro_datasets.pandas.csv_dataset.CSVDataset': ['reviews.01_raw#csv']},
'defaults': {'kedro.io.memory_dataset.MemoryDataset': ['preprocessed_companies']}
}
}
3.4 kedro catalog create
This command creates a YAML catalog configuration with missing datasets. So, it now saves MemoryDataset
s not mentioned in the catalog` in the new YAML file.
I haven't implemented the updated version of this command because now it just saves defaults
from the updated list command to new new YAML file. So it's not clear whether it is still useful.
Thanks for the long writeup @ElenaKhaustova ! Left an idea about the interactive functionality in #4481 (comment). I understand it's a thorny issue, hope we can unblock it. About your "current vs updated functionality", just to clarify: Is the idea to keep the |
Description
Parent issue: #4472
Suggested plan: #3312 (comment)
Context
Background: https://linen-slack.kedro.org/t/16064885/when-i-say-catalog-list-in-a-kedro-jupter-lab-instance-it-do#ad3bb4aa-f6f9-44c6-bb84-b25163bfe85c
With dataset factory, the "defintion" of a dataset is not known until the pipeline is run. When user is using a Jupyter notebook, they expected to see the full list of dataset with
catalog.list
.Current workaround to see the datasets for
__default__
pipeline look like this:When using the CLI commands, e.g.
kedro catalog list
we do matching to figure out which factory mentions in the catalog match the datasets used in the pipeline, but when going through the interactive flow no such checking has been done yet.Possible Implementation
Could check dataset existence when the session is created. We need to verify if that has any unexpected side effects.
This ticket is still open scope and we don't have a specify implementation in mind. The person who pick up can evaluate different approaches, with considerations of side-effect, avoid coupling with other components.
Possible Alternatives
catalog.list( pipeline=<name>
) - not a good solution because catalog wouldn't have access to a pipelinekedro catalog list
is called.The text was updated successfully, but these errors were encountered: