Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 99 #102

Merged
merged 69 commits into from
Jan 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
94ee1be
term program if closing early
HobnobMancer Nov 15, 2022
c258475
add opt to use seqs from FASTA file
HobnobMancer Nov 15, 2022
53d2506
shorten line lengths
HobnobMancer Nov 15, 2022
d53203c
update docs with new CLI args
HobnobMancer Nov 15, 2022
ec6cb54
add tutorial for using genbank seq cahce
HobnobMancer Nov 15, 2022
8c41446
cache seqs as retrieved. refactorise code
HobnobMancer Nov 15, 2022
cf3e395
compress the cache directory once done
HobnobMancer Nov 15, 2022
c504074
update version num
HobnobMancer Nov 15, 2022
3f84d95
update installation instructions
HobnobMancer Nov 15, 2022
d01a69f
update closing message unit tests
HobnobMancer Nov 15, 2022
b396a49
correct add to append for list
HobnobMancer Nov 15, 2022
b7c2f41
add logging messages
HobnobMancer Nov 15, 2022
6512e0e
IncompleteRead error capture when posting ids
HobnobMancer Nov 16, 2022
297e1c3
ignore blank links in the acc file
HobnobMancer Nov 16, 2022
908dcce
use re to retrieve ncbi accessions
HobnobMancer Nov 29, 2022
8e0bbd1
add unit tests and remove unused imports
HobnobMancer Nov 29, 2022
7546238
update unit tests
HobnobMancer Nov 29, 2022
8f28199
fix merge conflicts
HobnobMancer Jan 11, 2023
91422ea
update version number
HobnobMancer Nov 18, 2022
36c482b
refactor funcs
HobnobMancer Nov 21, 2022
9faaae0
define missing logger
HobnobMancer Nov 21, 2022
c503b5a
update parser
HobnobMancer Nov 21, 2022
f38925f
get protein acc for gene name
HobnobMancer Dec 8, 2022
50d559e
update dict keys to separte gene and protein names
HobnobMancer Dec 8, 2022
28c290a
retrieve gene names from uniprot to link gene names and genbank prote…
HobnobMancer Dec 8, 2022
f72ad97
add missing option to choices for the api
HobnobMancer Dec 8, 2022
055b947
add new args to parser
HobnobMancer Dec 8, 2022
3c10e02
update version number
HobnobMancer Dec 8, 2022
ca7be8b
update citation information
HobnobMancer Dec 8, 2022
8e234c5
fix unit tests for expand.unprot
HobnobMancer Dec 8, 2022
de5a920
fix unit tests for parsers
HobnobMancer Dec 8, 2022
aeb48d4
update requirements. use bioservices for the new uniprot api
HobnobMancer Dec 8, 2022
e3cbe5b
do not retry invalid gene names
HobnobMancer Dec 8, 2022
f80b6ac
update documentation
HobnobMancer Dec 8, 2022
3ec8477
update documentation
HobnobMancer Dec 8, 2022
1b36121
map gene names when couldn't be retrieved from uniprot
HobnobMancer Dec 8, 2022
8174df8
use the uniprot mapping service
HobnobMancer Dec 9, 2022
ec0e0ed
correct typo in variable name genebank to genbank
HobnobMancer Dec 9, 2022
8bfac94
mock getting gene names
HobnobMancer Dec 9, 2022
43994ec
mockmapping accessions
HobnobMancer Dec 9, 2022
6befcfd
correct key names to get genbank acc from uniprot dict
HobnobMancer Dec 9, 2022
9d4a721
check the retrieved gbk acc is in the local db before adding protein …
HobnobMancer Dec 9, 2022
d3fa24d
log addition of data to the local db
HobnobMancer Dec 9, 2022
e64489e
check the retrieved gbk acc is in the local db before adding protein …
HobnobMancer Dec 9, 2022
3a1e5b4
update variable name to print gbk not in lcoal db
HobnobMancer Dec 9, 2022
82e9baf
update unit tests
HobnobMancer Dec 9, 2022
d2ff454
cache uniprot accessions that could not be mapped to genbank
HobnobMancer Dec 9, 2022
c8474e2
expand error catching when querying ncbi
HobnobMancer Jan 5, 2023
ec3d6d4
move ncbi-calling funcs to ncbi mod
HobnobMancer Jan 6, 2023
2d0822b
update imports for getting prot seqs
HobnobMancer Jan 6, 2023
75f3405
process invalids containing batches after processing failed connectio…
HobnobMancer Jan 6, 2023
a70c2b9
update version number
HobnobMancer Jan 6, 2023
9a896e7
log when cache and downloaded seqs don't match
HobnobMancer Jan 6, 2023
d4ecf03
correct syntax errors
HobnobMancer Jan 6, 2023
b0f1384
add missing args to func calls
HobnobMancer Jan 10, 2023
53198c1
add missing IncompeteRead import
HobnobMancer Jan 10, 2023
6bb4849
add missing IncompeteRead import
HobnobMancer Jan 10, 2023
2af7469
correct . to , typo
HobnobMancer Jan 10, 2023
3119346
add missing args param to func calls
HobnobMancer Jan 10, 2023
8ce63b6
use joined list as str as failed connections key
HobnobMancer Jan 11, 2023
63f869b
correct typos success to successful
HobnobMancer Jan 11, 2023
2d08be9
correct old var name to new var name: gbk_acc_to_retrieve to acc_to_r…
HobnobMancer Jan 11, 2023
771b3d1
correct list to set: successful_accessions
HobnobMancer Jan 11, 2023
79350d2
remove whitespace from ids
HobnobMancer Jan 11, 2023
94d59d1
fix failed_connections bug
HobnobMancer Jan 11, 2023
4324751
update docs on cache files
HobnobMancer Jan 11, 2023
35c2fd8
Merge branch 'master' into issue_99
HobnobMancer Jan 11, 2023
b8268e5
change sqlalchemy requirement to fix v num
HobnobMancer Jan 11, 2023
b3b9dc4
Merge branch 'issue_99' of https://github.com/HobnobMancer/cazy_websc…
HobnobMancer Jan 11, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,6 @@ Protein sequences (retrieved from GenBank and/or UniProt) from the local CAZyme

Please see the [full documentation at ReadTheDocs](https://cazy-webscraper.readthedocs.io/en/latest/?badge=latest).

**The `bioconda` installation method is not currently supported, but we are working on getting this fixed soon**. For now please install via pypi or from source.


## Table of Contents
<!-- TOC -->
Expand Down Expand Up @@ -599,6 +597,8 @@ Providing the file types **is** case sensitive, but the order the file types are
cw_get_uniprot_data my_cazyme_db/cazyme_db.db --ec_filter 'EC1.2.3.4,EC2.3.1.-'
```

`-F`, `--file_only` - Only add seqs provided via JSON and/or FASTA file. Do not retrieved data from NCBI.

`--families` - List of CAZy (sub)families to scrape.

`--genbank_accessions` - Path to text file containing a list of GenBank accessions to retrieve protein data for. A unique accession per line.
Expand All @@ -619,6 +619,10 @@ cw_get_uniprot_data my_cazyme_db/cazyme_db.db --ec_filter 'EC1.2.3.4,EC2.3.1.-'

`--retries`, `-r` - Define the number of times to retry making a connection to CAZy if the connection should fail. Default: 10.

`--seq_dict` - Path to a JSON file, keyed by GenBank accessions and valued by protein sequence. Add seqs in file to the local CAZyme database.

`--seq_file` - Path to a FASTA file, keyed by GenBank accessions and valued by protein sequence. Add seqs in file to the local CAZyme database.

`--sql_echo` - Set SQLite engine echo parameter to True, causing SQLite to print log messages. Default: False.

`--species` - List of species written as Genus Species) to restrict the scraping of CAZymes to. CAZymes will be retrieved for **all** strains of each given species.
Expand Down
5 changes: 4 additions & 1 deletion cazy_webscraper/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
from cazy_webscraper.sql import sql_orm


__version__ = "2.2.3"
__version__ = "2.2.4"


VERSION_INFO = f"cazy_webscraper version: {__version__}"
Expand Down Expand Up @@ -127,6 +127,9 @@ def closing_message(job, start_time, args, early_term=False):
else:
print(message)

if early_term:
sys.exit(1)


def display_citation_info():
"""Display citation inforamtion.
Expand Down
Loading