Skip to contents

Introduction

This entry covers which annotations are available for the 3 model species used in SeedMatchR (human, mouse, rat). SeedMatchR requires information on the GTF for gene > transcript annotations as well as the sequences that correspond to the features of interest being searched. These can be generated independently by the user or they can be loaded with the SeedMatchR::load_annotations function. This function will load the predefined annotations based on the species of interest, the filtering criteria, and features of interest.

NOTES:

  1. In order to use rnor7, you must have R 4.3.0 or greater installed. This is because AnnotationHub ≥ 3.8.0 is required.
  2. rnor6 and mm10 cannot be used with the ensembldb::TxIsCanonicalFilter therefore the canonical argument for SeedMatchR::load_annotations must be set to FALSE.
  3. rnor6 and rnor7 cannot be used with ensembldb::TxSupportLevelFilter. Therefore this value should not be passed to the function SeedMatchR::load_annotations when working with the rat transcriptome.

Current Annotation Options

Species SeedMatchR arg Reference Build ENSEMBL Build Release Date 2bit ID ensembldb
Human hg38 GRCh38.p13 109 February 2023 AH106283 AH109606
Rat rnor6 Rnor_v6.0 104 May 2011 AH93578 AH95846
Rat rnor7 Rnor_v7.2 109 February 2023 AH106786 AH109732
Rat rnor7.113 Rnor_v7.2 113 October 2024 AH106786 AH119437
Mouse mm39 GRCm39 109 February 2023 AH106440 AH109655
Mouse mm10 GRCm38 102 October 2020 AH88475 AH89211

Functions for working with annotations and building transcriptomes for queries

Load data for Rnor7

annodb <- load_annotations("rnor7", feature.type = "exons", protein.coding = FALSE, canonical = F, return_gene_name = F)
#> Build AnnotationFilter for transcript features based on the following parameters: 
#> Keep only standard chroms: TRUE
#> Remove rows with NA in transcript ID: TRUE
#> Keep only protein coding genes and transcripts: FALSE
#> Filtering for transcripts with support level: FALSE
#> Keep only the ENSEMBL canonical transcript: FALSE
#> Filtering for specific genes: FALSE
#> Filtering for specific transcripts: FALSE
#> Filtering for specific gene symbols: FALSE
#> Filtering for specific entrez id: FALSE
#> Loading annotations from AnnotationHub for rnor7
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> require("rtracklayer")
#> Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
#> 'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> require("ensembldb")
#> Extracting exons from ensembldb object.
#> Extracting sequences for each feature.
#> Keeping sequences that are >= 8

How annotations were retrieved

Selecting GTF

The GTF files were selected to have the full DNA sequence visible without any type of masking of repeats. These files have names like Mus_musculus.GRCm39.109.gtf.

Selecting .2bit DNA sequences

The .2bit files were selected to have the full DNA sequence visible without any type of masking of repeats. These files have names like Homo_sapiens.GRCh38.dna.primary_assembly.2bit. For Rat, there was no file that was named as the dna.primary_assembly, instead we had to choose from dna_sm.primary_assembly which is the soft-masked version of the genome that accounts for repeats.

Code for querying AnnotationHub for references

Hg38

.2bit

AnnotationHub::query(ah, c("GRCh38", "2bit"))
#> AnnotationHub with 120 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl
#> # $species: Homo sapiens, homo sapiens
#> # $rdataclass: TwoBitFile
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH49722"]]' 
#> 
#>              title                                           
#>   AH49722  | Homo_sapiens.GRCh38.cdna.all.2bit               
#>   AH49723  | Homo_sapiens.GRCh38.dna.primary_assembly.2bit   
#>   AH49724  | Homo_sapiens.GRCh38.dna_rm.primary_assembly.2bit
#>   AH49725  | Homo_sapiens.GRCh38.dna_sm.primary_assembly.2bit
#>   AH49726  | Homo_sapiens.GRCh38.ncrna.2bit                  
#>   ...        ...                                             
#>   AH106282 | Homo_sapiens.GRCh38.cdna.all.2bit               
#>   AH106283 | Homo_sapiens.GRCh38.dna.primary_assembly.2bit   
#>   AH106284 | Homo_sapiens.GRCh38.dna_rm.primary_assembly.2bit
#>   AH106285 | Homo_sapiens.GRCh38.dna_sm.primary_assembly.2bit
#>   AH106286 | Homo_sapiens.GRCh38.ncrna.2bit

ensembldb

version 110 for GRCh38.p13
AnnotationHub::query(ah, c("GRCh38", "110"))
#> AnnotationHub with 1 record
#> # snapshotDate(): 2025-04-08
#> # names(): AH113665
#> # $dataprovider: Ensembl
#> # $species: Homo sapiens
#> # $rdataclass: EnsDb
#> # $rdatadateadded: 2023-04-25
#> # $title: Ensembl 110 EnsDb for Homo sapiens
#> # $description: Gene and protein annotations for Homo sapiens based on Ensem...
#> # $taxonomyid: 9606
#> # $genome: GRCh38
#> # $sourcetype: ensembl
#> # $sourceurl: http://www.ensembl.org
#> # $sourcesize: NA
#> # $tags: c("110", "Annotation", "AnnotationHubSoftware", "Coverage",
#> #   "DataImport", "EnsDb", "Ensembl", "Gene", "Protein", "Sequencing",
#> #   "Transcript") 
#> # retrieve record with 'object[["AH113665"]]'
Version 80 for GRCh38.p2
AnnotationHub::query(ah, c("GRCh38", "80"))
#> AnnotationHub with 1 record
#> # snapshotDate(): 2025-04-08
#> # names(): AH47066
#> # $dataprovider: Ensembl
#> # $species: Homo sapiens
#> # $rdataclass: GRanges
#> # $rdatadateadded: 2015-05-22
#> # $title: Homo_sapiens.GRCh38.80.gtf
#> # $description: Gene Annotation for Homo sapiens
#> # $taxonomyid: 9606
#> # $genome: GRCh38
#> # $sourcetype: GTF
#> # $sourceurl: ftp://ftp.ensembl.org/pub/release-80/gtf/homo_sapiens/Homo_sap...
#> # $sourcesize: 44733199
#> # $tags: c("GTF", "ensembl", "Gene", "Transcript", "Annotation") 
#> # retrieve record with 'object[["AH47066"]]'

Rat

.2bit

Rnor6
query(ah, c("Rnor", "release-104"))
#> AnnotationHub with 7 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl
#> # $species: rattus norvegicus
#> # $rdataclass: TwoBitFile, GRanges
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH92418"]]' 
#> 
#>             title                                          
#>   AH92418 | Rattus_norvegicus.Rnor_6.0.104.abinitio.gtf    
#>   AH92419 | Rattus_norvegicus.Rnor_6.0.104.chr.gtf         
#>   AH92420 | Rattus_norvegicus.Rnor_6.0.104.gtf             
#>   AH93576 | Rattus_norvegicus.Rnor_6.0.cdna.all.2bit       
#>   AH93577 | Rattus_norvegicus.Rnor_6.0.dna_rm.toplevel.2bit
#>   AH93578 | Rattus_norvegicus.Rnor_6.0.dna_sm.toplevel.2bit
#>   AH93579 | Rattus_norvegicus.Rnor_6.0.ncrna.2bit
Rnor7
query(ah, c("Rattus", ".2bit"))
#> AnnotationHub with 106 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl, UCSC
#> # $species: Rattus norvegicus, rattus norvegicus
#> # $rdataclass: TwoBitFile
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH14021"]]' 
#> 
#>              title                                           
#>   AH14021  | rn6.2bit                                        
#>   AH14022  | rn5.2bit                                        
#>   AH14023  | rn4.2bit                                        
#>   AH49867  | Rattus_norvegicus.Rnor_6.0.cdna.all.2bit        
#>   AH49868  | Rattus_norvegicus.Rnor_6.0.dna_rm.toplevel.2bit 
#>   ...        ...                                             
#>   AH104348 | Rattus_norvegicus.mRatBN7.2.ncrna.2bit          
#>   AH106784 | Rattus_norvegicus.mRatBN7.2.cdna.all.2bit       
#>   AH106785 | Rattus_norvegicus.mRatBN7.2.dna_rm.toplevel.2bit
#>   AH106786 | Rattus_norvegicus.mRatBN7.2.dna_sm.toplevel.2bit
#>   AH106787 | Rattus_norvegicus.mRatBN7.2.ncrna.2bit

ensembldb

Rnor6
query(ah, c("rat", 104, "ensdb"))
#> AnnotationHub with 6 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl
#> # $species: Sparus aurata, Rattus norvegicus, Mesocricetus auratus, Kryptole...
#> # $rdataclass: EnsDb
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH95677"]]' 
#> 
#>             title                                        
#>   AH95677 | Ensembl 104 EnsDb for Carassius auratus      
#>   AH95725 | Ensembl 104 EnsDb for Echeneis naucrates     
#>   AH95749 | Ensembl 104 EnsDb for Kryptolebias marmoratus
#>   AH95762 | Ensembl 104 EnsDb for Mesocricetus auratus   
#>   AH95846 | Ensembl 104 EnsDb for Rattus norvegicus      
#>   AH95850 | Ensembl 104 EnsDb for Sparus aurata
Rnor7
query(ah, c("rat", 109, "ensdb"))
#> AnnotationHub with 6 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl
#> # $species: Sparus aurata, Rattus norvegicus, Mesocricetus auratus, Kryptole...
#> # $rdataclass: EnsDb
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH109518"]]' 
#> 
#>              title                                        
#>   AH109518 | Ensembl 109 EnsDb for Carassius auratus      
#>   AH109583 | Ensembl 109 EnsDb for Echeneis naucrates     
#>   AH109611 | Ensembl 109 EnsDb for Kryptolebias marmoratus
#>   AH109625 | Ensembl 109 EnsDb for Mesocricetus auratus   
#>   AH109732 | Ensembl 109 EnsDb for Rattus norvegicus      
#>   AH109736 | Ensembl 109 EnsDb for Sparus aurata

Mouse

.2bit

mm39 or GCRm39
query(ah, c("GRCm39", ".2bit"))
#> AnnotationHub with 25 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl
#> # $species: mus musculus
#> # $rdataclass: TwoBitFile
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH90962"]]' 
#> 
#>              title                                           
#>   AH90962  | Mus_musculus.GRCm39.cdna.all.2bit               
#>   AH90963  | Mus_musculus.GRCm39.dna.primary_assembly.2bit   
#>   AH90964  | Mus_musculus.GRCm39.dna_rm.primary_assembly.2bit
#>   AH90965  | Mus_musculus.GRCm39.dna_sm.primary_assembly.2bit
#>   AH90966  | Mus_musculus.GRCm39.ncrna.2bit                  
#>   ...        ...                                             
#>   AH106439 | Mus_musculus.GRCm39.cdna.all.2bit               
#>   AH106440 | Mus_musculus.GRCm39.dna.primary_assembly.2bit   
#>   AH106441 | Mus_musculus.GRCm39.dna_rm.primary_assembly.2bit
#>   AH106442 | Mus_musculus.GRCm39.dna_sm.primary_assembly.2bit
#>   AH106443 | Mus_musculus.GRCm39.ncrna.2bit
mm10 or GRCm38
query(ah, c("GRCm38", ".2bit"))
#> AnnotationHub with 95 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl
#> # $species: Mus musculus, mus musculus
#> # $rdataclass: TwoBitFile
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH49772"]]' 
#> 
#>             title                                           
#>   AH49772 | Mus_musculus.GRCm38.cdna.all.2bit               
#>   AH49773 | Mus_musculus.GRCm38.dna.primary_assembly.2bit   
#>   AH49774 | Mus_musculus.GRCm38.dna_rm.primary_assembly.2bit
#>   AH49775 | Mus_musculus.GRCm38.dna_sm.primary_assembly.2bit
#>   AH49776 | Mus_musculus.GRCm38.ncrna.2bit                  
#>   ...       ...                                             
#>   AH88474 | Mus_musculus.GRCm38.cdna.all.2bit               
#>   AH88475 | Mus_musculus.GRCm38.dna.primary_assembly.2bit   
#>   AH88476 | Mus_musculus.GRCm38.dna_rm.primary_assembly.2bit
#>   AH88477 | Mus_musculus.GRCm38.dna_sm.primary_assembly.2bit
#>   AH88478 | Mus_musculus.GRCm38.ncrna.2bit

ensembldb

mm39 or GRCm39
query(ah, c("mus", 104, "ensdb"))
#> AnnotationHub with 11 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl
#> # $species: Ursus maritimus, Scophthalmus maximus, Prolemur simus, Periophth...
#> # $rdataclass: EnsDb
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH95670"]]' 
#> 
#>             title                                              
#>   AH95670 | Ensembl 104 EnsDb for Balaenoptera musculus        
#>   AH95763 | Ensembl 104 EnsDb for Mus caroli                   
#>   AH95775 | Ensembl 104 EnsDb for Mus musculus                 
#>   AH95778 | Ensembl 104 EnsDb for Mus pahari                   
#>   AH95779 | Ensembl 104 EnsDb for Mus spicilegus               
#>   ...       ...                                                
#>   AH95790 | Ensembl 104 EnsDb for Neogobius melanostomus       
#>   AH95825 | Ensembl 104 EnsDb for Periophthalmus magnuspinnatus
#>   AH95836 | Ensembl 104 EnsDb for Prolemur simus               
#>   AH95861 | Ensembl 104 EnsDb for Scophthalmus maximus         
#>   AH95879 | Ensembl 104 EnsDb for Ursus maritimus
mm10 or GRCm38
query(ah, c("mus", 109, "ensdb"))
#> AnnotationHub with 27 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl
#> # $species: Mus musculus, Ursus maritimus, Scophthalmus maximus, Prolemur si...
#> # $rdataclass: EnsDb
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH109509"]]' 
#> 
#>              title                                              
#>   AH109509 | Ensembl 109 EnsDb for Balaenoptera musculus        
#>   AH109626 | Ensembl 109 EnsDb for Mus caroli                   
#>   AH109640 | Ensembl 109 EnsDb for Mus musculus                 
#>   AH109641 | Ensembl 109 EnsDb for Mus musculus                 
#>   AH109642 | Ensembl 109 EnsDb for Mus musculus                 
#>   ...        ...                                                
#>   AH109671 | Ensembl 109 EnsDb for Neogobius melanostomus       
#>   AH109709 | Ensembl 109 EnsDb for Periophthalmus magnuspinnatus
#>   AH109721 | Ensembl 109 EnsDb for Prolemur simus               
#>   AH109750 | Ensembl 109 EnsDb for Scophthalmus maximus         
#>   AH109783 | Ensembl 109 EnsDb for Ursus maritimus
sessionInfo() 
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] ensembldb_2.32.0        AnnotationFilter_1.32.0 GenomicFeatures_1.60.0 
#>  [4] AnnotationDbi_1.70.0    Biobase_2.68.0          rtracklayer_1.68.0     
#>  [7] GenomicRanges_1.60.0    GenomeInfoDb_1.44.0     IRanges_2.42.0         
#> [10] S4Vectors_0.46.0        SeedMatchR_2.0.0        AnnotationHub_3.16.0   
#> [13] BiocFileCache_2.16.0    dbplyr_2.5.0            BiocGenerics_0.54.0    
#> [16] generics_0.1.4         
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1            dplyr_1.1.4                
#>  [3] blob_1.2.4                  filelock_1.0.3             
#>  [5] Biostrings_2.76.0           bitops_1.0-9               
#>  [7] fastmap_1.2.0               lazyeval_0.2.2             
#>  [9] RCurl_1.98-1.17             GenomicAlignments_1.44.0   
#> [11] XML_3.99-0.18               digest_0.6.37              
#> [13] mime_0.13                   lifecycle_1.0.4            
#> [15] ProtGenerics_1.40.0         KEGGREST_1.48.1            
#> [17] RSQLite_2.4.1               magrittr_2.0.3             
#> [19] compiler_4.5.1              rlang_1.1.6                
#> [21] sass_0.4.10                 tools_4.5.1                
#> [23] yaml_2.3.10                 knitr_1.50                 
#> [25] S4Arrays_1.8.1              htmlwidgets_1.6.4          
#> [27] bit_4.6.0                   curl_6.4.0                 
#> [29] DelayedArray_0.34.1         abind_1.4-8                
#> [31] BiocParallel_1.42.1         withr_3.0.2                
#> [33] purrr_1.1.0                 desc_1.4.3                 
#> [35] grid_4.5.1                  SummarizedExperiment_1.38.1
#> [37] cli_3.6.5                   rmarkdown_2.29             
#> [39] crayon_1.5.3                ragg_1.4.0                 
#> [41] httr_1.4.7                  rjson_0.2.23               
#> [43] DBI_1.2.3                   cachem_1.1.0               
#> [45] parallel_4.5.1              BiocManager_1.30.26        
#> [47] XVector_0.48.0              restfulr_0.0.16            
#> [49] matrixStats_1.5.0           vctrs_0.6.5                
#> [51] Matrix_1.7-3                jsonlite_2.0.0             
#> [53] bit64_4.6.0-1               systemfonts_1.2.3          
#> [55] jquerylib_0.1.4             glue_1.8.0                 
#> [57] pkgdown_2.1.3               codetools_0.2-20           
#> [59] BiocVersion_3.21.1          BiocIO_1.18.0              
#> [61] UCSC.utils_1.4.0            tibble_3.3.0               
#> [63] pillar_1.11.0               rappdirs_0.3.3             
#> [65] htmltools_0.5.8.1           GenomeInfoDbData_1.2.14    
#> [67] R6_2.6.1                    textshaping_1.0.1          
#> [69] lattice_0.22-7              evaluate_1.0.4             
#> [71] png_0.1-8                   Rsamtools_2.24.0           
#> [73] memoise_2.0.1               bslib_0.9.0                
#> [75] SparseArray_1.8.0           xfun_0.52                  
#> [77] fs_1.6.6                    MatrixGenerics_1.20.0      
#> [79] pkgconfig_2.0.3