sequences source

itai
New Member

Posts: 3

sequences source Sept 13, 2023 13:06:18 GMT 2

Quote

Post by itai on Sept 13, 2023 13:06:18 GMT 2

hello

We have downloaded all the sequences of a specific bacterial class from your database.

We aim to associate each sequence with its origin, such as its host or its environment. Is there a method to do this robustly?

thank you

Admin
Administrator

Posts: 10

sequences source Sept 13, 2023 13:32:15 GMT 2

Quote

Post by Admin on Sept 13, 2023 13:32:15 GMT 2

Hi Itai,
how many sequences did you get as the result for your query? what exactly would you like to get as an output for each sequence?
Would you like for each sequence the complete f-score table for all the terms associated with the sequence? Or would you like the best matching term out of a list of possible terms? Or something else?
I think it would help if you could also explain what is your motivation (i.e. why are looking for all sequences of a given class, and what is your end goal with these sequences).

itai
New Member

Posts: 3

sequences source Sept 27, 2023 14:16:29 GMT 2

Quote

Post by itai on Sept 27, 2023 14:16:29 GMT 2

Thank you Amnon,
we are trying to build a tree of a certain class of bacteria and annotate the tree by the source of each sequence
we received around ~30k sequences.
so we'll like to get the best matching term that is related to the source of the sample (animal/environment preferably) so we can tag in on the tree and do further analysis with it
Thank you

Admin
Administrator

Posts: 10

sequences source Sept 27, 2023 15:40:20 GMT 2

Quote

Post by Admin on Sept 27, 2023 15:40:20 GMT 2

Hi Itai,
There are several ways to approach this, depending on your exact goal:
For a given ASV, we can calculate an f-score quantifying the association for each dbBact term and this ASV. The question is if for each ASV you want the term with the highest f-score, or you want the term with the highest f-score out of a given set of terms.
For example, if you have an ASV that is found in feces of dogs, it could be that "feces" will have a higher f-score than "dog", but maybe you are interested in the animal host rather than the "feces" term.
Or maybe you are interested in something different (for example for each ASV, getting the f-scores for all terms out of a given set, or not using the f-score but a different metric)? Let me know

In any case, the simplest way is to use the calour and dbbact-calour python modules in a jupyter notebook:

assuming you have python (preferably in a conda environemt)

1. install calour (see here)

2. install dbbact-calour (installation instructions are part of the calour installation instructions).

3. use the Experiment.add_terms_to_features() function. See here for documentation

An example notebook can be found here

Let me know how it goes
Amnon

itai
New Member

Posts: 3

sequences source Sept 27, 2023 15:54:31 GMT 2

Quote

Post by itai on Sept 27, 2023 15:54:31 GMT 2

yes we would want the association with "dog" rather than "feces", even if it has a lower F-score.
so we we're wondering if there was a way to filter that out, so we'll get only the terms that are host/environment associations for each esv