Struggling to Figure Out The Structure of Biom File

jin
New Member

Posts: 3

Struggling to Figure Out The Structure of Biom File Aug 22, 2023 19:51:09 GMT 2

Quote

Post by jin on Aug 22, 2023 19:51:09 GMT 2

Hi,

I have been following the dbbact calour tutorial and having hard time to figure out the structure of biom format.
In the website, it is using biom file and text file both to load the data.

dat=ca.read_amplicon('data/chronic-fatigue-syndrome.biom',
                     'data/chronic-fatigue-syndrome.sample.txt',
                     normalize=10000,min_reads=1000)

I was wondering if I can have some help in terms of the structure of .biom format in this tutorial.
I hope to hear from any of you soon.

Thank you,
-Jin

Admin
Administrator

Posts: 10

Struggling to Figure Out The Structure of Biom File Aug 22, 2023 22:53:59 GMT 2

Quote

Post by Admin on Aug 22, 2023 22:53:59 GMT 2

Hi Jin,
the biom format is used to store the sample X feature (ASV) reads matrix. You can find all the details regarding the biom format here.
In addition, the text file contains the per-sample metadata (i.e. various details associated with each sample - for example if the subject from which the sample came was healthy or sick, etc.). It is a tab delimited text file, with one row per sample, where columns represent the various fields. The first column is the sample ID and it should match the sample IDs in the biom table. This is the standard mapping file (for example used in qiime2).
A bit more about the .biom format:
It actually defines 3 different possible file formats - tsv (i.e. tab delimited text table file), json, or hdf5, which is a binary format that handles well sparse matrices (which microbiome tables are typically are).
Note that qiime2 stores the tables internally also in biom format, but it is embedded in the qza (to see the actual biom table file, you need to unzip the qza file). To get a biom table out of a qiime2 qza table, you can use:

qiime tools export table.qza --output-dir exported-feature-table.biom

Alternatively, calour can read directly qiime2 tables using the calour.read_qiime2() function instead of using read_amplicon().

You can also find some more examples on how to use calour and dbBact in the dbBact paper notebooks here.

Does this make sense?
What is the specific problem you are facing? In what format is your data?

Amnon

Last Edit: Aug 22, 2023 23:00:27 GMT 2 by Admin

jin
New Member

Posts: 3

Struggling to Figure Out The Structure of Biom File Aug 23, 2023 19:41:09 GMT 2

Quote

Post by jin on Aug 23, 2023 19:41:09 GMT 2

Hi Amnon,

I created otu_table_2.biom file from otu_table_2.txt file, and otu_table_2.txt file looks like the below chart.

OTU.ID	sample_1	sample_2	...	sample_646
otu_112	0	1824	...	0
otu_2612	0	0	...	0
...	...	...	...	...
otu_4069	0	0	...	0

After that, I have both .biom file and .txt file.

cfs=ca.read_amplicon('/z/Proj/Data/otu_table_2.biom',
                     '/z/Proj/Data/test_metadata.txt',
                     normalize=10000,min_reads=1000)

My sample_metadata looked fine. It looked similar to the example as the calour tutorial.

In:
cfs.sample_metadata
Out:

	provider_id	...	sex	_calour_original_abundance
#SampleID	10	...	male	68150.0
sample_1	10	...	male	88992.0
sample_2	21	...	male	602945.0
...	...	...	...	...
sample_603	11	...	male	174087.0

575 rows × 9 columns

However, when I tried to look at the feature_metadata, it only prints out "_feature_id"

In:
cfs.feature_metadata
out:

	_feature_id
otu_112	otu_112
otu_2612	otu_2612
...	...
otu_1777	Row 5 column 2

4273 rows × 1 columns

So my question is,
Is the otu_table_2.biom file supposed to look as the chart above (otu_table_2.txt file)? if not, could you provide some example of .biom file structure for calour dbbact analysis?

Thank you,
-Jin

Attachments:

Admin
Administrator

Posts: 10

Struggling to Figure Out The Structure of Biom File Aug 23, 2023 20:34:09 GMT 2

Quote

Post by Admin on Aug 23, 2023 20:34:09 GMT 2

Hi Jin,
I am a bit confused about the specific problem you are facing:
As far as I can tell from the screenshot, the biom table you created seems ok. I was not able to see the mapping file, but hopefully it is also ok.

I would recommend setting the calour log level to 'INFO' using:
calour.set_log_level(11)

and then running calour.read_amplicon() with these files. Can you attach the output? (It should give information as to the number of features (ASVs) and samples, and how many samples were matched with the mapping file).

Regarding the feature_metadata, it contains the per feature (i.e. ASV) associated metadata. You can optionally provide a feature_metadata file name in the read_amplicon() (using the feature_metadata_file=XXX parameter, where the XXX file is a TSV containing per-feature info). However, typically, we do not use this feature_metadata_file field, as we do not have any prior information about the ASVs. So having the feature_metadata contain only the _feature_id column is totally fine.

IMPORTANT NOTE: if you want to use the dbbact-calour interface, the biom table should contain the exact ASV sequences instead of the OTU IDs as the first column in each row. Your current table contains the OTU IDs (e.g. "otu_122") where it should contain the exact ASV sequence (such as "TACGTATGGAGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGCAGGGCAAGTCTGATGTGAAAACCCGGGGCTCAACCCCGGGACTGCATTGGAAACTGTCCGGCTGGAGTGCAGGAGAGGTAAGTGGAATTCCTAG").

In general, the typical modern way of getting a biom table out of sequencing machine output, is to use a denoising method (such as DADA2 or Deblur), rather than use the older OTU picking methods (such as closed-reference, open-reference, or uparse). What method/program/parameters did you use to generate your table (can you specify the steps used to generate it starting from the fasta/fastq output of the sequencing)?

Amnon

jin
New Member

Posts: 3

Struggling to Figure Out The Structure of Biom File Aug 24, 2023 23:51:28 GMT 2

Quote

Post by jin on Aug 24, 2023 23:51:28 GMT 2

Thank you for the thoughtful response, Amnon!

This is the output of calour.read_amplicon()

Based on what we have for the input, we could generate the heatmap plots, but not the "dbBact term enrichment" (diff_abundance_enrichment). As you answered above, we might change biom table format to do dbbact-calour interface.

Is this text file below looks ok for the biom file? or do we have to include other column components?
when we tried to convert this text table into biom file by using bash code, it didn't convert into biom file. we thought we needed to have OTU.IDs to convert into biom file. It would be really nice if you have suggestions for that.

Regarding question about method to generate table,
Although the features are named "otu" in these files from previous post, please rest assured that we actually used an up-to-date dada2 pipeline. We renamed the ASVs to "otu_*" because the biom command line utility failed to convert our abundance table to biom when they were named "asv_*", but succeeded when they were named "otu_*".

Our pipeline produces a mia/TreeSummarizedExperiment data structure (https://microbiome.github.io/mia/) where the features are named "asv_" and the sequences are stored as feature metadata. So, our challenge is to figure out how to extract the right information from that object to provide to dbBact-calour.

Thank you,
-Jin

Admin
Administrator

Posts: 10

Struggling to Figure Out The Structure of Biom File Aug 26, 2023 0:13:07 GMT 2

Quote

Post by Admin on Aug 26, 2023 0:13:07 GMT 2

Hi Jin,
thanks for your response.
So if I understand correctly, the process for generating the biom table was:
1. use DADA2 (via R?) to denoise the reads files
2. import the DADA2 results into mia (as a TreeSummarizedExperiment) in R
3. Export the TreeSummarizedExperiment reads per sample table to a text file (how?).
4. Convert the text file to a biom table using biom convert (in the shell)

Is this correct? what steps did i miss/write incorrectly?

Calour can read a text tsv table (without the need to convert it to a .biom table). I think this will be the easiest method to use.
So what you need is to create a tab delimited text file of your table, with the format being:
first line (entries separated by tabs). e.g.:
#OTUID SAMPLE1NAME SAMPLE2NAME etc...

and the other lines: one per ASV, with the first column being the ASV sequence, other entries are the number of reads of the ASV in the sample (again tab separated). e.g.:
TGGGGAATTTTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGTCTGAAGAAGGCCTTCGGGTTGTAAAGGACTTTTGTCAGGGAAGAAAAGGATAGGGTTAATACCCCTGTCTGATGACGGTACCTGAAGAATAAGCACCGGC 445 5 450
TGAGGAATATTGGTCAATGGATGCAAATCTGAACCAGCCAAGTAGCGTGCAGGATGACGGCCCTATGGGTTGTAAACTGCTTTTATGTGAGAATAAAGTTAGGTATGTATACTTATTTGCATGTATCACATGAATAAGGACCGGCTAATT 0 0 0
TGGGGAATATTGCACAATGGGGGGAACCCTGATGCAGCCATGCCGCGTGAATGAAGAAGGCCTTCGGGTTGTAAAGTTCTTTCGGTAGCGAGGAAGGCATTTAGTTTAATAGACTAGATGATTGACGTTAACTACAGAAGAAGCACCGGC 108 0 24

(see the example file attached)

Then just use calour.read_amplicon() with this tsv file (and the mapping file). This should work.

Can you arrange such a table file with the sequences embedded?
On way to see if the dbbact interface works after loading this file, plot the interactive heatmap (gui='qt5' or 'jupyter'), click on some points in the heatmap, and see if you get the associated dbBact annotations for the ASV selected).

Did it work?

Let me know how it goes
Amnon

Let me know how it goes

Attachments:

example-table.txt (508 B)

Dbbact

Struggling to Figure Out The Structure of Biom File

Post by jin on Aug 22, 2023 19:51:09 GMT 2

Post by Admin on Aug 22, 2023 22:53:59 GMT 2

Post by jin on Aug 23, 2023 19:41:09 GMT 2

Post by Admin on Aug 23, 2023 20:34:09 GMT 2

Post by jin on Aug 24, 2023 23:51:28 GMT 2

Post by Admin on Aug 26, 2023 0:13:07 GMT 2

Quick Reply