Filtering
Your initial ov run
may produce a result database with unnecessary variants. Reports generated with this database can be too big as well. You can generate a trimmed down version of the result database and corresponding reports using filters.
A filter is a JSON object specifying the conditions used to filter variants. An example filter JSON object is below.
{
"sample": {
"require": [
"sample1",
"sample2",
],
"reject": [
"sample3",
"sample4",
]
},
"genes": [
"KRAS",
"BRAF"
],
"variant": {
"operator": "and",
"rules": [
{
"column": "gnomad3__af",
"test": "lessThan",
"value": 0.01,
"negate": false
},
{
"operator": "or",
"rules": [
{
"column": "clinvar__sig",
"test": "stringContains",
"value": "Pathogenic",
"negate": false
},
{
"column: "cadd__phred",
"test": "greaterThan",
"value": 20,
"negate": false
}
],
"negate": NEGATE
}
],
"negate": NEGATE
}
}
This filter means variants that meet the following conditions.
(they appear in sample1 or sample2, but not in sample3 nor sample4) AND
(they are from the gene KRAS or BRAF) AND
(their gnomAD3 allele frequency is less than 0.01) AND
(their ClinVar significance has Pathogenic OR their CADD Phred is greater than 20)
sample
and genes
can be omitted, which would mean not filtering by samples and genes.
The complete specification of the filter JSON object is below:
{
"sample": {
"require": [
"Sample to include",
"Sample to include",
...
],
"reject": [
"Sample name to exclude",
"Sample to exclude",
...
]
},
"genes": [
"Gene to include",
"Gene to include",
...
],
"variant": {
"operator": OPERATOR ("and" or "or"),
"rules": [
// One filter object can define the filter on one database column.
{
"column": "Column name in the result database" (such as "gnomad3__af"),
"test": TEST_TYPE
"value": VALUE
"negate": NEGATE (true or false)
},
// Filter objects can be grouped and nested in another filter object.
{
"operator": OPERATOR,
"rules": [
// Filter objects for columns, separated by a comma
],
"negate": NEGATE
}
],
"negate": NEGATE
}
}
TEST_TYPE
is one of "equals", "lessThanEq", "lessThan", "greatherThanEq", "greaterThan", "hasData", "stringContains", "stringStarts", "stringEnds", "between", "in", "select", and "inList".
VALUE
is string, integer, float, or in the case of TEST_TYPE select
, a list of string, integer, or float.
Filter JSON objects can be downloaded as JSON files from the GUI result viewer, using the export button in the Filter tab.
Generating reports with filtered variants
Filters can be used directly with ov run
. In this way, an input file is processed to produce a result database of all the variants in the input file, and reports are generated with filtered variants only. For example,
ov run input.vcf -f filter.json -t vcf
will process and generate input.vcf.sqlite
that has all the variants in input.vcf
, and then generate input.vcf.vcf
with only the variants that met the filtering conditions defined in filter.json
.
ov report
also can be used to do the same:
ov report input.vcf.sqlite -f filter.json -t vcf
Generating a filtered version of a result database
A trimmed version of a result database with filtered variants can be generated as well. This way, repeated report generation will not repeat the filtration. The trimmed database can be used for the GUI result viewer as well. To make such a trimmed version, use ov util filtersqlite
command. For example,
ov util filtersqlite input.vcf.sqlite -f filter.json
will generate input.vcf.filtered.sqlite
with the variants in input.vcf.sqlite
that met the conditions defined in filter.json
.
The suffix filtered
in input.vcf.filtered.sqlite
can be changed with --suffix
option:
ov util filtersqlite input.vcf.sqlite -f filter.json --suffix trimmed
will produce input.vcf.trimmed.sqlite
.
The output directory to store the filtered databases can be specified with -o
option.
Multiple result databases can be processed at once:
ov util filtersqlite input_1.vcf.sqlite input_2.vcf.sqlite -f filter.json
will process input_1.vcf.sqlite
and input_2.vcf.sqlite
one by one.
Filtering with SQL
Instead of a filter JSON object, SQL conditions can be directly used to filter variants as well. To do so, use --filtersql
option. This option can be used in ov run
and ov report
. For example,
ov run input.vcf --filtersql "v.base__chrom='chr1'" -t csv
will generate a CSV format output file of the variants that are filtered by the criterion that the chromosome of the variant is chr1.
The v.
in front of base__chrom
means the variant
table in the result database. g.
will mean the gene
table, and s
the sample table.
base__chrom
means base
module's chrom
column. base
module is an abstract column which includes the basic variant information as well as the mapper module. Other column names are used as they are. After a module name and "__
" (double underline), a column name follows. Thus, clinvar__sig
means cliuvar
module's sig
column. Thus, generating a VCF format output of the variants filtered with the criterion that the variants are Pathogenic
in ClinVar would be
ov report input.vcf --filtersql "clinvar__sig='Pathogenic'" -t vcf
SQL's where
command syntax is used. Thus, and
and or
can be used as well. For example,
ov report input.vcf -a clinvar --filtersql "base__hugo='BRCA1' and clinvar__sig like '%Pathogenic%' -t vcf"
will annotate the input file with the ClinVar module and generate a VCF format output of the variants filtered by two criteria, the gene being BRCA1 and the ClinVar significance has Pathogenic
.