API Reference
pyfastx.version
- pyfastx.version(debug=False)
Get current version of pyfastx
- Parameters:
debug (bool) – if true, return versions of pyfastx, zlib, sqlite3 and zran.
- Returns:
version of pyfastx
- Return type:
str
- pyfastx.gzip_check(file_name)
New in pyfastx 0.5.4
Check file is gzip compressed or not
- Parameters:
file_name (str) – the path of input file
- Returns:
Ture if file is gzip compressed else False
- Return type:
bool
- pyfastx.reverse_complement(seq)
New in pyfastx 2.0.0
get reverse complement sequence of given DNA sequence
- Parameters:
seq (str) – DNA sequence
- Returns:
reverse complement sequence
- Return type:
str
pyfastx.Fasta
- class pyfastx.Fasta(file_name, index_file=None, uppercase=True, build_index=True, full_index=False, full_name=False, memory_index=False, key_func=None)
Read and parse fasta files. Fasta can be used as dict or list, you can use index or sequence name to get a sequence object, e.g.
fasta[0]
,fasta['seq1']
- Parameters:
file_name (str) – the file path of input FASTA file
index_file (str) – the index file of FASTA file, default using index file with extension of .fxi in the same directory of FASTA file, New in 2.0.0
uppercase (bool) – always output uppercase sequence, default:
True
build_index (bool) – build index for random access to FASTA sequence, default:
True
. If build_index is False, iteration will return a tuple (name, seq); If build_index is True, iteration will return a sequence object.full_index (bool) – calculate character (e.g. A, T, G, C) composition when building index, this will improve the speed of GC content extracting. However, it will take more time to build index, default:
False
full_name (bool) – use the full header line instead of the part before first whitespace as the identifier of sequence, even in mode without building index. New in 0.6.14, default:
False
memory_index (bool) – if memory_index is True, the fasta index will be kept in memory and do not generate a index file, default:
False
key_func (function) – new in 0.5.1, key function is generally a lambda expression to split header and obtain a shortened identifer, default:
None
- Returns:
Fasta object
- file_name
FASTA file path
- size
total length of sequences in FASTA file
- type
New in
pyfastx
0.5.4get fasta type, return DNA, RNA, protein, or unknown
- is_gzip
New in pyfastx 0.5.0
return True if fasta is gzip compressed else return False
- gc_content
GC content of whole sequences in FASTA file, return a float value
- composition
nucleotide composition in FASTA file, a dict contains counts of A, T, G, C and N (unkown nucleotide base)
- longest
get longest sequence in FASTA file, return a Sequence object
New in
pyfastx
0.3.0
- shortest
get shortest sequence in FASTA file, return a Sequence object
New in
pyfastx
0.3.0
- mean
get average length of sequences in FASTA file
New in
pyfastx
0.3.0
- median
get median length of sequences in FASTA file
New in
pyfastx
0.3.0
- fetch(chrom, intervals, strand='+')
truncate subsequences from a given sequence by a start and end coordinate or a list of coordinates. This function will cache the full sequence into memory, and is suitable for extracting large numbers of subsequences from specified sequence.
- Parameters:
chrom (str) – chromosome name or sequence name
intervals (list/tuple) – list of [start, end] coordinates
strand (str) – sequence strand,
+
indicates sense strand,-
indicates antisense strand, default: ‘+’
Note
intervals can be a list or tuple with start and end position e.g. (10, 20). intervals also can be a list or tuple with multiple coordinates e.g. [(10, 20), (50,70)]
- Returns:
sliced subsequences
- Return type:
str
- flank(chrom, start, end, flank_length=50, use_cache=False)
Get the flank sequence of given subsequence with start and end. New in 0.7.0
- Parameters:
chrom (str) – chromosome name or sequence name
start (int) – 1-based start position of subsequence on chrom
end (int) – 1-based end position of subsequence on chrom
flank_length (int) – length of flank sequence, default 50
use_cache (bool) – cache the whole sequence
Note
If you want to extract flank sequence for large numbers of subsequences from the same sequence. Use
use_cache=True
will greatly improve the speed- Returns:
left flank and right flank sequence
- Return type:
tuple
- build_index()
build index for FASTA file
- keys()
get all names of sequences
- Returns:
an FastaKeys object
- count(n)
get counts of sequences whose length >= n bp
New in
pyfastx
0.3.0- Parameters:
n (int) – number of bases
- Returns:
sequence counts
- Return type:
int
pyfastx.Sequence
- class pyfastx.Sequence
Readonly sequence object generated by fasta object, Sequence can be treated as a list and support slicing e.g.
seq[10:20]
- id
sequence id or order number in FASTA file
- name
sequence name
- description
Get sequence description after name in sequence header
New in
pyfastx
0.3.1
- start
start position of sequence
- end
end position of sequence
- gc_content
GC content of current sequence, return a float value
- composition
nucleotide composition of sequence, a dict contains counts of A, T, G, C and N (unkown nucleotide base)
- raw
get the raw string (with header line and sequence lines) of sequence as it appeared in file
New in
pyfastx
0.6.3
- seq
get the string of sequence in sense strand
- reverse
get the string of reversed sequence
- complement
get the string of complement sequence
- antisense
get the string of sequence in antisense strand, corresponding to reversed and complement sequence
- search(subseq, strand='+')
Search for subsequence from given sequence and get the start position of the first occurrence
New in
pyfastx
0.3.6- Parameters:
subseq (str) – a subsequence for search
strand (str) – sequence strand + or -, default +
- Returns:
if found subsequence return one-based start position, if not return None
- Return type:
int or None
pyfastx.Fastq
New in pyfastx
0.4.0
- class pyfastx.Fastq(file_name, index_file=None, phred=0, build_index=True, full_index=False)
Read and parse fastq file
- Parameters:
file_name (str) – input FASTQ file path
index_file (str) – the index file of FASTQ file, default using the index file with extension of .fxi in the same directory of FASTQ file. New in 2.0.0
build_index (bool) – build index for random access to FASTQ reads, default:
True
. If build_index is False, iteration will return a tuple (name, seq, qual); If build_index is True, iteration will return a read objectfull_index (bool) – calculate character (e.g. A, T, G, C) composition when building index, this will improve the speed of GC content extracting. However, it will take more time to build index, default:
False
phred (int) – phred was used to convert quality ascii to quality int value, usually is 33 or 64, default
33
- Returns:
Fastq object
- file_name
FASTQ file path
- size
total bases in FASTQ file
- is_gzip
New in pyfastx 0.5.0
return True if fasta is gzip compressed else return False
- gc_content
GC content of whole FASTQ file
- avglen
New in
pyfastx
0.6.10get average length of reads
- maxlen
New in
pyfastx
0.6.10get maximum length of reads
- minlen
New in
pyfastx
0.6.10get minimum length of reads
- maxqual
New in
pyfastx
0.6.10get maximum quality score of bases
- minqual
New in
pyfastx
0.6.10get minimum quality score of bases
- composition
base composition in FASTQ file, a dict contains counts of A, T, G, C and N (unkown nucleotide base)
- phred
get phred value
- encoding_type
New in
pyfastx
0.4.1Guess the quality encoding type used by FASTQ sequence file
- build_index()
Build index for fastq file when build_index set to False
- keys()
New in
pyfastx
0.8.0Get all the names of reads in fastq file
- Returns:
an FastqKeys object
pyfastx.Read
New in pyfastx
0.4.0
- class pyfastx.Read
Readonly read object for obtaining read information, generated by fastq object
- id
read id or order number in FASTQ file
- name
read name excluding ‘@’
- description
get the full header line of read
- raw
get the raw string (with header, sequence, comment and quality lines) of read as it appeared in file
New in
pyfastx
0.6.3
- seq
get read sequence string
- qual
get read quality ascii string
- quali
get read quality integer value (ascii - phred), return a list
pyfastx.Fastx
- class pyfastx.Fastx(file_name, format='auto', uppercase=False)
New in
pyfastx
0.8.0. A python binding of kseq.h, provide a simple api for iterating over sequences in fasta/q file- Parameters:
file_name (str) – input fasta or fastq file path
format (str) – the input file format, can be “fasta” or “fastq”, default: “auto”, automatically detect the format of sequence file
uppercase (bool) – always output uppercase sequence, only work for fasta file, default: False
- Returns:
Fastx object
pyfastx.FastaKeys
- class pyfastx.FastaKeys
FastaKeys is a readonly and list-like object, contains all names of sequences
- sort(by='id', reverse=False)
Sort keys by sequence id, name or length for iteration
New in
pyfastx
0.5.0- Parameters:
by (str) – order by id, name, or length, default is id
reverse (bool) – used to flag descending sorts, default is False
- Returns:
FastaKeys object itself
- filter(*filters)
Filter keys by sequence name and length for iteration
- Parameters:
filters (list) – filters generated by comparison like ids > 500 or ids % ‘seq1’, where ids is a Identifier object
- Returns:
FastaKeys object itself
- reset()
Clear all filters and sort order
- Returns:
FastaKeys object itself
pyfastx.FastqKeys
- class pyfastx.FastqKeys
New in
pyfastx
0.8.0. FastqKeys is a readonly and list-like object, contains all names of reads