Welcome to pyfastx’s documentation!

Action Readthedocs Codecov PyPI Pyver Wheel Codacy Language Downloads License Bioconda

The pyfastx is a lightweight Python C extension that enables users to randomly access to sequences from plain and gzipped FASTA/Q files. This module aims to provide simple APIs for users to extract sequence from FASTA and reads from FASTQ by identifier and index number. The pyfastx will build indexes stored in a sqlite3 database file for random access to avoid consuming excessive amount of memory. In addition, the pyfastx can parse standard (sequence is spread into multiple lines with same length) and nonstandard (sequence is spread into one or more lines with different length) FASTA format. This module used kseq.h written by @attractivechaos in klib project to parse plain FASTA/Q file and zran.c written by @pauldmccarthy in project indexed_gzip to index gzipped file for random access.

This project was heavily inspired by @mdshw5’s project pyfaidx and @brentp’s project pyfasta.

Features

  • Single file for the Python extension

  • Lightweight, memory efficient for parsing FASTA file

  • Fast random access to sequences from gzipped FASTA file

  • Read sequences from FASTA file line by line

  • Calculate assembly N50 and L50

  • Calculate GC content and nucleotides composition

  • Extract reverse, complement and antisense sequence

  • Excellent compatibility, support for parsing nonstandard FASTA file

  • Support for random access reads from FASTQ file

Indices and tables