FAQ Database Discussion Community


Trim first N bases in multi fasta file with awk and print with max width format

awk,gawk,fasta
Background The multi fasta format contains several record of sequences, each record begins with a single-line description, followed by several lines of sequence (RNA, DNA, protein). The description line has greaterthan symbol in the beginning, following ">" is the identifier of the sequence, and the rest of the line contains...

How to generate matrix from fasta files

python,numpy,bioinformatics,biopython,fasta
In one hand I have 13 files (Fasta files) like this (extract of one fasta file): >gi|304322925|ref|YP_003856771.1| NADH dehydrogenase subunit 6 (mitochondrion) [Lynx rufus] MMTYIVFILSTIFVVSFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGY TTAMATEPYPEAWTSNKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEA MGIAALYSYGTWLVVVTGWSLLIGVLVIMEVTRGN >gi|295065592|ref|YP_003587393.1| NADH dehydrogenase subunit 6 (mitochondrion) [Nomascus siki] MTYTLFLLSVILVMGFVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYT...

How to extract short sequence using window with specific step size?

python,extract,extraction,biopython,fasta
The code below extract short sequence in every sequence with the window size 4. How to shift the window by step size 2 and extract 4 base pairs? Example code from Bio import SeqIO with open("testA_out.fasta","w") as f: for seq_record in SeqIO.parse("testA.fasta", "fasta"): i = 0 while ((i+4) < len(seq_record.seq))...

Convert/transform an abundance (OTU) table/data.frame (to a fasta file) in R

r,fasta
I'm working on a large dataset at the moment and so far I could solve all my ideas/problems via countless google searches and long try & error sessions very well. I've managed to use plyr and reshape functions for some transformations of my different datasets and learned a lot, but...

Change the identifier line name to random shortened name in fasta file

perl,random,identifier,short,fasta
I have a fasta file with about 8,000 sequences in it. I need to change the identifier line name to a random unique shorten name (max length 10). The fasta file contains seqences like this. >AX039539.1.1212 Bacteria;Chloroflexi;Dehalococcoidia;Dehalococcoidales;...

Splitting the data.frame into 2 columns

r,split,data.frame,fasta
I have a fasta file and I read the fasta file using "read.delim" into R. The corresponding data.frame looks like following: >tm_sd_1256_2_1 MJAKDHRZTASDJASJDKASJDURUJDFLSDJFSDIFJKSDFKSJDFLJSDLFD ASDJASDJ >tm_sd_5672_1_2 AIZZTQBCSKLKDSHDADBCMSJHKQUWIRJHJJKKDLJSGDHASGDZGDHGHAGSDZASDASDVASGASDHGCAHGS SADASDA[sample.fasta file][1] >tm_sd_543_1_2 MUZTREQWERNBVXCYMNMVHZTOPOPOEURDASDOPOQWEUZQUIZRZIRIEIWUEWASDHASHDAHSDHAKHHSDHASHDJASHDAHUWIEUROWUOERUOWEUROOWWWW >tm_sd_212_0_2...

Want to add random string to identifier line in fasta file

perl,random,add,identifier,fasta
I want to add random string to existing identifier line in fasta file. So I get: MMETSP0259|AmphidiniumcarteCMP1314aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa Then the sequence on the next lines as normal. I am have problem with i think in the format output. This is what I get: MMETSP0259|AmphidiniumCMP1314aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa CTTCATCGCACATGGATAACTGTGTACCTGACTaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab TCTGGGAAAGGTTGCTATCATGAGTCATAGAATaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac It's added to every line....

matching and appending a string to headers

awk,sed,fasta
I want to append a string to headers in a monolithic FASTA file. Input: >uce-101_seqname GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA Desired output: >uce-101_seqname |uce-101 GGCTGGCACCAGTTAACTTGGGATATTGGAGTGAAAAGGCCCGTAATCAGCCTTCGGTCATGTAGAACAATGCATAAAATTAAATTGACATTAATGAATAATTGTGTAATGAAAATGGA Example code: awk -F ">" '{if($2 ~ /^uce/){print $0 " |" substr($2,1,7)} else {print $0}}' <inputfile> The example code only works for 7 characters (e.g.,...

Remove a specific pattern in fasta sequences

python,fasta
I have a fasta file like this, >IWB12626 AACTTGAGGGACGTGCAGCTAAGGGAGGACTACTATCCAGCACCGGAGAA[T/C]GACATGATGATCACAGAGATGCGGGCTGAATCTTGCCTCCGGTTTGAGCA >IWB49383 CMGCTCATTTCTGCCGGGCTCGATAGCTGCCCTGTTCTTGAGAAGATCTC[A/G]ATTAAGGTGGAGGGCGATCTCCGGACTTGTCCGCGTCCATTTCACGGGTC I need to remove square brackets "[]","/" and the nucleotide that follows this symbol "/", so basically choosing the 1st of the two variants. This is my script, but I dont know how to specify to program that I...

Extracting the longest sequence from the tab delim file

r,sequence,fasta,longest-substring
I have tab delim file file which contains which contains the following information >fasta >ss_23_122_0_1 MJSDHWTEZTZEWUIASUDUAISDUASADIASDIAUSIDAUSIDCASDAS >ss_23_167_0_1 WEIURIOWERWKLEJDSAJFASDGASZDTTQZWTEZQWTEZUQWEZQWTEZQTWEZTQW >ss_23_167_0_1 MAASDASDWEPWERIWERIWER >ss_23_167_0_1 QWEKCKLSDOIEOWIOWEUWWEUWEZURZEWURZUWEUZUQZUWZUE >ss_45_201_0_1 HZTMKSKDIUWZUWEZTZWERWUEOIRUOEROOWEWERSDFSDFRRRETERTER >ss_45_201_0_1 ZTTRASOIIDIFOSDIOFISDOFSDFQAWTZETQWE >ss_89_10_0_2...

Python: How to find coordinates of short sequences in a FASTA file?

python,python-2.7,bioinformatics,biopython,fasta
I have a list of short sequences that I want to obtain its coordinate or in another word to get its bed file after compare with a fasta file which contains original sequences. Fasta file: >PGH2 CGTAGCGGCTGAGTGCGCGGATAGCGCGTA Short sequence fasta file: >PGH2 CGGCTGAGT Is there any ways to obtain its...

Python: How to print out sequences with length n from sliding window in FASTA file?

python,python-2.7,python-3.x,biopython,fasta
I have a fasta file with few sequences and I would like to perform sliding windows of window size 5 and extract the sequences whenever it sweeps through the sequence. For example ( test1.fasta ): >human1 ATCGCGTC>human2ATTTTCGCGAExpected output ( test1_out.txt ):>human1ATCGC>human1TCGCG>human1CGCGT>human1GCGTC>human2ATTTT>human2TTTTC>human2TTTCG>human2TTCGC>human2TCGCG>human2CGCGAMy following code only able to extract the first five...

Python: How to extract DNA sequence based on a text file with binary content?

python,python-2.7,bioinformatics,biopython,fasta
For example I have a fasta file with the following sequences: >human1 AGGGCGSTGC >human2 GCTTGCGCTAG >human3 TTCGCTAG How to use python to read a text file with the following content to extract the sequences? 1 represents true and 0 represents false. Only sequence with value 1 will be extracted. Example...

Subset sequence data in fasta file based on IDs stored in listed data frames

r,subsetting,fasta,seq
I am trying to subset one FASTA file (containing multiple sequences) into several smaller ones based on IDs I stored in a list of data frames (and I have a FASTA called fastafile like this: fastafile <- dput(fastafile) structure(list(r1 = "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac", r2 = "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag", r3 = "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgca", r4 = "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcgg",...

validate text box input (for fasta format) before submission

javascript,php,fasta
I want to validate a text box input on an HTML page whether it is in fasta format or not and if it only contains one sequence before submission of data. I know PHP and little idea about JavaScript. I don't think it is possible with php....