How do I convert the three letter amino acid codes to one letter code with python or R?

I have a fasta file as shown below. I would like to convert the three letter codes to one letter code. How can I do this with python or R?

>2ppo
ARGHISLEULEULYS
>3oot
METHISARGARGMET

desired output

>2ppo
RHLLK
>3oot
MHRRM

your suggestions would be appreciated!!


BioPython already has built-in dictionaries to help with such translations. Following commands will show you a whole list of available dictionaries:

import Bio
help(Bio.SeqUtils.IUPACData)

The predefined dictionary you are looking for:

Bio.SeqUtils.IUPACData.protein_letters_3to1['Ala']

Use a dictionary to look up the one letter codes:

d = {'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'GLN': 'Q', 'LYS': 'K',
     'ILE': 'I', 'PRO': 'P', 'THR': 'T', 'PHE': 'F', 'ASN': 'N', 
     'GLY': 'G', 'HIS': 'H', 'LEU': 'L', 'ARG': 'R', 'TRP': 'W', 
     'ALA': 'A', 'VAL':'V', 'GLU': 'E', 'TYR': 'Y', 'MET': 'M'}

And a simple function to match the three letter codes with one letter codes for the entire string:

def shorten(x):
    if len(x) % 3 != 0: 
        raise ValueError('Input length should be a multiple of three')

    y = ''
    for i in range(len(x) // 3):
            y += d[x[3 * i : 3 * i + 3]]
    return y

Testing your example:

>>> shorten('ARGHISLEULEULYS')
'RHLLK'

Here is a way to do it in R:

# Variables:
foo <- c("ARGHISLEULEULYS","METHISARGARGMET")

# Code maps:
code3 <- c("Ala", "Arg", "Asn", "Asp", "Cys", "Glu", "Gln", "Gly", "His", 
"Ile", "Leu", "Lys", "Met", "Phe", "Pro", "Ser", "Thr", "Trp", 
"Tyr", "Val")
code1 <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", 
"M", "F", "P", "S", "T", "W", "Y", "V")

# For each code replace 3letter code by 1letter code:
for (i in 1:length(code3))
{
    foo <- gsub(code3[i],code1[i],foo,ignore.case=TRUE)
}

Results in :

> foo
[1] "RHLLK" "MHRRM"

Note that I changed the variable name as variable names are not allowed to start with a number in R.


>>> src = "ARGHISLEULEULYS"
>>> trans = {'ARG':'R', 'HIS':'H', 'LEU':'L', 'LYS':'K'}
>>> "".join(trans[src[x:x+3]] for x in range(0, len(src), 3))
'RHLLK'

You just need to add the rest of the entries to the trans dict.

Edit:

To make the rest of trans, you can do this. File table:

Ala A
Arg R
Asn N
Asp D
Cys C
Glu E
Gln Q
Gly G
His H
Ile I
Leu L
Lys K
Met M
Phe F
Pro P
Ser S
Thr T
Trp W
Tyr Y
Val V

Read it:

trans = dict((l.upper(), s) for l, s in
             [row.strip().split() for row in open("table").readlines()])

You may try looking into and installing Biopython since you are parsing a .fasta file and then converting to one letter codes. Unfortunately, Biopython only has the function seq3(in package Bio::SeqUtils) which does the inverse of what you want. Example output in IDLE:

>>>seq3("MAIVMGRWKGAR*")
>>>'MetAlaIleValMetGlyArgTrpLysGlyAlaArgTer'

Unfortunately, there is no 'seq1' function (yet...) but I thought this might be helpful to you in the future. As far as your problem, Junuxx is correct. Create a dictionary and use a for loop to read the string in blocks of three and translate. Here is a similar function to the one he provided that is all-inclusive and handles lower cases as well.

def AAcode_3_to_1(seq):
    '''Turn a three letter protein into a one letter protein.

    The 3 letter code can be upper, lower, or any mix of cases
    The seq input length should be a factor of 3 or else results
    in an error

    >>>AAcode_3_to_1('METHISARGARGMET')
    >>>'MHRRM'

    '''
    d = {'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'GLN': 'Q', 'LYS': 'K',
     'ILE': 'I', 'PRO': 'P', 'THR': 'T', 'PHE': 'F', 'ASN': 'N', 
     'GLY': 'G', 'HIS': 'H', 'LEU': 'L', 'ARG': 'R', 'TRP': 'W', 'TER':'*',
     'ALA': 'A', 'VAL':'V', 'GLU': 'E', 'TYR': 'Y', 'MET': 'M','XAA':'X'}

    if len(seq) %3 == 0:
        upper_seq= seq.upper()
        single_seq=''
        for i in range(len(upper_seq)/3):
            single_seq += d[upper_seq[3*i:3*i+3]]
        return single_seq
    else:
        print("ERROR: Sequence was not a factor of 3 in length!")