Remove all text before colon
I have a file containing a certain number of lines. Each line looks like this:
TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1
I would like to remove all before ":" character in order to retain only PKMYT1 that is a gene name. Since I'm not an expert in regex scripting can anyone help me to do this using Unix (sed or awk) or in R?
Here are two ways of doing it in R:
foo <- "TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1"
# Remove all before and up to ":":
gsub(".*:","",foo)
# Extract everything behind ":":
regmatches(foo,gregexpr("(?<=:).*",foo,perl=TRUE))
A simple regular expression used with gsub()
:
x <- "TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1"
gsub(".*:", "", x)
"PKMYT1"
See ?regex
or ?gsub
for more help.
There are certainly more than 2 ways in R. Here's another.
unlist(lapply(strsplit(foo, ':', fixed = TRUE), '[', 2))
If the string has a constant length I imagine substr
would be faster than this or regex methods.
Using sed:
sed 's/.*://' < your_input_file > output_file
This will replace anything followed by a colon with nothing, so it'll remove everything up to and including the last colon on each line (because *
is greedy by default).
As per Josh O'Brien's comment, if you wanted to only replace up to and including the first colon, do this:
sed "s/[^:]*://"
That will match anything that isn't a colon, followed by one colon, and replace with nothing.
Note that for both of these patterns they'll stop on the first match on each line. If you want to make a replace happen for every match on a line, add the 'g
' (global) option to the end of the command.
Also note that on linux (but not on OSX) you can edit a file in-place with -i
eg:
sed -i 's/.*://' your_file