How to find string data-type that includes a number in Pandas DataFrame
I have a DataFrame with two columns. One column contain string values that may or may not include numbers (integer or float).
Sample:
import pandas as pd
import numpy as np
data = [('A', '>10'),
('B', '10'),
('C', '<10'),
('D', '10'),
('E', '10-20'),
('F', '20.0'),
('G', '25.1') ]
data_df = pd.DataFrame(data, columns = ['name', 'value'])
Entries in Column value
have string data-type. But, their values might be numeric or not.
What I want to get:
-
Find which rows have numeric values in column
value
. -
Remove other rows from dataset.
Final result will look like:
name value
'B' 10
'D' 10
'F' 20.0
'G' 25.1
I tried to use isnumeric()
function but it returns True
only for integers (not float).
If you have any idea to solve this problem, please let me know.
Updated Question (multi columns):
(The same question when there are more than one column with numeric values)
Similarly, I have a DataFrame with three columns. Two columns contain string values that may or may not include numbers (integer or float).
Sample:
import pandas as pd
import numpy as np
data = [('A', '>10', 'ABC'),
('B', '10', '15'),
('C', '<10', '>10'),
('D', '10', '15'),
('E', '10-20', '10-30'),
('F', '20.0', 'ABC'),
('G', '25.1', '30.1') ]
data_df = pd.DataFrame(data, columns = ['name', 'value1', 'value2'])
Entries in Columns value1
& value2
have string data-type. But, their values might be numeric or not.
What I want to get:
-
Find which rows have numeric values in columns
value1
&value2
. -
Remove other rows from dataset.
Final result will look like:
name value1 value2
'B' 10 15
'D' 10 15
'G' 25.1 30.1
You can use pandas.to_numeric
with errors='coerce'
, then dropna
to remove the invalid rows:
(data_df.assign(value=pd.to_numeric(data_df['value'], errors='coerce'))
.dropna(subset=['value'])
)
NB. this upcasts the integers into floats, but this is the way Series works and it's better to have upcasting than forcing an object type
output:
name value
1 B 10.0
3 D 10.0
5 F 20.0
6 G 25.1
If you just want to slice the rows and keep the string type:
data_df[pd.to_numeric(data_df['value'], errors='coerce').notna()]
output:
name value
1 B 10
3 D 10
5 F 20.0
6 G 25.1
updated question (multi columns)
build a mask and use any
/all
prior to slicing:
mask = data_df[data_df.columns[1:]].apply(pd.to_numeric, errors='coerce').notna().all(1)
data_df[mask]