Using Pandas to pd.read_excel() for multiple worksheets of the same workbook
I have a large spreadsheet file (.xlsx) that I'm processing using python pandas. It happens that I need data from two tabs (sheets) in that large file. One of the tabs has a ton of data and the other is just a few square cells.
When I use pd.read_excel()
on any worksheet, it looks to me like the whole file is loaded (not just the worksheet I'm interested in). So when I use the method twice (once for each sheet), I effectively have to suffer the whole workbook being read in twice (even though we're only using the specified sheet).
How do I only load specific sheet(s) with pd.read_excel()
?
Solution 1:
Try pd.ExcelFile
:
xls = pd.ExcelFile('path_to_file.xls')
df1 = pd.read_excel(xls, 'Sheet1')
df2 = pd.read_excel(xls, 'Sheet2')
As noted by @HaPsantran, the entire Excel file is read in during the ExcelFile()
call (there doesn't appear to be a way around this). This merely saves you from having to read the same file in each time you want to access a new sheet.
Note that the sheet_name
argument to pd.read_excel()
can be the name of the sheet (as above), an integer specifying the sheet number (eg 0, 1, etc), a list of sheet names or indices, or None
. If a list is provided, it returns a dictionary where the keys are the sheet names/indices and the values are the data frames. The default is to simply return the first sheet (ie, sheet_name=0
).
If None
is specified, all sheets are returned, as a {sheet_name:dataframe}
dictionary.
Solution 2:
There are a few options:
Read all sheets directly into an ordered dictionary.
import pandas as pd
# for pandas version >= 0.21.0
sheet_to_df_map = pd.read_excel(file_name, sheet_name=None)
# for pandas version < 0.21.0
sheet_to_df_map = pd.read_excel(file_name, sheetname=None)
Read the first sheet directly into dataframe
df = pd.read_excel('excel_file_path.xls')
# this will read the first sheet into df
Read the excel file and get a list of sheets. Then chose and load the sheets.
xls = pd.ExcelFile('excel_file_path.xls')
# Now you can list all sheets in the file
xls.sheet_names
# ['house', 'house_extra', ...]
# to read just one sheet to dataframe:
df = pd.read_excel(file_name, sheet_name="house")
Read all sheets and store it in a dictionary. Same as first but more explicit.
# to read all sheets to a map
sheet_to_df_map = {}
for sheet_name in xls.sheet_names:
sheet_to_df_map[sheet_name] = xls.parse(sheet_name)
# you can also use sheet_index [0,1,2..] instead of sheet name.
Thanks @ihightower for pointing it out way to read all sheets and @toto_tico,@red-headphone for pointing out the version issue.
sheetname : string, int, mixed list of strings/ints, or None, default 0 Deprecated since version 0.21.0: Use sheet_name instead Source Link
Solution 3:
You can also use the index for the sheet:
xls = pd.ExcelFile('path_to_file.xls')
sheet1 = xls.parse(0)
will give the first worksheet. for the second worksheet:
sheet2 = xls.parse(1)