How do I find the formatting for a subset of text in an Excel document cell

Using Python, I need to find all substrings in a given Excel sheet cell that are either bold or italic.

My problem is similar to this:

Using XLRD module and Python to determine cell font style (italics or not)

..but the solution is not applicable for me as I cannot assume that the same formatting holds for all content in the cell. The value in a single cell can look like this:


1. Some bold text Some normal text. Some italic text.


Is there a way to find the formatting of a range of characters in a cell using xlrd (or any other Python Excel module)?


Solution 1:

Thanks to @Vyassa for all of the right pointers, I've been able to write the following code which iterates over the rows in a XLS file and outputs style information for cells with "single" style information (e.g., the whole cell is italic) or style "segments" (e.g., part of the cell is italic, part of it is not).

import xlrd

# accessing Column 'C' in this example
COL_IDX = 2

book = xlrd.open_workbook('your-file.xls', formatting_info=True)
first_sheet = book.sheet_by_index(0)

for row_idx in range(first_sheet.nrows):
  text_cell = first_sheet.cell_value(row_idx, COL_IDX)
  text_cell_xf = book.xf_list[first_sheet.cell_xf_index(row_idx, COL_IDX)]

  # skip rows where cell is empty
  if not text_cell:
    continue
  print text_cell,

  text_cell_runlist = first_sheet.rich_text_runlist_map.get((row_idx, COL_IDX))
  if text_cell_runlist:
    print '(cell multi style) SEGMENTS:'
    segments = []
    for segment_idx in range(len(text_cell_runlist)):
      start = text_cell_runlist[segment_idx][0]
      # the last segment starts at given 'start' and ends at the end of the string
      end = None
      if segment_idx != len(text_cell_runlist) - 1:
        end = text_cell_runlist[segment_idx + 1][0]
      segment_text = text_cell[start:end]
      segments.append({
        'text': segment_text,
        'font': book.font_list[text_cell_runlist[segment_idx][1]]
      })
    # segments did not start at beginning, assume cell starts with text styled as the cell
    if text_cell_runlist[0][0] != 0:
      segments.insert(0, {
        'text': text_cell[:text_cell_runlist[0][0]],
        'font': book.font_list[text_cell_xf.font_index]
      })

    for segment in segments:
      print segment['text'],
      print 'italic:', segment['font'].italic,
      print 'bold:', segment['font'].bold

  else:
    print '(cell single style)',
    print 'italic:', book.font_list[text_cell_xf.font_index].italic,
    print 'bold:', book.font_list[text_cell_xf.font_index].bold

Solution 2:

xlrd can do this. You must call load_workbook() with the kwarg formatting_info=True, then sheet objects will have an attribute rich_text_runlist_map which is a dictionary mapping cell coordinates ((row, col) tuples) to a runlist for that cell. A runlist is a sequence of (offset, font_index) pairs where offset tells you where in the cell the font begins, and font_index indexes into the workbook object's font_list attribute (the workbook object is what's returned by load_workbook()), which gives you a Font object describing the properties of the font, including bold, italics, typeface, size, etc.

Solution 3:

I don't know if you can do that with xlrd, but since you ask about any other Python Excel module: openpyxl cannot do this in version 1.6.1.

The rich text gets reconstructed away in function get_string() in openpyxl/reader/strings.py. It would be relatively easy to setup a second table with 'raw' strings in that module.