Grouping data columns by shared values

I don't know how to properly describe what I need to do, so I will give an example. A colleague has a data set in Excel like so:

Col A    Col B    Col C
aaaaa    aaaaa    bbbbb
bbbbb    ccccc    ccccc
ccccc    ddddd    eeeee

The end result should be something like this:

Col A    Col B    Col C
aaaaa    aaaaa
bbbbb             bbbbb
ccccc    ccccc    ccccc
         ddddd
                  eeeee

Or even:

      Col A    Col B    Col C
aaaaa Yes      Yes      No
bbbbb Yes      No       Yes

etc.

(if it helps, the columns are protein extraction methods and the letters are protein IDs - we need to determine which proteins are extracted by which methods)

My colleague is doing this by hand, but there is enough data that it would be really helpful to automate it.

Is there a formula in Excel to do this automatically?


This is not a “turn-key” solution, but if you have thousands of rows, this may save you some effort.  (Do this in a scratch copy of your file, just in case something blows up or melts down, because “Undo” doesn’t always work.)  Note: this procedure was developed for Excel 2007 (but I have re-verified it in Excel 2013).

First, copy all your data into a scratch column; let’s call it V.  Note that you must copy the heading from Column A, or else put some dummy value in cell V1.

Illustration of copying data

Now go to the “Data” tab, “Sort & Filter” group, and click on “Advanced”:

                        “Sort & Filter” group, with “Advanced” highlighted

This will bring up the “Advanced Filter” dialog box:

                  “Advanced Filter” dialog box

Verify that “List range” shows your data in Column V.  Select “Copy to another location” and “Unique records only”.  Type “W1” in the “Copy to” field — or click in the field, and then click in W1 (there are several techniques that will get the same result).  Click on “OK”.  You should get something like this:

                              Data from V copied into W with duplicates removed

i.e., a list of your unique data values.  You may need to sort Column W.

Now enter =NOT(ISNA(VLOOKUP($W2,A$2:A$4,1,FALSE))) in X2 (replace the 4 with the number of the last row that contains data), and drag/fill down to match Column W (i.e., one row for each unique value in your original data) and to the right to Column Z (i.e., the number of columns in your data).

                              table showing TRUE/FALSE for whether each value is in each column

This gives you a truth table corresponding to the second form of the desired result in the question (but with “TRUE” and “FALSE” instead of “Yes” and “No”).  For example,

  • X2 is TRUE because Column A contains “aaaaa”,
  • X3 is TRUE because Column A contains “bbbbb”,
  • Y2 is TRUE because Column B contains “aaaaa”,
  • Y3 is FALSE because Column B does not contain “bbbbb”, etc.

Delete column V, and fix the headings (in Row 1) at your leisure.  If you don’t want to keep Columns A-C in the spreadsheet, then copy Columns W-Z and paste values.


Some explanation on the formula: The formula I have presented above is for use in Column X, which corresponds to Column A.  Since I used $W2, this is an absolute reference to Column W and it will refer to cell Wn when the formula is dragged/filled to row n of any column.  By contrast, A$2:A$4 is an absolute reference to Rows 2 through 4, but a relative reference to Column A.  When the formula is dragged to Column Y, this reference will automatically change to B$2:B$4.  When the formula is dragged to Column Z, this reference will automatically change to C$2:C$4.