Finding dead code in large python project [closed]
I've seen How can you find unused functions in Python code? but that's really old, and doesn't really answer my question.
I have a large python project with multiple libraries that are shared by multiple entry point scripts. This project has been accreting for many years with many authors, so there's a whole lot of dead code. You know the drill.
I know that finding all dead code is un-decidable. All I need is a tool that will find all functions that are not called anywhere. We're not doing anything fancy with calling functions based on the string of the function name, so I'm not worried about anything pathological...
I just installed pylint, but it appears to be file based, and not paying much attention to interfile dependencies, or even function dependencies.
Clearly, I could grep for def in all of the files, get all of the function names from that, and do a grep for each of those function names. I'm just hoping that there's something a little smarter than that out there already.
ETA: Please note that I don't expect or want something perfect. I know my halting-problem-proof just as well anyone (No really I taught theory of computation I know when I'm looking at something that is recursively enumerable). Any thing that tries to approximate it by actually running the code is going to take way too long. I just want something that syntactically goes through the code and says "This function is definitely used. This function MIGHT be used, and this function is definitely NOT used, no one else even seems to know it exists!" And the first two categories aren't important.
You might want to try out vulture. It can't catch everything due to Python's dynamic nature, but it catches quite a bit without needing a full test suite like coverage.py and others need to work.
Try running Ned Batchelder's coverage.py.
Coverage.py is a tool for measuring code coverage of Python programs. It monitors your program, noting which parts of the code have been executed, then analyzes the source to identify code that could have been executed but was not.
It is very hard to determine which functions and methods are called without executing the code, even if the code doesn't do any fancy stuff. Plain function invocations are rather easy to detect, but method calls are really hard. Just a simple example:
class A(object):
def f(self):
pass
class B(A):
def f(self):
pass
a = []
a.append(A())
a.append(B())
a[1].f()
Nothing fancy going on here, but any script that tries to determine whether A.f()
or B.f()
is called will have a rather hard time to do so without actually executing the code.
While the above code doesn't do anything useful, it certainly uses patterns that appear in real code -- namely putting instances in containers. Real code will usually do even more complex things -- pickling and unpickling, hierarchical data structures, conditionals.
As stated before, just detecting plain function invocations of the form
function(...)
or
module.function(...)
will be rather easy. You can use the ast
module to parse your source files. You will need to record all imports, and the names used to import other modules. You will also need to track top-level function definitions and the calls inside these functions. This will give you a dependency graph, and you can use NetworkX to detect the connected components of this graph.
While this might sound rather complex, it can probably done with less than 100 lines of code. Unfortunately, almost all major Python projects use classes and methods, so it will be of little help.
Here's the solution I'm using at least tentatively:
grep 'def ' *.py > defs
# ...
# edit defs so that it just contains the function names
# ...
for f in `cat defs` do
cat $f >> defCounts
cat *.py | grep -c $f >> defCounts
echo >> defCounts
done
Then I look at the individual functions that have very few references (< 3 say)
it's ugly, and it only gives me approximate answers, but I think it's good enough for a start. What are you-all's thoughts?