Ubuntu auto delete oldest file in directory when disk is above 90% capacity, repeat until capacity below 80%
I have found a few similar cron job scripts but nothing exactly how I need them and I do not know enough about scripting for Linux to try modify the code when it comes to this sort of job which could turn disastrous.
Essentially I have ip cameras that record to /home/ben/ftp/surveillance/
but I need to ensure there is always enough space on the disk to do so.
Would someone please be able to guide me how I can setup a cron job to:
Check if /dev/sbd/
has reached 90% capacity. If so then delete the oldest file in(and files in sub folders) /home/ben/ftp/surveillance/
And repeat this until /dev/sbd/
capacity is below 80% Repeat every 10 minutes.
Solution 1:
Writing these kinds of scripts for people always makes me nervous because, in the event anything goes wrong, one of three things will happen:
- I'll kick myself for what's probably a n00b-level typo
- Death threats will come my way because someone blindly copy/pasted without:
- making an effort to understand the script
- testing the script
- having a reasonable backup in place
- All of the above
So, to reduce the risk of all three, here is a starter kit for you:
#!/bin/sh
DIR=/home/ben/ftp/surveillance
ACT=90
df -k $DIR | grep -vE '^Filesystem' | awk '{ print $5 " " $1 }' | while read output;
do
echo $output
usep=$(echo $output | awk '{ print $1}' | cut -d'%' -f1 )
partition=$(echo $output | awk '{ print $2 }' )
if [ $usep -ge $ACT ]; then
echo "Running out of space \"$partition ($usep%)\" on $(hostname) as on $(date)"
oldfile=$(ls -dltr $DIR/*.gz|awk '{ print $9 }' | head -1)
echo "Let's Delete \"$oldfile\" ..."
fi
done
THINGS TO NOTE:
-
This script deletes nothing
-
DIR
is the directory to work with -
ACT
is the minimum percentage required to act -
Only one file – the oldest – is selected for "deletion"
-
You will want to replace
*.gz
with the actual file type of your surveillance videos.
DO NOT USE*.*
OR*
BY ITSELF! -
If the partition containing
DIR
is at a capacity greater thanACT
, you will see a message like this:97% /dev/sda2 Running out of space "/dev/sda2 (97%)" on ubuntu-vm as on Wed Jan 12 07:52:20 UTC 2022 Let's Delete "/home/ben/ftp/surveillance/1999-12-31-video.gz" ...
Again, this script will not delete anything.
-
If you are satisfied with the output, then you can continue to modify the script to delete/move/archive as you see fit
Test often. Test well. And remember: When putting rm
in a script, there is no undo.
Solution 2:
I would use Python for such task. It might lead to more code than a pure bash solution, but:
- it's (IMO) easier to test, just use
pytest
orunitest
module - it's readable for non Linux people (well except the
get_device
function which is Linux specific...) - it's easier to get started (again IMO)
- What if you want to send some emails ? To trigger new actions ? Scripts can be enriched easily with a programming language like Python.
Since Python 3.3, shutil
module comes with a function named disk_usage
. It can be used to get the disk usage based on a given directory.
The minor problem is that I don't known how to easily get the name of the disk, I.E, /dev/sdb
, even though it's possible to get its disk usage (using any directory mounted on /dev/sdb
, in my case $HOME
for example). I wrote a function called get_device
for this purpose.
#!/usr/bin/env python3
import argparse
from os.path import getmtime
from shutil import disk_usage, rmtree
from sys import exit
from pathlib import Path
from typing import Iterator, Tuple
def get_device(path: Path) -> str:
"""Find the mount for a given directory. This is needed only for logging purpose."""
# Read /etc/mtab to learn about mount points
mtab_entries = Path("/etc/mtab").read_text().splitlines()
# Create a dict of mount points and devices
mount_points = dict([list(reversed(line.split(" ")[:2])) for line in mtab_entries])
# Find the mount point of given path
while path.resolve(True).as_posix() not in mount_points:
path = path.parent
# Return device associated with mount point
return mount_points[path.as_posix()]
def get_directory_and_device(path: str) -> Tuple[str, Path]:
"""Exit the process if directory does not exist."""
fs_path = Path(path)
# Path must exist
if not fs_path.exists():
print(f"ERROR: No such directory: {path}")
exit(1)
# And path must be a valid directory
if not fs_path.is_dir():
print(f"Path must be a directory and not a file: {path}")
exit(1)
# Get the device
device = get_device(fs_path)
return device, fs_path
def get_disk_usage(path: Path) -> float:
# shutil.disk_usage support Path like objects so no need to cast to string
usage = disk_usage(path)
# Get disk usage in percentage
return usage.used / usage.total * 100
def remove_file_or_directory(path: Path) -> None:
"""Remove given path, which can be a directory or a file."""
# Remove files
if path.is_file():
path.unlink()
# Recursively delete directory trees
if path.is_dir():
rmtree(path)
def find_oldest_files(
path: Path, pattern: str = "*", threshold: int = 80
) -> Iterator[Path]:
"""Iterate on the files or directories present in a directory which match given pattern."""
# List the files in the directory received as argument and sort them by age
files = sorted(path.glob(pattern), key=getmtime)
# Yield file paths until usage is lower than threshold
for file in files:
usage = get_disk_usage(path)
if usage < threshold:
break
yield file
def check_and_clean(
path: str,
threshold: int = 80,
remove: bool = False,
) -> None:
"""Main function"""
device, fspath = get_directory_and_device(path)
# shutil.disk_usage support Path like objects so no need to cast to string
usage = disk_usage(path)
# Take action if needed
if usage > threshold:
print(
f"Disk usage is greather than threshold: {usage:.2f}% > {threshold}% ({device})"
)
# Iterate over files to remove
for file in find_oldest_files(fspath, "*", threshold):
print(f"Removing file {file}")
if remove:
remove_file_or_directory(file)
def main() -> None:
parser = argparse.ArgumentParser(
description="Purge old files when disk usage is above limit."
)
parser.add_argument(
"path", help="Directory path where files should be purged", type=str
)
parser.add_argument(
"--threshold",
"-t",
metavar="T",
help="Usage threshold in percentage",
type=int,
default=80,
)
parser.add_argument(
"--remove",
"--rm",
help="Files are not removed unless --removed or --rm option is specified",
action="store_true",
default=False,
)
args = parser.parse_args()
check_and_clean(
args.path,
threshold=args.threshold,
remove=args.remove,
)
if __name__ == "__main__":
main()
If you need to orchestrate many tasks using CRON, it might be worth putting together some Python code as a library, and reuse this code across many tasks.
EDIT: I finally added the CLI part in the script, I think I'll use it myself 😅
Solution 3:
Check if
/dev/sbd/
has reached 90% capacity. If so then delete the oldest file in(and files in sub folders)/home/ben/ftp/surveillance/
And repeat this until/dev/sbd/
capacity is below 80% Repeat every 10 minutes.
The script below will do exactly that (provided that you add it to your crontab
to run in 10 minute intervals). Be extra sure this is what you really want to do, since this could easily erase all files in /home/ben/ftp/surveillance/
if your disk is filling up somewhere outside this directory.
#!/bin/sh
directory='/home/ben/ftp/surveillance'
max_usage=90
goal_usage=80
[ -d "$directory" ] || exit 1
[ "$max_usage" -gt "$goal_usage" ] || exit 1
[ "$( df --output=pcent $directory | \
grep -Ewo '[0-9]+' )" -ge "$max_usage" ] || exit 0
dev_used="$( df -B 1K --output=used $directory | \
grep -Ewo '[0-9]+' )"
goal_usage="$( printf "%.0f" \
$( echo ".01 * $goal_usage * \
$( df -B 1K --output=size $directory | \
grep -Ewo '[0-9]+' )" | bc ) )"
echo "$( find $directory -type f -printf '%Ts,%k,\047%p\047\n' )" | \
sort -k1 | \
awk -F, -v goal="$(($dev_used-$goal_usage))" '\
(sum+$2)>goal{printf "%s ",$3; exit} \
(sum+$2)<=goal{printf "%s ",$3}; {sum+=$2}' | \
xargs rm
How this script works:
The first 3 lines after the shebang are the variables per your parameters:
-
directory
is the full path to the parent directory containing the files and subdirectories from which you want remove old files (i.e.,/home/ben/ftp/surveillance
). The quotes around this value are not necessary unless the path contains spaces. -
max_usage
is the percent of disk capacity that will trigger the old file deletion actions (i.e.,90
percent). -
goal_usage
is the percent of disk capacity you want to achieve after deleting old files (i.e.,80
percent).
Note that the values of max_usage
and goal_usage
must be integers.
[ -d "$directory" ] || exit 1
- Checks that
directory
exists, otherwise script ends and exits with status 1.
[ "$max_usage" -gt "$goal_usage" ] || exit 1
- Checks that
max_usage
is greater thangoal_usage
, otherwise script ends and exits with status 1.
[ "$( df --output=pcent $directory | \
grep -Ewo '[0-9]+' )" -ge "$max_usage" ] || exit 0
- Gets the current disk capacity percent used and checks if it meets or exceeds the threshold set by
max_usage
. If not, further processing is not required so the script ends and exits with status 0.
dev_used="$( df -B 1K --output=used $directory | \
grep -Ewo '[0-9]+' )"
- Gets the currently disk capacity kilobytes used.
goal_usage="$( printf "%.0f" \
$( echo ".01 * $goal_usage * \
$( df -B 1K --output=size $directory | \
grep -Ewo '[0-9]+' )" | bc ) )"
- Converts the
goal_usage
variable to kilobytes (we'll need this value further down).
find $directory -type f -printf '%Ts,%k,\047%p\047\n'
- Locates all files in
directory
(and in all its subdirectories) and makes a list of these files, one per line, formatted astimestamp, size in kilobytes, 'full/path/to/file'
. Note that the 'full/path/to/file' is enclosed in single quotes so spaces in the names of files or directories will not cause problems later.
sort -k1
- Sorts the previously
echo
'd list of files by timestamp (oldest first).
awk -F, -v goal="$(($dev_used-$goal_usage))"
-
awk
creates an internal variablegoal
that is equal to the difference betweendev_used
andgoal_usage
- and this is the total kilobytes worth of files that must be removed in order to bring the disk capacity percent down to thegoal_usage
set at the start of the script.
(sum+$2)>goal{printf "%s ",$3; exit} \
(sum+$2)<=goal{printf "%s ",$3}; {sum+=$2}'
-
awk
(continued) begins processing the list by keeping a running sum of field 2 values (size in kilobytes) and printing field 3 values ('full/path/to/file') to a space separated string until the sum of kilobytes from field 2 becomes greater than thegoal
, at which pointawk
stops processing additional lines.
xargs rm
- The string of 'full/path/to/file' values from
awk
is piped toxargs
which runs therm
command using the string as its arguments. This removes those files.