We cannot always avoid the details of file management, especially when analyzing raw data. It can come in the form of multiple files distributed over multiple directories. Accessing those files and the directory system can be a critical aspect of raw data processing. Luckily, Python has very convenient methods to handle files and directories in the
os module. Here we will look at functions to look inspect the contents of directories (
os.listdir, os.walk) and functions that help us manipulate files and directoris (
os.rename, os.remove, os.mkdir).
One of the most important functions to manage files is
os.listdir(directory). It returns a list of strings that contains to the names of all directories and files in path.
import os directory = r"C:\Users\Daniel\tutorial" dir_content = os.listdir(directory) # ['images', 'test_one.txt', 'test_two.txt']
Now that we know what is in our directory, how do we find out which of these are files and which are subdirectories? You might be tempted to just check for a period (.) inside the string because it separates the file from its extension. This method is error prone, because directories could contain periods and files don’t necessarily have extensions. It is much better to use
files = [x for x in dir_content if os.path.isfile(directory + os.sep + x)] # ['test_one.txt', 'test_two.txt'] dirs = [x for x in dir_content if os.path.isdir(directory + os.sep + x)] # ['images']
Now we know the contents of directory. We have two files and one directory. In case you were wondering about
os.sep, that is the directory separator of the operating system. On my Window 10 that is the
'\'. What if we need both files that are in our directory and those that are in sub-directories? This is the perfect case to use
os.walk(). It gives a convenient way to loop through a directory and all its sub-directories.
for root, dirs, files in os.walk(directory): print(root) print(dirs) print(files) # C:\Users\Daniel\tutorial # ['images'] # ['file_1.txt', 'file_2.txt'] # C:\Users\Daniel\tutorial\images #  # ['plot_1.png', 'plot_2.png']
os.walk() goes from top down. The first name
root tells us the full path of the directory we are currently at. Printing root tells us that we start at the top
directory. While root is a string, both
files are lists. They tell us for the current root, which files and directories are there. For the first directory we already found out on our own that the contents are. Two text files and a sub-directory. Our loop next goes to the sub-directory images. In there are no more sub-directories but two image files. If there were more sub-categories at any level (directory or directory\images),
os.walk would go through all of them. Next we will find out how to create/move/rename/delete both files and directories.
Manipulating Files and Directories
Let’s say I want to rename the .txt files. I don’t like the numbering and would prefer them to have a leading zero in case they go into the double digits. We can use
os.rename for this job.
directory = r"C:\Users\Daniel\tutorial" dir_content = os.listdir(directory) txt_f = [x for x in dir_content if os.path.isfile(directory + os.sep + x) and ".txt" in x] # ['file_1.txt', 'file_2.txt'] for f_name in txt_f: f_name_split = f_name.split("_") num = f_name_split.split(".") new_name = f_name_split + "_" + num.zfill(2) + ".txt" os.rename(directory + os.sep + f_name, directory + os.sep + new_name) os.listdir(directory) ['file_01.txt', 'file_02.txt', 'images']
Now that we renamed our files, let’s create another sub-directory for these .txt files. To create new directories we use
os.listdir(directory) # ['file_01.txt', 'file_02.txt', 'images'] os.mkdir(directory + os.sep + 'texts') os.listdir(directory) # ['file_01.txt', 'file_02.txt', 'images', 'texts']
Now we need to move the .txt files. There is no dedicated move function in the os module. Instead we use rename but instead of changing the name of the file, we change the path to the directory.
9directory = r"C:\Users\Daniel\tutorial" dir_content = os.listdir(directory) txt_f = [x for x in dir_content if os.path.isfile(directory + os.sep + x) and ".txt" in x] for f in txt_f: old_path = directory + os.sep + f new_path = directory + os.sep + "texts" + os.sep + f os.rename(old_path, new_path) os.listdir(directory) # ['images', 'texts'] os.listdir(directory+os.sep+'texts') # ['file_01.txt', 'file_02.txt']
Now our .txt files are in the ‘\texts’ sub-directory. Unfortunately there is no copy function in
os. Instead we have to use another module called shutil. You can use a signature like this.
from shutil import copyfile copyfile(source, destination)
Finally, to remove a file we simple use
os.listdir(directory+os.sep+"texts") # ['file_01.txt', 'file_02.txt'] os.remove(directory+os.sep+'texts'+os.sep+"file_01.txt") os.listdir(directory+os.sep+"texts") # ['file_02.txt']
And that’s it. You might have noticed that we did not cover how to create or read files. The
os module is technically able to create and read files but in data science we usually depend on more high level interfaces to read files. For example, we might want to open a .csv file with pandas
pd.read_csv. Using the lower level os functions will rarely be necessary. Thank you for reading and let me know if you have any questions.
In case you want to learn more about the
os module, here are the
os module docs.