Subtitle file exercise, translating Python to Rust

I love the movie Crouching Tiger Hidden Dragon (CTHD), especially the bar scene. I tried to find the screenplay for the movie in Mandarin. I found an awesome GitHub project (subtitle-combiner) and started thinking…

subtitle-combiner is great because its Git repository includes the subtitle files (SRT files) for CTHD in many different languages, including Simplified Mandarin, Traditional Mandarin, Pinyin and English. The subtitles are a treasure, it’s hard to find the CTHD screenplay in Mandarin; maybe we can create the transcript from the subtitles!

The subtitle-combiner Git repo includes a Python program which combines multiple subtitle files into one subtitle file. For example, it combines Simplified Mandarin subtitles and Pinyin subtitles into Mandarin+Pinyin subtitles.

subtitle-combiner uses Python so I can understand it; specifically subtitle-combiner uses Python 2 so it would be good practice to translate the project to Python 3 and a challenging exercise to translate the project to Rust.

Converting to Rust

Creating the rust-srt-combiner

Before I began I installed Rust. I’ve been using the Rust Book experiment. The Rust book and most resources teach you to use rustup to manage your Rust installation.

The first real step in converting to Rust is to create a Rust project. The Hello, Cargo! chapter teaches us how to create a new project using cargo new. Let’s name our project the rust-srt-combiner:

$ cargo new rust-srt-combiner
        Created binary (application) `rust-srt-combiner` package

Our new cargo project comes with a Cargo.toml file. The *.toml format is easy to read, but we should add a description property to credit where credit’s due. Notice our description property points to the original subtitles-combiner git repo:

[package]
name = "rust-srt-combiner"
version = "0.1.0"
edition = "2021"
# the description we added vvvvvvvvvvvvvvvvvvvvvvvvvvv
description = "A Rust translation of this project https://github.com/gterzian/Subtitles-combiner"

[dependencies]

Translating combine.py

Next we could create the Rust equivalent of a “Python package” (note subtitles-combiner has a Python init.py file), or we could focus on the core logic in the combine.py file. The combine.py file starts by creating an ArgumentParser so subtitles-combiner can be run from the command line. We’ll ignore “packaging” and argument-parsing for now – we can run our Rust program with cargo run and hard-code any arguments. Let’s focus on the most important part: subtitle file parsing.

subtitles-combiner defines these four functions

We’ll translate these functions in the order they are executed.

Translating read_lines

read_lines takes a single file, opens the file in read-mode (rt) and yields a result for every nonempty line – the yield means read_lines is a generator. In fact the subtitles-combiner project is described as “an example of using generators for creating data processign pipelines”, and links to this presentation on Python Generator Hacking.

How do we translate this Python function to Rust?

def read_lines(sub_file):
    with open(sub_file, 'rt') as f: 
        for line in f:
            striped = line.strip()
            if striped:
                yield striped.decode('utf-8')    
Filepaths are more than strings

I thought Python’s open(file...) function accepts only a string argument containing the filepath, but it accepts a “path-like object”. So does Rust’s File::open(path...); the argument path uses generic type P, where <P: AsRef<Path>> – notice the use of the Path struct which Rust describes as “a slice of a path (akin to str)”. Rust accepts a simple String because the String implements the AsRef<Path> trait. A file “path” is more than just a simple string in both Python and Rust because a file path depends on the operating system, Linux and Windows use different path separators (/ vs \), so a single string can only use one path separator, only work in Linux, or only in Windows. But a “path”-like object could be cross-platform.

Close the file when you’re done

Python’s with statement “wraps the execution of a block with methods defined by a context manager”. That means Python will always call __exit__() on your context. When the context is a file, Python will close the file when you’re done (when the with block ends). That’s because a Python file inherits from the IOBase class, which implements the __exit__ method. Rust closes the file using the file’s drop function. In Rust the drop method is comparable to __exit__(). The drop function is known as the object’s destructor and it “gives the type time to somehow finish what it was doing”. Rust destructor is called automatically “when an initialized variable or temporary goes out of scope”.

Taking it one line at a time

The Python file object inherits from IOBase, which means the file object is a context manager and also that the file can be “iterated over yielding the lines in a stream”. The Rust book teaches us how to read the entire file into a string using fs::read_to_string(file_path).... We don’t want one string – we want to iterate over each iterate each line in one subtitle file (so we can combine with lines in another subtitle file). Rust’s std::io::BufReader and the .lines() method give us a way to iterate lines. You could also make a custom BufReader which implement Iterator so you can iterate the BufReader directly. Not only is this closer to the Python approach (for line in my_reader), but it also means you don’t have to “allocate a string for each line”.

Truthy and falsey

Python supports the idea of “truthy” and “falsey” which means if line: will not execute if line is an empty string. An empty Python string is considered false or “falsey”– so is any empty Python sequence/collection. Rust if statements don’t work like Python’s; if the condition isn’t a bool, you’ll get an error. “Unlike languages such as Ruby and JavaScript, Rust will not automatically try to convert non-Boolean types to a Boolean.”. Rust prefers explicitness, so instead of relying on a language feature to treat an empty sequence as false (aka “falsey”), we should explicitly check the length of the string to see if it’s 0.

Decoding from Unicode

The Python read_lines function TODO Compare Python 2 and Python 3

Too lazy to be lazy

The file object is iterable, but the read_lines Python function uses the yield keyword. In Python the yield keyword creates a generator. Generators can be iterated (because generators “implement the iterator protocol”). So read_lines is a “generator function”; calling read_lines immediately returns a generator-iterator. The code in the generator function “only runs when called by next(g) or g.send(v), and execution is suspended when yield is encountered”. When should you use a generator? Use a generator when “you don’t know if you are going to need all results, or where you don’t want to allocate the memory for all results at the same time.” So maybe we only want to translate the first 5 lines of dialog in our .srt files – no need to read the entire file for that!

Does Rust have generators? Using the yield keyword technique in Rust requires experimental/unstable features #![feature(generators, generator_trait)]. Instead of using the experimental Rust yield keyword, you could explicitly implement the Iterator protocol (i.e. define the fn next($mut self) function). Our Rust function already returns Lines<BufRead> which implements the Iterator protocol, so I won’t implement the Iterator in Rust (already implemented), and I won’t try to use in Rust with the yield keyword (too advanced for me!)

Instead I’ll consider these Python alternatives to the existing read_lines “generator function”

# Python generator function
def read_lines(sub_file):
    with open(sub_file, 'rt') as f: 
        for line in f:
            striped = line.strip()
            if striped:
                yield striped.decode('utf-8')    
Generator expression

Here is a Python read_lines function that returns a “generator expression”. This SO post discusses the difference between a generator expression and a generator function All the answers agree you should use whichever approach is clearer / more “readable”. What do you think?

# Python function that returns a generator expression
def read_lines(sub_file):
    return (line.decode('utf-8') for line in open(sub_file, 'rt') if line)
Filter and map functions

Here is a Python read_lines function that uses filter(...) and map(...) functions. In Javascript I always go to .map() and .filter() methods when I need to process some data. For me, seeing the words “filter” and “map” send a clear message about the purpose of the code. I especially like how Javascript syntax lets us write .filter() first and then .map() so you read the methods in the order they are executed (a technique I know as “chaining”, related to the idea of “piping”). Also in Javascript each method can be put on a newline, which enhances readability. Python doesn’t let us do that; the filter appears inside the map and I could use newlines but it is awkward.

# Python function that uses map and filter
def read_lines(sub_file):
    return map(lambda l:l.decode('utf-8'), filter(lambda l:l, open(sub_file, 'rt')))

Compare to a Javascript-like function below. What I love about it:

// Javascript-like function that uses map and filter
def read_lines(sub_file):
    return file
        .open(sub_file)
        .filter(_=>_)
        .map(_=>_.decode('utf-8'));

So the Python tendency for code to collapse onto a single line makes our generator expression and filter/map functions less readable – I think the original generator function (using the yield keyword) is best. These considerations should influence our Rust translation. The Rust book talks about the concept of readability:

However, one long line is difficult to read, so it’s best to divide it. It’s often wise to introduce a newline and other whitespace to help break up long lines when you call a method with the .method_name() syntax.

So let’s revisit our function. With the helps of questions like [“most efficient way to filter Lines<BufReader>"](https://users.rust-lang.org/t/most-efficient-way-to-filter-lines-bufreader-file-based-on-multiple-criteria/74141) and ["how to read, filter, and modify lines from a file"](https://stackoverflow.com/a/30329127/1175496), I came up with this translation:

Translating read_files

…TODO…

Translating combine

…TODO…

Zipping it up

Translating write_combined_file

…TODO…

Important Differences

Other Resources