Subtitle file exercise, translating Python to Rust22 Feb 2023
I love the movie Crouching Tiger Hidden Dragon (CTHD), especially the bar scene. I tried to find the screenplay for the movie in Mandarin. I found an awesome GitHub project (subtitle-combiner) and started thinking…
subtitle-combiner is great because its Git repository includes the subtitle files (SRT files) for CTHD in many different languages, including Simplified Mandarin, Traditional Mandarin, Pinyin and English. The subtitles are a treasure, it’s hard to find the CTHD screenplay in Mandarin; maybe we can create the transcript from the subtitles!
The subtitle-combiner Git repo includes a Python program which combines multiple subtitle files into one subtitle file. For example, it combines Simplified Mandarin subtitles and Pinyin subtitles into Mandarin+Pinyin subtitles.
subtitle-combiner uses Python so I can understand it; specifically subtitle-combiner uses Python 2 so it would be good practice to translate the project to Python 3 and a challenging exercise to translate the project to Rust.
Converting to Rust
The first real step in converting to Rust is to create a Rust project. The Hello, Cargo! chapter teaches us how to create a new project using
cargo new. Let’s name our project the
$ cargo new rust-srt-combiner Created binary (application) `rust-srt-combiner` package
cargo project comes with a
Cargo.toml file. The
*.toml format is easy to read, but we should add a
description property to credit where credit’s due. Notice our description property points to the original subtitles-combiner git repo:
[package] name = "rust-srt-combiner" version = "0.1.0" edition = "2021" # the description we added vvvvvvvvvvvvvvvvvvvvvvvvvvv description = "A Rust translation of this project https://github.com/gterzian/Subtitles-combiner" [dependencies]
Next we could create the Rust equivalent of a “Python package” (note subtitles-combiner has a Python init.py file), or we could focus on the core logic in the
combine.py file. The
combine.py file starts by creating an ArgumentParser so subtitles-combiner can be run from the command line. We’ll ignore “packaging” and argument-parsing for now – we can run our Rust program with
cargo run and hard-code any arguments. Let’s focus on the most important part: subtitle file parsing.
subtitles-combiner defines these four functions
We’ll translate these functions in the order they are executed.
read_lines takes a single file, opens the file in read-mode (
rt) and yields a result for every nonempty line – the
read_lines is a generator. In fact the
subtitles-combiner project is described as “an example of using generators for creating data processign pipelines”, and links to this presentation on Python Generator Hacking.
How do we translate this Python function to Rust?
def read_lines(sub_file): with open(sub_file, 'rt') as f: for line in f: striped = line.strip() if striped: yield striped.decode('utf-8')
Filepaths are more than strings
I thought Python’s
open(file...) function accepts only a string argument containing the filepath, but it accepts a “path-like object”. So does Rust’s
File::open(path...); the argument
path uses generic type
<P: AsRef<Path>> – notice the use of the
Path struct which Rust describes as “a slice of a path (akin to
str)”. Rust accepts a simple
String because the
String implements the
AsRef<Path> trait. A file “path” is more than just a simple string in both Python and Rust because a file path depends on the operating system, Linux and Windows use different path separators (
\), so a single string can only use one path separator, only work in Linux, or only in Windows. But a “path”-like object could be cross-platform.
Close the file when you’re done
with statement “wraps the execution of a block with methods defined by a context manager”. That means Python will always call
__exit__() on your context. When the context is a file, Python will close the file when you’re done (when the
with block ends). That’s because a Python file inherits from the
IOBase class, which implements the
__exit__ method. Rust closes the file using the file’s
drop function. In Rust the
drop method is comparable to
drop function is known as the object’s
destructor and it “gives the type time to somehow finish what it was doing”. Rust destructor is called automatically “when an initialized variable or temporary goes out of scope”.
Taking it one line at a time
The Python file object inherits from
IOBase, which means the file object is a context manager and also that the file can be “iterated over yielding the lines in a stream”. The Rust book teaches us how to read the entire file into a string using
fs::read_to_string(file_path).... We don’t want one string – we want to iterate over each iterate each line in one subtitle file (so we can combine with lines in another subtitle file). Rust’s std::io::BufReader and the
.lines() method give us a way to iterate lines. You could also make a custom
BufReader which implement
Iterator so you can iterate the
BufReader directly. Not only is this closer to the Python approach (
for line in my_reader), but it also means you don’t have to “allocate a string for each line”.
Truthy and falsey
Python supports the idea of “truthy” and “falsey” which means
if line: will not execute if
line is an empty string. An empty Python string is considered false or “falsey”– so is any empty Python sequence/collection. Rust
if statements don’t work like Python’s; if the condition isn’t a
Decoding from Unicode
TODO Compare Python 2 and Python 3
Too lazy to be lazy
The file object is iterable, but the
read_lines Python function uses the
yield keyword. In Python the
yield keyword creates a generator. Generators can be iterated (because generators “implement the iterator protocol”). So
read_lines is a “generator function”; calling
read_lines immediately returns a generator-iterator. The code in the generator function “only runs when called by
g.send(v), and execution is suspended when
yield is encountered”. When should you use a generator? Use a generator when “you don’t know if you are going to need all results, or where you don’t want to allocate the memory for all results at the same time.” So maybe we only want to translate the first 5 lines of dialog in our .srt files – no need to read the entire file for that!
Does Rust have generators? Using the
yield keyword technique in Rust requires experimental/unstable features
#![feature(generators, generator_trait)]. Instead of using the experimental Rust
yield keyword, you could explicitly implement the Iterator protocol (i.e. define the
fn next($mut self) function). Our Rust function already returns
Lines<BufRead> which implements the Iterator protocol, so I won’t implement the
Iterator in Rust (already implemented), and I won’t try to use in Rust with the
yield keyword (too advanced for me!)
Instead I’ll consider these Python alternatives to the existing
read_lines “generator function”
# Python generator function def read_lines(sub_file): with open(sub_file, 'rt') as f: for line in f: striped = line.strip() if striped: yield striped.decode('utf-8')
Here is a Python
read_lines function that returns a “generator expression”. This SO post discusses the difference between a generator expression and a generator function All the answers agree you should use whichever approach is clearer / more “readable”. What do you think?
# Python function that returns a generator expression def read_lines(sub_file): return (line.decode('utf-8') for line in open(sub_file, 'rt') if line)
Filter and map functions
Here is a Python
read_lines function that uses
.filter() first and then
# Python function that uses map and filter def read_lines(sub_file): return map(lambda l:l.decode('utf-8'), filter(lambda l:l, open(sub_file, 'rt')))
- the code is executed in the order it’s written (i.e. file is opened, then filtered, then mapped)
- each new step in processing is on a new line, (and the processing step
.mapis the first word on the line)
- and indentation is not required (so the code doesn’t get very “wide”, I can read it top-to-bottom instead of left-to-right)
So the Python tendency for code to collapse onto a single line makes our generator expression and filter/map functions less readable – I think the original generator function (using the
yield keyword) is best. These considerations should influence our Rust translation. The Rust book talks about the concept of readability:
However, one long line is difficult to read, so it’s best to divide it. It’s often wise to introduce a newline and other whitespace to help break up long lines when you call a method with the .method_name() syntax.
So let’s revisit our function. With the helps of questions like [“most efficient way to filter Lines<BufReader
Zipping it up
Stringis the dynamic heap string type, like
Vec: use it when you need to own or modify your string data.
stris an immutable sequence of UTF-8 bytes of dynamic length somewhere in memory. Since the size is unknown, one can only handle it behind a pointer. This means that
strmost commonly2 appears as
BufReader<R>performs large, infrequent reads on the underlying
Readand maintains an in-memory buffer of the results…
BufReader<R>can improve the speed of programs that make small and repeated read calls to the same file or network socket. It does not help when reading very large amounts at once, or reading just one or a few times
type- > A struct is a type– “type” is the more general category; struct is one kind of type.
?Rust deals with the
- a “lambda” vs a “closure” generally
- Do the general ideas apply to a Python
- Do the general ideas apply to a Python
- “chaining” methods vs “piping” outputs as inputs