Skip to content

fix: Handle symbolic linking race condition#3796

Open
scasagrande wants to merge 1 commit into
bazel-contrib:release/1.9from
scasagrande:fix/ln-race-condition
Open

fix: Handle symbolic linking race condition#3796
scasagrande wants to merge 1 commit into
bazel-contrib:release/1.9from
scasagrande:fix/ln-race-condition

Conversation

@scasagrande
Copy link
Copy Markdown
Contributor

I ran into a problem when I packed a py_binary into an image and the users were launching it multiple times simultaneously. This resulted in a race condition for the ln calls during the venv bootstrapping.

My solution was just to add -fn flags to ln.

I'm still on 1.9 so I am targeting that release branch. Let me know if I am missing something obvious or if I should change anything else.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces standard ln -s calls with ln -sfn in the stage 1 bootstrap template to force symlink creation. However, using ln -sfn is not atomic and can introduce race conditions in concurrent environments due to the underlying unlink-then-symlink sequence, potentially causing transient 'No such file or directory' errors. Additionally, the -n flag is non-standard POSIX. It is recommended to use an atomic fallback pattern instead, such as ln -s ... 2>/dev/null || [[ -L ... ]].

# The bin/ directory may not exist if it is empty.
mkdir -p "$(dirname $python_exe)"
ln -s "$symlink_to" "$python_exe"
ln -sfn "$symlink_to" "$python_exe"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using ln -sfn is not atomic and can introduce a different race condition.\n\nUnder the hood, ln -sf performs an unlink() followed by a symlink(). If multiple processes run this bootstrap script concurrently, one process might delete the symlink created by another process, leading to a transient window where the symlink does not exist. Any concurrent process attempting to execute or read the file during this window will fail with a No such file or directory error. Additionally, the -n flag is not standard POSIX and may behave differently or be unsupported on some platforms.\n\nA safer, fully atomic approach that avoids deleting existing symlinks is to attempt the creation with ln -s and fallback to verifying the symlink's existence if it fails due to already existing:\n\nbash\nln -s \"$symlink_to\" \"$python_exe\" 2>/dev/null || [[ -L \"$python_exe\" ]]\n\n\nSince symlink() is an atomic system call, this guarantees that the symlink is either created successfully or already exists, without ever unlinking it.

Suggested change
ln -sfn "$symlink_to" "$python_exe"
ln -s "$symlink_to" "$python_exe" 2>/dev/null || [[ -L "$python_exe" ]]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotta love seeing \n\n in a comment


mkdir -p "$venv/bin"
ln -s "$python_exe_actual" "$python_exe"
ln -sfn "$python_exe_actual" "$python_exe"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

As mentioned above, ln -sfn is not atomic and can cause transient No such file or directory errors for concurrent processes due to the unlink and symlink sequence.\n\nUsing ln -s ... 2>/dev/null || [[ -L ... ]] provides an atomic alternative that avoids deleting the symlink if it already exists.

Suggested change
ln -sfn "$python_exe_actual" "$python_exe"
ln -s "$python_exe_actual" "$python_exe" 2>/dev/null || [[ -L "$python_exe" ]]

if [[ ! -e "$venv_site_packages" ]]; then
mkdir -p $(dirname $venv_site_packages)
ln -s "$runfiles_venv_site_packages" "$venv_site_packages"
ln -sfn "$runfiles_venv_site_packages" "$venv_site_packages"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using ln -sfn here can also cause race conditions under concurrent execution. We should use the atomic ln -s ... || [[ -L ... ]] pattern instead.

Suggested change
ln -sfn "$runfiles_venv_site_packages" "$venv_site_packages"
ln -s "$runfiles_venv_site_packages" "$venv_site_packages" 2>/dev/null || [[ -L "$venv_site_packages" ]]


if [[ ! -e "$venv/pyvenv.cfg" ]]; then
ln -s "$runfiles_venv/pyvenv.cfg" "$venv/pyvenv.cfg"
ln -sfn "$runfiles_venv/pyvenv.cfg" "$venv/pyvenv.cfg"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using ln -sfn here can also cause race conditions under concurrent execution. We should use the atomic ln -s ... || [[ -L ... ]] pattern instead.

Suggested change
ln -sfn "$runfiles_venv/pyvenv.cfg" "$venv/pyvenv.cfg"
ln -s "$runfiles_venv/pyvenv.cfg" "$venv/pyvenv.cfg" 2>/dev/null || [[ -L "$venv/pyvenv.cfg" ]]

@scasagrande
Copy link
Copy Markdown
Contributor Author

lmk if you'd rather I look into whatever the heck the bot is yelling about, but we've had zero problems with race conditions since patching this in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant