Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve timestamps when unstreaming dirs #966

Merged
merged 3 commits into from Jan 27, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
8 changes: 8 additions & 0 deletions ansible_runner/utils/streaming.py
@@ -1,3 +1,4 @@
import time
import tempfile
import zipfile
import os
Expand Down Expand Up @@ -68,6 +69,7 @@ def unstream_dir(stream, length, target_directory):

with zipfile.ZipFile(tmp.name, "r") as archive:
# Fancy extraction in order to preserve permissions
# AWX relies on the execution bit, in particular, for inventory
# https://www.burgundywall.com/post/preserving-file-perms-with-python-zipfile-module
for info in archive.infolist():
out_path = os.path.join(target_directory, info.filename)
Expand All @@ -85,6 +87,12 @@ def unstream_dir(stream, length, target_directory):

archive.extract(info.filename, path=target_directory)

# Fancy logic to preserve modification times
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a bit more context here to this comment as to why we need to preserve modifications time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this would be kind of a general expectation for extracting a zip file.

In the case of runner, we have a control & execution machine. The control machine creates the private_data_dir, the worker runs the job, and it gets streamed back. The modtimes can be used to distinguish if the control machine last modified the file, or if the file was modified during the job run.

In the AWX case, we use modification times to determine if new facts were written for a host. This is a big performance optimization for the case where the inventory has 10k hosts but you run against 1 host. Most playbooks will gather facts, and AWX maintains a fact cache in the database. A job run which should go super fast would get bogged down saving the rather substantial facts data structure for 10k hosts.

I have trouble putting that into a short sentence :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new Executor team (currently just myself and @eqrx) lack A LOT of context around design decisions in the current code base (much of it having to do with AWX where we have no experience or insight), so it makes maintaining it rather difficult. Just a simple sentence like "AWX uses modification times to determine if new facts were written for a host" would give us a lot of clarity when we see code like this where the purpose is not immediately clear.

# AWX uses modification times to determine if new facts were written for a host
# https://stackoverflow.com/questions/9813243/extract-files-from-zip-file-and-retain-mod-date
date_time = time.mktime(info.date_time + (0, 0, -1))
os.utime(out_path, times=(date_time, date_time))

if is_symlink:
link = open(out_path).read()
os.remove(out_path)
Expand Down
47 changes: 47 additions & 0 deletions test/unit/utils/test_utils.py
@@ -1,7 +1,9 @@
import datetime
import io
import json
import os
import signal
import time

from pathlib import Path

Expand Down Expand Up @@ -147,6 +149,51 @@ def test_transmit_permissions(tmp_path, fperm):
assert oct(new_file_path.stat().st_mode) == oct(old_file_path.stat().st_mode)


def test_transmit_modtimes(tmp_path):
source_dir = tmp_path / 'source'
source_dir.mkdir()

# python ZipFile uses an old standard that stores seconds in 2 second increments
# https://stackoverflow.com/questions/64048499/zipfile-lib-weird-behaviour-with-seconds-in-modified-time
(source_dir / 'b.txt').touch()
time.sleep(2.0) # flaky for anything less
(source_dir / 'a.txt').touch()

very_old_file = source_dir / 'very_old.txt'
very_old_file.touch()
old_datetime = os.path.getmtime(source_dir / 'a.txt') - datetime.timedelta(days=1).total_seconds()
os.utime(very_old_file, (old_datetime, old_datetime))

# sanity, verify assertions pass for source dir
mod_delta = os.path.getmtime(source_dir / 'a.txt') - os.path.getmtime(source_dir / 'b.txt')
assert mod_delta >= 1.0

outgoing_buffer = io.BytesIO()
outgoing_buffer.name = 'not_stdout'
stream_dir(source_dir, outgoing_buffer)

dest_dir = tmp_path / 'dest'
dest_dir.mkdir()

outgoing_buffer.seek(0)
first_line = outgoing_buffer.readline()
size_data = json.loads(first_line.strip())
unstream_dir(outgoing_buffer, size_data['zipfile'], dest_dir)

# Assure modification times are internally consistent
mod_delta = os.path.getmtime(dest_dir / 'a.txt') - os.path.getmtime(dest_dir / 'b.txt')
assert mod_delta >= 1.0

# Assure modification times are same as original (to the rounded second)
for filename in ('a.txt', 'b.txt', 'very_old.txt'):
difference = abs(os.path.getmtime(dest_dir / filename) - os.path.getmtime(source_dir / filename))
assert difference < 2.0

# Assure the very old timestamp is preserved
old_delta = os.path.getmtime(dest_dir / 'a.txt') - os.path.getmtime(source_dir / 'very_old.txt')
assert old_delta >= datetime.timedelta(days=1).total_seconds() - 2.


def test_signal_handler(mocker):
"""Test the default handler is set to handle the correct signals"""

Expand Down