Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security feature: Add feature to sanitize strings against CSV injection (aka Formula Injection). #326

Open
Adil-Iqbal opened this issue Aug 7, 2022 · 4 comments
Labels

Comments

@Adil-Iqbal
Copy link

Adil-Iqbal commented Aug 7, 2022

Summary

We'd like to add a feature to CsvGenerator that will enable sanitizing string values against CSV Injection.

CsvGenerator.Feature.SANITIZE_STRING_VALUES

The current method of protecting against this vulnerability is not reliable. Specifically, we request that for all string values, all matches that match the regular expression ^[-\=\+@]+ should be replaced by an empty string before the string value is written.

Background

https://www.veracode.com/blog/secure-development/data-extraction-command-execution-csv-injection

Rationale

The existing feature that is most-often used to protect against CSV injection is CsvGenerator.Feature.ALWAYS_QUOTE_STRINGS. There are two problems with this approach.

  1. It unnecessarily augments all string values, regardless of whether those values are malicious.
  2. End users may still remove double quotes programmatically in their spreadsheet application, not realizing that they are exposing themselves to a critical security issue.

Many developers are attempting to address the issue by using reflection to manually "sanitize" the declared string fields on their POJOs prior to serializing them as a CSV row. The risk is that they are doing it improperly, allowing exposure to still exist in some form. For example, one user on stack overflow decided to prepend malicious string with an apostrophe ('), instead of just removing the offending characters entirely.

Clearly, a more targeted feature is needed to protect the end user.

Mitigating the Risk

The risk is mitigated by ensuring that the following characters do not appear at the start of the string:

  • equals sign (=)
  • plus sign (+)
  • minus sign (-)
  • "At" symbol (@)

Acceptance criteria

Strings should be transformed:

  • After the string values are trimmed.
  • Before ALWAYS_QUOTE_STRINGS is applied.
  • Before string values are written.

The following transformations can be used as a guide for feature.

=foo --> foo
+foo --> foo
-foo --> foo
@foo --> foo
==foo --> foo
++foo --> foo
--foo --> foo
@@foo --> foo
@+foo --> foo
=-foo --> foo
foo --> foo
foo= --> foo=
fo=o --> fo=o
+-=@fo=o= --> fo=o=
@cowtowncoder
Copy link
Member

First of all, thank you for reporting the issue. I will have to read this with thought.

But my first thought is that this seems related to some specific use cases; and that the suggestion cannot be a general rule for all usage. It sounds like this would specifically be against injection when loading documents in Excel -- a common use case to be sure, but not at all the only use case. And as such it cannot dictate general processing: it would not make any sense to -- for example -- remove leading minus sign from negative numbers.
Similarly there are performance concerns in various content cleaning aspects to consider (for cases where replacement/removal is not needed).

It would likely make sense to add new CsvWriteFeature to cover use case(s) (SANITIZE_FOR_EXCEL?).
There are probably also ways to add quoting/escaping, although that is another orthogonal feature.

@Adil-Iqbal
Copy link
Author

You're right. It's certainly application specific. As is the vulnerability. Though, I do feel that "export to csv" like functionality is within the domain of this project.

That said, it seems I didn't consider all the constraints. The negative number like problem seems hard to solve without inspecting the entire string.

On the performance front, perhaps we can qualify which strings we apply the sanitize operation to by checking if the first character is in the set of offending characters. If not, we can omit the sanitize operation.

As for the operation itself, I'll try and think about edge cases and ways to save on time.

On the naming front, SANITIZE_FOR_EXCEL is better than what I originally suggested, but may be too narrow. Formulae are used by every major spreadsheet app, including Google Sheets, OfficeLibre, Zoho Sheets, Apple Numbers on MacOS, etc. All of them use the four characters mentioned above. A better name for the feature might be SANITIZE_FOR_CONSUMER.

@cowtowncoder
Copy link
Member

Yes, I agree that improvement(s) to allow safe(r) operation are important and useful, I hope it did not sound I suggest otherwise.

While I agree that SANITIZE_FOR_EXCEL or similar is sort of too specific, I think there is sometimes practical consideration of what a useful name is -- I am definitely open for alternatives. But one problem there may be is if different tools have different requirements (which is probably true): is it possible to find one common setting, or is there need for separate ones.
But I could go with alternative names; perhaps something more along lines of "SANITIZE_FORMULAS" or "SANITIZE_FORMULA_CHARACTERS".

Now, one practical question: it seemed to me that:

  1. For characters mentioned, the check only needs to be applied for the first character of a cell value (or possible first non-whitespace character)
  2. Problem only occurs if cell value is NOT quoted

is this true? (2) is esp. relevant since ideally I'd prefer not mutating actual values -- but obviously if it won't make any difference then... something needs to be done.

And come to think of that, it may even be necessary to offer some sort of handlers for deciding how to replace possibly problematic characteres (drop/change).
This is not an easy problem to solve cleanly it seems. :)

@pjfanning
Copy link
Member

pjfanning commented Aug 29, 2022

I can't see the security issue here - the security risk is on the consumer side (the Excel user) as opposed to the producer side. The Excel user shouldn't be loading untrusted files.

I'm not aware of any CSV libs that have a feature like this. If you can provide some links to equivalent support in other CSV tools, I'd be interested in seeing them.

If you want to export data using so that Excel users can load it up - why not produce an xlsx file (using Apache POI or something like that)? Then you get to control whether a cell is a text cell or a formula cell.

@cowtowncoder cowtowncoder changed the title Critical Security Issue: Add feature to sanitize strings against CSV injection (aka Formula Injection). Security feature: Add feature to sanitize strings against CSV injection (aka Formula Injection). Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants