Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table Cells in Row: Tab Separator #242

Open
ajl000 opened this issue Dec 22, 2021 · 4 comments
Open

Table Cells in Row: Tab Separator #242

ajl000 opened this issue Dec 22, 2021 · 4 comments

Comments

@ajl000
Copy link

ajl000 commented Dec 22, 2021

I am trying to use html-to-text as part of a spreadsheet IMPORTHTML function (webix sheets library).

It works really well using browserify.

With tables it would be wonderful if the cells could be separated by a tab character.

Possibly an option could be used such as selectors: [ { selector: 'table', rowCellSeparator: '\\t' } ]

Many thanks for this great project.

@KillyMXI
Copy link
Member

This was first requested in #98

Am I right what you essentially seek for is HTML to CSV/TSV conversion?

If that's the case then the right approach would be to have a separate formatter rather than an option for the default one shipped with html-to-text.

I'll see whether I can include it in version 9.
Making a custom formatter on your own is also possible. It will be simpler that the default one but still more complicated than any tags people usually customize.

@ajl000
Copy link
Author

ajl000 commented Dec 23, 2021

Yes this is correct: I am using html-to-text for HTML conversion to a JSON array (row-column) for an JavaScript spreadsheet, which is essentially CSV/TSV.

I don't know NodeJS so I cannot customize formatter.js to add a \t before each th/td in a row (non-first). Even every th/td in a row would be fine for my needs.

If you would include it, I would be pleased to sponsor a relatively small amount of USD150 for this feature.

Something possibly like the following would be great.

const {convert} = require('html-to-text');
const vs1 = "<p>Heading</p><table><tr><th>Month</th><th>Savings</th></tr><tr><td>January</td><td>$100</td></tr><tr><td>February</td><td>$80</td></tr></table>"

console.log(convert(vs1, {
 selectors: [ { selector: 'table', format: 'dataTableRowCellSeparator',  rowCellSeparator: '\\t' }} ]
}));

I have tried to look at custom formatters and may not have fully understood your comments above.

As an aside I note that

const {convert} = require('html-to-text');
const vs1 = "<p>Heading</p><table><tr><th>Month</th><th>Savings</th></tr><tr><td>January</td><td>$100</td></tr><tr><td>February</td><td>$80</td></tr></table>"

console.log(convert(vs1, {
 selectors: [ { selector: 'table', format: 'table' } ]
}));

console.log(convert(vs1, {
}));

Seems to output all the words together.

"Heading

MonthSavingsJanuary$100February$80"

@KillyMXI
Copy link
Member

KillyMXI commented Dec 23, 2021

format: 'table'

This is a legacy format that comes together with tables option (now deprecated). That was the way to select which tables should be rendered as tables before selectors were introduced. Since one of the main purposes for html-to-text is to clean up html emails and many emails use tables for layout - table tags can't be taken as tables by default.
Once I remove the tables option the default format for tables will simply be block.

MonthSavingsJanuary$100February$80

This is because format: 'table' is essentially equivalent to format: 'block' but there is no format specified for rows and cells, so they are interpreted as inline tags.
Thanks for bringing this up - it can actually be used to achieve the desired output without a complex table formatter.

{
  wordwrap: false,
  whitespaceCharacters: ' \r\n\f\u200b', // excluded tab character
  formatters: {
    'cellFormatter': function (elem, walk, builder, formatOptions) {
      builder.addInline('\t');
      walk(elem.children, builder);
    }
  },
  selectors: [
    { selector: 'table', format: 'block', options: { leadingLineBreaks: 2, trailingLineBreaks: 2 } },
    { selector: 'tr', format: 'block', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },
    { selector: 'th + th', format: 'cellFormatter' },
    { selector: 'th + td', format: 'cellFormatter' },
    { selector: 'td + td', format: 'cellFormatter' }
  ]
}

- this should do the job if there is no complex content inside cells.


When I get to a dedicated (and more robust) formatter implementation - I think I'll call it delimitedTable to match what seems to be the umbrella term - delimiter-separated values.

With workaround figured out, I think I won't try to make 8.2.0 for this. And version 9 is still few months away - there are a couple of big issues to address.

@KillyMXI
Copy link
Member

KillyMXI commented Dec 19, 2022

I haven't included delimitedTable among the new default formatters in the version 9, but there seems to be one improvement handy to simplify the example above - builder.addLiteral function.
It is made for markup elements and it circumvents the whitespace processing, so no need to alter whitespaceCharacters.

{
  wordwrap: false,
  formatters: {
    'cellFormatter': function (elem, walk, builder, formatOptions) {
      builder.addLiteral('\t');
      walk(elem.children, builder);
    }
  },
  selectors: [
    { selector: 'table', format: 'block', options: { leadingLineBreaks: 2, trailingLineBreaks: 2 } },
    { selector: 'tr', format: 'block', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },
    { selector: 'th + th', format: 'cellFormatter' },
    { selector: 'th + td', format: 'cellFormatter' },
    { selector: 'td + td', format: 'cellFormatter' }
  ]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants