Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split Feed and Article models #415

Open
twm opened this issue Jun 5, 2019 · 0 comments
Open

Split Feed and Article models #415

twm opened this issue Jun 5, 2019 · 0 comments

Comments

@twm
Copy link
Owner

twm commented Jun 5, 2019

Focusing on Article, though the same issues are present in Feed:

Right now the Article model (and corresponding SQL table) contains several categories of fields:

Server-set fields

Field Size¹ Notes
guid small immutable after creation
author small there should really be a raw_author
url small² there is currently no limit on URL size
date tiny fixed-size
raw_title small²
raw_content large²
title small²
content large²
content_snippet small up to 500 characters
content_rev tiny

¹ This column represents the typical size. Variable-size fields set by the feed generally accept arbitrary values and are marked ².

² This field is variable size. The size is controlled by the source site.

Client-set fields

Two boolean fields may be set by the user:

Field Size
read tiny
fave tiny

Problems

  1. Mixed writers

    Some field values are taken directly from the feed. Some are post-processed versions of values from the feed. Some are set by the user. In the Feed model (which has all the same issues) this has caused bugs like Bulk marking articles read can produce negative unread count #323. It would be better to model the server-side stuff in a more formal way i.e. fetch feed → postprocess → available for read.

  2. Write amplification of client operations.

    Tables are stored in the SQLite record format in b-tree pages. When a page changes, its revised content is written to a WAL frame. Because the read and fave fields are part of the same page as large fields like content, bulk mutation of articles is likely to cause many page rewrites.

  3. Possible read amplification.

    SQLite stores the columns of a row together in a page. Large fields which are not needed for a query are still read from disk. The "raw" fields are located before the sanitized content on disk, and may be large enough to force SQLite to chase to a continuation page.

@twm twm added the enhancement label Jun 5, 2019
@twm twm added this to To do in Fetch pipeline via automation Jun 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Fetch pipeline
  
To do
Development

No branches or pull requests

1 participant