Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a blame option #6

Open
burrscurr opened this issue Jan 5, 2021 · 1 comment
Open

Add a blame option #6

burrscurr opened this issue Jan 5, 2021 · 1 comment
Assignees
Labels
cli Alters command line interface enhancement New feature or request

Comments

@burrscurr
Copy link
Owner

As a user, I don't just want to know the aggregated quality measure of my recorded line (for instance the average deviation), but also which parts of my line were especially bad, thus contributing greatly to the end result.

This could be implemented with a --blame flag, that triggers creating an hierarchical clustering of deviation measurements along the line and outputs all sections that contribute more than a certain threshold (for instance 5%) to the end result.

@burrscurr burrscurr added the enhancement New feature or request label Jan 5, 2021
@burrscurr burrscurr self-assigned this Jan 5, 2021
@burrscurr burrscurr added the cli Alters command line interface label Jan 5, 2021
@burrscurr
Copy link
Owner Author

This represents a clustering problem (find areas with many points or high deviations, respectively). As a gps track very well could be composed of 10^5 or more points (especially when re-sampled/interpolated), the clustering algorithm must be efficient. Additionally, the algorithm should be able to find clusters of varying length and be robust regarding noise (e.g. a point that is close to the reference line, but surrounded by high-deviation points).

DBSCAN clustering algorithm fulfills these requirements. To translate the algorithm into linesman's domain, one iterate over the line's points and create a cluster when the deviation of surrounding points (epsilon parameter in the original algorithm) is higher than a certain threshold value (minPts parameter in the original algorithm). When a cluster is found, following points are added to the cluster as long as the points in their surrounding pass the minimum threshold. If a point does not meet the threshold, it is not added to any cluster.

When the clusters were calculated, each cluster can be evaluated regarding its influence on the end result (e.g. in form of a percentage) and be presented to the user.

User's could customize the neighborhood size of points to consider when evaluating whether a point is added to a cluster and the minimum threshold. The former represents the minimal length of a bad (high deviation) section, the latter one the sensitivity, e.g. how bad a section must be to be considered bad. This should both be powerful and understandable for users not familiar with DBSCAN or clustering in general. In any case, reasonable defaults should be provided, maybe based on the average deviation of the line.

Additionally, implementation becomes fairly simple compared to regular DBSCAN, since the points given here have linear order. That makes it really easy to calculate the neighborhood of each point. Complexity should be n^2 in worst case, but if the epsilon parameter is relatively small, more like n (which is great).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli Alters command line interface enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant