-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(query): the hashmap will cause oom #8737
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
This comment was marked as resolved.
This comment was marked as resolved.
ali cloud 16C32G ck databend -- databend
mysql> select count() from orders where o_clerk not like o_comment;
+----------+
| count() |
+----------+
| 15000000 |
+----------+
1 row in set (1 min 53.77 sec)
Read 15000000 rows, 1.11 GiB in 113.761 sec., 131.86 thousand rows/sec., 10.00 MiB/sec.
-- ck
SELECT version()
Query id: 58b06879-7ca4-4854-96d1-d4dd02cc2540
┌─version()──┐
│ 22.10.2.11 │
└────────────┘
SELECT count()
FROM orders
WHERE o_clerk NOT LIKE o_comment
Query id: 0d1ff665-04e1-435d-bbb7-e21da85dff00
┌──count()─┐
│ 15000000 │
└──────────┘
1 row in set. Elapsed: 7.925 sec. Processed 15.00 million rows, 1.22 GB (1.89 million rows/s., 153.73 MB/s.)
|
How does clickhouse handle this? |
Actually I'm not sure, It has different approaches to different types. But I haven't looked at this branch very closely |
BTW, like pattern is much simpler than regex. Maybe we don't need to convert like pattern into regex? |
I port a quick demo from tikv Conclusion: Regex matching is The benches show that plain like is much faster than regex build (if we have to): regex/build_regex_everytime
time: [13.348 µs 13.371 µs 13.392 µs]
regex/reuse_regex time: [16.677 ns 16.699 ns 16.722 ns]
regex/like time: [3.0577 ns 3.0593 ns 3.0611 ns] Note: the like function is ported here just for demo. It could be wrong. The source codefn bench_regex(c: &mut Criterion) {
let mut group = c.benchmark_group("regex");
group.bench_function("build_regex_everytime", |b| {
b.iter(|| {
let re = regex::Regex::new("^Hello\\..World.*%$").unwrap();
re.is_match("Hello, World!");
})
});
let built_re = regex::Regex::new("^Hello\\..World.*%$").unwrap();
group.bench_function("reuse_regex", |b| {
b.iter(|| {
built_re.is_match("Hello, World!");
})
});
group.bench_function("like", |b| {
b.iter(|| {
let _ = like(b"Hello, World", b"^Hello\\..World.*%$", &0);
})
});
}
fn decode_one(data: &[u8]) -> Option<(u8, usize)> {
if data.is_empty() {
None
} else {
Some((data[0], 1))
}
}
fn sort_compare(a: &[u8], b: &[u8]) -> std::cmp::Ordering {
a.cmp(b)
}
fn like(target: &[u8], pattern: &[u8], escape: &i64) -> Option<i64> {
let escape = *escape as u32;
// current search positions in pattern and target.
let (mut px, mut tx) = (0, 0);
// positions for backtrace.
let (mut next_px, mut next_tx) = (0, 0);
while px < pattern.len() || tx < target.len() {
if let Some((c, mut poff)) = decode_one(&pattern[px..]) {
let code: u32 = c.into();
if code == '_' as u32 {
if let Some((_, toff)) = decode_one(&target[tx..]) {
px += poff;
tx += toff;
continue;
}
} else if code == '%' as u32 {
// update the backtrace point.
next_px = px;
px += poff;
next_tx = tx;
next_tx += if let Some((_, toff)) = decode_one(&target[tx..]) {
toff
} else {
1
};
continue;
} else {
if code == escape && px + poff < pattern.len() {
px += poff;
poff = if let Some((_, off)) = decode_one(&pattern[px..]) {
off
} else {
break;
}
}
if let Some((_, toff)) = decode_one(&target[tx..]) {
if let std::cmp::Ordering::Equal =
sort_compare(&target[tx..tx + toff], &pattern[px..px + poff])
{
tx += toff;
px += poff;
continue;
}
}
}
}
// mismatch and backtrace to last %.
if 0 < next_tx && next_tx <= target.len() {
px = next_px;
tx = next_tx;
continue;
}
return Some(false as i64);
}
Some(true as i64)
} |
This case is not a constant regex rule, it's column-1 like column-2, is more difficult here. |
Yes our vector_const is a same impl like this.
|
The codes are better than the regex version, I vote to have this change. cc @TCeason It's ok to address this improvement in this pr. |
Ok |
If the like pattern just has one substring, we keep the original logic.
The source code~/hello# tree
.
├── Cargo.lock
├── Cargo.toml
├── benches
│ └── benchmark.rs
└── src
└── main.rs
2 directories, 4 files
~/hello# cat Cargo.toml
[package]
name = "hello"
version = "0.1.0"
edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
memchr = "2.5.0"
regex = "1.6.0"
criterion = "0.3"
[[bench]]
name = "benchmark"
harness = false
~/hello# cat benches/benchmark.rs
use criterion::{criterion_group, criterion_main, Criterion};
fn decode_one(data: &[u8]) -> Option<(u8, usize)> {
if data.is_empty() {
None
} else {
Some((data[0], 1))
}
}
fn like(target: &[u8], pattern: &[u8]) -> bool {
// current search positions in pattern and target.
let (mut px, mut tx) = (0, 0);
// positions for backtrace.
let (mut next_px, mut next_tx) = (0, 0);
while px < pattern.len() || tx < target.len() {
if let Some((c, mut poff)) = decode_one(&pattern[px..]) {
let code: u32 = c.into();
if code == '_' as u32 {
if let Some((_, toff)) = decode_one(&target[tx..]) {
px += poff;
tx += toff;
continue;
}
} else if code == '%' as u32 {
// update the backtrace point.
next_px = px;
px += poff;
next_tx = tx;
next_tx += if let Some((_, toff)) = decode_one(&target[tx..]) {
toff
} else {
1
};
continue;
} else {
if code == 0 && px + poff < pattern.len() {
px += poff;
poff = if let Some((_, off)) = decode_one(&pattern[px..]) {
off
} else {
break;
}
}
if let Some((_, toff)) = decode_one(&target[tx..]) {
if let std::cmp::Ordering::Equal = target[tx..tx + toff].cmp(&pattern[px..px + poff])
{
tx += toff;
px += poff;
continue;
}
}
}
}
// mismatch and backtrace to last %.
if 0 < next_tx && next_tx <= target.len() {
px = next_px;
tx = next_tx;
continue;
}
return false;
}
true
}
fn search_sub_str(str: &[u8], substr: &[u8]) -> Option<usize> {
if substr.len() <= str.len() {
str.windows(substr.len()).position(|w| w == substr)
} else {
None
}
}
fn bench_regex(c: &mut Criterion) {
let mut group = c.benchmark_group("regex");
group.bench_function("build_regex_everytime", |b| {
b.iter(|| {
let re = regex::Regex::new("^.*Hello.*$").unwrap();
re.is_match("Hello, World!");
})
});
let built_re = regex::Regex::new("^.*Hello.*$").unwrap();
group.bench_function("reuse_regex", |b| {
b.iter(|| {
built_re.is_match("Hello, World!");
})
});
group.bench_function("like", |b| {
b.iter(|| {
if like(b"Hello, World", b"%Hello%") {
} else {
}
})
});
group.bench_function("match_one_pattern", |b| {
b.iter(|| {
if search_sub_str(b"Hello, World", b"Hello").is_some() {} else {}
})
});
}
criterion_group!(benches, bench_regex);
criterion_main!(benches);
|
src/query/functions-v2/tests/it/scalars/testdata/comparison.txt
Outdated
Show resolved
Hide resolved
Can we have a benchmark like this again? |
Speed up. Database changed
--first exec
mysql> select count() from orders where o_clerk not like o_comment;
+----------+
| count() |
+----------+
| 15000000 |
+----------+
1 row in set (1.66 sec)
Read 15000000 rows, 1.11 GiB in 1.656 sec., 9.06 million rows/sec., 686.79 MiB/sec.
--second exec
mysql> select count() from orders where o_clerk not like o_comment;
+----------+
| count() |
+----------+
| 15000000 |
+----------+
1 row in set (0.84 sec)
Read 15000000 rows, 1.11 GiB in 0.832 sec., 18.03 million rows/sec., 1.33 GiB/sec.
|
I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/
Summary
Now
l_col like r_col
will generate a hashmap based on r_col, if r_col is huge, it will be oom kill.sys environment:
Table Struct
Prepare Data
data from: https://github.com/leiysky/tpch-databend
Result
test sql:
select count() from orders where o_clerk not like o_comment;
Closes #8615