Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tail input stops reading data and gets stuck forever unless deleting SQLite DB #8813

Open
asouchang opened this issue May 10, 2024 · 0 comments

Comments

@asouchang
Copy link

Issue Report

Describe the issue

fluent-bit somehow stops reading data and randomly happens on our machines. Once it runs into this case, fluent-bit gets stuck forever. Restarting the process doesn't help, as it would get stuck as well after starting.

The pattern we've observed is before it gets stuck, there are always consecutive logs about task creating as follows:

[2024/05/10 06:03:24] [trace] [task 0x7f7717e397a0] created (id=83)
[2024/05/10 06:03:24] [debug] [task] created task=0x7f7717e397a0 id=83 OK
[2024/05/10 06:03:24] [trace] [task 0x7f7717e39840] created (id=84)
[2024/05/10 06:03:24] [debug] [task] created task=0x7f7717e39840 id=84 OK
[2024/05/10 06:03:24] [trace] [task 0x7f7717e39a20] created (id=85)
[2024/05/10 06:03:24] [debug] [task] created task=0x7f7717e39a20 id=85 OK
[2024/05/10 06:03:24] [trace] [task 0x7f770a033240] created (id=86)
[2024/05/10 06:03:24] [debug] [task] created task=0x7f770a033240 id=86 OK
[2024/05/10 06:03:24] [trace] [task 0x7f770a0332e0] created (id=301)
[2024/05/10 06:03:24] [debug] [task] created task=0x7f770a0332e0 id=301 OK
...

Sending SIGCONT to fluent-bit, the dump showed it didn't reach the mem limit, and every chunks of tail were busy. Any attempt to get the dump report after stuck resulted the identical report.

The only working fix is to delete SQLite DB files (.db, .db-shm, and .db-wal), and then to restart fluent-bit.

We firstly got this issue in 1.9.9. Upgrading to 3.0.3 didn't help on this.

The logs in trace level are attached as follows:

clean-faillog.txt

And the configuration snippet of fluent-bit is as follows:

[SERVICE]
    Flush          1
    Daemon         Off
    Log_Level      trace
    Parsers_File   parsers.conf
    storage.path              /var/log/flb-storage/
    storage.sync              normal
    storage.checksum          off
    storage.backlog.mem_limit 1000MB
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_PORT    2020
    Hot_Reload   On

[INPUT]
    Name           tail
    Tag_Regex      (?<pod_name>[a-z0-9](?:[-a-z0-9.]*[a-z0-9])?(?:\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
    Tag            kube.<namespace_name>@@@@@@<container_name>@@@@@@<pod_name>@@@@@@<docker_id>
    Path           /var/log/containers/*.log
    Parser         cri
    DB             /var/log/flb_kube.db
    Mem_Buf_Limit  2048MB

Any advice on troubleshooting the issue?

Your Environment

  • Version used: 3.0.3, 1.9.9

  • Environment name and version:
    fluent-bit 3.0.3 Debian package (from packages.fluentbit.io/debian) installed on the base image of docker.io/debian:bullseye, run as a container in Kubernetes cluster
    It's deployed as a daemonset in kubernetes, mounting the host path: /var/log
    /var/log is located in the device containing host's root partition, not networking file system.

  • Operating System and version:
    A VM in Alibaba, running Alibaba Cloud Linux release 3 (Soaring Falcon), Linux tri401 5.10.134-16.1.al8.x86_64 # 1 SMP Thu Dec 7 14:11:24 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant