Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault with Sqlalchemy #166

Closed
auxsvr opened this issue Jul 28, 2022 · 2 comments · Fixed by #185
Closed

Segfault with Sqlalchemy #166

auxsvr opened this issue Jul 28, 2022 · 2 comments · Fixed by #185

Comments

@auxsvr
Copy link

auxsvr commented Jul 28, 2022

Bug Report

Memray triggers segfault running Sqlalchemy code using greenlets.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  PyCode_Addr2Line (addrq=0, co=0x66) at Objects/codeobject.c:1248
1248        if (addrq < 0) {
[Current thread is 1 (Thread 0x7fef9e344200 (LWP 23383))]
#0  PyCode_Addr2Line (addrq=0, co=0x66) at Objects/codeobject.c:1248
#1  PyFrame_GetLineNumber (f=<unknown at remote 0x5594cd476a00>) at Objects/frameobject.c:49
#2  0x00007fef9dc2a287 in memray::tracking_api::PythonStackTracker::pushPythonFrame(_frame*) [clone .constprop.52] ()
  from /home/petros/.local/lib/python3.10/site-packages/memray/_memray.cpython-310-x86_64-linux-gnu.so
#3  0x00007fef9dc2b5b8 in memray::tracking_api::PyTraceFunction(_object*, _frame*, int, _object*) ()
  from /home/petros/.local/lib/python3.10/site-packages/memray/_memray.cpython-310-x86_64-linux-gnu.so
#4  0x00007fef9e870e55 in call_trace (func=0x7fef9dc2b4c0 <memray::tracking_api::PyTraceFunction(_object*, _frame*, int,  _object*)>, obj=123, tstate=0x5594cc6614e0, 
   frame=Frame 0x7fef98c76640, for file /home/petros/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py, line 1330, in execute (self=<Connection(engine=<Engine(pool=<AsyncAdaptedQueuePool(_orig_logging_name=None, _echo=None,  logger=<Logger(filters=[], name='sqlalchemy.pool.impl.AsyncAdaptedQueuePool', level=0, parent=<Logger(filters=[],  name='sqlalchemy', level=30, parent=<RootLogger(filters=[], name='root', level=30, parent=None, propagate=True, handlers=[], disabled=False, _cache={}) at remote 0x7fef9d1b3880>, propagate=True, handlers=[], disabled=False, _cache={}, manager=?<Manager(root=<...>, _disable=0, emittedNoHandlerWarning=False, loggerDict={'rich': <Logger(filters=[], name='rich', level=0, parent=<...>, propagate=True, handlers=[], disabled=False, _cache={}, manager=<...>) at remote 0x7fef9d0dffd0>, 'concurrent.futures': <Logger(filters=[], name='concurrent.futures', level=0, parent=<...>, propagate=True, handlers=[], disabled=False, _cache={}, manager=<...>) at remote 0x7fef9c8ab820>, 'concurrent'...(truncated), trace_info=<optimized out>, what=0, arg=None) at Python/ceval.c:5468
#5  0x00007fef9e870f40 in call_trace_protected (func=<optimized out>, obj=<optimized out>, tstate=0x5594cc6614e0, frame=<optimized out>, trace_info=<optimized out>, what=<optimized out>, 
    arg=None) at Python/ceval.c:5427
#6  0x00007fef9e6c737d in _PyEval_EvalFrameDefault (tstate=0x5594cc6614e0, 
    f=Frame 0x7fef98c76640, for file /home/petros/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py, line 1330, in execute (self=<Connection(engine=<Engine(pool=<AsyncAdaptedQueuePool(_orig_logging_name=None, _echo=None, logger=<Logger(filters=[], name='sqlalchemy.pool.impl.AsyncAdaptedQueuePool', level=0, parent=<Logger(filters=[], name='sqlalchemy', level=30, parent=<RootLogger(filters=[], name='root', level=30, parent=None, propagate=True, handlers=[], disabled=False, _cache={}) at remote 0x7fef9d1b3880>, propagate=True, handlers=[], disabled=False, _cache={}, manager=<Manager(root=<...>, _disable=0, emittedNoHandlerWarning=False, loggerDict={'rich': <Logger(filters=[], name='rich', level=0, parent=<...>, propagate=True, handlers=[], disabled=False, _cache={}, manager=<...>) at remote 0x7fef9d0dffd0>, 'concurrent.futures': <Logger(filters=[], name='concurrent.futures', level=0, parent=<...>, propagate=True, handlers=[], disabled=False, _cache={}, manager=<...>) at remote 0x7fef9c8ab820>, 'concurrent'...(truncated), throwflag=0) at Python/ceval.c:1654
#7  0x00007fef9e77be93 in _PyEval_EvalFrame (throwflag=0, 
    f=Frame 0x7fef98c76640, for file /home/petros/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py, line 1330, in execute (self=<Connection(engine=<Engine(pool=<AsyncAdaptedQueuePool(_orig_logging_name=None, _echo=None, logger=<Logger(filters=[], name='sqlalchemy.pool.impl.AsyncAdaptedQueuePool', level=0, parent=<Logger(filters=[], name='sqlalchemy', level=30, parent=<RootLogger(filters=[], name='root', level=30, parent=None, propagate=True, handlers=[], disabled=False, _cache={}) at remote 0x7fef9d1b3880>, propagate=True, handlers=[], disabled=False, _cache={}, manager=<Manager(root=<...>, _disable=0, emittedNoHandlerWarning=False, loggerDict={'rich': <Logger(filters=[], name='rich', level=0, parent=<...>, propagate=True, handlers=[], disabled=False, _cache={}, manager=<...>) at remote 0x7fef9d0dffd0>, 'concurrent.futures': <Logger(filters=[], name='concurrent.futures', level=0, parent=<...>, propagate=True, handlers=[], disabled=False, _cache={}, manager=<...>) at remote 0x7fef9c8ab820>, 'concurrent'...(truncated), tstate=0x5594cc6614e0) at ./Include/internal/pycore_ceval.h:46
#8  _PyEval_Vector (tstate=<optimized out>, con=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=3, kwnames=<optimized out>) at Python/ceval.c:5065
#9  0x00007fef9e78e984 in _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=3, stack=0x5594cdf53fe8, func=<function at remote 0x7fef9a7af490>) at Objects/call.c:342
#10 _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=3, args=0x5594cdf53fe8, callable=<function at remote 0x7fef9a7af490>, tstate=0x5594cc6614e0) at ./Include/cpython/abstract.h:114
#11 method_vectorcall (method=<optimized out>, args=0x5594cdf53ff0, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/classobject.c:53
#12 0x00007fef9e7f0fe2 in _PyObject_VectorcallTstate (tstate=0x5594cc6614e0, callable=<method at remote 0x7fef98b44fc0>, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
    at ./Include/cpython/abstract.h:114
#13 0x00007fef9e783f8b in call_function (tstate=<optimized out>, trace_info=<optimized out>, pp_stack=0x7ffebfe6ccb0, oparg=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:5866
#14 0x00007fef9e77dfd5 in _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4231
#15 0x00007fef9e77be93 in _PyEval_EvalFrame (throwflag=0, 
    f=Frame 0x5594cdf53e00, for file /home/petros/.local/lib/python3.10/site-packages/sqlalchemy/orm/session.py, line 1847, in _execute_internal (self=<Session(identity_map=<WeakInstanceDict(_dict={}, _modified=set(), _wr=<weakref at remote 0x7fef98b22b60>) at remote 0x7fef98b2b940>, _new={}, _deleted={}, bind=<Engine(pool=<AsyncAdaptedQueuePool(_orig_logging_name=None, _echo=None, logger=<Logger(filters=[], name='sqlalchemy.pool.impl.AsyncAdaptedQueuePool', level=0, parent=<Logger(filters=[], name='sqlalchemy', level=30, parent=<RootLogger(filters=[], name='root', level=30, parent=None, propagate=True, handlers=[], disabled=False, _cache={}) at remote 0x7fef9d1b3880>, propagate=True, handlers=[], disabled=False, _cache={}, manager=<Manager(root=<...>, _disable=0, emittedNoHandlerWarning=False, loggerDict={'rich': <Logger(filters=[], name='rich', level=0, parent=<...>, propagate=True, handlers=[], disabled=False, _cache={}, manager=<...>) at remote 0x7fef9d0dffd0>, 'concurrent.futures': <Logger(filters=[], name='co...(truncated), tstate=0x5594cc6614e0) at ./Include/internal/pycore_ceval.h:46
#16 _PyEval_Vector (tstate=<optimized out>, con=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=3, kwnames=<optimized out>) at Python/ceval.c:5065
#17 0x00007fef9e78e984 in _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=3, stack=0x7fef98b39728, func=<function at remote 0x7fef9a4b0820>) at Objects/call.c:342
#18 _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=3, args=0x7fef98b39728, callable=<function at remote 0x7fef9a4b0820>, tstate=0x5594cc6614e0) at ./Include/cpython/abstract.h:114

Input Code

import asyncio
from sqlalchemy import Column, Integer, MetaData, String
from sqlalchemy.future import select
from sqlalchemy.orm import declarative_base, sessionmaker
from sqlalchemy.schema import Table
from sqlalchemy.ext.asyncio import async_scoped_session, AsyncSession, create_async_engine

metadata = MetaData()
Base = declarative_base(metadata=metadata)

t1 = Table(
    "t1",
    metadata,
    Column("id", Integer, primary_key=True),
    Column("d1", String(50)),
    Column("d2", String(50)),
    Column("d3", String(50)),
    Column("d4", String(50)),
    Column("d5", String(50)),
)


engine = engine = create_async_engine(
    "postgresql+psycopg://@/",
    future=True,
    hide_parameters=False,
    max_overflow=50,
    pool_size=30,
)

session_factory = sessionmaker(
    engine,
    expire_on_commit=True,
    autoflush=False,
    class_=AsyncSession,
    future=True,
)

db_session = async_scoped_session(session_factory, scopefunc=asyncio.current_task)


async def stream():
    async with db_session() as session:
        async for partition in (await session.stream(select(t1.c.id))).scalars().partitions(int(1e3)):
            yield partition
            await asyncio.sleep(0)

async def retrieve(id_):
    async with db_session() as session:
        return (await session.execute(select(t1).where(t1.c.id == id_))).scalar_one()


async def test():
    async for partition in stream():
        r = await asyncio.gather(*(retrieve(id_) for id_ in partition))
        print(len(r))


asyncio.run(test())

It seems that the pointer argument to PyCode_Addr2Line is invalid. Testing with PYTHONDEVMODE=1, it becomes 0xdddddddddddddddd, which corresponds to newly freed memory.

Expected behavior/code
No crash.

Environment

  • Python(s): Python 3.10.5
  • sqlalchemy master
  • psycopg 3.0.15
  • openSUSE Tumbleweed
@godlygeek
Copy link
Contributor

godlygeek commented Jul 28, 2022

Well, a segfault is obviously bad, and we should try to make this fail more gracefully, but there's no reasonable way for us to make this work.

Memray is trying to keep track of Python stack that led up to each allocation performed by the application. Greenlet is playing tricks to switch the stack that's running on a given thread without telling us. As the greenlet docs say:

Standard Python tracing and profiling doesn’t work as expected when used with greenlet since stack and frame switching happens on the same Python thread.

Greenlet breaks the tools that we're using to keep track of the Python stack. The crash is caused by it breaking a pretty fundamental assumption that we made: if our profile function (set by sys.setprofile()) is told that foo is being called, and the next thing it's told is that bar is being called, we believe that foo called bar. Based on that assumption, we believe that the Python frame representing the foo call must continue to exist on the stack until our profile function is told that bar has exited.

In the greenlet world, this assumption doesn't hold: our profile function can be told that foo is being called, and then that bar is being called, and it could mean that foo called bar, or it could mean that one greenlet called foo, and then control switched to another greenlet, and that other greenlet called bar.

In theory, we could use greenlet.settrace() to detect the switches, but there's a few problems with that:

  1. It requires us to know that greenlet is being used. If someone imports and starts using greenlet after we've already installed our profiling hooks, we wouldn't find out about it, and we would have no way to inject a greenlet.settrace() call. And this is the most common case - people normally use memray run to launch Memray, and our hooks get installed before the user's script starts running, and before it has imported anything.
  2. It's specific to greenlet, requiring us to hardcode functionality that's specific to a single 3rd party module. Currently we assume that we can rely on the interpreter's profiling machinery to tell us about switches between coroutines. This works for normal async def coroutines used by Trio and Curio and Asyncio. If we had to build something different for greenlet, it would only benefit greenlet.
  3. It would require a significant increase in complexity. Currently, at any given point in time, we only track the single Python stack that's currently executing for each thread. Even if we were hooked into notifications of greenlet switches, we would need to start recording multiple stacks per thread, and again, this complexity would only benefit greenlet, at a maintenance cost for us and a performance cost for everyone, regardless of whether they use greenlet or not.

So, we'll take a look at whether we can do something to fail more gracefully, but at this point we don't intend to try to make this actually work.

@godlygeek
Copy link
Contributor

We've come up with reasonable solutions to several of these problems.

For (2), we're willing to consider adding special support just for Greenlet, given that Greenlet is a fairly heavily used library, as long as the (maintenance and runtime) costs can be kept low enough.
For (1) we can detect Greenlet getting loaded after tracing starts by using our existing dlopen hook to watch for a dynamic library whose name starts with _greenlet getting loaded.
For (3) we think it's possible to support this with no extra per-allocation or per function call overhead for applications that aren't using Greenlet, building on the work done in #152. We could use greenlet.settrace() to throw away the old stack and start fresh with a new stack (dynamically recovered from the greenlet object being switched to) each time a greenlet switch happens.

But that leads us to a new problem - greenlet.settrace() is thread-local, and we don't have any reasonable way to install our Greenlet trace function that tells Memray about the new stack after a switch in each of the threads that was already started before Memray tracking began.

I've entered python-greenlet/greenlet#310 to ask whether the Greenlet maintainers can do something to help us out with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants