zfs_zrele_async can cause txg sync deadlocks
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
zfs-linux (Debian) |
Fix Released
|
Unknown
|
|||
zfs-linux (Ubuntu) |
Fix Released
|
Critical
|
Unassigned | ||
Xenial |
Fix Released
|
High
|
Heitor Alves de Siqueira | ||
Bionic |
Fix Released
|
Critical
|
Heitor Alves de Siqueira | ||
Focal |
Fix Released
|
Critical
|
Heitor Alves de Siqueira | ||
Groovy |
Fix Released
|
Critical
|
Heitor Alves de Siqueira | ||
Hirsute |
Fix Released
|
Critical
|
Unassigned |
Bug Description
[Impact]
TXG sync stalls, causing ZFS workloads to hang indefinitely
[Description]
For certain ZFS workloads, we can see hung task timeouts in the kernel logs due to a transaction group deadlock. Userspace process will hang and display stack traces similar to the one below:
[49181.619711] clnt_server D 0 21699 28868 0x00000320
[49181.619715] Call Trace:
[49181.619725] __schedule+
[49181.619730] schedule+0x2c/0x80
[49181.619750] cv_wait_
[49181.619763] ? wait_woken+
[49181.619775] __cv_wait+0x15/0x20 [spl]
[49181.619872] zil_commit.
[49181.619884] ? _cond_resched+
[49181.619887] ? mutex_lock+
[49181.619959] zil_commit+
[49181.620026] zfs_fsync+0x77/0xe0 [zfs]
[49181.620093] zpl_fsync+0x68/0xa0 [zfs]
[49181.620100] vfs_fsync_
[49181.620105] do_fsync+0x3d/0x70
[49181.620109] SyS_fsync+0x10/0x20
[49181.620114] do_syscall_
[49181.620119] entry_SYSCALL_
We also might see a kworker thread blocking in the zfs writeback/evict path:
[49181.881570] kworker/u17:3 D 0 4915 2 0x80000000
[49181.881576] Workqueue: writeback wb_workfn (flush-zfs-10)
[49181.881577] Call Trace:
[49181.881580] __schedule+
[49181.881582] ? atomic_
[49181.881584] schedule+0x2c/0x80
[49181.881588] bit_wait+0x11/0x60
[49181.881592] __wait_
[49181.881596] ? atomic_
[49181.881599] __inode_
[49181.881601] ? bit_waitqueue+
[49181.881605] inode_wait_
[49181.881609] evict+0xb5/0x1a0
[49181.881611] iput+0x19c/0x230
[49181.881648] zfs_iput_
[49181.881682] zfs_get_
[49181.881718] zil_commit.
[49181.881752] zil_commit+
[49181.881784] zpl_writepages+
[49181.881787] do_writepages+
[49181.881790] __writeback_
[49181.881792] ? __writeback_
[49181.881794] writeback_
[49181.881796] wb_writeback+
[49181.881799] wb_workfn+
[49181.881800] ? wb_workfn+
[49181.881803] ? __switch_
[49181.881809] process_
[49181.881811] worker_
[49181.881813] kthread+0x121/0x140
[49181.881815] ? process_
[49181.881817] ? kthread_
[49181.881819] ret_from_
This is caused by a race between ZFS writeback and evict threads, usually during a transaction group sync operation. It's possible to have two iput() threads racing for the same inode: one of them scheduled async and the other executed synchronously as part of the writeback path. If the writeback thread tries to evict the inode while the async thread is running, it might re-enter the block layer for the same inode due to ZFS counters being in an inconsistent state. This then causes the kworker thread to stall the writeback, which in turn prevents the transaction group sync to complete and locks other ZFS threads.
This is fixed by the upstream commit:
- Fix zrele race in zrele_async that can cause hang (43eaef6de817) [0]
[Test Case]
Being a race condition, this issue has been hard to reproduce consistently. This has been reported on heavy I/O workloads, mixing file creation and deletion. We have some reports both from upstream and from Ubuntu users that this is usually reproducible on e.g. heavy SQL workloads or on complex ccache-enabled builds [1].
[0] https:/
[1] https:/
[Regression Potential]
The patch has been tested in the ZFS test suite and in production environments, so the potential for further regressions should be fairly controlled. Potential regressions might arise in the ZFS writeback path, causing write hangs and eventually stalling all ZFS-backed operations indefinitely. We should monitor heavy I/O workloads that put a lot of stress in the sync and evict paths to exercise the new changes.
description: | updated |
Changed in zfs-linux (Ubuntu Groovy): | |
importance: | Undecided → Critical |
Changed in zfs-linux (Ubuntu Focal): | |
importance: | Undecided → High |
importance: | High → Critical |
Changed in zfs-linux (Ubuntu Bionic): | |
importance: | Undecided → Critical |
Changed in zfs-linux (Ubuntu Xenial): | |
importance: | Undecided → High |
assignee: | nobody → Heitor Alves de Siqueira (halves) |
Changed in zfs-linux (Ubuntu Bionic): | |
assignee: | nobody → Heitor Alves de Siqueira (halves) |
Changed in zfs-linux (Ubuntu Groovy): | |
assignee: | nobody → Heitor Alves de Siqueira (halves) |
Changed in zfs-linux (Ubuntu Focal): | |
assignee: | nobody → Heitor Alves de Siqueira (halves) |
Changed in zfs-linux (Ubuntu Bionic): | |
status: | New → In Progress |
Changed in zfs-linux (Ubuntu Xenial): | |
status: | New → In Progress |
Changed in zfs-linux (Ubuntu Focal): | |
status: | New → In Progress |
Changed in zfs-linux (Ubuntu Groovy): | |
status: | New → In Progress |
Changed in zfs-linux (Ubuntu Hirsute): | |
status: | Confirmed → In Progress |
Changed in zfs-linux (Ubuntu Hirsute): | |
status: | In Progress → Fix Released |
assignee: | Heitor Alves de Siqueira (halves) → nobody |
description: | updated |
Changed in zfs-linux (Debian): | |
status: | Unknown → New |
description: | updated |
Changed in zfs-linux (Debian): | |
status: | New → Fix Released |
FYI, I've sponsored these and uploaded, now waiting in -proposed. I also tested these patches on and AMD64 VM using the more exhaustive kernel team ZFS tests suite and they passed.