总之我倾向于部署一个Ubuntu22.04用pip指令会轻松很多,不然不断的找配套环境下载过程中会不断出现依赖问题。
我在查阅了官方文档(https://github.com/facebookresearch/xformers/releases)后,发现v0.0.32: Wheels for PyTorch 2.8.0,
我现在使用这个指令pip install xformers==0.0.32.post2 --no-deps下载了符合torch2.8环境的xformers。
后续模型使用过程中又出现了新问题
核心错误信息:RuntimeError: operator torchvision::nms does not exist
RuntimeError: 程序成功启动,但在运行过程中遇到了一个致命错误。
operator ... does not exist: 一个函数(操作符)丢失了。
torchvision::nms: 丢失的函数是 torchvision 库里的 nms(非极大值抑制)。这是一个核心的图像处理函数。
分析大概率是PyTorch 和 Torchvision 之间不匹配(因为我们前面用conda下载flash-attn的时候会改变torch环境)
以下是代码部分:
(opensora-conda-forge) studentluo@n1:~/jupyterlab/Open-Sora$ torchrun --nproc_per_node 2 --standalone scripts/diffusion/inference.py configs/diffusion/inference/256px.py \
--ckpt-path YOUR_256px_MODEL_PATH \
--cond_type i2v_head \
--prompt "A plump pig wallows in a muddy pond on a rustic farm, its pink snout poking out as it snorts contentedly. The camera captures the pig's playful splashes, sending ripples through the water under the midday sun. Wooden fences and a red barn stand in the background, framed by rolling green hills. The pig's muddy coat glistens in the sunlight, showcasing the simple pleasures of its carefree life." \
--ref assets/texts/i2v.png
W1114 14:26:30.211000 10317 site-packages/torch/distributed/run.py:774]
W1114 14:26:30.211000 10317 site-packages/torch/distributed/run.py:774] *****************************************
W1114 14:26:30.211000 10317 site-packages/torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1114 14:26:30.211000 10317 site-packages/torch/distributed/run.py:774] *****************************************
Traceback (most recent call last):
File "/home/studentluo/jupyterlab/Open-Sora/scripts/diffusion/inference.py", line 15, in <module>
from opensora.datasets.dataloader import prepare_dataloader
File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/opensora/datasets/init.py", line 1, in <module>
from .datasets import TextDataset, VideoTextDataset
File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/opensora/datasets/datasets.py", line 8, in <module>
from torchvision.datasets.folder import pil_loader
File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/torchvision/init.py", line 10, in <module>
Traceback (most recent call last):
File "/home/studentluo/jupyterlab/Open-Sora/scripts/diffusion/inference.py", line 15, in <module>
from torchvision import meta_registrations, datasets, io, models, ops, transforms, utils # usort:skip
File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/torchvision/meta_registrations.py", line 164, in <module>
from opensora.datasets.dataloader import prepare_dataloader
File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/opensora/datasets/init.py", line 1, in <module>
def meta_nms(dets, scores, iou_threshold):
File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/torch/library.py", line 1069, in register
from .datasets import TextDataset, VideoTextDataset
File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/opensora/datasets/datasets.py", line 8, in <module>
from torchvision.datasets.folder import pil_loader
File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/torchvision/init.py", line 10, in <module>
use_lib._register_fake(
File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/torch/library.py", line 219, in _register_fake
from torchvision import meta_registrations, datasets, io, models, ops, transforms, utils # usort:skip
File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/torchvision/meta_registrations.py", line 164, in <module>
handle = entry.fake_impl.register(
File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/torch/library/fake_impl.py", line 50, in register
def meta_nms(dets, scores, iou_threshold):
if torch.C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"): File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/torch/library.py", line 1069, in register
RuntimeError: operator torchvision::nms does not exist
use_lib._register_fake(
File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/torch/library.py", line 219, in register_fake
handle = entry.fake_impl.register(
File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/torch/library/fake_impl.py", line 50, in register
if torch.C.dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator torchvision::nms does not exist
E1114 14:26:32.970000 10317 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 10321) of binary: /home/studentluo/.conda/envs/opensora-conda-forge/bin/python
Traceback (most recent call last):
File "/home/studentluo/.conda/envs/opensora-conda-forge/bin/torchrun", line 9, in <module>
sys.exit(main())
File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 143, in call
return launch_agent(self.config, self.entrypoint, list(args))
File "/home/studentluo/.conda/envs/opensora-conda-forge/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
scripts/diffusion/inference.py FAILED
Failures:
[1]:
time : 2025-11-14_14:26:32
host : n1.example.com
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 10322)
error_file: <N/A>
Root Cause (first observed failure):
[0]:
time : 2025-11-14_14:26:32
host : n1.example.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 10321)
error_file: <N/A>