RuntimeError: CUDA error: an illegal memory access was encountered

4. 我的解决

后续发现其实是某张卡有问题，

在这里插入图片描述

其实不难发现，我报错的位置基本都是从gpu往cpu转换的时候出现的问题。

因此考虑是不是cpu内存不太够了，所以内存访问发生错误了
由于我使用的是容器，因此在docker-compose或者dockerfile里将配置项改为：
```
    shm_size: 64G  →     shm_size: 128G
```
shm_size，共享内存（shared memory）
之后就基本不报错了。。。

例行吐槽，第一次遇到这个错误，我是非常无语的。以前是不报错的，和以前相比，不同的地方有

感觉上可能会和显存有点关系吧

第一次报错

在这里插入图片描述

在这里插入图片描述

按照这个思路继续把batch_size调小（8→5），又换了个位置报错。。

在这里插入图片描述

换思路

报错信息是CUDA丢出来的一个运行时错误，发生了非法内存访问。网上关于这个问题的讨论也很多，但是并没有发现有找到真正原因的。

很多都是靠感觉的

参考：

pytorch的github issue：RuntimeError: CUDA error: an illegal memory access was encountered
- 这个回答好像有效的人比较多，一次惨痛的debug的经历-RuntimeError: CUDA error: an illegal memory access was encountered，这个人就是这么解决的
另外还有一些是经验论的，
- CSDN博客：[彻底解决]CUDA error: an illegal memory access was encountered(CUDA错误非法访问内存)
yolo的GitHub issue：Cuda illegal memory access when running inference on *.engine #6311

搜索/etc/X11/xorg.conf

参考：

https://download.nvidia.com/XFree86/Linux-x86_64/396.51/README/editxconfig.html
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#address-custom-xorg-conf-if-applicable
https://unix.stackexchange.com/questions/200553/multi-nvidia-gpu-overclocking-for-computations-cuda

本文来自网络，不代表协通编程立场，如若转载，请注明出处：https://net2asp.com/cd422e2faa.html