Thursday, March 1, 2012

Linux Nvidia: NVRM: GPU at 0000:01:00.0 Has Fallen Off The Bus Error and Solution


I'm using NVIDIA UNIX x86_64 Kernel Module version (driver) 280.13 under Debian 64 bit Linux with Linux kernel 2.6.32-5-amd64 x86_64. However, I'm getting the following errors in my /var/log/messages file
Feb 13 05:53:39 wks01 kernel: [26652.425207] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
Feb 14 03:59:14 wks01 kernel: [39846.244283] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
Feb 17 04:47:32 wks01 kernel: [35237.485871] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
Feb 18 06:53:19 wks01 kernel: [49298.937949] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
Feb 19 06:14:01 wks01 kernel: [28508.567838] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
This error occurs randomly and my laptop goes in hard freez mode. The hard reboot is the only way to recover from complete freeze of my Dell M6500 Debian Linux based laptop. How do I fix this problem?

This issue is reported all over the places and most recommended solutions are as follows:

Install Latest Kernel Version and NVIDIA Driver

You need to update your kernel and install the latest NVIDIA Unix driver.

Put NVIDIA Driver In Persistence Mode

You need to set your GPU in persistence mode. From the man page:
A flag that indicates whether persistence mode is enabled for the GPU. Value is either "Enabled" or "Disabled". When persistence mode is enabled the NVIDIA driver remains loaded even when no active clients, such as X11 or nvidia-smi, exist. This minimizes the driver load latency associated with running dependent apps, such as CUDA programs. For all CUDA- capable products. Linux only.
Edit /etc/rc.local file and add the following line before exit 0 statement:
 
/usr/bin/nvidia-smi -pm 1
 
Save and close the file. The above line ensures that your GPU is set to persistence mode as soon as it boots into the system.

How Do I Set Persistence Mode From Command Line?

Type the following command as root user:
# /usr/bin/nvidia-smi -pm 1

How Do I Verify That Persistence Mode Is Set From My Device?

Type the following command as root user:
# /usr/bin/nvidia-smi -q | grep -i Persistence
Sample outputs:
    Persistence Mode            : Enabled

How Do I View All Settings?

Type the following command to display GPU or unit info:
# nvidia-smi -q | less
Sample outputs:
==============NVSMI LOG==============
Timestamp                       : Tue Feb 21 07:20:20 2012
Driver Version                  : 280.13
Attached GPUs                   : 1
GPU 0000:01:00.0
    Product Name                : Quadro FX 2800M
    Display Mode                : N/A
    Persistence Mode            : Enabled
    Driver Model
        Current                 : N/A
        Pending                 : N/A
    Serial Number               : N/A
    GPU UUID                    : N/A
    Inforom Version
        OEM Object              : N/A
        ECC Object              : N/A
        Power Management Object : N/A
    PCI
        Bus                     : 1
        Device                  : 0
        Domain                  : 0
        Device Id               : 061D10DE
        Bus Id                  : 0000:01:00.0
    Fan Speed                   : N/A
    Memory Usage
        Total                   : 1023 Mb
        Used                    : 74 Mb
        Free                    : 949 Mb
    Compute Mode                : Default
    Utilization
        Gpu                     : N/A
        Memory                  : N/A
    Ecc Mode
        Current                 : N/A
        Pending                 : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory   : N/A
                Register File   : N/A
                L1 Cache        : N/A
                L2 Cache        : N/A
                Total           : N/A
            Double Bit
                Device Memory   : N/A
                Register File   : N/A
                L1 Cache        : N/A
                L2 Cache        : N/A
                Total           : N/A
        Aggregate
            Single Bit
                Device Memory   : N/A
                Register File   : N/A
                L1 Cache        : N/A
                L2 Cache        : N/A
                Total           : N/A
            Double Bit
                Device Memory   : N/A
                Register File   : N/A
                L1 Cache        : N/A
                L2 Cache        : N/A
                Total           : N/A
    Temperature
        Gpu                     : 48 C
    Power Readings
        Power State             : N/A
        Power Management        : N/A
        Power Draw              : N/A
        Power Limit             : N/A
    Clocks
        Graphics                : N/A
        SM                      : N/A
        Memory                  : N/A

Recommended readings:

No comments:

Post a Comment

Thank you for your comment