目录

1 错误:[ERROR] Dvpp vdec process 15015th frame failed, error:-1E0528 11:02:39.182062 50284 decode_impl_ac

2 错误:Chn 0 hi_mpi_vdec_release_frame Fail, Error Code = a0058015

3 错误:../sysdeps/aarch64/multiarch/../memcpy.S: No such file or directory.

参考文献:


1 错误:[ERROR] Dvpp vdec process 15015th frame failed, error:-1E0528 11:02:39.182062 50284 decode_impl_ac

华为昇腾310B1平台解码失败,报下面的错误:

[ERROR]  Dvpp vdec process 15015th frame failed, error:-1
E0528 11:02:39.182062 50284 decode_impl_acl.cpp:171] [InferServer] [DecoderAcl] SendStream(): acl Decode failed, ret = -1

经过调试发现,解码第一帧的时候,获取的解码数据都是正常的,但是第二帧就不正常了,调试没找到解决方法,然后去翻手册,查看关于hi_mpi_vdec_get_frame的接口介绍

单纯看hi_mpi_vdec_get_frame接口其实也没啥,然后发现下面有个hi_mpi_vdec_release_frame,好吧,翻了个低级错误,漏写释放了。

2 错误:Chn 0 hi_mpi_vdec_release_frame Fail, Error Code = a0058015

我的代码中送帧数据和取帧数据是在一个线程中,然后帧的后处理是在另一个线程中,然后我在另一个线程中处理完之后要用hi_mpi_vdec_release_frame释放帧数据,然后报错:

Chn 0 hi_mpi_vdec_release_frame Fail, Error Code = a0058015

这个问题我各种调试,各种编写调试代码,各种尝试排查,阅读华为芯片手册,最后才发现现象就是如果我在同一个线程中release就不会报错,如果我在另一个线程中释放就报错,然后在代码中增加了一行auto ret = aclrtSetDevice(0);,然后再释放就不报错了,

    AclLiteError VdecHelperV2::frameProcessFunc(){
        auto ret = aclrtSetDevice(0);//增加这行那么release就不报错了。

        while(1)
        {
            printf("before pop-----------videoFrameQueue_->Size========================%ld\n", videoFrameQueue_->Size());
            frame_info_user_data *frame_userdata = nullptr; 
            videoFrameQueue_->Pop(frame_userdata);
            printf("after pop-----------videoFrameQueue_->Size==========%ld\n", videoFrameQueue_->Size());
           
            //调试,记得删除
            auto ret = hi_mpi_vdec_release_frame(0, frame_userdata->frame_info);
            if (ret != HI_SUCCESS) {
                printf("[%s][%d] Chn %u hi_mpi_vdec_release_frame Fail, Error Code = %x \n",
                    __FUNCTION__, __LINE__, 0, ret);
            }

        /*
            if(nullptr != frame_userdata){
                //frame_info_user_data* frame_userdata = (frame_info_user_data*)temp_ptr;
                auto frame    = frame_userdata->frame_info;
                auto userdata = frame_userdata->userData;
                //callbackV2_(frame, userdata);//调试,记得删除
                
                delete frame_userdata;//调试,记得删除
            }   */
        }
        return ACLLITE_OK;
    }

但是虽然这样就报错了,却不太合理,我又看了下华为的手册,然后看到了下面的内容

所以其实这里用aclrtSetCurrentContext接口更合理,所以最终修改的代码如下,当然下面代码中的while(1)里面的内容其实只是调试代码,

    AclLiteError VdecHelperV2::frameProcessFunc(){
        //auto ret = aclrtSetDevice(0);

        aclError ret = aclrtSetCurrentContext(context_);
        if (ret != ACL_SUCCESS) {
            ACLLITE_LOG_ERROR("Video decoder set context failed, error: %d", ret);
        }

        while(1)
        {
            printf("before pop-----------videoFrameQueue_->Size========================%ld\n", videoFrameQueue_->Size());
            frame_info_user_data *frame_userdata = nullptr; 
            videoFrameQueue_->Pop(frame_userdata);
            printf("after pop-----------videoFrameQueue_->Size==========%ld\n", videoFrameQueue_->Size());
           
            //调试,记得删除
            auto ret = hi_mpi_vdec_release_frame(0, frame_userdata->frame_info);
            if (ret != HI_SUCCESS) {
                printf("[%s][%d] Chn %u hi_mpi_vdec_release_frame Fail, Error Code = %x \n",
                    __FUNCTION__, __LINE__, 0, ret);
            }

        /*
            if(nullptr != frame_userdata){
                //frame_info_user_data* frame_userdata = (frame_info_user_data*)temp_ptr;
                auto frame    = frame_userdata->frame_info;
                auto userdata = frame_userdata->userData;
                //callbackV2_(frame, userdata);//调试,记得删除
                
                delete frame_userdata;//调试,记得删除
            }   */
        }
        return ACLLITE_OK;
    }

3 错误:../sysdeps/aarch64/multiarch/../memcpy.S: No such file or directory.

Thread 37 "rtsp_sink" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xe7ff817ab000 (LWP 3282534)]
__memcpy_generic () at ../sysdeps/aarch64/multiarch/../memcpy.S:188
188	../sysdeps/aarch64/multiarch/../memcpy.S: No such file or directory.
(gdb) bt
#0  __memcpy_generic () at ../sysdeps/aarch64/multiarch/../memcpy.S:188
#1  0x0000e7fff6731e70 in memcpy_s () from /data/chw/aclstream/3rdparty/ascend/lib/linux_lib/libc_sec.so
#2  0x0000e7fff54be080 in drvMemcpyInner () from /data/chw/aclstream/3rdparty/ascend/lib/linux_lib/libascend_hal.so
#3  0x0000e7fff54be1bc in drvMemcpy () from /data/chw/aclstream/3rdparty/ascend/lib/linux_lib/libascend_hal.so
#4  0x0000e7fff6f69408 in ?? () from /data/chw/aclstream/3rdparty/ascend/lib/linux_lib/libruntime.so
#5  0x0000e7fff6c97730 in ?? () from /data/chw/aclstream/3rdparty/ascend/lib/linux_lib/libruntime.so
#6  0x0000e7fff6e309a4 in ?? () from /data/chw/aclstream/3rdparty/ascend/lib/linux_lib/libruntime.so
#7  0x0000e7fff6d021e4 in ?? () from /data/chw/aclstream/3rdparty/ascend/lib/linux_lib/libruntime.so
#8  0x0000e7fff6c2ea20 in rtMemcpy () from /data/chw/aclstream/3rdparty/ascend/lib/linux_lib/libruntime.so
#9  0x0000e7fff93e92d8 in aclrtMemcpy () from /usr/local/Ascend/ascend-toolkit/latest/lib64/libascendcl.so
#10 0x0000e7fffac53280 in acllite::CopyDataToHostEx (dest=0xf00018000200, destSize=3133440, src=0xf00014000200, 
    srcSize=3133440, runMode=ACL_DEVICE) at src/AclLiteUtils.cpp:318
#11 0x0000e7fffac53544 in acllite::CopyDataToHostEx (dest=0xf00018000200, destSize=3133440, src=0xf00014000200, 
    srcSize=3133440, deviceId=0) at src/AclLiteUtils.cpp:360
#12 0x0000e7fffe5504f4 in infer_server::DecoderAcl::OnFrame (this=0xe7ff880dc210, 
    codec_image=std::shared_ptr<acllite::ImageData> (use count 4, weak count 0) = {...}, channel_id=0, frame_id=5582577)
    at /data/chw/aclstream/infer_server/src/acl/decode_impl_acl.cpp:243
#13 0x0000e7fffe54f2d4 in infer_server::CallBackVdec (
    decoded_image=std::shared_ptr<acllite::ImageData> (use count 4, weak count 0) = {...}, channel_id=0, 
    frame_id=5582577, user_data=0xe7ff880dc210) at /data/chw/aclstream/infer_server/src/acl/decode_impl_acl.cpp:36
--Type <RET> for more, q to quit, c to continue without paging--
#14 0x0000e7fffacab484 in acllite::VideoDecoder::DecodeCallback (this=0xe7ff886ca3a0, 
    decodedImage=std::shared_ptr<acllite::ImageData> (use count 4, weak count 0) = {...}, frameId=5582577, 
    userData=0xe7ff880dc210) at src/VideoDecoder.cpp:350
#15 0x0000e7fffacaadf0 in acllite::VideoDecoder::DvppVdecCallbackV2 (frame=0xe7ff88a5c960, userdata=0xe7ff88894150)
    at src/VideoDecoder.cpp:274
#16 0x0000e7fffac9bf8c in acllite::VdecHelperV2::frameProcessFunc (this=0xe7ff88006be0) at src/VdecHelperV2.cpp:230
#17 0x0000e7fffaca0638 in std::__invoke_impl<int, int (acllite::VdecHelperV2::*)(), acllite::VdecHelperV2*> (
    __f=@0xe7ff88001690: (int (acllite::VdecHelperV2::*)(acllite::VdecHelperV2 * const)) 0xe7fffac9bec4 <acllite::VdecHelperV2::frameProcessFunc()>, __t=@0xe7ff88001688: 0xe7ff88006be0) at /usr/include/c++/11/bits/invoke.h:74
#18 0x0000e7fffaca0578 in std::__invoke<int (acllite::VdecHelperV2::*)(), acllite::VdecHelperV2*> (
    __fn=@0xe7ff88001690: (int (acllite::VdecHelperV2::*)(acllite::VdecHelperV2 * const)) 0xe7fffac9bec4 <acllite::VdecHelperV2::frameProcessFunc()>) at /usr/include/c++/11/bits/invoke.h:96
#19 0x0000e7fffaca04e0 in std::thread::_Invoker<std::tuple<int (acllite::VdecHelperV2::*)(), acllite::VdecHelperV2*> >::_M_invoke<0ul, 1ul> (this=0xe7ff88001688) at /usr/include/c++/11/bits/std_thread.h:259
#20 0x0000e7fffaca042c in std::thread::_Invoker<std::tuple<int (acllite::VdecHelperV2::*)(), acllite::VdecHelperV2*> >::operator() (this=0xe7ff88001688) at /usr/include/c++/11/bits/std_thread.h:266
#21 0x0000e7fffaca03c8 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<int (acllite::VdecHelperV2::*)(), acllite::VdecHelperV2*> > >::_M_run (this=0xe7ff88001680) at /usr/include/c++/11/bits/std_thread.h:211
#22 0x0000e7fffde531fc in ?? () from /lib/aarch64-linux-gnu/libstdc++.so.6
#23 0x0000e7fffdc1d5c8 in start_thread (arg=0x0) at ./nptl/pthread_create.c:442

上面的这个段错误出现的有些奇怪,有时候运行没有,有时候运行几十秒就出现这个错误,有时候运行很长时间才会出现这个错误,我最开始想去调试代码查找这个错误原因,也没能找到。

3.1错误原因查找一

后来觉得需要跳出来,虽然每次都是在aclrtMemcpy ()这个地方出现段错误,但是问题原因却不一定就是因为这里,我再次执行,然后执行的时候用

watch -n 1 -d npu-smi info

查看显存信息,然后发现在程序执行过程中内存占用一直在增长。

于是我怀疑是不是我代码中哪个地方内存没释放导致内存一直在增加,所以我重新去把我修改过的代码完整的看一遍。

然后发现这个地方内存增加是因为我的队列长度设置的是1000,并且数据的后处理太慢了,所以队列是一直在增加的直到满1000,我把队列长度设置成200,这里的内存就不会一直增加到离谱了

但是段错误却还是存在的。

3.2 错误原因查找二

然后我直接把队列去掉,获取到帧数据之后直接就处理不放到队列里面,结果发现还是有段错误,这时候我就去把华为的demo再认真的看一遍,发现他获取帧的时候,frame是定义的局部变量。

而我的代码中,这个frame是每次用new创建的,我也改成局部变量,多次测试没有段错误了

参考文献:

CANN 5.0.4 应用软件开发指南 (C&C++, 推理)  01.pdf

CANN 8.0.RC1 AscendCL应用软件开发指南 (C&C++, 推理)  01.pdf

Logo

昇腾计算产业是基于昇腾系列(HUAWEI Ascend)处理器和基础软件构建的全栈 AI计算基础设施、行业应用及服务,https://devpress.csdn.net/organization/setting/general/146749包括昇腾系列处理器、系列硬件、CANN、AI计算框架、应用使能、开发工具链、管理运维工具、行业应用及服务等全产业链

更多推荐