TensorRTとTriton Inference Serverで推論サーバの性能を劇的に改善し本番導入した話

yossylx https://blog.hatena.ne.jp/yossylx/ LayerX エンジニアブログ https://tech.layerx.co.jp/ 機械学習 MLOps 機械学習エンジニアの吉田です。前回は NVIDIA Triton Inference Server の性能を検証した話を書きましたが今回はその続編となります。 tech.layerx.co.jp 前回の記事以降も継続してTriton Inference Serverの検証を重ねた結果、推論サーバの性能を大幅に改善することができ、無事本番に導入することができました。この記事では本番導入までにどのような改善や検証を行ったのか書きたいと思います。はじめに背景バクラクでは請求書OCRなどの機械学習モデルを開発しており、リアルタイムで推論結果を返す必要があります。推論APIはNginx、Gun… 190 <iframe src="https://hatenablog-parts.com/embed?url=https%3A%2F%2Ftech.layerx.co.jp%2Fentry%2F2024%2F06%2F20%2F172755" title="TensorRTとTriton Inference Serverで推論サーバの性能を劇的に改善し本番導入した話 - LayerX エンジニアブログ" class="embed-card embed-blogcard" scrolling="no" frameborder="0" style="display: block; width: 100%; height: 190px; max-width: 500px; margin: 10px 0px;"></iframe> https://cdn.user.blog.st-hatena.com/default_entry_og_image/29440/1690850324393740 Hatena Blog https://hatena.blog 2024-06-20 17:27:55 TensorRTとTriton Inference Serverで推論サーバの性能を劇的に改善し本番導入した話 rich https://tech.layerx.co.jp/entry/2024/06/20/172755 1.0 100%