マイクロサービス・RCA関連の論文

Survey

Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study | IEEE Journals & Magazine | IEEE Xplore
- 実際のシステムでの障害をアンケート
Enjoy your observability: an industrial survey of microservice tracing and analysis | Empirical Software Engineering
- 実際にプロダクションシステムを運用するエンジニアにアンケート
Failures and Fixes: A Study of Software System Incident Response | IEEE Conference Publication | IEEE Xplore
- 公開されている情報をもとに障害を調査
A Qualitative Interview Study of Distributed Tracing Visualisation: A Characterisation of Challenges and Opportunities | IEEE Journals & Magazine | IEEE Xplore
- エンジニアへ分散トレーシングに関してアンケート
Unveiling the Hardware and Software Implications of Microservices in Cloud and Edge Systems | IEEE Journals & Magazine | IEEE Xplore
- 具体的な商用マイクロサービスの規模感(Netflix)を説明している記事

Production Microservice Analysis

Characterizing Microservice Dependency and Performance | Proceedings of the ACM Symposium on Cloud Computing
- Alibabaのマイクロサービスの分析
- Alibabaのマイクロサービスアーキテクチャで設計された巨大なシステムを分析した論文を読んだ | koyama’s blog
An In-Depth Study of Microservice Call Graph and Runtime Performance | IEEE Journals & Magazine | IEEE Xplore
- Alibabaのマイクロサービスの分析(2)
Characterizing and synthesizing the workflow structure of microservices in ByteDance Cloud - Wen - 2022 - Journal of Software: Evolution and Process - Wiley Online Library
- ByteDanceのマイクロサービスの分析
Lifting the veil on Meta’s microservice architecture: Analyses of topology and request workflows | USENIX
- Metaのマイクロサービスの分析
- #122: Lifting the veil on Meta’s microservice architecture – Misreading Chat

Alert storm/RCA

Automatically and Adaptively Identifying Severe Alerts for Online Service Systems | IEEE Conference Publication | IEEE Xplore
- China Construction Bankの実データを使ったアラートの識別
Understanding and Handling Alert Storm for Online Service Systems | IEEE Conference Publication | IEEE Xplore
- 中国の銀行の実データを使ってアラートストームのアラートをまとめる方法を提案
TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems | Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
- MicrosoftのエンジニアがマイクロサービスベースのシステムでのRoot Cause Analysisのために，冗長なグラフを構造を取り除く手法を提案
Redesigning of OpsGenie Paging Alert System in Responding to Critical Tickets: An Application of DMAIC Methodology | Proceedings of the 2024 6th International Conference on Management Science and Industrial Engineering
- IT Service Deskで実際のオンコール対応にかかっている具体的な時間の記載があった

cf. #SRE論文紹介 Detection is Better Than Cure: A Cloud Incidents Perspective V. Ganatra et. al., ESEC/FSE’23 - Speaker Deck

Incident Linking

Dependency Aware Incident Linking in Large Cloud Systems | Companion Proceedings of the ACM Web Conference 2024

Log search engine

TencentCLS: the cloud log service with high query performances: Proceedings of the VLDB Endowment: Vol 15, No 12
- Tencentのログ管理のプラットフォームについて説明している．
- 扱うログは，1日あたりペタバイトの規模が想定されている．
- Apache Lucene 6.0でBKD Treeが導入されたがBKDツリーの複雑さは線形に相関があることを課題している．
LogStore | Proceedings of the 2021 International Conference on Management of Data
- Alibabaのログ管理プラットフォームを紹介している．
- ヘビーな書き込みのスループットがあり，1秒あたり数千万のログレコードが書き込まれるという．
- 検索では数十万に及ぶテナントがあり，ペタベイトに及ぶログを探すという．
- Cost-effectiveなスケーラビリティのあるログストレージの設計が簡単でないことを課題としている．
LogLens: A Real-Time Log Analysis System | IEEE Conference Publication | IEEE Xplore
- NEC Laboratories Americaの研究者が中心で執筆している．
- リアルタイムのログ分析システムを提案した．
- また，教師なし機械学習を使いアプリケーションログのパースを行った．
- こうした，ログから異常なイベントを発見する方法や，ログメッセージのパーサーのパターンを自動で作成する方法は一つの研究テーマになっている印象がある．
FLAP | Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
- フロリダ国際大学の研究者が中心で執筆している．
- FIU Log Analysis Platformというイベントログを解析するためのプラットフォームで使われている技術を紹介している．
- Challanges(課題)として以下の3つを主張している．
  - 多様な種類のイベントログが与えられるとき，どのようにイベント分析を広く一般的な方法でサポートするか．
  - 目的の異なる多様な分析の要件がある際に，どのように効率的に既存の分析手法を適用するか．
  - 多様な分析結果がある場合，どう効果的にユーザーへ提示するか．
Distributed Hayabusa | Proceedings of the 15th Asian Internet Engineering Conference
- 筆頭著者は日本のLepidum社(現在はGMO Cybersecurity by Ierae社)の方だった．共
- 著者に東大の方が多い．
- 大規模なログの検索のために複雑なストレージシステムやクラスタシステムを管理する必要があることを課題としていた．
- Distributed Hayabusaというログ検索エンジンを提案している．
- ログをタイムスタンプでSQLiteファイルに分割(シャーディング)することで高速化していた．
Read as Needed: Building WiSER, a Flash-Optimized Search Engine | USENIX
- 検索エンジン WiSER を提案している．少ないメインメモリを使って高いスループットと低いレイテンシを出す手法を紹介している．
- 以下を特徴として提案している．
  - データ配置の最適化
  - 2つのコストに配慮したブルームフィルター（特にここが新しそう）
  - 適応性のあるプリフェッチ
  - 容量と時間のトレードオフ

cf. ログ検索システムの論文まとめ | koyama’s blog

Log clustering

Log Clustering Based Problem Identification for Online Service Systems | IEEE Conference Publication | IEEE Xplore
- ログ件数の大規模化
  - “A Microsoft service system even generates over 1PB of logs every day.”
- キーワード検索の限界(killやfailはダイナミックなインフラではfalse positiveになりやすい)
  - “The systems could proactively kill a job and restart it elsewhere, which causes many “kill” and “fail” keywords in logs.”
- 再発した問題がすぐに解消されずに残ったままになるので，同じエラーログが前から出ていたままになっている．
  - “However, in a large-scale online service system, there are many recurrent issues, which could lead to a lot of redundant effort in examining logs and diagnosing the previously known problems.”

Log volume reduction

LogReducer: Identify and Reduce Log Hotspots in Kernel on the Fly | Proceedings of the 45th International Conference on Software Engineering
- eBPFを使ってオンライン・オフラインプロセスでログの件数を削減している．
- 同一のテンプレートから繰り返し類似したログが出力されており，これがホットスポットになっている．
- WeChatのプロダクションシステムで検証を行った．