マイクロサービス・RCA関連の論文
Survey
Production Microservice Analysis
Alert storm/RCA
cf. #SRE論文紹介 Detection is Better Than Cure: A Cloud Incidents Perspective
V. Ganatra et. al., ESEC/FSE’23 - Speaker Deck
Incident Linking
Log search engine
cf. ログ検索システムの論文まとめ | koyama’s blog
Log clustering
- Log Clustering Based Problem Identification for Online Service Systems | IEEE Conference Publication | IEEE Xplore
- ログ件数の大規模化
- “A Microsoft service system even generates over 1PB of logs every day.”
- キーワード検索の限界(killやfailはダイナミックなインフラではfalse positiveになりやすい)
- “The systems could proactively kill a job and restart it elsewhere, which causes many “kill” and “fail” keywords in logs.”
- 再発した問題がすぐに解消されずに残ったままになるので,同じエラーログが前から出ていたままになっている.
- “However, in a large-scale online service system, there are many recurrent issues, which could lead to a lot of redundant effort in examining logs and diagnosing the previously known problems.”
Log volume reduction