字符串数据异常检测算法比较

📄 中文摘要

异常检测是机器学习中一个重要且广泛研究的问题。然而,针对字符串数据的异常检测研究相对较少,大多数文献集中于数值数据的异常检测。一种强健的字符串数据异常检测算法可以帮助进行数据清洗或系统日志文件中的异常检测。研究中比较了两种字符串异常检测算法。首先,提出了一种改进的局部异常因子算法,专门针对字符串数据进行异常检测,使用Levenshtein距离来计算数据集的密度。此外,提出了一种不同权重的Levenshtein度量,考虑了分层字符类别,可以用于调优算法以提高检测效果。

📄 English Summary

Comparison of Outlier Detection Algorithms on String Data

Outlier detection is a significant and well-researched problem in machine learning. However, there is limited research on outlier detection for string data, as most literature focuses on numerical data. A robust outlier detection algorithm for string data could aid in data cleaning or anomaly detection in system log files. This study compares two string outlier detection algorithms. Firstly, a variant of the well-known local outlier factor algorithm is introduced, tailored to detect outliers in string data using the Levenshtein measure to calculate the density of the dataset. A differently weighted Levenshtein measure is also presented, which takes into account hierarchical character classes and can be utilized to tune the algorithm for improved detection performance.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等