pythonによる文字列の正規化

torasenriwohashiru https://blog.hatena.ne.jp/torasenriwohashiru/ TorasenLab@はてな https://torasenriwohashiru.hatenadiary.org/ Python テキストテキストマイニングなどを行うためには文書、文、単語などの文字列の正規化が重要です。単語の大文字小文字の統一、半角全角の統一などをする必要があります。文字列の正規化のために利用しているpythonコードを以下に書いておきます。今後増える可能性もあります。実行環境 Ubuntu 10.04 64ビット python 2.6.5 unicode型に変換する def unicode_ignore_invalid_char(text): if isinstance(text, str): return text.decode('utf-8', 'ignore') return text 変換不能… 190 <iframe src="https://hatenablog-parts.com/embed?url=https%3A%2F%2Ftorasenriwohashiru.hatenadiary.org%2Fentry%2F20110806%2F1312558290" title="pythonによる文字列の正規化 - TorasenLab@はてな" class="embed-card embed-blogcard" scrolling="no" frameborder="0" style="display: block; width: 100%; height: 190px; max-width: 500px; margin: 10px 0px;"></iframe> https://images-fe.ssl-images-amazon.com/images/I/51EoFqAGo1L._SL160_.jpg Hatena Blog https://hatena.blog 2011-08-06 00:31:30 pythonによる文字列の正規化 rich https://torasenriwohashiru.hatenadiary.org/entry/20110806/1312558290 1.0 100%