Ruby で Web ページの本文を抽出する方法

griefworker https://blog.hatena.ne.jp/griefworker/ present https://tnakamura.hatenablog.com/ Ruby ExtractContent を使えばいい。 Webページの本文抽出 (nakatani @ cybozu labs) ただ、上の記事の ExtractContent は Ruby1.9 以上では動かなかった。正規表現エンジンが変わったからね…。無ければ自分で修正することを前提に、GitHub で探したら、案の定 1.9 対応版を発見した。 mono0x/extractcontent 試しにこれを使ってみよう。 Gemfile に gem "extractcontent", github: "mono0x/extractcontent" を追加し、bundle でインストール。使い方は簡単… 190 <iframe src="https://hatenablog-parts.com/embed?url=https%3A%2F%2Ftnakamura.hatenablog.com%2Fentry%2F2013%2F06%2F30%2F121932" title="Ruby で Web ページの本文を抽出する方法 - present" class="embed-card embed-blogcard" scrolling="no" frameborder="0" style="display: block; width: 100%; height: 190px; max-width: 500px; margin: 10px 0px;"></iframe> http://cdn-ak.f.st-hatena.com/images/fotolife/g/griefworker/20130630/20130630120833.png Hatena Blog https://hatena.blog 2013-06-30 12:19:32 Ruby で Web ページの本文を抽出する方法 rich https://tnakamura.hatenablog.com/entry/2013/06/30/121932 1.0 100%