RubyでSEIYUの商品情報をほぼ全件取得、CSVに出力するプログラムを書きました。
商品件数は1万4000件ほどあります。
カラムは商品名、値段(¥)、商品画像URL、カテゴリ3つです。
主に使ったGemはAnemoneとNokogiriです。
自分用ビボウロクなのでインデントズレズレ、無駄なコメントアウト多くてすいません。無視して下さい、、、、、
require 'rubygems'
require 'nokogiri'
require 'kconv'
require "open-uri"
require "anemone"
require 'csv'
def category_url_get
$category = []
$category_urls = []
base = "https://www.the-seiyu.com/front/contents/top/ns/"
Anemone.crawl(base, :depth_limit => 0 , :delay => 3) do |anemone|
anemone.on_every_page do |page|
doc = Nokogiri::HTML.parse(page.body.force_encoding("UTF-8"))
category_array = []
doc.xpath("//*[@id='categoryListWrapper_0002']/div[2]").each do |node|
node.css("a.level_3_li_Inner").each do |node_3|
category_3 = node_3.inner_text.gsub!(/(\s)/,"")
category_url_3 = "https://www.the-seiyu.com" + node_3 ["href"]
category_2 = node_3.parent.parent.parent.css("a.level_2_li_Inner").inner_text.gsub!(/(\s)/,"")
category_1 = node_3.parent.parent.parent.parent.parent.css("div.level_1_li_Inner").xpath("./a/span").inner_text
puts " ----------------------------------------------"
category_array << category_1
category_array << category_2
category_array << category_3
$category_urls << category_url_3
$category << category_array
category_array = []
end
end
end
end
end
def scrape
$urls = []
$stuff=[]
$result = []
$category_urls.flatten!
$category.flatten!(1)
n = 0
$category_urls.zip($category).each do |category_url,category|
p category_url
p category
puts "-----------------------"
$urls = []
$urls = Marshal.load(Marshal.dump(category_url))
Anemone.crawl($urls, :depth_limit => 0 , :delay => 3) do |anemone|
flug = 0
anemone.on_every_page do |page|
doc = Nokogiri::HTML.parse(page.body.force_encoding("UTF-8"))
doc.xpath("//li[@class='jsFlatHeight_list' and 'resized']").each do |node|
p node
stuff=[]
result = []
stuff << node.xpath(".//img").attribute("title").value
stuff << node.xpath(".//div/div/span/strong").inner_text
stuff << node.xpath(".//img").attribute("data-original").value
$stuff << stuff
result = Marshal.load(Marshal.dump(category))
result << $stuff
$stuff = []
result.flatten!
$result << result
end
puts "-------------------------"
p $result
begin
n = doc.xpath("//*[@id='list']/div/ul[2]/li[last()]/a").attribute("href").value.scan(/javascript:move\('(\d+)',/).flatten[0]
n =- 1
if flug == 0
(1..n).each do |p|
$urls << category_url + "&mode=image&pageSize=49¤tPage=#{p+1}&alignmentSequence=1&resultMessage="
end
end
flug =+ 1
rescue
next
end
end
end
n =+1
end
end
scrape
def to_csv
header = ["カテゴリ1","カテゴリ2","カテゴリ3","商品名","値段(¥)","画像URL"]
CSV.open('seiyu.csv','w',:encoding => "Windows-31J",:headers => true) do |file|
file << header
$result.each do |line|
file << line
end
end
puts "--------------------------------------------------------------------------"
end
to_csv
サーバー負荷を考えて必ずdelayを長くとって下さい。
自己責任でお願いします。なにか不都合があれば即削除します。
質問も受け付けていますので気軽にメンションとばして下さい。
ではでは〜。
たれみみ
@taremimi_7