前提条件

Python3
MeCab
NEologd
Linux

MeCabとNEologdの環境構築は以下参照 https://ykonishi.tokyo/install-mecab-neologd/

使用するPythonパッケージ

Flask
logging
MeCab
beautifulsoup4
requests
lxml
cchardet
csv

システムの概要

今回、slackから受け付けたメッセージを元に天気予報をスクレイピングして教えてくれる対話型ボットを作りました。

例えば、Ama-Tatsu、今日の新宿区の天気を教えてと投稿すると、こんな返答が返ってきます。

天気予報
新宿区の今日の天気は晴
最高気温は25度
最低気温は14度です。

メッセージを受けて返信するまでのざっくりとした過程

メッセージを受ける
メッセージ内容を解析して地域、今日もしくは明日の天気予報を判別
判別した地域の天気情報が存在するURLにアクセス(取材)
スクレイピングして必要なデータを取得(編集)
取得したデータを元にフォーマットを作成(編集・原稿作成)
ボットがお天気を伝える(報道)

slackで受け付けたメッセージは、WebHooks Outgoingを使って指定のサーバに送信するよう設定します。また、メッセージはjson形式で送信されます。

コード

cchardetはデータの中身を自動で分析して文字コードを決定してくれるパッケージで、大量の文章も高速で処理することができます。あえて文字コードを指定しているのは、requests.getした際、レスポンスヘッダに文字コード情報がないとISO-8859-1と判定されてしまうからです。 urlには地域の天気情報が存在するURLが入ります。URLはcsvなどで別に保存しておきます。また、スクレイピングで取得するデータは今日、明日の天気情報とします。

scrape_news.py

#coding: utf-8
import requests, lxml, cchardet
from bs4 import BeautifulSoup

class WeatherData:

    def __init__(self, url):
        response = requests.get(url)
        # character code measures.
        response_encoding = cchardet.detect(response.content)["encoding"]
        self.soup = BeautifulSoup(response.content, 'lxml', from_encoding=response_encoding)

    def scrape_today_weather(self):
        today_weather = self.soup.find('section', class_='today-weather').find('p', class_='weather-telop').get_text()
        today_high_temp = self.soup.find('section', class_='today-weather').find('dd', class_='high-temp').find('span', class_='value').get_text()
        today_low_temp = self.soup.find('section', class_='today-weather').find('dd', class_='low-temp').find('span', class_='value').get_text()
        return today_weather, today_high_temp, today_low_temp

    def scrape_tomorrow_weather(self):
        tomorrow_weather = self.soup.find('section', class_='tomorrow-weather').find('p', class_='weather-telop').get_text()
        tomorrow_high_temp = self.soup.find('section', class_='tomorrow-weather').find('dd', class_='high-temp').find('span', class_='value').get_text()
        tomorrow_low_temp = self.soup.find('section', class_='tomorrow-weather').find('dd', class_='low-temp').find('span', class_='value').get_text()
        return tomorrow_weather, tomorrow_high_temp, tomorrow_low_temp

次に天気のデータを使って原稿となるものを作成します。 municipality.csvは、地域の天気情報が存在するURLが記載されたファイルです。

csvの中身はここを見てもらえればと思います。 https://github.com/nisyuu/Ama-Tatsu/blob/master/municipality.csv

arrange_news.py

#coding: utf-8
import csv
from scrape_news import WeatherData

class ArrangeWeatherData:

    # get the each municipality.
    def get_municipality_data(self):
        municipality_dict = {}
        with open('municipality.csv', 'r') as municipality_file:
            reader = csv.reader(municipality_file)
            for municipality_data in reader:
                municipality_dict[municipality_data[0]] = municipality_data[1]
        return municipality_dict

    def weather_news_format(self, municipality):
        municipality_dict = self.get_municipality_data()
        weather_data = WeatherData(municipality_dict[municipality])
        today_weather_data = weather_data.scrape_today_weather()
        weather_news = []
        weather_news.append(municipality + "の今日の天気は" + today_weather_data[0] + "\n"\
                            "最高気温は" + today_weather_data[1] + "度\n"\
                            "最低気温は" + today_weather_data[2] + "度です。")

        tomorrow_weather_data = weather_data.scrape_tomorrow_weather()
        weather_news.append(municipality + "の明日の天気は" + tomorrow_weather_data[0] + "\n"\
                            "最高気温は" + tomorrow_weather_data[1] + "度\n"\
                            "最低気温は" + tomorrow_weather_data[2] + "度です。")
        return weather_news

    def today_weather_news_format(self, municipality):
        municipality_dict = self.get_municipality_data()
        weather_data = WeatherData(municipality_dict[municipality])
        today_weather_data = weather_data.scrape_today_weather()
        weather_news = []
        weather_news.append(municipality + "の今日の天気は" + today_weather_data[0] + "\n"\
                            "最高気温は" + today_weather_data[1] + "度\n"\
                            "最低気温は" + today_weather_data[2] + "度です。")
        return weather_news

    def tomorrow_weather_news_format(self, municipality):
        municipality_dict = self.get_municipality_data()
        weather_data = WeatherData(municipality_dict[municipality])
        tomorrow_weather_data = weather_data.scrape_tomorrow_weather()
        weather_news = []
        weather_news.append(municipality + "の明日の天気は" + tomorrow_weather_data[0] + "\n"\
                            "最高気温は" + tomorrow_weather_data[1] + "度\n"\
                            "最低気温は" + tomorrow_weather_data[2] + "度です。")
        return weather_news

次はメッセージを受け付けて返信をする部分です。軽量WebフレームワークのFlaskを使って稼働させます。

メッセージの解析には形態素解析を使っており、解析結果を元に対象の地域、今日、明日を判定してスクレイピングを実行します。

MeCab.Tagger('-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')となっている部分は個人の環境に応じてパスを設定してください。

ポート番号は5001と指定していますが、変更しても構わないです。

stationed.py

#coding: utf-8
import logging, MeCab, time
from flask import Flask, request, jsonify
from scrape_news import WeatherData
from arrange_news import ArrangeWeatherData

app = Flask(__name__)

# message class of the Slack.
class PostedSlackApi(object):

    def __init__(self, params):
        self.token = params["token"]
        self.team_id = params["team_id"]
        self.channel_id = params["channel_id"]
        self.channel_name = params["channel_name"]
        self.timestamp = params["timestamp"]
        self.user_id = params["user_id"]
        self.user_name = params["user_name"]
        self.text = params["text"]
        self.trigger_word = params["trigger_word"]

    def __str__(self):
        posted_data = self.__class__.__name__
        posted_data += "@{0.token}[channel={0.channel_name}, user={0.user_name}, text={0.text}]".format(self)
        return posted_data

class Reporter:

    @app.route('/', methods=['POST'])
    def ama_tatsu():

        global municipality
        def say_weather_news(weather_news):
            notify_weather = {"title": '天気予報', "text": weather_news[0] + "\n\n" + weather_news[1]}
            return jsonify({
            "username": "Ama-Tatsu",
            "icon_emoji": ":slightly_smiling_face:",
            "attachments": [notify_weather]
            })

        def say_today_weather(weather_news):
            notify_today_weather = {"title": '天気予報', "text": weather_news[0]}
            return jsonify({
            "username": "Ama-Tatsu",
            "icon_emoji": ":slightly_smiling_face:",
            "attachments": [notify_today_weather]
            })

        def say_tomorrow_weather(weather_news):
            notify_tomorrow_weather = {"title": '天気予報', "text": weather_news[0]}
            return jsonify({
            "username": "Ama-Tatsu",
            "icon_emoji": ":slightly_smiling_face:",
            "attachments": [notify_tomorrow_weather]
            })

        posted_data = PostedSlackApi(request.form)
        logging.debug(posted_data)

        # ignore the slackbot.
        if posted_data.user_name == "slackbot":
            return ''

        # language analysis.
        municipality = ''
        municipality_dict = ArrangeWeatherData().get_municipality_data()
        tagger_neologd = MeCab.Tagger('-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')
        tagger_neologd.parse('')
        node = tagger_neologd.parseToNode(posted_data.text)
        word_list = []
        while node:
            word_feature = node.feature.split(',')
            word = node.surface
            if word in municipality_dict:
                municipality = word
            elif bool(word) and word_feature[0] in ['名詞', '形容詞'] and word not in ['Ama-Tatsu', 'amat', 'amatatsu', 'あまたつ', 'amatatsu', '\n', 'です', 'ます', '。', '', '']:
                word_list.append(word)
            node = node.next

        if '天気' in ''.join(word_list):
            if bool([i for i in ['今日', '今日の天気'] if i in word_list]):
                weather_news = ArrangeWeatherData().today_weather_news_format(municipality=municipality)
                return say_today_weather(weather_news)
            elif bool([i for i in ['明日', '明日の天気'] if i in word_list]):
                weather_news = ArrangeWeatherData().tomorrow_weather_news_format(municipality=municipality)
                return say_tomorrow_weather(weather_news)
            else:
                weather_news = ArrangeWeatherData().weather_news_format(municipality=municipality)
                return say_weather_news(weather_news)

if __name__ == '__main__':
    app.debug = True
    app.run(host='0.0.0.0', port=5001)

WebHooks Outogoing

設定する項目は、連携するチャンネルとトリガーとなるワード、メッセージの送信先です。トリガーワードはボットを反応させたいワードを指定しましょう。カンマ区切りで複数指定可能です。メッセージの送信先には、ボットを稼働させるマシンのアドレスとポート番号を指定します。アドレスがhttp://hogehoge.huga であれば、http://hogehoge.huga:5001 となります。

起動

python3 stationed.py

ちなみに、nohup python3 stationed.py > output_log/out.log &でログを別ファイルに出力しつつバックグラウンドで稼働させることができます。

GitHubにも載せていますので、ご覧ください。 https://github.com/nisyuu/Ama-Tatsu 参考 https://ykonishi.tokyo/take-measures-encoding/