剛好最近有處理到s3 log的分析, 所以就順便紀錄&分享一下.

那基本要寫出parser就用regular expression就好, 其實也不難剛好複習一下, 太久沒用了, 有點生疏. 以下列出s3 log的example & 對應到的欄位名稱

S3 format


S3 log example

16c30236345580cd721db3bb157f5f75341e9a9ca19c17dacef7443e14324be3 my_bucket [22/Jan/2016:01:41:16 +0000] 180.204.165.97 - 48040A3D82700E22 REST.GET.OBJECT camp/a434c975/a434c975e63f4612b804e4573c8318b49d8d7ff6.jpg "GET /camp/a434c975/a434c975e63f4612b804e4573c8318b49d8d7ff6.jpg HTTP/1.1" 200 - 45214 45214 25 24 "http://xxx.xxx.com/" "Mozilla/5.0 (Linux; Android 5.0; SM-G900I Build/LRX21T; wv) AppleWebKit/547.37 (KHTML, like Gecko) Version/4.0 Chrome/47.0.2526.100 Mobile Safari/537.36" -

對應到的名稱

"bucket_owner", "bucket", "datetime", "ip", "requestor_id", "request_id", "operation", "key", "http_method_uri_proto", "http_status","s3_error", "bytes_sent", "object_size", "total_time", "turn_around_time", "referer", "user_agent"

Regular expression


(\S+) (\S+) \[(.*?)\] (\S+) (\S+) (\S+) (\S+) (\S+) "([^"]+)" (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) "([^"]+)" "([^"]+)"

基本上就三種型態而已, 還挺好理解的.

  • (\S+): match any non-white space character, + Between one and unlimited times, as many times as possible

  • [(.*?)]: matches any character (except newline) in []

  • "([^"]+)": match a single character not present in the list below

Code


s3.parser.python


def parse(path, r, s3_log_format, s3_names, s3_names_analysis):

    data = list()
    with open(path + r, 'r') as log_file:
        for line in log_file:
            match = s3_log_format.match(line)
            if match is not None:
                tmp = [match.group(n + 1) for n in range(17)]
                result = zip(s3_names, tmp)
                tmp = list()

                for item in result:
                    if item[0] in s3_names_analysis:
                        tmp.append(item)

                data.append(tmp)

    return data

def main(path):
    s3_log_format = r'(\S+) (\S+) \[(.*?)\] (\S+) (\S+) (\S+) (\S+) (\S+) "([^"]+)" (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) "([^"]+)" "([^"]+)"';
    
    s3_log_format = re.compile(s3_log_format)

    s3_names = ["bucket_owner", "bucket", "datetime", "ip", "requestor_id", "request_id", "operation", "key", "http_method_uri_proto", "http_status", "s3_error", "bytes_sent", "object_size", "total_time", "turn_around_time", "referer", "user_agent"]

    # choose names you want to analysis
    s3_names_analysis = ["datetime", "key", "http_method_uri_proto", "http_status"]
  
    files = [f for f in listdir(path) if isfile(join(path, f))]
    data = list()

    for r in files:
        # parse and get log info u wanted
        data = parse(path, r, s3_log_format, s3_names, s3_names_analysis)
        print data
       
if __name__ == "__main__":

    argv = sys.argv[1:]

    try:
      opts, args = getopt.getopt(argv,"hp:",["help", "path="])
    except getopt.GetoptError:
      print 'Usage: python s3.parser.py -p <path>'
      sys.exit(2)

    path = '';

    for opt, arg in opts:
        if opt == '-h':
            print 'Usage: python s3.parser.py -p <path>'
            sys.exit()
        elif opt in ("-p", "--path"):
            path = arg

    if path == '':
        print 'Usage: python s3.parser.py -p <path>'
        sys.exit(2)

    main(path)

Reference