Quantcast
Channel: kakakakakku blog
Viewing all articles
Browse latest Browse all 917

parquet-cli で Parquet ファイルを読む

$
0
0

Parquetファイルをサクッと読むときに parquetコマンドを使うと便利❗️

Homebrew なら brew install parquet-cliコマンドで簡単にセットアップできる.今回は最新 v1.12.3を使う.そして parquetコマンドの実装は GitHub の apache/parquet-mrリポジトリにある.parquet helpコマンドの出力だと引数まで細かく把握できず,実装を確認する必要がある.実装は parquet-cli/src/main/java/org/apache/parquet/cli/commands/ディレクトリにある.

$ brew install parquet-cli

$ brew info parquet-cli
==> parquet-cli: stable 1.12.3(bottled), HEAD

github.com

あと関連記事を調べると parquet-toolsも紹介されていたりするけど,現時点では使えなくなっている🔥

$ brew install parquet-tools
Error: parquet-tools has been disabled because it is deprecated upstream!

formulae.brew.sh

parquetコマンド

parquet helpコマンドで確認できる通り,サブコマンドは多くある.今回は代表的なものを紹介する.

  • parquet meta
  • parquet schema
  • parquet head
  • parquet cat
  • parquet column-size
$ parquet help

Usage: parquet [options][command][command options]

  Options:

    -v, --verbose,--debug
    Print extra debugging information

  Commands:

    help
    Retrieves details on the functions of other commands
    meta
    Print a Parquet file's metadata    pages  Print page summaries for a Parquet file    dictionary  Print dictionaries for a Parquet column    check-stats  Check Parquet files for corrupt page and column stats (PARQUET-251)    schema  Print the Avro schema for a file    csv-schema  Build a schema from a CSV data sample    convert-csv  Create a file from CSV data    convert  Create a Parquet file from a data file    to-avro  Create an Avro file from a data file    cat  Print the first N records from a file    head  Print the first N records from a file    column-index  Prints the column and offset indexes of a Parquet file    column-size  Print the column sizes of a parquet file    prune  Prune column(s) in a Parquet file and save it to a new file. The columns left are not changed.    trans-compression  Translate the compression from one to another (It doesn't support bloom filter feature yet).
    masking
    Replace columns with masked values and write to a new Parquet file
    footer
    Print the Parquet file footer in json format

  Examples:

    # print information for meta
    parquet help meta

  See 'parquet help <command>'for more information on a specific command.

Parquet サンプル

今回は TLC Trip Record Dataで公開されている「タクシーデータセット🚕」を使う.Parquet フォーマットで取得できる.

  • 2022/01
  • Yellow Taxi Trip Records
  • yellow_tripdata_2022-01.parquet
    • 36 MB

www.nyc.gov

🪓 parquet metaコマンド

Parquet ファイルのメタデータを取得できる.一部は割愛して載せているけど,スキーマ情報や統計情報を確認できる.

$ parquet meta yellow_tripdata_2022-01.parquet

File path:  yellow_tripdata_2022-01.parquet
Created by: parquet-cpp-arrow version 7.0.0(中略)        

Schema:
message schema {
  optional int64 VendorID;
  optional int64 tpep_pickup_datetime (TIMESTAMP(MICROS,false));
  optional int64 tpep_dropoff_datetime (TIMESTAMP(MICROS,false));
  optional double passenger_count;
  optional double trip_distance;
  optional double RatecodeID;
  optional binary store_and_fwd_flag (STRING);
  optional int64 PULocationID;
  optional int64 DOLocationID;
  optional int64 payment_type;
  optional double fare_amount;
  optional double extra;
  optional double mta_tax;
  optional double tip_amount;
  optional double tolls_amount;
  optional double improvement_surcharge;
  optional double total_amount;
  optional double congestion_surcharge;
  optional double airport_fee;}


Row group 0:  count: 246393115.47 B records  start: 4  total(compressed): 36.361 MB total(uncompressed):63.526 MB
--------------------------------------------------------------------------------
                       type      encodings count     avg size   nulls   min / max
VendorID               INT64     G _ R     24639310.13 B     0"1" / "6"
tpep_pickup_datetime   INT64     G _ R_ F  24639313.84 B     0"2008-12-31T22:23:09.000000" / "2022-05-18T20:41:57.000000"
tpep_dropoff_datetime  INT64     G _ R_ F  24639313.92 B     0"2008-12-31T23:06:56.000000" / "2022-05-18T20:47:45.000000"
passenger_count        DOUBLE    G _ R     24639310.20 B     71503"-0.0" / "9.0"
trip_distance          DOUBLE    G _ R     24639311.39 B     0"-0.0" / "306159.28"
RatecodeID             DOUBLE    G _ R     24639310.06 B     71503"1.0" / "99.0"
store_and_fwd_flag     BINARY    G _ R     24639310.03 B     71503"N" / "Y"
PULocationID           INT64     G _ R     24639310.78 B     0"1" / "265"
DOLocationID           INT64     G _ R     24639310.99 B     0"1" / "265"
payment_type           INT64     G _ R     24639310.15 B     0"0" / "5"
fare_amount            DOUBLE    G _ R     24639311.01 B     0"-480.0" / "401092.32"
extra                  DOUBLE    G _ R     24639310.22 B     0"-4.5" / "33.5"
mta_tax                DOUBLE    G _ R     24639310.02 B     0"-0.5" / "16.59"
tip_amount             DOUBLE    G _ R     24639311.06 B     0"-125.22" / "888.88"
tolls_amount           DOUBLE    G _ R     24639310.09 B     0"-31.4" / "193.3"
improvement_surcharge  DOUBLE    G _ R     24639310.01 B     0"-0.3" / "0.3"
total_amount           DOUBLE    G _ R     24639311.42 B     0"-480.3" / "401095.62"
congestion_surcharge   DOUBLE    G _ R     24639310.09 B     71503"-2.5" / "2.5"
airport_fee            DOUBLE    G _ R     24639310.07 B     71503"-1.25" / "1.25"

parquet.apache.org

🪓 parquet schemaコマンド

Parquet ファイルの Avro スキーマ情報を取得できる.

$ parquet schema yellow_tripdata_2022-01.parquet
{"type" : "record",
  "name" : "schema",
  "fields" : [{"name" : "VendorID",
    "type" : ["null", "long"],
    "default" : null
  }, {"name" : "tpep_pickup_datetime",
    "type" : ["null", {"type" : "long",
      "logicalType" : "timestamp-micros"}],
    "default" : null
  }, {"name" : "tpep_dropoff_datetime",
    "type" : ["null", {"type" : "long",
      "logicalType" : "timestamp-micros"}],
    "default" : null
  }, {"name" : "passenger_count",
    "type" : ["null", "double"],
    "default" : null
  }, {"name" : "trip_distance",
    "type" : ["null", "double"],
    "default" : null
  }, {"name" : "RatecodeID",
    "type" : ["null", "double"],
    "default" : null
  }, {"name" : "store_and_fwd_flag",
    "type" : ["null", "string"],
    "default" : null
  }, {"name" : "PULocationID",
    "type" : ["null", "long"],
    "default" : null
  }, {"name" : "DOLocationID",
    "type" : ["null", "long"],
    "default" : null
  }, {"name" : "payment_type",
    "type" : ["null", "long"],
    "default" : null
  }, {"name" : "fare_amount",
    "type" : ["null", "double"],
    "default" : null
  }, {"name" : "extra",
    "type" : ["null", "double"],
    "default" : null
  }, {"name" : "mta_tax",
    "type" : ["null", "double"],
    "default" : null
  }, {"name" : "tip_amount",
    "type" : ["null", "double"],
    "default" : null
  }, {"name" : "tolls_amount",
    "type" : ["null", "double"],
    "default" : null
  }, {"name" : "improvement_surcharge",
    "type" : ["null", "double"],
    "default" : null
  }, {"name" : "total_amount",
    "type" : ["null", "double"],
    "default" : null
  }, {"name" : "congestion_surcharge",
    "type" : ["null", "double"],
    "default" : null
  }, {"name" : "airport_fee",
    "type" : ["null", "double"],
    "default" : null
  }]}

🪓 parquet headコマンド

Parquet ファイルのデータを取得できる(デフォルトは先頭10件).--num-recordsオプションで件数を変えられる.

$ parquet head --num-records2 yellow_tripdata_2022-01.parquet | jq .
{"VendorID": 1,
  "tpep_pickup_datetime": 1640997340000000,
  "tpep_dropoff_datetime": 1640998409000000,
  "passenger_count": 2,
  "trip_distance": 3.8,
  "RatecodeID": 1,
  "store_and_fwd_flag": "N",
  "PULocationID": 142,
  "DOLocationID": 236,
  "payment_type": 1,
  "fare_amount": 14.5,
  "extra": 3,
  "mta_tax": 0.5,
  "tip_amount": 3.65,
  "tolls_amount": 0,
  "improvement_surcharge": 0.3,
  "total_amount": 21.95,
  "congestion_surcharge": 2.5,
  "airport_fee": 0}{"VendorID": 1,
  "tpep_pickup_datetime": 1640997223000000,
  "tpep_dropoff_datetime": 1640997727000000,
  "passenger_count": 1,
  "trip_distance": 2.1,
  "RatecodeID": 1,
  "store_and_fwd_flag": "N",
  "PULocationID": 236,
  "DOLocationID": 42,
  "payment_type": 1,
  "fare_amount": 8,
  "extra": 0.5,
  "mta_tax": 0.5,
  "tip_amount": 4,
  "tolls_amount": 0,
  "improvement_surcharge": 0.3,
  "total_amount": 13.3,
  "congestion_surcharge": 0,
  "airport_fee": 0}

🪓 parquet catコマンド

Parquet ファイルのデータを取得できる(デフォルトは全件).実装としては headcatも同じ.--num-recordsオプションで件数を変えられたり,--columnsオプションでカラムを指定することもできる.

$ parquet cat--num-records2 yellow_tripdata_2022-01.parquet --columns VendorID,tpep_pickup_datetime,tpep_dropoff_datetime | jq .
{"VendorID": 1,
  "tpep_pickup_datetime": 1640997340000000,
  "tpep_dropoff_datetime": 1640998409000000}{"VendorID": 1,
  "tpep_pickup_datetime": 1640997223000000,
  "tpep_dropoff_datetime": 1640997727000000}

🪓 parquet column-sizeコマンド

Parquet ファイルのカラムごとのサイズを取得できる.--columnsオプションでカラムを指定することもできる.

$ parquet column-size yellow_tripdata_2022-01.parquet
DOLocationID-> Size In Bytes: 2445215 Size In Ratio: 0.06413285
RatecodeID-> Size In Bytes: 144678 Size In Ratio: 0.0037945996
fare_amount-> Size In Bytes: 2492332 Size In Ratio: 0.06536862
tpep_dropoff_datetime-> Size In Bytes: 9646502 Size In Ratio: 0.25300744
congestion_surcharge-> Size In Bytes: 214074 Size In Ratio: 0.0056147105
VendorID-> Size In Bytes: 318048 Size In Ratio: 0.00834173
passenger_count-> Size In Bytes: 488835 Size In Ratio: 0.012821114
tolls_amount-> Size In Bytes: 226301 Size In Ratio: 0.005935399
improvement_surcharge-> Size In Bytes: 26125 Size In Ratio: 6.8520376E-4
trip_distance-> Size In Bytes: 3413238 Size In Ratio: 0.08952205
store_and_fwd_flag-> Size In Bytes: 74134 Size In Ratio: 0.0019443788
payment_type-> Size In Bytes: 377921 Size In Ratio: 0.009912073
total_amount-> Size In Bytes: 3506595 Size In Ratio: 0.09197061
extra-> Size In Bytes: 540290 Size In Ratio: 0.01417067
tip_amount-> Size In Bytes: 2623199 Size In Ratio: 0.06880099
mta_tax-> Size In Bytes: 55403 Size In Ratio: 0.0014531041
airport_fee-> Size In Bytes: 163205 Size In Ratio: 0.004280524
tpep_pickup_datetime-> Size In Bytes: 9459174 Size In Ratio: 0.24809423
PULocationID-> Size In Bytes: 1912073 Size In Ratio: 0.05014965

おまけ : VS Code 拡張機能 parquet-viewer

VS Code 拡張機能 parquet-viewerを使うと,Parquet フォーマットを JSON フォーマットに変換して表示できる(Parquet ファイルが大きすぎるとうまく表示されないこともある).

marketplace.visualstudio.com


Viewing all articles
Browse latest Browse all 917

Trending Articles