λ沢.dev

lambdasawa cheatsheet

AWS Transcribe YouTube Channel Archives

チャンネルの動画一覧ページの DevTools から動画の URL を取得する

https://www.youtube.com/channel/xxxx/videos

[...document.querySelectorAll(".yt-simple-endpoint.style-scope.ytd-grid-video-renderer")]
  .map(e => 
    e.href.includes("watch") ?
      new URLSearchParams(new URL(e.href).search).get("v") :
      e.href.match(/\/shorts\/([\w]+)/)[1]
  )
  .map(id => `https://www.youtube.com/watch?v=${id}`)
  .join("\n")

public-video-ids.txt

https://www.youtube.com/watch?v=xxxx1
https://www.youtube.com/watch?v=xxxx2
https://www.youtube.com/watch?v=xxxx3

プレイリスト内の動画一覧ページの DevTools から動画の URL を取得する

https://www.youtube.com/playlist?list=xxxx

[...document.querySelectorAll(".yt-simple-endpoint.style-scope.ytd-playlist-video-renderer")]
  .map(e => 
    e.href.includes("watch") ?
      new URLSearchParams(new URL(e.href).search).get("v") :
      e.href.match(/\/shorts\/([\w]+)/)[1]
  )
  .map(id => `https://www.youtube.com/watch?v=${id}`)
  .join("\n")

private-video-ids.txt

https://www.youtube.com/watch?v=xxxx4
https://www.youtube.com/watch?v=xxxx5
https://www.youtube.com/watch?v=xxxx6

ダウンロードした動画を S3 にアップロード

aws s3 mb s3://$MP4_BUCKET_NAME

while read -r url; do
  youtube-dl --format mp4 "$url"
  aws s3 cp ./*.mp4 "s3://$MP4_BUCKET_NAME"
  rm ./*.mp4
done <public-video-ids.txt

EC2 を起動

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.16"
    }
  }

  required_version = ">= 1.2.0"
}

provider "aws" {
  region = "ap-northeast-1"
}

resource "aws_instance" "app_server" {
  ami           = "ami-0a054d6fbb2cc67ee"
  instance_type = "t2.micro"

  tags = {
    Name = "youtube-backup"
  }
}
terraform init
terraform apply -auto-approve`

Allow SSH from my IP

TODO

EC2 に SSH してスクリプト実行

https://dev.classmethod.jp/articles/ec2-instance-connect/

sudo apt update -y && sudo apt install -y youtube-dl awscli

curl -sSLO https://raw.githubusercontent.com/lambdasawa/lambdasawa/bba6e3864d5b5721247a6d17da46326fe62ce632/docs/transcribe-channel-archives.md

cp transcribe-channel-archives.md public-video-ids.txt
vim public-video-ids.txt

cp transcribe-channel-archives.md private-video-ids.txt
vim private-video-ids.txt

cp transcribe-channel-archives.md backup.sh
vim backup.sh
chmod u+x backup.sh

./backup.sh

Transcribe を実行

for key in (aws s3api list-objects-v2 --bucket $MP4_BUCKET_NAME | jq -r ".Contents[] | .Key")
    set video_id (echo $key | sed -E 's/.*\\[(.*)\\].*/\1/')
    set s3_uri "s3://$MP4_BUCKET_NAME/$key"
    set media (printf '{"MediaFileUri":"%s"}' "$s3_uri")

    echo $video_id $s3_uri $media
    aws transcribe start-transcription-job \
        --transcription-job-name "$video_id" \
        --media-format mp4 \
        --language-code ja-JP \
        --media "$media"
end
aws s3 mb s3://$TRANSCRIPT_BUCKET_NAME

mkdir -p transcript
pushd $PWD
cd transcript

for key in (aws s3api list-objects-v2 --bucket $MP4_BUCKET_NAME | jq -r ".Contents[] | .Key")
    set video_id (echo $key | sed -E 's/.*\\[(.*)\\].*/\1/')

  curl -sSL -o "$video_id.json" (aws transcribe get-transcription-job --transcription-job-name $video_id | jq -r .TranscriptionJob.Transcript.TranscriptFileUri)
  aws s3 cp "$video_id.json" s3://$TRANSCRIPT_BUCKET_NAME/

end

popd