Grabbing Media with MITMPROXY

Why

Often I like to follow some online courses on my tablet when I am offline. E.g. in trains and planes. Most content providers don’t give me the option to download the content for offline usage.

There are many ways of grabbing content, but often sites are trying to obfuscate their content with a lot of obfuscated JavaScript code. That’s why I wanted a little different approach by using a Man-In-The-Middle HTTP/HTTPS proxy. My eyes landed on mitmproxy. It’s a nice proxy that you can extend with your own Python scripts. It supports HTTPS on-the-fly certificate generation!

Now I can listen in on all traffic from my browser! Also encrypted HTTPS traffic πŸ™‚

As an example I will try to get a video course fromΒ Cybrary. They offer many great free courses, but unfortunately no offline courses. I guess I will get less problems with them than other copyright-crazy media providers. But the procedure will probably be the same.

Setup

First we need to install Mitmproxy. If you are using Kali-Linux it’s already installed. You just need to start it up:

mitmproxy --mode regular

This starts the proxy and opens a manager/console window. The proxy listens on port 8080 by default. After starting up the Mitmproxy for the first time it generates a CA-certificate that you need to get for your browser. On Linux you find it under “~/.mitmproxy/mitmproxy-ca-cert.cer”.

Now you need to point your browser to your new proxy. In Firefox you find the proxy settings under Options -> Generel -> Network Proxy -> Settings. In my case it looks like this:

Then you import the CA-certificate into the root certificate store of Firefox:

Firefox-> Options -> Privacy & Security -> View Certificates -> Authorities -> Import

Now navigate to https://www.google.com and verify that you have installed the certificate correctly and that Mitmproxy can issue on-the-fly certificates. It should look like this in Firefox:

Basic mitmproxy navigation

In Mitmproxy you can use the arrow keys to navigate the captured content. If you want to look closer at the content you can press Enter on an item. Now you can navigate through the content (Requests and Responses). Here is an example of some google traffic:

At any time you can press question mark to get the help menu.

Inspecting some real content

First I started a Cybrary course in Firefox and then had a look in the mitmproxy console. A lot of things was going on in the beginning. But after the video started things settled down a little. I noticed something like this:

The video appeared to be split up into smaller separate video and audio segments. The sizes seemed to be around 20-200k each. The format looked like m4s (judging from the URL ending in .m4s). This is apparently a steaming version of mp4.

My first thought was to join all these segments. To do that I used the Mitmproxy Python add-on functionality. By looking at the documentation I made my first script CybraryGrabber.py:

#!/usr/bin/env python3

from mitmproxy import ctx
import re

class CybraryGrabber:
    def __init__(self):
        ctx.log.warn( "CybraryGrabber class initiated" )

    def writefile(self, filename, content):
        ctx.log.warn( "Writing File: {}".format( filename ) )
        with open( filename, "wb" ) as f:
            f.write( content )
        f.close()

    def response(self, flow):
        url = flow.request.path

        if ".m4s" in url:
            matches = re.search( '(video|audio)/(.+)/chop/segment-(\d+).m4s', url )
            if matches:
                filename = '{}_{}_{:03}.m4s'.format( matches.group(2),  matches.group(1), int(matches.group(3)) )
                self.writefile( filename, flow.response.content )

addons = [
    CybraryGrabber()
]

Basically this script attaches to all http-responses.Β  It looks at the URL and if it contains .m4s we do a regex on the URL. Here I extract information about the course ID, if it’s a video or audio file and the segment number.

Now we start mitmproxy with our new script and visit a course on Cybrary:

mitmproxy --mode regular -s CybraryGrabber.py --set console_eventlog_verbosity="warn"

Analysing the data

In the console you can see all logged events by pressing capital-E. When I started up Mitmproxy, I instructed it only to show events with loglevel >= warning:

warn: CybraryGrabber class initiated
warn: Writing File: 568839438_audio_001.m4s
warn: Writing File: 568839434_video_001.m4s
warn: Writing File: 568839438_audio_002.m4s
warn: Writing File: 568839434_video_002.m4s
warn: Writing File: 568839438_audio_003.m4s
warn: Writing File: 568839434_video_003.m4s
warn: Writing File: 568839434_video_004.m4s
warn: Writing File: 568839438_audio_004.m4s
warn: Writing File: 568839438_audio_005.m4s
warn: Writing File: 568839434_video_005.m4s

Now I just thought that these files should be joined. So I did like this:

cat 568839438_audio* > joined_audio.m4s
cat 568839434_video* > joined_video.m4s

To my big disappointment I couldn’t play the files in VLC πŸ™ Bummer… What went wrong?

After some googling I realized that often a small init-code is sent before the video stream segments. So I went back into the console and looked what happened just before segment-1 was sent.

I stumbled on this Json file:

Maybe that had to be added to the beginning of my other captured files? I added the following to my response-method to also fetch the Json init_segment and base64 decode it:

if "master.json" in url:
    import json
    import base64

    receivedJsonBytes = flow.response.content
    receivedJson = json.loads( receivedJsonBytes.decode("utf-8") )

    for video in receivedJson['video']:
        id = video["id"]
        init = video["init_segment"]
        init_raw = base64.b64decode(init)
        filename = "{}_video_000.m4s".format(id)
        self.writefile( filename, init_raw )

    for audio in receivedJson['audio']:
        id = audio["id"]
        init = audio["init_segment"]
        init_raw = base64.b64decode(init)
        filename = "{}_audio_000.m4s".format(id)
        self.writefile( filename, init_raw )

After I joined the files including the new “xxx_video_000.m4s” file i finally got a video that would play! The last thing to be done was to join audio and video with ffmpeg:

ffmpeg.exe -i joined_audio.m4s -i joined_video.m4s joined_final.mp4

Badabing-badaboom πŸ™‚ A fully playable video appeared πŸ™‚

Conclusion

All of this is proof of concept code to show the concepts. Of couse the finished code will do all the file-joining itself.

This method will work on a lot of content out on the big internet. It just requires a few hours of investigating the http-flow to make a small and effective script in python.

 

3 thoughts to “Grabbing Media with MITMPROXY”

  1. Hello Cron,
    With mitmproxy and your CybraryGrabber script, i’m able to save video and audio segments, like for you, the joined file is not playable.
    I added your code to “fetch the json init_segment and base 64 decode” to the CybraryGrabber script, without succes, problem with url.
    I don’t know python, could you post the final CybraryGrabber script, and with file joining it will be splendid.
    Thank you very much.

Leave a Reply

Your email address will not be published. Required fields are marked *