Making a Web Crawler using the Android Web Client

Source Code

Like many others my coworkers and I have been called back to work in the office for part of the week. Returning to the office hasn’t been without its challenges, especially since the environment has substantially changed. At the end of one week I was asked to collect some information on ads served to the browser in certain countries. To gather this information, I used a VPN to browse from a different country and I created a web-crawler using JavaScript and Node. It created a browser instance, followed links starting from a specific set of pages, and kept track of resources that the pages loaded, and download content that was accessed from certain domains. The app worked fine and it collected the information that I needed. On Monday, when I was in the office, I was asked to produce a similar dataset as seen from a different country. I started my software to do this tasks only to find that the network now actively blocks VPN connections.

I thought about driving back home to complete the task, but decided to just make a new web crawler to run from my Android tablet. That’s what I did. I made an app with a WebView and had it to load each one of ty starting pages. For each page that loaded, there were two sets of data that I needed to capture; the resources that the page requested, and the links that were in the page. To retrieve this information, I would need a WebViewClient for the WebView. The WebViewClient is an object with a number of methods that get called that let one intercept or get notifications of what the WebView is doing. I was only concerned with a few methods on this object.

onPageFinished – Fires once a page has finished loading
onLoadResource – Fires when a page is requesting a resource, such as an image

When a page finishes loading, I grab the links. There is not API specifically for querying the page’s DOM. There is, however, a method on the WebView to execute JavaScript and return the results as a string object. I inject a small function into the page that grabs the links and extract them from the JSON array of strings that comes back. This is the JavaScript.

function extractLinks(){
     var list = Array.from(document.getElementsByTagName('a')); 
     for(var i=0;i<list.length;++i) { 
           list[i] = list[i].href;
     }
     return list;
})()

To execute the JavaScript in the webview, I use the WebView’s evaluateJavascript() method. The method accepts a ValueCallback object. The value is a string of the JSON encoding of the information. I convert that to a String array and save the links. The two references to the dataHandler object are from a class that I defined. The two methods of interest here are LinksExtracted(String[]) and PageLoadComplete(). The LinksExtracted method receives all of the URLs of the links in the page. The dataHandler is responsible for saving those. PageLoadComplete is used to create demarcaction in the data between the pages. Note that this method of capturing links isn’t perfect; it is possible that after a page loads, the page could dynamically adjust the HTML to remove some links and add others. For my application, the result of this apparent oversight is fine.

    override fun onPageFinished(view: WebView?, url: String?) {
        super.onPageFinished(view, url)

        view!!.evaluateJavascript("(function extractLinks(){var list = Array.from(document.getElementsByTagName('a')); for(var i=0;i<list.length;++i) { list[i] = list[i].href};; return list;})()",
            object:ValueCallback<String> {
                override fun onReceiveValue(value: String) {
                    if(value != null && value != "null")
                    {
                        val gson = GsonBuilder().create()
                        val theList = gson.fromJson<ArrayList<String>>(value, object :
                            TypeToken<ArrayList<String>>(){}.type)
                        if(theList != null) {
                            dataHandler.LinksExtracted(theList.toTypedArray());
                        }
                    }
                    dataHandler.PageLoadComplete()
                }
            }
            )
    }

The links are persisted to an SqLite database. To do this, I’ve defined a data class for holding a row of data.

package net.j2i.webcrawler.data
import kotlinx.serialization.Serializable

@Serializable
data class UrlReading(val sessionID:Long=0L, val pageRequestID:Long = 0L, val url:String = "", val timestamp:Long = -1L) {
}

The sessionID will be the same for all values captured during the same run of the program. pageRequestID increments every time a new page loads. urlString contains the information of interest, the URL. And timestamp contains the time at which the URL was captured.

Creation of the database and insertion of data into it fairly plain-vanilla code. I won’t post the code here, but if you would like to see it, it’s on GitHub and can be found through this link: https://github.com/j2inet/sample-webcrawler/blob/main/app/src/main/java/net/j2i/webcrawler/data/UrlReadingDataHelper.kt

When the data is to be extracted, the program will write it to a CSV file with headers. To minimize the memory demand for this, I have a method on the data helper that will write the data as a cursor is reading it.

    fun writeAllRecords(os:OutputStreamWriter):List<UrlReading>  {

        os.write("SessionID, PageRequestID, Timestamp, URL\r\n")

        val readings = mutableListOf<UrlReading>()
        val db = writableDatabase
        val projection = arrayOf(
            BaseColumns._ID,
            UrlReadingsContract.COLUMN_NAME_SESSION_ID,
            UrlReadingsContract.COLUMN_NAME_PAGE_REQUEST_ID,
            UrlReadingsContract.COLUMN_NAME_URL,
            UrlReadingsContract.COLUMN_NAME_TIMESTAMP
        )
        val sortOrder = "${UrlReadingsContract.COLUMN_NAME_TIMESTAMP} ASC"
        val cursor = db.query(
            UrlReadingsContract.TABLE_NAME,
            projection,
            null,
            null,
            null,
            null,
            sortOrder
        )
                    with(cursor) {
            while (moveToNext()) {
                val reading = UrlReading(
                    //source = getString(getColumnIndexOrThrow(BaseColumns._ID)),
                    sessionID = getLong(getColumnIndexOrThrow(UrlReadingsContract.COLUMN_NAME_SESSION_ID)),
                    pageRequestID = getLong(getColumnIndexOrThrow(UrlReadingsContract.COLUMN_NAME_PAGE_REQUEST_ID)),
                    url = getString(getColumnIndexOrThrow(UrlReadingsContract.COLUMN_NAME_URL)),
                    timestamp = getLong(getColumnIndexOrThrow(UrlReadingsContract.COLUMN_NAME_URL)),
                );

                val line = "${reading.sessionID}; ${reading.pageRequestID}; ${reading.timestamp}, ${reading.url}\r\n";
                os.write(line);
                readings.add(reading)
            }
        }
        return readings
    }

The program keeps track of the URLs that it has found links for and ads them to a list. When going to the next page, it randomly selects from this list (and removes the item selected). However, the program will first visit all of the initial set of URLs that it was given before randomly selecting. If I don’t do this, then the links found on the first page loaded might result in the other initial set of pages not being visited or not having a chance of having as much of an impact in the pages visited. Those initial URLs are added to the list and a count of the URLs is saved.

        UrlList.add("https://msn.com")
        UrlList.add("https://yahoo.com");
        linearLoadCount = UrlList.count()

The method for loading random URLs initially dequeues URLs from the beginning of the list. After all of the intial URLs have been read, random reads occur.

    fun openRandomSite() {
        var index = 0;
        if(linearLoadCount>0) {
            --linearLoadCount
            var index = random.nextInt(UrlList.count())
        }
        val nextUrl = UrlList[index];
        UrlList.removeAt(index);
        mainWebView!!.loadUrl(nextUrl)
    }

To keep the pages cycling, in the PageLoadComplete()handler the next call to load a random page is queue (with a delay).

            override fun PageLoadComplete() {
                ++pageSessionID;
                mainHandler.postDelayed(object:Runnable {
                    override fun run() {
                        openRandomSite()
                    }
                },NAVIGATE_DELAY)
            }

It took less time to write this than it would have to drive home. The initial set of URLs in the code are in the source code. This was written to only be used once, so I skipped practices that would have made the program of more general utility. Nevertheless, I think it might be useful to someone. You can find the complete source code on GitHub.

https://github.com/j2inet/sample-webcrawler

Mastodon: @j2inet@masto.ai
Instagram: @j2inet
Facebook: @j2inet
YouTube: @j2inet
Telegram: j2inet
Twitter: @j2inet