Avoiding Unnecessary File Downloads While Syncing

I had the opportunity to revisit an old project that was created for a client. The initial release of this project had a program that was syncing content from a CMS. It was made to only download content that had been downloaded since the last time it synced. For some reason, it was now always downloading all of the files instead of only the ones that change. Looking into the problem I found that changes in the CMS resulted in files no longer having ETAG headers, which are used to tell if a file has changed since the last time it was requested. The files still had a header indicating a last updated date. It is easy enough to use that header instead. But the client had enough requests for changes to justify writing a new syncing component; they had a new CMS with different APIs. File syncing isn’t complex, I could rewrite the component easily in an evening. I decided to write the new version of the component using .Net 6.0.

Before downloading a file, I need to check the attributes of the file on the server end without starting the transfer of the file itself. The HTTP verb for obtaining this information is HEAD. The HEAD verb will return the headers for the resource identified by the URI, but it doesn’t return the resources data stream itself. As a quick test, I grabbed the URL for an MP3 player I keep seeing in an Amazon advertisement. https://m.media-amazon.com/images/I/61TUVbqPhLL.AC_SL1500.jpg.

I used Postman to request the image at the URL and examined the headers. Postman will perform a GET request by default. Changing the request from GET to HEAD results in a response with no body, but has headers. This is exactly what we want!

There are a couple of things that we will need to do with this information. We will need to save it somewhere for future use. When we make future requests, we need to use this information to filter what data we transfer. The filtering can be done on the client side within the logic of the program making the request, or it can be performed on the server side by adding an additional header to the request named If-Modified-Since. Providing a date in this header will cause the server to either send the new resource (if it is more recent than the date in this parameter) or it will return header information only (if the server version is not more recent than the date specified). The date must be in a specific format. But if you are saving the original date response, then you can use it as it was received.

Let’s jump into actual code. I’ve made a data class that stores information about the files I will be downloaded.

namespace FileSyncExample.ViewModels
{
    public class FileData: ViewModelBase
    {
        private DateTimeOffset? _serverLastModifiedDate;
        [JsonProperty("last-modified")]
        public DateTimeOffset? ServerLastModifiedDate
        {
            get => _serverLastModifiedDate;
            set => SetValueIfChanged(() => ServerLastModifiedDate, () => _serverLastModifiedDate, value);
        }

        public string _fileName;
        [JsonProperty("file-name")]
        public string FileName
        {
            get => _fileName;
            set => SetValueIfChanged(() => FileName, () => _fileName, value);
        }

        private string _clientName;
        [JsonProperty("client-name")]
        public string ClientName
        {
            get => _clientName;
            set => SetValueIfChanged(()=>ClientName, () => _clientName, value);
        }

        private bool _didUpdate;
        [JsonIgnore]
        public bool DidUpdate
        {
            get => _didUpdate;
            set => SetValueIfChanged(()=>DidUpdate, ()=>_didUpdate, value);
        }
    }
}

I’m using this for two purposes in this example program. I’m both building a download list with it and am using it to save metadata. In the real program, this list is made using a query to the CMS. I create a list of these objects with the identifiers.

        public MainViewModel()
        {
            Files.Add(new FileData() { FileName= "61lLJ85GYXL._AC_SL1000_.jpg" });
            Files.Add(new FileData() { FileName= "61qfFAQ3xKL._AC_SL1500_.jpg" });
            Files.Add(new FileData() { FileName= "71PKvcmV6DL._AC_SX679_.jpg" });
            Files.Add(new FileData() { FileName= "71fOsWX9qlL._AC_UY327_FMwebp_QL65_.jpg" });
        }

All of these images are coming from Amazon. The full URL to the data stream is built by prepending the file name. I do this through a string format.

var requestUrl = $"https://m.media-amazon.com/images/I/{file. Filename}";

For the download, I am using the HttpClient. It accepts a request and returns the response.

HttpClient client = new HttpClient();
client.DefaultRequestHeaders.Clear();
client.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));
client.DefaultRequestHeaders.ConnectionClose = true;

For now, let’s code for a single scenario; there are no files already downloaded. We wish to do our priming download and save the file’s data and the metadata about the file. To keep the file system clean instead of placing the metadata in a separate file I’m saving it in an alternative data stream. This only works on NTFS file systems. If you would like to learn more about that read here. The significant parts of the code to perform the download follows.

var requestUrl = $"https://m.media-amazon.com/images/I/{file.FileName}";
var request = new HttpRequestMessage(HttpMethod.Get, requestUrl);
var response = await client.SendAsync(request);
var lastModified = response.Content.Headers.LastModified;
if(lastModified.HasValue)
{
    file.ServerLastModifiedDate = lastModified;
}
try
{
    response.EnsureSuccessStatusCode();
    using (FileStream outputStream = new FileStream(Path.Combine(Settings.Default.CachePath, file.FileName), FileMode.Create, FileAccess.Write))
    {
        var data = await response.Content.ReadAsByteArrayAsync();
        outputStream.Write(data, 0, data.Length);
    }
    //Putting the metadata in an alternative stream named meta.json
    var fileMetadata = JsonConvert.SerializeObject(file);
    Debug.WriteLine(fileMetadata);
    var metaFilePath = Path.Combine(Settings.Default.CachePath, $"{file.FileName}:meta.json");
    var fileHandle = NativeMethods.CreateFileW(metaFilePath, NativeConstants.GENERIC_WRITE,
                        0,//NativeConstants.FILE_SHARE_WRITE,
                        IntPtr.Zero,
                        NativeConstants.OPEN_ALWAYS,
                        0,
                        IntPtr.Zero);
    if(fileHandle != IntPtr.MinValue)
    {
        using(StreamWriter sw = new StreamWriter(new FileStream(fileHandle, FileAccess.Write)))
        {
            sw.Write(fileMetadata);
        }
    }

}
catch(Exception exc)
{

}

After running the program, the images show in my download folder. When I open PowerShell and check the streams, I see my alternative data stream present.

Printing out the data in one of the alternative data streams, I see the data in the format that I expect.

PS C:\temp\streams> Get-Item .\61lLJ85GYXL._AC_SL1000_.jpg | Get-Content -Stream meta.json

{"_fileName":"61lLJ85GYXL._AC_SL1000_.jpg","last-modified":"2019-10-30T16:28:38+00:00","file-name":"61lLJ85GYXL._AC_SL1000_.jpg","client-name":"j2i.net"}

PS C:\temp\streams>

Next, we want to modify the program to load this metadata if it exists and grab the LastModified property. This is all we need. We are going to use this information to detect if the file has been modified.

void RefreshMetadata()
{
    DirectoryInfo cacheDataDirectory = new DirectoryInfo(Settings.Default.CachePath);
    if (!cacheDataDirectory.Exists)
        return;
    foreach(var file in Files)
    {
        var fileInfo = new FileInfo(Path.Combine(cacheDataDirectory.FullName, file.FileName));
        if (!fileInfo.Exists)
            continue;
        //Great! The file exists! Let's load the metadata for it!
        var metaFilePath = $"{fileInfo.FullName}:meta.json";
        var fileHandle = NativeMethods.CreateFileW(metaFilePath, NativeConstants.GENERIC_READ,
                            0,//NativeConstants.FILE_SHARE_WRITE,
                            IntPtr.Zero,
                            NativeConstants.OPEN_ALWAYS,
                            0,
                            IntPtr.Zero);
        using (StreamReader sr = new StreamReader(new FileStream(fileHandle, FileAccess.Read)))
        {
            var metaString = sr.ReadToEnd();
            var readFileData = JsonConvert.DeserializeObject<FileData>(metaString);
            file.ServerLastModifiedDate = readFileData.ServerLastModifiedDate;
        }

    }
}

The previous code that we wrote needs a few changes. If the file being downloaded has a last modified date, add that to the request in a header field named If-Modified-Since. Thankfully, .Net can convert the DateTimeOffset object to the string format that we need for the request.

 if(file.ServerLastModifiedDate.HasValue)
 {
     request.Headers.Add("If-Modified-Since", file.ServerLastModifiedDate.Value.ToString("R"));
 }

When the response comes back, we must examine the response code. If the file has been updated the response code will have a response code of 200 (OK). This is the normal response code that we get when we first access a file. If the file has not been updated since the value we pass in If-Modified-Since the response code will be 304 (not modified). The response will have no content. We can move on from this file.

var response = await client.SendAsync(request);
if(response.StatusCode == System.Net.HttpStatusCode.NotModified)
{
    continue;
}

I can’t modify the images on Amazon for testing the behaviour of the app when the image is updated. If you want to test that, you will have to modify the sample program to point to a set of images that you can control to test that out. The NodeJS based http-server utility is useful here if you want to use a random set of images on your local computer for this purpose.

As always, the code for this post is available on GitHub. You can find it in the following repository.


Posts may contain products with affiliate links. When you make purchases using these links, we receive a small commission at no extra cost to you. Thank you for your support.

Twitter: @j2inet
Instagram: @j2inet
Facebook: j2inet
YouTube: j2inet
Telegram: j2inet

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.