Calculating a file’s mean value of data bytes

  accumulate, c++, c++20, unordered-map

Just for fun, I am trying to calculate a file’s mean value of data bytes, essentially replicating a feature available in an already existing tool (ent). Basically, it is simply the result of summing all the bytes of a file and dividing by the file length. If the data are close to random, this should be about 127.5. I am testing 2 methods of computing the mean value, one is a simple for loop which works on an unordered_map and the other is using std::accumulate directly on a string object.

It works fine when compiled with clang++ and I am able to reproduce the same output values that ent shows using both methods. That said, when I compile with g++, the for loop method starts producing bad outputs when the input file is around 2.5GB (on my system), but no issues with clang++ (tested up to 8GB).

Benchmarking both methods also show that it is much slower to use std::accumulate than a simple for loop. Mesured on my system, on average, clang++ is about 4 times faster for the accumulate method than g++.

So here are my questions:

  1. Why is the for loop method producing bad output at around 2.5GB input for g++ but not with clang++. My guess is I am doing things wrong (UB probably), but they happen to work with clang++.

  2. Why is the std::accumulate method so much slower on g++ with the same optimization settings?

Thanks!


Compiler info (target is x86_64-pc-linux-gnu):

clang version 11.1.0

gcc version 11.1.0 (GCC)

Build info:

g++ -Wall -Wextra -pedantic -O3 -DNDEBUG -std=gnu++2a main.cpp -o main-g

clang++ -Wall -Wextra -pedantic -O3 -DNDEBUG -std=gnu++20 main.cpp -o main-clang

Sample file (using random data):

dd if=/dev/urandom iflag=fullblock bs=1G count=8 of=test-8g.bin (example for 8GB random data file)

Code:

#include <chrono>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <numeric>
#include <stdexcept>
#include <string>
#include <unordered_map>

auto main(int argc, char** argv) -> int {
  using std::cout;

  std::filesystem::path file_path{};

  if (argc == 2) {
    file_path = std::filesystem::path(argv[1]);
  } else {
    return 1;
  }

  std::string input{};
  std::unordered_map<char, int> char_map{};

  std::ifstream istrm(file_path, std::ios::binary);
  if (!istrm.is_open()) {
    throw std::runtime_error("Could not open file");
  }

  const auto file_size = std::filesystem::file_size(file_path);
  input.resize(file_size);
  istrm.read(input.data(), static_cast<std::streamsize>(file_size));

  istrm.close();

  // store frequency of individual chars in unordered_map
  for (const auto& c : input) {
    if (!char_map.contains(c)) {
      char_map.insert(std::pair<char, int>(c, 1));
    } else {
      char_map[c]++;
    }
  }

  unsigned long sum_for_loop = 0;

  // start stopwatch
  auto start_timer = std::chrono::steady_clock::now();
  cout << "using for loopn";

  // this for loop works when compiled with clang++
  // stops working with g++ at 2.5GB input file
  for (const auto& item : char_map) {
    sum_for_loop += static_cast<unsigned char>(item.first) * item.second;
  }

  // stop stopwatch
  cout << std::chrono::duration<double>(std::chrono::steady_clock::now() - start_timer).count() << " sn";

  auto mean_for_loop = static_cast<double>(sum_for_loop) / static_cast<double>(input.size());

  cout << std::fixed << "sum_for_loop: " << sum_for_loop << " size: " << input.size() << 'n';
  cout << "mean value of data bytes: " << mean_for_loop << 'n';

  // start stopwatch
  start_timer = std::chrono::steady_clock::now();
  cout << "using accumulate()n";

  // accumulate works when compiled with clang++ and g++ but is slow (much slower in g++)
  auto sum_accum =
      std::accumulate(input.begin(), input.end(), 0.0, [](auto current_val, auto each_char) { return current_val + static_cast<unsigned char>(each_char); });
  
  // stop stopwatch
  cout << std::chrono::duration<double>(std::chrono::steady_clock::now() - start_timer).count() << " sn";

  auto mean_accum = sum_accum / static_cast<double>(input.size());

  cout << std::fixed << "sum_for_loop: " << sum_accum << " size: " << input.size() << 'n';
  cout << "mean value of data bytes: " << mean_accum << 'n';
}

Sample output from 2GB file (clang++):

using for loop
2.024e-05 s
sum_for_loop: 273805913805 size: 2147483648
mean value of data bytes: 127.500814
using accumulate()
1.317576 s
sum_for_loop: 273805913805.000000 size: 2147483648
mean value of data bytes: 127.500814

Sample output from 2GB file (g++):

using for loop
1.9751e-05 s
sum_for_loop: 136903803941 size: 1073741824
mean value of data bytes: 127.501603
using accumulate()
2.597439 s
sum_for_loop: 136903803941.000000 size: 1073741824
mean value of data bytes: 127.501603

Sample output from 8GB file (clang++):

using for loop
1.853e-05 s
sum_for_loop: 1095220441576 size: 8589934592
mean value of data bytes: 127.500440
using accumulate()
5.247585 s
sum_for_loop: 1095220441576.000000 size: 8589934592
mean value of data bytes: 127.500440

Sample output from 8GB file (g++) (note wrong output using for loop):

using for loop
1.942e-05 s
sum_for_loop: 3781096 size: 8589934592
mean value of data bytes: 0.000440
using accumulate()
20.797355 s
sum_for_loop: 1095220441576.000000 size: 8589934592
mean value of data bytes: 127.500440

Source: Windows Questions C++

LEAVE A COMMENT