Skip to content

Improve HDF5 support#200

Open
gpregger-ethz wants to merge 2 commits intocgsecurity:masterfrom
gpregger-ethz:hdf5_support_clean
Open

Improve HDF5 support#200
gpregger-ethz wants to merge 2 commits intocgsecurity:masterfrom
gpregger-ethz:hdf5_support_clean

Conversation

@gpregger-ethz
Copy link

Hi
I needed to recover some HDF5 data files and added superblock parsing to determine the file sizes to file_hdf5.c.
I'm told file recovery with my changes was much improved (I'm not qualified to speak on the recovered contents) and additional tests with HDF5 example files look very promising in checksum comparisons.
Though unfortunately I'm not very fluent in either C or this level of data recovery, so I expect there to be potential for improvement still.
Feel free to let me know if any adjustments need to be made.

In any case, many thanks for providing and maintaing testdisk! 👍

References:
HDF5 Superblock specification: https://support.hdfgroup.org/documentation/hdf5/latest/_f_m_t11.html#subsec_fmt11_boot_super
Additional HDF5 example data files: https://github.com/openPMD/openPMD-example-datasets
(Unfortunately I was unable to source any HDF5 files with superblock version 1 so I could not test that case)

@cgsecurity
Copy link
Owner

Your code was unsafe for sb_offsets_size_offset != 8. I have decided to only deal with 64-bits offset for the moment.
Can you try the src/file_hdf5.c file I have pushed a few minutes ago ?

@gpregger-ethz
Copy link
Author

gpregger-ethz commented Mar 6, 2026

Thanks for the input. I've ran your code on my test-image with 44 deleted test files on it. My code recovers 44 files with about 80% matching the source files. Your code unfortunately only recovers 1 file.

For almost all files this check in file_check_hdf5_0 fails:

if(eof_address < eof_address_offset || eof_address < file_recovery->file_size)

It seems to me that file_recovery->file_size just contains garbage here, see some log outputs:

file_check_hdf5_0: dec eof_address = 2478040
file_check_hdf5_0: hex eof_address = 0x25CFD8
------------------------------
file_check_hdf5_0: eof_address < eof_address_offset || eof_address < file_recovery->file_size
file_check_hdf5_0: eof_address = 2478040
file_check_hdf5_0: eof_address_offset = 40
file_check_hdf5_0: file_recovery->file_size:  2478080
------------------------------
file_check_hdf5_0: dec eof_address = 2478040
file_check_hdf5_0: hex eof_address = 0x25CFD8
------------------------------
file_check_hdf5_0: eof_address < eof_address_offset || eof_address < file_recovery->file_size
file_check_hdf5_0: eof_address = 2478040
file_check_hdf5_0: eof_address_offset = 40
file_check_hdf5_0: file_recovery->file_size:  2478080
------------------------------
file_check_hdf5_0: dec eof_address = 3561320
file_check_hdf5_0: hex eof_address = 0x365768
------------------------------
file_check_hdf5_0: eof_address < eof_address_offset || eof_address < file_recovery->file_size
file_check_hdf5_0: eof_address = 3561320
file_check_hdf5_0: eof_address_offset = 40
file_check_hdf5_0: file_recovery->file_size:  9809920
------------------------------
file_check_hdf5_0: dec eof_address = 94920
file_check_hdf5_0: hex eof_address = 0x172C8
------------------------------
file_check_hdf5_0: eof_address < eof_address_offset || eof_address < file_recovery->file_size
file_check_hdf5_0: eof_address = 94920
file_check_hdf5_0: eof_address_offset = 40
file_check_hdf5_0: file_recovery->file_size:  1789370368
------------------------------
file_check_hdf5_0: dec eof_address = 16064464
file_check_hdf5_0: hex eof_address = 0xF51FD0
------------------------------
file_check_hdf5_0: eof_address < eof_address_offset || eof_address < file_recovery->file_size
file_check_hdf5_0: eof_address = 16064464
file_check_hdf5_0: eof_address_offset = 40
file_check_hdf5_0: file_recovery->file_size:  16777216
------------------------------
file_check_hdf5_0: dec eof_address = 19529464
file_check_hdf5_0: hex eof_address = 0x129FEF8
/root/recovery/recup_dir.9/f3964928.h5   3966976-3999743

though sometimes the value is strangely close to the eof_address, I may have missed something...
What is the point of this check and where does file_recovery->file_size come from?

Edit: Changing

if(eof_address < eof_address_offset || eof_address < file_recovery->file_size)

to

if(eof_address < eof_address_offset)

leads to the more successful recovery I observed with my code: 44/44 files recovered 41/44 checksums match

@cgsecurity
Copy link
Owner

Can you try with
if(eof_address < eof_address_offset || eof_address > file_recovery->file_size)

@gpregger-ethz
Copy link
Author

This yields 43/44 recovered with 41/43 checksum matches 👍

Can you quickly comment on the origin of the value file_recovery->file_size?

@cgsecurity
Copy link
Owner

cgsecurity commented Mar 6, 2026

What are the results with

static int header_check_hdf5(const unsigned char *buffer, const unsigned int buffer_size, const unsigned int safe_header_only, const file_recovery_t *file_recovery, file_recovery_t *file_recovery_new)
{
  const struct hdf5_superblock *sb=(const struct hdf5_superblock*)&buffer[0];
  /*@ assert \valid_read(sb); */
  if(sb->version > 2)
    return 0;
  if(sb->offsets_size < 1)
    return 0;
  if(sb->offsets_size == 8)
  {
    uint64_t calculated_file_size;
    /* Currently only handle 64-bits offsets */
    if(sb->version == 0)
      calculated_file_size = le64(*(const uint64_t *)(&buffer[0x18 + 2*8]));
    else
      calculated_file_size = le64(*(const uint64_t *)(&buffer[0x1C + 2*8]));
    if(calculated_file_size < 0x1C + 3*8)
      return 0;
    reset_file_recovery(file_recovery_new);
    file_recovery_new->extension=file_hint_hdf5.extension;
    file_recovery_new->calculated_file_size = calculated_file_size;
    file_recovery_new->data_check=&data_check_size;
    file_recovery_new->file_check=&file_check_size;
    return 1;
  }
  reset_file_recovery(file_recovery_new);
  file_recovery_new->extension=file_hint_hdf5.extension;
  return 1;
}

@gpregger-ethz
Copy link
Author

44/44 recovered, 0 matches :(

@cgsecurity
Copy link
Owner

There was a missing "return 1;" in my previous copy/paste:

    file_recovery_new->file_check=&file_check_size;
    return 1;
}

@gpregger-ethz
Copy link
Author

44/44 recovered 41/44 matches 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants