查看: 731|回复: 4

python写法的快速去重

14 回帖	105 积分	8 小时在线时间

上等兵

注册时间: 2017-8-7

金币: 64 个

威望: 3 个

荣誉: 1 个

发消息

累计签到：3 天
连续签到：0 天
[LV.20]漫游旅程

发表于 2024-6-25 19:13 |显示全部楼层

本帖最后由 2521667577 于 2024-6-25 19:40 编辑
代码旨在大型文本快速去重相比传统的命令行有以下优点

分块处理：
- 使用 read_large_file_in_chunks 函数从输入文件中分块读取数据，每块默认包含 100,000 行
去重处理：
- 在 process_chunk 函数中，对每个数据块进行去重，并将结果后写入临时文件
归并排序临时文件：
- merge_sorted_files 函数负责将所有临时文件中的数据进行归并排序
显示处理进度和剩余时间：
- 使用 tqdm 库在处理过程中显示进度条，根据已处理的行数和总行数估算剩余处理时间。
异常处理：
- 使用 latin-1 编码并忽略解码错误，确保程序在处理非 UTF-8 格式的文本时能够稳定运行
- 1. import os
  2. import heapq
  3. import time
  4. from tqdm import tqdm
  6. # 读取大文件，分块处理去重并排序
  7. def read_large_file_in_chunks(file_path, chunk_size=100000):
  8. with open(file_path, 'r', encoding='latin-1', errors='ignore') as file:
  9. lines = []
  10. for line in file:
  11. lines.append(line.strip())
  12. if len(lines) >= chunk_size:
  13. yield lines
  14. lines = []
  15. if lines:
  16. yield lines
  18. # 去重并排序后写入临时文件
  19. def process_chunk(lines, temp_dir, chunk_index):
  20. unique_lines = sorted(set(lines))
  21. temp_file_path = os.path.join(temp_dir, f'temp_chunk_{chunk_index}.txt')
  22. with open(temp_file_path, 'w', encoding='utf-8') as temp_file:
  23. temp_file.write('\n'.join(unique_lines) + '\n')
  24. return temp_file_path
  26. # 归并排序临时文件并去重
  27. def merge_sorted_files(temp_files, output_file_path, batch_size=100):
  28. num_files = len(temp_files)
  29. batched_files = [temp_files[i:i+batch_size] for i in range(0, num_files, batch_size)]
  31. for batch in batched_files:
  32. open_files = [open(file, 'r', encoding='utf-8') for file in batch]
  33. pq = []
  35. for file_index, file in enumerate(open_files):
  36. line = file.readline().strip()
  37. if line:
  38. heapq.heappush(pq, (line, file_index))
  40. with open(output_file_path, 'a', encoding='utf-8') as output_file:
  41. prev_line = None
  42. while pq:
  43. current_line, file_index = heapq.heappop(pq)
  44. if current_line != prev_line:
  45. output_file.write(current_line + '\n')
  46. prev_line = current_line
  47. next_line = open_files[file_index].readline().strip()
  48. if next_line:
  49. heapq.heappush(pq, (next_line, file_index))
  51. for file in open_files:
  52. file.close()
  54. def count_lines(file_path):
  55. with open(file_path, 'r', encoding='latin-1', errors='ignore') as file:
  56. return sum(1 for line in file)
  58. def remove_duplicates(input_file_path, output_file_path, temp_dir, chunk_size=100000):
  59. if not os.path.exists(temp_dir):
  60. os.makedirs(temp_dir)
  62. start_time = time.time()
  63. total_lines = count_lines(input_file_path)
  64. chunk_index = 0
  65. temp_files = []
  67. print("Reading and processing chunks...")
  68. with tqdm(total=total_lines, desc="Processing chunks") as pbar:
  69. for chunk in read_large_file_in_chunks(input_file_path, chunk_size):
  70. temp_file_path = process_chunk(chunk, temp_dir, chunk_index)
  71. temp_files.append(temp_file_path)
  72. chunk_index += 1
  73. pbar.update(len(chunk))
  75. print("Merging sorted chunks...")
  76. merge_start_time = time.time()
  77. merge_sorted_files(temp_files, output_file_path)
  79. for temp_file in temp_files:
  80. os.remove(temp_file)
  82. end_time = time.time()
  83. print(f"Process completed successfully in {end_time - start_time:.2f} seconds.")
  84. print(f"Time taken for merging: {end_time - merge_start_time:.2f} seconds.")
  86. # 使用上述函数进行去重
  87. input_file_path = r'D:\zd\small\input.txt'
  88. output_file_path = r'D:\zd\small\output.txt'
  89. temp_dir = r'D:\zd\small\temp'
  91. remove_duplicates(input_file_path, output_file_path, temp_dir)
  复制代码