利用python在大量数据文件下删除某一行

大文件操作内存问题一直是个棘手的问题，这里介绍了一种在大文件中操作具体行的方法

python 修改大数据文件时，如果全加载到内存中，可能会导致内存溢出。因此可借用如下方法，将分件分段读取修改。

with open('file.txt', 'r') as old_file:
      with open('file.txt', 'r+') as new_file:

            current_line = 0

            # 定位到需要删除的行
            while current_line < (3 - 1):  #(del_line - 1)
                  old_file.readline()
                  current_line += 1

            # 当前光标在被删除行的行首，记录该位置
            seek_point = old_file.tell()

            # 设置光标位置
            new_file.seek(seek_point, 0)

            # 读需要删除的行，光标移到下一行行首
            old_file.readline()

            # 被删除行的下一行读给 next_line
            next_line = old_file.readline()

            # 连续覆盖剩余行，后面所有行上移一行
            while next_line:
                  new_file.write(next_line)
                  next_line = old_file.readline()

            # 写完最后一行后截断文件，因为删除操作，文件整体少了一行，原文件最后一行需要去掉
            new_file.truncate()

注：truncate（）函数括号可以加数字，表示删除数字之后的字符串，如果不加就从当前光标处开始截断删除

或者这样做

一个修改文件内容同时重命名的脚本函数

import codecs

from tempfile import mkstemp
from shutil import move
from os import remove

def rename_and_date(blog_list, slug, create_time):
      """
      :param blog_list : 一个字典，格式为 {key: file_path}
      :param slug : key
      :param create_time : 创建时间
      """

      print(blog_list[slug])

      def replace(source_file_path):
            fh, target_file_path = mkstemp()
            flag = 1
            is_modify = False
            with codecs.open(target_file_path, 'w', 'utf-8') as target_file:
                  with codecs.open(source_file_path, 'r', 'utf-8') as source_file:
                  for line in source_file:
                        if "date:" in line and flag:
                              flag = 0
                              is_modify = True
                              target_file.write(re.sub(r"[\d+]{4}-[\d]{2}-[\d]{2}", create_time, line))
                        else:
                              target_file.write(line)
            remove(source_file_path)
            move(target_file_path, source_file_path)

            return is_modify

      is_modify = replace(blog_list[slug])

      def rename(file, name):
            path = file.parent
            os.rename(file, path / f"{name}.md")

      if is_modify:
            rename(blog_list[slug], f"{create_time}-{slug}")

利用python在大量数据文件下删除某一行

技术分享