同济子豪兄GAIDC演讲:内卷时代,如何诞生“超级个体”?& 转图片PDF文字至word保存代码

同济子豪兄GAIDC演讲:内卷时代,如何诞生“超级个体”?

image-20230228170103636

image-20230228170112516

image-20230228170119194

image-20230228170009124

以其昏昏 使其昭昭

image-20230228170017188

image-20230228170332325

image-20230228170245236

转图片PDF文字至word保存代码

import io
import os
from pdf2image import convert_from_path
from PIL import Image
import pytesseract
import PyPDF2
from docx import Document
from docx.shared import Inches #现在是pip install python-docx不是doc库

# Open the PDF file
with open('example.pdf', 'rb') as f:
    # Create a PDF reader object
    pdf_reader = PyPDF2.PdfReader(f) #新版本改正为这样 原有的不适用了

    # Create a new Word document
    doc = Document()

    # Extract the text from each page and add it to the Word document
    for page_num in range(len(pdf_reader.pages)):
        # Extract the page as an image
        page = convert_from_path('example.pdf', dpi=300, first_page=page_num+1, last_page=page_num+1)[0]

        # Convert the image to grayscale and apply thresholding
        page = page.convert('L')
        threshold = 200
        page = page.point(lambda x: 255 * (x > threshold))

        # Extract the text from the image using Tesseract
        text = pytesseract.image_to_string(page)

        # Add the text to the Word document
        doc.add_heading(f'Page {page_num+1}', level=1)
        doc.add_paragraph(text)
        print(page_num)

    # Save the Word document
    doc.save('output.docx')

输入pip uninstall xxx -y (xxx换成需要卸载的包)就可以轻轻松松卸载掉啦。

当然也可以直接去/Users/XX/opt/anaconda3/lib/python3.7/site-packages/XXX 找到想卸载的包直接删除,一劳永逸

A\一. 问题:在将网络数据流导入文件时,有可能遇到“'gbk' codec can't encode characte”错误。

二. 分析:

1.在windows下面,新文件(即写入的目标文件)的默认编码是gbk。

2.网络数据流的编码是utf-8。

python解释器会用gbk编码去解析utf-8的网络数据流,于是报错。

三.解决如下,指定目标文件的编码格式为utf-8:

with open('./html','w')as f:
    f.write(html.encode('utf-8').decode('utf-8')) 
#改为
with open('./html','w',encoding='utf-8')as f:
    f.write(html)

B \python安装的docx出错 No module named ‘exceptions‘

这是安装的包不对, 先卸载

pip  uninstall docx

然后重新合适的工具包

pip  install python-docx

C\ tesseract库 https://github.com/tesseract-ocr/tesseract

D 获取jupyter notebook的token等

jupyter notebook list

E、可以直接用word阅读文章
image-20230228203413951